5 Phylogeny 5.1 References
Transcription
5 Phylogeny 5.1 References
66 Bioinformatics I, WS’09-10, D. Huson, December 1, 2009 5 Phylogeny Evolutionary tree of organisms, Ernst Haeckel, 1866 5.1 References • J. Felsenstein, Inferring Phylogenies, Sinauer, 2004. • C. Semple and M. Steel, Phylogenetics, Oxford University Press, 2003. • R. Durbin, S. Eddy, A. Krogh & G. Mitchison, Biological sequence analysis, Cambridge, 1998 • D.L. Swofford, G.J. Olsen, P.J.Waddell & D.M. Hillis, Phylogenetic Inference, in: D.M. Hillis (ed.), Molecular Systematics, 2 ed., Sunderland Mass., 1996. • M. Salemi and A-M. Vandamme, The Phylogenetic Handbook. Cambridge University Press 2003. 5.2 Why phylogenetic analysis? Phylogenetic analysis has many applications, including understanding the emergence and development of diseases: Bioinformatics I, WS’09-10, D. Huson, December 1, 2009 5.3 67 Outline of the chapter 1. Enumerating phylogenetic trees 2. Computer representation of trees 3. Distance-based reconstruction methods 4. Parsimony-based reconstruction methods 5. Jukes-Cantor model of DNA sequences evolution 6. Maximum likelihood reconstruction methods 7. Bayesian reconstruction methods 8. Phylogenetic Networks 1, 3, 4 will cover some material already presented in “Grundlagen der Bioinformatik” 5.4 Phylogenetic analysis The goal of phylogenetic analysis is to determine and describe the evolutionary relationship between a givens set of species. In particular, this involves determining the order of speciation events and their approximate timing. It is generally assumed that speciation is a branching process: a population of organisms becomes separated into two sub-populations. Over time, these evolve into separate species that do not crossbreed. Because of this assumption, a tree is often used to represent a proposed phylogeny for a set of species, showing how the species evolved from a common ancestor. 68 Bioinformatics I, WS’09-10, D. Huson, December 1, 2009 5.4.1 Early Evolutionary Studies Before the early 1960s: Anatomical/Morphological features were the dominant criteria used to derive evolutionary relationships between species 1958: Francis Crick suggested to use molecular sequences for phylogenetic tree reconstruction. 1962: Emile Zuckerkandl and Linus Pauling built the first phylogenetic tree from aligned amino acid sequences. They also proposed the theory of the molecular clock. 1967: Walter Fitch and Emanuel Margoliash published the first algorithm for phylogenetic tree reconstruction (least-squares algorithm). 1977: Carl Woese uses phylogenies based on SSU rRNA sequences to show that are are three Domains of Life: Eukaryotes, Bacteria and Archaea. 5.4.2 Phylogenetic trees In the following, we will use X = {x1 , x2 , . . . , xn } to denote a set of taxa, where a taxon xi is simply a representative of a group of individuals defined in some way. Definition 5.4.1 (Phylogenetic tree) A phylogenetic tree (on X) is a system T = (V, E, λ) consisting of a connected graph (V, E) without cycles, together with a labeling λ of the leaves by elements of X, such that: 1. every leaf is labeled by exactly one taxon, and 2. every taxon appears exactly once as a label. 5.4.3 Unrooted trees An unrooted phylogenetic tree is obtained by placing a set of taxa on the leaves of a tree: Pan_panisc Gorilla Homo_sap Mus_mouse Rattus_norv harbor_sel Bos_ta(cow) fin_whale blue whale Taxa X harbor_sel Gorilla Pan_panisc Bos_ta(cow) fin_whale Homo_sap Mus mouse + tree ⇒ blue_whale Rattus norv phylogenetic tree T on X Bioinformatics I, WS’09-10, D. Huson, December 1, 2009 5.4.4 Unrooted and rooted trees 5.4.5 Enumerating phylogenetic trees 69 Let T be an unrooted phylogenetic tree on n taxa, i.e., with n leaves. We assume that T is bifurcating, i.e. that each internal node has degree 3. Any non-bifurcating tree on n taxa will have less nodes and edges. Lemma 5.4.2 (Number of nodes and edges) A bifurcating phylogenetic tree T on n taxa has 2n − 2 nodes, 2n − 3 edges and n − 3 internal edges for all n ≥ 3. Proof: by induction. For n = 3 there is exactly one tree: it has 3 outer edges and zero internal edges. It has 3 leaves and one internal node. Assume that the statement holds for n. Any tree T 0 with n + 1 leaves can be obtained from some tree T with n leaves by inserting a new node v into some edge e of T and connecting v to a new leaf w. This increases both the number of nodes and the number of edges by 2. Thus T 0 has 2n − 2 + 2 = 2(n + 1) − 2 nodes and 2n − 3 + 2 = 2n − 1 = 2(n + 1) − 3 edges. The number of internal edges is equal to the total number of edges minus the number of edges leading to leaves: (2n − 3) − n = n − 3. Lemma 5.4.3 (Number of trees) There are U (n) = (2n − 5)!! := 3 · 5 · 7 · . . . · (2n − 5) unrooted trees on n leaves, and R(n) = (2n − 3)!! = U (n) · (2n − 3) = 3 · 5 · . . . · (2n − 3) rooted trees. The number of possible trees grows extremely rapidly for increasing values of n. For example, for n = 10 we have U (n) = 2.027 × 106 and R(n) = 34.459 × 108 . 5.4.6 Distances in phylogenetic trees Let T = (V, E) be a phylogenetic tree with node set V and edge set E. An edge weight function ω : E → R≥0 assigns a non-negative number ω(e) to every edge e. These weights (or lengths) often represent 70 Bioinformatics I, WS’09-10, D. Huson, December 1, 2009 • the number of mutations on evolutionary path from one species to another, or • a time estimate for evolution of one species into another. If T is a phylogenetic tree with edge weights ω, then the tree distance (or patristic distance) between i and j is given by X ω(e), dT (i, j) = e∈P (i,j) where the sum is taken over all edges e that are contained in the path P (i, j) from i to j. Example: B A 9 8 2 D 12 7 5 C E The tree distance between B and C is dT (B, C) = 9 + 12 + 7 = 28. 5.5 Computer representation of a phylogenetic tree Let X be a set of taxa and T a phylogenetic tree on X. To represent a phylogenetic tree in a computer we maintain a set of nodes V and a set of edges E. Each edge e ∈ E maintains a reference to its source node s(e) and target node t(e). Each node v ∈ V maintains a list of references to all incident edges. Each leaf node v ∈ V maintains a reference λ(v) to the taxon that it is labeled by, and, vice versa, ν(x) maps a taxon x to the node that it labels. Each edge e ∈ E maintains its length ω(e). 5.5.1 Nested structure A rooted phylogenetic tree T = (V, E, λ) is a nested structure: Consider any node v ∈ V . Let Tv denote the subtree rooted at v. Let v1 , v2 , . . . , vk denote the children of v. Then Tv is obtainable from its subtrees Tv1 , Tv2 , . . . , Tvk by connecting v to each of their roots. Such a nested structure can be written using nested brackets: Bioinformatics I, WS’09-10, D. Huson, December 1, 2009 Phylogenetic tree nested description v1 v10 taxon2 v7 v8 taxon6 v9 taxon1 v5 v6 taxon3 taxon4 v4 taxon5 v3 v2 71 v1 ↓ (v2 , v3 ) ↓ ((v4 , v5 ), (v6 , v7 , v8 )) ↓ (((v9 , v10 ), v5 ), (v6 , v7 , v8 )) Description: (((taxon1 , taxon2 ), taxon3 ), (taxon4 , taxon5 , taxon6 )) 5.5.2 Newick format Phylogenetic trees are usually read and printed using the so-called Newick format. The Newick Standard for representing trees in computer-readable form makes use of the correspondence between trees and nested parentheses, noticed in 1857 by the famous English mathematician Arthur Cayley. The tree ends with a semicolon. Interior nodes are represented by a pair of matched parentheses. Between them are representations of the nodes that are immediately descended from that node, separated by commas. Leaves are represented by their names. Trees can be multifurcating at any level. Branch lengths can be incorporated into a tree by putting a real number after a node and preceded by a colon. This represents the length of the branch immediately below that node. Examples: Unrooted tree Rooted tree with edge lengths e a d b c Newick string: ((a,b),c,(d,e)); Newick string: (((A:0.5,B:0.5):1.5,C:2.0):1.0,(D:1.0,E:1.0):2.0); In the Newick format, each pair of matching brackets corresponds to an internal node and each taxon label corresponds to an external node. To add edge lengths to the format, specify the length len of the edge along which given node we visited by adding :len behind the closing bracket, in the case of an internal node, and behind the label, in the case of a leaf. For example, consider the tree (a,b,(c,d)). If all edges have the same length 1, then this tree is written as (a:1,b:1,(c:1,d:1):1). 5.5.3 Algorithm for Tree to Newick This algorithm recursively prints a tree in Newick format. Initialisation with e = null and v = ρ, if the tree is rooted, or v set to an arbitrary internal node, else. Algorithm 5.5.1 (toNewick(e, v)) Input: Phylogenetic tree T = (V, E) on X, with labeling λ : V → X 72 Bioinformatics I, WS’09-10, D. Huson, December 1, 2009 Output: Newick description of tree begin if v is a leaf then print λ(v) else // v is an internal node print ’(’ for each edge f 6= e incident to v do If this is not the first pass of the loop, print ’,’ Let w 6= v be the other node incident to f call toNewick(f, w) print ’)’ print ’;’ end 5.5.4 Algorithm for parsing Newick Algorithm parses a tree in Newick format from a (space-free) string s = s1 s2 . . . sm . Initialisation: parseNewick(1, m, null). Algorithm 5.5.2 (parseNewick(i,j,v)) Input: a substring si . . . sj and a root node v Output: a phylogenetic tree T = (V, E, λ) begin while i ≤ j do Add a new node w to V if v 6= null then add a new edge {v, w} to E if si = ’(’ then // hit a subtree Set k to the position of the balancing close-bracket for i call parseNewick(i + 1, k − 1, w) // strip brackets and recurse else // hit a leaf Set k = min{k ≥ i | sk+1 = ’,’ or k + 1 = j} Set λ(w) = si . . . sk // grab label for leaf Set i = k + 1 // advance to next ’,’ or j if i < j then Check that i + 1 ≤ j and si+1 = ’,’ Increment i // advance to next token end 5.6 Constructing phylogenetic trees There are four main approaches to constructing phylogenetic trees: 1. Using a distance method, one first computes a distance matrix for a given set of biological data and then one computes a tree that represents these distances as closely as possible. 2. Maximum parsimony takes as input a set of aligned sequences and attempts to find a tree and a labeling of its internal nodes by auxiliary sequences such that the number of mutations along the tree is minimum. 3. Given a probabilistic model of evolution, maximum likelihood approaches aim at finding a phylogenetic tree that maximizes the likelihood of obtaining the given sequences. 4. Bayesian reconstruction methods estimate a posterior probability distribution of trees. Bioinformatics I, WS’09-10, D. Huson, December 1, 2009 5.7 73 Distance-based tree reconstruction Assume we are given a set X = {x1 , x2 , . . . , xn } of taxa. The input to a distance method is a distance matrix D : X × X → R≥0 that associates a distance d(xi , xj ) with every pair of taxa xi , xj ∈ X. We require that D satisfies at least the first two properties of a metric: 1. symmetry, d(x, y) = d(y, x), 2. the triangle inequality, d(x, y) + d(y, z) ≥ (dx, z) and 3. identity, x 6= y ⇒ d(x, y) > 0. Distance-Based Phylogeny Problem: Reconstruct an evolutionary tree on n leaves from an n × n distance matrix Input: A distance matrix D = (dij ) on X. Output: A phylogenetic tree T (rooted or unrooted) with edge lengths on X such that the distances in T approximate D in some prescribed way. The most commonly distance-based methods used are 1. UPGMA 2. Neighbor joining 3. Minimum evolution and related methods 5.8 The neighbor-joining algorithm The most widely used distance method is neighbor joining (NJ), originally introduced by Saitou and Nei (1987)1 and modified by Studier and Keppler (1988). Given a distance matrix D, neighbor joining produces an unrooted phylogenetic tree T with edge lengths. It is more widely applicable than UPGMA, as it does not assume a “molecular clock”. Starting from a star tree topology the algorithm proceeds to build a tree based on D by repeatedly pairing “neighboring” taxa: c a c d d g b e 1 a → f Saitou, N., Nei, Y. (1987) SIAM J. on Comp. 10:405-421 b g e f 74 Bioinformatics I, WS’09-10, D. Huson, December 1, 2009 Algorithm 5.8.1 (Neighbor joining) Input: Distance matrix D on X Output: Unrooted phylogenetic tree T Initialization: Define L to be the set of leaf nodes, one for each taxon x ∈ X Set T = L while |L| > 2 do Compute the “neighbor-joining matrix” N : 1 P For all i, j set Nij = dij − (ri + rj ), where ri = |L|−2 k∈L dik Pick a pair i, j ∈ L for which Nij is minimum Create a new node k in T Update D: For all m ∈ L set dkm = 12 (dim + djm − dij ) Connect k to i and to j using two new edges of lengths dik = 12 (dij + ri − rj ) and djk = dij − dik , respectively. Remove i and j from L and add k to L. if L = {i, j}, then create a new edge between i and j of length dij . Let i and j be two neighboring leaves that have the same parent node, k. Remove i, j from the list of nodes L and add k to the current list of nodes. How do we update its distance to any given leaf m? Consider: i i −→ m j j k m We must have: dim = dik + dkm , djm = djk + dkm and dij = dik + djk , thus dim + djm = dik + dkm + djk + dkm = dij + 2dkm , which implies 1 dkm = (dim + djm − dij ). 2 Why do we set the length of the new edges to dik = 12 (dij + ri − rj ) and djk = dij − dik ? 1 P 1 P By definition, ri = |L|−2 k∈L dik equals the average distance qi = |L|−2 k∈L,k6=i,j dik from i to all other nodes m 6= i, j, plus dij |L|−2 : m1 i dij j k m2 qi average distance qj m5 m 4 m3 Bioinformatics I, WS’09-10, D. Huson, December 1, 2009 As we see from this figure: 2dik = dij + qi − qj = dij + qi − qj + 5.8.1 Taxa A B C D dij |L|−2 − dij |L|−2 75 = dij + ri − rj . Example A - B 8 - C 7 9 - D 12 14 11 - Initialisation: L = X = {A, B, C, D} Iteration: 1. Compute N: first compute column sums ri∗ : ∗ ∗ ∗ ∗ rA = 8 + 7 + 12 = 27, rB = 8 + 9 + 14 = 31, rC = 7 + 9 + 11 = 27, rD = 12 + 14 + 11 = 37 N = T axa A B C D A − ∗ ∗ 8 − (rA + rB )/2 ∗ ∗ 7 − (rA + rC )/2 ∗ ∗ )/2 + rD 12 − (rA B 8 − ∗ ∗ 9 − (rB + rC )/2 ∗ ∗ )/2 + rD 14 − (rB = T axa A B C D A − −21 −20 −20 D 12 14 11 − B 8 − −20 −20 C 7 9 − −21 C 7 9 − ∗ ∗ )/2 + rD 11 − (rC D 12 14 11 − One minimum is N (A, B) = −21. Define new internal node k to join A and B and compute edge lengths of k to A and of k to B: 1 1 (d(A, B) + rA − rB ) = (8 + 27/2 − 31/2) = 3 2 2 d(B, k) = d(A, B) − d(A, k) = 8 − 3 = 5 d(A, k) = Compute new D0 : T axa k C D k − d(k, C) d(k, D) C − 11 D − where d(k, C) = d(k, D) = 1 (d(A, C) + d(B, C) − d(A, B)) = 4 2 1 (d(A, D) + d(B, D) − d(A, B)) = 9 2 Second iteration: Compute N: we have L = {AB, C, D}, |L| = 3, |L| − 2 = 1 ∗ ∗ rk∗ = 4 + 9 = 13, rC = 4 + 11 = 15, rD = 9 + 11 = 20 N = T axa AB C D AB − 4 − (13 + 15)/1 9 − (13 + 20)/1 = T axa AB C D AB − −24 −24 C 4 − −24 D 9 11 − C 4 − 11 − (15 + 20)/1 D 9 11 − 76 Bioinformatics I, WS’09-10, D. Huson, December 1, 2009 A minimum is given by N (AB, C). Define a new node k to join AB and C: d(C, k) = d(AB, k) = 1 1 (d(AB, C) + rC − rAB ) = (4 + 15 − 13) = 3 2 2 d(AB, C) − d(C, k) = 4 − 3 = 1 Termination: Only two nodes left, D and ABC. Compute length of edge d(ABC, D): d(ABC, D) = d(C, D) − d(ABC, C) = 11 − 3 = 8 This is the resulting tree: B 5 8 1 D 3 A 3 C 5.8.2 Additivity and the four-point condition A distance matrix D is called additive, or a tree metric, if it can be represented by a phylogenetic tree. How to tell whether D is additive? Definition 5.8.2 (Four-point condition) Assume we are given a set of taxa X. A distance matrix D on X is said to fulfill the four-point condition if for all i, j, k, l ∈ X d(i, j) + d(k, l) ≤ max{d(i, k) + d(j, l), d(i, l) + d(j, k)} holds. (That is, two of the three expressions d(xi , xj ) + d(xk , xl ), d(xi , xk ) + d(xj , xl ), and d(xi , xl ) + d(xj , xk ) are equal and are not smaller than the third.) We have the following classical result (Buneman, 1971): Theorem 5.8.3 (Additivity metrics and the 4PC) A distance matrix D on X is additive, if and only if it fulfills the four-point condition. We will prove one direction of the theorem. Proof “⇒”: Assume that D is additive. Then there exists a phylogenetic tree T that represents D. Let i, j, k, l be elements of X. If they are all different, then we can assume w.l.o.g. that (i, j) and (k, l) are neighbors in the induced tree T |{i,j,k,l} , respectively. It follows that d(i, j) + d(k, l) ≤ d(i, k) + d(j, l) = d(i, l) + d(j, k), Bioinformatics I, WS’09-10, D. Huson, December 1, 2009 77 as can be seen from the following figure: and thus D fulfills the four-point condition. Example additive metric: B C D A 7 6 5 D= B 3 6 C 5 A non-additive B A 7 D= B C Check the four-point condition: Check the four-point condition: metric: C 7 4 D 6 7 5 d(A, B) + d(C, D) = 7 + 5 = 12 d(A, C) + d(B, D) = 7 + 7 = 14 d(A, D) + d(B, C) = 6 + 4 = 10 ⇒ d(A, C) + d(B, D) 6≤ max (d(A, B) + d(C, D), d(A, D) + d(B, C)) ⇒ 4-point condition doesn’t hold. d(A, B) + d(C, D) = 7 + 5 = 12 d(A, C) + d(B, D) = 6 + 6 = 12 d(A, D) + d(B, C) = 5 + 3 = 8 ⇒ the four-point condition holds. D D C 2.0 C 2.0 3.0 2.0 1.0 1.0 2.0 1.0 2.0 3.0 2.0 B A B A 5.9 The Jukes-Cantor model of evolution T. Jukes and C. Cantor (1969) introduced a very simple model of DNA sequence evolution: Definition 5.9.1 (Jukes-Cantor model) The Jukes-Cantor model of evolution makes the following assumptions: 1. We are given a rooted phylogenetic tree T along which sequences evolve. 2. The sequence length is an input parameter and possible states for each site are A, C, G and T. 3. The possible states for each site are A, C, G and T. 4. For each site of the sequence at the root of T , the state is drawn from a given distribution (typically uniform). 5. The sites evolve identically and independently (i.i.d.) down the edges of the tree from the root at a fixed rate u. 6. With each edge e ∈ E we associate a duration t = t(e) and the expected number of mutations per site along e is uτ (e). The probabilities of change to each of the 3 remaining states are equal. 78 Bioinformatics I, WS’09-10, D. Huson, December 1, 2009 How do we “evolve” a sequence down an edge e under the Jukes-Cantor model? Let v = (v1 , . . . , vn ) and w = (w1 , . . . , wn ) denote the source and target sequences associated with e. We assume that v has already been determined and we want to determine w. Under the Jukes-Cantor model, the evolutionary event C3 = nucleotide changes to one of the other three bases occurs at a fixed rate of u. Now consider the event C4 = nucleotide changes to any base (including back to the same base). The rate of occurrences of this event is taken to be 43 u. 4 The probability that event C4 does not occur within time t is: e− 3 ut (Poisson distribution with k = 0). Given that the event C4 occurs, the probability of ending in any particular base is 14 . Let P and Q denote the base pair present at a given site before and after a given time period t. The probability that an event of type C4 occurs and changes P ∈ {A, C, G, T} to base Q within time t is: 4 1 Prob(Q | P, t) = 1 − e− 3 ut . 4 Now, adding the probability that the event C4 does not occur, we obtain the probability that the state does not change, i.e. P = Q: 1 4 4 4 1 1 − e− 3 ut = 1 + 3e− 3 ut . Prob(Q = P | P, t) = e− 3 ut + 4 4 Thus, we obtain a probability-of-change formula for the probability of an observable change occurring at any given site in time t: 4 Prob(change | t) = 1 − Prob(Q = P | P, t) = 1 − 41 1 + 3e− 3 ut 4 = 34 1 − e− 3 ut . We can use this model to generate artificial sequences along a model tree. E.g., set the sequence length to 10 and the mutation rate to u = 0.1. Initially, the root node is assigned a random sequence. Then the sequences are evolved down the tree, at each edge using the probability-of-change formula to decide whether a given base is to change: 1 a 1 b 1 c 1 d 1 e 1 f 2 ACGTTTAGAG ACGTTTCGAG 2 1 * 3 ACCTTGCGGG E.g., the probability of change along the edge marked ∗ is 4 0.75(1 − e− 3 ×0.1×3 ) = 0.75(1 − e−0.4 ) ≈ .247. 5.10 The Jukes-Cantor distance transformation The Hamming distance between two sequences underestimates their true evolutionary distance (average number of mutations that took place per site over the elapsed time), because it doesn’t count “back mutations” and multiple mutations at the same site. Bioinformatics I, WS’09-10, D. Huson, December 1, 2009 79 Thus, a correction formula is required. For the Jukes-Cantor model, inversion of the probability-ofchange formula leads to the following: Definition 5.10.1 (Jukes-Cantor transformation) The Jukes-Cantor distance transformation for a pair of sequences a and b is defined as: formula: 3 4 JC(a, b) = − ln 1 − Ham(a, b) . (5.1) 4 3 5.11 More general transformations The Jukes-Cantor model assumes that all nucleotides occur with the same frequency. This assumption is relaxed under the Felsenstein 81 model introduced by Joe Felsenstein (1981), which has the following distance transformation: F 81(ai , aj ) = −B ln(1 − Ham(ai , aj )/B), (5.2) where B = 1 − (πA2 + πC2 + πG2 + πT2 ) and πQ is the frequency of base Q. The base frequency is obtained from the pair of sequences to be compared, or better, from the complete set of given sequences. Note that this contains the Jukes-Cantor transformation as a special case with equal base frequencies πA = πC = πG = πT = 0.25, thus B = 43 . Unlike the Jukes-Cantor transformation, this transformation can also be applied to protein P20 sequences, 2 if all amino acids are generated with the same frequency, and B = 1 − setting B = 19 i=1 πi , in the 20 proportional case. Given equal base frequencies, but different proportions P and Q of transitions and transversions between ai and aj , the distance for the Kimura 2 parameter model is computed as: 1 1 1 1 K2P (ai , aj ) = ln + ln . (5.3) 2 1 − 2P − Q 4 1 − 2Q If we drop the assumption of equal base frequencies, then we obtain the Felsenstein 84 transformation: P (A − B)Q Q F 84(ai , aj ) = −2A ln 1 − − + 2(A − B − C) ln 1 − , (5.4) 2A 2AC 2C where A = πC πT /πY +πA πG /πR , B = πC πT +πA πG , and C = πR πY with πY = πC +πT , πR = πA +πG . Even more general models exist, but we will skip the details. The following figure summarizes the complete overview of the time reversible models: 80 Bioinformatics I, WS’09-10, D. Huson, December 1, 2009 GTR Equal base frequency 3 substitution parameters (2 f. Transitions, 1 f. Transv.) SYM TrN 2 substitution parametera (1 f. Transitions, 1 f. Transv.) HKY =F84 1 substitution parameter 3 substitution parameters (1 f. Transitions, 2 f. Transv) K3SP Equal base frequency 2 substitution parameters (1 f. Transitions, 1 f. Transv) K2P F81 Equal base frequency 1 substitution parameter JC GTR TrN SYM HKY K3ST = = = = = “general time reversible” “Tamura-Nei” “symmetric” “Hasegawa-Kishino-Yano” “Kimura 3-Parameter” Bioinformatics I, WS’09-10, D. Huson, December 1, 2009 5.12 81 Maximum parsimony In phylogenetic analysis, the Maximum Parsimony problem is to find a phylogenetic tree that explains a given set of aligned sequences using a minimum number of “evolutionary events”. The maximum parsimony method is one of the most widely used used tree reconstruction methods. Suppose we are given a multiple alignment of sequences A of length L and a phylogenetic tree T . The maximum parsimony method makes two fundamental assumptions: • Changes at different positions of the sequences are independent. • Changes in different parts of the tree are independent. 5.12.1 The parsimony score of a tree The difference between two sequences X = (x1 , . . . , xL ) and Y = (y1 , . . . , yL ) is simply their nonnormalized Hamming distance d(X, Y ) = |{k | xk 6= yk }|, that is, the number of positions at which they differ. Assume we are given a multiple alignment of sequences A = {A1 , A2 , . . . , An } and a corresponding phylogenetic tree T , leaf-labeled by A. If we assign a hypothetical ancestor sequence to every internal node in T , then we can obtain a score for T together with this assignment by summing over all differences d(X, Y ), where X and Y are two sequences labeling two nodes that are joined by an edge in T . Example: AGT AGA 1 0 CGA 0 CCA 1 AGA CGA ⇒ score = 3 1 0 CGA Definition 5.12.1 (Parsimony score of a tree) Suppose we are given a multiple alignment of sequences A of length L and a corresponding phylogenetic tree T leaf-labeled by A. Then the parsimony score for T and A is defined as: X P S(T, A) = min d(λ(u), λ(v)), λ {u,v}∈E where the minimum is taken over all possible labelings λ of the nodes of T by sequences of length L that extend the labeling of the leaves by A. . Small Parsimony problem: The Small Parsimony problem is to compute the parsimony score for a given tree T . 82 Bioinformatics I, WS’09-10, D. Huson, December 1, 2009 5.13 The Fitch algorithm In 1971 Walter Fitch2 published a dynamic programming algorithm that solves the Small Parsimony problem efficiently. Input: A phylogenetic tree T with n leaves. The leaves are labeled by the sequences of an MSA with n sequences over an alphabet Σ. Since maximum parsimony assumes indepedence of each column, one can compute the parsimony score column by column and the final result is given by the sum of the column results. The Fitch algorithm runs in two passes: The forward pass first computes the parsimony score and the backward pass then computes a labeling of the internal nodes. 5.13.1 The Fitch algorithm: Forward pass As the algorithms proceeds column-by-column, we can assume that each leaf v is labeled by a single character-state c(v). We will assign a set S(v) of states and a number l(v) to each node v as follows: S(v) = {c(v)}, l(v) = 0 For each leaf v: For each inner node v with children u, w: S(v) = S(u) ∩ S(w) if S(u) ∩ S(w) 6= ∅ S(u) ∪ S(w) if S(u) ∩ S(w) = ∅ l(v) = l(u) + l(w) if S(u) ∩ S(w) 6= ∅ l(v) = l(u) + l(w) + 1 if S(u) ∩ S(w) = ∅ To compute S(v) and l(v) we traverse the tree in postorder or “bottom-up”. The number of changes required is given by the value of l at the root. The forward pass algorithm requires O(n) steps for each column and thus O(nL) in total. 5.13.2 The Fitch algorithm: Example Example: we use the Fitch algorithm to compute the parsimony score for the following labeled tree: G 5.13.3 A G C C A Fitch algorithm: Backward pass The forward pass algorithm computes the parsimony score. An optimal labeling of the internal nodes is obtained via traceback: starting at the root node r, we label r using any character in S(r). Then, 2 Fitch, W. (1971) Toward defining the course of evolution: minimum change of specified tree topology. Syst. Zoology 20:406-416 Bioinformatics I, WS’09-10, D. Huson, December 1, 2009 83 for each child w, we use the same letter, if it is contained in S(w), otherwise we use any letter in S(w) as label for w. We then visit the children von w etc. {A,G} {G} {C,G} {G} {C} {A,C} {G} {C} {C} {A} {A} G → A G C C A Again, the algorithm requires O(nL) steps, in total. 5.13.4 The Fitch algorithm: Backward pass Algorithm 5.13.1 (Backward pass) Input: A phylogenetic tree T and the set S(v) of each node v in T Output: The fully labeled tree T do for all nodes v, starting at the root node, going to the leaves if v is the root node then Set c(v) = t with t ∈ S(v) else Let w be the parent node of v if c(w) ∈ S(v) then Set c(v) = c(w) else Set c(v) = t with t ∈ S(v) end 5.13.5 Sankoff ’s algorithm In Fitch’s algorithm each mutation is equally weighted independent of the nature of the mutation. For nucleotides and especially for amino acids the assumption of equal substitution costs for all possible character changes is inappropriate. In 1975, David Sankoff3 proposed an alternative to Fitch’s algorithm in which substitutions can be weighted. Input: • a phylogenetic tree T = (V, E) with n leaves • a character-state c(v) for each leaf v • a cost matrix C that specifies the cost C(α, β) of changing state α into state β Forward pass: In a post-order traversal, for each node v and each state α ∈ Σ we will compute Sα (v) as the minimum cost of the subtree rooted at v, given c(v) = α. Initialization: For each leaf node v and state α set Sα (v) = 3 0, c(v) = α ∞, else. Sankoff D. (1975) Minimal mutation trees of sequences. SIAM J. Appl. Math. 28:35-42 84 Bioinformatics I, WS’09-10, D. Huson, December 1, 2009 Main recursion: For each internal node v with children u and w recursively compute Sβ (u) and Sβ (w) for all states β. Then set Sα (v) = min(Sβ (u) + C(α, β)) + min(Sβ (w) + C(α, β)). β β Result: The minimal cost of the whole tree is given by min Sα (ρ), α where ρ is the root of the tree. Example: Backward pass: Based on the values of Sv (α) from the forward pass we will now choose an optimal character state α for each inner node. As in Fitch’s algorithm we start at the root and visit all nodes in a pre-order traversal. For the root ρ set: c(ρ) = arg min Sα (ρ) α For every internal node w 6= ρ with parent node v we set: c(w) = arg min(Sα (w) + C(α, c(v))). α Example: Bioinformatics I, WS’09-10, D. Huson, December 1, 2009 5.13.6 85 The Large Parsimony problem Definition 5.13.2 (Parsimony score of an alignment) Given a multiple alignment A {a1 , . . . , an }, its parsimony score is defined as = P S(A) = min{P S(T, A) | T is a phylogenetic tree on A}. Definition 5.13.3 (Large Parsimony problem) The Large Parsimony problem is to compute P S(A). Theorem 5.13.4 (Parsimony is hard) The Large Parsimony problem is NP-hard. 5.14 Maximum likelihood estimation Let A = (a1 , . . . , an ) be a multiple sequence alignment of length m on X . The basic idea of the maximum likelihood estimation method is to compute a phylogenetic tree T , together with edge lengths ω, that maximizes the likelihood P(A | T, ω, M ) that the multiple sequence alignment A was generated on the phylogenetic tree T with edge lengths ω under the given model of sequence evolution M . Such a model M specifies how to choose the initial sequence at the root of the tree T and how sequences evolve along edges of the tree. For example, the Jukes-Cantor model specifies that at any position i of the root sequence, all four nucleotides appear with the same probability of 41 . The mutation rate is u. Moreover, the length ω(e) of any edge e is interpreted as a duration and the probability that an observable change occurs along an edge of length t is given by the probability-of-change formula: 4 3 P(change | t) = 1 − e− 3 ut . 4 In practice, the models employed are more general than the simple Jukes-Cantor model, usually allowing unequal nucleotide frequencies and different transition probabilities between different nucleotides. The models are generally time reversible, which implies that the optimal score attainable is independent of the location of the root. A main advantage of maximum likelihood estimation is that it provides a systematic framework for explicitly incorporating assumptions and knowledge about the process that generated the given data. Although evolutionary models are only a rough estimate of real-world biological evolution, in practice, maximum likelihood methods have proved to be quite robust to violations of the assumptions made in the models. 5.14.1 Maximum likelihood estimation is hard As with maximum parsimony, the main draw-back of this approach is that finding an optimal phylogenetic tree is NP-hard. In fact, the situation is in some sense worse: Theorem 5.14.1 (Maximum likelihood estimation is hard) For a given phylogenetic tree T , the problem of defining edge lengths ω so as to maximize the likelihood P (A | T, ω, M ) is NP-hard. 86 5.14.2 Bioinformatics I, WS’09-10, D. Huson, December 1, 2009 Solving maximum likelihood Assume we are given the following DNA alignment A as 1 2 i a1 A C . . . G C a2 A C . . . G C a3 A C . . . T A a4 A C . . . T G input: G C C G m A G G A ... ... ... ... There are three possible unrooted phylogenetic trees on a set of four taxa: a1 a3 a1 a2 a1 a2 a2 a4 a3 a4 a4 a3 To simplify the calculation of the maximum likelihood score for such a phylogenetic tree T we root it by choosing one of its internal nodes to be the root ρ: (1) (1) → (2) (2) (3) C (4) C A G (3) (4) (5) (5) → (6) (6) We will discuss how to compute the maximum likelihood for the first tree of the three unrooted trees. The other two trees are processed similarly. Conceptually, we obtain the likelihood L(i) that a specific rooted phylogenetic tree T generated Ai , the i-th character of A, by summing over all possible labellings α of the internal nodes by states. For a bifurcating unrooted phylogenetic tree on n taxa there are n − 2 internal nodes, and so there are k (n−2) possible labellings to be considered, where k is the number of possible character states. For our example, we need to consider 16 different possibilities: C C A G A L(i) = P Ai + P Ai A + ... + P Ai + ... + P Ai C C A G C A C C A G G C C C A G T T In general, for a given rooted phylogenetic tree T and model M , the likelihood of T given a single character Ai is defined as L(i) = P(A P i |T, ω, M ) = P(Ai , α|T, ω, M ) Pα Q = P(α(ρ)) P (α(v), α(w)) , e α e={v,w} where the summation is over all possible labellings α of the internal nodes, P(α(ρ)) is the probability of generating the state α(ρ) at the root and Pe (α(v), α(w)) is the probability of observing the two states α(v) and α(w) at the two ends of an edge e = {v, w}. Bioinformatics I, WS’09-10, D. Huson, December 1, 2009 87 The total likelihood that T generated the multiple sequence alignment A is obtained as the product of the likelihoods for all positions of A: L(T ) = L(1) · L(2) · · · · · L(m) = m Y L(i). i=1 The computation of L(i) is actually more complicated than we have described so far, because not only do we have to consider all possible rooted phylogenetic trees on n taxa, but we must also consider all possible ways of defining edge lengths ω on each such tree. Assume that we have a rooted phylogenetic tree T with edge lengths ω, we will now discuss in more detail how to calculate Y Pe (α(v), α(w)) P (Ai , α | T, ω, M ) = P(α(ρ)) e={v,w} for the case of the Jukes-Cantor model with mutation rate u. Under this model, the probability of observing any particular nucleotide at the root node is 41 . The probability of observing two different states α(v) and α(w) at opposite ends of an edge e of length t = ω(e) is given by 4 3 P (α(v) 6= α(w) | T ) = (1 − e− 3 ut ), 4 whereas the probability of observing the same state at opposite nodes of e equals 4 1 1 − P (α(v) = α(w) | T ) = (1 + 3e− 3 ut ). 4 For example, the value of C C A G A P A i A is given by P (A)Pe1 (A, C)Pe2 (A, C)Pe3 (A, A)Pe4 (A, A)Pe5 (A, G) = 33 (1 46 4 4 4 4 4 − e− 3 ut1 )(1 − e− 3 ut2 )(1 + 3e− 3 ut3 )(1 + 3e− 3 ut4 )(1 − e− 3 ut5 ), where e1 , . . . , e5 are the five edges in T and ti = ω(ei ) for i = 1, . . . , 5. 5.14.3 Felsenstein’s algorithm One can efficiently compute the likelihood for a given bifurcating, rooted phylogenetic tree T with edge lengths ω using Felsenstein’s algorithm (Felsenstein, 1981). In this approach, for each node v in T and every possible character state a, in a post-order traversal of T , one recursively computes the conditional likelihood Lv (a) for the subtree Tv rooted at v, under the condition that the state of v is a, that is, we have α(v) = a. More precisely, for any internal node v with children w1 and w2 , the main recursion is ! ! X X Lv (a) = P{v,w1 } (a, b)Lw1 (b) P{v,w2 } (a, b)Lw2 (b) , b b 88 Bioinformatics I, WS’09-10, D. Huson, December 1, 2009 where summation is over all possible states b and, as above, P{v,w} (a, b) is the probability of observing states a and b at the two ends of the edge {v, w} under the given model M . To initialize the computation, for each leaf v and every symbol a, we set Lv (a) = 1, if the node v is labeled by the character state a, and Lv (a) = 0, else. 5.14.4 Summary As mentioned above, we must consider all possible choices of edge lengths ω. Determining an optimal choice of ω is in general a computationally hard problem. One possible heuristic is to optimize the length of each edge in turn, while keeping all other edge lengths fixed. In summary, for a given multiple sequence alignment A on X , in maximum likelihood estimation, a result is obtained by searching through all possible phylogenetic trees T and for each such tree T , computing the maximum likelihood score. The method returns the tree that attains the best score. 5.15 Quartet puzzling Quartet puzzling 4 is a heuristic for approximating the maximum likelihood (MLE) tree. It is based on the idea of computing the MLE trees for all quartets of taxa and then using many rounds of combination steps in which one attempts to fit a random set of quartet trees together so as to obtain a tree on the whole dataset. Assume we are given a multiple alignment of sequences A = {a1 , . . . , an }. The quartet puzzling heuristic proceeds in three steps: 1. For each quartet q of four distinct taxa ai , aj , ak , al ∈ A compute the maximum likelihood tree Tq . This step is called maximum likelihood step. 2. Choose a random order ai1 , ai2 , . . . , ain of the taxa and then use it to guide a combining or puzzle step that produces a tree on the whole data set from quartet trees. 3. Repeat step (2) many times to obtain trees T1 , T2 , . . . , TR . Report the majority consensus tree T. 5.15.1 Computing all quartet trees Given four taxa {a1 , a2 , a3 , a4 }, there are three possible bifurcating topologies: a1 a3 a1 a2 a1 a2 a2 a4 a3 a4 a4 a3 {{a1 , a2 }, {a3 , a4 }} {{a1 , a3 }, {a2 , a4 }} {{a1 , a4 }, {a2 , a3 }} Now, let T be any phylogenetic tree that contains leaves labeled a1 , a2 , a3 , a4 , We say that T induces the split or tree topology {{a1 , a2 }, {a3 , a4 }}, if there exists an edge in T that separates the nodes labeled a1 and a2 from the nodes a3 and a4 . We will sometimes abbreviate {{a1 , a2 }, {a3 , a4 }} by a1 , a2 | a3 , a4 . Assume we are given this tree T : 4 Korbinian Strimmer and Arndt von Haeseler (1996) Quartet-puzzling: a quartet maximum-likelihood method for reconstructing tree topologies. Mol Biol Evol 13, 964-969. Bioinformatics I, WS’09-10, D. Huson, December 1, 2009 a1 89 a5 a2 a4 a3 This tree induces the following five quartet tree topologies: a1 a3 a1 a3 a1 a4 a1 a4 a2 a4 a2 a5 a2 a2 a5 a3 a5 a4 a3 a5 It is easy to reconstruct the original tree from its set of quartets. However, in the presence of false quartet topologies (as inevitably produced by any phylogenetic reconstruction method), the reconstruction of the original tree can become difficult. In quartet puzzling, quartets are greedily combined to produce a tree for the whole data set. In an attempt to avoid the usual problems of a greedy approach, the combining step is run many times using different random orderings of the taxa. 5.15.2 The puzzle step The puzzle step is initialized with the maximum likelihood tree for {a1 , a2 , a3 , a4 }, e.g.: a1 0 0 a2 0 0 a3 0 a4 Each edge e is labeled with a number of conflicts c(e) that is originally set to 0. In the puzzle step, the task is to decide how to insert the next taxon ai+1 into the tree Ti already constructed for taxa {a1 , . . . , ai }. We proceed as follows: For each edge e in Ti we consider what would happen if we were to attach taxon ai+1 into the middle of e by a new edge: We consider all quartets consisting of the taxon ai+1 and three taxa from {a1 , a2 , . . . , ai }. For each such quartet q we compare the precomputed maximum likelihood tree Tq with the restriction of Ti to q, denoted by Ti |q . If these two trees differ, then we increment c(e) by one. Finally, we attach e to an edge f in Ti that has a minimum number of conflicts c(f ), thereby obtaining Ti+1 . Example: We illustrate the puzzle step using the five taxon tree shown above. For each of the five quartets we compute the respective ML tree. Assume that these are indeed a1 a3 a1 a3 a1 a4 a1 a4 a2 a4 a2 a5 a2 a2 a5 a3 a5 a3 a4 a5 We start with the maximum likelihood tree on {a1 , a2 , a3 , a4 }. The puzzle step for a5 proceeds as 90 Bioinformatics I, WS’09-10, D. Huson, December 1, 2009 follows: a1 0 0 Initial tree: a2 0 a1 −→ a1 , a3 | a4 , a5 2 1 a2 2 0 a3 0 a4 1 a3 0 a4 a1 1 0 a2 1 a1 3 2 −→ a2 , a3 | a4 , a5 a2 3 0 a3 0 a4 2 a3 0 a4 a3 4 −→ a1 , a2 | a3 , a5 a1 −→ a1 , a2 | a4 , a5 2 2 0 4 a2 a4 Hence, the puzzle step suggests that taxon a5 should be attached in the edge adjacent to the leaf labeled a4 to obtain the tree a1 a5 a2 a3 a4 In geneal, this procedure is repeated for different orders of taxa many times (say, 1000 for a large dataset). Finally a consensus of all computed trees is derived as the final result. 5.15.3 The quartet puzzling algorithm Algorithm 5.15.1 (Quartet puzzling) Input: Aligned sequences A = {a1 , . . . , an }, number R of trees Output: Phylogenetic trees T 1 , T 2 , . . . , T R on A Initialization: Compute an MLE tree tq for each quartet q ⊆ A for r = 1 to R do Randomly permute the order of the sequences in A Set T4 equal to the precomputed MLE tree on {a1 , a2 , a3 , a4 } for i = 5 to n do Set c(e) = 0 for all edges in Ti−1 for each quartet q ⊆ {a1 , . . . , ai } with ai ∈ q do Let x, y | z, ai be the MLE tree topology on q Let p denote the path from x to y in Ti−1 Increment c(e) by 1 for each edge e that is contained in p or that is separated from z by p Choose any edge f for which c(f ) is minimal Obtain Ti from Ti−1 by attaching a new leaf with label ai to f Output Tn end Bioinformatics I, WS’09-10, D. Huson, December 1, 2009 5.16 91 Bayesian methods Bayesian inference of phylogenetic trees is a method whose goal is to obtain a quantity called the posterior probability of a phylogenetic tree. Generally speaking, the posterior probability of a result is the conditional probability of the result being observed, computed after seeing a given input dataset. The prior probability of a result is the marginal probability of the result being observed, set before the input dataset is examined. In other words, the prior probability distribution of phylogenetic trees on X reflects our belief of the distribution of all possible phylogenetic trees on X . This distribution is determined in advance, without knowledge of the actual sequence data to be examined. Let A be a multiple sequence alignment and T a phylogenetic tree with edge lengths ω, on X . Also, we will assume that some fixed model of evolution M is given. The posterior probability of T can be obtained from the prior probability of T with the help of the likelihood P (A | T ) = P (A | T, ω, M ) using Bayes’ Theorem: Theorem 5.16.1 (Bayes’ Theorem) P (T | A) = P (A | T ) × P (T ) . P (A) The posterior probability P (T | A) can be interpreted as the probability that the tree T is correct, given the data. In the simplest case, all trees on X are considered to be equally probable and then the prior distribution is called a flat prior. In the case of a flat prior, the posterior probability P (T | A) is proportional to the likelihood P (A | T ) (computed under the model M ) and the maximum likelihood tree also has maximum posterior probability. The maximum posterior probability over all trees is usually very low, if we think of tree space as a discrete sample space. In Bayesian tree inference the goal is not to determine a single phylogentic tree of maximum posterior probability, but rather to compute a sample of phylogenetic trees according to the posterior probability distribution of trees. Such a sample of trees is then processed further, for example, to generate a single “consensus tree” (as described later) and then to label the clusters in such a tree by their posterior probabilities. The expression for the posterior probability given by Bayes’ Theorem looks quite simple and we can efficiently compute P (A | T ) using Felsensteins algorithm, as described in in the context of maximum likelihood. However, it is usually impossible to solve the complete expression analytically, as computation of the normalizing factor P (A) usually involves summing over all possible phylogenetic tree topologies T 0 and integrating over all possible edge lengths ω and model parameters µ (of the given model M ): P (T | A) = P (A | T )P (T ) P (A | T )P (T ) =P , 0 0 P (A) T 0 P (A | T )P (T ) where 0 Z Z P (A | T ) = ω P (A | T 0 , ω, µ)P (ω, µ)dµdω. µ To avoid this problem, Bayesian inference employs the Markov chain Monte Carlo (MCMC) approach, which is based on the idea of sampling from the posterior probability distribution using a suitably constructed chain of results. 92 Bioinformatics I, WS’09-10, D. Huson, December 1, 2009 This is a general approach that is used in a wide range of applications. In the context of phylogenetic tree reconstruction, the MCMC approach constructs a chain of phylogenetic trees. At each step, a modification of the current phylogenetic tree is proposed and then a probabilistic decision is made whether to “accept” the newly proposed phylogenetic tree or whether to keep the current one. 5.16.1 Metropolis-Hastings algorithm Algorithm 5.16.2 (Metropolis-Hastings) Let Ti be the phylogenetic tree at position i of the MCMC chain and let T 0 be a proposed modification of Ti . Using what is called the Metropolis-Hastings algorithm, the decision whether to accept T 0 or to keep Ti is based on the ratio of the posterior probabilities of the two trees, which is given by R= P (T 0 | A) P (A | T 0 )P (T 0 )P (A) P (A | T 0 )P (T 0 ) = = . P (Ti | A) P (A | Ti )P (Ti )P (A) P (A | Ti )P (Ti ) Note that the normalizing factor P (A), which is so difficult to compute, cancels out and this expression is much easier to calculate than the actual posterior probability of either Ti or T 0 . Based on R, we accept the proposed phylogenetic tree T 0 with probability min(1, R). In other words, we always accept T 0 , if it has a better posterior probability than Ti , and otherwise we draw a random number p uniformly from the interval [0, 1] and accept the tree T 0 if p < R, and reject T 0 , otherwise. Acceptance means that we set Ti+1 = T 0 , whereas rejection implies Ti+1 = Ti . The aim is to produce an MCMC chain M of bifurcating phylogenetic trees that has the property that the stationary distribution of the trees (including the parameters ω and µ) in the chain equals the posterior probability distribution of the trees (again, including ω and µ). One can show that, by construction, the decision rule described above implies that the number of times that a phylogenetic tree T occurs in M is a valid approximation of the posterior probability of that tree. 5.16.2 Practical issues For this approach to work well in practice, a number of issues must be taken into account. • First, the mechanism for modifying the current phylogenetic tree must be designed with care. On the one hand, the suggested changes should be quite small so as to avoid a drastic drop in posterior probability. On the other hand, the modifications should be large enough so as to ensure that M is able to explore a sufficient portion of parameter space in a reasonable amount of time. • Second, as the MCMC chain M starts at a random location in parameter space, usually of very low posterior probability, the chain must be allowed to burn in, that is, it must be given time to reach stationarity, a state in which the chain is actually sampling from the posterior probability. In practice, the first 10,000 elements of M, say, are excluded from further analysis. • Third, trees that are close to each other in the MCMC chain M are highly correlated by construction, as they differ only by small modifications. To avoid this problem of autocorrelation, the final set of output trees is obtained by sparsely sampling M, retaining only every 1,000-th tree, say. Bioinformatics I, WS’09-10, D. Huson, December 1, 2009 5.16.3 93 The local algorithm What types of modifications are used in practice? Branch-swapping methods such as NNI (nearest neighbor interchange), SPR (subtree prune and regraft) and TBR (tree bisection and reattachment) are used in the context of maximum parsimony and maximum likelihood methods to search through the space of phylogenetic tree topologies. For the purposes of constructing an MCMC chain we need to couple such an approach with a method for proposing modifications of edge lengths and parameters of the evolutionary model. We will now describe one approach that is used in practice called the local algorithm, which is a modification of the NNI method. The algorithm starts by choosing an internal edge (v, w) of the current phylogenetic tree Ti at random. As we assume that Ti is bifurcating, both of the nodes v and w are each connected to two further nodes, which we will randomly label a and b for v, and c and d for w, respectively. Let x, y and z denote the distance from a to v, w and d, respectively, as obtained by adding the edge lengths ω of the edges from a to v, v to w, and w to d, appropriately. Here we show a phylogenetic tree before and after a modification by the local algorithm: 0 x y z a v w d b c 0 −→ y' x' z' a w v d c b The distance z from node a to node d is modified by a randomly chosen factor to a new value of z 0 . In this example, the node w is moved by a random amount along the path from a to d, whereas the relative position of node v on the path from a to d is not changed. 5.16.4 Details of the local algorithm The local algorithm changes the length of the path from a to d and modifies the positions of v and w in that path. Here are some details: • First define the new distance from a to d as z 0 = z × ep−0.5 , where p is a random number uniformly chosen from the interval (0, 1). The lengths of edges on the path from a to d will be modified so that the distance from a to d will be z 0 . • Second, randomly decide with equal odds whether node v or node w is to be moved by a random amount. If we decide to move node v, then define x0 = q × z 0 and y 0 = y × z 0 /z. Otherwise, define x0 = x × z 0 /z and y 0 = q × z 0 . In both cases, q is a random number uniformly chosen from the interval (0, 1). • Third, to reposition the nodes v and w according to the values just chosen, remove the two nodes v and w from the path connecting a to d and then reinsert them into the path at distances x0 and y 0 from a, respectively. If x0 < y 0 , then the local algorithm does not result in a change of the current tree topology. In 94 Bioinformatics I, WS’09-10, D. Huson, December 1, 2009 this case, only the lengths of the edges from a to v, v to w, and w to d change. In more detail, we set ω(a, v) = x0 , ω(v, w) = y 0 − x0 and ω(w, d) = z 0 − y 0 . If x0 > y 0 , then the modifications result is a new topology in which a and w, w and v, and v and d, respectively, are both connected by an edge. In this case, the edge lengths are ω(a, w) = y 0 , ω(w, v) = x0 − y 0 and ω(v, d) = z 0 − x0 . Simple parameters of the evolutionary model, such as the mutation rate µ in the Jukes-Cantor model, are modified by adding a random number uniformly chosen from an appropriate interval centered at 0. For a set of parameters that is constrained to sum to some specific value, such as 1 in the case of probabilities, the values are randomly modified according to a so-called “Dirichlet distribution”. A proposal for modifying the tree and a proposal for new values for all parameters of the specified evolutionary model are generated independently and simultaneously, and then accepted or rejected together in in a single application of the ratio-based decision step of the Metropolis-Hasting algorithm. 5.16.5 Metropolis-coupled Markov chain Monte Carlo It is important that the MCMC chain M is given the opportunity to visit all relevant parts of parameter space. In consequence, the larger the number of taxa is, and the more parameters the evolutionary model M has, the longer the chain must be run. A popular approach for exploring more of parameter space in less time is to use a variant of MCMC called Metropolis-coupled Markov chain Monte Carlo, or (MC 3 ), for short. In this approach k different MCMC chains M1 , . . . , Mk are run in parallel. The first chain M1 is called the cold chain and this is the one from which inferences will be made. The other k − 1 chains are called heated chains. Each of the heated chains aims at sampling from a posterior probability distribution raised to a power of β > 1. Heating a chain has the effect of “flattening out” the probability landscape and the hope is that a heated chain will be able to move more freely through parameter space. For example, in a strategy called incremental heating, the i-th chain Mi aims at sampling from the distribution P (T | A)β(i) , with 1 β(i) = , 1 + (i − 1)t where t is a user-specified heating parameter. After all k chains have moved one step, a swap of states is proposed between two randomly chosen chains Mi and Mj , using a Metropolis-like algorithm. Let Ti denote the current state of chain Mi , for all i = 1, . . . , k. (Here, by current state we mean the total state consisting of the current tree T , the current length parameters ω and the current model parameters µ.) A swap between Mi and Mj is accepted with probability P (Tj | A)β(i) P (Ti | A)β(j) α= . P (Ti | A)β(i) P (Tj | A)β(j) The main idea is that the heated chains act as “scouts” exploring the parameter space and that swapping will occasionally help the cold chain to move to a new part of the space which it would otherwise have difficulty reaching because of a deep valley in the original distribution. At the end of the run, the heated chains are discarded and the cold chain is then processed as described above. 5.16.6 Outlook A major challenge is how do we determine whether an MCMC chain M has been run long enough so that that chain has converged, that is, it provides a good approximation of the posterior distribution of trees. A simple approach is to monitor the likelihood of the phylogenetic trees added to M and to Bioinformatics I, WS’09-10, D. Huson, December 1, 2009 95 assume that convergence has occurred once there is no further increase in likelihood. A more reliable approach may be to use multiple MCMC chains in parallel and to keep them running until they appear to be sampling from the same distribution. One of the main attractions of Bayesian inference is that it produces a distribution of phylogenetic trees rather than a point estimate, in many cases using less time than a maximum-likelihood analysis followed by a bootstrap analysis. Areas of on-going research are how to ensure convergence and also how to choose appropriate prior distributions. 96 5.17 Bioinformatics I, WS’09-10, D. Huson, December 1, 2009 Consensus trees In most practical phylogenetic studies, a whole collection of different phylogenetic trees is computed for the set of taxa under investigation. These might be different phylogenetic “gene trees” that were obtained from a number of different gene sequences in a multi-gene study. Or, they might be different phylogenetic trees obtained from the same multiple sequence alignment using different inference methods, such as maximum-parsimony, maximum-likelihood and so forth. Yet another possibility is that the collection of phylogenetic trees was obtained using a method that produces multiple trees, such as Bayesian inference. In all such cases, one may want to treat the different phylogenetic trees as different estimations of the same underlying “true” tree and then a “consensus method” can be used to construct a “consensus tree” that represents those parts of the evolutionary history on which the different phylogenetic trees agree on, in some sense. Although consensus methods can be formulated for rooted phylogenetic trees, such methods are less commonly used, probably because most tree inference methods produce unrooted trees, and also perhaps because the location of the roots in the input trees may affect the resulting consensus tree too much. We will discuss the two most important consensus methods for unrooted phylogenetic trees, namely the strict consensus and the majority consensus. But first we must introduce the important concept of a split. 5.17.1 Trees and splits Any edge e of a phylogenetic tree T defines a split S = A|A¯ on X, that is, a bipartition of X into two non-empty sets A and A, consisting of all taxa on the one side and the other side of e, respectively. t1 t8 t2 For example: t3 e t4 t7 t6 t5 Here, A = {t3 , t4 , t5 } and A = {t1 , t2 , t6 , t7 , t8 }. We will use Σ(T ) to denote the split encoding of T , i.e. the set of all splits obtained from T . Here is an unrooted phylogenetic tree and its split encoding: b a c e d (a) Unrooted tree T {a} | {b, c, d, e} {b} | {a, c, d, e} {c} | {a, b, d, e} {d} | {a, b, c, e} {e} | {a, b, c, d} {a, b} | {c, d, e} {a, b, e} | {c, d} (b) Split encoding Σ(T ). For a given set tree it is easy to write down all splits. But, given a set of splits, how to decide whether it comes from some tree? Bioinformatics I, WS’09-10, D. Huson, December 1, 2009 97 Definition 5.17.1 (Compatible splits) Two split S1 = A1 |A¯1 and S2 = A2 |A¯2 on X are called compatible, if one of the folowing four possible intersections of split parts is empty: A1 ∩ A2 , A1 ∩ A¯2 , A¯1 ∩ A2 or A¯1 ∩ A¯2 . A set of splits Σ on X is called compatible, if all pairs of splits in Σ are compatible. Theorem 5.17.2 (Buneman, 1971) A set of splits Σ on X (that contains all trivial splits) is compatible, if and only if there exists a phylogenetic tree T on X with Σ = Σ(T ). 5.17.2 Strict consensus Let X be a set of taxa and assume that we are given a collection of phylogenetic trees T = (T1 , . . . , Tk ) on X . Let Σstrict denote the set of all splits that occur in every input tree, that is, we define \ Σstrict = Σ(T ). T ∈T We call this the set of strict consensus splits of the collection of phylogenetic trees T . The strict consensus tree is the (unique) phylogenetic tree Tstrict for which Σ(Tstrict ) = Σstrict holds. It can be obtained from any tree in T by contracting every edge whose split is not contained in Σstrict . In practice, the strict consensus tree may be quite unresolved, often being very close to the “star tree”, which consists of one central node incident to all leaf edges. The “majority consensus” is usually more informative. 5.17.3 Majority consensus Let Σmajority denote the set of majority consensus splits, that is, the set of all splits that occur in more than half of all trees in T , formally defined as k Σmajority = S : |{T ∈ T : S ∈ Σ(T )}| > . 2 By the pigeon-hole principle, any two splits S1 and S2 that are contained in Σmajority will also occur in some input tree together, and thus they are compatible. Thus, there exists a unique phylogenetic tree Tmajority corresponding to Σmajority , which is called the majority consensus tree. Here are four unrooted phylogenetic trees on X = {a, b, c, d, e, f } and the corresponding strict and majority consensus trees: c d d d c c d c e b e b e b b e f a f f a a a f (a) Tree T1 (b) Tree T2 c (c) Tree T3 d (d) Tree T4 c d e e b b f a (e) Strict consensus Tstrict a f (f) Majority consensus Tmajority 98 Bioinformatics I, WS’09-10, D. Huson, December 1, 2009 5.18 Rooted phylogenetic networks Rooted phylogenetic trees can be generalized to “rooted phylogenetic networks”: Definition 5.18.1 (Rooted phylogenetic network) Let X be a set of taxa. A rooted phylogenetic network N = (V, E, λ) on X consists of a directed graph G = (V, E) and a node labeling λ : X → V such that: 1. The graph G is a directed, acyclic graph (DAG). 2. The graph G is rooted, that is, it has precisely one node ρ of indegree 0. 3. The node labeling λ assigns a label to each leaf of G. We will usually assume that the network does not contain any node that is suppressible, in other words, is an unlabeled node with indegree 1 and outdegree 1. In addition we will usually assume that the taxon labeling λ is a bijection between the taxon set X and the set of leaves of N . Note: a rooted DAG is always connected. In the the rooted phylogenetic network N depicted below, the leaves are labeled by taxa a,. . . ,i, the root node is ρ, there are two reticulate nodes labeled r and s, and all other nodes are tree nodes. Reticulate edges are shown in gray. The node labeled v is suppressible. All tree nodes except the node labeled w are bifurcations. The reticulation node r is a bicombination, whereas s is not. ½ v w s r a b c d e f g h i Here is a summary of all the types of nodes and edges that we would like to distinguish among: Definition 5.18.2 (Types of nodes and edges) Let N be a rooted phylogenetic network on a taxon set X . 1. The one node with indegree 0 is called the root. 2. A node with outdegree 0 is called a leaf node. 3. A node with at most one in-edge is called a tree node. 4. Any edge leading to a tree node is called a tree edge. 5. A node with at least two in-edges is called a reticulate node or a reticulation. 6. Any edge leading to a reticulate node is called a reticulate edge. 7. A node with exactly two out-edges is called a bifurcation. 8. A node with exactly two in-edges is called a bicombination. Bioinformatics I, WS’09-10, D. Huson, December 1, 2009 5.19 99 Representing trees in rooted networks Let T = {T1 , T2 , . . . , Tm } be a collection of rooted phylogenetic trees on some taxon set X . For example, these might be the result of a multiple gene analysis or they might be different estimations of a single gene-based phylogeny. In practice, such a collection of trees is often combined into a consensus tree T¯ that is interpreted as representing the parts of the evolutionary history upon which the different trees agree. An alternative idea is to use a rooted phylogenetic network N to represent T in the following sense: Definition 5.19.1 (Rooted trees represented by a rooted network) Let N be a rooted phylogenetic network on a taxon set X . We say that a rooted phylogenetic tree T on X is represented by N , or that T is displayed by N , if T is embedded in N . We will use T (N ) to denote the set of all rooted phylogenetic trees on X that are represented by N . Below we show two rooted phylogenetic trees T1 and T2 , and then two different phylogenetic networks N1 and N2 that both represent the two trees: (a) Tree T1 : a b c d (b) Tree T2 : e f a (c) Network N1 : b c d e f (d) Network N2 : r a 5.20 b c d e f a b c d e f Recombination networks A recombination network is a rooted phylogenetic network that is used to describe the evolution of a set of closely related sequences (usually from different individuals of a population) in terms of mutations (along edges of the network), speciation or duplication events (at tree nodes) and recombination events (at reticulate nodes). In a simple recombination event, two sequences a and b of the same length L give rise to a new sequence c of length L that consists of a prefix of the sequence a and a suffix of the sequence b (or vice-versa). This is known as a single crossover event: a ACGTTGGCC AGTTGA b ccattgaat gtttca c ACGTTGGCC gtttca In a multiple crossover event, the resulting sequence c consists of a mosaic of segments from a and b. In either type of event, the character state ci is either ai or bi for every position i = 1, . . . , L of the sequence c. 100 5.20.1 Bioinformatics I, WS’09-10, D. Huson, December 1, 2009 Definition of a recombination network Definition 5.20.1 (Recombination network) Let M be a multiple alignment of sequences of length L, on X . A recombination network N representing M is given by a bicombining rooted phylogenetic network on X , together with two additional labellings: 1. Each node v of N is labeled by a sequence σ(v) of length L. 2. Each tree edge e is labeled by a set of positions δ(e) ⊆ {1, . . . , L}. These two labellings must fulfill the following compatibility conditions: 1. The sequence σ(v) assigned to any leaf v must equal the sequence in M that is given for the taxon associated with v. 2. If r is a reticulate node (often called a recombination node in this context) with parents v and w, then the sequence σ(r) must be obtainable from the two sequences σ(v) and σ(w) by a crossover. 3. If e = (v, w) is a tree edge, then the set of positions at which the two sequences σ(v) and σ(w) differ must equal δ(e). For computational reasons, the following condition is usually also required: (iv) Any given position may mutate at most once in the network. In other words, for any given position i there exists at most one edge e with i ∈ δ(e). The latter condition is usually referred to as the infinite sites assumption because for sequences of infinite length it holds that the probability of the same site being hit by a mutation more than once is zero, under a uniform distribution. In population studies one sometimes extends the definition of a recombination network to allow more than one root. This is necessary, for example, when different lineages are traced back to a founding population. However, we will not explore this further. Below we show (a) a multiple alignment of sequences of length five on X = {a, . . . , e} and (b) a recombination network N for M . For each node v we show the sequence σ(v) and for each edge e we show the set of positions δ(e). In this example, that set is either empty (unlabeled edges) or contains only one position (from 1 to 5): 00000 1 a 10000 b 11000 c 11100 d 00110 e 00111 3 10000 2 11000 00100 r 4 00110 11100 5 (a) Alignment M a b c d e 10000 11000 11100 00110 00111 (b) Recombination network N Bioinformatics I, WS’09-10, D. Huson, December 1, 2009 5.20.2 101 The local-tree parsimony approach Let M be a multiple alignment of sequences of length L and let N be a corresponding recombination network, on X . Consider a position i in the alignment. For any particular reticulate node r in N the character state at position i of the sequence σ(r) is copied from exactly one of the two parents of r. In consequence, each individual character in M evolves along some rooted phylogenetic tree embedded in N . In the recombination network shown above, the first two characters evolve along the tree in which the leaf labeled c is a child of the node labeled 11000, whereas the last three characters evolve along the tree in which the leaf labeled c is a child of the node labeled 00100. So, in consequence, the evolutionary history of any sufficiently short segment of sequence is a rooted phylogenetic tree. This insight leads to the following approach: 1. Determine a suitable rooted phylogenetic tree Ti for each position i in the alignment M . We will refer to any such tree Ti as a local tree. 2. Combine all the trees into a suitable rooted phylogenetic network N . When doing this, there is trade-off to be made between the number of incompatibilities between a character and its associated local tree on the one hand, and the “recombination cost ” of switching from one tree topology to a different when going from position i to i + 1, on the other. The first part can be formulated as a dynamic programming problem (Hein, 1993). For this, define a local-tree graph G = (V, E) as follows. For each position i in M and each possible rooted tree T on X let v(i, T ) be a node in V . Any two nodes v(i, T ) and v(i + 1, T 0 ) at adjacent positions in the alignment are connected by a directed edge e = (v(i, T ), v(i + 1, T 0 )): ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... (1,T1 ) (11,T2) ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... Each node in G is assigned a weight ω(v(i, T )) equal to the minimum number of substitutions required for the i-th character on the rooted tree T . Each edge e = (v(i, T ), v(i + 1, T 0 )) is assigned a weight ω(v(i, T ), v(i + 1, T 0 )) = ω(e) that reflects the “recombination distance” between T and T 0 . Now, in the local-tree graph G any path P = (v(1, T1 ), v(2, T2 ), . . . , v(L, TL )) that starts at some node v(1, T1 ) and ends at some node v(L, TL ) selects a rooted phylogenetic tree Ti for each position i in M . The total weight of such a path P is obtained by adding all node weights and all edge weights along the path: L L−1 X X W (P ) = ω(v(i, Ti )) + ω(v(i, Ti ), v(i + 1, Ti+1 )). i=1 i=1 102 Bioinformatics I, WS’09-10, D. Huson, December 1, 2009 A most parsimonious set of trees for M is given by a path P of minimum weight W (P ), which can be computed by standard dynamic programming using the following recursion: minT 0 {W (i − 1, T 0 ) + ω((v(i − 1, T 0 ), v(i, T ))) + ω(v(i, T ))}, if i > 1, W (i, T ) = ω(v(1, T )), else. To complete the recursion we need to specify how to determine the weight of a node ω(v(i, T )) and the weight of an edge ω(v(i − 1, T 0 ), v(i + 1, T )): • The weight of a node is simply the parsimony score for the i-th character on the tree T , which can be efficiently computed using the Sankoff algorithm. • The weight of an edge is the minimum number of reticulation nodes in a rooted phylogenetic network that contains both trees T and T 0 . This problem is NP-hard problem. The described approach is not practical for three reasons. First, the number of possible rooted phylogenetic trees for n taxa is (2n − 3)!! and thus is very large even for relatively small values of n. Second, the computation of each edge weight in the graph is NP-hard. Finally, the second part of the approach, computation of a suitable rooted phylogenetic network that represents all the trees, is an NP-hard problem. Although the approach is not practical, it is conceptually appealing because it explicitly addresses the mosaic nature of aligned sequences: A long multiple sequence alignment consists of stretches of sequence that have evolved along a common rooted phylogenetic tree and these stretches are separated by crossover positions at which recombinations have occurred. Here is an illustration of the approach: 1 2 3 a 1 0 0 b 1 1 0 c 1 1 0 d 0 0 1 e 0 0 1 ... ... ... ... ... ... 4 5 6 7 8 0 1 1 0 0 0 0 1 0 0 0 0 0 1 0 1 1 0 1 1 1 1 0 1 1 (a) Alignment M ... ... ... ... ... ... ... ... 9 0 0 1 1 1 ... 10 1 1 0 0 0 ... ... 11 1 1 1 0 0 ... ... ... ... (1,T1 ) (11,T2) ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... (b) Local-trees graph G The path highlighted in bold chooses the tree T1 for the first five positions of M and then the tree T2 for the remaining six positions: Bioinformatics I, WS’09-10, D. Huson, December 1, 2009 (c) Tree T1 : a b c 103 (d) Tree T2 : d a e b c d e (e) The resulting recombination network N : a b c d e Explaination: For the given multiple alignment M of sequences of length 11 on X = {a, b, c, d, e} we indicate the corresponding local-tree graph G. The graph has 11 columns of nodes, each column containing one node for each of the possible rooted phylogenetic trees on X . All nodes of a column are connected to all nodes of the next column by edges. The nodes and edges are weighted as described above. In this example the first five characters of M are all compatible with the tree T1 , whereas the last six characters are all compatible with the tree T2 , and so in this case a path through G that chooses T1 for the first five positions and then T2 for the remaining six positions is optimal. The depicted rooted phylogenetic network N contains both T1 and T2 and is a recombination network for M . More precisely, N is a recombination network if one adds the appropriate labeling of the nodes by sequences and edges by mutations. 5.20.3 A heuristic for the local-tree approach Let M be a multiple alignment of sequences of length L on X . We now look at a heuristic for computing a recombination network N for M that is based on the ideas just discussed (Hein, 1993). A first simplification is to use unrooted phylogenetic trees rather than rooted ones. The main simplification is to consider only a small part of the local-tree graph. The heuristic starts by computing a phylogenetic tree T from the whole dataset M using the maximum parsimony method. This tree will be used as a start tree and is assigned to position 1 of the alignment. Next, we compute the SPR-neighborhood S of T consisting of T and all phylogenetic trees that are exactly one SPR modification away from T . The main recursion of the previous section is repeatedly applied to the local-tree graph restricted to nodes that contain a tree from S until a position is reached at which a jump to a different tree T 0 in S is indicated by the recursion. We then determine the SPR-neighborhood S 0 of T 0 and continue the calculation with T 0 and S 0 in place of T and S. The heuristic algorithm considers only those trees in the local-tree graph that differ at most by one “SPR” from the current tree: 104 Bioinformatics I, WS’09-10, D. Huson, December 1, 2009 ... ... ... ... ... ... ... ... ... ... ... (1,T1 ) (11,T2) ... ... ... ... ... ... ... ... ... ... ... The resulting set of trees might depend quite strongly on the initial tree used. Hence, after completing the first pass of the algorithm, a second pass is performed in the opposite direction, using the local tree that was previously chosen for the last position of the alignment as the new start tree. This should be repeated a couple of times to completely erase any dependency on the initial choice of start tree. The final pass produces a set of local trees and breakpoint positions along the sequences that reflect the recombination history of the given set of sequences. This heuristic is an example of a general approach to studying recombination that operates by scanning along a multiple sequence alignment, building trees for local segments of the sequence and then comparing the trees in an attempt to determine recombination crossover positions. Once a set of phylogenetic trees has been constructed for the different stretches of a multiple sequence alignment M , then in a second step these trees may be combined into a recombination network to represent the evolutionary history of the data. By design, the local trees of neighboring positions will differ by at most one SPR and so in this case the calculation of a corresponding network is easier than in the general case, although it is still challenging in practice. 5.21 Summary • Phylogenetic trees are used to represent the evolutionary history of species. • They can be computed from distances using a method such as neighbor-joining. • They can be computed from a multiple alignment of sequences using a method based on maximum parsimony, maximum likelihood or the Bayesian approach. • Rooted phylogenetic networks can be used to represent multiple trees. • The local-tree approach attempts to explain the evolution of a set of sequences in the presence of recombination by finding a most parsimonious path through the local-tree graph.