1 Building Phylogenetic Trees Yaw-Ling Lin ( 林耀鈴 ) Dept Computer Sci and Info Management Providence University, Taiwan WWW:
2 Phylogenetic Tree Topology: bifurcating –Leaves - 1 … N –Internal nodes N+1 … 2N-2 leaf branch internal node
3 Orthologues / Paralogues
4 Rooted / Unrooted Tree
5 Counting Trees
6 (2N - 5)!! = # unrooted trees for N taxa (2N- 3)!! = # rooted trees for N taxa C A B D A B C A D B E C A D B E C F
7 Rrooting the tree: To root a tree mentally, imagine that the tree is made of string. Grab the string at the root and tug on it until the ends of the string (the taxa) fall opposite the root: A B C Root D A B C D Note that in this rooted tree, taxon A is no more closely related to taxon B than it is to C or D. Rooted tree Unrooted tree
8 UPGMA -- Unweighted Pair Group Method with Arithmetic mean simplest method - uses sequential clustering algorithm (assumption of rate constancy among lineages - often violated) A B B dAB C dACdBC ( AB) C d(AB)C d(AB)C = (dAC + dAB) / 2 Distance matrix Tree dAB / 2 A B A d(AB)C / 2 B C step 1 step 2
9 UPGMA Step 1 combine B and C a e c b d
10 UPGMA step 2 combine BC and D (10+12)/2 (4+6)/2 a e c b d 2 2
11 UPGMA step 3 combine A and E a e c b d
12 UPGMA step 4 combine AE and BCD a e c b d
13 UPGMA Result a e c b d
14 UPGMA Result a e c b d
15 UPGMA(1)
16 UPGMA(2)
17 UPGMA -- Ilustrations
18 When UPGMA fails …
19 Neighbor Joining Very popular method Does not make molecular clock assumption : modified distance matrix constructed to adjust for differences in evolution rate of each taxon Produces unrooted tree Assumes additivity: distance between pairs of leaves = sum of lengths of edges connecting them Like UPGMA, constructs tree by sequentially joining subtrees
20 Additivity
21 Naïve NJ by Additivity? O(n 2 ) (i,j) pairs O(n 2 ) (k,l) pairs (k,l) “rejects” (i,j) whenever additivity fails O(n 4 ) to pick an (i,j) neighbor pair! So totally O(n 5 ) time suffices
22 Neighbor Joining: Once we know the correct (i,j) pair
23 Neighbour Joining: why not pick the smallest (i,j) pair?
24 Neighbour Joining(3) i j
25 Neighbour Joining: Algorithm
26 Neighbor-Joining: Algorithm i j i j k m
27 Neighbor-Joining: Complexity The method performs a search using time O(n 2 ) and using time O(n 2 ) to update distance matrix. Giving a total time complexity of O(n 3 ),and a space complexity of O(n 2 ).
28 Reasoning the NJ Method How did the ideas of S i,j and R i comes from ? How correct is the algorithm? Heuristic or exact solution?
29 The “1-star” Sum of the Branch Lengths D and L as the distance between OTUs and the branch length between nodes Each branch is counted N-1 times when all distances are added
30 The “paired-2-star” Sum of the Branch Lengths
31 The “paired-2-star” Tree Size
32 The Distance and Branch Lengths between a Combined OTU and another One
33 Before the proof
34 Before the proof (Cont.)
35 Neighbor-Joining: The proof
36 Lemma
37 Lemma (Cont.)
38 Proof
39 Proof of the Theorem: by contradiction Suppose that i and j are not neighbors. Let k and l be any pair of neighbors, so that i, j, k, and l are distinct and are represented in the tree.Consider the sum in formula (b), which is nonnegative. If m is fifth OUT, then it joins the tree at point x along one of the indicated arcs. Say that m is of type 1 if it joins the path from I to j at any node different from u and that m is of type 2 if it joins the path from i to j at node u. r Type1: A = -2D ux -2D uv Type2: B = -4D vx +2D uv For the sum in formula b to be nonnegative, Type2 should be more than Type1. i j k l vu x x w s x B A
40 Proof of the theorem (Cont.) If m is of type 1,then the corresponding summand in formula (b) is -2D ux -2D uv. If m is of type 2, then the corresponding summand in formula (b) is -4D vx +2D uv. For the sum in formula (b) to be nonnegative, there must be at least as many terms corresponding to OTUs m of type 2 as there are terms corresponding top OTUs m of type 1. It follows that there are more OTUs that join the path from i to j at u than there are OTUs that join that path at all other nodes combined. Because neither i nor j has a neighbor, there must be a pair r,s of neighbors that argument applied to w that is different from u, By the above argument applied to w, there are more OTUs that join the path from i to j at w than there are OTUs that join that path at all other nodes combined. The conclusions about u and w contradict each other, and the theorem follows.
41 Speeding up Neighbor-Joining Tree Construction In this paper, the authors present several heuristics for speeding up the NJ method. The heuristics attempt to reduce the search time by using a quad-tree. The worst case time complexity remains O(n 3 ) and the space complexity after adding the quad-tree is still O(n 2 ). The authors have implemented a tool, QuickJoin.
42 Previous Work The neighbor-joining method is introduced by Saitou and Nei. The algorithm was later amended by Studier and Keppler with a running time O(n 3 ). BIONJ -- Gascuel et al. produce a O(n 3 ) implementation of a variant of the NJ algorithm that produce more accurate trees in many cases. QuickTree -- Durbin et al. produce an code optimized implementation of the NJ algorithm.
43 Appendix:Proof of neighbour-joining
44 +/- of distance methods Advantages: –easy to perform –quick calculation –fit for sequences having high similarity scores Disadvantages: –the sequences are not considered as such (loss of information) –all sites are generally equally treated (do not take into account differences of substitution rates ) –not applicable to distantly divergent sequences.
45 Parsimony
46 Maximum Parsimony Method principle - search for tree that requires the smallest number of character state changes between the OTUs informative sites - those that favor some trees over others operationally - at least two different kinds of residues at the site, each of which is found in at least two of the OUT sequences T C A G A T C T A G T T A G A A C T A G T T C G A T C G A G T T C T A A G G A C Site OTU
47 Evaluating Parsimony Scores How do we compute the Parsimony score for a given tree? Traditional Parsimony –Each base change has a cost of 1 Weighted Parsimony –Each change is weighted by the score c(a,b)
48 Traditional Parsimony aga {a,g} {a} Solved independently for each position Linear time solution a a
49 Traditional Parsimony
50 Evaluating Weighted Parsimony Dynamic programming on the tree Initialization: For each leaf i set S(i,a) = 0 if i is labeled by a, otherwise S(i,a) = Iteration: if k is node with children i and j, then S(k,a) = min b (S(i,b)+c(a,b)) + min b (S(j,b)+c(a,b)) Termination: cost of tree is min a S(r,a) where r is the root k i j
51 Example AardvarkBisonChimpDogElephant A : CAGGTA B : CAGACA C : CGGGTA D : TGCACT E : TGCGTA
52 Cost of Evaluating Parsimony Score is evaluated on each position independetly. Scores are then summed over all positions. If there are n nodes, m characters, and k possible values for each character, then complexity is O(nmk) By keeping traceback information, we can reconstruct most parsimonious values at each ancestor node
53 Weighted Parsimony
54 Traditional Parsimony is not “complete”
55 Parsimony: Searching over all trees by Branch and Bound
56 Assessing the trees: the bootstrap
57
58 Simultaneous alignment and phylogeny(1)
59 Inferring trees – Maximum Likelihood method Maximum likelihood supposes a model of evolution along tree branches. Strategy: Find parameters ( tree, branch lengths, substitution rate ) that maximizes the likelihood assigned to the data. Note: Model of evolution does not include indels! In Phylip package: program PROTML
60 Probabilistic Methods The phylogenetic tree represents a generative probabilistic model (like HMMs) for the observed sequences. Background probabilities: q(a) Mutation probabilities: P(a|b, t) Models for evolutionary mutations –Jukes Cantor –Kimura 2-parameter model Such models are used to derive the probabilities
61 Jukes Cantor model A model for mutation rates Mutation occurs at a constant rate Each nucleotide is equally likely to mutate into any other nucleotide with rate a.
62 Kimura 2-parameter model Allows a different rate for transitions and transversions.
63 Mutation Probabilities The rate matrix R is used to derive the mutation probability matrix S: S is obtained by integration. For Jukes Cantor: q can be obtained by setting t to infinity
64 Mutation Probabilities Both models satisfy the following properties: Lack of memory: – Reversibility: –Exist stationary probabilities { P a } s.t. A GT C
65 Probabilistic Approach Given P,q, the tree topology and branch lengths, we can compute: x1x1 x2x2 x3x3 x4x4 x5x5 t1t1 t2t2 t3t3 t4t4
66 Computing the Tree Likelihood u We are interested in the probability of observed data given tree and branch “lengths”: u Computed by summing over internal nodes u This can be done efficiently using a tree upward traversal pass.
67 Tree Likelihood Computation Define P(L k |a)= prob. of leaves below node k given that x k =a Init: for leaves: P(L k |a)=1 if x k =a ; 0 otherwise Iteration: if k is node with children i and j, then Termination:Likelihood is
68 Maximum Likelihood (ML) Score each tree by –Assumption of independent positions Branch lengths t can be optimized –Gradient ascent –EM We look for the highest scoring tree –Exhaustive –Sampling methods (Metropolis)
69 Optimal Tree Search Perform search over possible topologies T1T1 T3T3 T4T4 T2T2 TnTn Parametric optimization (EM) Parameter space Local Maxima
70 Computational Problem Such procedures are computationally expensive! Computation of optimal parameters, per candidate, requires non-trivial optimization step. Spend non-negligible computation on a candidate, even if it is a low scoring one. In practice, such learning procedures can only consider small sets of candidate structures
71 Max Likelihood versus Parsimony (Example from BSA p. 225) Choose tree T, with unequal branch lengths. Generate 1000 sequences of length N according to probabilistic model (A) Reconstruction by ML (B) Reconstruction by Parsimony NT1T1 T2T2 T3T T1T1 T2T2 T3T3 T NT1T1 T2T2 T3T Conclusion: ML infers right tree as N gets larger, Parsimony does not necessarily.
72 Max Likelihood versus NJ (Example from BSA p. 225) Choose tree T, with unequal branch lengths. Generate 1000 sequences of length N according to probabilistic model (A) Reconstruction by ML (B) Reconstruction by NJ NT1T1 T2T2 T3T T1T1 T2T2 T3T3 T Conclusion: ML infers right tree as N gets largerl. If the probabilistic model is correct, the ML distances shall be very close to additive, therefore the NJ method predicts the correct tree.
73 Phylip - practicalities Menu-driven, no command line Input file format: –First line: –Next lines: Sequences: First ten characters is the sequence name Then sequence follows. Spaces and newlines are allowed. Dashes (-) signify gaps Example: 4 46 hba1 MV-LSPADKTNVKAAWGKVG AHAGEYGAEALERMFLSFPTTKTYFP beta MVHLTPEEKSAVTALWGKVN VDEVGGEALGRLLVVYPWTQRFFESF Myoglobin –MGLSDGEWQLVLNVWGKVE ADIPGHGQEVLIRLFKGHPETLEKFD Leghemogl MGAFSEKQESLVKSSWEAFK QNVPHHSAVFYTLILEKAPAAQNMFS
74 The End