Download presentation
Presentation is loading. Please wait.
Published byKristina Fitzgerald Modified over 9 years ago
1
1 Building Phylogenetic Trees Yaw-Ling Lin ( 林耀鈴 ) Dept Computer Sci and Info Management Providence University, Taiwan E-mail: yllin@pu.edu.tw WWW: http://www.cs.pu.edu.tw/~yawlin
2
2 Phylogenetic Tree Topology: bifurcating –Leaves - 1 … N –Internal nodes N+1 … 2N-2 leaf branch internal node
3
3 Orthologues / Paralogues
4
4 Rooted / Unrooted Tree
5
5 Counting Trees
6
6 (2N - 5)!! = # unrooted trees for N taxa (2N- 3)!! = # rooted trees for N taxa C A B D A B C A D B E C A D B E C F
7
7 Rrooting the tree: To root a tree mentally, imagine that the tree is made of string. Grab the string at the root and tug on it until the ends of the string (the taxa) fall opposite the root: A B C Root D A B C D Note that in this rooted tree, taxon A is no more closely related to taxon B than it is to C or D. Rooted tree Unrooted tree
8
8 UPGMA -- Unweighted Pair Group Method with Arithmetic mean simplest method - uses sequential clustering algorithm (assumption of rate constancy among lineages - often violated) A B B dAB C dACdBC ( AB) C d(AB)C d(AB)C = (dAC + dAB) / 2 Distance matrix Tree dAB / 2 A B A d(AB)C / 2 B C step 1 step 2
9
9 UPGMA Step 1 combine B and C a e c b d
10
10 UPGMA step 2 combine BC and D (10+12)/2 (4+6)/2 a e c b d 2 2
11
11 UPGMA step 3 combine A and E a e c b d 2 2.5 2
12
12 UPGMA step 4 combine AE and BCD a e c b d 3.5 2 2.5 2
13
13 UPGMA Result a e c b d 3.5 2 2.5.5 2 2 3.5
14
14 UPGMA Result a e c b d 3.5 2 2.5.5 2 2 3.5
15
15 UPGMA(1)
16
16 UPGMA(2)
17
17 UPGMA -- Ilustrations
18
18 When UPGMA fails …
19
19 Neighbor Joining Very popular method Does not make molecular clock assumption : modified distance matrix constructed to adjust for differences in evolution rate of each taxon Produces unrooted tree Assumes additivity: distance between pairs of leaves = sum of lengths of edges connecting them Like UPGMA, constructs tree by sequentially joining subtrees
20
20 Additivity
21
21 Naïve NJ by Additivity? O(n 2 ) (i,j) pairs O(n 2 ) (k,l) pairs (k,l) “rejects” (i,j) whenever additivity fails O(n 4 ) to pick an (i,j) neighbor pair! So totally O(n 5 ) time suffices
22
22 Neighbor Joining: Once we know the correct (i,j) pair
23
23 Neighbour Joining: why not pick the smallest (i,j) pair?
24
24 Neighbour Joining(3) i j
25
25 Neighbour Joining: Algorithm
26
26 Neighbor-Joining: Algorithm i j i j k m
27
27 Neighbor-Joining: Complexity The method performs a search using time O(n 2 ) and using time O(n 2 ) to update distance matrix. Giving a total time complexity of O(n 3 ),and a space complexity of O(n 2 ).
28
28 Reasoning the NJ Method How did the ideas of S i,j and R i comes from ? How correct is the algorithm? Heuristic or exact solution?
29
29 The “1-star” Sum of the Branch Lengths D and L as the distance between OTUs and the branch length between nodes Each branch is counted N-1 times when all distances are added
30
30 The “paired-2-star” Sum of the Branch Lengths
31
31 The “paired-2-star” Tree Size
32
32 The Distance and Branch Lengths between a Combined OTU and another One
33
33 Before the proof
34
34 Before the proof (Cont.)
35
35 Neighbor-Joining: The proof
36
36 Lemma 2 1 3 4
37
37 Lemma (Cont.)
38
38 Proof 2 1 3 4
39
39 Proof of the Theorem: by contradiction Suppose that i and j are not neighbors. Let k and l be any pair of neighbors, so that i, j, k, and l are distinct and are represented in the tree.Consider the sum in formula (b), which is nonnegative. If m is fifth OUT, then it joins the tree at point x along one of the indicated arcs. Say that m is of type 1 if it joins the path from I to j at any node different from u and that m is of type 2 if it joins the path from i to j at node u. r Type1: A = -2D ux -2D uv Type2: B = -4D vx +2D uv For the sum in formula b to be nonnegative, Type2 should be more than Type1. i j k l vu x x w s x B A
40
40 Proof of the theorem (Cont.) If m is of type 1,then the corresponding summand in formula (b) is -2D ux -2D uv. If m is of type 2, then the corresponding summand in formula (b) is -4D vx +2D uv. For the sum in formula (b) to be nonnegative, there must be at least as many terms corresponding to OTUs m of type 2 as there are terms corresponding top OTUs m of type 1. It follows that there are more OTUs that join the path from i to j at u than there are OTUs that join that path at all other nodes combined. Because neither i nor j has a neighbor, there must be a pair r,s of neighbors that argument applied to w that is different from u, By the above argument applied to w, there are more OTUs that join the path from i to j at w than there are OTUs that join that path at all other nodes combined. The conclusions about u and w contradict each other, and the theorem follows.
41
41 Speeding up Neighbor-Joining Tree Construction In this paper, the authors present several heuristics for speeding up the NJ method. The heuristics attempt to reduce the search time by using a quad-tree. The worst case time complexity remains O(n 3 ) and the space complexity after adding the quad-tree is still O(n 2 ). The authors have implemented a tool, QuickJoin.
42
42 Previous Work The neighbor-joining method is introduced by Saitou and Nei. The algorithm was later amended by Studier and Keppler with a running time O(n 3 ). BIONJ -- Gascuel et al. produce a O(n 3 ) implementation of a variant of the NJ algorithm that produce more accurate trees in many cases. QuickTree -- Durbin et al. produce an code optimized implementation of the NJ algorithm.
43
43 Appendix:Proof of neighbour-joining
44
44 +/- of distance methods Advantages: –easy to perform –quick calculation –fit for sequences having high similarity scores Disadvantages: –the sequences are not considered as such (loss of information) –all sites are generally equally treated (do not take into account differences of substitution rates ) –not applicable to distantly divergent sequences.
45
45 Parsimony
46
46 Maximum Parsimony Method principle - search for tree that requires the smallest number of character state changes between the OTUs informative sites - those that favor some trees over others operationally - at least two different kinds of residues at the site, each of which is found in at least two of the OUT sequences T C A G A T C T A G T T A G A A C T A G T T C G A T C G A G T T C T A A G G A C Site OTU 1 2 3 4 5 6 7 8 9 10 1 2 3 4
47
47 Evaluating Parsimony Scores How do we compute the Parsimony score for a given tree? Traditional Parsimony –Each base change has a cost of 1 Weighted Parsimony –Each change is weighted by the score c(a,b)
48
48 Traditional Parsimony aga {a,g} {a} Solved independently for each position Linear time solution a a
49
49 Traditional Parsimony
50
50 Evaluating Weighted Parsimony Dynamic programming on the tree Initialization: For each leaf i set S(i,a) = 0 if i is labeled by a, otherwise S(i,a) = Iteration: if k is node with children i and j, then S(k,a) = min b (S(i,b)+c(a,b)) + min b (S(j,b)+c(a,b)) Termination: cost of tree is min a S(r,a) where r is the root k i j
51
51 Example AardvarkBisonChimpDogElephant A : CAGGTA B : CAGACA C : CGGGTA D : TGCACT E : TGCGTA
52
52 Cost of Evaluating Parsimony Score is evaluated on each position independetly. Scores are then summed over all positions. If there are n nodes, m characters, and k possible values for each character, then complexity is O(nmk) By keeping traceback information, we can reconstruct most parsimonious values at each ancestor node
53
53 Weighted Parsimony
54
54 Traditional Parsimony is not “complete”
55
55 Parsimony: Searching over all trees by Branch and Bound
56
56 Assessing the trees: the bootstrap
57
57
58
58 Simultaneous alignment and phylogeny(1)
59
59 Inferring trees – Maximum Likelihood method Maximum likelihood supposes a model of evolution along tree branches. Strategy: Find parameters ( tree, branch lengths, substitution rate ) that maximizes the likelihood assigned to the data. Note: Model of evolution does not include indels! In Phylip package: program PROTML
60
60 Probabilistic Methods The phylogenetic tree represents a generative probabilistic model (like HMMs) for the observed sequences. Background probabilities: q(a) Mutation probabilities: P(a|b, t) Models for evolutionary mutations –Jukes Cantor –Kimura 2-parameter model Such models are used to derive the probabilities
61
61 Jukes Cantor model A model for mutation rates Mutation occurs at a constant rate Each nucleotide is equally likely to mutate into any other nucleotide with rate a.
62
62 Kimura 2-parameter model Allows a different rate for transitions and transversions.
63
63 Mutation Probabilities The rate matrix R is used to derive the mutation probability matrix S: S is obtained by integration. For Jukes Cantor: q can be obtained by setting t to infinity
64
64 Mutation Probabilities Both models satisfy the following properties: Lack of memory: – Reversibility: –Exist stationary probabilities { P a } s.t. A GT C
65
65 Probabilistic Approach Given P,q, the tree topology and branch lengths, we can compute: x1x1 x2x2 x3x3 x4x4 x5x5 t1t1 t2t2 t3t3 t4t4
66
66 Computing the Tree Likelihood u We are interested in the probability of observed data given tree and branch “lengths”: u Computed by summing over internal nodes u This can be done efficiently using a tree upward traversal pass.
67
67 Tree Likelihood Computation Define P(L k |a)= prob. of leaves below node k given that x k =a Init: for leaves: P(L k |a)=1 if x k =a ; 0 otherwise Iteration: if k is node with children i and j, then Termination:Likelihood is
68
68 Maximum Likelihood (ML) Score each tree by –Assumption of independent positions Branch lengths t can be optimized –Gradient ascent –EM We look for the highest scoring tree –Exhaustive –Sampling methods (Metropolis)
69
69 Optimal Tree Search Perform search over possible topologies T1T1 T3T3 T4T4 T2T2 TnTn Parametric optimization (EM) Parameter space Local Maxima
70
70 Computational Problem Such procedures are computationally expensive! Computation of optimal parameters, per candidate, requires non-trivial optimization step. Spend non-negligible computation on a candidate, even if it is a low scoring one. In practice, such learning procedures can only consider small sets of candidate structures
71
71 Max Likelihood versus Parsimony (Example from BSA p. 225) Choose tree T, with unequal branch lengths. Generate 1000 sequences of length N according to probabilistic model (A) Reconstruction by ML (B) Reconstruction by Parsimony NT1T1 T2T2 T3T3 20419339242 100638204158 5009046135 200099730 0.1 0.09 0.3 2 1 3 4 2 13 43 12 42 14 3 T1T1 T2T2 T3T3 T NT1T1 T2T2 T3T3 20396378224 10040551579 5004045942 20003536460 Conclusion: ML infers right tree as N gets larger, Parsimony does not necessarily.
72
72 Max Likelihood versus NJ (Example from BSA p. 225) Choose tree T, with unequal branch lengths. Generate 1000 sequences of length N according to probabilistic model (A) Reconstruction by ML (B) Reconstruction by NJ NT1T1 T2T2 T3T3 20419339242 100638204158 5009046135 200099730 0.1 0.09 0.3 2 1 3 4 2 13 43 12 42 14 3 T1T1 T2T2 T3T3 T Conclusion: ML infers right tree as N gets largerl. If the probabilistic model is correct, the ML distances shall be very close to additive, therefore the NJ method predicts the correct tree.
73
73 Phylip - practicalities Menu-driven, no command line Input file format: –First line: –Next lines: Sequences: First ten characters is the sequence name Then sequence follows. Spaces and newlines are allowed. Dashes (-) signify gaps Example: 4 46 hba1 MV-LSPADKTNVKAAWGKVG AHAGEYGAEALERMFLSFPTTKTYFP beta MVHLTPEEKSAVTALWGKVN VDEVGGEALGRLLVVYPWTQRFFESF Myoglobin –MGLSDGEWQLVLNVWGKVE ADIPGHGQEVLIRLFKGHPETLEKFD Leghemogl MGAFSEKQESLVKSSWEAFK QNVPHHSAVFYTLILEKAPAAQNMFS
74
74 The End
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.