. Phylogenetic Trees Lecture 3 Based on: Durbin et al 7.4; Gusfield 17
2 Character-based methods for constructing phylogenies In this approach, trees are constructed by comparing the characters of the corresponding species. Characters may be morphological (teeth structures) or molecular (homologous DNA sequences). One common approach is Maximum Parsimony. Assumptions: u Independence of characters (no interactions) u Best tree is one where minimal changes take place
3 1. Maximum Parsimony Input: four nucleotide sequences: AAG, AAA, GGA, AGA taken from four species. Question: Which evolutionary tree best explains these sequences ? AGA AAA GGA AAG AAA Total #substitutions = 4 One Answer (the parsimony principle): Pick a tree that has a minimum total number of substitutions of symbols between species and their originator in the phylogenetic tree.
4 Example Continued There are many trees possible. For example: AGA GGA AAA AAG AAA AGA AAA Total #substitutions = 3 GGA AAA AGA AAG AAA Total #substitutions = 4 The left tree is preferred over the right tree. The total number of changes is called the parsimony score.
5 Simple Example u Suppose we have five species, such that three have ‘C’ and two ‘T’ at a specified position u Minimal tree has one evolutionary change: C C C C C T T T T C
6 Extension to Many Letters u What is the parsimony score of AardvarkBisonChimpDogElephant A: CAGGTA B: CAGACA C: CGGGTA D: TGCACT E: TGCGTA We do it character after character; each score is computed independently of the others.
7 Fitch’s Algorithm of Evaluating Trees Traverse tree from leaves to root determining set of possible states (e.g. nucleotides) for each internal node Traverse tree from root to leaves picking ancestral states for internal nodes
8 Fitch’s Algorithm – Step 1 # of changes = # union operations T T CT T C T A G T AGT GT
9 Fitch’s Algorithm – Step 1 D o a post-order (from leaves to root) traversal of tree Determine possible states R i of internal node i with children j and k
10 Fitch’s Algorithm – Step 2 T T CT T C T A G T AGT GT T T CT T C T A G T AGT GT T T CT T C T A G T AGT GT T T CT T C T A G T AGT GT T T CTCT T C T A G T AGT GT T T CTCT T C T A G T AGT GTGT
11 Fitch’s Algorithm – Step 2 Do a pre-order (from root to leaves) traversal of tree Select state r j of internal node j with parent i
12 Weighted Version of Fitch’s Algorithm I nstead of assuming all state changes are equally likely, use different costs c(a, b) for different changes 1 st step of algorithm is to propagate costs up through tree
13 Weighted Version of Fitch’s Algorithm Want to determine minimal cost S(i, a) of assigning character a to node i For leaves:
14 Weighted Version of Fitch’s Algorithm W ant to determine min. cost S(i, a) of assigning character a to node i For internal nodes: a b i j k
15 Weighted Version of Fitch’s Algorithm – Step 2 D o a pre-order (from root to leaves) traversal of tree Select minimal cost character for root For each internal node j, select character that produced minimal cost at parent i
16 Weighted Parsimony Scores Weighted Parsimony score: Each change is weighted by a score c(a, b). The weighted parsimony score reduces to the parsimony score when c(a,a)=0 and c(a,b)=1 for all b a.
17 Evaluating Weighted Parsimony Scores Each position is independent and computed by itself. Use Dynamic Programming on a given tree. u If k is a node with children i and j, then S(i, a) = min x (S(j, x)+c(a, x)) + min y (S(k, y)+c(a, y)) i j k S(j,x) S(i, a) the minimum score of subtree rooted at k when k has character a. S(k,y) S(i,a)
18 Evaluating Parsimony Scores Dynamic programming on a given tree Initialization: For each leaf i set S(i,a) = 0 if i is labeled by a, otherwise S(i,a) = Iteration: if i is node with children j and k, then S(i,a) = min x (S(j,x)+c(a,x)) + min y (S(k,y)+c(a,y)) Termination: cost of tree is min x S(r,x) where r is the root Comment: To reconstruct an optimal assignment, we need to keep in each node i and for each character a the two characters x, y that bring about the minimum when i has character a.
19 Cost of Evaluating Parsimony for binary trees If there are n nodes, m characters, and k possible values for each character, then complexity is O(nmk 2 ). Of course, we still need to search over ALL possible trees and find the best one. One usually resorts to heuristic search techniques.
20 Exploring the Space of Trees W e’ve considered how to find the minimum number of changes for a given tree topology Need some search procedure for exploring the space of tree topologies Given n sequences there are possible rooted trees
21 Counting Trees n = 3 One Tree: n = 4 3 Trees A rooted tree with n leaves has (2n-1) nodes and (2n-2) edges, discounting the edge to the root; hence an unrooted tree has (2n-3) edges. For each additional leaf we add two edges. Therefore we have … (2n-5) unrooted trees with n leaves. Each of such trees has (2n-3) edges, which can be chosen as a root of the rooted tree. Hence we have … (2n-5) (2n-3) rooted trees with n leaves
22 Exploring the Space of Trees taxa (n) # of rooted trees , ,405,375
23 Maximum Parsimony Species 1 – A G G G T A A C T G Species 2 - A C G A T T A T T A Species 3 - A T A A T T G T C T Species 4 - A A T G T T G T C G How many possible unrooted trees?
24 Maximum Parsimony How many possible unrooted trees? Species 1 - A G G G T A A C T G Species 2 - A C G A T T A T T A Species 3 - A T A A T T G T C T Species 4 - A A T G T T G T C G
25 Maximum Parsimony How many substitutions? MP
26 Maximum Parsimony A G G G T A A C T G 2 - A C G A T T A T T A 3 - A T A A T T G T C T 4 - A A T G T T G T C G 0 0 0
27 Maximum Parsimony A G G G T A A C T G 2 - A C G A T T A T T A 3 - A T A A T T G T C T 4 - A A T G T T G T C G 0 3
28 Maximum Parsimony 1 - G 2 - C 3 - T 4 - A A G C T C A G T C C C G A T C 3 3 3
29 Maximum Parsimony A G G G T A A C T G 2 - A C G A T T A T T A 3 - A T A A T T G T C T 4 - A A T G T T G T C G 0 3 2
30 Maximum Parsimony A G G G T A A C T G 2 - A C G A T T A T T A 3 - A T A A T T G T C T 4 - A A T G T T G T C G
31 Maximum Parsimony G 2 - A 3 - A 4 - G G G A A A G G A A A A G GA A 2 2 1
32 Maximum Parsimony
33 Maximum Parsimony A G G G T A A C T G 2 - A C G A T T A T T A 3 - A T A A T T G T C T 4 - A A T G T T G T C G
34 Finding most parsimonious trees - exact solutions u Exact solutions can only be used for small numbers of taxa. u Exhaustive search examines all possible trees. u Typically used for problems with less than 10 taxa.
35 Finding most parsimonious trees - exhaustive search A B C (1) (2a) Starting tree, any 3 taxa A B D C A B D C (2b) (2c) E A B C D E E E E Add fourth taxon (D) in each of three possible positions: three trees Add fifth taxon (E) in each of the five possible positions on each of the three trees -> 15 trees, and so on
36 Finding most parsimonious trees - exact solutions u Branch and bound saves time by discarding families of trees during tree construction that can not be smaller than the smallest tree found so far. (Here “smaller” means more parsimonious.) u Can be enhanced by specifying an initial upper bound for tree length. u Typically used only for problems with less than 20 taxa.
37 Finding most parsimonious trees: branch and bound A B C B1 A B D C A B C D B3 A A BE D C C1.1 A B D E CC1.3 A BD C EC1.2 A B C C1.4 ED A BC C1.5 E D A B D C B2 C2.1 C2.2 C2.3 C2.4 C2.5 C3.1 C3.2 C3.3 C3.4 C3.5
38 Finding most parsimonious trees - heuristics u The number of possible trees increases exponentially with the number of taxa making exhaustive searches impractical for many data sets (an NP complete problem) u Heuristic methods are used to search tree space for most parsimonious trees u The trees found are not guaranteed to be the most parsimonious - they are best guesses
39 Finding most parsimonious trees - heuristics u Stepwise addition Asis - the order in the data matrix Closest -starts with shortest 3-taxon tree adds taxa in order that produces the least increase in tree length Simple - the first taxon in the matrix is a taken as a reference - taxa are added to it in the order of their decreasing similarity to the reference Random - taxa are added in a random sequence, many different sequences can be used u Recommend random with as many (e.g ) addition sequences as practical
40 Finding most parsimonious trees - heuristics Branch Swapping: Nearest neighbor interchange (NNI) Subtree pruning and regrafting (SPR) Tree bisection and reconnection (TBR)
41 Finding most parsimonious trees - heuristics 1 Nearest neighbor interchange (NNI) A B CD E F G A B DC E F G A B CD E F G
42 Finding most parsimonious trees - heuristics 2 Subtree pruning and regrafting (SPR) A B CD E F G A B CD E F G C D G B A E F
43 Finding most parsimonious trees - heuristics 3 Tree bisection and reconnection (TBR) A B CD E F G A B C D E F G A C F D E B G
44 Finding most parsimonious trees - heuristics - summary u Branch Swapping Nearest neighbor interchange (NNI) Subtree pruning and regrafting (SPR) Tree bisection and reconnection (TBR) u The nature of heuristic searches means we cannot know which method will find the most parsimonious trees or all such trees. u However, TBR is the most extensive swapping routine and its use with multiple random addition sequences should work well.
45 Tree space may be populated by local minima and islands of most parsimonious trees GLOBAL MINIMUM Local Minimum Local Minima Tree Length RANDOM ADDITION SEQUENCE REPLICATES SUCCESSFAILURE Branch Swapping Branch Swapping
46 Multiple most parsimonious trees u Many parsimony analyses yield multiple equally optimal trees u Multiple trees are due to either: - Alternative equally parsimonious optimizations of homoplastic characters - Missing data - Or both u We can further select among these trees with additional criteria, but u Most commonly relationships common to all the optimal trees are summarized with consensus trees
47 Consensus methods - 1 u A consensus tree is a summary of the agreement among a set of fundamental trees u There are many different consensus methods that differ in: 1. the kind of agreement 2. the level of agreement u Consensus methods can be used with any types of tree - not just parsimony
48 Strict consensus methods - 1 u Strict consensus methods require agreement across all the fundamental trees u They show only those relationships that are unambiguously supported by the parsimonious interpretation of the data u The commonest method (strict component consensus) focuses on clades u This method produces a consensus tree that includes all and only those clades found in all the fundamental trees u Other relationships (those in which the fundamental trees disagree) are shown as unresolved polytomies
49 Strict consensus methods - 2 ABCDEFG A B C E D FG TWO FUNDAMENTAL TREES A B C D E FG STRICT COMPONENT CONSENSUS TREE
50 Majority-rule consensus methods u Majority-rule consensus methods require agreement across a majority of the fundamental trees u May include relationships that are not supported by the most parsimonious interpretation of the data u The commonest method focuses on clades u This method produces a consensus tree that includes all and only those clades found in a majority (>50%) of the fundamental trees u Other relationships are shown as unresolved polytomies u Of particular use in bootstrapping
51 Majority rule consensus ABCDEFG A B C E D FG ABCEDFG MAJORITY-RULE COMPONENT CONSENSUS TREE A B C E F DG THREE FUNDAMENTAL TREES Numbers indicate frequency of clades in the fundamental trees
52 Reduced consensus methods - 1 u Focuses upon any cladistic relationships (statements that some taxa are more closely related to each other than to some other taxa) u Reduced consensus methods occur in strict and majority-rule varieties u Other relationships are shown as unresolved polytomies u May be more sensitive than methods focusing only on clades
53 Reduced consensus methods - 2 A B C D E FG TWO FUNDAMENTAL TREES STRICT REDUCED CLADISTIC CONSENSUS TREE Taxon G is excluded AGBCDEF A B C D E F A B C DE F G Strict component consensus completely unresolved
54 Consensus methods - 2 Spirostomumum Ochromonas Symbiodinium Prorocentrum Loxodes Tetrahymena Tracheloraphis Euplotes Gruberia Ochromonas Symbiodinium Prorocentrum Loxodes Tetrahymena Spirostomumum Euplotes Tracheloraphis Gruberia Ochromonas Symbiodinium Prorocentrum Loxodes Tetrahymena Euplotes Spirostomumum Tracheloraphis Gruberia Ochromonas Symbiodinium Prorocentrum Loxodes Tetrahymena Tracheloraphis Spirostomum Euplotes Gruberia Ochromonas Symbiodinium Prorocentrum Loxodes Tetrahymena Spirostomum Euplotes Tracheloraphis Gruberia Ochromonas Symbiodinium Prorocentrum Loxodes Tetrahymena Spirostomum Tracheloraphis Gruberia Three fundamental trees majority-rule strict (component) strict reduced cladistic Euplotes excluded
55 Consensus methods - 3 u Use strict methods to identify those relationships unambiguously supported by parsimonious interpretation of the data u Use reduced methods where consensus trees are poorly resolved u Use majority-rule methods in bootstrapping u Avoid other methods which have ambiguous interpretations
56 Parsimony - advantages u a simple method - easily understood operation u does not seem to depend on an explicit model of evolution u gives both trees and associated hypotheses of character evolution u should give reliable results if the data is well structured and homoplasy is either rare or randomly distributed on the tree
57 Parsimony - disadvantages u May give misleading results if homoplasy is common or concentrated in particular parts of the tree, e.g: - thermophilic convergence - base composition biases - long branch attraction u Underestimates branch lengths u Model of evolution is implicit - behaviour of method not well understood u Parsimony often justified on purely philosophical grounds - we must prefer simplest hypotheses - particularly by morphologists u For most molecular systematists this is uncompelling
58 Parsimony can be inconsistent u Felsenstein (1978) developed a simple model phylogeny including four taxa and a mixture of short and long branches u Under this model parsimony will give the wrong tree With more data the certainty that parsimony will give the wrong tree increases - so that parsimony is statistically inconsistent. Advocates of parsimony initially responded by claiming that Felsenstein’s result showed only that his model was unrealistic. It is now recognized that the long-branch attraction (the Felsenstein Zone) is one of the most serious problems in phylogenetic inference. Long branches are attracted but the similarity is homoplastic AB C D Model tree p p q qq Rates or Branch lengths p >> q A B C D Parsimony tree Wrong
59 2. Perfect Phylogeny Data on species is given by a Character State Matrix. Cell (p, i) has value j iff character i of object (species) p has state j. Goal: constructing evolution tree for the species. Character Objectc1c1 c2c2 c3c3 c4c4 c5c5 A11200 B20121 C32331 D03410 E11001
60 Motivation: Evolution Tree Internal nodes correspond to speciation events, where some character (attribute) is acquired. Assumptions: 1. No reversals (characters are not lost) 2. No convergences (a character is created only once)
61
62 Perfect Phylogeny for a 0-1 Matrix A 0-1 matrix: Each character is either 0 (non exists) or 1 (exists). u Each of the n objects label exactly one leaf of T u Each of the m characters labels exactly one edge of T u Object p has exactly the characters labeling the path from p to the root. A perfect phylogeny for the matrix: Tree with no convergence, no reversals A11000 B00100 C11001 D00110 E01000 A E D C B
63 The (Binary) Perfect Phylogeny Problem Problem: Given a 0-1 matrix M, determine if it has a perfect phylogeny, and construct one if it does. (Note: edges are labeled by characters: edge labeled by i represent changing character i’s state from 0 to 1) A11000 B00100 C11001 D00110 E01000 A E D C B
64 Solution to Perfect Phylogeny Problem Definition: Given a 0-1 matrix M, O k ={j: M jk =1}; i.e., O k is the set of objects that have character k. Theorem: M has a perfect phylogenetic tree iff the sets {O i } are laminar, ie: for all i, j, either O i and O j are disjoint, or one includes the other A11000 B00100 C11001 D00110 E A11000 B00101 C11001 D00110 E01001 LaminarNot Laminar
65 Proof : Assume M has a perfect phylogeny, and let i, j be given. Consider the edges labeled i and j. Case 1: There is a root to leaf path containing both. Then one is included in the other (2 and 1 below). Case 2: not case 1. Then they are disjoint (2 and 3 below). A E D C B
66 Proof (cont.) : Assume for all i, j, either O i and O j are disjoint, or one includes the other. We prove by induction on the number of characters that it has. Basis: one character. Then there are at most two objects, one with and one without this character. 0B 1A 1 1 AB
67 Proof (cont.) : Induction step: Assume correctness for n-1 characters, and consider a matrix with n characters (non-zero columns). WLOG assume that O 1 is not contained in O j for j > 1. Let S 1 be the set of objects that have character 1, and S 2 be the remaining objects. Then each character belongs to objects in S 1 or S 2, but not both. By induction there are trees T 1 and T 2 for S 1 and S 2. Combining them as below gives the desired tree A11000 B00100 C11001 D00110 E10000 T1T1 T2T2 1
68 Efficient Implementation 1. Sort the columns by decreasing value when considered as binary numbers. (Time complexity: O(mn), using radix sort). Claim: If the binary value of column i is larger than that of column j, then O i is not a proper subset of O j. Proof: O i – O j > 0 means the 1’s in O i are not covered by the 1’s in O j A11000 B00100 C11001 D00110 E A11000 B00100 C11010 D00101 E10000
69 Efficient Implementation (2) 2. Make a backwards linked list of the 1’s in each row (leftmost 1 in each row points at itself). Time complexity: O(mn) E 10100D 01011C 00100B 00011A Claim: If the columns are sorted, then the set of columns is laminar iff for each column i, all the links leaving column i point at the same column. Can be checked in O(mn) time.
70 Examples 00001E 10100D 01011C 00100B 00011A laminar 01101E 10100D 01011C 00100B 00011A Not laminar
71 Efficient Implementation (3) 3. When the matrix is laminar, the tree edges corresponding to characters are defined by the backwards links in the matrix A11000 B00100 C11010 D00101 E10000 A E D C B remaining edges and leaves are determined by the characters of each object. Needs O(mn) time.