Phylogenetic trees as a visualization tools for evolutionary classification
ChimpHumanGorilla HumanChimpGorilla = ChimpGorillaHuman == GorillaChimp Trees
Same thing… s4s5 s1 s3 s2 s4s5 s1 s3 s2 =
Bifurcating / Multifurcating s4s5 s1 s3 s2 A multifurcation = Polytomy s4s5 s1 s3 s2 Dichotomy There are two types of polytomies: soft (lack of information to resolve the tree) and hard (multiple divergence in short evolutionary time).
A “comb” A comb s4s5 s1 s3 s2
Terminology A branch = An edge External node - leaf HumanChimp Chicken Gorilla The root Internal nodes
Ingroup / Outgroup: HumanChimp Chicken Gorilla INGROUP OUTGROUP
Subtrees HumanChimp Chicken Gorilla Duck A subtree
Monophyletic groups HumanChimp Chicken Gorilla The Gorilla+Human+Chimp are monophyletic. A clade is a monophyletic group.
Paraphyletic = Non- monophyletic groups WhaleChimp Drosophila Zebrafish The Zebrafish+Whale are paraphyletic
The maximum parsimony principle. 3. Tree building
Genes: 0 = absence, 1 = presence speciesg1g2g3g4g5g6 s s s s s Tree building
s1s4s3 s2 s5 Evaluate this tree… 3. Tree building
s1s4s3s2s5 Gene number Tree building
s1s4s3s2s5 Gene number 1, Option number Tree building
s1s4s3s2s5 Gene number 1, Option number 2. Number of changes for gene 1 (character 1) = Tree building
s1s4s3 s2 s5 Gene number 2, Option number Tree building
s1s4s3 s2 s5 Gene number 2, Option number Tree building
s1s4s3 s2 s5 Gene number 2, Option number Number of changes for gene 2 (character 2) = 2 3. Tree building
s1s4s3 s2 s5 Gene number 3, Option number Tree building
s1s4s3 s2 s5 Gene number 3, Option number Number of changes for gene 3 (character 3) = 1 3. Tree building
s1s4s3 s2 s5 Gene number 4, Option number Tree building
s1s4s3 s2 s5 Gene number 4, Option number Number of changes for gene 4 (character 4) = 2 3. Tree building
Gene number 5 is the same as Gene number 4 Number of changes for gene 5 (character 5) = 2 3. Tree building
s1s4s3 s2 s5 Gene number 6, 1 option only: Number of changes for gene 6 (character 6) = 1 3. Tree building
Sum of changes Number of changes for gene 6 (character 6) = 1 Number of changes for gene 5 (character 5) = 2 Number of changes for gene 4 (character 4) = 2 Number of changes for gene 3 (character 3) = 1 Number of changes for gene 2 (character 2) = 2 Sum of changes for this tree topology = 9 Can we do better ??? Number of changes for gene 1 (character 1) = 1 3. Tree building
s1s4s3 s2 s5 The MP (most parsimonious) tree: Sum of changes for this tree topology = 8 3. Tree building
How to efficiently compute the MP score of a tree
The Fitch algorithm (1971): AG C C A HumanChimp Chicken Gorilla Duck {A,G} {A,C,G} {A,C} Postorder tree scan. In each node, if the intersection between the leaves is empty: we apply a union operator. Otherwise, an intersection.
Number of changes AG C C A HumanChimp Chicken Gorilla Duck {A,G} {A,C,G} {A,C} Total number of changes = number of union operators.
Patterns: AG C C A HumanChimp Chicken Gorilla Duck {A,G} {A,C,G} {A,C} CACAG require the same number of changes as CACAT, or in general all those positions with the pattern XYXYZ.
Ex: GACAGGGA CAAG GCGA GAAA HumanChimp Chicken Gorilla Duck Find min. number of changes. Point to all identical patterns.
Ambiguous characters: AG C C R = {A,G} HumanChimp Chicken Gorilla Duck {A,G} {A,C,G} {A,G,C } {A,C,G } R = {A,G} = Purine..
Subtrees Each node has an ID HumanChimp Chicken Gorilla Duck Subtree of node 4.
The Sankoff algorithm: Generalization: they assume a cost function Cij for changing from i to j. If Cij = 1, it just counts number of changes. We now search for the tree with the min. cost. Definition: Si(k) = Minimum cost of the subtree of node i, given that the assignment of node i = character k.
Easy to compute for the leaves. For example S 2 (A) = 0 (no cost in A there) S 2 (C) = S 2 (G) = S 2 (T) ∞ (they just can’t be there) A G A A C
Definition: Si(k) = Minimum cost of the subtree of node i, given that the assignment of node i = character k AG A A C [0, ∞, ∞, ∞][∞, 0, ∞, ∞][0, ∞, ∞, ∞] [∞, ∞, 0, ∞]
Definition: Si(k) = Minimum cost of the subtree of node i, given that the assignment of node i = character k. 1 0 [s 1 (A), s 1 (C), s 1 (G), s 1 (T)] ACGT A0312 C3021 G1203 T2130 Costs: 2 [s 2 (A), s 2 (C), s 2 (G), s 2 (T)] S 0 (A) = min x (C AX + S 1 (X)) + min Y (C AY +S 2 (Y))
Definition: Si(k) = Minimum cost of the subtree of node i, given that the assignment of node i = character k. 1 0 [13, 17, 22, 14] ACGT A0312 C3021 G1203 T2130 Costs: 2 [15,14,21,17] S 0 (A) = min { 13, , , } + min { 15, , , } = = 28.
Definition: Si(k) = Minimum cost of the subtree of node i, given that the assignment of node i = character k. 1 [13, 17, 22, 14] ACGT A0312 C3021 G1203 T2130 Costs: 2 [15,14,21,17] S 0 (C) = min { , 17, , } + min { , 14, , } = = 29. [28,x,y,z}
Definition: Si(k) = Minimum cost of the subtree of node i, given that the assignment of node i = character k. 1 [13, 17, 22, 14] ACGT A0312 C3021 G1203 T2130 Costs: 2 [15,14,21,17] S 0 (G) = min { , , 22, } + min { , , 21, } = = 30. [28,29,y,z}
Definition: Si(k) = Minimum cost of the subtree of node i, given that the assignment of node i = character k. 1 [13, 17, 22, 14] ACGT A0312 C3021 G1203 T2130 Costs: 2 [15,14,21,17] S 0 (T) = min { , , , 14 } + min { , , , 17 } = = 29. [28,29,30,z}
Definition: Si(k) = Minimum cost of the subtree of node i, given that the assignment of node i = character k. 1 [28,29,30,29} [13, 17, 22, 14] ACGT A0312 C3021 G1203 T2130 Costs: 2 [15,14,21,17] The cost of the tree is the minimum of this vector, which is 28.
Dynamic programming. This is an example of dynamic programming, because you first solve some small problems, and then recursively, use these solutions to build a solution to a larger problem.
Exercise. Compute minimal cost for this tree A G A C C ACGT A02.51 C 0 1 G1 0 T 1 0 Solution: the vector at the root should be [6,6,7,8], thus, the answer is 6.