The Tree of Life From Ernst Haeckel, 1891
But, is there only one “Tree of Life?” There are many theories of evolution Basic idea: Speciation is caused by physical separation into groups where different genetic variants become dominant Basic Tennant: Any two species share a common ancestor some time in the distant past
We are generally considering a “Gene Tree” as opposed to a “Species Tree.” Divergence within a gene generally happens before splitting into species occurs. In order to get a picture of evolution involving species, there is a need to look at collections of genes as opposed to individual genes.
Classical phylogenetic analysis: morphological features number of legs, lengths of legs, etc. Modern biological methods allow for the use of molecular features Gene sequences Protein sequences Analysis based on homologous sequences (e.g., globins) in different species
Use of Molecular Data Provides an objective criteria for constructing phylogenetic trees Basic data includes Gene sequences Protein sequences Analysis based on homologous sequences in different species
However, gene/protein sequence can be homologous for different reasons: Orthologs -- sequences diverged after a speciation event Paralogs -- sequences diverged after a duplication event Xenologs -- sequences diverged after a horizontal transfer (e.g., by virus)
Two main kinds of information are contained in trees: The tree topology The tree metric The form or shape of the tree distance or branch length 2 1 3 6
NO! YES! Cycle Family trees can look like this These two trees are topologically equivalent. The one on the right is more common in biology texts.
Trees can be unrooted or rooted I V III IV II I V IV III II
If there are n leaf nodes or taxa, how many different trees are possible? NR = Number of Possible Rooted Trees = NU = Number of Possible Unrooted Trees = The numerator contains the computationally insidious factorial function. Well, so does the denominator, but it is for a much smaller number.
Three Leaf Nodes Four Leaf Nodes Only one unrooted tree is possible C Only one unrooted tree is possible Four Leaf Nodes C A D C B D A B C B A D Three different unrooted trees are possible
A Table Showing the Growth of Unrooted and Rooted Trees 3 1 4 15 5 105 6 945 7 10395 8 135135 9 2027025 10 34459425 11 654729075 12 65729075 1.375*10-10 What do you notice in this table? Why is this true?
To Further Make the Point If we are creating a tree with 15 different taxa, there are 213,458,046,767,875 possible rooted trees. Assuming a computer can create a tree in 10-9 seconds, it would take 2.47 days of computation time to create them. If 20 species, 8,200,794,532,637,891,559,337 possible trees and the same computer would take 259,867 years to generate this many trees!
If we assume the Molecular Clock is working, then the distance from the root to each leaf is the same. The above tree does not assume a Molecular Clock!
THE MODERN MOLECULAR CLOCK Lindell Bromham and David Penny
Distance data can be generated from character data: Jukes-Cantor where p = percent of mismatches Kimura where P = percent transitions Q = percent transversions
Next we create a matrix of these distances species A B C dAB - dAC dBC D dAD dBD dCD For Example: species A B C D 9 - 8 11 12 15 10 E 18 13 5
Input: distance matrix between species Outline: Simple Distance-Based Method Unweighted-pair-group method with arithmetic mean Input: distance matrix between species Outline: Cluster species together Initially clusters are singletons At each iteration combine two “closest” clusters to get a new one
UPGMA Clustering Let Ci and Cj be clusters, define distance between them to be When combining two clusters, Ci and Cj, to form a new cluster Ck, then
Begin with the following distance matrix B C D E Species B C D E A 4 2 3 - 1 Closest Pair is {B, D} so cluster them, C1 = {B, D} d(C1,A) = 1/2 (4 + 4) = 4 d(C1,C) = 1/2 (4 + 4) = 4 d(C1,E) = 1/2 (4 + 4) = 4 B D 0.5 Tree at end of Stage 1
Create a new matrix that includes C1 A C E Species A C E C1 4 - 2 3 Closest are A and C, so C2 = {A, C} d(C1, C2) = 1/2 (4 + 4) = 4 d(C2, E) = 1/2 (3 + 3) = 3 A C 1 B D 0.5 Tree at end of Stage 2
Once again we revise the distance matrix: Species C2 E C1 4 - 3 We create group C3 = {C2, E} = {{A, C}, E} d(C1, C3) = 1/6 ( d(B,A) + d(B,C) + d(B,E) + d(D,A) + d(D,C) + d(D,E)) = 1/6(4+4+4+4+4+4)=4 B D 0.5 E C A 1 1.5 NOTE: This tree satisfies the Molecular Clock Assumption. This is a basic property of UPGMA produced trees. Completed Tree
The Fitch-Margoliash(FM) Algorithm A weaker requirement is additivity In “real” tree, distances between species are the sum of distances between intermediate nodes k c b j a i
Consequences of Additivity Suppose input distances are additive For any three leaves Thus k c b j a m i
Applying this idea to three taxa, A, B, and C: z A y x B Using the fact that: x + y = d(A,B) x+ z = d(A,C) y + z = d(B,C) and a little high school algebra, we have x = 1/2 (d(A,B) + d(A,C) – d(B,C)) y = 1/2 (d(A,B) + d(B,C) – d(A,C)) z = 1/2 (d(A,C) + d(B,C) – d(A,B)
We will apply this criterion to the following data: B C D E A .31 1.01 .75 1.03 - 1.00 .69 .90 .61 .42 .37 We note that A and B are the closest, but to group them without the assumption of equal distance from a common ancestor, we temporarily group C-D-E and use the three taxa case: d(A,C-D-E) = 1/3(1.01+.75+1.03) = .93 d(B,C-D-E) = 1/3(1.00+.69+.90) = .863
From the formulas two slides previous we have: C-D-E .7415 A .1215 .1885 B Recall, the joining of C-D-E was only temporary so that we could get accurate distances for joining A and B Separating C, D, and E and combining A and B for the rest of the algorithm gives the table: C D E A-B 1.005 .72 .965 - .61 .42 .37
We now have the table: C D E A-B 1.005 .72 .965 - .61 .42 .37 The closest distance is D,E. So we combine everything else into a single group A-B-C d(D,A-B-C) = 1/3(.75+.69+.61) = .683 d(E,A-B-C) = 1/3(1.03+.90+.42) = .783 D E A-B-C .683 .783 - .37
This yeilds an intermediate tree of: .135 .548 A-B-C .235 E We keep the edges joining D and E while discarding the grouping A-B-C. We now have four edges of our tree and two groupings. d(A-B,D-E) = 1/4 (.75+1.03+.69+90) = .8425 d(A-B,C) = .72 d(C,D-E) = 1/2 (.61+.42) = .515
We can now produce the table: C D-E 1.005 .8425 - .515 Again applying the distance formulas we have the tree: C .33875 .6625 A-B .17625 D-E
All that remains is to compute a and b 0.1215 0.1885 0.135 0.235 0.33875 The average of A and B from their common vertex is 0.155. For D and E the average distance is .185 So for the value of a we have .66625 – .155 = .51125 for the value of b we have .1765 – .185 = -.00875 The negative value for b is a cause for concern about the quality of our data. If we are confident of our data and since .00875 is close to 0, most researchers would assign a value of 0 to b.
One concern is that we have produced an unrooted tree for our five species is that we have an unrooted tree and no real clue on where to place the root. Sometimes physical evidence can help with the placement of a root for the tree; however, many times such evidence does not exist. A common heuristic practice is to include an extra taxon that is more distantly related to those under consideration than they are to each other. Such a taxon is called an outgroup. The biological asumption is that this group must have split from the others before they split from themselves.