Download presentation
Presentation is loading. Please wait.
Published byEmery Phillips Modified over 8 years ago
2
14.4. Tue Introduction to models (Jarno) 16.4. Thu Distance-based methods (Jarno) 17.4. Fri ML analyses (Jarno) 20.4. Mon Assessing hypotheses (Jarno) 21.4. Tue Problems with molecular data (Jarno) 23.4. Thu Problems with molecular data (Jarno) Phylogenomics 24.4. Fri Search algorithms, visualization, and other computational aspects (Jarno) J
3
Character based ◦ Parsimony ◦ Model based analyses maximum likelihood bayesian methods Similarity based ◦ Distance methods
4
Distance Estimates attempt to estimate the mean number of changes per locus (~gene) since 2 taxa last shared a common ancestor based upon a model of how the sequences may have evolved
5
J
8
Number of changes between two sequences. Amino acid sequences (similarly also for nucleotide): KIMMOKIMMO KIMMATI-MO d H =1d H =1 Hamilton distance does not count gaps. Sometimes used in parsimony methods. J
9
Edit distance does count gaps (-): KIMMOKIMMO KIMMATI-MO d H =1d H =1 d E =1d E =2 Often used in parsimony methods. J
10
The p-distance is a normalized Hamilton or edit distance. ◦ Normalized to the length of the sequence alignment p d =n d /n, ◦ where p d is the distance ◦ n d is the number of differing nucleotides between the (aligned) sequences (Hamilton distance) ◦ n is the total length of the alignment J
11
KIMMOKIMMO KIMMATI-MO d H =1d H =1 d E =1d E =2 d p =1/5=0.2d p =1/4=0.25 J
12
Distance models are often based upon some of the same assumptions as the models in ML but they are implemented in a different way ◦ Jukes Cantor model: assumes all changes equally likely ◦ General time reversible model (GTR): assigns different probabilities to each type of change ◦ LogDet (Paralinear) distance model: was devised to deal with unequal base frequencies in different sequences
13
All models include a correction for multiple substitutions at the same site All (except Logdet distances) can be modified to include a gamma correction for site rate heterogeneity CA C G T A 1 2 3 1 Seq 1 Seq 2 Number of changes
14
d xy = distance between sequence x and sequence y expressed as the number of changes per site (note d xy = r/n where r is number of replacements and n is the total number of sites. This assumes all sites can vary and when unvaried sites are present in two sequences it will underestimate the amount of change which has occurred at variable sites) D = is the observed proportion of nucleotides which differ between two sequences (fractional dissimilarity) ln = natural log function to correct for superimposed substitutions Jukes & Cantor model: d xy = -(3/4) ln (1-4/3 D)
15
The 3/4 and 4/3 terms reflect that there are four types of nucleotides and three ways in which a second nucleotide may not match a first - with all types of change being equally likely (i.e. unrelated sequences should be 25% identical by chance alone)
16
If two sequences are 95% identical they are different at 5% or 0.05 (D) of sites thus: d xy = -3/4 ln (1-4/3*0.05) = 0.0517 Note that the observed dissimilarity 0.05 increases only slightly to an estimated 0.0517 - this makes sense because in two very similar sequences one would expect very few changes to have been superimposed at the same site in the short time since the sequences diverged apart However, if two sequences are only 50% identical they are different at 50% or 0.50 (D) of sites thus: d xy = -3/4 ln (1-4/3*0.5) = 0.824 For dissimilar sequences, which may have diverged a long time ago, the use of ln infers that a much larger number of superimposed changes have occurred at the same site The natural logarithm ln is used to correct for superimposed changes at the same site
17
The most common additional parameters are: ◦ A correction to allow different substitution rates for each type of nucleotide change ◦ A correction for the proportion of sites which are unable to change ◦ A correction for variable site rates at those sites which can change
18
LogDet (paralinear) distances was designed to deal with unequal base frequencies in each pairwise sequence comparison - thus it (putatively) allows base compositions to vary over the tree! This distinguishes it from the GTR distance model which takes the average base composition and applies it to all comparisons
19
LogDet distances assume all sites can vary - thus it is important to remove those sites which cannot change The proportion of such sites is typically slightly smaller than the observed number of constant sites and is estimated using ML Invariable sites are removed according to the base composition of constant sites (rather than the base composition of all sites - which may be different) in order to preserve the correct base frequencies among remaining constant sites
20
d xy = estimated distance between sequence x and sequence y ln = natural log function to correct for superimposed substitutions F xy = 4 x 4 (there are four bases in DNA) divergence matrix for seq X & Y - this matrix summarises the relative frequencies of bases in a given pairwise comparison det = is the determinant (a unique mathematical value) of the matrix LogDet Distances d xy = -ln (det F xy )
21
Sequence B a c g t a 224 5 24 8 Sequence A c 3 149 1 16 g 24 5 230 4 t 5 19 8 175 For sequences A and B, for 900 sequence positions, this matrix summarises pairwise site by site comparisons (it uses the data very efficiently)
22
The matrix Fxy expresses this data as the proportions (e.g. 224/900 = 0.249) of sites: a c g t a.249.006.027.009 Fxy = c.003.166.001.018 g.027.006.256.004 t.006.021.009.194 dxy = -ln [det Fxy] = -ln [.002] = 6.216 (the logDet distance between sequences A and B)
23
Very good for situations where base compositions vary significantly between sequences Even when base compositions do not appear to vary the LogDet distances model performs at least as well as other distance methods A drawback is that it assumes sites evolve identically and rates are equal for all sites However, a correction whereby a proportion of invariable sites are removed prior to analysis appears to work very well in simulations
24
Occurs when different sites in a molecule evolve at different rates due to different functional constraints Many models (Jukes Cantor, LogDet, some ML models) assume all sites can vary and all evolve at the same rate This underestimates the amount of change that has occurred - and thus distances between sequences - leading to incorrect trees A gamma correction for site rate heterogeneity can be included - if model allows this (many do)
25
O.O.O.O.O.O. eremitaitalicumcristinaelassalleibarnabita¹barnabita O. eremita- O. italicum0.038- O. cristinae0.0490.044- O. lassallei0.1010.0980.100- O. barnabita¹0.1060.0980.0870.062- O. barnabita0.1150.1150.1050.0680.006- This summary of the data is then used to infer the phylogenetic relationships of taxa
26
Four mathematical conditions must be satisfied: ◦ d(x,y)>=0 ◦ d(x,x)=0 and d(y,y)=0 ◦ d(x,y)=d(y,x)# symmetric ◦ d(x,z)<=d(x,y) + d(y,z) # metric The last is also called triangle unequality If the aforementioned conditions are not satisfied then d is not a distance, but a dissimilarity J
27
Euclidean distance (green) Taxicab, city block or Manhattan distance (other colors) J
28
A metric is an ultrametric, if: ◦ d(x, z) ≤ max(d(x, y), d(y, z)), ◦ which means that points can never fall between other points. If the distances between the sequences are ultrametric, then the tree formed by certain clustering methods (UPGMA) will be an accurate ultrametric tree. This results into an accurately rooted tree. If the distance is not ultrametric, the resulting tree will be an "unaccurate ultrametric tree". In practise this would mean a molecular clock exists! J
29
Additive trees are generalizations of ultrametric trees. An additive metric is a one for which: ◦ d(x,y)+d(u,v) <= max(d(x,u)+d(y,v), d(x,v)+d(y,u)) An additive tree is further restricted by: ◦ d(x,y)+d(u,v) <= d(x,u)+d(y,v) = d(x,v)+d(y,u)) J
30
Additive distances: If we could determine exactly the true evolutionary distance implied by a given amount of observed sequence change, between each pair of taxa under study, these distances would have the useful property of tree additivity.
31
Additive trees: A phylogenetic tree is additive if the evolutionary distance separating any two points on a tree is equal to the total of the lengths of the branches that join the two points. Obtaining a tree using pairwise distances
32
A B C D A - 4 4 8 B 4 - 6 10 C 4 6 - 8 D 8 10 8 - A B C D 1 1 3 6 2 Note that the branch lengths in the matrix and the tree path lengths match perfectly - this is a single unique additive tree
33
Unfortunately due to the finite amount of available data, stochastic (random) errors will cause deviation of the estimated distances from perfect tree additivity even when evolution proceeds exactly according to the distance model used Poor estimates obtained using an inappropriate model will compound the problem How can we identify the tree which best fits the experimental data from the many possible trees?
34
We have uncertain data that we want to fit to a particular mathematical model (an additive tree) and find the optimal value for the adjustable parameters (branching pattern and branch lengths)
35
Seeks to minimise the squared deviation of the tree path length distances from the distance estimates
36
A E C D B v1v1 v6v6 v7v7 v5v5 v3v3 v2v2 v4v4 A B C D E A 0 0.23 0.16 0.20 0.17 B 0.23 0 0.23 0.17 0.24 C 0.16 0.23 0 0.15 0.11 D 0.20 0.17 0.15 0 0.21 E 0.17 0.24 0.11 0.21 0 Observed D ij Inferred d ij Least squares methods Minimise discrepancy between Observed D ij and inferred d ij
37
For 20 taxa there are ~2 x 10 20 unrooted trees (close to Avogadro’s constant) For 50 taxa there are ~3 x 10 74 unrooted trees (number of electrons in the universe?) How can we find the best tree ?
38
Minimum Evolution Method For each possible alternative tree one can estimate the length of each branch from the estimated pairwise distances between taxa and then compute the sum (S) of all branch length estimates The minimum evolution criterion is to choose the tree with the smallest S value
39
Clustering methods do not optimize a criterion ◦ apply a particular algorithm to the observed data to come up with a tree UPGMA and Neighbour-joining
40
Unweighted Pair Group Method using Arithmetic averages Assumes that sequences evolve under a ”molecular clock” Always connect those taxa or amalgamates of taxa that have the shortest distance Gives an ultrametric tree
42
Distance 3.3 (Human - Monkey) is the minimum. So we'll join Human and Monkey to MonHum and we'll calculate the new distances. Mon-Hum MonkeyHumanSpinachMosquitoRice
43
After we have joined two species in a subtree we have to compute the distances from every other node to the new subtree. We do this with a simple average of distances: Dist[Spinach, MonHum] = (Dist[Spinach, Monkey] + Dist[Spinach, Human])/2 = (90.8 + 86.3)/2 = 88.55 Mon-Hum MonkeyHumanSpinach
44
HumanMosquito Mon-Hum MonkeySpinachRice Mos-(Mon-Hum)
45
HumanMosquito Mon-Hum MonkeySpinachRice Mos-(Mon-Hum) Spin-Rice
46
HumanMosquito Mon-Hum MonkeySpinachRice Mos-(Mon-Hum) Spin-Rice (Spin-Rice)-(Mos-(Mon-Hum))
47
A BC D 13 4 4 22 10 True tree A B C D A 0 17 21 27 B 17 0 12 18 C 21 12 0 14 D 27 18 14 0 Distance matrix A B C D 66 2 8 2.833 10.833 UPGMA tree
48
Does not assume a molecular clock Approximates the minimum evolution method Guaranteed (supposedly) to recover the true tree if the distance matrix is an exact reflection of the tree
49
Calculate a corrected distance matrix ◦ Adjust distance of each pair of taxa with average distance to all other taxa Join two least distant taxa together to create a new node Calculate branch lengths from node to each taxon separately taking into account average distance to all other taxa Combine joined taxa and calculate corrected distances to remaining taxa and go through cycle again
50
A B C D E B 5 C 4 7 D 7 10 7 E 6 9 6 5 F 8 11 8 9 8 1. Compute net divergences for every node: rA = 5+4+7+6+8=30 rD = 38 rB = 5+7+10+9+11=42 rE = 34 rC = 32 rF = 44 2. Compute rate corrected distance matrix: Mij =dij – (ri+rj)/(N-2) M AB = d AB – (rA+rB)/(N-2) = 5 – (30+42)/4 = -13 M AC etc etc A B C D E B -13 C -11.5 -11.5 D -10 -10 -10.5 E -10 -10 -10.5 -13 F -10.5 -10.5 -11 -11.5 -11.5 A B C D E F
51
3. Join neighbours A and B to form node U and 4. compute their branch lengths: S AU = d AB /2+(rA-rB)/2(N-2) = 5/2+(30-42)/2(6-2)=1 S BU = d AB -S AU =4 A B C D E F U 5. Distance from U to remaining terminals: d CU = (d AC +d BC -d AB )/2 = 3 d DU = (d AD +d BD -d AB )/2 = 6 d EU = (d AE +d BE -d AB )/2 = 5 d FU = (d AF +d BF -d AB )/2 = 7 U C D E C 3 D 6 7 E 5 6 5 F 7 8 9 8 Repeat steps 1-5 1 4 1. Compute net divergences for every node: rU = 3+6+5+7=21 rE = 24 rC = 24 rF = 32 rD = 27 U C D E C -12 D -10 -11 E -10 -10 -12 F -10.7 -10.7 -10.7 -10.7 2. Compute rate corrected distance matrix: Mij =dij – (ri+rj)/(N-2) 3. Join neighbours U and C to form node V and 4. compute their branch lengths: S UV = d CU /2+(rU-rC)/2(N-2)=1 S CV = d CU -S UV =2 A B C D E F U 1 4 1 2 5. Distance from U to remaining terminals: dDV = (dDU+dCD-dCU)/2 = 5 dEV = (dEU+dCE-dCU)/2 = 4 dFV = (dFU+dCF-dCU)/2 = 6 V Repeat steps 1-5
52
1. Compute net divergences for every node: rV = 5+4+6=15 rE = 17 rD = 19 rF = 23 V D E D -12 E -12 -13 F -13 -12 -12 2. Compute rate corrected distance matrix: Mij =dij – (ri+rj)/(N-2) 3. Join neighbours D and E to form node W and 4. compute their branch lengths: S DW = d DE /2+(rD-rE)/2(N-2)=3 S EW = d DE -S DW =2 A B C D E F U 1 4 1 2 5. Distance from W to remaining terminals: dVW = (dDV+dEV-dDE)/2 = 2 dFW = (dDF+dEF-dDE)/2 = 6 V Repeat steps 1-5 2 3 W V D E D 5 E 4 5 F 6 9 8
53
V W W 2 F 6 6 1. Compute net divergences for every node: rV = 2+6=8 rF = 12 rW = 8 V W W -14 F -14 -14 2. Compute rate corrected distance matrix: Mij =dij – (ri+rj)/(N-2) 3. Join neighbours F and V to form node X and 4. compute their branch lengths: S VX = d FV /2+(rV-rF)/2(N-2)=1 S FX = d FV -S VX =5 A B C D E F U 1 4 1 2 5. Distance from W to remaining terminals: d WX = (d FW +d VW- d FV )/2=1 V Repeat steps 1-5 2 3 W 5 1 X W X 1 A B C D E F U 1 4 1 2 V 2 3 W 5 1 X 1
54
A BC D 13 4 4 22 10 A B C D A 0 17 21 27 B 17 0 12 18 C 21 12 0 14 D 27 18 14 0 Compute net divergences for every node: rA = 17+21+27=65 rB = 17+12+18=47 rC = 21+12+14=47 rD = 27+18+14=59 Compute rate corrected distance matrix: Mi =dij – (ri+rj)/(N-2) M AB = dAB – (rA+rB)/(N-2) = 17 – (65+47)/2 = -47.5 M AC = dAC – (rA+rC)/(N-2) = 21 – (65+47)/2 = -45.5 M AD = dAD – (rA+rD)/(N-2) = 27 – (65+59)/2 = -48.5 M BC = dBC – (rB+rC)/(N-2) = 12 – (47+47)/2 = -41 M BD = dBD – (rB+rD)/(N-2) = 18 – (47+59)/2 = -44 M CD = dCD – (rC+rD)/(N-2) = 14 – (47+59)/2 = -46 A B C D A 0 B -47.5 0 C -45.5 -41 0 D -48.5 -44 -46 0 Beware! There are cases where NJ does not work!
55
Fast when using clustering algorithms - suitable for analysing data sets which are too large for ML A large number of models are available with many parameters - improves estimation of distances
58
Distance estimates are only correct if model used is correct Rate variations in different parts of a tree are intractable for distance measures ◦ Information on variation in characters is lost once sequence differences are converted to distances
59
Information is lost - given only the distances it is impossible to derive the original sequences Only through character based analyses (ML, parsimony) can the most informative positions be inferred Generally outperformed by Maximum likelihood methods in choosing the correct tree in computer simulations (but logDet is better in some situations)
60
”Nothing makes sense in biology except in the light of evolution” - Dobzhansky 1973 ”Nothing in evolution makes sense except in the light of phylogeny” - Savage 1997
61
The study of character evolution The study of historical biogeography The study of the temporal framework of evolution and diversification The study of molecular evolution
62
Ease of data generation for large numbers of taxa Ease of generating a large number of independent data sets for given taxa Molecular characters behind the morphological characters we see
63
The butterfly family Nymphalidae
64
Wahlberg et al (2009) Proc R Soc 276: 4295-4302
65
104 mya 94 mya 65 mya Libytheinae Danaini Tellervini+Ithomiini Limenitidinae Heliconiinae Pseudergolinae Apaturinae Biblidinae Cyrestinae Nymphalinae Calinaginae Charaxinae Satyrinae Wahlberg et al (2009) Proc R Soc 276: 4295-4302
66
Peña & Wahlberg (2008) Biology Letters 4: 274-278. satyrine clade
67
Widespread Neotropics and/or Oriental Neotropics, Oriental, Australia Oriental Oriental and/or Afrotropics Oriental Oriental and/or Neotropics Neotropics and/or Oriental Oriental and/or Neotropics Neotropics, Oriental, or widespread including Afrotropics Neotropics, Oriental, Afrotropics Neotropics and/or Oriental Oriental (and Neotropics?) Wahlberg et al (2009) Proc R Soc 276: 4295-4302
68
90 Mya
69
80 Mya
70
70 Mya
71
60 Mya
72
Wahlberg et al (2009) Proc R Soc 276: 4295-4302
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.