Presentation is loading. Please wait.

Presentation is loading. Please wait.

Tree Confidence Have we got the true tree? Use known phylogenies Unfortunately, very rare Hillis et al. (1992) created experimental phylogenies using phage.

Similar presentations


Presentation on theme: "Tree Confidence Have we got the true tree? Use known phylogenies Unfortunately, very rare Hillis et al. (1992) created experimental phylogenies using phage."— Presentation transcript:

1 Tree Confidence Have we got the true tree? Use known phylogenies Unfortunately, very rare Hillis et al. (1992) created experimental phylogenies using phage cultures and mutagens

2 Tree Confidence Created a phylogeny of nine taxa from T7 phage cultures – ~135,000 possible topologies Divided cultures in the presence of a mutagen at predetermined intervals True phylogeny is symmetric and has equal branch lengths Obtained restriction maps and sequences for all cultures (and the internodes) and inferred phylogenies using UPGMA, NJ, parsimony

3 Tree Confidence Correct phylogeny inferred 100% Ancestral states (via parsimony) correctly inferred 97.3% Correlations between predicted and actual branch lengths used to compare the methods Parsimony > NJ > UPGMA “The results… directly support the legitimacy of methods for phylogenetic estimation… with regard to branching relationships…, branch lengths, and ancestral genotypes.” Unfortunately, the tree was relatively simple with plenty of informative changes along each branch

4 Tree Confidence Use simulated data sets Supply a computer with sequences and ‘evolve’ the sequences according to some model Analyze the resulting sequences using various phylogenetic methods Advantage – we can test a wide variety of parameters Disadvantage – are the models biologically accurate? Hillis et al. (1993) did this also

5 Tree Confidence Attempted to find the correct unrooted tree for four sequences, a relatively simple problem Varied rates of nucleotide substitution at sites and along branches Tested performance using UPGMA and parsimony

6 Tree Confidence Note the relative performances Why does UPGMA do so well along the diagonal? Note the failure of both methods in the ‘Felsenstein zone’ where long branch attraction tends to occur UPGMAMP

7 Tree Confidence

8 What about the amount of data available? (Huelsenbeck et al., 1996) Note that some methods converge on the correct tree with less data necessary than others… if all branches evolve equally What if some branches evolve more quickly?

9 Tree Confidence Congruence among alternative/independent data sets

10 Tree Confidence Putting confidence estimates on nodes We usually have only one data set. How do we obtain information on the statistical support of the nodes in our tree if we don’t have replicate data? Two most common techniques are resampling methods, bootstrapping and jackknifing

11 Tree Confidence Bootstrap analysis (non- parametric) First used for phylogenetics by Felsenstein in 1985 “Resampling with replacement” to generate pseudo replicates Typically repeated 100 – 5000 times Useful and widely implemented for phenetic and likelihood methods

12 Tree Confidence Bootstrap analysis 1 2 3 4 5 6 7 8 9 10 A – A T G G A T T T C G B – A T G G C G T T C G C – G C G G A G T T C G D – G C G G C G T T T G 4GGGG4GGGG 2TTCC2TTCC 9CCCT9CCCT 4GGGG4GGGG 4 2 9 4 8 7 5 1 3 1 A – G T C G T T A A G A B – G T C G T T C A G A C – G C C G T T A G G G D – G C T G T T C G G G 1 1 3 109 2 7 5 7 3 4 A – A G G C T T A T G G B – A G G C T T C T G G C – G G G C C T A T G G D – G G G T C T C T G G 2 10104 5 2 7 9 2 3 9 A – G G G A T T C T G C B – G G G C T T C T G C C – G G G A C T C C G C D – G G G C C T T C G T 3

13 Tree Confidence 4 2 9 4 8 7 5 1 3 1 A – G T C G T T A A G A B – G T C G T T C A G A C – G C C G T T A G G G D – G C T G T T C G G G 1 1 3 109 2 7 5 7 3 4 A – A G G C T T A T G G B – A G G C T T C T G G C – G G G C C T A T G G D – G G G T C T C T G G 2 10104 5 2 7 9 5 3 9 A – G G G A T T C A G C B – G G G C T T C C G C C – G G G A C T C A G C D – G G G C C T T C G T 3 1 2 3 4 5 6 7 8 9 10 A – A T G G A T T T C G B – A T G G C G T T C G C – G C G G A G T T C G D – G C G G C G T T T G O A D B C ((A,B),(C,D)) A D B C A D B C A D C B ((A,C),(B,D))

14 Tree Confidence Consensus trees Majority-rule consensus trees only display branches with 50% support or more A majority-rule consensus tree may or may not be congruent with any of the pseudoreplicate topologies Other people and software will superimpose the branch support on the tree obtained from the original data set

15 Tree Confidence Consensus trees In this example, ¾ trees contain the branch linking AB and CD They get 75% bootstrap support A D B C ((A,B),(C,D)) A D B C A D B C A D C B ((A,C),(B,D)) A D B C ((A,B),(C,D)) 75

16 Tree Confidence What does the bootstrap really tell us? It only reflects the strength of the phylogenetic signal in the data. Tells us nothing about the accuracy of the method we chose If the data set is biased, the bootstrap tree will be also If evolutionary rates are unequal, long branch attraction will likely influence the consensus tree as much as the original tree Sites may not evolve independently – if that is true, randomly sampling them is invalid –Block-bootstrapping – sample n/b blocks of b adjacent sites (to correct for correlation among adjacent sites) (Künsch, 1989)

17 Tree Confidence Parametric bootstrap Here, we are trying to determine if the data set is typical of the parameters we have estimated for it. For example, we may find the ML tree, with estimates of branch lengths and substitution rates. We can now construct alignments by simulating sequences following the parameters: topology, branch lengths, substitution rates. Does the original alignment resemble the bootstraps?

18 Tree Confidence Parametric bootstrap - application Say that we obtain tree T1 in a phylogenetic analysis We were expecting T2. We can test the null hypothesis that the data were actually generated on treeT2 but that stochasticity (or some other process) resulted in T1 being preferred. Estimate all model parameters on T2 and generate a set of reference data sets using those values and parametric bootstrapping. On each of the generated data sets, measure the difference in likelihood score of the two trees. Use this reference distribution in evaluating whether the preference for T1 in the original data set could be due to chance alone causing a deviation in data actually generated on tree T2. Extremely computationally intensive.

19 Tree Confidence Parametric bootstrap – the placement of Strepsiptera Two topologies proposed –Classical (bottom) Strepsiptera is sister to the beetles –MP based analysis of rRNA suggested Strepsiptera is sister to the flies Huelsenbeck (1997) performed a parametric bootstrap test 1. created a “constrained” tree, forcing the relationship with

20 Tree Confidence Huelsenbeck (1997) performed a parametric bootstrap test –created a “constrained” tree, forcing the relationship with beetles –Identified the best tree under this constraint –created many simulated data sets using a parametric bootstrap and allowed them to evolve under the constraints –analyzed the resulting data sets under parsimony criterion –92% of the trees had topologies that included a sister relationship between flies and Strepsiptera, despite the fact that we had stacked the deck in favor of the classical topology

21 Tree Confidence Jackknifing – the most common variation is the delete-half jackknife Randomly purge half of the sites from the original data set Not commonly used anymore

22 Tree Confidence Decay indices – aka Bremer support The decay index is the length difference between the shortest trees including a group and the shortest trees that exclude the group (the extra steps required to overturn a group) Generally, the higher the decay index the better the relative support for a group Like bootstrap proportions (BP), decay indices may be misleading if the data is misleading Unlike BPs decay indices are not scaled (0-100) and it is less clear what is an acceptable decay index Magnitude of decay indices and BPs generally correlated The higher the number of terminal taxa, the higher the index

23 Tree Confidence The approximate Likelihood Ratio Test (aLRT) For any strictly bifurcating tree, any branch connects to four other branches The tree can be simplified We can also hypothesize that the internal branch does not exist The likelihoods of all three possible internal nodes can be calculated (and are as a part of standard ML inference) These likelihoods can be compared via a modified LRT to determine whether any one alternative is significantly better than the original A D B C A D B C A D B C

24 Tree Confidence Permutation testing Sometimes it is desirable to test if various rearrangements of a phylogeny are significantly different from others Several such tests allow us to determine if one tree is statistically significantly worse than another: –Kishino-Hasegawa (KH) test, Shimodaira-Hasegawa (SH) test Null hypothesis for all tests is that the trees are no different than would be expected from random sampling error

25 Tree Confidence Distributions and hypothesis testing Typical procedure –Generate a sampling distribution which consists of many values of a test statistic generated from the data or from some other distribution of values. –Generate a test statistic for your particular situation –Find out where your test statistic falls in the overall distribution –Is it in the acceptance region or the rejection region, p-value set a priori –One-sided test – we know directionality of the effect we are expecting –Two-sided test – we don’t know directionality of the effect

26 Tree Confidence Distributions and hypothesis testing Typical procedure –Generate a sampling distribution which consists of many values of a test statistic generated from the data or from some other distribution of values. –Generate a test statistic for your particular situation –Find out where your test statistic falls in the overall distribution –Is it in the acceptance region or the rejection region, p-value set a priori –One-sided test – we know directionality of the effect we are expecting –Two-sided test – we don’t know directionality of the effect –Ideally, we would be able to generate distributions by sampling from the real world. We can’t re-run evolution, so this isn’t possible – generate distributions by resampling from the data

27 Tree Confidence The Kishino-Hasegawa test uses differences in the support provided by individual sites for two trees (T0, T1) to determine if the overall differences between the trees are significantly greater than expected from random sampling error Valid only if the trees to be tested are identified a priori –Infer likelihoods of each tree l 0, l 1 –Generate many bootstrap replicates (i) for each tree and calculate l for all of them –Generate a distribution of differences between all of the trees – this is the null distribution –Use the distribution to test your hypothesis – does the difference you originally calculated fall into one of the tails?

28 Tree Confidence The Kishino-Hasegawa test H0 – the trees are not different H1 – the trees are different

29 Tree Confidence The Kishino-Hasegawa test The test statistic is the score (likelihood) difference between the trees The null distribution is all of the possible differences if the trees are not different (generated by bootstrapping the data) Assumptions: –Trees selected a priori –Sites are independent –Sites are identically distributed –Large numbers of sites are sampled

30 Tree Confidence The Shimodaira-Hasegawa (SH) test is an alternative test involving bootstrapping to test whether the best tree is better than other trees identified a posteriori –The test statistic in this case is the score difference between the best tree and all other trees to be compared –H0 – all trees to be compared are equally good –H1 – some or all threes are not equally good explanations of the data

31 Tree Confidence If there is a lot of homoplasy, trees derived using parsimony could be unreliable. Values are available to evaluate the reliability of parsimony-based trees. Consistency Index (CI) = the minimum number of changes/tree length c i = m i /s i (for any i-th site in the alignment) –m i = the minimum possible number of substitutions for any conceivable topology –s i = the minimum number of substitutions required for the topology being considered –The overall CI for the entire tree is Σm i / Σs i Homoplasy Index (HI) = 1 – CI Both will provide an idea of the relative value of the data with regard to the given tree. But random data can give CI’s with values between 0.4 and 0.6

32 Tree Confidence What can go wrong in phylogenetic analysis? Sampling error –All inferred trees can only represent the sequences used –Improper sampling of the taxa will yield incorrect trees –Improperly selected sequences will yield incorrect trees Incorrect evolutionary model –Choosing the wrong model of sequence evolution will likely lead to the wrong tree Evolutionary history –Sometimes, despite our best efforts, finding the best tree may just be impossible –Rapid radiations, widely differing rates of evolution, extinction


Download ppt "Tree Confidence Have we got the true tree? Use known phylogenies Unfortunately, very rare Hillis et al. (1992) created experimental phylogenies using phage."

Similar presentations


Ads by Google