Processing & Testing Phylogenetic Trees. Rooting.

Processing & Testing Phylogenetic Trees

Rooting

Rooting 1. Outgroup Rooting: Based on external information. 2. Midpoint Rooting: Direct a posteriori use of the ultrametricity assumption. 3. Largest-Genetic-Variability-Group Rooting: Indirect a posteriori use of the ultrametricity assumption.

Rooted tree Rooting with outgroup plant animal bacterial outgroup root animal Unrooted tree plant Monophyletic group Monophyletic group

Midpoint rooting

Largest variation = Most ancient

Estimating Branch Length From pairwise distances to branch lengths: maximum likelihood, least squares, etc.

Estimating Divergence Times

Topological comparisons

Penny and Hendy's topological distance (d T ) A commonly used measure of dissimilarity between two tree topologies. The measure is based on tree partitioning. d T = 2c c = the number of partitions resulting in different divisions of the OTUs in the two tree topologies under consideration.

Trees inferred from the analysis of a particular data set are called fundamental trees, i.e., they summarize the phylogenetic information in a data set.

Sometimes we have many fundamental trees pertaining to the same question. For example, we may have trees derived from different genes for the same taxa, or trees derived through different methods, or different runs in a simulation. In these cases we need to be able to summarize the data.

Consensus trees are trees that summarize the phylogenetic information in a set of fundamental trees.

strict consensus tree In a strict consensus tree, all conflicting branching patterns are collapsed into multifurcations. majority-rule consensus trees In a X% majority-rule consensus trees, a branching pattern that occurs with a frequency of X% or more is adopted. When X = 100%, the majority-rule consensus tree will be identical with the strict consensus tree.

A tree is an evolutionary hypothesis

How do we know that the inferred tree is correct?

Joseph H. Camin (1922-1979)

Assessing tree reliability Phylogenetic reconstruction is a problem of statistical inference. One must assess the reliability of the inferred phylogeny and its component parts. Questions: (1) how reliable is the tree? (2) which parts of the tree are reliable? (3) is this tree significantly better than another one?

Bootstrapping A statistical technique that uses intensive random resampling of data to estimate a statistic whose underlying distribution is unknown.A statistical technique that uses intensive random resampling of data to estimate a statistic whose underlying distribution is unknown.

Characters are resampled with replacement to create many bootstrap replicate data sets (pseudosamples)Characters are resampled with replacement to create many bootstrap replicate data sets (pseudosamples) Each bootstrap replicate data set is analyzedEach bootstrap replicate data set is analyzed Frequency of occurrence of a group (bootstrap proportions) is a measure of support for the groupFrequency of occurrence of a group (bootstrap proportions) is a measure of support for the group Bootstrapping

Bootstrapping - an example Ciliate SSUrDNA - parsimony bootstrap 123456789 Freq -----------------.**...... 100.00...**.... 100.00.....**.. 100.00...****.. 100.00...****** 95.50.......** 84.33...****.* 11.83...*****. 3.83.*******. 2.50.**....*. 1.00.**.....* 1.00 Partition Table Ochromonas (1) Symbiodinium (2) Prorocentrum (3) Euplotes (8) Tetrahymena (9) Loxodes (4) Tracheloraphis (5) Spirostomum (6) Gruberia (7) 100 96 84 100

Reduction of a phylogenetic tree by the collapsing of internal branches associated with bootstrap values that are lower than a critical value (C). (a) Gene tree for  -tubulin (b) C = 50% (c) C = 90%

All these tests use the null hypothesis that the differences between two trees (A and B) are no greater than expected from the sampling errorAll these tests use the null hypothesis that the differences between two trees (A and B) are no greater than expected from the sampling error Tests for two competing trees

Under the null hypothesis the mean of the differences in parsimony steps at each site is expected to be zero. Distribution of differences at each site 0 Favoring tree AFavoring tree B

Tests for two competing trees A parametric test for comparing two trees under the assumption that all nucleotide sites are independent and equivalent. D i = difference in the minimum number of substitutions between the two trees at the ith informative site. D =  D i. n = number of informative sites. V(D) = sample variance of D

The null hypothesis, D = 0, is tested with the Student paired t-test with n – 1 degrees of freedom:

Likelihood Ratio Test Likelihood of Hypothesis 1 = L 1Likelihood of Hypothesis 1 = L 1 Likelihood of Hypothesis 2 = L 2Likelihood of Hypothesis 2 = L 2  = 2(ln L 1 – ln L 2 )  = 2(ln L 1 – ln L 2 ) Compare  to  2 distribution or to a simulated distribution.Compare  to  2 distribution or to a simulated distribution.

Reliability of Phylogenetic Methods Phylogenetic methods can also be evaluated in terms of their general performance, particularly their:Phylogenetic methods can also be evaluated in terms of their general performance, particularly their: consistency - approach the truth with more data efficiency - how quickly can they handle how much data robustness - how sensitive to violations of assumptions Studies of these properties can be analytical or by simulationStudies of these properties can be analytical or by simulation

Problems with long branches With long branches most methods may yield erroneous trees. For example, the maximum-parsimony method tends to cluster long branches together. This phenomenon is called long-branch attraction or the Felsenstein zone

A B C D TRUE TREEWRONG TREE AB CD pp q qq p >> q

Chaperonin Maximum Likelihood Tree (Roger et al. 1998. PNAS 95: 229) Longest branches

Trees: Pectinate (a) versus Symmetrical (b)

Recommendations

Avoid the “Black Box” Researchers invest considerable resources in producing molecular sequence data.Researchers invest considerable resources in producing molecular sequence data. They should also invest the time and effort needed to get the most out of their data.They should also invest the time and effort needed to get the most out of their data. Modern phylogenetic software makes it easy to produce trees from aligned sequences, but phylogenetic inference should not be treated as a “black box.”Modern phylogenetic software makes it easy to produce trees from aligned sequences, but phylogenetic inference should not be treated as a “black box.”

Choices are Unavoidable There are many phylogenetic methods.There are many phylogenetic methods. Thus, the investigator is confronted with unavoidable choices.Thus, the investigator is confronted with unavoidable choices. Not all methods are equally good for all data.Not all methods are equally good for all data. An understanding of the basic properties of the various phylogenetic methods is essential for informed choice of method and interpretation of results.An understanding of the basic properties of the various phylogenetic methods is essential for informed choice of method and interpretation of results.

Data are not Perfect Most data includes misleading evidence, and we need to have a cautious attitude to the quality of data and trees.Most data includes misleading evidence, and we need to have a cautious attitude to the quality of data and trees. Data may have both systematic biases and unbiased noise that affect our chances of getting the correct treeData may have both systematic biases and unbiased noise that affect our chances of getting the correct tree Different methods may be more or less sensitive to some problems.Different methods may be more or less sensitive to some problems.

Alignment The data determine the results.The data determine the results. The alignment determines the data.The alignment determines the data. Be aware of alignment artefacts.Be aware of alignment artefacts. If using multiple alignment software, explore the sensitivity of the alignment to the parameters used.If using multiple alignment software, explore the sensitivity of the alignment to the parameters used. Eliminate regions that cannot be aligned with confidence.Eliminate regions that cannot be aligned with confidence.

Models The data should fit the assumptions of the model.The data should fit the assumptions of the model. Explore the data for potential biases and deviations from the assumptions of the model.Explore the data for potential biases and deviations from the assumptions of the model.

Choice of Models Complex models may better approximate the evolution of the sequences and, therefore, might be expected to give more accurate results.Complex models may better approximate the evolution of the sequences and, therefore, might be expected to give more accurate results. More complex models require the estimation of more parameters each of which is subject to some error.More complex models require the estimation of more parameters each of which is subject to some error. There is a trade-off between more realistic and complex models and their power to discriminate between alternative hypotheses.There is a trade-off between more realistic and complex models and their power to discriminate between alternative hypotheses.

Not all methods are good for all problems.

Processing & Testing Phylogenetic Trees. Rooting.

Similar presentations

Presentation on theme: "Processing & Testing Phylogenetic Trees. Rooting."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Processing & Testing Phylogenetic Trees. Rooting.

Similar presentations

Presentation on theme: "Processing & Testing Phylogenetic Trees. Rooting."— Presentation transcript:

Similar presentations

About project

Feedback