Processing & Testing Phylogenetic Trees. Rooting.

Slides:



Advertisements
Similar presentations
Bootstrapping (non-parametric)
Advertisements

Bioinformatics Phylogenetic analysis and sequence alignment The concept of evolutionary tree Types of phylogenetic trees Measurements of genetic distances.
An Introduction to Phylogenetic Methods
Lecture 6 Outline – Thur. Jan. 29
Introduction to Phylogenies
 Aim in building a phylogenetic tree is to use a knowledge of the characters of organisms to build a tree that reflects the relationships between them.
1 General Phylogenetics Points that will be covered in this presentation Tree TerminologyTree Terminology General Points About Phylogenetic TreesGeneral.
Maximum Likelihood. Likelihood The likelihood is the probability of the data given the model.
Molecular Evolution Revised 29/12/06
“Inferring Phylogenies” Joseph Felsenstein Excellent reference
EPIDEMIOLOGY AND BIOSTATISTICS DEPT Esimating Population Value with Hypothesis Testing.
BIOE 109 Summer 2009 Lecture 4- Part II Phylogenetic Inference.
10-1 Introduction 10-2 Inference for a Difference in Means of Two Normal Distributions, Variances Known Figure 10-1 Two independent populations.
Point estimation, interval estimation
Maximum Likelihood. Historically the newest method. Popularized by Joseph Felsenstein, Seattle, Washington. Its slow uptake by the scientific community.
Distance Methods. Distance Estimates attempt to estimate the mean number of changes per site since 2 species (sequences) split from each other Simply.
9-1 Hypothesis Testing Statistical Hypotheses Statistical hypothesis testing and confidence interval estimation of parameters are the fundamental.
Resampling techniques
In addition to maximum parsimony (MP) and likelihood methods, pairwise distance methods form the third large group of methods to infer evolutionary trees.
Chapter 3 Analysis of Variance
Evaluating Hypotheses
Summary and Recommendations. Avoid the “Black Box” Researchers invest considerable resources in producing molecular sequence dataResearchers invest considerable.
Tree Evaluation Tree Evaluation. Tree Evaluation A question often asked of a data set is whether it contains ‘significant cladistic structure’, that is.
Lecture 24 Inferring molecular phylogeny Distance methods
Chapter 2 Simple Comparative Experiments
Lecture 13 – Performance of Methods Folks often use the term “reliability” without a very clear definition of what it is. Methods of assessing performance.
Inferences About Process Quality
1 Inference About a Population Variance Sometimes we are interested in making inference about the variability of processes. Examples: –Investors use variance.
5-3 Inference on the Means of Two Populations, Variances Unknown
Processing & Testing Phylogenetic Trees. Rooting.
Chapter 9 Title and Outline 1 9 Tests of Hypotheses for a Single Sample 9-1 Hypothesis Testing Statistical Hypotheses Tests of Statistical.
Multiple Sequence Alignments and Phylogeny.  Within a protein sequence, some regions will be more conserved than others. As more conserved,
Terminology of phylogenetic trees
Molecular phylogenetics
Molecular evidence for endosymbiosis Perform blastp to investigate sequence similarity among domains of life Found yeast nuclear genes exhibit more sequence.
Population All members of a set which have a given characteristic. Population Data Data associated with a certain population. Population Parameter A measure.
Chapter 9 Hypothesis Testing and Estimation for Two Population Parameters.
10-1 Introduction 10-2 Inference for a Difference in Means of Two Normal Distributions, Variances Known Figure 10-1 Two independent populations.
9-1 Hypothesis Testing Statistical Hypotheses Definition Statistical hypothesis testing and confidence interval estimation of parameters are.
Statistical Decision Making. Almost all problems in statistics can be formulated as a problem of making a decision. That is given some data observed from.
Pinpointing Uncertainty. Comparing competing phylogenetic hypotheses - tests of two (or more) trees Particularly useful techniques are those designed.
Models of sequence evolution GTR HKY Jukes-Cantor Felsenstein K2P Tree building methods: some examples Assessing phylogenetic data Popular phylogenetic.
Molecular phylogenetics 1 Level 3 Molecular Evolution and Bioinformatics Jim Provan Page and Holmes: Sections
Lecture 25 - Phylogeny Based on Chapter 23 - Molecular Evolution Copyright © 2010 Pearson Education Inc.
Tree Confidence Have we got the true tree? Use known phylogenies Unfortunately, very rare Hillis et al. (1992) created experimental phylogenies using phage.
Phylogenetics and Coalescence Lab 9 October 24, 2012.
Bioinformatics 2011 Molecular Evolution Revised 29/12/06.
Random Regressors and Moment Based Estimation Prepared by Vera Tabakova, East Carolina University.
Molecular phylogenetics 4 Level 3 Molecular Evolution and Bioinformatics Jim Provan Page and Holmes: Sections
A brief introduction to phylogenetics
Chapter 5 Parameter estimation. What is sample inference? Distinguish between managerial & financial accounting. Understand how managers can use accounting.
Speaker: Bin-Shenq Ho Dec. 19, 2011
Chapter 10 Phylogenetic Basics. Similarities and divergence between biological sequences are often represented by phylogenetic trees Phylogenetics is.
Business Statistics: A First Course (3rd Edition)
Statistical Inference Statistical inference is concerned with the use of sample data to make inferences about unknown population parameters. For example,
Ayesha M.Khan Spring Phylogenetic Basics 2 One central field in biology is to infer the relation between species. Do they possess a common ancestor?
Review of Statistical Inference Prepared by Vera Tabakova, East Carolina University.
Chapter 8 Estimation ©. Estimator and Estimate estimator estimate An estimator of a population parameter is a random variable that depends on the sample.
HYPOTHESIS TESTING FOR DIFFERENCES BETWEEN MEANS AND BETWEEN PROPORTIONS.
Today’s lesson (Chapter 12) Paired experimental designs Paired t-test Confidence interval for E(W-Y)
Statistical Decision Making. Almost all problems in statistics can be formulated as a problem of making a decision. That is given some data observed from.
Lecture 15 - Hypothesis Testing
Introduction to Bioinformatics Resources for DNA Barcoding
Chapter 2 Simple Comparative Experiments
Phylogenetic Inference
9 Tests of Hypotheses for a Single Sample CHAPTER OUTLINE
Summary and Recommendations
Assessing Phylogenetic Hypotheses and Phylogenetic Data
Summary and Recommendations
But what if there is a large amount of homoplasy in the data?
Presentation transcript:

Processing & Testing Phylogenetic Trees

Rooting

Rooting 1. Outgroup Rooting: Based on external information. 2. Midpoint Rooting: Direct a posteriori use of the ultrametricity assumption. 3. Largest-Genetic-Variability-Group Rooting: Indirect a posteriori use of the ultrametricity assumption.

Rooted tree Rooting with outgroup plant animal bacterial outgroup root animal Unrooted tree plant Monophyletic group Monophyletic group

Midpoint rooting

Largest variation = Most ancient

Estimating Branch Length From pairwise distances to branch lengths: maximum likelihood, least squares, etc.

Estimating Divergence Times

Topological comparisons

Penny and Hendy's topological distance (d T ) A commonly used measure of dissimilarity between two tree topologies. The measure is based on tree partitioning. d T = 2c c = the number of partitions resulting in different divisions of the OTUs in the two tree topologies under consideration.

Trees inferred from the analysis of a particular data set are called fundamental trees, i.e., they summarize the phylogenetic information in a data set.

Sometimes we have many fundamental trees pertaining to the same question. For example, we may have trees derived from different genes for the same taxa, or trees derived through different methods, or different runs in a simulation. In these cases we need to be able to summarize the data.

Consensus trees are trees that summarize the phylogenetic information in a set of fundamental trees.

strict consensus tree In a strict consensus tree, all conflicting branching patterns are collapsed into multifurcations. majority-rule consensus trees In a X% majority-rule consensus trees, a branching pattern that occurs with a frequency of X% or more is adopted. When X = 100%, the majority-rule consensus tree will be identical with the strict consensus tree.

A tree is an evolutionary hypothesis

How do we know that the inferred tree is correct?

Joseph H. Camin ( )

Assessing tree reliability Phylogenetic reconstruction is a problem of statistical inference. One must assess the reliability of the inferred phylogeny and its component parts. Questions: (1) how reliable is the tree? (2) which parts of the tree are reliable? (3) is this tree significantly better than another one?

Bootstrapping A statistical technique that uses intensive random resampling of data to estimate a statistic whose underlying distribution is unknown.A statistical technique that uses intensive random resampling of data to estimate a statistic whose underlying distribution is unknown.

Characters are resampled with replacement to create many bootstrap replicate data sets (pseudosamples)Characters are resampled with replacement to create many bootstrap replicate data sets (pseudosamples) Each bootstrap replicate data set is analyzedEach bootstrap replicate data set is analyzed Frequency of occurrence of a group (bootstrap proportions) is a measure of support for the groupFrequency of occurrence of a group (bootstrap proportions) is a measure of support for the group Bootstrapping

Bootstrapping - an example Ciliate SSUrDNA - parsimony bootstrap Freq ** ** ** **** ****** ** ****.* ***** ******* **....* **.....* 1.00 Partition Table Ochromonas (1) Symbiodinium (2) Prorocentrum (3) Euplotes (8) Tetrahymena (9) Loxodes (4) Tracheloraphis (5) Spirostomum (6) Gruberia (7)

Reduction of a phylogenetic tree by the collapsing of internal branches associated with bootstrap values that are lower than a critical value (C). (a) Gene tree for  -tubulin (b) C = 50% (c) C = 90%

All these tests use the null hypothesis that the differences between two trees (A and B) are no greater than expected from the sampling errorAll these tests use the null hypothesis that the differences between two trees (A and B) are no greater than expected from the sampling error Tests for two competing trees

Under the null hypothesis the mean of the differences in parsimony steps at each site is expected to be zero. Distribution of differences at each site 0 Favoring tree AFavoring tree B

Tests for two competing trees A parametric test for comparing two trees under the assumption that all nucleotide sites are independent and equivalent. D i = difference in the minimum number of substitutions between the two trees at the ith informative site. D =  D i. n = number of informative sites. V(D) = sample variance of D

The null hypothesis, D = 0, is tested with the Student paired t-test with n – 1 degrees of freedom:

Likelihood Ratio Test Likelihood of Hypothesis 1 = L 1Likelihood of Hypothesis 1 = L 1 Likelihood of Hypothesis 2 = L 2Likelihood of Hypothesis 2 = L 2  = 2(ln L 1 – ln L 2 )  = 2(ln L 1 – ln L 2 ) Compare  to  2 distribution or to a simulated distribution.Compare  to  2 distribution or to a simulated distribution.

Reliability of Phylogenetic Methods Phylogenetic methods can also be evaluated in terms of their general performance, particularly their:Phylogenetic methods can also be evaluated in terms of their general performance, particularly their: consistency - approach the truth with more data efficiency - how quickly can they handle how much data robustness - how sensitive to violations of assumptions Studies of these properties can be analytical or by simulationStudies of these properties can be analytical or by simulation

Problems with long branches With long branches most methods may yield erroneous trees. For example, the maximum-parsimony method tends to cluster long branches together. This phenomenon is called long-branch attraction or the Felsenstein zone

A B C D TRUE TREEWRONG TREE AB CD pp q qq p >> q

Chaperonin Maximum Likelihood Tree (Roger et al PNAS 95: 229) Longest branches

Trees: Pectinate (a) versus Symmetrical (b)

Recommendations

Avoid the “Black Box” Researchers invest considerable resources in producing molecular sequence data.Researchers invest considerable resources in producing molecular sequence data. They should also invest the time and effort needed to get the most out of their data.They should also invest the time and effort needed to get the most out of their data. Modern phylogenetic software makes it easy to produce trees from aligned sequences, but phylogenetic inference should not be treated as a “black box.”Modern phylogenetic software makes it easy to produce trees from aligned sequences, but phylogenetic inference should not be treated as a “black box.”

Choices are Unavoidable There are many phylogenetic methods.There are many phylogenetic methods. Thus, the investigator is confronted with unavoidable choices.Thus, the investigator is confronted with unavoidable choices. Not all methods are equally good for all data.Not all methods are equally good for all data. An understanding of the basic properties of the various phylogenetic methods is essential for informed choice of method and interpretation of results.An understanding of the basic properties of the various phylogenetic methods is essential for informed choice of method and interpretation of results.

Data are not Perfect Most data includes misleading evidence, and we need to have a cautious attitude to the quality of data and trees.Most data includes misleading evidence, and we need to have a cautious attitude to the quality of data and trees. Data may have both systematic biases and unbiased noise that affect our chances of getting the correct treeData may have both systematic biases and unbiased noise that affect our chances of getting the correct tree Different methods may be more or less sensitive to some problems.Different methods may be more or less sensitive to some problems.

Alignment The data determine the results.The data determine the results. The alignment determines the data.The alignment determines the data. Be aware of alignment artefacts.Be aware of alignment artefacts. If using multiple alignment software, explore the sensitivity of the alignment to the parameters used.If using multiple alignment software, explore the sensitivity of the alignment to the parameters used. Eliminate regions that cannot be aligned with confidence.Eliminate regions that cannot be aligned with confidence.

Models The data should fit the assumptions of the model.The data should fit the assumptions of the model. Explore the data for potential biases and deviations from the assumptions of the model.Explore the data for potential biases and deviations from the assumptions of the model.

Choice of Models Complex models may better approximate the evolution of the sequences and, therefore, might be expected to give more accurate results.Complex models may better approximate the evolution of the sequences and, therefore, might be expected to give more accurate results. More complex models require the estimation of more parameters each of which is subject to some error.More complex models require the estimation of more parameters each of which is subject to some error. There is a trade-off between more realistic and complex models and their power to discriminate between alternative hypotheses.There is a trade-off between more realistic and complex models and their power to discriminate between alternative hypotheses.

Not all methods are good for all problems.