Assessing Phylogenetic Hypotheses and Phylogenetic Data We use numerical phylogenetic methods because most data includes potentially misleading evidence.

Slides:



Advertisements
Similar presentations
CONSENSUS “general or widespread agreement” Consensus tree – a tree depicting agreement among a set of treesConsensus tree – a tree depicting agreement.
Advertisements

Bootstrapping (non-parametric)
Bioinformatics Phylogenetic analysis and sequence alignment The concept of evolutionary tree Types of phylogenetic trees Measurements of genetic distances.
1 Health Warning! All may not be what it seems! These examples demonstrate both the importance of graphing data before analysing it and the effect of outliers.
Chapter 7 Statistical Data Treatment and Evaluation
Hypothesis Testing Steps in Hypothesis Testing:
An Introduction to Phylogenetic Methods
Copyright © 2014 by McGraw-Hill Higher Education. All rights reserved.
 Aim in building a phylogenetic tree is to use a knowledge of the characters of organisms to build a tree that reflects the relationships between them.
1 General Phylogenetics Points that will be covered in this presentation Tree TerminologyTree Terminology General Points About Phylogenetic TreesGeneral.
Chapter 10: Hypothesis Testing
Maximum Likelihood. Likelihood The likelihood is the probability of the data given the model.
Molecular Evolution Revised 29/12/06
EPIDEMIOLOGY AND BIOSTATISTICS DEPT Esimating Population Value with Hypothesis Testing.
Resampling techniques Why resampling? Jacknife Cross-validation Bootstrap Examples of application of bootstrap.
Maximum Likelihood. Historically the newest method. Popularized by Joseph Felsenstein, Seattle, Washington. Its slow uptake by the scientific community.
Distance Methods. Distance Estimates attempt to estimate the mean number of changes per site since 2 species (sequences) split from each other Simply.
The Simple Regression Model
Summary and Recommendations. Avoid the “Black Box” Researchers invest considerable resources in producing molecular sequence dataResearchers invest considerable.
Tree Evaluation Tree Evaluation. Tree Evaluation A question often asked of a data set is whether it contains ‘significant cladistic structure’, that is.
2008 Chingchun 1 Bootstrap Chingchun Huang ( 黃敬群 ) Vision Lab, NCTU.
Chapter 11 Multiple Regression.
Inferences About Process Quality
Today Concepts underlying inferential statistics
5-3 Inference on the Means of Two Populations, Variances Unknown
Review for Exam 2 Some important themes from Chapters 6-9 Chap. 6. Significance Tests Chap. 7: Comparing Two Groups Chap. 8: Contingency Tables (Categorical.
Inference about Population Parameters: Hypothesis Testing
Processing & Testing Phylogenetic Trees. Rooting.
Aaker, Kumar, Day Ninth Edition Instructor’s Presentation Slides
Review of Statistical Inference Prepared by Vera Tabakova, East Carolina University ECON 4550 Econometrics Memorial University of Newfoundland.
CENTRE FOR INNOVATION, RESEARCH AND COMPETENCE IN THE LEARNING ECONOMY Session 2: Basic techniques for innovation data analysis. Part I: Statistical inferences.
Phylogeny Estimation: Traditional and Bayesian Approaches Molecular Evolution, 2003
Terminology of phylogenetic trees
Lecture 15 - Hypothesis Testing A. Competing a priori hypotheses - Paired-Sites Tests Null Hypothesis : There is no difference in support for one tree.
Molecular phylogenetics
1 Institute of Engineering Mechanics Leopold-Franzens University Innsbruck, Austria, EU H.J. Pradlwarter and G.I. Schuëller Confidence.
Chi-Square as a Statistical Test Chi-square test: an inferential statistics technique designed to test for significant relationships between two variables.
Pinpointing Uncertainty. Comparing competing phylogenetic hypotheses - tests of two (or more) trees Particularly useful techniques are those designed.
Educational Research: Competencies for Analysis and Application, 9 th edition. Gay, Mills, & Airasian © 2009 Pearson Education, Inc. All rights reserved.
Tree Confidence Have we got the true tree? Use known phylogenies Unfortunately, very rare Hillis et al. (1992) created experimental phylogenies using phage.
Bioinformatics 2011 Molecular Evolution Revised 29/12/06.
Molecular phylogenetics 4 Level 3 Molecular Evolution and Bioinformatics Jim Provan Page and Holmes: Sections
Confidence intervals and hypothesis testing Petter Mostad
Lecture 4: Statistics Review II Date: 9/5/02  Hypothesis tests: power  Estimation: likelihood, moment estimation, least square  Statistical properties.
Issues concerning the interpretation of statistical significance tests.
Testing alternative hypotheses. Outline Topology tests: –Templeton test Parametric bootstrapping (briefly) Comparing data sets.
"Classical" Inference. Two simple inference scenarios Question 1: Are we in world A or world B?
Processing & Testing Phylogenetic Trees. Rooting.
DTC Quantitative Methods Bivariate Analysis: t-tests and Analysis of Variance (ANOVA) Thursday 14 th February 2013.
© Copyright McGraw-Hill 2004
Ayesha M.Khan Spring Phylogenetic Basics 2 One central field in biology is to infer the relation between species. Do they possess a common ancestor?
Evaluating the Fossil Record with Model Phylogenies Cladistic relationships can be determined without ideas about stratigraphic completeness; implied gaps.
Chapter 13 Understanding research results: statistical inference.
Chapter 9: Introduction to the t statistic. The t Statistic The t statistic allows researchers to use sample data to test hypotheses about an unknown.
Non-parametric Approaches The Bootstrap. Non-parametric? Non-parametric or distribution-free tests have more lax and/or different assumptions Properties:
BIOL 582 Lecture Set 2 Inferential Statistics, Hypotheses, and Resampling.
Statistical Inference for the Mean Objectives: (Chapter 8&9, DeCoursey) -To understand the terms variance and standard error of a sample mean, Null Hypothesis,
Class Seven Turn In: Chapter 18: 32, 34, 36 Chapter 19: 26, 34, 44 Quiz 3 For Class Eight: Chapter 20: 18, 20, 24 Chapter 22: 34, 36 Read Chapters 23 &
STA248 week 121 Bootstrap Test for Pairs of Means of a Non-Normal Population – small samples Suppose X 1, …, X n are iid from some distribution independent.
Chapter 9 Introduction to the t Statistic
Lecture 15 - Hypothesis Testing
DTC Quantitative Methods Bivariate Analysis: t-tests and Analysis of Variance (ANOVA) Thursday 20th February 2014  
Two-Sample Hypothesis Testing
Inferring a phylogeny is an estimation procedure.
When we free ourselves of desire,
Summary and Recommendations
Assessing Phylogenetic Hypotheses and Phylogenetic Data
Assessing Phylogenetic Hypotheses and Phylogenetic Data
Summary and Recommendations
But what if there is a large amount of homoplasy in the data?
Presentation transcript:

Assessing Phylogenetic Hypotheses and Phylogenetic Data We use numerical phylogenetic methods because most data includes potentially misleading evidence of relationshipsWe use numerical phylogenetic methods because most data includes potentially misleading evidence of relationships We should not be content with constructing phylogenetic hypotheses but should also assess what ‘confidence’ we can place in our hypothesesWe should not be content with constructing phylogenetic hypotheses but should also assess what ‘confidence’ we can place in our hypotheses This is not always simple! (but do not despair!)This is not always simple! (but do not despair!)

Assessing Data Quality We expect (or hope) our data will be well structured and contain strong phylogenetic signalWe expect (or hope) our data will be well structured and contain strong phylogenetic signal We can test this using randomisation tests of explicit null hypothesesWe can test this using randomisation tests of explicit null hypotheses The behaviour or some measure of the quality of our real data is contrasted with that of comparable but phylogenetically uninformative data determined by randomisation of the dataThe behaviour or some measure of the quality of our real data is contrasted with that of comparable but phylogenetically uninformative data determined by randomisation of the data

Random Permutation Random permutation destroys any correlation among characters to that expected by chance aloneRandom permutation destroys any correlation among characters to that expected by chance alone It preserves number of taxa, characters and character states in each character (and the theoretical maximum and minimum tree lengths)It preserves number of taxa, characters and character states in each character (and the theoretical maximum and minimum tree lengths) Original structured data with strong correlations among characters ‘TAXA’‘CHARACTERS’ R-PNUDERTOU A-EREAPLEAD N-RMRMMADNP D-MLTREYMDR O-UDEYUDEYM M-TOMOTOULT L-EYDNDMPME Y-DAPLRNRRE Randomly permuted data with correlation among characters due to chance alone ‘ TAXA’‘CHARACTERS’ R-PRPRPRPRP A-EAEAEAEAE N-RNRNRNRNR D-MDMDMDMDM O-UOUOUOUOU M-TMTMTMTMT L-ELELELELE Y-DYDYDYDYD

Matrix Randomisation Tests Compare some measure of data quality (hierarchical structure) for the real and many randomly permuted data setsCompare some measure of data quality (hierarchical structure) for the real and many randomly permuted data sets This allows us to define a test statistic for the null hypothesis that the real data are no better structured than randomly permuted and phylogenetically uninformative dataThis allows us to define a test statistic for the null hypothesis that the real data are no better structured than randomly permuted and phylogenetically uninformative data A permutation tail probability (PTP) is the proportion of data sets with as good or better measure of quality than the real dataA permutation tail probability (PTP) is the proportion of data sets with as good or better measure of quality than the real data

Structure of Randomisation Tests Reject null hypothesis if, for example, more than 5% of random permutations have as good or better measure than the real dataReject null hypothesis if, for example, more than 5% of random permutations have as good or better measure than the real data Measure of data quality (e.g. tree length, ML, pairwise incompatibilities) 95% cutoff GOODBAD Frequency PASS TEST reject null hypothesis FAIL TEST

Matrix Randomisation Tests Measures of data quality include:Measures of data quality include: 1. Tree length for most parsimonious trees - the shorter the tree length the better the data (PAUP*) 1. Tree length for most parsimonious trees - the shorter the tree length the better the data (PAUP*) 2. Any other objective function (Likelihood, Least Squares Fit, etc) 2. Any other objective function (Likelihood, Least Squares Fit, etc) 3. Numbers of pairwise incompatibilities between characters (pairs of incongruent characters) - the fewer character conflicts the better the data 3. Numbers of pairwise incompatibilities between characters (pairs of incongruent characters) - the fewer character conflicts the better the data

Matrix Randomization Tests Real data Randomly permuted Ciliate SSUrDNA Strict consensus 1 MPT L = 618 CI = 0.696, RI = PTP = 0.01 PC-PTP = Significantly non random 3 MPTs L = 792 CI = 0.543, RI = PTP = 0.68 PC-PTP = Not significantly different from random Min = 430 Max = 927

Matrix Randomisation Tests - use and limitations Can detect very poor data - that provides no good basis for phylogenetic inferences (throw it away!)Can detect very poor data - that provides no good basis for phylogenetic inferences (throw it away!) However, only very little may be needed to reject the null hypothesis (passing test  great data)However, only very little may be needed to reject the null hypothesis (passing test  great data) Doesn’t indicate location of this structure (more discerning tests are possible)Doesn’t indicate location of this structure (more discerning tests are possible)

Skewness of Tree Length Distributions Studies with random and thus phylogenetically uninformative data showed that the distribution of tree lengths tends to be normal In contrast, phylogenetically informative data is expected to have a strongly skewed distribution with few shortest trees and few trees nearly as short NUMBER OF TREES shortest tree NUMBER OF TREES shortest tree Tree length

Skewness of Tree Length Distributions Measured with the G 1 statistic (PAUP*)Measured with the G 1 statistic (PAUP*) Skewness of tree length distributions could be used as a measure of data quality in a randomisation testSkewness of tree length distributions could be used as a measure of data quality in a randomisation test Significance cut-offs for data sets of up to eight taxa have been published based on randomly generated data (rather than randomly permuted data)Significance cut-offs for data sets of up to eight taxa have been published based on randomly generated data (rather than randomly permuted data)

Skewness - example RANDOMLY PERMUTED DATA g1= REAL DATA Ciliate SSUrDNA g1=

Assessing Phylogenetic Hypotheses - groups on trees Several methods have been proposed that attach numerical values to internal branches in trees that are intended to provide some measure of the strength of support for those branches and the corresponding groupsSeveral methods have been proposed that attach numerical values to internal branches in trees that are intended to provide some measure of the strength of support for those branches and the corresponding groups These methods include:These methods include: character resampling methods - the bootstrap and jackknife character resampling methods - the bootstrap and jackknife comparisons with suboptimal trees - decay analyses comparisons with suboptimal trees - decay analyses additional randomisation tests additional randomisation tests

Bootstrapping (non-parametric) Bootstrapping is a modern statistical technique that uses computer intensive random resampling of data to determine sampling error or confidence intervals for some estimated parameterBootstrapping is a modern statistical technique that uses computer intensive random resampling of data to determine sampling error or confidence intervals for some estimated parameter

Bootstrapping Characters are resampled with replacement to create many bootstrap replicate data setsCharacters are resampled with replacement to create many bootstrap replicate data sets Each bootstrap replicate data set is analysed (e.g. with parsimony, distance, ML)Each bootstrap replicate data set is analysed (e.g. with parsimony, distance, ML) Agreement among the resulting trees is summarized with a majority-rule consensus treeAgreement among the resulting trees is summarized with a majority-rule consensus tree Frequency of occurrence of groups, bootstrap proportions (BPs), is a measure of support for those groupsFrequency of occurrence of groups, bootstrap proportions (BPs), is a measure of support for those groups Additional information is given in partition tablesAdditional information is given in partition tables

Bootstrapping Original data matrix Characters Taxa A R R Y Y Y Y Y Y B R R Y Y Y Y Y Y C Y Y Y Y Y R R R D Y Y R R R R R R Outgp R R R R R R R R ABC D A BCD Outgroup Resampled data matrix Characters Taxa A R R R Y Y Y Y Y B R R R Y Y Y Y Y C Y Y Y Y Y R R R D Y Y Y R R R R R Outgp R R R R R R R R Randomly resample characters from the original data with replacement to build many bootstrap replicate data sets of the same size as the original - analyse each replicate data set Summarise the results of multiple analyses with a majority-rule consensus tree Bootstrap proportions (BPs) are the frequencies with which groups are encountered in analyses of replicate data sets ABC D Outgroup 96% 66%

Bootstrapping - an example Ciliate SSUrDNA - parsimony bootstrap Freq ** ** ** **** ****** ** ****.* ***** ******* **....* **.....* 1.00 Majority-rule consensus Partition Table Ochromonas (1) Symbiodinium (2) Prorocentrum (3) Euplotes (8) Tetrahymena (9) Loxodes (4) Tracheloraphis (5) Spirostomum (6) Gruberia (7)

Bootstrapping - random data Randomly permuted data - parsimony bootstrap Majority-rule consensus (with minority components) Partition Table Freq *****.** ** *..* *......* ***.*.** *...* *..**.** *..* *...*..* ***....* **.** **.* *...* **..*..* *...* *.** ***

Bootstrap - interpretation Bootstrapping was introduced as a way of establishing confidence intervals for phylogeniesBootstrapping was introduced as a way of establishing confidence intervals for phylogenies This interpretation of bootstrap proportions (BPs) depends on assuming that the original data is a random (fair) sample from independent and identically distributed dataThis interpretation of bootstrap proportions (BPs) depends on assuming that the original data is a random (fair) sample from independent and identically distributed data However, several things complicate this interpretationHowever, several things complicate this interpretation -Perhaps the assumptions are unreasonable - making any statistical interpretation of BPs invalid -Some theoretical work indicates that BPs are very conservative, and may underestimate confidence intervals - problem increases with numbers of taxa -BPs can be high for incongruent relationships in separate analyses - and can therefore be misleading (misleading data -> misleading BPs) -with parsimony it may be highly affected by inclusion or exclusion of only a few characters

Bootstrapping is a very valuable and widely used technique - it (or some suitable) alternative is demanded by some journals, but it may require a pragmatic interpretation:Bootstrapping is a very valuable and widely used technique - it (or some suitable) alternative is demanded by some journals, but it may require a pragmatic interpretation: BPs depend on two aspects of the support for a group - the numbers of characters supporting a group and the level of support for incongruent groupsBPs depend on two aspects of the support for a group - the numbers of characters supporting a group and the level of support for incongruent groups BPs thus provides an index of the relative support for groups provided by a set of data under whatever interpretation of the data (method of analysis) is usedBPs thus provides an index of the relative support for groups provided by a set of data under whatever interpretation of the data (method of analysis) is used Bootstrap - interpretation

High BPs (e.g. > 85%) is indicative of strong ‘signal’ in the dataHigh BPs (e.g. > 85%) is indicative of strong ‘signal’ in the data Provided we have no evidence of strong misleading signal (e.g. base composition biases, great differences in branch lengths) high BPs are likely to reflect strong phylogenetic signalProvided we have no evidence of strong misleading signal (e.g. base composition biases, great differences in branch lengths) high BPs are likely to reflect strong phylogenetic signal Low BPs need not mean the relationship is false, only that it is poorly supportedLow BPs need not mean the relationship is false, only that it is poorly supported Bootstrapping can be viewed as a way of exploring the robustness of phylogenetic inferences to perturbations in the the balance of supporting and conflicting evidence for groupsBootstrapping can be viewed as a way of exploring the robustness of phylogenetic inferences to perturbations in the the balance of supporting and conflicting evidence for groups Bootstrap - interpretation

Jackknifing Jackknifing is very similar to bootstrapping and differs only in the character resampling strategyJackknifing is very similar to bootstrapping and differs only in the character resampling strategy Some proportion of characters (e.g. 50%) are randomly selected and deletedSome proportion of characters (e.g. 50%) are randomly selected and deleted Replicate data sets are analysed and the results summarised with a majority-rule consensus treeReplicate data sets are analysed and the results summarised with a majority-rule consensus tree Jackknifing and bootstrapping tend to produce broadly similar results and have similar interpretationsJackknifing and bootstrapping tend to produce broadly similar results and have similar interpretations

Decay analysis In parsimony analysis, a way to assess support for a group is to see if the group occurs in slightly less parsimonious trees alsoIn parsimony analysis, a way to assess support for a group is to see if the group occurs in slightly less parsimonious trees also The length difference between the shortest trees including the group and the shortest trees that exclude the group (the extra steps required to overturn a group) is the decay index or Bremer supportThe length difference between the shortest trees including the group and the shortest trees that exclude the group (the extra steps required to overturn a group) is the decay index or Bremer support Can be extended to any optimality criterion and to other relationshipsCan be extended to any optimality criterion and to other relationships

Decay analysis -example Ochromonas Symbiodinium Prorocentrum Loxodes Tracheloraphis Spirostomum Gruberia Euplotes Tetrahymena Ochromonas Symbiodinium Prorocentrum Loxodes Tetrahymena Tracheloraphis Spirostomum Euplotes Gruberia Ciliate SSUrDNA data Randomly permuted data

Decay analyses - in practice Decay indices for each clade can be determined by:Decay indices for each clade can be determined by: -Saving increasingly less parsimonious trees and producing corresponding strict consensus trees until the consensus is completely unresolved -analyses using reverse topological constraints to determine shortest trees that lack each clade -with the Autodecay or TreeRot programs (in conjunction with PAUP)

Decay indices - interpretation Generally, the higher the decay index the better the relative support for a groupGenerally, the higher the decay index the better the relative support for a group Like BPs, decay indices may be misleading if the data is misleadingLike BPs, decay indices may be misleading if the data is misleading Unlike BPs decay indices are not scaled (0-100) and it is less clear what is an acceptable decay indexUnlike BPs decay indices are not scaled (0-100) and it is less clear what is an acceptable decay index Magnitude of decay indices and BPs generally correlated (i.e. they tend to agree)Magnitude of decay indices and BPs generally correlated (i.e. they tend to agree) Only groups found in all most parsimonious trees have decay indices > zeroOnly groups found in all most parsimonious trees have decay indices > zero

Trees are typically complex - they can be thought of as sets of less complex relationships

Extending Support Measures The same measures (BP, JP & DI) that are used for clades/splits can also be determined for triplets and quartetsThe same measures (BP, JP & DI) that are used for clades/splits can also be determined for triplets and quartets This provides a lot more information because there are more triplets/quartets than there are cladesThis provides a lot more information because there are more triplets/quartets than there are clades Furthermore....Furthermore....

The Decay Theorem The DI of an hypothesis of relationships is equal to the lowest DI of the resolved triplets that the hypothesis entailsThe DI of an hypothesis of relationships is equal to the lowest DI of the resolved triplets that the hypothesis entails This applies equally to BPs and JPs as well as DIsThis applies equally to BPs and JPs as well as DIs Thus a phylogenetic chain is no stronger than its weakest link!Thus a phylogenetic chain is no stronger than its weakest link! and, measures of clade support may give a very incomplete picture of the distribution of supportand, measures of clade support may give a very incomplete picture of the distribution of support

Bootstrapping with Reduced Consensus ABCDEF G HI J ABCDEF G HI J X ABCDEFGHIJX ABCDEFG H IJ ABC DEFGHIJ X A B C D E F G H I J X

Pinpointing Uncertainty

Leaf Stability Leaf stability is the average of supports of the triplets/quartets containing the leafLeaf stability is the average of supports of the triplets/quartets containing the leaf

PTP tests of groups A number of randomization tests have been proposed for evaluating particular groups rather than entire data matrices by testing null hypotheses regarding the level of support they receive from the dataA number of randomization tests have been proposed for evaluating particular groups rather than entire data matrices by testing null hypotheses regarding the level of support they receive from the data Randomisation can be of the data or the groupRandomisation can be of the data or the group These methods have not become widely used both because they are not readily performed and because their properties are still under investigationThese methods have not become widely used both because they are not readily performed and because their properties are still under investigation One type, the topology dependent PTP tests are included in PAUP* but have serious problemsOne type, the topology dependent PTP tests are included in PAUP* but have serious problems

Comparing competing phylogenetic hypotheses - tests of two (or more) trees Particularly useful techniques are those designed to allow evaluation of alternative phylogenetic hypothesesParticularly useful techniques are those designed to allow evaluation of alternative phylogenetic hypotheses Several such tests allow us to determine if one tree is statistically significantly worse than another:Several such tests allow us to determine if one tree is statistically significantly worse than another: Winning sites, Templeton, Kishino-Hasegawa, parametric bootstrapping (SOWH) Winning sites, Templeton, Kishino-Hasegawa, parametric bootstrapping (SOWH) Shimodaira-Hasegawa, Approximately Unbiased Shimodaira-Hasegawa, Approximately Unbiased

Tests are of the null hypothesis that the differences between two trees (A and B) are no greater than expected from sampling errorTests are of the null hypothesis that the differences between two trees (A and B) are no greater than expected from sampling error The simplest ‘wining sites’ test sums the number of sites supporting tree A over tree B and vice versa (those having fewer steps on, and better fit to, one of the trees)The simplest ‘wining sites’ test sums the number of sites supporting tree A over tree B and vice versa (those having fewer steps on, and better fit to, one of the trees) Under the null hypothesis characters are equally likely to support tree A or tree B and a binomial distribution gives the probability of the observed difference in numbers of winning sitesUnder the null hypothesis characters are equally likely to support tree A or tree B and a binomial distribution gives the probability of the observed difference in numbers of winning sites Tests of two trees

The Templeton test Templeton’s test is a non-parametric Wilcoxon signed ranks test of the differences in fits of characters to two treesTempleton’s test is a non-parametric Wilcoxon signed ranks test of the differences in fits of characters to two trees It is like the ‘winning sites’ test but also takes into account the magnitudes of differences in the support of characters for the two treesIt is like the ‘winning sites’ test but also takes into account the magnitudes of differences in the support of characters for the two trees

Templeton’s test - an example Seymouriadae Diadectomorpha Synapsida ParareptiliaCaptorhinidae Paleothyris ClaudiosaurusYounginiformes Archosauromorpha Lepidosauriformes PlacodusEosauropterygiaAraeoscelidia 2 1 Recent studies of the relationships of turtles using morphological data have produced very different results with turtles grouping either within the parareptiles (H1) or within the diapsids (H2) the result depending on the morphologist This suggests there may be: - problems with the data - special problems with turtles - weak support for turtle relationships The Templeton test was used to evaluate the trees and showed that the slightly longer H1 tree found in the constrained analyses was not significantly worse than the unconstrained H2 tree The morphological data do not allow choice between H1 and H2 Parsimony analysis of the most recent data favoured H2 However, analyses constrained by H2 produced trees that required only 3 extra steps (<1% tree length)

Kishino-Hasegawa test The Kishino-Hasegawa test is similar in using differences in the support provided by individual sites for two trees to determine if the overall differences between the trees are significantly greater than expected from random sampling errorThe Kishino-Hasegawa test is similar in using differences in the support provided by individual sites for two trees to determine if the overall differences between the trees are significantly greater than expected from random sampling error It is a parametric test that depends on assumptions that the characters are independent and identically distributed (the same assumptions underlying the statistical interpretation of bootstrapping)It is a parametric test that depends on assumptions that the characters are independent and identically distributed (the same assumptions underlying the statistical interpretation of bootstrapping) It can be used with parsimony and maximum likelihood - implemented in PHYLIP and PAUP*It can be used with parsimony and maximum likelihood - implemented in PHYLIP and PAUP*

Kishino-Hasegawa test If the difference between trees (tree lengths or likelihoods) is attributable to sampling error, then characters will randomly support tree A or B and the total difference will be close to zero The observed difference is significantly greater than zero if it is greater than 1.95 standard deviations This allows us to reject the null hypothesis and declare the sub- optimal tree significantly worse than the optimal tree (p < 0.05) Under the null hypothesis the mean of the differences in parsimony steps or likelihoods for each site is expected to be zero, and the distribution normal From observed differences we calculate a standard deviation Distribution of Step/Likelihood differences at each site 0 Sites favouring tree A Sites favouring tree B Expected Mean

Kishino-Hasegawa test Ciliate SSUrDNA Maximum likelihood tree Ochromonas Symbiodinium Prorocentrum Sarcocystis Theileria Plagiopyla n Plagiopyla f Trimyema c Trimyema s Cyclidium p Cyclidium g Cyclidium l Glaucoma Colpodinium Tetrahymena Paramecium Discophrya Trithigmostoma Opisthonecta Colpoda Dasytrichia Entodinium Spathidium Loxophylum Homalozoon Metopus c Metopus p Stylonychia Onychodromous Oxytrichia Loxodes Tracheloraphis Spirostomum Gruberia Blepharisma anaerobic ciliates with hydrogenosomes Parsimonious character optimization of the presence and absence of hydrogenosomes suggests four separate origins of within the ciliates Questions - how reliable is this result? - in particular how well supported is the idea of multiple origins? - how many origins can we confidently infer?

Kishino-Hasegawa test Ochromonas Symbiodinium Prorocentrum Sarcocystis Theileria Plagiopyla n Plagiopyla f Trimyema c Trimyema s Cyclidium p Cyclidium g Cyclidium l Dasytrichia Entodinium Loxophylum Homalozoon Spathidium Metopus c Metopus p Loxodes Tracheloraphis Spirostomum Gruberia Blepharisma Discophrya Trithigmostoma Stylonychia Onychodromous Oxytrichia Colpoda Paramecium Glaucoma Colpodinium Tetrahymena Opisthonecta Ochromonas Symbiodinium Prorocentrum Sarcocystis Theileria Plagiopyla n Plagiopyla f Trimyema c Trimyema s Cyclidium p Cyclidium g Cyclidium l Homalozoon Spathidium Dasytrichia Entodinium Loxophylum Metopus c Metopus p Loxodes Tracheloraphis Spirostomum Gruberia Blepharisma Discophrya Trithigmostoma Stylonychia Onychodromous Oxytrichia Colpoda Paramecium Glaucoma Colpodinium Tetrahymena Opisthonecta Parsimony analyse with topological constraints found the shortest trees forcing hydrogenosomal ciliate lineages together, thereby reducing the number of separate origins of hydrogenosomes Two topological constraint trees Each of the constrained parsimony trees were compared to the ML tree and the Kishino-Hasegawa test used to determine which of these trees were significantly worse than the ML tree

Kishino-Hasegawa test No.ConstraintExtraDifference Significantly OriginstreeStepsand SD worse? 4ML MP- -13  18 No 3(cp,pt)  22 No 3(cp,rc)  40 Yes 3(cp,m)  36 Yes 3(pt,rc)  38 Yes 3(pt,m)  29 Yes 3(rc,m)  34 Yes 2(pt,cp,rc)  40 Yes 2(pt,rc,m)  43 Yes 2(pt,cp,m)  37 Yes 2(cp,rc,m)  49 Yes 2(pt,cp)(rc,m)  39 Yes 2(pt,m)(rc,cp)  48 Yes 2(pt,rc)(cp,m)  50 Yes 1(pt,cp,m,rc)  49 Yes Constrained analyses used to find most parsimonious trees with less than four separate origins of hydrogenosomes Tested against ML tree Trees with 2 or 1 origin are all significantly worse than the ML tree We can confidently conclude that there have been at least three separate origins of hydrogenosomes within the sampled ciliates Test summary and results (simplified)

Problems with tests of trees To be statistically valid, the Kishino-Hasegawa test should be of trees that are selected a prioriTo be statistically valid, the Kishino-Hasegawa test should be of trees that are selected a priori However, most applications have used trees selected a posteriori on the basis of the phylogenetic analysisHowever, most applications have used trees selected a posteriori on the basis of the phylogenetic analysis Where we test the ‘best’ tree against some other tree the KH test will be biased towards rejection of the null hypothesisWhere we test the ‘best’ tree against some other tree the KH test will be biased towards rejection of the null hypothesis Only if null hypothesis is not rejected will result be safe from some unknown level of biasOnly if null hypothesis is not rejected will result be safe from some unknown level of bias

Problems with tests of trees The Shimodaira-Hasegawa test is a more statistically correct technique for testing trees selected a posteriori and is implemented in PAUP*The Shimodaira-Hasegawa test is a more statistically correct technique for testing trees selected a posteriori and is implemented in PAUP* However it requires selection of a set of plausible topologies - hard to give practical adviceHowever it requires selection of a set of plausible topologies - hard to give practical advice Parametric bootstrapping (SOWH test) is an alternative - but it is harder to implement and may suffer from an opposite bias due to model mis- specificationParametric bootstrapping (SOWH test) is an alternative - but it is harder to implement and may suffer from an opposite bias due to model mis- specification The Approximately Unbiased test (implemented in CONSEL) may be the best option currentlyThe Approximately Unbiased test (implemented in CONSEL) may be the best option currently

Problems with tests of trees

Taxonomic Congruence Trees inferred from different data sets (different genes, morphology) should agree if they are accurateTrees inferred from different data sets (different genes, morphology) should agree if they are accurate Congruence between trees is best explained by their accuracyCongruence between trees is best explained by their accuracy Congruence can be investigated using consensus (and supertree) methodsCongruence can be investigated using consensus (and supertree) methods Incongruence requires further work to explain or resolve disagreementsIncongruence requires further work to explain or resolve disagreements

Reliability of Phylogenetic Methods Phylogenetic methods (e.g. parsimony, distance, ML) can also be evaluated in terms of their general performance, particularly their:Phylogenetic methods (e.g. parsimony, distance, ML) can also be evaluated in terms of their general performance, particularly their: consistency - approach the truth with more data efficiency - how quickly (how much data) robustness - sensitivity to violations of assumptions Studies of these properties can be analytical or by simulationStudies of these properties can be analytical or by simulation

Reliability of Phylogenetic Methods There have been many arguments that ML methods are best because they have desirable statistical properties, such as consistencyThere have been many arguments that ML methods are best because they have desirable statistical properties, such as consistency However, ML does not always have these propertiesHowever, ML does not always have these properties –if the model is wrong/inadequate (fortunately this is testable to some extent) –properties not yet demonstrated for complex inference problems such as phylogenetic trees

Reliability of Phylogenetic Methods “Simulations show that ML methods generally outperform distance and parsimony methods over a broad range of realistic conditions”“Simulations show that ML methods generally outperform distance and parsimony methods over a broad range of realistic conditions” Whelan et al Trends in Genetics 17: But…But… Most simulations cover a narrow range of very (unrealistically) simple conditionsMost simulations cover a narrow range of very (unrealistically) simple conditions –few taxa (typically just four!) –few parameters (standard models - JC, K2P etc)

Reliability of Phylogenetic Methods Simulations with four taxa have shown:Simulations with four taxa have shown: -Model based methods - distance and maximum likelihood perform well when the model is accurate (not surprising!) -Violations of assumptions can lead to inconsistency for all methods (a Felsenstein zone) when branch lengths or rates are highly unequal -Maximum likelihood methods are quite robust to violations of model assumptions -Weighting can improve the performance of parsimony (reduce the size of the Felsenstein zone)

Reliability of Phylogenetic Methods However:However: -Generalising from four taxon simulations may be dangerous as conclusions may not hold for more complex cases -A few large scale simulations (many taxa) have suggested that parsimony can be very accurate and efficient -Most methods are accurate in correctly recovering known phylogenies produced in laboratory studies More realistic simulations are needed if they are to help in choosing/understanding methodsMore realistic simulations are needed if they are to help in choosing/understanding methods