Presentation is loading. Please wait.

Presentation is loading. Please wait.

Characterizing the Phylogenetic Tree-Search Problem Daniel Money And Simon Whelan ~Anusha Sura.

Similar presentations


Presentation on theme: "Characterizing the Phylogenetic Tree-Search Problem Daniel Money And Simon Whelan ~Anusha Sura."— Presentation transcript:

1 Characterizing the Phylogenetic Tree-Search Problem Daniel Money And Simon Whelan ~Anusha Sura

2 What is this? A phylogenetic tree or evolutionary tree is a branching diagram or "tree" showing the inferred evolutionary relationships among various biological species or other entities.

3 Why do we need to know about this? Motivation The problem of explaining the evolutionary history of today's species How do species relate to one another in terms of common ancestors Approaches People who are curious to know the origin of Fossil Records Phylogenetic Trees is the best head start

4 Brief Overview In Phylogenetic trees Leaves represent present day species Interior nodes represent hypothesized ancestors

5 Key Idea Phylogenetic studies frequently use some form of optimality criterion to assess how well specific tree topologies describe the observed sequence data. Optimality methods typically work by finding the best scoring tree for a sequence alignment, which is taken to be the best estimate of the evolutionary relationships between a set of sequences

6 Why not Hill-climbing? Hill-climbing can produce many different optimal trees depending where the algorithm starts. Only the optimal tree with the highest likelihood, the global optimum, has the appealing properties of the ML estimator Tree-search is NP-hard.

7 Goals? Learn about the factors that affect the topography of tree-space, and to provide pragmatic suggestions that will aid phylogenetic inference with existing methods Investigate how the difficulty of tree-search differs between alignments and use correlation analyses to identify predictors for the difficulty of tree-search. Also examine whether optima share any properties, such as their relative size or their location in tree-space.

8 Data Sets Phylogenomic data sets consisting of 8-, 20- and 40-taxa. Phylogenomic datasets consist of a series of genes taken from the same set of taxa, leading us to expect a single tree relating the taxa, and enabling us to compare results between genes to highlight similarities and differences in tree-space caused by alignment properties.

9 Observation from the Data Set 8 taxa – ungapped nucleotide sequence alignments (106 genes) 20 and 40 taxa – gapped amino acid sequence alignments (146 genes) Genes with more than 10% unknown character or gap are excluded 40 taxa – 20 different genes 20 taxa – 52 different genes

10 Exhaustive tree search(8 taxa data set) (i) assign a start tree to the current tree object (ii) use a rearrangement operation to define the neighborhood of the current tree, (iii) calculate likelihoods for the trees in the neighborhood and assign the highest scoring as the new current tree, and (iv) if no improvement in likelihood occurs, then tree-search reaches an optimum and stops, otherwise go to (ii)

11 Number and size of optima 1.Assume the number of different optima to be a suitable proxy for the difficulty of tree-search problem 2.The number of optima identified during tree-search under a specific rearrangement strategy from all possible starting points in tree- space 3.The size of an optimum is defined as the number of start trees that reach that specific optimum when performing tree-search

12 Statistical comparison of optima Assess whether each local optimum is significantly different to the global optimum using the SH test (Shimodaira and Hasegawa 1999), implemented in PAML (Yang 1997).

13 Optima Clustering Compute the mean NNI distance between ‘n’ identified optima and assess the significance of any clustering observed using a bootstrap approach Take 1000 draws from the null distribution of no significant clustering by sampling n randomly chosen trees, with the condition that none is a neighbor to any other, and computing their mean NNI distance

14 Correlating the number of optima with gene and tree properties Spearman correlation coefficients between specific properties and the number of optima identified I.tree length, defined as the sum of all branches of the globally optimal tree II.alignment length, defined as the length of the gapped sequence alignment associated with a gene III.the difference in likelihood between the fully resolved globally optimal tree and the unresolved star-tree (Δ ln L ^ )

15 Parameter distributions Investigate the estimates of tree length and the α parameter of the Γ- distributed rates-across-sites model (by examining their distribution across the 10,395 bifurcating trees relating 8 taxa for all 106 genes) Calculate the mean rank of the global optimum and the skewness of the distribution for each gene

16 Heuristic Tree-Search Using Sampling for the 20- and 40-taxa Phylogenomic and Bench marking Data Set Number and size of optima 1.The relative numbers of optima discovered by phyml and RAxML from the randomly sampled start trees should be indicative of the difficulty of tree-search 2.The relative size of the sampled optima are calculated as the number of start trees that lead to them

17 Statistical comparison of optima The best tree identified during tree-search is taken as a proxy for the global optimum Compare the 95% confidence interval of this best tree with the other optima, identified using the SH test

18 Optima Clustering Robinson–Foulds (RF) distance metric 1.The average RF distance between the ‘n’ sampled optima is calculated 2.a bootstrap procedure used to assign P values of clustering by comparing the observed distance to distribution of average RF- distances between n randomly sampled trees

19 Correlating the number of optima with gene and tree properties Ignore minor variations in gene tree topology that can result from (e.g.,) incomplete lineage sorting or model misspecification Spearman correlation coefficients are calculated between the observed number of optima and gene length, tree length, and likelihood difference between the star-tree and the best optimum identified

20 Parameter distributions The distribution of tree length and rates-across-sites parameter, α, across trees are approximated by taking estimates from random start trees, for both the simple and complex model

21 Exhaustive Analysis of Eight-taxa Yeast Phylogenomic Data Set The global optimum is frequently the largest optimum 1. There is a strong tendency for the global optimum to be larger than other less good optima 2.The global optimum is on average 2.37 times larger than expected under JC, and 2.16 times larger than expected under GTR+Γ

22 Comparison of ordered rank with average size

23 Statistical comparison of optima Of the 91 genes with multiple optima under GTR+Γ, 28 have local optima that the SH test finds significantly different from the global optima. When averaged across all 91 genes, 14.6% of the locally optima found are significantly different to the global optima

24 Clustering of optima The likelihood decreases as the NNI distance from the global optimum increases, but the slope is less steep bootstrap analyses show that for the overwhelming majority of genes, the mean NNI distance between optima was less than expected by chance and under JC (GTR+Γ), 49/92(50/91) genes display significant clustering

25 Representation of tree-space for the genes YBR198C and YLR389C

26 Correlation between number of optima and data properties The number of optima in a gene is correlated with the value of Δln L at the global optimum The number of optima compared with the value of Δln L across all genes under JC and GTR+Γ, and we find there are significant, but imperfect, correlations between these variables

27 Correlations between Δ ln Lˆ and the number of optima per gene

28 Correlations between gene properties and the number of optima

29 Parameter distributions Globally optimal trees tend to have relatively high estimates of α from Γ-distributed rates-across-sites and low estimates of tree length There is positive skew for α, with the majority of trees having low parameter estimates, in contrast to the high estimate in the globally optimal tree Negative skew for tree length, with the majority of trees having longer estimates

30 Conclusions Major differences in the topography of tree search. The global optimum tends to have the greatest number of trees attracted to it. Model choice affects tree-space. The difference in log likelihood between a well-resolved topology and the star topology provides a proxy for the phylogenetic information in an alignment. NNI tree-search performs poorly on real data No single program is likely to yield the best tree estimate

31 References Morrison D.A. 2007. Increasing the efficiency of searches for the maximum likelihood tree in a phylogenetic analysis of up to 150 nucleotide sequences. Felsenstein J. 2003. Inferring phylogenies. Sunderland (MA): Sinauer Associates. Stamatakis A. 2006. RAxML-VI-HPC: maximum likelihood-based phylogenetic analyses with thousands of taxa and mixed models. Shimodaira H., Hasegawa M. 1999. Multiple comparisons of loglikelihoods with applications to phylogenetic inference. Morell V. 1996. TreeBASE: the roots of phylogeny. Guindon S., Gascuel O. 2003. A simple, fast, and accurate algorithm to estimate large phylogenies by maximum likelihood.

32

33


Download ppt "Characterizing the Phylogenetic Tree-Search Problem Daniel Money And Simon Whelan ~Anusha Sura."

Similar presentations


Ads by Google