Characterizing the Phylogenetic Tree-Search Problem Daniel Money And Simon Whelan ~Anusha Sura.

Slides:



Advertisements
Similar presentations
Bioinformatics Phylogenetic analysis and sequence alignment The concept of evolutionary tree Types of phylogenetic trees Measurements of genetic distances.
Advertisements

An Introduction to Phylogenetic Methods
Introduction to Phylogenies
Wellcome Trust Workshop Working with Pathogen Genomes Module 6 Phylogeny.
 Aim in building a phylogenetic tree is to use a knowledge of the characters of organisms to build a tree that reflects the relationships between them.
Phylogenetic Trees Lecture 4
1 General Phylogenetics Points that will be covered in this presentation Tree TerminologyTree Terminology General Points About Phylogenetic TreesGeneral.
Maximum Likelihood. Likelihood The likelihood is the probability of the data given the model.
Molecular Evolution Revised 29/12/06
. Phylogeny II : Parsimony, ML, SEMPHY. Phylogenetic Tree u Topology: bifurcating Leaves - 1…N Internal nodes N+1…2N-2 leaf branch internal node.
Simple Linear Regression
Maximum Likelihood. Historically the newest method. Popularized by Joseph Felsenstein, Seattle, Washington. Its slow uptake by the scientific community.
Association Mapping of Complex Diseases with Ancestral Recombination Graphs: Models and Efficient Algorithms Yufeng Wu UC Davis RECOMB 2007.
Topic 2: Statistical Concepts and Market Returns
Phylogeny reconstruction BNFO 602 Roshan. Simulation studies.
BNFO 602 Phylogenetics Usman Roshan. Summary of last time Models of evolution Distance based tree reconstruction –Neighbor joining –UPGMA.
Probabilistic methods for phylogenetic trees (Part 2)
Tree-Building. Methods in Tree Building Phylogenetic trees can be constructed by: clustering method optimality method.
Phylogenetic Tree Construction and Related Problems Bioinformatics.
Today Concepts underlying inferential statistics
TGCAAACTCAAACTCTTTTGTTGTTCTTACTGTATCATTGCCCAGAATAT TCTGCCTGTCTTTAGAGGCTAATACATTGATTAGTGAATTCCAATGGGCA GAATCGTGATGCATTAAAGAGATGCTAATATTTTCACTGCTCCTCAATTT.
Multiple Sequence Alignments and Phylogeny.  Within a protein sequence, some regions will be more conserved than others. As more conserved,
Phylogeny Estimation: Traditional and Bayesian Approaches Molecular Evolution, 2003
Terminology of phylogenetic trees
BINF6201/8201 Molecular phylogenetic methods
Molecular phylogenetics
Fundamentals of Data Analysis Lecture 4 Testing of statistical hypotheses.
Molecular evidence for endosymbiosis Perform blastp to investigate sequence similarity among domains of life Found yeast nuclear genes exhibit more sequence.
Phylogenetics Alexei Drummond. CS Friday quiz: How many rooted binary trees having 20 labeled terminal nodes are there? (A) (B)
1 Dan Graur Molecular Phylogenetics Molecular phylogenetic approaches: 1. distance-matrix (based on distance measures) 2. character-state.
Phylogenetic Analysis. General comments on phylogenetics Phylogenetics is the branch of biology that deals with evolutionary relatedness Uses some measure.
Molecular phylogenetics 1 Level 3 Molecular Evolution and Bioinformatics Jim Provan Page and Holmes: Sections
Bioinformatics 2011 Molecular Evolution Revised 29/12/06.
Parsimony-Based Approaches to Inferring Phylogenetic Trees BMI/CS 576 Colin Dewey Fall 2010.
Phylogenetic Prediction Lecture II by Clarke S. Arnold March 19, 2002.
Phylogenetic Trees  Importance of phylogenetic trees  What is the phylogenetic analysis  Example of cladistics  Assumptions in cladistics  Frequently.
Building phylogenetic trees. Contents Phylogeny Phylogenetic trees How to make a phylogenetic tree from pairwise distances  UPGMA method (+ an example)
Comp. Genomics Recitation 3 The statistics of database searching.
PCB 3043L - General Ecology Data Analysis. OUTLINE Organizing an ecological study Basic sampling terminology Statistical analysis of data –Why use statistics?
Multiple alignment: Feng- Doolittle algorithm. Why multiple alignments? Alignment of more than two sequences Usually gives better information about conserved.
More statistical stuff CS 394C Feb 6, Today Review of material from Jan 31 Calculating pattern probabilities Why maximum parsimony and UPGMA are.
Phylogeny and Genome Biology Andrew Jackson Wellcome Trust Sanger Institute Changes: Type program name to start Always Cd to phyml directory before starting.
Module networks Sushmita Roy BMI/CS 576 Nov 18 th & 20th, 2014.
Parallel & Distributed Systems and Algorithms for Inference of Large Phylogenetic Trees with Maximum Likelihood Alexandros Stamatakis LRR TU München Contact:
Phylogenetic Analysis Gabor T. Marth Department of Biology, Boston College BI420 – Introduction to Bioinformatics Figures from Higgs & Attwood.
Rooting Phylogenetic Trees with Non-reversible Substitution Models Von Bing Yap* and Terry Speed § *Statistics and Applied Probability, National University.
Chapter 10 Phylogenetic Basics. Similarities and divergence between biological sequences are often represented by phylogenetic trees Phylogenetics is.
Parsimony-Based Approaches to Inferring Phylogenetic Trees BMI/CS 576 Colin Dewey Fall 2015.
Maximum Likelihood Given competing explanations for a particular observation, which explanation should we choose? Maximum likelihood methodologies suggest.
Phylogeny Ch. 7 & 8.
Phylogenetic Trees - Parsimony Tutorial #13
Ayesha M.Khan Spring Phylogenetic Basics 2 One central field in biology is to infer the relation between species. Do they possess a common ancestor?
Evaluating the Fossil Record with Model Phylogenies Cladistic relationships can be determined without ideas about stratigraphic completeness; implied gaps.
The Big Issues in Phylogenetic Reconstruction Randy Linder Integrative Biology, University of Texas
Bayesian statistics named after the Reverend Mr Bayes based on the concept that you can estimate the statistical properties of a system after measuting.
CS 395T: Computational phylogenetics January 18, 2006 Tandy Warnow.
Statistical stuff: models, methods, and performance issues CS 394C September 3, 2009.
Finding Motifs Vasileios Hatzivassiloglou University of Texas at Dallas.
Fundamentals of Data Analysis Lecture 4 Testing of statistical hypotheses pt.1.
Substitution Matrices and Alignment Statistics BMI/CS 776 Mark Craven February 2002.
Introduction to Bioinformatics Resources for DNA Barcoding
Phylogenetic basis of systematics
Multiple Alignment and Phylogenetic Trees
Inferring phylogenetic trees: Distance and maximum likelihood methods
Summary and Recommendations
BNFO 602 Phylogenetics Usman Roshan.
BNFO 602 Phylogenetics – maximum parsimony
Product moment correlation
Lecture 8 – Searching Tree Space
Summary and Recommendations
Presentation transcript:

Characterizing the Phylogenetic Tree-Search Problem Daniel Money And Simon Whelan ~Anusha Sura

What is this? A phylogenetic tree or evolutionary tree is a branching diagram or "tree" showing the inferred evolutionary relationships among various biological species or other entities.

Why do we need to know about this? Motivation The problem of explaining the evolutionary history of today's species How do species relate to one another in terms of common ancestors Approaches People who are curious to know the origin of Fossil Records Phylogenetic Trees is the best head start

Brief Overview In Phylogenetic trees Leaves represent present day species Interior nodes represent hypothesized ancestors

Key Idea Phylogenetic studies frequently use some form of optimality criterion to assess how well specific tree topologies describe the observed sequence data. Optimality methods typically work by finding the best scoring tree for a sequence alignment, which is taken to be the best estimate of the evolutionary relationships between a set of sequences

Why not Hill-climbing? Hill-climbing can produce many different optimal trees depending where the algorithm starts. Only the optimal tree with the highest likelihood, the global optimum, has the appealing properties of the ML estimator Tree-search is NP-hard.

Goals? Learn about the factors that affect the topography of tree-space, and to provide pragmatic suggestions that will aid phylogenetic inference with existing methods Investigate how the difficulty of tree-search differs between alignments and use correlation analyses to identify predictors for the difficulty of tree-search. Also examine whether optima share any properties, such as their relative size or their location in tree-space.

Data Sets Phylogenomic data sets consisting of 8-, 20- and 40-taxa. Phylogenomic datasets consist of a series of genes taken from the same set of taxa, leading us to expect a single tree relating the taxa, and enabling us to compare results between genes to highlight similarities and differences in tree-space caused by alignment properties.

Observation from the Data Set 8 taxa – ungapped nucleotide sequence alignments (106 genes) 20 and 40 taxa – gapped amino acid sequence alignments (146 genes) Genes with more than 10% unknown character or gap are excluded 40 taxa – 20 different genes 20 taxa – 52 different genes

Exhaustive tree search(8 taxa data set) (i) assign a start tree to the current tree object (ii) use a rearrangement operation to define the neighborhood of the current tree, (iii) calculate likelihoods for the trees in the neighborhood and assign the highest scoring as the new current tree, and (iv) if no improvement in likelihood occurs, then tree-search reaches an optimum and stops, otherwise go to (ii)

Number and size of optima 1.Assume the number of different optima to be a suitable proxy for the difficulty of tree-search problem 2.The number of optima identified during tree-search under a specific rearrangement strategy from all possible starting points in tree- space 3.The size of an optimum is defined as the number of start trees that reach that specific optimum when performing tree-search

Statistical comparison of optima Assess whether each local optimum is significantly different to the global optimum using the SH test (Shimodaira and Hasegawa 1999), implemented in PAML (Yang 1997).

Optima Clustering Compute the mean NNI distance between ‘n’ identified optima and assess the significance of any clustering observed using a bootstrap approach Take 1000 draws from the null distribution of no significant clustering by sampling n randomly chosen trees, with the condition that none is a neighbor to any other, and computing their mean NNI distance

Correlating the number of optima with gene and tree properties Spearman correlation coefficients between specific properties and the number of optima identified I.tree length, defined as the sum of all branches of the globally optimal tree II.alignment length, defined as the length of the gapped sequence alignment associated with a gene III.the difference in likelihood between the fully resolved globally optimal tree and the unresolved star-tree (Δ ln L ^ )

Parameter distributions Investigate the estimates of tree length and the α parameter of the Γ- distributed rates-across-sites model (by examining their distribution across the 10,395 bifurcating trees relating 8 taxa for all 106 genes) Calculate the mean rank of the global optimum and the skewness of the distribution for each gene

Heuristic Tree-Search Using Sampling for the 20- and 40-taxa Phylogenomic and Bench marking Data Set Number and size of optima 1.The relative numbers of optima discovered by phyml and RAxML from the randomly sampled start trees should be indicative of the difficulty of tree-search 2.The relative size of the sampled optima are calculated as the number of start trees that lead to them

Statistical comparison of optima The best tree identified during tree-search is taken as a proxy for the global optimum Compare the 95% confidence interval of this best tree with the other optima, identified using the SH test

Optima Clustering Robinson–Foulds (RF) distance metric 1.The average RF distance between the ‘n’ sampled optima is calculated 2.a bootstrap procedure used to assign P values of clustering by comparing the observed distance to distribution of average RF- distances between n randomly sampled trees

Correlating the number of optima with gene and tree properties Ignore minor variations in gene tree topology that can result from (e.g.,) incomplete lineage sorting or model misspecification Spearman correlation coefficients are calculated between the observed number of optima and gene length, tree length, and likelihood difference between the star-tree and the best optimum identified

Parameter distributions The distribution of tree length and rates-across-sites parameter, α, across trees are approximated by taking estimates from random start trees, for both the simple and complex model

Exhaustive Analysis of Eight-taxa Yeast Phylogenomic Data Set The global optimum is frequently the largest optimum 1. There is a strong tendency for the global optimum to be larger than other less good optima 2.The global optimum is on average 2.37 times larger than expected under JC, and 2.16 times larger than expected under GTR+Γ

Comparison of ordered rank with average size

Statistical comparison of optima Of the 91 genes with multiple optima under GTR+Γ, 28 have local optima that the SH test finds significantly different from the global optima. When averaged across all 91 genes, 14.6% of the locally optima found are significantly different to the global optima

Clustering of optima The likelihood decreases as the NNI distance from the global optimum increases, but the slope is less steep bootstrap analyses show that for the overwhelming majority of genes, the mean NNI distance between optima was less than expected by chance and under JC (GTR+Γ), 49/92(50/91) genes display significant clustering

Representation of tree-space for the genes YBR198C and YLR389C

Correlation between number of optima and data properties The number of optima in a gene is correlated with the value of Δln L at the global optimum The number of optima compared with the value of Δln L across all genes under JC and GTR+Γ, and we find there are significant, but imperfect, correlations between these variables

Correlations between Δ ln Lˆ and the number of optima per gene

Correlations between gene properties and the number of optima

Parameter distributions Globally optimal trees tend to have relatively high estimates of α from Γ-distributed rates-across-sites and low estimates of tree length There is positive skew for α, with the majority of trees having low parameter estimates, in contrast to the high estimate in the globally optimal tree Negative skew for tree length, with the majority of trees having longer estimates

Conclusions Major differences in the topography of tree search. The global optimum tends to have the greatest number of trees attracted to it. Model choice affects tree-space. The difference in log likelihood between a well-resolved topology and the star topology provides a proxy for the phylogenetic information in an alignment. NNI tree-search performs poorly on real data No single program is likely to yield the best tree estimate

References Morrison D.A Increasing the efficiency of searches for the maximum likelihood tree in a phylogenetic analysis of up to 150 nucleotide sequences. Felsenstein J Inferring phylogenies. Sunderland (MA): Sinauer Associates. Stamatakis A RAxML-VI-HPC: maximum likelihood-based phylogenetic analyses with thousands of taxa and mixed models. Shimodaira H., Hasegawa M Multiple comparisons of loglikelihoods with applications to phylogenetic inference. Morell V TreeBASE: the roots of phylogeny. Guindon S., Gascuel O A simple, fast, and accurate algorithm to estimate large phylogenies by maximum likelihood.