Computational and mathematical challenges involved in very large-scale phylogenetics Tandy Warnow The University of Texas at Austin.

Slides:



Advertisements
Similar presentations
CS 598AGB What simulations can tell us. Questions that simulations cannot answer Simulations are on finite data. Some questions (e.g., whether a method.
Advertisements

A Separate Analysis Approach to the Reconstruction of Phylogenetic Networks Luay Nakhleh Department of Computer Sciences UT Austin.
Challenges in computational phylogenetics Tandy Warnow Radcliffe Institute for Advanced Study University of Texas at Austin.
Computer Science and Reconstructing Evolutionary Trees Tandy Warnow Department of Computer Science University of Illinois at Urbana-Champaign.
Large-Scale Phylogenetic Analysis Tandy Warnow Associate Professor Department of Computer Sciences Graduate Program in Evolution and Ecology Co-Director.
Computational biology and computational biologists Tandy Warnow, UT-Austin Department of Computer Sciences Institute for Cellular and Molecular Biology.
Multiple sequence alignment methods: evidence from data CS/BioE 598 Tandy Warnow.
BNFO 602 Phylogenetics Usman Roshan.
Probabilistic methods for phylogenetic trees (Part 2)
CIS786, Lecture 4 Usman Roshan.
CIS786, Lecture 8 Usman Roshan Some of the slides are based upon material by Dennis Livesay and David.
Computing the Tree of Life The University of Texas at Austin Department of Computer Sciences Tandy Warnow.
Combinatorial and graph-theoretic problems in evolutionary tree reconstruction Tandy Warnow Department of Computer Sciences University of Texas at Austin.
Phylogeny Estimation: Why It Is "Hard", and How to Design Methods with Good Performance Tandy Warnow Department of Computer Sciences University of Texas.
CIPRES: Enabling Tree of Life Projects Tandy Warnow The Program in Evolutionary Dynamics at Harvard University The University of Texas at Austin.
CIPRES: Enabling Tree of Life Projects Tandy Warnow The University of Texas at Austin.
Disk-Covering Methods for phylogeny reconstruction Tandy Warnow The University of Texas at Austin.
Phylogenetic Tree Reconstruction Tandy Warnow The Program in Evolutionary Dynamics at Harvard University The University of Texas at Austin.
Complexity and The Tree of Life Tandy Warnow The University of Texas at Austin.
Phylogeny Estimation: Traditional and Bayesian Approaches Molecular Evolution, 2003
Barking Up the Wrong Treelength Kevin Liu, Serita Nelesen, Sindhu Raghavan, C. Randal Linder, and Tandy Warnow IEEE TCCB 2009.
Computer Science Research for The Tree of Life Tandy Warnow Department of Computer Sciences University of Texas at Austin.
Rec-I-DCM3: A Fast Algorithmic Technique for Reconstructing Large Evolutionary Trees Usman Roshan Department of Computer Science New Jersey Institute of.
Software for Scientists Tandy Warnow Department of Computer Science University of Texas at Austin.
CIPRES: Enabling Tree of Life Projects Tandy Warnow The University of Texas at Austin.
Benjamin Loyle 2004 Cse 397 Solving Phylogenetic Trees Benjamin Loyle March 16, 2004 Cse 397 : Intro to MBIO.
394C, Spring 2013 Sept 4, 2013 Tandy Warnow. DNA Sequence Evolution AAGACTT TGGACTTAAGGCCT -3 mil yrs -2 mil yrs -1 mil yrs today AGGGCATTAGCCCTAGCACTT.
More statistical stuff CS 394C Feb 6, Today Review of material from Jan 31 Calculating pattern probabilities Why maximum parsimony and UPGMA are.
Statistical stuff: models, methods, and performance issues CS 394C September 16, 2013.
CS 394C Algorithms for Computational Biology Tandy Warnow Spring 2012.
CIPRES: Enabling Tree of Life Projects Tandy Warnow The University of Texas at Austin.
Introduction to Phylogenetic Estimation Algorithms Tandy Warnow.
Algorithmic research in phylogeny reconstruction Tandy Warnow The University of Texas at Austin.
Algorithms research Tandy Warnow UT-Austin. “Algorithms group” UT-Austin: Warnow, Hunt UCB: Rao, Karp, Papadimitriou, Russell, Myers UCSD: Huelsenbeck.
GRAPPA: Large-scale whole genome phylogenies based upon gene order evolution Tandy Warnow, UT-Austin Department of Computer Sciences Institute for Cellular.
Constructing the Tree of Life: Divide-and-Conquer! Tandy Warnow University of Illinois at Urbana-Champaign.
Using Divide-and-Conquer to Construct the Tree of Life Tandy Warnow University of Illinois at Urbana-Champaign.
598AGB Basics Tandy Warnow. DNA Sequence Evolution AAGACTT TGGACTTAAGGCCT -3 mil yrs -2 mil yrs -1 mil yrs today AGGGCATTAGCCCTAGCACTT AAGGCCTTGGACTT.
The Big Issues in Phylogenetic Reconstruction Randy Linder Integrative Biology, University of Texas
Problems with large-scale phylogeny Tandy Warnow, UT-Austin Department of Computer Sciences Center for Computational Biology and Bioinformatics.
Three approaches to large- scale phylogeny estimation: SATé, DACTAL, and SEPP Tandy Warnow Department of Computer Science The University of Texas at Austin.
CS 395T: Computational phylogenetics January 18, 2006 Tandy Warnow.
Statistical stuff: models, methods, and performance issues CS 394C September 3, 2009.
Iterative-DCM3: A Fast Algorithmic Technique for Reconstructing Large Phylogenetic Trees Usman Roshan and Tandy Warnow U. of Texas at Austin Bernard Moret.
Sequence alignment CS 394C: Fall 2009 Tandy Warnow September 24, 2009.
Approaching multiple sequence alignment from a phylogenetic perspective Tandy Warnow Department of Computer Sciences The University of Texas at Austin.
Simultaneous alignment and tree reconstruction Collaborative grant: Texas, Nebraska, Georgia, Kansas Penn State University, Huston-Tillotson, NJIT, and.
The Tree of Life: Algorithmic and Software Challenges Tandy Warnow The University of Texas at Austin.
Absolute Fast Converging Methods CS 598 Algorithmic Computational Genomics.
394C: Algorithms for Computational Biology Tandy Warnow Jan 25, 2012.
Distance-based phylogeny estimation
The Disk-Covering Method for Phylogenetic Tree Reconstruction
New Approaches for Inferring the Tree of Life
Distance based phylogenetics
Multiple Sequence Alignment Methods
Tandy Warnow Department of Computer Sciences
Techniques for MSA Tandy Warnow.
Algorithm Design and Phylogenomics
CIPRES: Enabling Tree of Life Projects
BNFO 602 Phylogenetics Usman Roshan.
Absolute Fast Converging Methods
CS 581 Tandy Warnow.
New methods for simultaneous estimation of trees and alignments
Ultra-Large Phylogeny Estimation Using SATé and DACTAL
Recent Breakthroughs in Mathematical and Computational Phylogenetics
CS 394C: Computational Biology Algorithms
September 1, 2009 Tandy Warnow
Algorithms for Inferring the Tree of Life
Tandy Warnow The University of Texas at Austin
Tandy Warnow The University of Texas at Austin
Presentation transcript:

Computational and mathematical challenges involved in very large-scale phylogenetics Tandy Warnow The University of Texas at Austin

How did life evolve on earth? An international effort to understand how life evolved on earth Biomedical applications: drug design, protein structure and function prediction, biodiversity. Courtesy of the Tree of Life project

Steps in a phylogenetic analysis Gather data Align sequences Reconstruct phylogeny on the multiple alignment - often obtaining a large number of trees Compute consensus (or otherwise estimate the reliable components of the evolutionary history) Perform post-tree analyses.

Reconstructing the “Tree” of Life Handling large datasets: millions of species, and most problems are NP-hard most problems are NP-hard Evolution is complex: Reticulate evolution Reticulate evolution Indels, genome rearrangements, etc. Indels, genome rearrangements, etc. Gene tree/species tree differences Gene tree/species tree differences

The CIPRES Project (Cyber-Infrastructure for Phylogenetic Research) The US National Science Foundation funds this project, which has the following major components: ALGORITHMS and SOFTWARE: scaling to millions of sequences (open source, freely distributed) MATHEMATICS/PROBABILITY/STATISTICS: Obtaining better mathematical theory under complex models of evolution DATABASES: Producing new database technology for structured data, to enable scientific discoveries SIMULATIONS: The first million taxon simulation under realistically complex models OUTREACH: Museum partners, K-12, general scientific public PORTAL available to all researchers See for more about CIPRES.

This talk Part 1: Overview of some interesting mathematical issues Part 2: Better heuristics for NP-hard optimization problems Part 3: New methods for simultaneous estimation of trees and alignments Part 4: Related research and open problems.

Part 1: Mathematical issues under stochastic models

DNA Sequence Evolution AAGACTT TGGACTTAAGGCCT -3 mil yrs -2 mil yrs -1 mil yrs today AGGGCATTAGCCCTAGCACTT AAGGCCTTGGACTT TAGCCCATAGACTTAGCGCTTAGCACAAAGGGCAT TAGCCCTAGCACTT AAGACTT TGGACTTAAGGCCT AGGGCATTAGCCCTAGCACTT AAGGCCTTGGACTT AGCGCTTAGCACAATAGACTTTAGCCCAAGGGCAT

Markov models of single site evolution Simplest (Jukes-Cantor): The model tree is a pair (T,{e,p(e)}), where T is a rooted binary tree, and p(e) is the probability of a substitution on the edge e. The state at the root is random. If a site changes on an edge, it changes with equal probability to each of the remaining states. The evolutionary process is Markovian. More complex models (such as the General Markov model) are also considered, often with little change to the theory.

Modelling variation between characters: Rates-across-sites If a site (i.e., character) is twice as fast as another on one edge, it is twice as fast everywhere. The distribution of the rates is typically assumed to be gamma. A B C D A B C D

Modelling variation between characters: The “no-common-mechanism” model A separate random variable for every combination of site and edge - the underlying tree is fixed, but otherwise there are no constraints on variation between sites. A B C D A B C D

Identifiability and statistical consistency A model is identifiable if it is uniquely characterized by the probability distribution it defines. A phylogenetic reconstruction method is statistically consistent under a model if the probability that the method reconstructs the true tree goes to 1 as the sequence length increases.

Identifiability results The “standard” Markov models (from Jukes-Cantor to the General Markov model) are identifiable. These models are also identifiable when sites draw rates from a gamma distribution (easy to prove if the distribution is known, and harder to prove if the distribution must be estimated - cf. Allman’s talk). However, mixed models are often not identifiable (cf. Matsen’s talk), nor are some models in which sites draw rates from more complex distributions. Phylogeny estimation typically is done under identifiable models.

What about phylogeny reconstruction methods? TAGCCCATAGACTTTGCACAATGCGCTTAGGGCAT UVWXY U VW X Y

1.Hill-climbing heuristics for NP-hard optimization criteria (Maximum Parsimony and Maximum Likelihood) Phylogenetic reconstruction methods Phylogenetic trees Cost Global optimum Local optimum 2.Polynomial time distance-based methods: Neighbor Joining, FastME, Weighbor, etc. 3.Bayesian methods

Performance criteria Running time. Space. Statistical performance issues (e.g., statistical consistency and sequence length requirements) “Topological accuracy” with respect to the underlying true tree. Typically studied in simulation. Accuracy with respect to a particular criterion (e.g. tree length or likelihood score), on real data.

Quantifying Error FN: false negative (missing edge) FP: false positive (incorrect edge) 50% error rate FN FP

Statistical consistency, exponential convergence, and absolute fast convergence (afc)

Theoretical results: statistical consistency under typical models? Neighbor Joining is polynomial time, and statistically consistent. Maximum Parsimony is NP-hard, and even exact solutions are not statistically consistent. Maximum Likelihood is NP-hard, but exact solutions are statistically consistent

Theoretical results: sequence length requirements under typical models? Neighbor joining (and some other distance-based methods) will return the true tree with high probability provided sequence lengths are exponential in the diameter of the tree (Erdos et al., Atteson). Maximum likelihood will return the true tree with high probability provided sequence lengths are exponential in the number of taxa (Steel and Szekely).

Neighbor joining has poor performance on large diameter trees [Nakhleh et al. ISMB 2001] Simulation study based upon fixed edge lengths, K2P model of evolution, sequence lengths fixed to 1000 nucleotides. Error rates reflect proportion of incorrect edges in inferred trees. NJ No. Taxa Error Rate

DCM1 Warnow, St. John, and Moret, SODA 2001 A two-phase procedure which reduces the sequence length requirement of methods. The DCM phase produces a collection of trees, and the SQS phase picks the “best” tree. The “base method” is applied to subsets of the original dataset. When the base method is NJ, you get DCM1-NJ. DCMSQS Exponentially converging method Absolute fast converging method

DCM1-boosting distance-based methods [Nakhleh et al. ISMB 2001] Theorem: DCM1-NJ converges to the true tree from polynomial length sequences NJ DCM1-NJ No. Taxa Error Rate

However, The best accuracy in simulation tends to be from computationally intensive methods (and most molecular phylogeneticists prefer these methods). Unfortunately, these approaches can take weeks or more, just to reach decent local optima. Conclusion: We need better heuristics!

Part 2: Improved heuristics for NP-hard optimization problems

1.Hill-climbing heuristics (which can get stuck in local optima) 2.Randomized algorithms for getting out of local optima 3.Approximation algorithms for MP (based upon Steiner Tree approximation algorithms). Approaches for “solving” MP/ML Phylogenetic trees Cost Global optimum Local optimum

Problems with current techniques for MP Shown here is the performance of a TNT heuristic maximum parsimony analysis on a real dataset of almost 14,000 sequences. (“Optimal” here means best score to date, using any method for any amount of time.) Acceptable error is below 0.01%. Performance of TNT with time

Observations The best MP heuristics cannot get acceptably good solutions within 24 hours on some large datasets. Datasets of these sizes may need months (or years) of further analysis to reach reasonable solutions. Apparent convergence can be misleading.

Rec-I-DCM3: a new technique (Roshan et al.) Combines a new decomposition technique (DCM3) with recursion and iteration, to produce a novel approach for escaping local optima Tested initially on MP (maximum parsimony), but also implemented for ML and other optimization problems

The DCM3 decomposition Input: Set S of sequences, and guide-tree T 1. Compute short subtree graph G(S,T), based upon T 2. Find clique separator in the graph G(S,T) and form subproblems DCM3 decompositions (1) can be obtained in O(n) time (the short subtree graph is triangulated) (2) yield small subproblems (3) can be used iteratively

Iterative-DCM3 T T’ Base method DCM3

Rec-I-DCM3 significantly improves performance (Roshan et al.) Comparison of TNT to Rec-I-DCM3(TNT) on one large dataset Current best techniques DCM boosted version of best techniques

Observations Rec-I-DCM3 improves upon the best performing heuristics for MP. The improvement increases with the difficulty of the dataset. Rec-I-DCM3 also improves maximum likelihood (using RAxML) analyses (data not shown), and is included in the CIPRES portal.

Part 3: Multiple sequence alignment

Steps in a phylogenetic analysis Gather data Align sequences Reconstruct phylogeny on the multiple alignment - often obtaining a large number of trees Compute consensus (or otherwise estimate the reliable components of the evolutionary history) Perform post-tree analyses.

Steps in a phylogenetic analysis Gather data Align sequences Reconstruct phylogeny on the multiple alignment - often obtaining a large number of trees Compute consensus (or otherwise estimate the reliable components of the evolutionary history) Perform post-tree analyses.

Basic observations about standard two-phase methods Many new MSA methods improve on ClustalW, with ProbCons and MAFFT the two best MSA methods. Although alignment accuracy correlates with phylogenetic accuracy, it has less effect than might be expected (Wang et al., unpublished).

What about simultaneous estimation? Several Bayesian methods for simultaneous estimation of trees and alignments have been developed, but none can be applied to datasets with more than (approx.) 20 sequences. POY attempts to solve the NP-hard “minimum length tree” problem, where gaps contribute to the length of the tree and can be applied to large datasets. However, its performance on simulated data isn’t competitive with the best two-phase methods (unpublished data).

New method: SATe ( Simultaneous Alignment and Tree estimation ) Developers: Warnow, Linder, Liu, Nelesen, and Zhao. Basic technique: propose alignments (using treelength under a selected affine gap penalty), and compute maximum likelihood trees for these alignments under GTR. Unpublished.

Simulation study 100 taxon model trees (generated by r8s and then modified, so as to deviate from the molecular clock). DNA sequences evolved under ROSE (indel events of blocks of nucleotides, plus HKY site evolution). The root sequence has 1000 sites. We vary the gap length distribution, probability of gaps, and probability of substitutions, to produce 8 model conditions: models 1-4 have “long gaps” and 5-8 have “short gaps”. We compared RAxML on various alignments (including the true alignment) to SATe.

Topological accuracy FN (false negative): proportion of correct edges missing from the estimated tree FP (false positive): proportion of incorrect edges in the estimated tree

Alignment accuracy Normalized number of columns in the estimated alignment relative to the true alignment.

Alignment accuracy FN: proportion of correctly homologous pairs of nucleotides missing from the estimated alignment (i.e., 1-SP score). FP: proportion of incorrect pairings of nucleotides in the estimated alignment.

Other observations SATe is more accurate at estimating the number of gaps and the correct alignment length than other methods. The standard alignment accuracy measure, SP, is not particularly predictive of phylogenetic accuracy. We still need methods for MSA under models that include rearrangements and duplications.

Part IV: Open questions Tree shape (including branch lengths) has an impact on phylogeny reconstruction - but what model of tree shape to use? What is the sequence length requirement for Maximum Likelihood? (Current bound is probably too large.) Why is MP not so bad? How to detect and reconstruct reticulate evolution? “Teasing apart” trees under complex models? What complex models are identifiable?

General comments Current models of sequence evolution are clearly too simple, and more realistic ones are not identifiable. Relative performance between methods can change as the models become more complex or as the number of taxa increases. We do not know how methods perform under realistic conditions (nor how long we need to let computationally intensive methods run).

Acknowledgements Funding: NSF, The David and Lucile Packard Foundation, The Program in Evolutionary Dynamics at Harvard, and The Institute for Cellular and Molecular Biology at UT-Austin. Collaborators: Peter Erdos, Daniel Huson, Randy Linder, Kevin Liu, Bernard Moret, Serita Nelesen, Usman Roshan, Mike Steel, Katherine St. John, Laszlo Szekely, Tiffani Williams, and David Zhao.