Tandy Warnow The University of Texas at Austin

Slides:

Advertisements

Similar presentations

CS 598AGB What simulations can tell us. Questions that simulations cannot answer Simulations are on finite data. Some questions (e.g., whether a method.

Advertisements

A Separate Analysis Approach to the Reconstruction of Phylogenetic Networks Luay Nakhleh Department of Computer Sciences UT Austin.

Computer Science and Reconstructing Evolutionary Trees Tandy Warnow Department of Computer Science University of Illinois at Urbana-Champaign.

BNFO 602 Phylogenetics Usman Roshan.

CIS786, Lecture 3 Usman Roshan.

Phylogeny reconstruction BNFO 602 Roshan. Simulation studies.

BNFO 602 Phylogenetics Usman Roshan. Summary of last time Models of evolution Distance based tree reconstruction –Neighbor joining –UPGMA.

CIS786, Lecture 4 Usman Roshan.

CIS786, Lecture 8 Usman Roshan Some of the slides are based upon material by Dennis Livesay and David.

Computational and mathematical challenges involved in very large-scale phylogenetics Tandy Warnow The University of Texas at Austin.

Combinatorial and graph-theoretic problems in evolutionary tree reconstruction Tandy Warnow Department of Computer Sciences University of Texas at Austin.

Phylogeny Estimation: Why It Is "Hard", and How to Design Methods with Good Performance Tandy Warnow Department of Computer Sciences University of Texas.

Phylogenetic Tree Reconstruction Tandy Warnow The Program in Evolutionary Dynamics at Harvard University The University of Texas at Austin.

Complexity and The Tree of Life Tandy Warnow The University of Texas at Austin.

Barking Up the Wrong Treelength Kevin Liu, Serita Nelesen, Sindhu Raghavan, C. Randal Linder, and Tandy Warnow IEEE TCCB 2009.

Computer Science Research for The Tree of Life Tandy Warnow Department of Computer Sciences University of Texas at Austin.

Rec-I-DCM3: A Fast Algorithmic Technique for Reconstructing Large Evolutionary Trees Usman Roshan Department of Computer Science New Jersey Institute of.

Benjamin Loyle 2004 Cse 397 Solving Phylogenetic Trees Benjamin Loyle March 16, 2004 Cse 397 : Intro to MBIO.

CS 173, Lecture B August 25, 2015 Professor Tandy Warnow.

More statistical stuff CS 394C Feb 6, Today Review of material from Jan 31 Calculating pattern probabilities Why maximum parsimony and UPGMA are.

Statistical stuff: models, methods, and performance issues CS 394C September 16, 2013.

CIPRES: Enabling Tree of Life Projects Tandy Warnow The University of Texas at Austin.

Introduction to Phylogenetic Estimation Algorithms Tandy Warnow.

Algorithmic research in phylogeny reconstruction Tandy Warnow The University of Texas at Austin.

Using Divide-and-Conquer to Construct the Tree of Life Tandy Warnow University of Illinois at Urbana-Champaign.

Problems with large-scale phylogeny Tandy Warnow, UT-Austin Department of Computer Sciences Center for Computational Biology and Bioinformatics.

Three approaches to large- scale phylogeny estimation: SATé, DACTAL, and SEPP Tandy Warnow Department of Computer Science The University of Texas at Austin.

CS 395T: Computational phylogenetics January 18, 2006 Tandy Warnow.

Statistical stuff: models, methods, and performance issues CS 394C September 3, 2009.

Sequence alignment CS 394C: Fall 2009 Tandy Warnow September 24, 2009.

Approaching multiple sequence alignment from a phylogenetic perspective Tandy Warnow Department of Computer Sciences The University of Texas at Austin.

Simultaneous alignment and tree reconstruction Collaborative grant: Texas, Nebraska, Georgia, Kansas Penn State University, Huston-Tillotson, NJIT, and.

The Tree of Life: Algorithmic and Software Challenges Tandy Warnow The University of Texas at Austin.

Absolute Fast Converging Methods CS 598 Algorithmic Computational Genomics.

394C: Algorithms for Computational Biology Tandy Warnow Jan 25, 2012.

CS 466 and BIOE 498: Introduction to Bioinformatics

Constrained Exact Optimization in Phylogenetics

Distance-based phylogeny estimation

The Disk-Covering Method for Phylogenetic Tree Reconstruction

Advances in Ultra-large Phylogeny Estimation

Phylogenetic basis of systematics

New Approaches for Inferring the Tree of Life

394C, Spring 2012 Jan 23, 2012 Tandy Warnow.

Statistical tree estimation

Distance based phylogenetics

Multiple Sequence Alignment Methods

Tandy Warnow Department of Computer Sciences

Challenges in constructing very large evolutionary trees

Techniques for MSA Tandy Warnow.

Algorithm Design and Phylogenomics

CIPRES: Enabling Tree of Life Projects

Mathematical and Computational Challenges in Reconstructing Evolution

New methods for simultaneous estimation of trees and alignments

BNFO 602 Phylogenetics Usman Roshan.

BNFO 602 Phylogenetics – maximum parsimony

Absolute Fast Converging Methods

CS 581 Tandy Warnow.

CS 581 Tandy Warnow.

CS 581 Algorithmic Computational Genomics

Tandy Warnow Department of Computer Sciences

New methods for simultaneous estimation of trees and alignments

Texas, Nebraska, Georgia, Kansas

Ultra-Large Phylogeny Estimation Using SATé and DACTAL

Recent Breakthroughs in Mathematical and Computational Phylogenetics

CS 394C: Computational Biology Algorithms

September 1, 2009 Tandy Warnow

Algorithms for Inferring the Tree of Life

Sequence alignment CS 394C Tandy Warnow Feb 15, 2012.

Tandy Warnow The University of Texas at Austin

New methods for simultaneous estimation of trees and alignments

Presentation transcript:

Tandy Warnow The University of Texas at Austin Computational and mathematical challenges involved in very large-scale phylogenetics Tandy Warnow The University of Texas at Austin

Species phylogeny Orangutan Gorilla Chimpanzee Human From the Tree of the Life Website, University of Arizona Orangutan Gorilla Chimpanzee Human A phylogeny is a tree representation for the evolutionary history relating the species we are interested in. This is an example of a 13-species phylogeny. At each leaf of the tree is a species – we also call it a taxon in phylogenetics (plural form is taxa). They are all distinct. Each internal node corresponds to a speciation event in the past. When reconstructing the phylogeny we compare the characteristics of the taxa, such as their appearance, physiological features, or the composition of the genetic material.

How did life evolve on earth? An international effort to understand how life evolved on earth Biomedical applications: drug design, protein structure and function prediction, biodiversity Phylogenetic estimation is a “Grand Challenge”: millions of taxa, NP-hard optimization problems Courtesy of the Tree of Life project

The CIPRES Project (Cyber-Infrastructure for Phylogenetic Research) www.phylo.org This project is funded by the NSF under a Large ITR grant ALGORITHMS and SOFTWARE: scaling to millions of sequences (open source, freely distributed) MATHEMATICS/PROBABILITY/STATISTICS: Obtaining better mathematical theory under complex models of evolution DATABASES: Producing new database technology for structured data, to enable scientific discoveries SIMULATIONS: The first million taxon simulation under realistically complex models OUTREACH: Museum partners, K-12, general scientific public PORTAL available to all researchers

Step 1: Gather data S1 = AGGCTATCACCTGACCTCCA S2 = TAGCTATCACGACCGC S3 = TAGCTGACCGC S4 = TCACGACCGACA

Step 2: Multiple Sequence Alignment S1 = AGGCTATCACCTGACCTCCA S2 = TAGCTATCACGACCGC S3 = TAGCTGACCGC S4 = TCACGACCGACA S1 = -AGGCTATCACCTGACCTCCA S2 = TAG-CTATCAC--GACCGC-- S3 = TAG-CT-------GACCGC-- S4 = -------TCAC--GACCGACA

Step 3: Construct tree S1 = AGGCTATCACCTGACCTCCA S2 = TAGCTATCACGACCGC S3 = TAGCTGACCGC S4 = TCACGACCGACA S1 = -AGGCTATCACCTGACCTCCA S2 = TAG-CTATCAC--GACCGC-- S3 = TAG-CT-------GACCGC-- S4 = -------TCAC--GACCGACA S1 S2 S4 S3

Performance criteria Estimated alignments are evaluated with respect to the true alignment. Studied both in simulation and on real data. Estimated trees are evaluated for “topological accuracy” with respect to the true tree. Typically studied in simulation. Methods for these problems can also be evaluated with respect to an optimization criterion (e.g., maximum likelihood score) as a function of running time. Typically studied on real data. Issues: Simulation studies need to be based upon realistic models, and “truth” is often not known for real data.

Observations The best current multiple sequence alignment methods can produce highly inacccurate alignments on large datasets (with the result that trees estimated on these alignments are also inaccurate). The fast (polynomial time) methods produce highly inaccurate trees for many datasets. Heuristics for NP-hard optimization problems often produce highly accurate trees, but can take months to reach solutions on large datasets.

This talk Part 1: Improving the topological accuracy of polynomial time phylogeny reconstruction methods (and absolute fast converging methods) Part 2: Improving heuristics for NP-hard optimization problems (getting better solutions faster) Part 3: Simultaneous Alignment and Tree estimation (SATe) Part 4: Conclusions

Part 1: Improving polynomial time methods (and absolute fast converging methods)

DNA Sequence Evolution -3 mil yrs -2 mil yrs -1 mil yrs today AAGACTT TGGACTT AAGGCCT AGGGCAT TAGCCCT AGCACTT AGCGCTT AGCACAA TAGACTT TAGCCCA AAGACTT TGGACTT AAGGCCT AGGGCAT TAGCCCT AGCACTT AAGGCCT TGGACTT TAGCCCA TAGACTT AGCGCTT AGCACAA AGGGCAT TAGCCCT AGCACTT

Markov models of single site evolution Simplest (Jukes-Cantor): The model tree is a pair (T,{e,p(e)}), where T is a rooted binary tree, and p(e) is the probability of a substitution on the edge e. The state at the root is random. If a site changes on an edge, it changes with equal probability to each of the remaining states. The evolutionary process is Markovian. More complex models (such as the General Markov model) are also considered, often with little change to the theory.

Distance-based Phylogenetic Methods

FN: false negative (missing edge) FP: false positive (incorrect edge) 50% error rate FP

Neighbor joining has poor performance on large diameter trees [Nakhleh et al. ISMB 2001] Simulation study based upon fixed edge lengths, K2P model of evolution, sequence lengths fixed to 1000 nucleotides. Error rates reflect proportion of incorrect edges in inferred trees. 0.8 NJ 0.6 Error Rate 0.4 0.2 400 800 1200 1600 No. Taxa

Theorem: Neighbor joining (and some other distance-based methods) will return the true tree with high probability provided sequence lengths are exponential in the diameter of the tree (Erdos et al., Atteson).

Statistical consistency, exponential convergence, and absolute fast convergence (afc)

DCM1 Warnow, St. John, and Moret, SODA 2001 Exponentially converging method Absolute fast converging method DCM SQS A two-phase procedure which reduces the sequence length requirement of methods. The DCM phase produces a collection of trees, and the SQS phase picks the “best” tree. The “base method” is applied to subsets of the original dataset. When the base method is NJ, you get DCM1-NJ.

Graph-theoretic divide-and-conquer (DCM’s) Define a triangulated (i.e. chordal) graph so that its vertices correspond to the input taxa Compute a decomposition of the graph into overlapping subgraphs, thus defining a decomposition of the taxa into overlapping subsets. Apply the “base method” to each subset of taxa, to construct a subtree Merge the subtrees into a single tree on the full set of taxa.

DCM (cartoon)

Some properties of chordal graphs Every chordal graph has at most n maximal cliques, and these can be found in polynomial time: Maxclique decomposition. Every chordal graph has a vertex separator which is a maximal clique: Separator-component decomposition. Every chordal graph has a perfect elimination scheme: enables us to merge correct subtrees and get a correct supertree back, if subtrees are big enough.

DCM1 Decompositions Input: Set S of sequences, distance matrix d, threshold value 1. Compute threshold graph 2. Perform minimum weight triangulation (note: if d is an additive matrix, then the threshold graph is provably chordal). DCM1 decomposition : Compute maximal cliques

DCM1-boosting distance-based methods [Nakhleh et al. ISMB 2001] Theorem: DCM1-NJ converges to the true tree from polynomial length sequences 0.8 NJ DCM1-NJ 0.6 Error Rate 0.4 0.2 400 800 1200 1600 No. Taxa

However, The best phylogenetic accuracy tends to be from computationally intensive methods (and most molecular phylogeneticists prefer these methods). Unfortunately, these approaches can take weeks or more, just to reach decent local optima. Conclusion: We need better heuristics for NP-hard optimization methods!

Part 2: Improved heuristics for NP-hard optimization problems Rec-I-DCM3: Roshan, Williams, Moret, and Warnow Part of the CIPRES software distribution and portal

Standard problem: Maximum Parsimony (Hamming distance Steiner Tree) Input: Set S of n aligned sequences of length k Output: A phylogenetic tree T leaf-labeled by sequences in S additional sequences of length k labeling the internal nodes of T such that is minimized.

Maximum parsimony (example) Input: Four sequences ACT ACA GTT GTA Question: which of the three trees has the best MP scores?

Maximum Parsimony ACT GTA ACA ACT GTT ACA GTT GTA GTA ACA ACT GTT

Maximum Parsimony ACT GTA ACA ACT GTT GTA ACA ACT 2 1 1 2 GTT 3 3 GTT MP score = 7 MP score = 5 GTA ACA ACA GTA 2 1 1 ACT GTT MP score = 4 Optimal MP tree

Maximum Parsimony: computational complexity ACT ACA GTT GTA 1 2 MP score = 4 Finding the optimal MP tree is NP-hard Optimal labeling can be computed in linear time O(nk)

Maximum Likelihood (ML) Given: stochastic model of sequence evolution (e.g. Jukes-Cantor) and a set S of sequences Objective: Find tree T and parameter values so as to maximize the probability of the data. Preferred by some systematists, but even harder than MP in practice.

Approaches for “solving” MP (and other NP-hard problems in phylogeny) Hill-climbing heuristics (which can get stuck in local optima) Randomized algorithms for getting out of local optima Approximation algorithms for MP (based upon Steiner Tree approximation algorithms). Phylogenetic trees Cost Global optimum Local optimum

Problems with current techniques for MP Shown here is the performance of a TNT heuristic maximum parsimony analysis on a real dataset of almost 14,000 sequences. (“Optimal” here means best score to date, using any method for any amount of time.) Acceptable error is below 0.01%. Performance of TNT with time

Rec-I-DCM3: a new technique (Roshan et al.) Combines a new decomposition technique (DCM3) with recursion and iteration, to produce a novel approach for escaping local optima Demonstrated here on MP (maximum parsimony), but also implemented for ML and other optimization problems

The DCM3 decomposition Input: Set S of sequences, and guide-tree T 1. Compute short subtree graph G(S,T), based upon T 2. Find clique separator in the graph G(S,T) and form subproblems DCM3 decompositions can be obtained in O(n) time (the short subtree graph is triangulated) (2) yield small subproblems (3) can be used iteratively

Iterative-DCM3 T DCM3 Base method T’

Rec-I-DCM3 significantly improves performance (Roshan et al.) Current best techniques DCM boosted version of best techniques Comparison of TNT to Rec-I-DCM3(TNT) on one large dataset

Part 3: Multiple sequence alignment SATe (Simultaneous Alignment and Tree estimation) Developers: Liu, Nelesen, Linder, and Warnow unpublished

Multiple Sequence Alignment AGGCTATCACCTGACCTCCA TAGCTATCACGACCGC TAGCTGACCGC -AGGCTATCACCTGACCTCCA TAG-CTATCAC--GACCGC-- TAG-CT-------GACCGC-- Notes: 1. We insert gaps (dashes) to each sequence to make them “line up”. 2. Nucleotides in the same column are presumed to have a common ancestor (i.e., they are “homologous”).

Indels and substitutions at the DNA level Deletion Mutation …ACGGTGCAGTTACCA…

Indels and substitutions at the DNA level Deletion Mutation …ACGGTGCAGTTACCA…

Indels and substitutions at the DNA level Deletion Mutation …ACGGTGCAGTTACCA… …ACCAGTCACCA…

Deletion Mutation …ACGGTGCAGTTACCA… …ACCAGTCACCA… The true multiple alignment is: …ACGGTGCAGTTACCA… …AC----CAGTCACCA…

Basic observations about standard two-phase methods Clustal is the standard multiple alignment method used by systematists. However, many new MSA methods improve on ClustalW, with ProbCons and MAFFT the two best MSA methods. The best current two-phase techniques are obtained by computing maximum likelihood trees on ProbCons or MAFFT alignments (joint work with Wang, Leebens-Mack, and dePamphilis - unpublished).

New method: SATe (Simultaneous Alignment and Tree estimation) Developers: Warnow, Linder, Liu, Nelesen, and Zhao. Basic technique: iteratively propose alignments (using various techniques), and compute maximum likelihood trees for these alignments. Unpublished.

Simulation study 100 taxon model trees, 1000 sites at the root DNA sequences evolved with indels and substitutions (using ROSE). We vary the gap length distribution, probability of gaps, and probability of substitutions, to produce 8 model conditions: models 1-4 have “long gaps” and 5-8 have “short gaps”. We compare SATe to maximum likelihood trees (using RAxML) on various alignments (including the true alignment), each method limited to 24 hours.

Error rates refer to the proportion of incorrect edges.

Errors in estimating alignments Normalized number of columns in the estimated alignment relative to the true alignment.

Summary of SATe SATe produces more accurate trees than the best current two-phase method, especially when the evolutionary process has many gap events. SATe alignments do not compress the data (“over-align”) as much as standard MSA methods, most of which are based upon progressive alignment.

Future work Our current research is focused on extending SATe to estimate maximum likelihood under models that include gap events. Evolution is more complex than just indels and substitutions: we need methods that can handle genome rearrangements and duplications.

Acknowledgements Funding: NSF, The David and Lucile Packard Foundation, The Program in Evolutionary Dynamics at Harvard, and The Institute for Cellular and Molecular Biology at UT-Austin. Collaborators: Peter Erdos, Daniel Huson, Randy Linder, Kevin Liu, Bernard Moret, Serita Nelesen, Usman Roshan, Mike Steel, Katherine St. John, Laszlo Szekely, Tiffani Williams, and David Zhao. Thanks also to Li-San Wang and Serafim Batzoglou (slides)