CIPRES: Enabling Tree of Life Projects Tandy Warnow The University of Texas at Austin.

Slides:

Advertisements

Similar presentations

CS 598AGB What simulations can tell us. Questions that simulations cannot answer Simulations are on finite data. Some questions (e.g., whether a method.

Advertisements

A Separate Analysis Approach to the Reconstruction of Phylogenetic Networks Luay Nakhleh Department of Computer Sciences UT Austin.

Challenges in computational phylogenetics Tandy Warnow Radcliffe Institute for Advanced Study University of Texas at Austin.

DCJUC: A Maximum Parsimony Simulator for Constructing Phylogenetic Tree of Genomes with Unequal Contents Zhaoming Yin Bader-Polo Joint Group Meeting, Nov.

Computer Science and Reconstructing Evolutionary Trees Tandy Warnow Department of Computer Science University of Illinois at Urbana-Champaign.

Large-Scale Phylogenetic Analysis Tandy Warnow Associate Professor Department of Computer Sciences Graduate Program in Evolution and Ecology Co-Director.

Computational biology and computational biologists Tandy Warnow, UT-Austin Department of Computer Sciences Institute for Cellular and Molecular Biology.

Molecular Evolution Revised 29/12/06

High-Performance Algorithm Engineering for Computational Phylogenetics [B Moret, D Bader] Kexue Liu CMSC 838 Presentation.

Genome Rearrangement Phylogeny

CIS786, Lecture 5 Usman Roshan.

CIS786, Lecture 3 Usman Roshan.

Inferring Phylogeny using Permutation Patterns on Genomic Data 1 Md Enamul Karim 2 Laxmi Parida 1 Arun Lakhotia 1 University of Louisiana at Lafayette.

CIS786, Lecture 4 Usman Roshan.

CIS786, Lecture 8 Usman Roshan Some of the slides are based upon material by Dennis Livesay and David.

Computing the Tree of Life The University of Texas at Austin Department of Computer Sciences Tandy Warnow.

Computational and mathematical challenges involved in very large-scale phylogenetics Tandy Warnow The University of Texas at Austin.

Combinatorial and graph-theoretic problems in evolutionary tree reconstruction Tandy Warnow Department of Computer Sciences University of Texas at Austin.

Phylogeny Estimation: Why It Is "Hard", and How to Design Methods with Good Performance Tandy Warnow Department of Computer Sciences University of Texas.

CIPRES: Enabling Tree of Life Projects Tandy Warnow The Program in Evolutionary Dynamics at Harvard University The University of Texas at Austin.

CIPRES: Enabling Tree of Life Projects Tandy Warnow The University of Texas at Austin.

Disk-Covering Methods for phylogeny reconstruction Tandy Warnow The University of Texas at Austin.

Phylogenetic Tree Reconstruction Tandy Warnow The Program in Evolutionary Dynamics at Harvard University The University of Texas at Austin.

Complexity and The Tree of Life Tandy Warnow The University of Texas at Austin.

Phylogeny Estimation: Traditional and Bayesian Approaches Molecular Evolution, 2003

Barking Up the Wrong Treelength Kevin Liu, Serita Nelesen, Sindhu Raghavan, C. Randal Linder, and Tandy Warnow IEEE TCCB 2009.

Computer Science Research for The Tree of Life Tandy Warnow Department of Computer Sciences University of Texas at Austin.

Gene Order Phylogeny Tandy Warnow The Program in Evolutionary Dynamics, Harvard University The University of Texas at Austin.

Rec-I-DCM3: A Fast Algorithmic Technique for Reconstructing Large Evolutionary Trees Usman Roshan Department of Computer Science New Jersey Institute of.

NP-hardness and Phylogeny Reconstruction Tandy Warnow Department of Computer Sciences University of Texas at Austin.

Molecular phylogenetics 1 Level 3 Molecular Evolution and Bioinformatics Jim Provan Page and Holmes: Sections

Bioinformatics 2011 Molecular Evolution Revised 29/12/06.

Software for Scientists Tandy Warnow Department of Computer Science University of Texas at Austin.

Benjamin Loyle 2004 Cse 397 Solving Phylogenetic Trees Benjamin Loyle March 16, 2004 Cse 397 : Intro to MBIO.

394C, Spring 2013 Sept 4, 2013 Tandy Warnow. DNA Sequence Evolution AAGACTT TGGACTTAAGGCCT -3 mil yrs -2 mil yrs -1 mil yrs today AGGGCATTAGCCCTAGCACTT.

More statistical stuff CS 394C Feb 6, Today Review of material from Jan 31 Calculating pattern probabilities Why maximum parsimony and UPGMA are.

Statistical stuff: models, methods, and performance issues CS 394C September 16, 2013.

CS 394C Algorithms for Computational Biology Tandy Warnow Spring 2012.

CIPRES Software architecture/development Focus Leader: Mark Holder (FSU) Architecture:Wayne Maddison (UBC) Mark Holder (FSU) David Swofford (FSU) Implementation:

Parallel & Distributed Systems and Algorithms for Inference of Large Phylogenetic Trees with Maximum Likelihood Alexandros Stamatakis LRR TU München Contact:

CIPRES: Enabling Tree of Life Projects Tandy Warnow The University of Texas at Austin.

Introduction to Phylogenetic Estimation Algorithms Tandy Warnow.

Algorithmic research in phylogeny reconstruction Tandy Warnow The University of Texas at Austin.

Algorithms research Tandy Warnow UT-Austin. “Algorithms group” UT-Austin: Warnow, Hunt UCB: Rao, Karp, Papadimitriou, Russell, Myers UCSD: Huelsenbeck.

GRAPPA: Large-scale whole genome phylogenies based upon gene order evolution Tandy Warnow, UT-Austin Department of Computer Sciences Institute for Cellular.

Using Divide-and-Conquer to Construct the Tree of Life Tandy Warnow University of Illinois at Urbana-Champaign.

598AGB Basics Tandy Warnow. DNA Sequence Evolution AAGACTT TGGACTTAAGGCCT -3 mil yrs -2 mil yrs -1 mil yrs today AGGGCATTAGCCCTAGCACTT AAGGCCTTGGACTT.

The Big Issues in Phylogenetic Reconstruction Randy Linder Integrative Biology, University of Texas

Problems with large-scale phylogeny Tandy Warnow, UT-Austin Department of Computer Sciences Center for Computational Biology and Bioinformatics.

CS 598 AGB Supertrees Tandy Warnow. Today’s Material Supertree construction: given set of trees on subsets of S (the full set of taxa), construct tree.

CS 395T: Computational phylogenetics January 18, 2006 Tandy Warnow.

Statistical stuff: models, methods, and performance issues CS 394C September 3, 2009.

Iterative-DCM3: A Fast Algorithmic Technique for Reconstructing Large Phylogenetic Trees Usman Roshan and Tandy Warnow U. of Texas at Austin Bernard Moret.

Sequence alignment CS 394C: Fall 2009 Tandy Warnow September 24, 2009.

Approaching multiple sequence alignment from a phylogenetic perspective Tandy Warnow Department of Computer Sciences The University of Texas at Austin.

Simultaneous alignment and tree reconstruction Collaborative grant: Texas, Nebraska, Georgia, Kansas Penn State University, Huston-Tillotson, NJIT, and.

The Tree of Life: Algorithmic and Software Challenges Tandy Warnow The University of Texas at Austin.

Absolute Fast Converging Methods CS 598 Algorithmic Computational Genomics.

394C: Algorithms for Computational Biology Tandy Warnow Jan 25, 2012.

20 years and 22 papers with Bernard Moret

The Disk-Covering Method for Phylogenetic Tree Reconstruction

New Approaches for Inferring the Tree of Life

Challenges in constructing very large evolutionary trees

CIPRES: Enabling Tree of Life Projects

BNFO 602 Phylogenetics Usman Roshan.

CS 581 Tandy Warnow.

CS 394C: Computational Biology Algorithms

Algorithms for Inferring the Tree of Life

Tandy Warnow The University of Texas at Austin

Tandy Warnow The University of Texas at Austin

Presentation transcript:

CIPRES: Enabling Tree of Life Projects Tandy Warnow The University of Texas at Austin

Reconstructing the “Tree” of Life Handling large datasets: millions of species The “Tree of Life” is not really a tree: reticulate evolution

Cyber Infrastructure for Phylogenetic Research Purpose: to create a national infrastructure of hardware, open source software, database technology, etc., necessary to infer the Tree of Life. Group: 40 biologists, computer scientists, and mathematicians from 13 institutions. Funding: $11.6 M (large ITR grant from NSF). URL:

CIPRes Members University of New Mexico Bernard Moret David Bader UCSD/SDSC Fran Berman Alex Borchers Phil Bourne John Huelsenbeck Terri Liebowitz Mark Miller University of Connecticut Paul O Lewis University of Pennsylvania Junhyong Kim Susan Davidson Sampath Kannan Val Tannen Texas A&M Tiffani Williams UT Austin Tandy Warnow David M. Hillis Warren Hunt Robert Jansen Randy Linder Lauren Meyers Daniel Miranker University of Arizona David R. Maddison University of British Columbia Wayne Maddison North Carolina State University Spencer Muse American Museum of Natural History Ward C. Wheeler NJIT Usman Roshan UC Berkeley Satish Rao Steve Evans Richard M Karp Brent Mishler Elchanan Mossel Eugene W. Myers Christos M. Papadimitriou Stuart J. Russell Rice Luay Nakhleh SUNY Buffalo William Piel Florida State University David L. Swofford Mark Holder Yale Michael Donoghue Paul Turner

CIPRES activity Databases - e.g. TreeBase II (Bill Piel and others) Simulations of large-scale complex genome-scale evolution (Junhyong Kim) Outreach (Michael Donoghue and Brent Mishler) Algorithms (Tandy Warnow) Open source software (Wayne Maddison, Dave Swofford, Mark Holder, and Bernard Moret) Computer cluster at SDSC (Fran Berman and Mark Miller) - available to ATOL projects and other groups with datasets above 1000 taxa

DNA Sequence Evolution AAGACTT TGGACTTAAGGCCT -3 mil yrs -2 mil yrs -1 mil yrs today AGGGCATTAGCCCTAGCACTT AAGGCCTTGGACTT TAGCCCATAGACTTAGCGCTTAGCACAAAGGGCAT TAGCCCTAGCACTT AAGACTT TGGACTTAAGGCCT AGGGCATTAGCCCTAGCACTT AAGGCCTTGGACTT AGCGCTTAGCACAATAGACTTTAGCCCAAGGGCAT

Phylogeny Problem TAGCCCATAGACTTTGCACAATGCGCTTAGGGCAT UVWXY U VW X Y

Complex Evolutionary Processes Gap events “Heterotachy” (violations of the rates- across-sites assumption) New types of data (e.g., whole genomes) Reticulate evolution (e.g., hybrid speciation and horizontal gene transfer)

Challenges in reconstructing large and/or complex evolutionary histories Previous simulation studies don’t necessarily help us understand phylogenetic reconstruction on large or complex datasets We need new statistical models, new theory, and probably new methods. Reticulate evolution and whole genome evolution in particular present many interesting challenges for reconstruction.

CIPRES research in algorithms Multiple sequence alignment Genomic alignment Heuristics for Maximum Parsimony and Maximum Likelihood Bayesian MCMC methods Supertree methods Whole genome phylogeny reconstruction Reticulate evolution detection and reconstruction Data mining on sets of trees, and compact representations of these sets

1.Heuristics for hard optimization criteria (Maximum Parsimony and Maximum Likelihood) - hard to solve on large datasets Phylogenetic reconstruction methods Phylogenetic trees Cost Global optimum Local optimum 2.Polynomial time distance-based methods: Neighbor Joining, FastME, Weighbor, etc. - poor accuracy on datasets with large evolutionary distances

DCMs: Divide-and-conquer for improving phylogeny reconstruction

“Boosting” phylogeny reconstruction methods DCMs “boost” the performance of phylogeny reconstruction methods. DCM Base method MDCM-M

DCMs (Disk-Covering Methods) DCMs for polynomial time methods improve topological accuracy (empirical observation), and have provable theoretical guarantees under Markov models of evolution DCMs for hard optimization problems reduce running time needed to achieve good levels of accuracy (empirically observation)

DCM1-boosting distance-based methods [Nakhleh et al. ISMB 2001] DCM1-boosting makes distance- based methods more accurate Theoretical guarantees that DCM1-NJ converges to the true tree from polynomial length sequences NJ DCM1-NJ No. Taxa Error Rate

Major challenge: MP and ML Maximum Parsimony (MP) and Maximum Likelihood (ML) remain the methods of choice for most systematists The main challenge here is to make it possible to obtain good solutions to MP or ML in reasonable time periods on large datasets

Solving NP-hard problems exactly is … unlikely Number of (unrooted) binary trees on n leaves is (2n-5)!! If each tree on 1000 taxa could be analyzed in seconds, we would find the best tree in 2890 millennia #leaves#trees x x x

How good an MP analysis do we need? Our research shows that we need to get within 0.01% of optimal (or better even, on large datasets) to return reasonable estimates of the true tree’s “topology”

Problems with current techniques for MP Shown here is the performance of a heuristic maximum parsimony analysis on a real dataset of almost 14,000 sequences. (“Optimal” here means best score to date, using any method for any amount of time.) Acceptable error is below 0.01%. Performance of TNT with time

Observations The best MP heuristics cannot get acceptably good solutions within 24 hours on most of these large datasets. Datasets of these sizes may need months (or years) of further analysis to reach reasonable solutions. Apparent convergence can be misleading.

Our objective: speed up the best MP heuristics Time MP score of best trees Performance of hill-climbing heuristic Desired Performance Fake study

DCM3 decomposition Input: Set S of sequences, and guide-tree T 1. Compute short subtree graph G(S,T), based upon T 2. Find clique separator in the graph G(S,T) and form subproblems DCM3 decompositions (1) can be obtained in O(n) time (2) yield small subproblems (3) can be used iteratively (4) can be applied recursively

Iterative-DCM3 T T’ Base method DCM3

New DCMs DCM3 1.Compute subproblems using DCM3 decomposition 2.Apply base method to each subproblem to yield subtrees 3.Merge subtrees using the Strict Consensus Merger technique 4.Randomly refine to make it binary Recursive-DCM3 Iterative DCM3 1.Compute a DCM3 tree 2.Perform local search and go to step 1 Recursive-Iterative DCM3

Rec-I-DCM3 significantly improves performance Comparison of TNT to Rec-I-DCM3(TNT) on one large dataset Current best techniques DCM boosted version of best techniques

Datasets 1322 lsu rRNA of all organisms 2000 Eukaryotic rRNA 2594 rbcL DNA 4583 Actinobacteria 16s rRNA 6590 ssu rRNA of all Eukaryotes 7180 three-domain rRNA 7322 Firmicutes bacteria 16s rRNA 8506 three-domain+2org rRNA ssu rRNA of all Bacteria Proteobacteria 16s rRNA Obtained from various researchers and online databases

Rec-I-DCM3(TNT) vs. TNT (Comparison of scores at 24 hours) Base method is the default TNT technique, the current best method for MP. Rec-I-DCM3 significantly improves upon the unboosted TNT by returning trees which are at most 0.01% above optimal on most datasets.

Observations Rec-I-DCM3 improves upon the best performing heuristics for MP. The improvement increases with the difficulty of the dataset.

DCMs DCM for NJ and other distance methods produces absolute fast converging (afc) methods DCMs for MP heuristics DCMs for use with the GRAPPA software for whole genome phylogenetic analysis; these have been shown to let GRAPPA scale from its maximum of about genomes to 1000 genomes. Current projects: DCM development for maximum likelihood and multiple sequence alignment.

Part II: Whole-Genome Phylogenetics A B C D E F X Y Z W A B C D E F

Genomes Evolve by Rearrangements Inverted Transposition –7 –6 –5 – Inversion (Reversal) –8 –7 –6 – Transposition

Genome Rearrangement Has A Huge State Space DNA sequences : 4 states per site Signed circular genomes with n genes: states, 1 site Circular genomes (1 site) –with 37 genes: states –with 120 genes: states

Why use gene orders? “Rare genomic changes”: huge state space and relative infrequency of events (compared to site substitutions) could make the inference of deep evolution easier, or more accurate. Our research shows this is true, but accurate analysis of gene order data is computationally very intensive!

The Generalized Nadeau-Taylor model Wang and Warnow, 2001 Three types of events: inversions, transpositions, and inverted transpositions Each event of each type is equiprobable The relative probabilities of the three events are parameters that the user can specify

Phylogeny reconstruction in 1998 Distance-based –Breakpoint (BP) distances [Blanchette, Kunisawa, Sankoff 1998] Minimum length trees (NP-hard, even for three taxa) –BPAnalysis: [Sankoff & Blanchette 1998]: exhaustive search through treespace to find the minimum breakpoint length (the number of breakpoints on the tree)

40 taxa, 120 genes, Inv.:Transp.:InvTrans p=2:1:1 birth-death trees, expected deviation from ultrametricity=2 NJ(BP) NJ(BP): seconds BPanalysis: will not finish (will take 200 years for a 13 genome dataset) Error in inferred tree Amount of evolution

Progress Statistically-defined distance estimators: EDE and IEBP, highly robust to model violations FastME(EDE) yields very accurate trees, except when the datasets are close to “saturated”

BP=breakpoint distance INV=inversion distance EDE: statistically-based estimator [Wang et al. ‘ 01] - highly robust. All these methods are polynomial time. 40 taxa, 120 genes Inv.:Transp.:InvTransp =2:1:1 Birth-death trees, expected deviation from ultrametricity=2 Amount of evolution

Progress for equal gene content EDE and IEBP: Statistically-based evolutionary distance estimators, which enable fast and accurate tree reconstruction and are robust to model violations (Wang and W) GRAPPA (software package) for finding trees of minimum inversion or breakpoint length (Moret et al.). Computationally intensive but robust. MPME: heuristic for the breakpoint phylogeny DCM4-MPME: can handle larger datasets than MPME or GRAPPA (Wang and W). All methods are fairly robust to model violations

Minimum length trees (“parsimony”) Breakpoint length and inversion length: both NP-hard to solve even on three-leaf trees. Exact solutions exponential in both number of taxa and number of genes. Inversion-phylogeny has better topological accuracy than breakpoint phylogeny, but is harder to solve. Highly robust to model violations.

“Solving” the inversion and breakpoint phylogeny problems Phylogenetic trees MP score Global optimum Local optimum Usual issue of getting stuck in local optima, since the optimization problems are NP-hard Additional problem: finding the best trees is enormously hard, since even the “point estimation” problem is hard (worse than estimating branch lengths in ML).

Minimum length trees (“parsimony”) Breakpoint phylogeny –BPAnalysis: [Sankoff & Blanchette 1998] –GRAPPA [Moret et al. 2001] –MPME [Wang et al. PSB 2002]: represents gene orders as multi- state strings, and solves parsimony on this modified dataset. This problem is exponential in the number of taxa, but polynomial in the number of genes). Because of MP software, it cannot handle large datasets. –DCM4-MPME: uses a divide-and-conquer strategy (similar to DCM3) to decompose a large dataset into smaller datasets, on the basis of a guide tree. It can handle larger datasets than any of the other methods. Inversion phylogeny: –GRAPPA: highly accurate, robust to model violations, but cannot analyze trees with large edge lengths in reasonable time periods.

Benchmark gene order dataset: Campanulaceae 12 genomes + 1 outgroup (Tobacco), 105 gene segments NP-hard optimization problems: breakpoint and inversion phylogenies (techniques score every tree) Joint work with Bob Jansen, Linda Raubeson, Jijun Tang, and Li-San Wang 1997: BPAnalysis (Blanchette and Sankoff): 200 years (est.) 2000: Using GRAPPA v1.1 on the 512-processor Los Lobos Supercluster machine: 2 minutes (200,000-fold speedup per processor) 2003: Using latest version of GRAPPA: 2 minutes on a single processor (1-billion-fold speedup per processor)

Analyzing Large Datasets Problem size / divergence Topological Error NJ(EDE) poor accuracy for highly diverged datasets GRAPPA cannot handle datasets of moderate size, or trees with long branches MPME Cannot handle datasets of large size NJ(EDE)+MPME in a Divide-and- Conquer approach NJ(BP) NJ(EDE) MPME GRAPPA

DCM4-MPME: Guide tree=NJ(EDE) GRAPPA & MPME won’t finish (long branch lengths; too many taxa) 120 genes, 200 taxa, Inversion/Transposition/Inverted Transposition=2:1:1 Birth-Death Trees with deviation from ultrametricity NJ(EDE) DCM4-MPME

Summary True evolutionary distance estimators improve accuracy of NJ Sequence-based heuristic (MPME) Divide-and-conquer, integrated approach for large-scale data

Future Directions for Whole Genome Phylogeny Scale maximum parsimony to large datasets Analyze nuclear genomes –Multiple chromosomes –Unequal gene content: insertions, deletions, duplications (Moret’s group) –Nonuniform model: hotspots/short inversions

Limitations and ongoing research Current methods are mostly limited to single chromosomes with equal gene content (or very small amounts of deletions and duplications). Moret et al. have made some progress on developing a reliable distance-based method for chromosomes with unequal gene content (tests on real and simulated data show high accuracy) Handling the multiple chromosome case is harder

GRAPPA (Genome Rearrangement Analysis under Parsimony and other Phylogenetic Algorithms) Heuristics for NP-hard optimization problems Fast polynomial time distance-based methods Contributors: U. New Mexico, U. Texas at Austin, Universitá di Bologna, Italy Freely available in source code at this site. Project leader: Bernard Moret (UNM)

CIPRES software distributions Software group leaders: Wayne Maddison and Dave Swofford The first distribution (in the next months) will focus on Rec-I-DCM3(PAUP*): fast heuristic searches for maximum parsimony on large datasets for PAUP* users All software will be open source Community contributions to software will be enabled

Acknowledgements NSF The David and Lucile Packard Foundation The Program in Evolutionary Dynamics at Harvard The Institute for Cellular and Molecular Biology at UT- Austin See and and