CIPRES: Enabling Tree of Life Projects Tandy Warnow The University of Texas at Austin.

Slides:



Advertisements
Similar presentations
CS 598AGB What simulations can tell us. Questions that simulations cannot answer Simulations are on finite data. Some questions (e.g., whether a method.
Advertisements

A Separate Analysis Approach to the Reconstruction of Phylogenetic Networks Luay Nakhleh Department of Computer Sciences UT Austin.
Challenges in computational phylogenetics Tandy Warnow Radcliffe Institute for Advanced Study University of Texas at Austin.
Computer Science and Reconstructing Evolutionary Trees Tandy Warnow Department of Computer Science University of Illinois at Urbana-Champaign.
Large-Scale Phylogenetic Analysis Tandy Warnow Associate Professor Department of Computer Sciences Graduate Program in Evolution and Ecology Co-Director.
Perfect phylogenetic networks, and inferring language evolution Tandy Warnow The University of Texas at Austin (Joint work with Don Ringe, Steve Evans,
Computational biology and computational biologists Tandy Warnow, UT-Austin Department of Computer Sciences Institute for Cellular and Molecular Biology.
BNFO 602 Phylogenetics Usman Roshan.
CIS786, Lecture 3 Usman Roshan.
Phylogeny reconstruction BNFO 602 Roshan. Simulation studies.
BNFO 602 Phylogenetics Usman Roshan. Summary of last time Models of evolution Distance based tree reconstruction –Neighbor joining –UPGMA.
CIS786, Lecture 4 Usman Roshan.
CIS786, Lecture 8 Usman Roshan Some of the slides are based upon material by Dennis Livesay and David.
Computing the Tree of Life The University of Texas at Austin Department of Computer Sciences Tandy Warnow.
Computational and mathematical challenges involved in very large-scale phylogenetics Tandy Warnow The University of Texas at Austin.
Combinatorial and graph-theoretic problems in evolutionary tree reconstruction Tandy Warnow Department of Computer Sciences University of Texas at Austin.
Phylogeny Estimation: Why It Is "Hard", and How to Design Methods with Good Performance Tandy Warnow Department of Computer Sciences University of Texas.
CIPRES: Enabling Tree of Life Projects Tandy Warnow The Program in Evolutionary Dynamics at Harvard University The University of Texas at Austin.
CIPRES: Enabling Tree of Life Projects Tandy Warnow The University of Texas at Austin.
Disk-Covering Methods for phylogeny reconstruction Tandy Warnow The University of Texas at Austin.
Phylogenetic Tree Reconstruction Tandy Warnow The Program in Evolutionary Dynamics at Harvard University The University of Texas at Austin.
Complexity and The Tree of Life Tandy Warnow The University of Texas at Austin.
Maximum Parsimony Input: Set S of n aligned sequences of length k Output: –A phylogenetic tree T leaf-labeled by sequences in S –additional sequences of.
Computer Science Research for The Tree of Life Tandy Warnow Department of Computer Sciences University of Texas at Austin.
Gene Order Phylogeny Tandy Warnow The Program in Evolutionary Dynamics, Harvard University The University of Texas at Austin.
Rec-I-DCM3: A Fast Algorithmic Technique for Reconstructing Large Evolutionary Trees Usman Roshan Department of Computer Science New Jersey Institute of.
NP-hardness and Phylogeny Reconstruction Tandy Warnow Department of Computer Sciences University of Texas at Austin.
Software for Scientists Tandy Warnow Department of Computer Science University of Texas at Austin.
CIPRES: Enabling Tree of Life Projects Tandy Warnow The University of Texas at Austin.
Benjamin Loyle 2004 Cse 397 Solving Phylogenetic Trees Benjamin Loyle March 16, 2004 Cse 397 : Intro to MBIO.
394C: Algorithms for Computational Biology Tandy Warnow Sept 9, 2013.
394C, Spring 2013 Sept 4, 2013 Tandy Warnow. DNA Sequence Evolution AAGACTT TGGACTTAAGGCCT -3 mil yrs -2 mil yrs -1 mil yrs today AGGGCATTAGCCCTAGCACTT.
More statistical stuff CS 394C Feb 6, Today Review of material from Jan 31 Calculating pattern probabilities Why maximum parsimony and UPGMA are.
Statistical stuff: models, methods, and performance issues CS 394C September 16, 2013.
CS 394C Algorithms for Computational Biology Tandy Warnow Spring 2012.
Introduction to Phylogenetic Estimation Algorithms Tandy Warnow.
Algorithmic research in phylogeny reconstruction Tandy Warnow The University of Texas at Austin.
Algorithms research Tandy Warnow UT-Austin. “Algorithms group” UT-Austin: Warnow, Hunt UCB: Rao, Karp, Papadimitriou, Russell, Myers UCSD: Huelsenbeck.
GRAPPA: Large-scale whole genome phylogenies based upon gene order evolution Tandy Warnow, UT-Austin Department of Computer Sciences Institute for Cellular.
Using Divide-and-Conquer to Construct the Tree of Life Tandy Warnow University of Illinois at Urbana-Champaign.
598AGB Basics Tandy Warnow. DNA Sequence Evolution AAGACTT TGGACTTAAGGCCT -3 mil yrs -2 mil yrs -1 mil yrs today AGGGCATTAGCCCTAGCACTT AAGGCCTTGGACTT.
SupreFine, a new supertree method Shel Swenson September 17th 2009.
Problems with large-scale phylogeny Tandy Warnow, UT-Austin Department of Computer Sciences Center for Computational Biology and Bioinformatics.
CS 395T: Computational phylogenetics January 18, 2006 Tandy Warnow.
Statistical stuff: models, methods, and performance issues CS 394C September 3, 2009.
Iterative-DCM3: A Fast Algorithmic Technique for Reconstructing Large Phylogenetic Trees Usman Roshan and Tandy Warnow U. of Texas at Austin Bernard Moret.
Approaching multiple sequence alignment from a phylogenetic perspective Tandy Warnow Department of Computer Sciences The University of Texas at Austin.
Simultaneous alignment and tree reconstruction Collaborative grant: Texas, Nebraska, Georgia, Kansas Penn State University, Huston-Tillotson, NJIT, and.
Chapter AGB. Today’s material Maximum Parsimony Fixed tree versions (solvable in polynomial time using dynamic programming) Optimal tree search.
The Tree of Life: Algorithmic and Software Challenges Tandy Warnow The University of Texas at Austin.
Absolute Fast Converging Methods CS 598 Algorithmic Computational Genomics.
394C: Algorithms for Computational Biology Tandy Warnow Jan 25, 2012.
Distance-based phylogeny estimation
The Disk-Covering Method for Phylogenetic Tree Reconstruction
New Approaches for Inferring the Tree of Life
Distance based phylogenetics
Tandy Warnow Department of Computer Sciences
Challenges in constructing very large evolutionary trees
CIPRES: Enabling Tree of Life Projects
BNFO 602 Phylogenetics Usman Roshan.
BNFO 602 Phylogenetics – maximum parsimony
Absolute Fast Converging Methods
CS 581 Tandy Warnow.
CS 581 Tandy Warnow.
Tandy Warnow Department of Computer Sciences
CS 394C: Computational Biology Algorithms
September 1, 2009 Tandy Warnow
Algorithms for Inferring the Tree of Life
Tandy Warnow The University of Texas at Austin
Tandy Warnow The University of Texas at Austin
Presentation transcript:

CIPRES: Enabling Tree of Life Projects Tandy Warnow The University of Texas at Austin

Phylogeny Orangutan GorillaChimpanzee Human From the Tree of the Life Website, University of Arizona

Evolution informs about everything in biology Big genome sequencing projects just produce data – so what? Evolutionary history relates all organisms and genes, and helps us understand and predict –interactions between genes (genetic networks) –drug design –predicting functions of genes –influenza vaccine development –origins and spread of disease –origins and migrations of humans

Reconstructing the “Tree” of Life Handling large datasets: millions of species and NP-hard optimization problems NSF funds many projects towards this goal, under the Assembling the Tree of Life program

Cyber Infrastructure for Phylogenetic Research Purpose: to create a national infrastructure of hardware, algorithms, database technology, etc., necessary to infer the Tree of Life. Group: approx. 36 biologists, computer scientists, and mathematicians from 18 institutions. Funding: $11.6 M (large ITR grant from NSF).

CIPRES Members EPFL (Switzerland) Bernard Moret Georgia Tech David Bader UCSD/SDSC Fran Berman Alex Borchers John Huelsenbeck Terri Liebowitz Mark Miller University of Connecticut Paul O Lewis University of Pennsylvania Junhyong Kim Susan Davidson Sampath Kannan Val Tannen Texas A&M Tiffani Williams UT Austin Tandy Warnow David M. Hillis Warren Hunt Robert Jansen Randy Linder Lauren Meyers Daniel Miranker University of Arizona David R. Maddison University of British Columbia Wayne Maddison North Carolina State University Spencer Muse American Museum of Natural History Ward C. Wheeler NJIT Usman Roshan UC Berkeley Satish Rao Steve Evans Richard M Karp Brent Mishler Elchanan Mossel Eugene W. Myers Christos M. Papadimitriou Stuart J. Russell Rice Luay Nakhleh SUNY Buffalo William Piel Florida State University David L. Swofford Mark Holder Yale Michael Donoghue Paul Turner

CIPRES algorithms research (sample) Improved heuristics for NP-hard optimization problems (MP, ML, tree alignment) Obtaining better mathematical theory for phylogeny reconstruction methods under Markov models of evolution Supertree methods Constructing networks rather than trees (detecting and reconstructing reticulate evolution) Whole genome phylogeny

This talk Phylogeny reconstruction through a divide-and-conquer strategy using chordal graph theory –“Absolute fast converging” methods –Improved heuristics for NP-hard optimization problems

DNA Sequence Evolution AAGACTT TGGACTTAAGGCCT -3 mil yrs -2 mil yrs -1 mil yrs today AGGGCATTAGCCCTAGCACTT AAGGCCTTGGACTT TAGCCCATAGACTTAGCGCTTAGCACAAAGGGCAT TAGCCCTAGCACTT AAGACTT TGGACTTAAGGCCT AGGGCATTAGCCCTAGCACTT AAGGCCTTGGACTT AGCGCTTAGCACAATAGACTTTAGCCCAAGGGCAT

Phylogeny Problem TAGCCCATAGACTTTGCACAATGCGCTTAGGGCAT UVWXY U VW X Y

1.Heuristics for NP-hard optimization criteria (Maximum Parsimony and Maximum Likelihood) Phylogenetic reconstruction methods Phylogenetic trees Cost Global optimum Local optimum 2.Polynomial time distance-based methods: Neighbor Joining, FastME, etc. 3. Bayesian MCMC methods.

Evaluating phylogeny reconstruction methods In simulation: how “topologically” accurate are trees reconstructed by the method? On real data: how good are the “scores” (typically either MP or ML scores) obtained by the method, as a function of time?

DNA Sequence Evolution AAGACTT TGGACTTAAGGCCT -3 mil yrs -2 mil yrs -1 mil yrs today AGGGCATTAGCCCTAGCACTT AAGGCCTTGGACTT TAGCCCATAGACTTAGCGCTTAGCACAAAGGGCAT TAGCCCTAGCACTT AAGACTT TGGACTTAAGGCCT AGGGCATTAGCCCTAGCACTT AAGGCCTTGGACTT AGCGCTTAGCACAATAGACTTTAGCCCAAGGGCAT

Markov models of DNA sequence evolution General Time Reversible (GTR) Markov Model: The state at the root is random The model tree is a pair consisting of a rooted binary tree T with edge lengths, where w(e) indicates the number of times a site changes on edge e. There is a 4x4 symmetric substitution matrix for the sites, so that if a site changes on an edge, it uses the matrix to determine the probability of each change. The evolutionary process is Markovian All sites evolve identically and independently

Distance-based Phylogenetic Methods

Quantifying Error FN: false negative (missing edge) FP: false positive (incorrect edge) 50% error rate FN FP

Neighbor joining has poor accuracy on large diameter model trees [Nakhleh et al. ISMB 2001] Simulation study based upon fixed edge lengths, K2P model of evolution, sequence lengths fixed to 1000 nucleotides. Error rates reflect proportion of incorrect edges in inferred trees. NJ No. Taxa Error Rate

Problems with current techniques for MP Shown here is the performance of the TNT software for maximum parsimony on a real dataset of almost 14,000 sequences. The required level of accuracy with respect to MP score is no more than 0.01% error (otherwise high topological error results). (“Optimal” here means best score to date, using any method for any amount of time.) Performance of TNT with time

Empirical problems with existing methods Polynomial time methods have poor topological accuracy on large datasets – we need better polynomial time methods. Heuristics for Maximum Parsimony (MP) and Maximum Likelihood (ML) and Bayesian MCMC methods cannot handle large datasets (take too long!) – we need new heuristics that can analyze large datasets.

“Boosting” phylogeny reconstruction methods DCMs “boost” the performance of phylogeny reconstruction methods. DCM Base method MDCM-M

Graph-theoretic divide-and-conquer (DCM’s) Define a triangulated (i.e. chordal) graph so that its vertices correspond to the input taxa Compute a decomposition of the graph into overlapping subgraphs, thus defining a decomposition of the taxa into overlapping subsets. Apply the “base method” to each subset of taxa, to construct a subtree Merge the subtrees into a single tree on the full set of taxa.

Some properties of chordal graphs Every chordal graph has at most n maximal cliques, and these can be found in polynomial time: Maxclique decomposition. Every chordal graph has a vertex separator which is a maximal clique: Separator-component decomposition. Every chordal graph has a perfect elimination scheme: enables us to merge correct subtrees and get a correct supertree back, if subtrees are big enough.

A separator-component DCM (cartoon)

Strict Consensus Merger (SCM)

DCMs (Disk-Covering Methods) DCMs for polynomial time methods improve topological accuracy (empirical observation), and have provable theoretical guarantees under Markov models of evolution DCMs for hard optimization problems reduce running time needed to achieve good levels of accuracy (empirically observation)

Statistical consistency, convergence rates, and absolute fast convergence

Neighbor Joining’s sequence length requirement is exponential! Atteson: Let T be a General Markov model tree defining additive matrix D. Then Neighbor Joining will reconstruct the true tree with high probability from sequences that are of length at least O(lg n e max Dij ).

DCM1-Boosting [Warnow et al. SODA 2001] DCM1+SQS is a two-phase procedure which reduces the sequence length requirement of methods. DCM1SQS Exponentially converging method Absolute fast converging method

Improving upon NJ Construct trees on a number of smaller diameter subproblems, and merge the subtrees into a tree on the full dataset. Our approach: –Phase I: produce O(n 2 ) trees (one for each diameter) –Phase II: pick the “best” tree from the set.

DCM1 Decompositions DCM1 decomposition :Compute maximal cliques Input: Set S of sequences, distance matrix d, threshold value 1. Compute threshold graph 2. Perform minimum weight triangulation (note: if d is an additive matrix, then the threshold graph is provably chordal).

DCM1-boosting distance-based methods [Nakhleh et al. ISMB 2001 and Warnow et al. SODA 2001] Theorem: DCM1-NJ converges to the true tree from polynomial length sequences NJ DCM1-NJ No. Taxa Error Rate

What about solving MP and ML? Maximum Parsimony (MP) and maximum likelihood (ML) are the major phylogeny estimation methods used by systematists.

Maximum Parsimony Input: Set S of n aligned sequences of length k Output: A phylogenetic tree T –leaf-labeled by sequences in S –additional sequences of length k labeling the internal nodes of T such that is minimized.

Maximum Parsimony: computational complexity ACT ACA GTT GTA ACAGTA MP score = 4 Finding the optimal MP tree is NP-hard Optimal labeling can be computed in linear time O(nk)

1.Hill-climbing heuristics (which can get stuck in local optima) 2.Randomized algorithms for getting out of local optima 3.Approximation algorithms for MP (based upon Steiner Tree approximation algorithms ). Approaches for “solving” MP/ML Phylogenetic trees Cost Global optimum Local optimum

Problems with current techniques for MP Best methods are a combination of simulated annealing, divide-and-conquer and genetic algorithms, as implemented in the software package TNT. However, they do not reach 0.01% of optimal on large datasets in 24 hours. Performance of TNT with time

Observations The best MP heuristics cannot get acceptably good solutions within 24 hours on most of these large datasets. Datasets of these sizes may need months (or years) of further analysis to reach reasonable solutions. Apparent convergence can be misleading.

How can we improve upon existing techniques?

Our objective: speed up the best MP heuristics Time MP score of best trees Performance of hill-climbing heuristic Desired Performance Fake study

DCM Decompositions DCM1 decomposition : Max cliques DCM2 decomposition: Separator plus components Input: Set S of sequences, distance matrix d, threshold value 1. Compute threshold graph 2. Perform minimum weight triangulation

Empirical observation No DCM based upon the threshold graphs gave us an improvement over the best heuristics for MP!

How can we improve upon existing techniques?

A conjecture as to why current techniques are poor: Our studies suggest that trees with near optimal scores tend to be topologically close (RF distance less than 15%) from the other near optimal trees. The best heuristics for MP are based upon the TBR move to explore tree space: there are O(n 3 ) neighbors of every tree, most of which have large RF distances. So TBR may be useful initially (to reach near optimality) but then more “localized” searches are more productive.

Using DCMs differently Observation: DCMs make small local changes to the tree New algorithmic strategy: use DCMs iteratively and/or recursively to improve heuristics on large datasets We needed a decomposition strategy that produces small subproblems quickly.

New DCM3 decomposition Input: Set S of sequences, and guide-tree T 1.We use a new graph (“short subtree graph”) G(S,T)) Note: G(S,T) is chordal! 2. Find clique separator in G(S,T) and form subproblems DCM3 decompositions (1) can be obtained in O(n) time (2) yield small subproblems (3) can be used iteratively

DCM3 decompositions

Iterative-DCM3 T T’ DCM3 Base method

Comparison of DCMs (13,921 sequences) Base method is the TNT-ratchet. “Optimal” refers to the best score found by any method using any amount of time, to date.

Rec-I-DCM3 significantly improves performance Comparison of TNT to Rec-I-DCM3(TNT) on one large dataset Current best techniques DCM boosted version of best techniques

Conclusions (and comments) Rec-I-DCM3 improves upon the best performing heuristics for MP. The improvement increases with the difficulty of the dataset. DCMs also boost the performance of ML heuristics (not shown). Rec-I-DCM3 will be in the first software release from the CIPRES project

Other research projects Simultaneous estimation of tree and multiple sequence alignment Supertree methods Constructing networks rather than trees (detecting and reconstructing reticulate evolution) Obtaining better bounds on sequence length requirements of phylogeny reconstruction methods Whole genome phylogeny Constructing forests rather than trees

Acknowledgments The CIPRES project The National Science Foundation The David and Lucile Packard Foundation The Program for Evolutionary Dynamics at Harvard, and the Radcliffe Institute for Advanced Research The Institute for Cellular and Molecular Biology at UT-Austin Collaborators: Bernard Moret, Usman Roshan, Tiffani Williams, and Daniel Huson.