Rec-I-DCM3: A Fast Algorithmic Technique for Reconstructing Large Evolutionary Trees Usman Roshan Department of Computer Science New Jersey Institute of.

Slides:



Advertisements
Similar presentations
A Separate Analysis Approach to the Reconstruction of Phylogenetic Networks Luay Nakhleh Department of Computer Sciences UT Austin.
Advertisements

Challenges in computational phylogenetics Tandy Warnow Radcliffe Institute for Advanced Study University of Texas at Austin.
Computer Science and Reconstructing Evolutionary Trees Tandy Warnow Department of Computer Science University of Illinois at Urbana-Champaign.
Large-Scale Phylogenetic Analysis Tandy Warnow Associate Professor Department of Computer Sciences Graduate Program in Evolution and Ecology Co-Director.
Computational biology and computational biologists Tandy Warnow, UT-Austin Department of Computer Sciences Institute for Cellular and Molecular Biology.
CIS786, Lecture 5 Usman Roshan.
BNFO 602 Phylogenetics Usman Roshan.
CIS786, Lecture 3 Usman Roshan.
Phylogeny reconstruction BNFO 602 Roshan. Simulation studies.
BNFO 602 Phylogenetics Usman Roshan. Summary of last time Models of evolution Distance based tree reconstruction –Neighbor joining –UPGMA.
CIS786, Lecture 4 Usman Roshan.
CIS786, Lecture 8 Usman Roshan Some of the slides are based upon material by Dennis Livesay and David.
Computing the Tree of Life The University of Texas at Austin Department of Computer Sciences Tandy Warnow.
Computational and mathematical challenges involved in very large-scale phylogenetics Tandy Warnow The University of Texas at Austin.
Combinatorial and graph-theoretic problems in evolutionary tree reconstruction Tandy Warnow Department of Computer Sciences University of Texas at Austin.
Phylogeny Estimation: Why It Is "Hard", and How to Design Methods with Good Performance Tandy Warnow Department of Computer Sciences University of Texas.
CIPRES: Enabling Tree of Life Projects Tandy Warnow The Program in Evolutionary Dynamics at Harvard University The University of Texas at Austin.
CIPRES: Enabling Tree of Life Projects Tandy Warnow The University of Texas at Austin.
Disk-Covering Methods for phylogeny reconstruction Tandy Warnow The University of Texas at Austin.
Phylogenetic Tree Reconstruction Tandy Warnow The Program in Evolutionary Dynamics at Harvard University The University of Texas at Austin.
Complexity and The Tree of Life Tandy Warnow The University of Texas at Austin.
Phylogenetics Alexei Drummond. CS Friday quiz: How many rooted binary trees having 20 labeled terminal nodes are there? (A) (B)
Computer Science Research for The Tree of Life Tandy Warnow Department of Computer Sciences University of Texas at Austin.
Gene Order Phylogeny Tandy Warnow The Program in Evolutionary Dynamics, Harvard University The University of Texas at Austin.
NP-hardness and Phylogeny Reconstruction Tandy Warnow Department of Computer Sciences University of Texas at Austin.
Software for Scientists Tandy Warnow Department of Computer Science University of Texas at Austin.
CIPRES: Enabling Tree of Life Projects Tandy Warnow The University of Texas at Austin.
Benjamin Loyle 2004 Cse 397 Solving Phylogenetic Trees Benjamin Loyle March 16, 2004 Cse 397 : Intro to MBIO.
SuperFine, Enabling Large-Scale Phylogenetic Estimation Shel Swenson University of Southern California and Georgia Institute of Technology.
394C, Spring 2013 Sept 4, 2013 Tandy Warnow. DNA Sequence Evolution AAGACTT TGGACTTAAGGCCT -3 mil yrs -2 mil yrs -1 mil yrs today AGGGCATTAGCCCTAGCACTT.
More statistical stuff CS 394C Feb 6, Today Review of material from Jan 31 Calculating pattern probabilities Why maximum parsimony and UPGMA are.
CS 394C Algorithms for Computational Biology Tandy Warnow Spring 2012.
Parallel & Distributed Systems and Algorithms for Inference of Large Phylogenetic Trees with Maximum Likelihood Alexandros Stamatakis LRR TU München Contact:
CIPRES: Enabling Tree of Life Projects Tandy Warnow The University of Texas at Austin.
Introduction to Phylogenetic Estimation Algorithms Tandy Warnow.
Algorithmic research in phylogeny reconstruction Tandy Warnow The University of Texas at Austin.
Algorithms research Tandy Warnow UT-Austin. “Algorithms group” UT-Austin: Warnow, Hunt UCB: Rao, Karp, Papadimitriou, Russell, Myers UCSD: Huelsenbeck.
GRAPPA: Large-scale whole genome phylogenies based upon gene order evolution Tandy Warnow, UT-Austin Department of Computer Sciences Institute for Cellular.
598AGB Basics Tandy Warnow. DNA Sequence Evolution AAGACTT TGGACTTAAGGCCT -3 mil yrs -2 mil yrs -1 mil yrs today AGGGCATTAGCCCTAGCACTT AAGGCCTTGGACTT.
SupreFine, a new supertree method Shel Swenson September 17th 2009.
Problems with large-scale phylogeny Tandy Warnow, UT-Austin Department of Computer Sciences Center for Computational Biology and Bioinformatics.
CS 598 AGB Supertrees Tandy Warnow. Today’s Material Supertree construction: given set of trees on subsets of S (the full set of taxa), construct tree.
CS 395T: Computational phylogenetics January 18, 2006 Tandy Warnow.
Iterative-DCM3: A Fast Algorithmic Technique for Reconstructing Large Phylogenetic Trees Usman Roshan and Tandy Warnow U. of Texas at Austin Bernard Moret.
Sequence alignment CS 394C: Fall 2009 Tandy Warnow September 24, 2009.
Approaching multiple sequence alignment from a phylogenetic perspective Tandy Warnow Department of Computer Sciences The University of Texas at Austin.
Simultaneous alignment and tree reconstruction Collaborative grant: Texas, Nebraska, Georgia, Kansas Penn State University, Huston-Tillotson, NJIT, and.
Chapter AGB. Today’s material Maximum Parsimony Fixed tree versions (solvable in polynomial time using dynamic programming) Optimal tree search.
Clustering [Idea only, Chapter 10.1, 10.2, 10.4].
The Tree of Life: Algorithmic and Software Challenges Tandy Warnow The University of Texas at Austin.
Absolute Fast Converging Methods CS 598 Algorithmic Computational Genomics.
394C: Algorithms for Computational Biology Tandy Warnow Jan 25, 2012.
The Disk-Covering Method for Phylogenetic Tree Reconstruction
New Approaches for Inferring the Tree of Life
394C, Spring 2012 Jan 23, 2012 Tandy Warnow.
Multiple Sequence Alignment Methods
Challenges in constructing very large evolutionary trees
CIPRES: Enabling Tree of Life Projects
BNFO 602 Phylogenetics Usman Roshan.
BNFO 602 Phylogenetics – maximum parsimony
CS 581 Tandy Warnow.
CS 581 Tandy Warnow.
Tandy Warnow Department of Computer Sciences
BNFO 602 Phylogenetics – maximum likelihood
BNFO 602 Phylogenetics Usman Roshan.
CS 394C: Computational Biology Algorithms
September 1, 2009 Tandy Warnow
Algorithms for Inferring the Tree of Life
Tandy Warnow The University of Texas at Austin
Tandy Warnow The University of Texas at Austin
Presentation transcript:

Rec-I-DCM3: A Fast Algorithmic Technique for Reconstructing Large Evolutionary Trees Usman Roshan Department of Computer Science New Jersey Institute of Technology

Phylogeny Orangutan GorillaChimpanzee Human From the Tree of the Life Website, University of Arizona

DNA Sequence Evolution AAGACTT TGGACTTAAGGCCT -3 mil yrs -2 mil yrs -1 mil yrs today AGGGCATTAGCCCTAGCACTT AAGGCCTTGGACTT TAGCCCATAGACTTAGCGCTTAGCACAAAGGGCAT TAGCCCTAGCACTT AAGACTT TGGACTTAAGGCCT AGGGCATTAGCCCTAGCACTT AAGGCCTTGGACTT AGCGCTTAGCACAATAGACTTTAGCCCAAGGGCAT

Phylogeny Problem TAGCCCATAGACTTTGCACAATGCGCTTAGGGCAT UVWXY U VW X Y

Why construct phylogenies? Evolutionary history relates all organisms and genes, and helps us understand interactions between genes (genetic networks) functions of genes influenza vaccine development origins and spread of disease origins and migrations of humans drug design

1.Hill-climbing heuristics for hard optimization criteria: Maximum Parsimony and Maximum Likelihood Phylogenetic reconstruction methods Phylogenetic trees Cost Global optimum Local optimum 2.Polynomial time distance-based methods: Neighbor Joining, etc. 3.Bayesian methods

Maximum Parsimony (a.k.a Steiner Tree problem in phylogenetics) Input: Set S of n aligned sequences of length k Output: A phylogenetic tree T –leaf-labeled by sequences in S –additional sequences of length k labeling the internal nodes of T such that is minimized.

Very large tree space for MP and ML Number of (unrooted) binary trees on n leaves is (2n-5)!! If each tree on 1000 taxa could be analyzed in seconds, we would find the best tree in 2890 millennia #leaves#trees x x x

Problems with heuristics for ML and MP Many software packages are available which implement heuristics for finding MP and ML trees: PAUP*, PHYLIP, mrBayes, TNT, … Heuristics for Maximum Parsimony (MP) and Maximum Likelihood (ML) cannot handle large datasets: get trapped in local optima

Problems with current heuristics Current best technique for MP: TNT software package available from Pablo Goloboff. TNT is well above the best known score even after 168 hours = 7 days of computation. A separate study of ours shows that trees above 0.01% of “optimal” can differ significantly in structure, whereas those closer to the 0.01% threshold are topologically similar.

Our approach Use Disk-Covering Methods (DCMs) to boost the performance of existing best known technique.

The Warnow et al. DCM2 technique for speeding up MP/ML searches 1. Decompose sequences into overlapping subproblems 2. Compute subtrees using a base method 3. Merge subtrees using the Strict Consensus Merge (SCM) 4. Refine to make the tree binary

Problems with DCM1 and DCM2 DCM1 was designed to improve the statistical performance of distance based methods such as NJ. It does not help with MP or ML analyses – too many subproblems and much loss of resolution after merger DCM2 helps with MP and ML analyses, but only on some datasets – decomposition doesn’t reduce size enough and takes too long to compute

2.Find separator X in G which minimizes max where are the connected components of G – X 3.Output subproblems as. DCM2 Input: distance matrix d, threshold, sequences S Algorithm: 1a. Compute a threshold graph G using q and d 1b. Perform a minimum weight triangulation of G DCM2 decomposition vs DCM3 decomposition (our new DCM) DCM3 Input : guide-tree T on S, sequences S Algorithm: 1.Compute a short quartet graph G using T. The graph G is provably triangulated. DCM3 advantage: it is faster and produces smaller subproblems than DCM2

DCM2 vs DCM3 (threshold graph vs short quartet graph) DCM2 threshold graph V={sequences} E={(i,j): <= q } q is at least the minimum required to make the graph connected G can become very dense, especially given outliers, and thus produce large separators (order of 80%) and subproblems up to 90% in size! DCM3 short quartet graph V={sequences} E={(i,j): sequence i and j are in some short quartet} G is not so dense and any outliers will be in one short quartet only. Separators are small and subproblems at most 50% in size in practice.

Time to compute DCM3 decompositions An optimal DCM3 decomposition takes O(n 3 ) to compute – same as for DCM2 The centroid edge DCM3 decomposition can be computed in O(n 2 ) time An approximate centroid edge decomposition can be computed in O(n) time

DCM3 decomposition – example 1.Locate the centroid edge e in (O(n) time) 2.Set the closest leaves around e to be the separator (O(n) time) 3.Remaining leaves in subtrees around e form the subsets (unioned with the separator)

Improving upon DCM3 tree: Iterative-DCM3 T T’ DCM3 Local search Starting tree

Recursive-Iterative-DCM3 T T’ Recursive-DCM3 Local search Starting tree

Experimental design We compare methods on 10 real datasets obtained from researchers and public databases lsu rRNA of all organisms 2000 Eukaryotic rRNA 2560 rbcL DNA 4114 Actinobacteria 16s rRNA 6281 ssu rRNA of all Eukaryotes 6458 Firmicutes bacteria 16s rRNA 6722 three-domain rRNA 7769 three-domain+2org rRNA ssu rRNA of all Bacteria Proteobacteria 16s rRNA

Experimental design Dataset: 10 real datasets ranging from 1127 to sequences (DNA and rRNA) Methods studied: –Recursive-Iterative-DCM3 (Rec-I-DCM3) with 1/4th and 1/8th subset sizes –TNT (combination of simulated annealing, divide- and-conquer, and genetic algorithms) Five runs of each method: –Rec-I-DCM3: each run for 168 hours (1 week) –TNT: each run for 336 hours (2 weeks)

Results 1.Performance as a function of time on dataset of sequences 2.Comparison of scores found at 168 hours 3.Rec-I-DCM3 speedup over TNT

5 to 60 minutes on sequences All three methods are well above best score in the first hour.

1 to 24 hours on sequences Rec-I-DCM3 scores improve faster than TNT

1 to 336 hours on sequences Rapid improvement for both the methods in first 24 hours. Rec-I-DCM3 continues to improve faster than TNT thereafter.

1 to 336 hours on sequences with all five runs plotted Plot of all five runs of each method show statistically sound results. Similar behavior on all datasets.

Average percent above the best known score found to date on each dataset 24 hours 168 hours At 168 hours, Rec-I-DCM3 scores improve by half (above optimal) whereas TNT improvement is slow.

(avg Rec-I-DCM3 time/avg TNT time) to reach avg TNT 2week score Average Rec-I-DCM3 scores reach average TNT score 25 times faster on datasets 9, and 10, and 50 times faster on Datasets 6 and 8 than TNT.

Software Open source DCM3 code available from See CIPRES ( Contact Usman Roshan or Tandy Warnow for more

Future work DCM3–ML Biological discoveries from large dataset analysis Optimal subset size

Acknowledgements This work was done in collaboration with Tandy Warnow (UT Austin and Radcliffe) Bernard Moret (UNM) Tiffani Williams (UNM) Thanks to Pablo Goloboff for support on TNT Dave Swofford (FSU) for support on PAUP* Robin Gutell (UT Austin) for providing large accurate alignments Doug Burger (UT Austin) and Steve Keckler (UT Austin) for usage of the SCOUT Pentium and Mastadon Xeon clusters