Gene Order Phylogeny Tandy Warnow The Program in Evolutionary Dynamics, Harvard University The University of Texas at Austin.

Slides:



Advertisements
Similar presentations
CS 598AGB What simulations can tell us. Questions that simulations cannot answer Simulations are on finite data. Some questions (e.g., whether a method.
Advertisements

A Separate Analysis Approach to the Reconstruction of Phylogenetic Networks Luay Nakhleh Department of Computer Sciences UT Austin.
Challenges in computational phylogenetics Tandy Warnow Radcliffe Institute for Advanced Study University of Texas at Austin.
DCJUC: A Maximum Parsimony Simulator for Constructing Phylogenetic Tree of Genomes with Unequal Contents Zhaoming Yin Bader-Polo Joint Group Meeting, Nov.
Large-Scale Phylogenetic Analysis Tandy Warnow Associate Professor Department of Computer Sciences Graduate Program in Evolution and Ecology Co-Director.
1 General Phylogenetics Points that will be covered in this presentation Tree TerminologyTree Terminology General Points About Phylogenetic TreesGeneral.
Perfect phylogenetic networks, and inferring language evolution Tandy Warnow The University of Texas at Austin (Joint work with Don Ringe, Steve Evans,
Computational biology and computational biologists Tandy Warnow, UT-Austin Department of Computer Sciences Institute for Cellular and Molecular Biology.
Molecular Evolution Revised 29/12/06
High-Performance Algorithm Engineering for Computational Phylogenetics [B Moret, D Bader] Kexue Liu CMSC 838 Presentation.
Bioinformatics Chromosome rearrangements Chromosome and genome comparison versus gene comparison Permutations and breakpoint graphs Transforming Men into.
Current Approaches to Whole Genome Phylogenetic Analysis Hongli Li.
Genome Rearrangement Phylogeny
CIS786, Lecture 5 Usman Roshan.
CIS786, Lecture 3 Usman Roshan.
Phylogeny reconstruction BNFO 602 Roshan. Simulation studies.
BNFO 602 Phylogenetics Usman Roshan. Summary of last time Models of evolution Distance based tree reconstruction –Neighbor joining –UPGMA.
FPGA Acceleration of Gene Rearrangement Analysis Jason D. Bakos Dept. of Computer Science and Engineering University of South Carolina Columbia, SC USA.
Inferring Phylogeny using Permutation Patterns on Genomic Data 1 Md Enamul Karim 2 Laxmi Parida 1 Arun Lakhotia 1 University of Louisiana at Lafayette.
CIS786, Lecture 4 Usman Roshan.
CIS786, Lecture 8 Usman Roshan Some of the slides are based upon material by Dennis Livesay and David.
Computing the Tree of Life The University of Texas at Austin Department of Computer Sciences Tandy Warnow.
Computational and mathematical challenges involved in very large-scale phylogenetics Tandy Warnow The University of Texas at Austin.
Combinatorial and graph-theoretic problems in evolutionary tree reconstruction Tandy Warnow Department of Computer Sciences University of Texas at Austin.
Phylogeny Estimation: Why It Is "Hard", and How to Design Methods with Good Performance Tandy Warnow Department of Computer Sciences University of Texas.
CIPRES: Enabling Tree of Life Projects Tandy Warnow The Program in Evolutionary Dynamics at Harvard University The University of Texas at Austin.
CIPRES: Enabling Tree of Life Projects Tandy Warnow The University of Texas at Austin.
Disk-Covering Methods for phylogeny reconstruction Tandy Warnow The University of Texas at Austin.
Phylogenetic Tree Reconstruction Tandy Warnow The Program in Evolutionary Dynamics at Harvard University The University of Texas at Austin.
Complexity and The Tree of Life Tandy Warnow The University of Texas at Austin.
Combinatorial and Statistical Approaches in Gene Rearrangement Analysis Jijun Tang Computer Science and Engineering University of South Carolina
Phylogeny Estimation: Traditional and Bayesian Approaches Molecular Evolution, 2003
A simulation study comparing phylogeny reconstruction methods for linguistics Collaborators: Francois Barbancon, Don Ringe, Luay Nakhleh, Steve Evans Tandy.
Barking Up the Wrong Treelength Kevin Liu, Serita Nelesen, Sindhu Raghavan, C. Randal Linder, and Tandy Warnow IEEE TCCB 2009.
Computer Science Research for The Tree of Life Tandy Warnow Department of Computer Sciences University of Texas at Austin.
Binary Encoding and Gene Rearrangement Analysis Jijun Tang Tianjin University University of South Carolina (803)
Rec-I-DCM3: A Fast Algorithmic Technique for Reconstructing Large Evolutionary Trees Usman Roshan Department of Computer Science New Jersey Institute of.
NP-hardness and Phylogeny Reconstruction Tandy Warnow Department of Computer Sciences University of Texas at Austin.
Molecular phylogenetics 1 Level 3 Molecular Evolution and Bioinformatics Jim Provan Page and Holmes: Sections
Bioinformatics 2011 Molecular Evolution Revised 29/12/06.
CIPRES: Enabling Tree of Life Projects Tandy Warnow The University of Texas at Austin.
More statistical stuff CS 394C Feb 6, Today Review of material from Jan 31 Calculating pattern probabilities Why maximum parsimony and UPGMA are.
Statistical stuff: models, methods, and performance issues CS 394C September 16, 2013.
Parallel & Distributed Systems and Algorithms for Inference of Large Phylogenetic Trees with Maximum Likelihood Alexandros Stamatakis LRR TU München Contact:
CIPRES: Enabling Tree of Life Projects Tandy Warnow The University of Texas at Austin.
Introduction to Phylogenetic Estimation Algorithms Tandy Warnow.
Algorithmic research in phylogeny reconstruction Tandy Warnow The University of Texas at Austin.
Algorithms research Tandy Warnow UT-Austin. “Algorithms group” UT-Austin: Warnow, Hunt UCB: Rao, Karp, Papadimitriou, Russell, Myers UCSD: Huelsenbeck.
GRAPPA: Large-scale whole genome phylogenies based upon gene order evolution Tandy Warnow, UT-Austin Department of Computer Sciences Institute for Cellular.
A simulation study comparing phylogeny reconstruction methods for linguistics Collaborators: Francois Barbancon, Don Ringe, Luay Nakhleh, Steve Evans Tandy.
Using Divide-and-Conquer to Construct the Tree of Life Tandy Warnow University of Illinois at Urbana-Champaign.
Problems with large-scale phylogeny Tandy Warnow, UT-Austin Department of Computer Sciences Center for Computational Biology and Bioinformatics.
CS 395T: Computational phylogenetics January 18, 2006 Tandy Warnow.
Statistical stuff: models, methods, and performance issues CS 394C September 3, 2009.
Iterative-DCM3: A Fast Algorithmic Technique for Reconstructing Large Phylogenetic Trees Usman Roshan and Tandy Warnow U. of Texas at Austin Bernard Moret.
Approaching multiple sequence alignment from a phylogenetic perspective Tandy Warnow Department of Computer Sciences The University of Texas at Austin.
The Tree of Life: Algorithmic and Software Challenges Tandy Warnow The University of Texas at Austin.
394C: Algorithms for Computational Biology Tandy Warnow Jan 25, 2012.
The Disk-Covering Method for Phylogenetic Tree Reconstruction
New Approaches for Inferring the Tree of Life
Tandy Warnow Department of Computer Sciences
Challenges in constructing very large evolutionary trees
CIPRES: Enabling Tree of Life Projects
BNFO 602 Phylogenetics Usman Roshan.
BNFO 602 Phylogenetics – maximum parsimony
CS 581 Tandy Warnow.
CS 394C: Computational Biology Algorithms
Algorithms for Inferring the Tree of Life
Tandy Warnow The University of Texas at Austin
Tandy Warnow The University of Texas at Austin
Presentation transcript:

Gene Order Phylogeny Tandy Warnow The Program in Evolutionary Dynamics, Harvard University The University of Texas at Austin

Cyber-Infrastructure for Phylogenetic RESearch ( Main research: Large-scale phylogenetics, reticulate evolution, gene order phylogeny, complex simulations, and databases Funded by $11.6M ITR Grant from NSF 40 biologists, computer scientists, and mathematicians collaborating on the project

CIPRes Members University of New Mexico Bernard Moret David Bader Tiffani Williams UCSD/SDSC Fran Berman Alex Borchers David Stockwell Phil Bourne John Huelsenbeck Dana Jermanis Mark Miller Michael Alfaro Tracy Zhao University of Connecticut Paul O Lewis University of Pennsylvania Junhyong Kim Sampath Kannan UT Austin Tandy Warnow David M. Hillis Warren Hunt Robert Jansen Randy Linder Lauren Meyers Daniel Miranker Usman Roshan Luay Nakhleh University of Arizona David R. Maddison University of British Columbia Wayne Maddison North Carolina State University Spencer Muse American Museum of Natural History Ward C. Wheeler UC Berkeley Satish Rao Joseph M. Hellerstein Richard M Karp Brent Mishler Elchanan Mossel Eugene W. Myers Christos M. Papadimitriou Stuart J. Russell SUNY Buffalo William Piel Florida State University David L. Swofford Mark Holder Yale Michael Donoghue Paul Turner Aventis Pharmaceuticals Lisa Vawter

Limitations of DNA phylogenetics Deep evolutionary histories may not be recoverable from DNA sequence phylogeny due to lack of specificity -- too much noise (homoplasy) and insufficient sequence length The systematics community has looked to “rare genomic changes” for better sources of phylogenetic signal

Whole-Genome Phylogenetics A B C D E F X Y Z W A B C D E F

Genomes As Signed Permutations 1 – or –3 5 –1 etc.

Genomes Evolve by Rearrangements Inverted Transposition –7 –6 –5 – Inversion (Reversal) –8 –7 –6 – Transposition

Other types of events Duplications, Insertions, and Deletions (changes gene content) Fissions and Fusions (for genomes with more than one chromosome) These events change the number of copies of each gene in each genome (“unequal gene content”)

Genome Rearrangement Has A Huge State Space DNA sequences : 4 states per site Signed circular genomes with n genes: states, 1 site Circular genomes (1 site) –with 37 genes (mitochondria): states –with 120 genes (chloroplasts): states

Why use gene orders? “Rare genomic changes”: huge state space and relative infrequency of events (compared to site substitutions) could make the inference of deep evolution easier, or more accurate. Our research shows this is true, but accurate analysis of gene order data is computationally very intensive!

Phylogeny reconstruction from gene orders Distance-based reconstruction: estimate pairwise distances, and apply methods like Neighbor- Joining or Weighbor “Maximum Parsimony”: find tree with the minimum length (inversions, transpositions, or other edit distances) Maximum Likelihood: find tree and parameters of evolution most likely to generate the observed data

Maximum Parsimony on Rearranged Genomes (MPRG) The leaves are rearranged genomes. Find the tree that minimizes the total number of rearrangement events (e.g., inversion phylogeny minimizes the number of inversions) A B C D A B C D E F Total length = 18

Optimization problems for gene order phylogeny Breakpoint phylogeny: find the phylogeny which minimizes the total number of breakpoints (NP-hard, even to find the median of three genomes) Inversion phylogeny: find the phylogeny which minimizes the sum of inversion distances on the edges (NP-hard, even to find the median of three genomes)

Inversion phylogenies Phylogenetic trees Tree length Global optimum Local optimum When the data are close to saturated, even the best distance-based analyses are insufficiently accurate. In these cases, our initial investigations suggest that the inversion phylogeny approach may be superior. Problem: finding the best trees is enormously hard, since even the “point estimation” problem is hard (worse than estimating branch lengths in ML).

Observations For equal gene content, heuristics for the inversion phylogeny problem are extremely accurate, even under model conditions in which transpositions are dominant. For unequal gene content, the parsimony style problems are too computationally intense -- but NJ (neighbor joining) with a new distance estimator (Moret et al. 2004) works extremely well.

Software BPAnalysis (Sankoff): open source, restricted to the breakpoint phylogeny reconstruction GRAPPA (Moret et al.): open source, restricted to single chromosome genomes, but can handle both equal and unequal gene content MGR (Pevzner et al.): multiple chromosome, limited to equal gene content, performs well if the dataset is small (less than 10 genomes) Bayesian analysis by Bret Larget (not yet released).

Tobacco Platycodon Cyananthus Asyneuma Tiodanus Legousia Merciera Wahlenbergia Symphyandra Adenophora Trachelium The strict consensus of 24 trees, each with inversion length of 64. Finished within 40 minutes on a laptop using GRAPPA version 1.8 Campanula Codonopsis

GRAPPA (Genome Rearrangement Analysis under Parsimony and other Phylogenetic Algorithms) Heuristics for maximum parsimony style problems for equal gene content Fast polynomial time distance-based methods Contributors: U. New Mexico,U. Texas at Austin, Universitá di Bologna, Italy Freely available in source code at this site. Project leader: Bernard Moret (UNM)

Speeding up MP and ML: DCM3 Tandy Warnow Radcliffe Institute The University of Texas at Austin

Reconstructing the “Tree” of Life Handling large datasets: millions of species

Methods for phylogenetic inference Polynomial time methods, mostly based upon estimating evolutionary distances Heuristics for hard optimization problems (such as maximum parsimony and maximum likelihood) Bayesian methods

Main research objectives Determine the best current methods available for MP and ML, and then improve upon them Focus on performance within one day, one week, or one month, on large real datasets (1K to 20K sequences for MP) Final objective is hundreds of thousands (or millions) of sequences.

Initial results Very large datasets are hard for both MP and ML, no matter what software is used Suboptimal solutions to MP yield reasonable estimates of the optimal MP trees - but only if they are within.01% of optimal MP score Improving upon techniques for searching treespace will yield improvements for both MP and ML

Datasets 1322 lsu rRNA of all organisms 2000 Eukaryotic rRNA 2594 rbcL DNA 4583 Actinobacteria 16s rRNA 6590 ssu rRNA of all Eukaryotes 7180 three-domain rRNA 7322 Firmicutes bacteria 16s rRNA 8506 three-domain+2org rRNA ssu rRNA of all Bacteria Proteobacteria 16s rRNA Obtained from various researchers and online databases

Problems with current techniques for MP Average MP scores above “optimal” of best methods at 24 hours across 10 datasets Best current techniques fail to reach 0.01% of optimal at the end of 24 hours, on large datasets

Problems with current techniques for MP The best current method (default TNT) fails to reach acceptable levels of accuracy (0.01% of “optimal”) within 24 hours on many large datasets -- evidence suggests that this level will not be reached for weeks or months (or more) of further analysis. Performance of TNT with time

Observations The best methods cannot get acceptably good solutions within 24 hours on most of these large datasets. Datasets of these sizes may need months (or years) of further analysis to reach reasonable solutions. Apparent convergence can be misleading.

Observations The best methods cannot get acceptably good solutions within 24 hours on most of these large datasets. Datasets of these sizes may need months (or years) of further analysis to reach reasonable solutions. Apparent convergence can be misleading.

Observations The best methods cannot get acceptably good solutions within 24 hours on most of these large datasets. Datasets of these sizes may need months (or years) of further analysis to reach reasonable solutions. Apparent convergence can be misleading.

Disk-Covering Methods (DCMs) DCMs are divide-and-conquer methods that our group has developed for use in phylogeny reconstruction DCM2 was designed for speeding up maximum parsimony and maximum likelihood heuristics. DCM2 was good enough for PAUP*. DCM3 is a recent improvement over DCM2 which enables iteration (and gives smaller subproblems) - and is good enough for TNT.

“Boosting” MP heuristics DCMs “boost” the performance of phylogeny reconstruction methods. DCM Base method MDCM-M

DCM3 technique for speeding up MP searches

Iterative-DCM3 T T’ Base method DCM3

New DCMs DCM3 1.Compute subproblems using DCM3 decomposition 2.Apply base method to each subproblem to yield subtrees 3.Merge subtrees using the Strict Consensus Merger technique 4.Randomly refine to make it binary Recursive-DCM3 Iterative DCM3 1.Compute a DCM3 tree 2.Perform local search and go to step 1 Recursive-Iterative DCM3

“Boosting” MP heuristics We examine DCMs using DCM2 and DCM3, and using recursion and/or iteration. DCM Base method MDCM-M

Performance Study How well do these “boosted” versions of the best MP heuristics perform, compared to the best MP heuristics? We examine performance with respect to “optimal” MP scores (best found so far, using any method) for a number of very large datasets, over 24 hours. The benchmark MP heuristic is the default TNT.

Datasets 1322 lsu rRNA of all organisms 2000 Eukaryotic rRNA 2594 rbcL DNA 4583 Actinobacteria 16s rRNA 6590 ssu rRNA of all Eukaryotes 7180 three-domain rRNA 7322 Firmicutes bacteria 16s rRNA 8506 three-domain+2org rRNA ssu rRNA of all Bacteria Proteobacteria 16s rRNA Obtained from various researchers and online databases

Rec-I-DCM3 significantly improves performance Comparison of TNT to Rec-I-DCM3(TNT) on one large dataset Current best techniques DCM boosted version of best techniques

Rec-I-DCM3(TNT) vs. TNT (Comparison of scores at 24 hours) Base method is the default TNT technique, the current best method for MP. Rec-I-DCM3 significantly improves upon the unboosted TNT by returning trees which are at most 0.01% above optimal on most datasets.

Summary Rec-I-DCM3 is a powerful technique for escaping local optima, and “boosts” the performance of the best heuristics for solving MP The improvement increases with the difficulty of the dataset - Rec-I-DCM3(TNT) is 50 times faster than TNT on our hardest datasets, but we expect even bigger speedups in our next version DCMs also boost the performance of Maximum Likelihood heuristics (not shown)

Acknowledgements Collaborators: Bernard Moret (UNM), Usman Roshan (UT-Austin), and Tiffani Williams (UNM) Funding: NSF, The David and Lucile Packard Foundation, The Radcliffe Institute for Advanced Study, The Institute for Cellular and Molecular Biology at UT-Austin, and The Program in Evolutionary Dynamics at Harvard University Software will be part of the CIPRES Project’s first distribution - see

Cyber-Infrastructure for Phylogenetic RESearch ( Main research: Large-scale phylogenetics, reticulate evolution, gene order phylogeny, and databases Funded by $11.6M ITR Grant from NSF 40 biologists, computer scientists, and mathematicians collaborating on the project