Large-Scale Phylogenetic Analysis Tandy Warnow Associate Professor Department of Computer Sciences Graduate Program in Evolution and Ecology Co-Director.

Slides:



Advertisements
Similar presentations
CS 598AGB What simulations can tell us. Questions that simulations cannot answer Simulations are on finite data. Some questions (e.g., whether a method.
Advertisements

A Separate Analysis Approach to the Reconstruction of Phylogenetic Networks Luay Nakhleh Department of Computer Sciences UT Austin.
Challenges in computational phylogenetics Tandy Warnow Radcliffe Institute for Advanced Study University of Texas at Austin.
Computer Science and Reconstructing Evolutionary Trees Tandy Warnow Department of Computer Science University of Illinois at Urbana-Champaign.
Computational biology and computational biologists Tandy Warnow, UT-Austin Department of Computer Sciences Institute for Cellular and Molecular Biology.
Molecular Evolution Revised 29/12/06
High-Performance Algorithm Engineering for Computational Phylogenetics [B Moret, D Bader] Kexue Liu CMSC 838 Presentation.
Genome Rearrangement Phylogeny
BNFO 602 Phylogenetics Usman Roshan.
Probabilistic methods for phylogenetic trees (Part 2)
Inferring Phylogeny using Permutation Patterns on Genomic Data 1 Md Enamul Karim 2 Laxmi Parida 1 Arun Lakhotia 1 University of Louisiana at Lafayette.
CIS786, Lecture 8 Usman Roshan Some of the slides are based upon material by Dennis Livesay and David.
Computing the Tree of Life The University of Texas at Austin Department of Computer Sciences Tandy Warnow.
Computational and mathematical challenges involved in very large-scale phylogenetics Tandy Warnow The University of Texas at Austin.
Combinatorial and graph-theoretic problems in evolutionary tree reconstruction Tandy Warnow Department of Computer Sciences University of Texas at Austin.
Phylogeny Estimation: Why It Is "Hard", and How to Design Methods with Good Performance Tandy Warnow Department of Computer Sciences University of Texas.
CIPRES: Enabling Tree of Life Projects Tandy Warnow The Program in Evolutionary Dynamics at Harvard University The University of Texas at Austin.
CIPRES: Enabling Tree of Life Projects Tandy Warnow The University of Texas at Austin.
Disk-Covering Methods for phylogeny reconstruction Tandy Warnow The University of Texas at Austin.
Phylogenetic Tree Reconstruction Tandy Warnow The Program in Evolutionary Dynamics at Harvard University The University of Texas at Austin.
Complexity and The Tree of Life Tandy Warnow The University of Texas at Austin.
Combinatorial and Statistical Approaches in Gene Rearrangement Analysis Jijun Tang Computer Science and Engineering University of South Carolina
Phylogenetic Analysis. 2 Introduction Intension –Using powerful algorithms to reconstruct the evolutionary history of all know organisms. Phylogenetic.
Computer Science Research for The Tree of Life Tandy Warnow Department of Computer Sciences University of Texas at Austin.
Gene Order Phylogeny Tandy Warnow The Program in Evolutionary Dynamics, Harvard University The University of Texas at Austin.
Rec-I-DCM3: A Fast Algorithmic Technique for Reconstructing Large Evolutionary Trees Usman Roshan Department of Computer Science New Jersey Institute of.
NP-hardness and Phylogeny Reconstruction Tandy Warnow Department of Computer Sciences University of Texas at Austin.
Bioinformatics 2011 Molecular Evolution Revised 29/12/06.
Software for Scientists Tandy Warnow Department of Computer Science University of Texas at Austin.
CIPRES: Enabling Tree of Life Projects Tandy Warnow The University of Texas at Austin.
Benjamin Loyle 2004 Cse 397 Solving Phylogenetic Trees Benjamin Loyle March 16, 2004 Cse 397 : Intro to MBIO.
394C, Spring 2013 Sept 4, 2013 Tandy Warnow. DNA Sequence Evolution AAGACTT TGGACTTAAGGCCT -3 mil yrs -2 mil yrs -1 mil yrs today AGGGCATTAGCCCTAGCACTT.
More statistical stuff CS 394C Feb 6, Today Review of material from Jan 31 Calculating pattern probabilities Why maximum parsimony and UPGMA are.
Combining with phylogeny Wafa Jobran Seminar in Bioinformatics Technion spring 2005.
Statistical stuff: models, methods, and performance issues CS 394C September 16, 2013.
CS 394C Algorithms for Computational Biology Tandy Warnow Spring 2012.
CIPRES: Enabling Tree of Life Projects Tandy Warnow The University of Texas at Austin.
Introduction to Phylogenetic Estimation Algorithms Tandy Warnow.
Algorithmic research in phylogeny reconstruction Tandy Warnow The University of Texas at Austin.
Algorithms research Tandy Warnow UT-Austin. “Algorithms group” UT-Austin: Warnow, Hunt UCB: Rao, Karp, Papadimitriou, Russell, Myers UCSD: Huelsenbeck.
Selecting Genomes for Reconstruction of Ancestral Genomes Louxin Zhang Department of Mathematics National University of Singapore.
GRAPPA: Large-scale whole genome phylogenies based upon gene order evolution Tandy Warnow, UT-Austin Department of Computer Sciences Institute for Cellular.
Constructing the Tree of Life: Divide-and-Conquer! Tandy Warnow University of Illinois at Urbana-Champaign.
Using Divide-and-Conquer to Construct the Tree of Life Tandy Warnow University of Illinois at Urbana-Champaign.
598AGB Basics Tandy Warnow. DNA Sequence Evolution AAGACTT TGGACTTAAGGCCT -3 mil yrs -2 mil yrs -1 mil yrs today AGGGCATTAGCCCTAGCACTT AAGGCCTTGGACTT.
SupreFine, a new supertree method Shel Swenson September 17th 2009.
Problems with large-scale phylogeny Tandy Warnow, UT-Austin Department of Computer Sciences Center for Computational Biology and Bioinformatics.
Three approaches to large- scale phylogeny estimation: SATé, DACTAL, and SEPP Tandy Warnow Department of Computer Science The University of Texas at Austin.
CS 395T: Computational phylogenetics January 18, 2006 Tandy Warnow.
Statistical stuff: models, methods, and performance issues CS 394C September 3, 2009.
Iterative-DCM3: A Fast Algorithmic Technique for Reconstructing Large Phylogenetic Trees Usman Roshan and Tandy Warnow U. of Texas at Austin Bernard Moret.
Sequence alignment CS 394C: Fall 2009 Tandy Warnow September 24, 2009.
Approaching multiple sequence alignment from a phylogenetic perspective Tandy Warnow Department of Computer Sciences The University of Texas at Austin.
Simultaneous alignment and tree reconstruction Collaborative grant: Texas, Nebraska, Georgia, Kansas Penn State University, Huston-Tillotson, NJIT, and.
The Tree of Life: Algorithmic and Software Challenges Tandy Warnow The University of Texas at Austin.
Absolute Fast Converging Methods CS 598 Algorithmic Computational Genomics.
394C: Algorithms for Computational Biology Tandy Warnow Jan 25, 2012.
Distance-based phylogeny estimation
20 years and 22 papers with Bernard Moret
New Approaches for Inferring the Tree of Life
Distance based phylogenetics
Tandy Warnow Department of Computer Sciences
BNFO 602 Phylogenetics Usman Roshan.
Absolute Fast Converging Methods
CS 581 Tandy Warnow.
The Most General Markov Substitution Model on an Unrooted Tree
CS 394C: Computational Biology Algorithms
Algorithms for Inferring the Tree of Life
Tandy Warnow The University of Texas at Austin
Tandy Warnow The University of Texas at Austin
Presentation transcript:

Large-Scale Phylogenetic Analysis Tandy Warnow Associate Professor Department of Computer Sciences Graduate Program in Evolution and Ecology Co-Director The Center for Computational Biology and Bioinformatics The University of Texas at Austin

Outline of Talk Phylogenetic reconstruction from DNA sequences – the problems, and the progress Phylogenetic reconstruction from gene order and content in whole genomes – initial work The future of large-scale phylogeny, and the possibilities of inferring the “Tree of Life”

I. Molecular Systematics TAGCCCATAGACTTTGCACAATGCGCTTAGGGCAT UVWXY U VW X Y

DNA Sequence Evolution AAGACTT TGGACTTAAGGCCT -3 mil yrs -2 mil yrs -1 mil yrs today AGGGCATTAGCCCTAGCACTT AAGGCCTTGGACTT TAGCCCATAGACTTAGCGCTTAGCACAAAGGGCAT TAGCCCTAGCACTT AAGACTT TGGACTTAAGGCCT AGGGCATTAGCCCTAGCACTT AAGGCCTTGGACTT AGCGCTTAGCACAATAGACTTTAGCCCAAGGGCAT

Major Phylogenetic Reconstruction Methods Polynomial-time distance-based methods (neighbor joining the most popular) NP-hard sequence-based methods –Maximum Parsimony –Maximum Likelihood Heated debates over the relative performance of these methods

Quantifying Error FN: false negative (missing edge) FP: false positive (incorrect edge) 50% error rate FN FP

Main Result: DCM-Boosting and DCM NJ +ML We have developed the first polynomial time methods that improve upon NJ (with respect to topological accuracy) and are never worse than NJ. The method is obtained through DCM-boosting.

Basis of Distance-Based Methods: Additivity A distance matrix is additive if there exists a tree and such that. Waterman et al. (1977) showed that:

Distance-based Phylogenetic Methods

Statistical Consistency Atteson (1990) showed that if is small enough. Sequence length Hence NJ is statistically consistent for many models of evolution. But what about performance on finite sequence lengths?

We focus on performance on finite sequence lengths

Absolute fast convergence vs. exponential convergence

General Markov (GM) Model A GM model tree is a pair where – is a rooted binary tree. –, and is a stochastic substitution matrix with. –The sequence at the root of is drawn from a uniform distribution. –the rates of evolution across the sites can be drawn from a fixed distribution GM contains models like Jukes-Cantor (JC) and Kimura 2-Parameter (K2P) models.

Absolute Fast Convergence Let. Define. We parameterize the GM model: A phylogenetic reconstruction method is absolute fast-converging (AFC) for the GM model if for all positive there is a polynomial such that for all on set of sequences of length at least generated on, we have

Theoretical Comparison of Early AFC Methods to NJ Theorem 1 [Warnow et al. 2001] DCM NJ +SQS is absolute fast converging for the GM model. Theorem 2 [Csűrös 2001] HGT+FP is absolute fast converging for the GM model. Theorem 3 [Atteson 1999] NJ is exponentially converging for the GM model (but is not known to be AFC).

DCM-Boosting [Warnow et al. 2001] DCM+SQS is a two-phase procedure which reduces the sequence length requirement of methods. DCMSQS Exponentially converging method Absolute fast converging method DCM NJ +SQS is the result of DCM-boosting NJ.

Experimental Comparison of Early AFC Methods to NJ rbcL 500-taxon tree Jukes-Cantor model Avg. branch length = 0.264

Improving upon early AFC methods These early AFC methods outperform NJ only on long enough sequences and on large enough trees with high enough rates of evolution. Hence we need new fast converging methods which improve upon NJ on more of the parameter space, and are never worse than NJ. We modify the second phase to improve the empirical performance, replacing SQS with ML (maximum likelihood) or MP (maximum parsimony).

DCM NJ +ML vs. other methods on a fixed tree 500-taxon rbcL tree K2P+  model (=2, =1) Avg. branch length = Typical performance

Comparison of methods on random trees as a function of the number of taxa Random tree topologies K2P+  model (=2, =1) Avg. branch length = 0.05 Seq. length = 1000

Summary These are the first polynomial time methods that improve upon NJ (with respect to topological accuracy) and are never worse than NJ. The advantage obtained with DCM NJ +MP and DCM NJ +ML increases with number of taxa. In practice these new methods are slower than NJ (minutes vs. seconds), but still much faster than MP and ML (which can take days). Conjecture: DCM NJ +ML is AFC.

II. Whole-Genome Phylogeny A B C D E F X Y Z W A B C D E F

Genomes As Signed Permutations 1 – or –3 5 –1 etc.

Genomes Evolve by Rearrangements Inverted Transposition –7 –6 –5 – Inversion (Reversal) –8 –7 –6 – Transposition

Genome Rearrangement Has A Huge State Space DNA sequences : 4 states per site Signed circular genomes with n genes: states, 1 site Circular genomes (1 site) –with 37 genes: states –with 120 genes: states

Distance-based Phylogenetic Methods for Genomes

Genomic Distance Estimators Standard: –Breakpoint distance –(Minimum) Inversion distance Our estimators: We attempt to estimate the actual number of events (the ``true evolutionary distance”): –EDE [Moret et al, ISMB’01] –Approx-IEBP [Wang and Warnow, STOC’01] –Exact-IEBP [Wang, WABI’01]

Breakpoint Distance Breakpoint distance= –3 –

Minimum Inversion Distance –8 –7 –6 –5 – –3 –2 –7 –6 –5 – –3 7 2 –6 –5 – Inversion distance=3

Measured Distance vs. Actual Number of Events Breakpoint DistanceInversion Distance 120 genes, inversion-only evolution

Generalized Nadeau-Taylor Model Three types of events: –Inversions –Transpositions –Inverted Transpositions Events of the same type are equiprobable Probability of the three types have fixed ratio: Inv : Trp : Inv.Trp = (1--)::

Estimating True Evolutionary Distances for Genomes Given fixed probabilities for each type of event, we estimate the expected breakpoint distance after k random events: Approx-IEBP [Wang, Warnow 2001] –Polynomial-time closed-form approximation to the expected breakpoint distance –Proven error bound Exact-IEBP [Wang 2001] –Exact, recursive solution for the expected breakpoint distance –Polynomial-time but slower than Approx-IEBP

Estimating True Evolutionary Distances for Genomes (cont.) Estimating the expected Inversion distance: EDE [Moret, Wang, Warnow, Wyman 2001] –Closed-form formula based upon an empirical estimation of the expected inversion distance after k random events (based upon 120 genes and inversion only, but robust to errors in the model). –Polynomial time, fastest of the three.

Goodness of fit for Approx-IEBP 120 genes Inversion-only evolution (similar perfor- mance under other models) EDE and Exact-IEBP have similar performance Approx-

Absolute Difference 120 genes Inversion only evolution (Similar relative performance under other models)

Accuracy of Neighbor Joining Using Distance Estimators 120 genes Inversion-only evolution 10, 20, 40, 80, and 160 genomes Similar relative performance under other models

Accuracy of Neighbor Joining Using Distance Estimators 120 genes All three event types equiprobable 10, 20, 40, 80, and 160 genomes Similar relative performance under other models

Summary of Genomic Distance Estimators Statistically based estimation of genomic distances improves NJ analyses Our IEBP estimators assume knowledge of the probabilities of each type of event, but are robust to model violations NJ(EDE) outperforms NJ on other estimators, under all models studied Accuracy is very good, except when very close to saturation

Maximum Parsimony on Rearranged Genomes (MPRG) The leaves are rearranged genomes. Find the tree that minimizes the total number of rearrangement events A B C D A B C D E F Total length = 18

GRAPPA [Bader et al., PSB’01] ( Genome Rearrangements Analysis under Parsimony and other Phylogenetic Algorithms) Reimplementation of BPAnalysis [Blanchette et al. 1997] for the Breakpoint Phylogeny problem. Uses algorithm engineering to improve performance. Improves the algorithm by reducing the number of tree length evaluations. (Evaluating the length of a fixed tree is NP-hard)

Campanulaceae

Analysis of Campanulaceae 12 genomes + 1 outgroup (Tobacco) 105 gene segments BPAnalysis [Blanchette et al. 1997] over 200 years [Cosner et al. 2000] Using GRAPPA v1.1 on the 512-processor Los Lobos Supercluster machine: 2 minutes = 100 million-fold speedup (200,000-fold speedup per processor)

Consensus of 216 MP Trees Strict Consensus of 216 trees; 6 out of 10 internal edges recovered. Trachelium Campanula Adenophora Symphandra Legousia Asyneuma Triodanus Wahlenbergia Merciera Codonopsis Cyananthus Platycodon Tobacco

Future Work New focus on Rare Genomic Changes –New data –New models –New methods New techniques for large scale analyses –Divide-and-conquer methods –Non-tree models –Visualization of large trees and large sets of trees

Acknowledgements Funding: The David and Lucile Packard Foundation, The National Science Foundation, and Paul Angello Collaborators: Robert Jansen (U. Texas) Bernard Moret, David Bader, Mi-Yan (U. New Mexico) Daniel Huson (Celera) Katherine St. John (CUNY) Linda Raubeson (Central Washington U.) Luay Nakhleh, Usman Roshan, Jerry Sun, Li-San Wang, Stacia Wyman (Phylolab, U. Texas)

Phylolab, U. Texas Please visit us at