GRAPPA: Large-scale whole genome phylogenies based upon gene order evolution Tandy Warnow, UT-Austin Department of Computer Sciences Institute for Cellular.

Slides:



Advertisements
Similar presentations
Reconstructing Phylogenies from Gene-Order Data Overview.
Advertisements

CS 598AGB What simulations can tell us. Questions that simulations cannot answer Simulations are on finite data. Some questions (e.g., whether a method.
A Separate Analysis Approach to the Reconstruction of Phylogenetic Networks Luay Nakhleh Department of Computer Sciences UT Austin.
Challenges in computational phylogenetics Tandy Warnow Radcliffe Institute for Advanced Study University of Texas at Austin.
DCJUC: A Maximum Parsimony Simulator for Constructing Phylogenetic Tree of Genomes with Unequal Contents Zhaoming Yin Bader-Polo Joint Group Meeting, Nov.
BALANCED MINIMUM EVOLUTION. DISTANCE BASED PHYLOGENETIC RECONSTRUCTION 1. Compute distance matrix D. 2. Find binary tree using just D. Balanced Minimum.
Large-Scale Phylogenetic Analysis Tandy Warnow Associate Professor Department of Computer Sciences Graduate Program in Evolution and Ecology Co-Director.
Perfect phylogenetic networks, and inferring language evolution Tandy Warnow The University of Texas at Austin (Joint work with Don Ringe, Steve Evans,
Computational biology and computational biologists Tandy Warnow, UT-Austin Department of Computer Sciences Institute for Cellular and Molecular Biology.
Molecular Evolution Revised 29/12/06
High-Performance Algorithm Engineering for Computational Phylogenetics [B Moret, D Bader] Kexue Liu CMSC 838 Presentation.
Bioinformatics Chromosome rearrangements Chromosome and genome comparison versus gene comparison Permutations and breakpoint graphs Transforming Men into.
Current Approaches to Whole Genome Phylogenetic Analysis Hongli Li.
Genome Rearrangement Phylogeny
Bioinformatics and Phylogenetic Analysis
In addition to maximum parsimony (MP) and likelihood methods, pairwise distance methods form the third large group of methods to infer evolutionary trees.
FPGA Acceleration of Phylogeny Reconstruction for Whole Genome Data Jason D. Bakos Panormitis E. Elenis Jijun Tang Dept. of Computer Science and Engineering.
BNFO 602 Phylogenetics Usman Roshan. Summary of last time Models of evolution Distance based tree reconstruction –Neighbor joining –UPGMA.
FPGA Acceleration of Gene Rearrangement Analysis Jason D. Bakos Dept. of Computer Science and Engineering University of South Carolina Columbia, SC USA.
Inferring Phylogeny using Permutation Patterns on Genomic Data 1 Md Enamul Karim 2 Laxmi Parida 1 Arun Lakhotia 1 University of Louisiana at Lafayette.
CIS786, Lecture 4 Usman Roshan.
CIS786, Lecture 8 Usman Roshan Some of the slides are based upon material by Dennis Livesay and David.
Phylogenetic trees Sushmita Roy BMI/CS 576
Computing the Tree of Life The University of Texas at Austin Department of Computer Sciences Tandy Warnow.
Computational and mathematical challenges involved in very large-scale phylogenetics Tandy Warnow The University of Texas at Austin.
Combinatorial and graph-theoretic problems in evolutionary tree reconstruction Tandy Warnow Department of Computer Sciences University of Texas at Austin.
Phylogeny Estimation: Why It Is "Hard", and How to Design Methods with Good Performance Tandy Warnow Department of Computer Sciences University of Texas.
CIPRES: Enabling Tree of Life Projects Tandy Warnow The Program in Evolutionary Dynamics at Harvard University The University of Texas at Austin.
CIPRES: Enabling Tree of Life Projects Tandy Warnow The University of Texas at Austin.
Disk-Covering Methods for phylogeny reconstruction Tandy Warnow The University of Texas at Austin.
Phylogenetic Tree Reconstruction Tandy Warnow The Program in Evolutionary Dynamics at Harvard University The University of Texas at Austin.
Complexity and The Tree of Life Tandy Warnow The University of Texas at Austin.
Combinatorial and Statistical Approaches in Gene Rearrangement Analysis Jijun Tang Computer Science and Engineering University of South Carolina
Phylogeny Estimation: Traditional and Bayesian Approaches Molecular Evolution, 2003
Phylogenetic Analysis. 2 Introduction Intension –Using powerful algorithms to reconstruct the evolutionary history of all know organisms. Phylogenetic.
Computer Science Research for The Tree of Life Tandy Warnow Department of Computer Sciences University of Texas at Austin.
Binary Encoding and Gene Rearrangement Analysis Jijun Tang Tianjin University University of South Carolina (803)
Gene Order Phylogeny Tandy Warnow The Program in Evolutionary Dynamics, Harvard University The University of Texas at Austin.
Rec-I-DCM3: A Fast Algorithmic Technique for Reconstructing Large Evolutionary Trees Usman Roshan Department of Computer Science New Jersey Institute of.
NP-hardness and Phylogeny Reconstruction Tandy Warnow Department of Computer Sciences University of Texas at Austin.
Molecular phylogenetics 1 Level 3 Molecular Evolution and Bioinformatics Jim Provan Page and Holmes: Sections
Bioinformatics 2011 Molecular Evolution Revised 29/12/06.
Parsimony-Based Approaches to Inferring Phylogenetic Trees BMI/CS 576 Colin Dewey Fall 2010.
CIPRES: Enabling Tree of Life Projects Tandy Warnow The University of Texas at Austin.
More statistical stuff CS 394C Feb 6, Today Review of material from Jan 31 Calculating pattern probabilities Why maximum parsimony and UPGMA are.
Combining with phylogeny Wafa Jobran Seminar in Bioinformatics Technion spring 2005.
Statistical stuff: models, methods, and performance issues CS 394C September 16, 2013.
CIPRES: Enabling Tree of Life Projects Tandy Warnow The University of Texas at Austin.
Algorithmic research in phylogeny reconstruction Tandy Warnow The University of Texas at Austin.
Algorithms research Tandy Warnow UT-Austin. “Algorithms group” UT-Austin: Warnow, Hunt UCB: Rao, Karp, Papadimitriou, Russell, Myers UCSD: Huelsenbeck.
Using Divide-and-Conquer to Construct the Tree of Life Tandy Warnow University of Illinois at Urbana-Champaign.
Problems with large-scale phylogeny Tandy Warnow, UT-Austin Department of Computer Sciences Center for Computational Biology and Bioinformatics.
CS 395T: Computational phylogenetics January 18, 2006 Tandy Warnow.
Statistical stuff: models, methods, and performance issues CS 394C September 3, 2009.
Iterative-DCM3: A Fast Algorithmic Technique for Reconstructing Large Phylogenetic Trees Usman Roshan and Tandy Warnow U. of Texas at Austin Bernard Moret.
Approaching multiple sequence alignment from a phylogenetic perspective Tandy Warnow Department of Computer Sciences The University of Texas at Austin.
The Tree of Life: Algorithmic and Software Challenges Tandy Warnow The University of Texas at Austin.
394C: Algorithms for Computational Biology Tandy Warnow Jan 25, 2012.
Bioinformatics Overview
20 years and 22 papers with Bernard Moret
New Approaches for Inferring the Tree of Life
Multiple Sequence Alignment Methods
Challenges in constructing very large evolutionary trees
Phylogenetic Inference
BNFO 602 Phylogenetics Usman Roshan.
CS 581 Tandy Warnow.
CS 394C: Computational Biology Algorithms
Algorithms for Inferring the Tree of Life
Tandy Warnow The University of Texas at Austin
Tandy Warnow The University of Texas at Austin
Presentation transcript:

GRAPPA: Large-scale whole genome phylogenies based upon gene order evolution Tandy Warnow, UT-Austin Department of Computer Sciences Institute for Cellular and Molecular Biology Program in Evolution, Ecology, and Behavior Center for Computational Biology and Bioinformatics

Whole-Genome Phylogenetics A B C D E F X Y Z W A B C D E F

Genomes As Signed Permutations 1 – or –3 5 –1 etc.

Genomes Evolve by Rearrangements Inverted Transposition –7 –6 –5 – Inversion (Reversal) –8 –7 –6 – Transposition

Genome Rearrangement Has A Huge State Space DNA sequences : 4 states per site Signed circular genomes with n genes: states, 1 site Circular genomes (1 site) –with 37 genes: states –with 120 genes: states

Why use gene orders? “Rare genomic changes”: huge state space and relative infrequency of events (compared to site substitutions) could make the inference of deep evolution easier, or more accurate. Our research shows this is true, but accurate analysis of gene order data is computationally very intensive!

Phylogeny reconstruction from gene orders Distance-based reconstruction: estimate pairwise distances, and apply methods like Neighbor- Joining or Weighbor “Maximum Parsimony”: find tree with the minimum length (inversions, transpositions, or other edit distances) Maximum Likelihood: find tree and parameters of evolution most likely to generate the observed data

This talk Distance-based methods: we show how to estimate genomic distances appropriately for the Generalized Nadeau-Taylor model Parsimony-style methods: we can find very good solutions to NP-hard problems (inversion and breakpoint phylogeny) quite quickly Validation of these approaches on real and simulated data

Distance-based methods

Genomic Distance Estimators Standard: –Breakpoint distance –(Minimum) Inversion distance Our estimators: We attempt to estimate the actual number of events (the “true evolutionary distance”): –EDE [Moret et al, ISMB’01] –Approx-IEBP [Wang and Warnow, STOC’01] –Exact-IEBP [Wang, WABI’01]

Breakpoint Distance Breakpoint distance= –3 –

Minimum Inversion Distance –8 –7 –6 –5 – –3 –2 –7 –6 –5 – –3 7 2 –6 –5 – Inversion distance=3

Measured Distance vs. Actual Number of Events Breakpoint DistanceInversion Distance 120 genes, inversion-only evolution

Generalized Nadeau-Taylor Model Three types of events: –Inversions –Transpositions –Inverted Transpositions Events of the same type are equiprobable Probability of the three types have fixed ratio: Inv : Trp : Inv.Trp = (1-  -  ):  : 

Estimating True Evolutionary Distances for Genomes Given fixed probabilities for each type of event, we estimate the expected breakpoint distance after k random events, or the expected inversion distance after k random events. Inverting these functions gives us a better estimate of true evolutionary distances. Approx-IEBP [Wang and Warnow 2001] Exact-IEBP [Wang 2001] EDE [ Moret et al, ISMB 2001 ] (inversion based)

Estimating True Evolutionary Distances for Genomes (cont.) Estimating the expected Inversion distance: EDE [Moret, Wang, Warnow, Wyman 2001] –Closed-form formula based upon an empirical estimation of the expected inversion distance after k random events (based upon 120 genes and inversion only, but robust to errors in the model). –Polynomial time, fastest of the three.

Absolute Difference 120 genes Inversion only evolution (Similar relative performance under other models)

Accuracy of Neighbor Joining Using Distance Estimators 120 genes All three event types equiprobable 10, 20, 40, 80, and 160 genomes Similar relative performance under other models

Summary of Distance-based Reconstruction Methods Statistically-based estimation of genomic distances improves NJ analyses. Our IEBP estimators assume knowledge of the probabilities of each type of event, but are robust to model violations. EDE is based upon an inversion-only evolutionary model, but is robust. Best performing method: Weighbor(EDE); second best is NJ(EDE); both are robust to model violations. Worst performing is NJ(BP). Accuracy is very good, except when very close to saturation.

Maximum Parsimony on Rearranged Genomes (MPRG) The leaves are rearranged genomes. Find the tree that minimizes the total number of rearrangement events A B C D A B C D E F Total length = 18

Optimization problems for gene order phylogeny Breakpoint phylogeny: find the phylogeny which minimizes the total number of breakpoints (NP-hard, even to find the median of three genomes) Inversion phylogeny: find the phylogeny which minimizes the sum of inversion distances on the edges (NP-hard, even to find the median of three genomes)

Inversion and Breakpoint phylogenies Phylogenetic trees MP score Global optimum Local optimum When the data are close to saturated, even Weighbor(EDE) analyses are insufficiently accurate. In these cases, our initial investigations suggest that the inversion and breakpoint phylogeny approaches may be superior. Problem: finding the best trees is enormously hard, since even the “point estimation” problem is hard (worse than estimating branch lengths in ML).

GRAPPA (Genome Rearrangement Analysis under Parsimony and other Phylogenetic Algorithms) Heuristics for NP-hard optimization problems Fast polynomial time distance-based methods Contributors: U. New Mexico,U. Texas at Austin, Universitá di Bologna, Italy Freely available in source code at this site. Project leader: Bernard Moret (UNM)

Benchmark gene order dataset: Campanulaceae 12 genomes + 1 outgroup (Tobacco), 105 gene segments NP-hard optimization problems: breakpoint and inversion phylogenies (techniques score every tree) 1997: BPAnalysis (Blanchette and Sankoff): 200 years (est.)

Benchmark gene order dataset: Campanulaceae 12 genomes + 1 outgroup (Tobacco), 105 gene segments NP-hard optimization problems: breakpoint and inversion phylogenies (techniques score every tree) 1997: BPAnalysis (Blanchette and Sankoff): 200 years (est.) 2000: Using GRAPPA v1.1 on the 512-processor Los Lobos Supercluster machine: 2 minutes (200,000-fold speedup per processor)

Benchmark gene order dataset: Campanulaceae 12 genomes + 1 outgroup (Tobacco), 105 gene segments NP-hard optimization problems: breakpoint and inversion phylogenies (techniques score every tree) 1997: BPAnalysis (Blanchette and Sankoff): 200 years (est.) 2000: Using GRAPPA v1.1 on the 512-processor Los Lobos Supercluster machine: 2 minutes (200,000-fold speedup per processor) 2003: Using latest version of GRAPPA: 2 minutes on a single processor (1-billion-fold speedup per processor)

Summary Weighbor(EDE) and NJ(EDE) are highly accurate polynomial time distance-based reconstructions, except when datasets are close to saturated GRAPPA (inversion phylogeny or breakpoint phylogeny) produces highly accurate estimates of trees, and even of ancestral gene orders, acceptably fast

DCM-boosting MP and ML Idea: it may be faster to run a computationally expensive method on a few overlapping subproblems of somewhat smaller size Challenge: how to pick the best decomposition?

Addressing the accuracy/time issues: Disk-Covering Methods DCM1 decomposition: lots of small diameter subproblems. (Used for NJ.) DCM2 decomposition: Very few subproblems, each somewhat smaller. (Used for MP or ML.)

DCM-boosting Speeding up MP/ML heuristics Time MP score of best trees Performance of hill-climbing heuristic Desired Performance Fake study

The DCM2 technique for speeding up MP/ML searches

GRAPPA (Genome Rearrangement Analysis under Parsimony and other Phylogenetic Algorithms) Heuristics for NP-hard optimization problems Fast polynomial time distance-based methods Contributors: U. New Mexico,U. Texas at Austin, Universitá di Bologna, Italy Freely available in source code Project leader: Bernard Moret (UNM)

Limitations and ongoing research Current methods limited to single chromosomes with equal gene content (or very small amounts of deletions and duplications) -- we are working on developing reliable techniques for genomes with unequal gene content

Acknowledgements Funding: The David and Lucile Packard Foundation, and The National Science Foundation. Collaborators: Bernard Moret (UNM), David Bader (UNM), Bob Jansen (UT), Linda Raubeson (CWU) Students: Li-San Wang (now postdoc of Junhyong Kim at Penn), Jijun Tang (UNM), and others at UNM

Phylolab, U. Texas Please visit us at