Genome Rearrangement Phylogeny

Slides:



Advertisements
Similar presentations
Reconstructing Phylogenies from Gene-Order Data Overview.
Advertisements

CS 598AGB What simulations can tell us. Questions that simulations cannot answer Simulations are on finite data. Some questions (e.g., whether a method.
A Separate Analysis Approach to the Reconstruction of Phylogenetic Networks Luay Nakhleh Department of Computer Sciences UT Austin.
Challenges in computational phylogenetics Tandy Warnow Radcliffe Institute for Advanced Study University of Texas at Austin.
BALANCED MINIMUM EVOLUTION. DISTANCE BASED PHYLOGENETIC RECONSTRUCTION 1. Compute distance matrix D. 2. Find binary tree using just D. Balanced Minimum.
Large-Scale Phylogenetic Analysis Tandy Warnow Associate Professor Department of Computer Sciences Graduate Program in Evolution and Ecology Co-Director.
Molecular Evolution Revised 29/12/06
High-Performance Algorithm Engineering for Computational Phylogenetics [B Moret, D Bader] Kexue Liu CMSC 838 Presentation.
BNFO 602 Phylogenetics Usman Roshan.
In addition to maximum parsimony (MP) and likelihood methods, pairwise distance methods form the third large group of methods to infer evolutionary trees.
BNFO 602 Phylogenetics Usman Roshan. Summary of last time Models of evolution Distance based tree reconstruction –Neighbor joining –UPGMA.
FPGA Acceleration of Gene Rearrangement Analysis Jason D. Bakos Dept. of Computer Science and Engineering University of South Carolina Columbia, SC USA.
Inferring Phylogeny using Permutation Patterns on Genomic Data 1 Md Enamul Karim 2 Laxmi Parida 1 Arun Lakhotia 1 University of Louisiana at Lafayette.
CIS786, Lecture 4 Usman Roshan.
CIS786, Lecture 8 Usman Roshan Some of the slides are based upon material by Dennis Livesay and David.
Computing the Tree of Life The University of Texas at Austin Department of Computer Sciences Tandy Warnow.
Computational and mathematical challenges involved in very large-scale phylogenetics Tandy Warnow The University of Texas at Austin.
Combinatorial and graph-theoretic problems in evolutionary tree reconstruction Tandy Warnow Department of Computer Sciences University of Texas at Austin.
Phylogeny Estimation: Why It Is "Hard", and How to Design Methods with Good Performance Tandy Warnow Department of Computer Sciences University of Texas.
CIPRES: Enabling Tree of Life Projects Tandy Warnow The Program in Evolutionary Dynamics at Harvard University The University of Texas at Austin.
CIPRES: Enabling Tree of Life Projects Tandy Warnow The University of Texas at Austin.
Disk-Covering Methods for phylogeny reconstruction Tandy Warnow The University of Texas at Austin.
Complexity and The Tree of Life Tandy Warnow The University of Texas at Austin.
Combinatorial and Statistical Approaches in Gene Rearrangement Analysis Jijun Tang Computer Science and Engineering University of South Carolina
Phylogenetic Analysis. 2 Introduction Intension –Using powerful algorithms to reconstruct the evolutionary history of all know organisms. Phylogenetic.
Computer Science Research for The Tree of Life Tandy Warnow Department of Computer Sciences University of Texas at Austin.
Binary Encoding and Gene Rearrangement Analysis Jijun Tang Tianjin University University of South Carolina (803)
Gene Order Phylogeny Tandy Warnow The Program in Evolutionary Dynamics, Harvard University The University of Texas at Austin.
Rec-I-DCM3: A Fast Algorithmic Technique for Reconstructing Large Evolutionary Trees Usman Roshan Department of Computer Science New Jersey Institute of.
NP-hardness and Phylogeny Reconstruction Tandy Warnow Department of Computer Sciences University of Texas at Austin.
Computational Biology, Part D Phylogenetic Trees Ramamoorthi Ravi/Robert F. Murphy Copyright  2000, All rights reserved.
Bioinformatics 2011 Molecular Evolution Revised 29/12/06.
CIPRES: Enabling Tree of Life Projects Tandy Warnow The University of Texas at Austin.
Introduction to Phylogenetics
394C: Algorithms for Computational Biology Tandy Warnow Sept 9, 2013.
394C, Spring 2013 Sept 4, 2013 Tandy Warnow. DNA Sequence Evolution AAGACTT TGGACTTAAGGCCT -3 mil yrs -2 mil yrs -1 mil yrs today AGGGCATTAGCCCTAGCACTT.
More statistical stuff CS 394C Feb 6, Today Review of material from Jan 31 Calculating pattern probabilities Why maximum parsimony and UPGMA are.
Combining with phylogeny Wafa Jobran Seminar in Bioinformatics Technion spring 2005.
Statistical stuff: models, methods, and performance issues CS 394C September 16, 2013.
CIPRES: Enabling Tree of Life Projects Tandy Warnow The University of Texas at Austin.
Algorithmic research in phylogeny reconstruction Tandy Warnow The University of Texas at Austin.
Algorithms research Tandy Warnow UT-Austin. “Algorithms group” UT-Austin: Warnow, Hunt UCB: Rao, Karp, Papadimitriou, Russell, Myers UCSD: Huelsenbeck.
GRAPPA: Large-scale whole genome phylogenies based upon gene order evolution Tandy Warnow, UT-Austin Department of Computer Sciences Institute for Cellular.
Using Divide-and-Conquer to Construct the Tree of Life Tandy Warnow University of Illinois at Urbana-Champaign.
598AGB Basics Tandy Warnow. DNA Sequence Evolution AAGACTT TGGACTTAAGGCCT -3 mil yrs -2 mil yrs -1 mil yrs today AGGGCATTAGCCCTAGCACTT AAGGCCTTGGACTT.
SupreFine, a new supertree method Shel Swenson September 17th 2009.
Problems with large-scale phylogeny Tandy Warnow, UT-Austin Department of Computer Sciences Center for Computational Biology and Bioinformatics.
CS 598 AGB Supertrees Tandy Warnow. Today’s Material Supertree construction: given set of trees on subsets of S (the full set of taxa), construct tree.
CS 395T: Computational phylogenetics January 18, 2006 Tandy Warnow.
Statistical stuff: models, methods, and performance issues CS 394C September 3, 2009.
Iterative-DCM3: A Fast Algorithmic Technique for Reconstructing Large Phylogenetic Trees Usman Roshan and Tandy Warnow U. of Texas at Austin Bernard Moret.
Absolute Fast Converging Methods CS 598 Algorithmic Computational Genomics.
394C: Algorithms for Computational Biology Tandy Warnow Jan 25, 2012.
Distance-based phylogeny estimation
20 years and 22 papers with Bernard Moret
New Approaches for Inferring the Tree of Life
394C, Spring 2012 Jan 23, 2012 Tandy Warnow.
Multiple Sequence Alignment Methods
Challenges in constructing very large evolutionary trees
BNFO 602 Phylogenetics Usman Roshan.
BNFO 602 Phylogenetics – maximum parsimony
CS 581 Tandy Warnow.
Multiple Genome Rearrangement
Tandy Warnow Department of Computer Sciences
The Most General Markov Substitution Model on an Unrooted Tree
CS 394C: Computational Biology Algorithms
September 1, 2009 Tandy Warnow
Algorithms for Inferring the Tree of Life
Tandy Warnow The University of Texas at Austin
Tandy Warnow The University of Texas at Austin
Presentation transcript:

Genome Rearrangement Phylogeny Robert K. Jansen School of Biology University of Texas at Austin Bernard M.E. Moret Department of Computer Science University of New Mexico Li-San Wang Tandy Warnow Department of Computer Sciences

Outline Introduction Genome rearrangement phylogeny reconstruction Application Other methods Future research

New Phylogenetic Signals Large-throughput sequencing efforts lead to larger datasets Challenge: inferring deep evolutionary events Biologists turning to “rare genomic changes” Rare Large state space High signal-to-noise ratio Potential for clarifying early evolution Best studied: gene order evolution (genome rearrangement)

Genomes As Signed Permutations 1 –5 3 4 -2 -6 or 5 –1 6 2 -4 -3 etc.

Gene Order Data Rare changes on the genomic scale Large state space DNA: 4 states/character Protein (amino acid sequence): 20 states/character Circular gene order with 120 genes: High signal-to-noise ratio states/character

Genomes Evolve by Rearrangements 1 2 3 4 5 6 7 8 9 10 Inversion: 1 2 –6 –5 -4 -3 7 8 9 10 Transposition: 1 2 7 8 3 4 5 6 9 10 Inverted Transposition: 1 2 7 8 –6 -5 -4 -3 9 10

Edit Distances Between Genomes (INV) Inversion distance [Hannenhalli & Pevzner 1995] Computable in linear time [Moret et al 2001] (BP) Breakpoint distance [Watterson et al. 1982] Computable in linear time NJ(BP): [Blanchette, Kunisawa, Sankoff, 1999] A = 1 2 3 4 5 6 7 8 9 10 B = 1 2 3 -8 -7 -6 4 5 9 10 BP(A,B)=3

Our Model: the Generalized Nadeau-Taylor Model [STOC’01] Three types of events: Inversions (INV) Transpositions (TRP) Inverted Transpositions (ITP) Events of the same type are equiprobable Probabilities of the three types have fixed ratio We focus on signed circular genomes in this talk.

Simulation Study Protocol Synthetic Input Evolutionary Process A B C D E F True Tree A 1 -4 2 -3 5 6 B 1 3 2 5 4 6 C 1 3 -2 5 6 4 D 1 3 4 5 -2 6 E 1 4 5 -3 -2 6 F 1 2 3 4 5 6 Known in simulation Phylogenetic Method D B A C E F Inferred Tree

Quantifying Error FN: false negative (missing edge) 1/3=33.3% error rate

Outline Genome rearrangement evolution Genome rearrangement phylogeny reconstruction Application Other methods Future research

Gene Order Parsimony

Breakpoint Phylogeny [Sankoff & Blanchette 1998] “Maximum Parsimony”-style problem: Find tree(s), leaf-labeled by genomes, with shortest breakpoint length NP-hard problem on two levels: Find the shortest tree (the space of trees has exponential size) Given a tree, find its breakpoint length (Even for a tree with 3 leaves, but can be reduced to TSP) BPAnalysis [Sankoff & Blanchette 1998] Takes 200 years to compute our 13-taxon dataset on a Sun workstation

BPAnalysis A B Y X’ C Z Y’ X’ Tree length evaluation for EVERY tree Given a fixed tree topology, evaluate the tree length: Iteratively evaluate the median problem (tree length for a 3-leaf tree) A B Y X’ C Z Y’ X’

GRAPPA (Genome Rearrangement Analysis under Parsimony and other Phylogenetic Algorithms) http://www.cs.unm.edu/~moret/GRAPPA/ Uses lowerbound techniques to speed up Used on real datasets, producing thousand-fold speedups over BPAnalysis [ISMB’01] Contributors: (led by Bernard Moret at UNM) U. New Mexico U. Texas at Austin Universitá di Bologna, Italy

The Circular Lowerbound of the Length of a Tree Given a tree, we can lowerbound its length very quickly: 1 2 3 4

The Lowerbound Technique Avoid any tree X without potential: tree X whose lowerbound lb(X) is higher than twice the length c(T) of the best tree T Finding a good starting tree quickly is of utmost importance We turn to distance-based methods Neighbor joining (NJ) [Saitou and Nei 1987] Weighbor [Bruno et al. 2000]

Additive Distance Matrix and True Evolutionary Distance (T.E.D.) 7 S4 5 5 S3 0 13 16 3 1 S4 0 13 4 S5 0 8 S2 S5 Theorem [Waterman et al. 1977] Given an m×m additive distance matrix, we can reconstruct a tree realizing the distance in O(m2) time.

Error Tolerance of Neighbor Joining Theorem [Atteson 1999] Let {Dij} be the true evolutionary distances, and {dij} be the estimated distances for T. Let be the length of the shortest edge in T. If for all taxa i,j, we have then neighbor joining returns T.

BP and INV BP/2 vs K (120 genes) INV vs K (K: Actual number of inversions) (Inversion-only evolution)

NJ(BP) [Blanchette, Kunisawa, Sankoff 1999] and NJ(INV) Inversion only Transpositions/ inverted transpositions only 120 genes, 160 leaves Uniformly Random Tree

Estimate True Evolutionary Distances Using BP To use the scatter plot to estimate the actual number of events (K): Compute BP/2 From the curve, look up the corresponding value of K (2) (1) BP/2 vs K (120 genes) (K: Actual number of inversions) (Inversion-only evolution)

True Evolutionary Distance (t.e.d.) Estimators for Gene Order Data Exact-IEBP [WABI’01] Approx-IEBP [STOC’01] EDE [ISMB’01] Based on the Expectation of Breakpoint distance (Exact) Breakpoint distance (Approx.) Inversion distance (Approx.) Derivation Analytical Empirical Model knowledge Required Inversion-only IEBP: Inverting the Expected BreakPoint distance EDE: Empirically Derived Estimator

True Evolutionary Distance Estimators BP vs K (120 genes) Exact-IEBP vs K (K: Actual number of inversions) (Inversion-only evolution)

Variance of True Evolutionary Distance Estimators There are new distance-based phylogeny reconstruction methods (though designed for DNA sequences) Weighbor [Bruno et al. 2000] These methods use the variance of good t.e.d.’s, and yield more accurate trees than NJ. Variance estimates for the t.e.d.s [Wang WABI’02] Weighbor(IEBP), Weighbor(EDE) K vs Exact-IEBP (120 genes)

Using T.E.D. Helps 5% 120 genes 160 leaves Uniformly random tree Transpositions/inverted transpositions only (180 runs per figure) 5%

Observations EDE is the best distance estimator when used with NJ and Weighbor. True evolutionary distance estimators are reliable even when we do not know the GNT model parameters (the probability ratios of the three types of events).

Outline Genome rearrangement evolution Genome rearrangement phylogeny reconstruction Application Other methods Future research

Percentage of Trees Eliminated Through Bounding [ISMB’01] edge length=2 # taxa 10 20 40 80 160 1% 80% 91% 100% 99% 320 #genes Uses NJ(EDE) as starting tree

Campanulaceae cpDNA BPAnalysis takes 2 centuries on a Sun workstation 13 taxa (tobacco as outlier) 105 gene segments GRAPPA finds 216 trees with shortest breakpoint length (out of 654,729,075 trees) Running Time: BPAnalysis takes 2 centuries on a Sun workstation GRAPPA takes 1.5 hours on a 512-node supercluster About 2300-fold speedup on a single node

Campanulaceae [Moret et al. ISMB 2001] Strict consensus of 216 optimal trees found by GRAPPA 6 out of 10 max. edges found Tobacco Platycodon Cyananthus Codonopsis Merciera Wahlenbergia Triodanis Asyneuma Legousia Symphandra Adenophora Campanula Trachelium

Outline Genome rearrangement evolution Genome rearrangement phylogeny reconstruction Application Other methods Future research

“Fast” Approaches for Genome Rearrangement Phylogeny Basic technique: encode data as strings and apply maximum parsimony Running time exponential in the number of genomes, but polynomial in the number of genes (faster than GRAPPA) MPBE [ISMB’00] Maximum Parsimony using Binary Encodings MPME [Boore et al. Nature ’95, PSB’02] Maximum Parsimony using Multi-state Encodings The length of a tree using these two methods is a lowerbound of the true breakpoint length [Bryant ’01]

Maximum Parsimony using Binary Encoding (MPBE) Input genome (circular) A: 1 2 3 4 = -4 –3 –2 –1 B: 1 -4 -3 –2 = 2 3 4 -1 C: 1 2 -3 –4 = 4 3 –2 -1 MPBE Strings (1,2) (2,3) (3,4) (4,1) (1,-4) (-2,1) (2,-3) (-3,-4) (-4,1) A: 1 1 1 1 0 0 0 0 0 B: 0 1 1 0 1 1 0 0 0 C: 1 0 0 0 0 0 1 1 1

Maximum Parsimony using Multistate Encoding (MPME) Input genome (circular) A: 1 2 3 4 = -4 –3 –2 –1 B: 1 -4 -3 –2 = 2 3 4 -1 C: 1 2 -3 –4 = 4 3 –2 -1 MPME Strings We use PAUP to solve Maximum Parsimony => Constraint: number of states per site cannot exceed 32 1 2 3 4 -1 –2 –3 -4 A: 2 3 4 1 –4 –1 –2 -3 B: -4 3 4 –1 2 1 –2 -3 C: 2 –3 -2 3 4 –1 -4 1

NJ vs MP (120 genes, 160 genomes) All three event types equiprobable (datasets that exceed 32-state limit for MPME are dropped)

Inversion Phylogeny Inversion median has higher running time than breakpoint median Inversion phylogeny overall has shorter running time than breakpoint phylogeny, and returns more accurate trees [Moret et al. WABI ’02]

DCM-GRAPPA [Moret & Tang 2003] Disk-Covering Method: divide the original problem into subproblems [Huson, Nettles, Parida, Warnow and Yooseph, 1998] Uses inversion distance DCM-GRAPPA: can now process thousands of genomes, each having hundreds of genes

Ongoing and Future Research Genome rearrangement phylogeny with unequal gene content (duplications, deletions, etc.) Non-uniform genome rearrangement models (Segment-length dependent model, hotspots)

Acknowledgements University of Texas Tandy Warnow (Advisor) Robert K. Jansen Stacia Wyman Luay Nakhleh Usman Roshan Cara Stockham Jerry Sun University of New Mexico Bernard M.E. Moret David Bader Jijun Tang Mi Yan Central Washington University Linda Raubeson

Phylolab Department of Computer Sciences University of Texas at Austin Please visit us at http://www.cs.utexas.edu/users/phylo/