High-Performance Computing for Reconstructing Phylogenies from Gene-Order Data David A. Bader Electrical & Computer Engineering University of New Mexico.

Slides:

Advertisements

Similar presentations

Reconstructing Phylogenies from Gene-Order Data Overview.

Advertisements

School of CSE, Georgia Tech

1 Chapter 1 Why Parallel Computing? An Introduction to Parallel Programming Peter Pacheco.

DCJUC: A Maximum Parsimony Simulator for Constructing Phylogenetic Tree of Genomes with Unequal Contents Zhaoming Yin Bader-Polo Joint Group Meeting, Nov.

BALANCED MINIMUM EVOLUTION. DISTANCE BASED PHYLOGENETIC RECONSTRUCTION 1. Compute distance matrix D. 2. Find binary tree using just D. Balanced Minimum.

 Aim in building a phylogenetic tree is to use a knowledge of the characters of organisms to build a tree that reflects the relationships between them.

Molecular Evolution Revised 29/12/06

High-Performance Algorithm Engineering for Computational Phylogenetics [B Moret, D Bader] Kexue Liu CMSC 838 Presentation.

Reference: Message Passing Fundamentals.

Genome Rearrangement Phylogeny

. Class 1: Introduction. The Tree of Life Source: Alberts et al.

1. Elements of the Genetic Algorithm  Genome: A finite dynamical system model as a set of d polynomials over  2 (finite field of 2 elements)  Fitness.

Multiple Sequence Alignment Algorithms in Computational Biology Spring 2006 Most of the slides were created by Dan Geiger and Ydo Wexler and edited by.

A Parallel Structured Ecological Model for High End Shared Memory Computers Dali Wang Department of Computer Science, University of Tennessee, Knoxville.

Data Partitioning for Reconfigurable Architectures with Distributed Block RAM Wenrui Gong Gang Wang Ryan Kastner Department of Electrical and Computer.

High Performance Computing 1 Parallelization Strategies and Load Balancing Some material borrowed from lectures of J. Demmel, UC Berkeley.

Solving the Protein Threading Problem in Parallel Nocola Yanev, Rumen Andonov Indrajit Bhattacharya CMSC 838T Presentation.

FPGA Acceleration of Phylogeny Reconstruction for Whole Genome Data Jason D. Bakos Panormitis E. Elenis Jijun Tang Dept. of Computer Science and Engineering.

Genomic Rearrangements CS 374 – Algorithms in Biology Fall 2006 Nandhini N S.

BNFO 602 Phylogenetics Usman Roshan. Summary of last time Models of evolution Distance based tree reconstruction –Neighbor joining –UPGMA.

FPGA Acceleration of Gene Rearrangement Analysis Jason D. Bakos Dept. of Computer Science and Engineering University of South Carolina Columbia, SC USA.

Probabilistic methods for phylogenetic trees (Part 2)

Utilizing Fuzzy Logic for Gene Sequence Construction from Sub Sequences and Characteristic Genome Derivation and Assembly.

Inferring Phylogeny using Permutation Patterns on Genomic Data 1 Md Enamul Karim 2 Laxmi Parida 1 Arun Lakhotia 1 University of Louisiana at Lafayette.

CIS786, Lecture 8 Usman Roshan Some of the slides are based upon material by Dennis Livesay and David.

Improved results for a memory allocation problem Rob van Stee University of Karlsruhe Germany Leah Epstein University of Haifa Israel WADS 2007 WAOA 2007.

Computing the Tree of Life The University of Texas at Austin Department of Computer Sciences Tandy Warnow.

Combinatorial and Statistical Approaches in Gene Rearrangement Analysis Jijun Tang Computer Science and Engineering University of South Carolina

© Fujitsu Laboratories of Europe 2009 HPC and Chaste: Towards Real-Time Simulation 24 March

Molecular phylogenetics

Parsimony and searching tree-space Phylogenetics Workhop, August 2006 Barbara Holland.

1 Generalized Tree Alignment: The Deferred Path Heuristic Stinus Lindgreen

Computer Science Research for The Tree of Life Tandy Warnow Department of Computer Sciences University of Texas at Austin.

Binary Encoding and Gene Rearrangement Analysis Jijun Tang Tianjin University University of South Carolina (803)

Phylogenetic Analysis. General comments on phylogenetics Phylogenetics is the branch of biology that deals with evolutionary relatedness Uses some measure.

Bioinformatics 2011 Molecular Evolution Revised 29/12/06.

Parsimony-Based Approaches to Inferring Phylogenetic Trees BMI/CS 576 Colin Dewey Fall 2010.

BioPerf: A Benchmark Suite to Evaluate High- Performance Computer Architecture on Bioinformatics Applications David A. Bader, Yue Li Tao Li Vipin Sachdeva.

INDIANAUNIVERSITYINDIANAUNIVERSITY 1 Parallel implementation and performance of fastDNAml - a program for maximum likelihood phylogenetic inference Craig.

More statistical stuff CS 394C Feb 6, Today Review of material from Jan 31 Calculating pattern probabilities Why maximum parsimony and UPGMA are.

BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.

Parallel & Distributed Systems and Algorithms for Inference of Large Phylogenetic Trees with Maximum Likelihood Alexandros Stamatakis LRR TU München Contact:

Significance Tests for Max-Gap Gene Clusters Rose Hoberman joint work with Dannie Durand and David Sankoff.

Parsimony-Based Approaches to Inferring Phylogenetic Trees BMI/CS 576 Colin Dewey Fall 2015.

SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO Hybrid MPI/Pthreads Parallelization of the RAxML Phylogenetics Code Wayne Pfeiffer.

341- INTRODUCTION TO BIOINFORMATICS Overview of the Course Material 1.

Subtree Prune Regraft & Horizontal Gene Transfer or Recombination.

CSCI-455/552 Introduction to High Performance Computing Lecture 23.

Algorithms research Tandy Warnow UT-Austin. “Algorithms group” UT-Austin: Warnow, Hunt UCB: Rao, Karp, Papadimitriou, Russell, Myers UCSD: Huelsenbeck.

Parallelization Strategies Laxmikant Kale. Overview OpenMP Strategies Need for adaptive strategies –Object migration based dynamic load balancing –Minimal.

GRAPPA: Large-scale whole genome phylogenies based upon gene order evolution Tandy Warnow, UT-Austin Department of Computer Sciences Institute for Cellular.

Nonlinear differential equation model for quantification of transcriptional regulation applied to microarray data of Saccharomyces cerevisiae Vu, T. T.,

CS 395T: Computational phylogenetics January 18, 2006 Tandy Warnow.

Parallel Programming in Chess Simulations Part 2 Tyler Patton.

Finding Motifs Vasileios Hatzivassiloglou University of Texas at Dallas.

1/50 University of Turkish Aeronautical Association Computer Engineering Department Ceng 541 Introduction to Parallel Computing Dr. Tansel Dökeroğlu

Bioinformatics Overview

Introduction to Parallel Computing: MPI, OpenMP and Hybrid Programming

WABI: Workshop on Algorithms in Bioinformatics

New Approaches for Inferring the Tree of Life

The University of Adelaide, School of Computer Science

1 Department of Engineering, 2 Department of Mathematics,

1 Department of Engineering, 2 Department of Mathematics,

BNFO 602 Phylogenetics Usman Roshan.

1 Department of Engineering, 2 Department of Mathematics,

BNFO 602 Phylogenetics – maximum parsimony

Chapter 19 Molecular Phylogenetics

CS 394C: Computational Biology Algorithms

Department of Computer Science, University of Tennessee, Knoxville

Algorithms for Inferring the Tree of Life

Presentation transcript:

High-Performance Computing for Reconstructing Phylogenies from Gene-Order Data David A. Bader Electrical & Computer Engineering University of New Mexico

Computational Biology Special Interest Group, HPC November 2001High-Performance for Phylogeny Reconstruction, David A. Bader2 Acknowledgment of Support National Science Foundation CAREER: High-Performance Algorithms for Scientific Applications ( ) ITR: Algorithms for Irregular Discrete Computations on SMPs ( ) DEB: Ecosystem Studies: Self-Organization of Semi-Arid Landscapes: Test of Optimality Principles ( ) ITR/AP: Reconstructing Complex Evolutionary Histories ( ) DEB Comparative Chloroplast Genomics: Integrating Computational Methods, Molecular Evolution, and Phylogeny ( ) ITR/AP(DEB): Computing Optimal Phylogenetic Trees under Genome Rearrangement Metrics ( ) PACI: NCSA/Alliance, NPACI/SDSC, PSC Sun Microsystems

Computational Biology Special Interest Group, HPC November 2001High-Performance for Phylogeny Reconstruction, David A. Bader3 Algorithms that Scale from the Blade to the Fire

Computational Biology Special Interest Group, HPC November 2001High-Performance for Phylogeny Reconstruction, David A. Bader4 Commercial Aspects of Phylogeny Reconstruction Identification of microorganisms public health entomology sequence motifs for groups are patented example: differentiating tuberculosis strains Dynamics of microbial communities pesticide exposure: identify and quantify microbes in soil Vaccine development variants of a cell wall or protein coat component porcine reproductive and respiratory syndrome virus isolates from US and Europe were separate populations HIV studied through DNA markers Biochemical pathways antibacterials and herbicides Glyphosate (Roundup , Rodeo , and Pondmaster  ): first herbicide targeted at a pathway not present in mammals phylogenetic distribution of a pathway is studied by the pharmaceutical industry before a drug is developed Pharmaceutical industry predicting the natural ligands for cell surface receptors which are potential drug targets a single family, G protein coupled receptors (GPCRs), contains 40% of the targets of most pharm. companies

Computational Biology Special Interest Group, HPC November 2001High-Performance for Phylogeny Reconstruction, David A. Bader5 GRAPPA: Genome Rearrangements Analysis Genome Rearrangements Analysis under Parsimony and other Phylogenetic Algorithms Open-source already used by other computational phylogeny groups, Caprara, Pevzner, LANL, FBI, PharmCos. Gene-order Phylogeny Reconstruction Breakpoint Median Inversion Median over one-million fold speedup from previous codes Parallelism Scales linearly with the number of processors Developed using Sun Forte C

Computational Biology Special Interest Group, HPC November 2001High-Performance for Phylogeny Reconstruction, David A. Bader6 Molecular Data for Phylogeny simple DNA sequence: nucleotides low-level functionality: amino acids, etc. genomic level: genes (next is functional level: proteomics, etc.) Biologists now have full gene sequences for many single-chromosome organisms and organelles (e.g., mitochondria, chloroplasts) and for more and more larger organisms

Computational Biology Special Interest Group, HPC November 2001High-Performance for Phylogeny Reconstruction, David A. Bader7 Gene Order Phylogeny Many organelles appear to evolve mostly through processes that simply rearrange gene ordering (inversion, transposition) and perhaps alter gene content (duplication, loss). Chloroplast have a single, typically circular, chromosome and appear to evolve mostly through inversion: i -1 i j j+1 i -1 -j -i j+1 The sequence of genes i, i+1, …, j is inverted and every gene is flipped.

Computational Biology Special Interest Group, HPC November 2001High-Performance for Phylogeny Reconstruction, David A. Bader8 Gene Order Phylogeny (cont’d) The real problem Reconstruct the “true” tree, identify the “true” ancestral genomes, and recover on each edge the “true” sequence of evolutionary changes The optimization problem (parsimony) Reconstruct a tree and ancestral genomes so as to minimize the sum, over all tree edges, of the inferred evolutionary distance along each edge The surrogate problem Do the optimization problem with a measure of inferred evolutionary distance that lends itself to analysis

Computational Biology Special Interest Group, HPC November 2001High-Performance for Phylogeny Reconstruction, David A. Bader9 Breakpoint Analysis: A Surrogate for Gene Order Breakpoint: an adjacent pair of genes present in one genome, but absent in the other Breakpoint distance: the total number of breakpoints between two genomes (a true metric, similar to Hamming distance) Breakpoint phylogeny: the tree and ancestral genomes that minimize the sum, over all edges of the tree, of the breakpoint distances Naturally, it is an NP-hard problem, even with just 3 leaves.

Computational Biology Special Interest Group, HPC November 2001High-Performance for Phylogeny Reconstruction, David A. Bader10 Breakpoint Analysis (Sankoff & Blanchette 1998) For each tree topology do somehow assign initial genomes to the internal nodes repeat for each internal node do –compute a new genome that minimizes the distances to its three neighbors –replace old genome by new if distance is reduced until no change Sankoff & Blanchette implemented this in a C++ package (2n-5)!! = (2n-5) (2n-7) …  5  3 trees unknown iterative heuristic NP-hard

Computational Biology Special Interest Group, HPC November 2001High-Performance for Phylogeny Reconstruction, David A. Bader11 Algorithm Engineering Works! We reimplemented everything – the original code is too slow and not as flexible as we wanted. Our main dataset is a collection of chloroplast data from the flowering plant family Campanulaceae (bluebells): 13 genomes of 105 gene segments each On our old workstation: BPAnalysis processes trees/minute Our implementation processes over 50,000 trees/minute Speedup ratio is over 5,000!! On synthetic datasets, we see speedups from 300 to over 50,000…

Computational Biology Special Interest Group, HPC November 2001High-Performance for Phylogeny Reconstruction, David A. Bader12 So… What did we do?! Absolutely no high-level algorithmic changes Three low-level algorithmic changes: better bounding strong upper bound initialization “condensing” Completely different data representation Two low-level algorithmic changes all memory is pre-allocated some loops are hand-unrolled Written in C instead of C++ ~10x ~ 6x ? (convenience) * (~ 10x) * * Well, so I lied just a little bit…

Computational Biology Special Interest Group, HPC November 2001High-Performance for Phylogeny Reconstruction, David A. Bader13 One high-level algorithmic change (Ok, so I lied a little…) Avoid labeling the tree if possible Use current best score as an upper bound. Compute lower bound & prune tree away if lower bound > upper bound Lower bound: Get circular ordering of leaves, x 1 x 2 … x n Compute  = d(x 1,x 2 ) + d(x 2,x 3 ) + … + d(x n,x 1 ) Then ½  is a lower bound because d(.) obeys the triangle inequality every tree edge is used twice in a tree-based version of 

Computational Biology Special Interest Group, HPC November 2001High-Performance for Phylogeny Reconstruction, David A. Bader14 Tree a bc d e  = d(a,b) + d(b,c) + d(c,d) + d(d,e) + d(e,a) a bc d e d(a,b) d(b,c) d(c,d) TreeTree version (paths) d(d,e) d(e,a) (Same trick as in the “twice around the tree” approximation for the TSP with triangle inequality.)

Computational Biology Special Interest Group, HPC November 2001High-Performance for Phylogeny Reconstruction, David A. Bader15 Algorithmic Changes (~ 10x) 1.Better bounding: skip edges that would cause degree 3 or premature cycle 2.“Condensing”: whenever the same gene subsequence appears in all genomes, it can be condensed into a single “superfragment” done as static processing and on the fly before each TSP 3.Initializing the new median with the best of the old one and its three neighbors. Condensing is very effective on real data within families, but easily defeated by large evolutionary distances. (1) and (3) cause over half of the TSP instances (for finding computing “median-of-three” updated internal nodes) to be pruned away instantly.

Computational Biology Special Interest Group, HPC November 2001High-Performance for Phylogeny Reconstruction, David A. Bader16 Data Representation (~ 10x) No distance matrix for reduction of “median-of-three” to Traveling Salesperson Problem (TSP): at most 4n edges can be of interest – the others are treated as an undifferentiated pool. The adjacency lists have length  4. thus, linear time at each step and reduced storage. Backtracking search has a small list of edges and only searches among edges of cost 1 and 2 (–  and 0 are always included) – still NP-hard, but often easy When search runs out of edges, tour is completed in linear time from the pool of edges of cost 3 Many auxiliary arrays (á la Fortran!) to carry information on flags, degrees, other end of chains, …

Computational Biology Special Interest Group, HPC November 2001High-Performance for Phylogeny Reconstruction, David A. Bader17 Low-Level Coding Changes (~ 6x) 1.All storage allocated at start, with large #s of pointers passed to subroutines (no globals, to allow parallel execution). Avoids malloc/free overhead; Improves cache locality 2.Avoid recomputations. Use local variables for intermediate pointers Hand unroll loops on adjacencies to preserve locality (and to avoid “mod” operations with circular genomes) Speeds up addressing – never deference! & Improves cache locality BPAnalysis uses 65MB and has a real memory footprint of ~ 12MB on our real data Our reimplementation uses 1.6MB with a footprint of 0.6MB

Computational Biology Special Interest Group, HPC November 2001High-Performance for Phylogeny Reconstruction, David A. Bader18 And… How did we do it? 3 strategies: Profile, Profile, Profile (and use your engineering sense/nose/… ) Sun Forte 6 Analyzer We began with 4 main culprits: preparing adjacency lists for the TSP computing breakpoint distances computing lower bounds in TSP backtracking in TSP Over 10 – 12 major iterations, each of which yielded a 1.5 – 2 fold speed-up, these four switched places over and over.

Computational Biology Special Interest Group, HPC November 2001High-Performance for Phylogeny Reconstruction, David A. Bader19 Profiling And our final tally (still on the Campanulaceae dataset) is: 30% backtracking (excl. LB) 20% preparing adjacency lists 20% condensing & expanding 15% computing LB 8% computing distances 7% miscellaneous overhead (no obvious culprits left)

Computational Biology Special Interest Group, HPC November 2001High-Performance for Phylogeny Reconstruction, David A. Bader20 High-Performance Computing Techniques Availability of hundreds of powerful processors Standard parallel programming interfaces (Sun HPC) Message passing interface (MPI) OpenMP or POSIX threads Algorithmic libraries for SMP clusters SIMPLE Goal: make efficient use of parallelism for exploring candidate tree topologies sharing of improved bounds

Computational Biology Special Interest Group, HPC November 2001High-Performance for Phylogeny Reconstruction, David A. Bader21 Parallelization of the Phylogeny Algorithm Enumerating tree topologies is pleasantly parallel and allows multiple processors to independently search the tree space with little or no overhead Improved bounds can be broadcast to other processors without interrupting work Load is evenly balanced when trees are cyclically assigned (e.g. in a round-robin fashion) to the processors Linear speedup

Computational Biology Special Interest Group, HPC November 2001High-Performance for Phylogeny Reconstruction, David A. Bader22 Final Remarks Our reimplementation led to numerous extensions as well as to new theoretical results GRAPPA has been extended to inversion phylogeny, with linear-time algorithms for inversion distance and a new approach to exact inversion median-of-three. Better bounding in the next version of GRAPPA yields two more orders of magnitude speedup. These insights and improvements are made possible by mature development tools (Forte) Algorithmic engineering techniques are widely applicable We may not always get 6 orders of magnitude, but 3 – 4 orders should be nearly routine with most codes. (We are starting work on TBR and exact parsimony solvers.)

Computational Biology Special Interest Group, HPC November 2001High-Performance for Phylogeny Reconstruction, David A. Bader23 Final Remarks (cont’d) High-performance implementations enable: better approximations for difficult problems (MP, ML) true optimization for larger instances realistic data exploration (e.g., testing evolutionary scenarios, assessing answers obtained through other means, etc.) Our analysis of the Campanulaceae dataset confirmed the conjecture of Robert Jansen et al. – that inversion is the principal process of genome evolution in cpDNA for this group.

Computational Biology Special Interest Group, HPC November 2001High-Performance for Phylogeny Reconstruction, David A. Bader24 Work-In-Progress and Future Work Tree enumeration using circular ordering Handle unequal gene content and duplicate genes using exemplars Parallel branch and bound techniques (optimized for Sun HPC Servers) for searching tree space Improved SPR and TBR techniques (local searches around good trees) Exact Algorithm for Maximum Parsimony

Computational Biology Special Interest Group, HPC November 2001High-Performance for Phylogeny Reconstruction, David A. Bader25 Recent publications (2001) A New Implementation and Detailed Study of Breakpoint Analysis, B.M.E. Moret, S. Wyman, D.A. Bader, T. Warnow, M. Yan, Sixth Pacific Symposium on Biocomputing 2001, pp , Hawaii, January High-Performance Algorithm Engineering for Gene-Order Phylogenies, D.A. Bader, B. M.E. Moret, T. Warnow, S.K. Wyman, and M. Yan, DIMACS Workshop on Whole Genome Comparison, DIMACS Center, Rutgers University, Piscataway, NJ, March Variation in vegetation growth rates: Implications for the evolution of semi-arid landscapes, C. Restrepo, B.T. Milne, D. Bader, W. Pockman, and A. Kerkhoff, 16th Annual Symposium of the US-International Association of Landscape Ecology, Arizona State University, Tempe, April High-Performance Algorithm Engineering for Computational Phylogeny, B. M.E. Moret, D.A. Bader, and T. Warnow, 2001 International Conference on Computational Science, San Francisco, CA, May Cluster Computing: Applications, David A. Bader and Robert Pennington, The International Journal of High Performance Computing, 15(2): , May New approaches for using gene order data in phylogeny reconstruction, R.K. Jansen, D.A. Bader, B. M. E. Moret, L.A. Raubeson, L.-S. Wang, T. Warnow, and S. Wyman. Botany 2001, Albuquerque, NM, August GRAPPA: a high-performance computational tool for phylogeny reconstruction from gene-order data, B. M.E. Moret, D.A. Bader, T. Warnow, S.K. Wyman, and M. Yan. Botany 2001, Albuquerque, NM, August Inferring phylogenies of photosynthetic organisms from chloroplast gene orders, L.A. Raubeson, D.A. Bader, B. M.E. Moret, L.-S. Wang, T. Warnow, and S.K. Wyman. Botany 2001, Albuquerque, NM, August Industrial Applications of High-Performance Computing for Phylogeny Reconstruction, D.A. Bader, B. M.E. Moret, and L. Vawter, SPIE ITCom: Commercial Applications for High-Performance Computing, Denver, CO, SPIE Vol. 4528, pp , August Using PRAM Algorithms on a Uniform-Memory-Access Shared-Memory Architecture, D.A. Bader, A. Illendula, B. M.E. Moret, and N.R. Weisse-Bernstein, Fifth Workshop on Algorithm Engineering, Springer-Verlag LNCS 2141, , Aarhus, Denmark, August A Linear-Time Algorithm for Computing Inversion Distance Between Two Signed Permutations with an Experimental Study, D.A. Bader, B. M.E. Moret, and M. Yan, Journal of Computational Biology, 8(5): , October 2001.