High-Performance Algorithm Engineering for Computational Phylogenetics [B Moret, D Bader] Kexue Liu CMSC 838 Presentation.

Slides:



Advertisements
Similar presentations
Reconstructing Phylogenies from Gene-Order Data Overview.
Advertisements

A Separate Analysis Approach to the Reconstruction of Phylogenetic Networks Luay Nakhleh Department of Computer Sciences UT Austin.
Phylogenetic Tree A Phylogeny (Phylogenetic tree) or Evolutionary tree represents the evolutionary relationships among a set of organisms or groups of.
School of CSE, Georgia Tech
DCJUC: A Maximum Parsimony Simulator for Constructing Phylogenetic Tree of Genomes with Unequal Contents Zhaoming Yin Bader-Polo Joint Group Meeting, Nov.
BALANCED MINIMUM EVOLUTION. DISTANCE BASED PHYLOGENETIC RECONSTRUCTION 1. Compute distance matrix D. 2. Find binary tree using just D. Balanced Minimum.
Phylogenetic Trees Understand the history and diversity of life. Systematics. –Study of biological diversity in evolutionary context. –Phylogeny is evolutionary.
Phylogenetic reconstruction
Computational biology and computational biologists Tandy Warnow, UT-Austin Department of Computer Sciences Institute for Cellular and Molecular Biology.
Molecular Evolution Revised 29/12/06
© Wiley Publishing All Rights Reserved. Phylogeny.
Bioinformatics Chromosome rearrangements Chromosome and genome comparison versus gene comparison Permutations and breakpoint graphs Transforming Men into.
Current Approaches to Whole Genome Phylogenetic Analysis Hongli Li.
. Phylogeny II : Parsimony, ML, SEMPHY. Phylogenetic Tree u Topology: bifurcating Leaves - 1…N Internal nodes N+1…2N-2 leaf branch internal node.
Whole Genome Alignment using Multithreaded Parallel Implementation Hyma S Murthy CMSC 838 Presentation.
Bioinformatics and Phylogenetic Analysis
Solving the Protein Threading Problem in Parallel Nocola Yanev, Rumen Andonov Indrajit Bhattacharya CMSC 838T Presentation.
FPGA Acceleration of Phylogeny Reconstruction for Whole Genome Data Jason D. Bakos Panormitis E. Elenis Jijun Tang Dept. of Computer Science and Engineering.
Ant Colony Optimization Optimisation Methods. Overview.
High-Performance Computing for Reconstructing Phylogenies from Gene-Order Data David A. Bader Electrical & Computer Engineering University of New Mexico.
BNFO 602 Phylogenetics Usman Roshan. Summary of last time Models of evolution Distance based tree reconstruction –Neighbor joining –UPGMA.
FPGA Acceleration of Gene Rearrangement Analysis Jason D. Bakos Dept. of Computer Science and Engineering University of South Carolina Columbia, SC USA.
Inferring Phylogeny using Permutation Patterns on Genomic Data 1 Md Enamul Karim 2 Laxmi Parida 1 Arun Lakhotia 1 University of Louisiana at Lafayette.
CIS786, Lecture 8 Usman Roshan Some of the slides are based upon material by Dennis Livesay and David.
Computing the Tree of Life The University of Texas at Austin Department of Computer Sciences Tandy Warnow.
Phylogeny Estimation: Why It Is "Hard", and How to Design Methods with Good Performance Tandy Warnow Department of Computer Sciences University of Texas.
Multiple Sequence Alignment CSC391/691 Bioinformatics Spring 2004 Fetrow/Burg/Miller (Slides by J. Burg)
Multiple Sequence Alignments and Phylogeny.  Within a protein sequence, some regions will be more conserved than others. As more conserved,
Combinatorial and Statistical Approaches in Gene Rearrangement Analysis Jijun Tang Computer Science and Engineering University of South Carolina
Phylogenetic Analysis. 2 Introduction Intension –Using powerful algorithms to reconstruct the evolutionary history of all know organisms. Phylogenetic.
1 Bio + Informatics AAACTGCTGACCGGTAACTGAGGCCTGCCTGCAATTGCTTAACTTGGC An Overview پرتال پرتال بيوانفورماتيك ايرانيان.
0 Fast and Accurate Reconstruction of Evolutionary Trees: a Model-based Study Ming-Yang Kao Department of Computer Science Northwestern University Evanston,
Parsimony and searching tree-space Phylogenetics Workhop, August 2006 Barbara Holland.
1 Generalized Tree Alignment: The Deferred Path Heuristic Stinus Lindgreen
Computer Science Research for The Tree of Life Tandy Warnow Department of Computer Sciences University of Texas at Austin.
Binary Encoding and Gene Rearrangement Analysis Jijun Tang Tianjin University University of South Carolina (803)
Phylogenetic Analysis. General comments on phylogenetics Phylogenetics is the branch of biology that deals with evolutionary relatedness Uses some measure.
PERFORMANCE ANALYSIS cont. End-to-End Speedup  Execution time includes communication costs between FPGA and host machine  FPGA consistently outperforms.
Bioinformatics 2011 Molecular Evolution Revised 29/12/06.
RESOURCES, TRADE-OFFS, AND LIMITATIONS Group 5 8/27/2014.
Parsimony-Based Approaches to Inferring Phylogenetic Trees BMI/CS 576 Colin Dewey Fall 2010.
Introduction to Phylogenetics
BioPerf: A Benchmark Suite to Evaluate High- Performance Computer Architecture on Bioinformatics Applications David A. Bader, Yue Li Tao Li Vipin Sachdeva.
1 Short Term Scheduling. 2  Planning horizon is short  Multiple unique jobs (tasks) with varying processing times and due dates  Multiple unique jobs.
More statistical stuff CS 394C Feb 6, Today Review of material from Jan 31 Calculating pattern probabilities Why maximum parsimony and UPGMA are.
Mobile Agent Migration Problem Yingyue Xu. Energy efficiency requirement of sensor networks Mobile agent computing paradigm Data fusion, distributed processing.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
Genetic Algorithms CSCI-2300 Introduction to Algorithms
Parsimony-Based Approaches to Inferring Phylogenetic Trees BMI/CS 576 Colin Dewey Fall 2015.
SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO Hybrid MPI/Pthreads Parallelization of the RAxML Phylogenetics Code Wayne Pfeiffer.
Phylogeny Ch. 7 & 8.
Optimization Problems
Subtree Prune Regraft & Horizontal Gene Transfer or Recombination.
Algorithms research Tandy Warnow UT-Austin. “Algorithms group” UT-Austin: Warnow, Hunt UCB: Rao, Karp, Papadimitriou, Russell, Myers UCSD: Huelsenbeck.
GRAPPA: Large-scale whole genome phylogenies based upon gene order evolution Tandy Warnow, UT-Austin Department of Computer Sciences Institute for Cellular.
Ayesha M.Khan Spring Phylogenetic Basics 2 One central field in biology is to infer the relation between species. Do they possess a common ancestor?
CS 395T: Computational phylogenetics January 18, 2006 Tandy Warnow.
Bioinformatics Overview
Introduction to Bioinformatics Resources for DNA Barcoding
WABI: Workshop on Algorithms in Bioinformatics
New Approaches for Inferring the Tree of Life
Methods of molecular phylogeny
KEY CONCEPT Entire genomes are sequenced, studied, and compared.
BNFO 602 Phylogenetics Usman Roshan.
BNFO 602 Phylogenetics – maximum parsimony
Sahand Kashani, Stuart Byma, James Larus 2019/02/16
Phylogeny.
CS 394C: Computational Biology Algorithms
High-Throughput Identification and Quantification of Candida Species Using High Resolution Derivative Melt Analysis of Panfungal Amplicons  Tasneem Mandviwala,
Algorithms for Inferring the Tree of Life
Presentation transcript:

High-Performance Algorithm Engineering for Computational Phylogenetics [B Moret, D Bader] Kexue Liu CMSC 838 Presentation

CMSC 838T – Presentation Motivation u Phylogeny reconstruction from molecular data  Poses complex optimization problem  NP hard and thus computationally intractable u High performance Algorithm Engineering  Reduce the running time of existing phylogenetic algoritms

CMSC 838T – Presentation Talk Overview u Overview of talk  Background  Breakpoint Phylogeny  Breakpoint Analysis  Re-Engineering Techniques  Impact in computational Biology  Observations

CMSC 838T – Presentation Background u Algorithm Engineering  Transform a pencil-and-paper algorithm into an efficient, robust implementation.  Main focus is experimentation u High Performance Algorithm Engineering  Running time and quality of the solution as the paramount goal  Includes parallelism  Refining serial part of the code  Cache-aware programming is a key to performance

CMSC 838T – Presentation Background u Phylogeny  Reconstruction of the evolutionary history of a collection of organisms  Takes the form of an evolutionary tree u Computational Phylogenetics  Is extremely computation-intensive  Methods for sequence data (RNA, DNA, amino acid, Protein) do not scale up to whole genome  Genome level data a) At this level, evolution is slow b) Enable us to recover deep evolutionary relationships c) Much hard to analyze than sequence data  Optimization criteria a) Heuristics b) Parsimony criterion c) Maximum likelihood

CMSC 838T – Presentation Breakpoint Phylogeny u Deal with simple genomic data  Organisms have a single chromosome or contain single- chromosome organelles  Each chromosome can be represented by an ordering of oriented genes.  Evolutionary process includes inversion, transposition, insertion, deletion and duplication. u Approaches  Construct parsimonious tree a) Known or conjectured to be NP hard b) No automated tool to solve it  Neighbor-joining heuristics a) Fast and valuable b) Can’t recover the ancestral gene orders.  Breakpoint phylogeny by Blanchette and Sankoff.

CMSC 838T – Presentation Breakpoint phylogeny u More special case:  All the genomes have the same set of genes  Each gene appears once. u Is of interest to biologists  Inversions are the main evolutionary mechanism on such genomes u Works well for certain datasets. u Implementation developed by Sankoff and Blanchette  Breakpoint Analysis  Too slow to be used on anything other than small datasets with a few genes.

CMSC 838T – Presentation Breakpoint Analysis: Details u Breakpoint:  Two genomes G and G’ with the same set of genes and each gene appears exactly once in each genome  Ordered pair of genes, (g i, g j ) appears in G  Neither (g i, g j ) nor (-g j, -g i ) appears in G ’ u Breakpoint Distance  Number of breakpoints between two genomes. u Median for three genomes  The genome which minimizes the breakpoint distance u Median Problem for Breakpoints  Construct a median of given genomes  NP hard

CMSC 838T – Presentation Breakpoint Analysis u Method developed by Sankoff and Blanchette to solve breakpoint phylogeny u Uses reduction from MPB to Travelling Salesman Problem  Directed MPB to undirected TSP  Representing each gene by a pair of cities connected by an edge u Outer loop enumerates all (2n-5)!! trees on n leaves u Inner loop runs unknown number of iterations u Computation complexity is exponential in each of the number of genomes and the number of genes.

CMSC 838T – Presentation Breakpoint Analysis Initially label all internal nodes with gene orders Repeat For each internal node v, with neighbors A, B, C do Solve the MPB on A, B, C to yield label m If relabelling v with m improves the score of T, then do it until no internal node can be relabelled

CMSC 838T – Presentation Re-Engineering Techniques u Profiling:  Identify bottlenecks to balance implementation  Eliminate problems which include excessive resource consumption or poor results.  Examples: a. Hand-unrolling loops, cut the running time down by a factor at least six. b. Refine distance computations c. Refine lower bound computations  Speed-up by one order of magnitude on Campanulaceae dataset

CMSC 838T – Presentation Re-Engineering Techniques u Cache Awareness  Memory footprint a. BPAnalysis: 60MB b. GRAPPA: 1.8MB  Memory locality a. BPAnalysis: poor locality, working set size of about 12MB b. GRAPPA : good locality, working set size of about 600KB  Minimizing pointer dereferencing  Reuses allocated storage  Studies indicate that gain is likely to be factors of anywhere from 2 to 40

CMSC 838T – Presentation Re-Engineering Techniques u Low-level Algorithmic Changes  Using all of the available information  Examples: a. Using lower bound to eliminate over 95% of the tree. b. Take advantage of special structures: TSP has only two nontrivial edges( cost 1 and cost 2)  Speed-up by a factor of 5-10.

CMSC 838T – Presentation Re-Engineering Techniques: Parallel Aspects u Efficient Tree Generation,  Avoid unbounded-precision arithmetic  Allow generation from any count with variable gap  Provides parallel generation and also sampling of search space u Portable MPI implementation, each processor handles a fraction of trees. u On the 512-processor Alliance cluster LOS LOBOS at UNM, obtained a 512-fold speedup. u Summarize speedups:  Profiling: one order of magnitude  Cache awareness: factors of anywhere from 2 to 40  Low-level Algorithmic changes: 5-10  512-processor parallelism: 512  Overall, Grappa demonstrated a million-fold speedup over the original implementation

CMSC 838T – Presentation Evaluation: the Bluebell Family u Dataset: full gene sequences for the chloroplasts of 12 species of Campanulaceae (Bluebells), plus tobacco.  Chloroplast a. A semi-independent organism that lives within plant cells and allow them to photosynthesize. b. Have a single chromosome with about 120 genes. u Optimization target: reconstruct the phylogeny with the least total amount of genomic changes. u Environment: 512-processor Los Lobos supercluster at UNM u Results:  Speedup by three to four orders in the serial part  Total speedup by over one million

CMSC 838T – Presentation Phylogeny of Bluebell Family

CMSC 838T – Presentation Impact in Computational Biology u Much faster implementations  Alter the practice of research in biology and medicine  Reducing the time of an analysis from two years down to a day  Makes an enormous difference in the pace and cost of drug discover and development u Fast and accurate analysis software  Enables researchers to pursue more leads, develop better institution on small dataset  Form new conjectures about biological mechanism

CMSC 838T – Presentation Observations u Algorithm re-engineering  Uncovers salient characteristic of the algorithm  Enable us to develop better algorithms l Example: find a true linear time algorithm for computing inversion distance in the development of GRAPPA.  Can be applied to any existing bioinformatics algorithms l Several have been engineered for performance, such as BLAST  Limited benefits in theoretical terms when applied to NP-hard optimization problems  Does not scale up to “industrial-strengthen” l Grappa only enables to move from 10 taxa to 13 taxa

CMSC 838T – Presentation Thank you