Presentation is loading. Please wait.

Presentation is loading. Please wait.

High-Performance Algorithm Engineering for Computational Phylogenetics [B Moret, D Bader] Kexue Liu CMSC 838 Presentation.

Similar presentations


Presentation on theme: "High-Performance Algorithm Engineering for Computational Phylogenetics [B Moret, D Bader] Kexue Liu CMSC 838 Presentation."— Presentation transcript:

1 High-Performance Algorithm Engineering for Computational Phylogenetics [B Moret, D Bader] Kexue Liu CMSC 838 Presentation

2 CMSC 838T – Presentation Motivation u Phylogeny reconstruction from molecular data  Poses complex optimization problem  NP hard and thus computationally intractable u High performance Algorithm Engineering  Reduce the running time of existing phylogenetic algoritms

3 CMSC 838T – Presentation Talk Overview u Overview of talk  Background  Breakpoint Phylogeny  Breakpoint Analysis  Re-Engineering Techniques  Impact in computational Biology  Observations

4 CMSC 838T – Presentation Background u Algorithm Engineering  Transform a pencil-and-paper algorithm into an efficient, robust implementation.  Main focus is experimentation u High Performance Algorithm Engineering  Running time and quality of the solution as the paramount goal  Includes parallelism  Refining serial part of the code  Cache-aware programming is a key to performance

5 CMSC 838T – Presentation Background u Phylogeny  Reconstruction of the evolutionary history of a collection of organisms  Takes the form of an evolutionary tree u Computational Phylogenetics  Is extremely computation-intensive  Methods for sequence data (RNA, DNA, amino acid, Protein) do not scale up to whole genome  Genome level data a) At this level, evolution is slow b) Enable us to recover deep evolutionary relationships c) Much hard to analyze than sequence data  Optimization criteria a) Heuristics b) Parsimony criterion c) Maximum likelihood

6 CMSC 838T – Presentation Breakpoint Phylogeny u Deal with simple genomic data  Organisms have a single chromosome or contain single- chromosome organelles  Each chromosome can be represented by an ordering of oriented genes.  Evolutionary process includes inversion, transposition, insertion, deletion and duplication. u Approaches  Construct parsimonious tree a) Known or conjectured to be NP hard b) No automated tool to solve it  Neighbor-joining heuristics a) Fast and valuable b) Can’t recover the ancestral gene orders.  Breakpoint phylogeny by Blanchette and Sankoff.

7 CMSC 838T – Presentation Breakpoint phylogeny u More special case:  All the genomes have the same set of genes  Each gene appears once. u Is of interest to biologists  Inversions are the main evolutionary mechanism on such genomes u Works well for certain datasets. u Implementation developed by Sankoff and Blanchette  Breakpoint Analysis  Too slow to be used on anything other than small datasets with a few genes.

8 CMSC 838T – Presentation Breakpoint Analysis: Details u Breakpoint:  Two genomes G and G’ with the same set of genes and each gene appears exactly once in each genome  Ordered pair of genes, (g i, g j ) appears in G  Neither (g i, g j ) nor (-g j, -g i ) appears in G ’ u Breakpoint Distance  Number of breakpoints between two genomes. u Median for three genomes  The genome which minimizes the breakpoint distance u Median Problem for Breakpoints  Construct a median of given genomes  NP hard

9 CMSC 838T – Presentation Breakpoint Analysis u Method developed by Sankoff and Blanchette to solve breakpoint phylogeny u Uses reduction from MPB to Travelling Salesman Problem  Directed MPB to undirected TSP  Representing each gene by a pair of cities connected by an edge u Outer loop enumerates all (2n-5)!! trees on n leaves u Inner loop runs unknown number of iterations u Computation complexity is exponential in each of the number of genomes and the number of genes.

10 CMSC 838T – Presentation Breakpoint Analysis Initially label all internal nodes with gene orders Repeat For each internal node v, with neighbors A, B, C do Solve the MPB on A, B, C to yield label m If relabelling v with m improves the score of T, then do it until no internal node can be relabelled

11 CMSC 838T – Presentation Re-Engineering Techniques u Profiling:  Identify bottlenecks to balance implementation  Eliminate problems which include excessive resource consumption or poor results.  Examples: a. Hand-unrolling loops, cut the running time down by a factor at least six. b. Refine distance computations c. Refine lower bound computations  Speed-up by one order of magnitude on Campanulaceae dataset

12 CMSC 838T – Presentation Re-Engineering Techniques u Cache Awareness  Memory footprint a. BPAnalysis: 60MB b. GRAPPA: 1.8MB  Memory locality a. BPAnalysis: poor locality, working set size of about 12MB b. GRAPPA : good locality, working set size of about 600KB  Minimizing pointer dereferencing  Reuses allocated storage  Studies indicate that gain is likely to be factors of anywhere from 2 to 40

13 CMSC 838T – Presentation Re-Engineering Techniques u Low-level Algorithmic Changes  Using all of the available information  Examples: a. Using lower bound to eliminate over 95% of the tree. b. Take advantage of special structures: TSP has only two nontrivial edges( cost 1 and cost 2)  Speed-up by a factor of 5-10.

14 CMSC 838T – Presentation Re-Engineering Techniques: Parallel Aspects u Efficient Tree Generation,  Avoid unbounded-precision arithmetic  Allow generation from any count with variable gap  Provides parallel generation and also sampling of search space u Portable MPI implementation, each processor handles a fraction of trees. u On the 512-processor Alliance cluster LOS LOBOS at UNM, obtained a 512-fold speedup. u Summarize speedups:  Profiling: one order of magnitude  Cache awareness: factors of anywhere from 2 to 40  Low-level Algorithmic changes: 5-10  512-processor parallelism: 512  Overall, Grappa demonstrated a million-fold speedup over the original implementation

15 CMSC 838T – Presentation Evaluation: the Bluebell Family u Dataset: full gene sequences for the chloroplasts of 12 species of Campanulaceae (Bluebells), plus tobacco.  Chloroplast a. A semi-independent organism that lives within plant cells and allow them to photosynthesize. b. Have a single chromosome with about 120 genes. u Optimization target: reconstruct the phylogeny with the least total amount of genomic changes. u Environment: 512-processor Los Lobos supercluster at UNM u Results:  Speedup by three to four orders in the serial part  Total speedup by over one million

16 CMSC 838T – Presentation Phylogeny of Bluebell Family

17 CMSC 838T – Presentation Impact in Computational Biology u Much faster implementations  Alter the practice of research in biology and medicine  Reducing the time of an analysis from two years down to a day  Makes an enormous difference in the pace and cost of drug discover and development u Fast and accurate analysis software  Enables researchers to pursue more leads, develop better institution on small dataset  Form new conjectures about biological mechanism

18 CMSC 838T – Presentation Observations u Algorithm re-engineering  Uncovers salient characteristic of the algorithm  Enable us to develop better algorithms l Example: find a true linear time algorithm for computing inversion distance in the development of GRAPPA.  Can be applied to any existing bioinformatics algorithms l Several have been engineered for performance, such as BLAST  Limited benefits in theoretical terms when applied to NP-hard optimization problems  Does not scale up to “industrial-strengthen” l Grappa only enables to move from 10 taxa to 13 taxa

19 CMSC 838T – Presentation Thank you


Download ppt "High-Performance Algorithm Engineering for Computational Phylogenetics [B Moret, D Bader] Kexue Liu CMSC 838 Presentation."

Similar presentations


Ads by Google