1 SC'03, Nov. 15–21, 2003 A Million-Fold Speed Improvement in Genomic Repeats Detection John W. Romein Jaap Heringa Henri E. Bal Vrije Universiteit, Amsterdam.

Slides:



Advertisements
Similar presentations
Parallel BioInformatics Sathish Vadhiyar. Parallel Bioinformatics  Many large scale applications in bioinformatics – sequence search, alignment, construction.
Advertisements

Fa07CSE 182 CSE182-L4: Database filtering. Fa07CSE 182 Summary (through lecture 3) A2 is online We considered the basics of sequence alignment –Opt score.
Lecture 24 Coping with NPC and Unsolvable problems. When a problem is unsolvable, that's generally very bad news: it means there is no general algorithm.
Improved Alignment of Protein Sequences Based on Common Parts David Hoksza Charles University in Prague Department of Software Engineering Czech Republic.
BLAST Sequence alignment, E-value & Extreme value distribution.
Local alignments Seq X: Seq Y:. Local alignment  What’s local? –Allow only parts of the sequence to match –Results in High Scoring Segments –Locally.
A Grid implementation of the sliding window algorithm for protein similarity searches facilitates whole proteome analysis on continuously updated databases.
C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E Alignments 1 Sequence Analysis.
Optimatization of a New Score Function for the Detection of Remote Homologs Kann et al.
Heuristic alignment algorithms and cost matrices
1-month Practical Course Genome Analysis (Integrative Bioinformatics & Genomics) Lecture 3: Pair-wise alignment Centre for Integrative Bioinformatics VU.
Whole Genome Alignment using Multithreaded Parallel Implementation Hyma S Murthy CMSC 838 Presentation.
Expected accuracy sequence alignment
CS262 Lecture 9, Win07, Batzoglou Multiple Sequence Alignments.
Lecture 3: A Case for RAID (Part 1) Prof. Shahram Ghandeharizadeh Computer Science Department University of Southern California.
Performance Optimization of Clustal W: Parallel Clustal W, HT Clustal and MULTICLUSTAL Arunesh Mishra CMSC 838 Presentation Authors : Dmitri Mikhailov,
Parallel Computation in Biological Sequence Analysis: ParAlign & TurboBLAST Larissa Smelkov.
Parallel Computation in Biological Sequence Analysis Xue Wu CMSC 838 Presentation.
1-month Practical Course Genome Analysis Lecture 3: Residue exchange matrices Centre for Integrative Bioinformatics VU (IBIVU) Vrije Universiteit Amsterdam.
Profile Hidden Markov Models PHMM 1 Mark Stamp. Hidden Markov Models  Here, we assume you know about HMMs o If not, see “A revealing introduction to.
Incorporating Bioinformatics in an Algorithms Course Lawrence D’Antonio Ramapo College of New Jersey.
Sequence alignment, E-value & Extreme value distribution
1 Computer Science, University of Warwick Architecture Classifications A taxonomy of parallel architectures: in 1972, Flynn categorised HPC architectures.
Sequence comparison: Score matrices Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas
Fundamental Issues in Parallel and Distributed Computing Assaf Schuster, Computer Science, Technion.
Sequence comparison: Local alignment
Introduction to Profile Hidden Markov Models
Brandon Andrews.  Longest Common Subsequences  Global Sequence Alignment  Scoring Alignments  Local Sequence Alignment  Alignment with Gap Penalties.
Sequence Alignment Algorithms Morten Nielsen Department of systems biology, DTU.
Pairwise Sequence Alignment (I) (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 22, 2005 ChengXiang Zhai Department of Computer Science University.
InCoB August 30, HKUST “Speedup Bioinformatics Applications on Multicore- based Processor using Vectorizing & Multithreading Strategies” King.
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
Pairwise alignment of DNA/protein sequences I519 Introduction to Bioinformatics, Fall 2012.
Order independent structural alignment of circularly permutated proteins T. Andrew Binkowski Bhaskar DasGupta  Jie Liang ‡ Bioengineering Computer Science.
Lecture 6. Pairwise Local Alignment and Database Search Csc 487/687 Computing for bioinformatics.
BLAST Anders Gorm Pedersen & Rasmus Wernersson. Database searching Using pairwise alignments to search databases for similar sequences Database Query.
Construction of Substitution Matrices
CSC 7600 Lecture 28 : Final Exam Review Spring 2010 HIGH PERFORMANCE COMPUTING: MODELS, METHODS, & MEANS FINAL EXAM REVIEW Daniel Kogler, Chirag Dekate.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
Sequence Comparison Algorithms Ellen Walker Bioinformatics Hiram College.
Data Management for Decision Support Session-4 Prof. Bharat Bhasker.
Biocomputation: Comparative Genomics Tanya Talkar Lolly Kruse Colleen O’Rourke.
Lecture 3 : Performance of Parallel Programs Courtesy : MIT Prof. Amarasinghe and Dr. Rabbah’s course note.
Introdution to SSE or How to put your algorithms on steroids! Christian Kerl
GEM: A Framework for Developing Shared- Memory Parallel GEnomic Applications on Memory Constrained Architectures Mucahid Kutlu Gagan Agrawal Department.
Parallel Algorithm for Multiple Genome Alignment Using Multiple Clusters Nova Ahmed, Yi Pan, Art Vandenberg Georgia State University SURA Cyberinfrastructure.
CSCI1600: Embedded and Real Time Software Lecture 33: Worst Case Execution Time Steven Reiss, Fall 2015.
Biosequence Similarity Search on the Mercury System Praveen Krishnamurthy, Jeremy Buhler, Roger Chamberlain, Mark Franklin, Kwame Gyang, and Joseph Lancaster.
High-level Interfaces for Scalable Data Mining Ruoming Jin Gagan Agrawal Department of Computer and Information Sciences Ohio State University.
Expected accuracy sequence alignment Usman Roshan.
Multiple Sequence Alignment (cont.) (Lecture for CS397-CXZ Algorithms in Bioinformatics) Feb. 13, 2004 ChengXiang Zhai Department of Computer Science University.
CIP HPC CIP - HPC HPC = High Performance Computer It’s not a regular computer, it’s bigger, faster, more powerful, and more.
LECTURE #1 INTRODUCTON TO PARALLEL COMPUTING. 1.What is parallel computing? 2.Why we need parallel computing? 3.Why parallel computing is more difficult?
Chapter 11 System Performance Enhancement. Basic Operation of a Computer l Program is loaded into memory l Instruction is fetched from memory l Operands.
4.2 - Algorithms Sébastien Lemieux Elitra Canada Ltd.
Sujata Ray Dey Maheshtala College Computer Science Department
BLAST Anders Gorm Pedersen & Rasmus Wernersson.
Sequence comparison: Local alignment
Genomic Data Clustering on FPGAs for Compression
Bioinformatics: The pair-wise alignment problem
CSCI1600: Embedded and Real Time Software
Symmetric Multiprocessing (SMP)
Fast Sequence Alignments
Sujata Ray Dey Maheshtala College Computer Science Department
Parallel System for BLAST
Sequence Alignment Algorithms Morten Nielsen BioSys, DTU
Vrije Universiteit Amsterdam
1-month Practical Course
CSCI1600: Embedded and Real Time Software
Presentation transcript:

1 SC'03, Nov. 15–21, 2003 A Million-Fold Speed Improvement in Genomic Repeats Detection John W. Romein Jaap Heringa Henri E. Bal Vrije Universiteit, Amsterdam Vrije Universiteit Faculty of Sciences, Department of Computer Science Bio-Informatics Group & Computer Systems Group Amsterdam, the Netherlands

2 SC'03, Nov. 15–21, 2003 repeats in bio sequences important to detect  essential for evolution  protein structure & function  diseases hard to detect  any length  mutations  insertions/deletions  different fragment sizes  tandem and distant

3 SC'03, Nov. 15–21, 2003 repro delineates repeats ☺ sensitive two phases 1.find top alignments (slow)‏ 2.find repeats replaced phase 1  old algorithm ☹ O(n 4 )  n < 2,000  new algorithm ☺ O(n 3 )  n < 60,000 ☺ 3-level parallel: SIMD, SMP, cluster

4 SC'03, Nov. 15–21, 2003 sidestep: sequence alignment  superpose two sequences ( TATGCAG, TCTGAG )‏  match symbols vertically (good: +2, bad: -1)‏  allow gaps (-2-1*length)‏  maximize score  compute matrix using dynamic programming

5 SC'03, Nov. 15–21, 2003 sidestep: local alignment  Find sub-sequences that match well  Ignores non-matching values before and after the subsequence (by disallowing negative values)  Construct actual alignment: O(n 3 ) time  Computing only the scores: O(n 2 ) time  (see paper)

6 SC'03, Nov. 15–21, 2003 summary  (TATGCAG, TCTGAG) => 6  takes O( n 2 ) time  (TATGCAG, TCTGAG) =>  takes O( n 3 ) time  Matching TATGCAG with TCTGAG gives same result as matching only the substrings TATGCAG and TCTGAG

7 SC'03, Nov. 15–21, 2003 finding top alignments red lines: top alignments split sequence every possible way  align subsequence-pair  best is first top alignment trick: find next best (top) alignment using O(n 2 ) algorithm n times; construct top alignment using O(n 3 ) algorithm repeat while avoiding found top alignments  user typically wants 5-30 top alignments  ordered list, do most promising alignments first  realign 3-10%

8 SC'03, Nov. 15–21, 2003 performance old vs. new sequence: longest known protein (titin)‏ speed improvement increases with sequence length

9 SC'03, Nov. 15–21, 2003 parallel alignment parallelism within alignment ☹ loop-carried dependency concurrent alignments ☹ speculative parallelism ☺ good performance three-level parallelism  SSE/SSE2 multimedia extensions (SIMD)  shared memory MIMD  distributed memory MIMD

10 SC'03, Nov. 15–21, 2003 SIMD parallelism multimedia extensions  4 (SSE) or 8 (SSE2) parallel operations on consecutive 2-byte words  compiler intrinsics compute 4 (or 8) neighboring matrices concurrently ☹ interleaved memory layout use fine-grained hardware for coarse-grained computation applicable to any program that does many alignments

11 SC'03, Nov. 15–21, 2003 SSE/SSE2 performance speedups w.r.t. new algorithm superlinear speedups  MAX operator  8 extra mmx/xmm registers  scheduling cache-aware alignment: 4 – 6.5 times faster

12 SC'03, Nov. 15–21, 2003 MIMD parallelism SIMD (SSE) parallelism is speculative  If a matrix (alignment) is ‘promising’, its neighbors probably also are promising MIMD parallelism:  use dynamic task scheduling, selecting most promising tasks from a job queue Shared memory (SMP): easy Distributed memory: MPI, master/worker

13 SC'03, Nov. 15–21, 2003 total parallel performance SMP: 2 CPUs  2 2 times faster cluster: 64*2 CPUs  548 – 889-fold speedup Up to 125x faster than SSE version on 1 CPU

14 SC'03, Nov. 15–21, 2003 conclusions new algorithm >> 100 times faster  much more for longer sequences parallel: SSE(2), SMP, cluster  SSE(2) parallelism yields superlinear speedups  128 CPUs: 548 – 889-fold speedup 1,000,000-fold speed improvement