Biosequence Similarity Search on the Mercury System Praveen Krishnamurthy, Jeremy Buhler, Roger Chamberlain, Mark Franklin, Kwame Gyang, and Joseph Lancaster.

Slides:



Advertisements
Similar presentations
SE-292 High Performance Computing
Advertisements

Indexing DNA Sequences Using q-Grams
SE-292 High Performance Computing Memory Hierarchy R. Govindarajan
A Scalable and Reconfigurable Search Memory Substrate for High Throughput Packet Processing Sangyeun Cho and Rami Melhem Dept. of Computer Science University.
Alignment methods Introduction to global and local sequence alignment methods Global : Needleman-Wunch Local : Smith-Waterman Database Search BLAST FASTA.
Improved Alignment of Protein Sequences Based on Common Parts David Hoksza Charles University in Prague Department of Software Engineering Czech Republic.
BLAST Sequence alignment, E-value & Extreme value distribution.
Multithreaded FPGA Acceleration of DNA Sequence Mapping Edward Fernandez, Walid Najjar, Stefano Lonardi, Jason Villarreal UC Riverside, Department of Computer.
1 ALAE: Accelerating Local Alignment with Affine Gap Exactly in Biosequence Databases Xiaochun Yang, Honglei Liu, Bin Wang Northeastern University, China.
Bioinformatics Unit 1: Data Bases and Alignments Lecture 2: “Homology” Searches and Sequence Alignments.
Local alignments Seq X: Seq Y:. Local alignment  What’s local? –Allow only parts of the sequence to match –Results in High Scoring Segments –Locally.
A Grid implementation of the sliding window algorithm for protein similarity searches facilitates whole proteome analysis on continuously updated databases.
Extensible Processors. 2 ASIP Gain performance by:  Specialized hardware for the whole application (ASIC). −  Almost no flexibility. −High cost.  Use.
1 An Evolution of Pattern Matching within Network Intrusion Detection Systems Erik Anderson 9 November 2006.
Designing Multiple Simultaneous Seeds for DNA Similarity Search Yanni Sun, Jeremy Buhler Washington University in Saint Louis.
Database searching. Purposes of similarity search Function prediction by homology (in silico annotation) Function prediction by homology (in silico annotation)
Overview of sequence database searching techniques and multiple alignment May 1, 2001 Quiz on May 3-Dynamic programming- Needleman-Wunsch method Learning.
Sequence Alignment Bioinformatics. Sequence Comparison Problem: Given two sequences S & T, are S and T similar? Need to establish some notion of similarity.
Pairwise Sequence Alignment Part 2. Outline Global alignments-continuation Local versus Global BLAST algorithms Evaluating significance of alignments.
Cluster Computer For Bioinformatics Applications Nile University, Bioinformatics Group. Hisham Adel 2008.
Alignment methods June 26, 2007 Learning objectives- Understand how Global alignment program works. Understand how Local alignment program works.
Parallel Computation in Biological Sequence Analysis Xue Wu CMSC 838 Presentation.
Heuristic Approaches for Sequence Alignments
Sequence alignment, E-value & Extreme value distribution
A Study of GeneWise with the Drosophila Adh Region Asta Gindulyte CMSC 838 Presentation Authors: Yi Mo, Moira Regelson, and Mike Sievers Paracel Inc.,
07/14/08. 2 Points Introduction. Cluster and Supercomputers. Cluster Types and Advantages. Our Cluster. Cluster Performance. Cluster Computer for Basic.
Accelerating Read Mapping with FastHASH †† ‡ †† Hongyi Xin † Donghyuk Lee † Farhad Hormozdiari ‡ Samihan Yedkar † Can Alkan § Onur Mutlu † † † Carnegie.
Speed Up DNA Sequence Database Search and Alignment by Methods of DSP
Networking Virtualization Using FPGAs Russell Tessier, Deepak Unnikrishnan, Dong Yin, and Lixin Gao Reconfigurable Computing Group Department of Electrical.
Sarang Dharmapurikar With contributions from : Praveen Krishnamurthy,
Softcore Vector Processor Team ASP Brandon Harris Arpith Jacob.
Querying Large Databases Rukmini Kaushik. Purpose Research for efficient algorithms and software architectures of query engines.
Indexing DNA sequences for local similarity search Joint work of Angela, Dr. Mamoulis and Dr. Yiu 17/5/2007.
11 Overview Paracel GeneMatcher2. 22 GeneMatcher2 The GeneMatcher system comprises of hardware and software components that significantly accelerate a.
BLAST Anders Gorm Pedersen & Rasmus Wernersson. Database searching Using pairwise alignments to search databases for similar sequences Database Query.
CISC667, F05, Lec9, Liao CISC 667 Intro to Bioinformatics (Fall 2005) Sequence Database search Heuristic algorithms –FASTA –BLAST –PSI-BLAST.
BLAST Basic Local Alignment Search Tool (Altschul et al. 1990)
Achieving Scalability, Performance and Availability on Linux with Oracle 9iR2-RAC Grant McAlister Senior Database Engineer Amazon.com Paper
The Mercury System: Embedding Computation into Disk Drives Roger Chamberlain, Ron Cytron, Mark Franklin, Ron Indeck Center for Security Technologies Washington.
Gapped BLAST and PSI-BLAST: a new generation of protein database search programs Stephen F. Altschul, Thomas L. Madden, Alejandro A. Schäffer, Jinghui.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
Sequence Comparison Algorithms Ellen Walker Bioinformatics Hiram College.
Using BLAST for Genomic Sequence Annotation Jeremy Buhler For HHMI / BIO4342 Tutorial Workshop.
Mining Document Collections to Facilitate Accurate Approximate Entity Matching Presented By Harshda Vabale.
Biocomputation: Comparative Genomics Tanya Talkar Lolly Kruse Colleen O’Rourke.
PatternHunter: A Fast and Highly Sensitive Homology Search Method Bin Ma Department of Computer Science University of Western Ontario.
Pairwise Sequence Alignment Part 2. Outline Summary Local and Global alignments FASTA and BLAST algorithms Evaluating significance of alignments Alignment.
From Smith-Waterman to BLAST
Doug Raiford Phage class: introduction to sequence databases.
David Wishart February 18th, 2004 Lecture 3 BLAST (c) 2004 CGDN.
Qq q q q q q q q q q q q q q q q q q q Background: DNA Sequencing Goal: Acquire individual’s entire DNA sequence Mechanism: Read DNA fragments and reconstruct.
Heuristic Alignment Algorithms Hongchao Li Jan
CIP HPC CIP - HPC HPC = High Performance Computer It’s not a regular computer, it’s bigger, faster, more powerful, and more.
Genome Revolution: COMPSCI 004G 8.1 BLAST l What is BLAST? What is it good for?  Basic.
Performance. Moore's Law Moore's Law Related Curves.
9/6/07BCB 444/544 F07 ISU Dobbs - Lab 3 - BLAST1 BCB 444/544 Lab 3 BLAST Scoring Matrices & Alignment Statistics Sept6.
FastHASH: A New Algorithm for Fast and Comprehensive Next-generation Sequence Mapping Hongyi Xin1, Donghyuk Lee1, Farhad Hormozdiari2, Can Alkan3, Onur.
Basics of BLAST Basic BLAST Search - What is BLAST?
BLAST Anders Gorm Pedersen & Rasmus Wernersson.
Genomic Data Clustering on FPGAs for Compression
Department of Computer Science
Jin Zhang, Jiayin Wang and Yufeng Wu
Fast Sequence Alignments
BLAST.
High-Level Synthesis of a Genomic Database Search Engine
Parallel System for BLAST
Basic Local Alignment Search Tool (BLAST)
Basic Local Alignment Search Tool
Sequence alignment, E-value & Extreme value distribution
Reconfigurable Computing (EN2911X, Fall07)
Presentation transcript:

Biosequence Similarity Search on the Mercury System Praveen Krishnamurthy, Jeremy Buhler, Roger Chamberlain, Mark Franklin, Kwame Gyang, and Joseph Lancaster Department of Computer Science and Engineering, Washington University in Saint Louis, MO Supported by an NIH STTR Grant & NSF Grants DBI , ITR , CCR

Slide # 2Washington University in St. Louis Outline Overview of BLAST Overview of the Mercury system Description of BLASTN algorithm Algorithmic changes to BLASTN Improvement in performance Related work Conclusion

Slide # 3Washington University in St. Louis Basic Local Alignment Search Tool Biosequence comparison software –Query sequence (new genome) to large database of known biosequences Look for similar regions Exponential growth of genomic databases –Longer time for searches to complete –Solutions Perform comparison over multiple machines Specialized hardware - Our Approach

Slide # 4Washington University in St. Louis The Mercury System

Slide # 5Washington University in St. Louis The Mercury System Proximity to disk –Simple operations performed close to disk Avoids CPU use –400 Mbytes/s throughput from the disk Concurrent Independent operation –Does not use processor cache cycles, memory or I/O buses Reconfigurable logic –Logic can be tuned to the particular need of the application

Slide # 6Washington University in St. Louis BLASTN –Both the query and the database are long DNA strings –Consist of {A, C, T, G} and some unknowns Each stage processes lesser data The stages become more computationally expensive

Slide # 7Washington University in St. Louis BLASTN - Terminology … ACTGTGTTTCACTGACGGGTGT … … CTGTGTCCCCAACACTGCTGACGTAGAATCGTGTAG … Query Database ‘w-mer’ is a sequence of ‘w’ consecutive bases

Slide # 8Washington University in St. Louis BLASTN - Pipeline - Stage 1 Matches each ‘11-mer’ in query to database –Exact string matching 83% of overall time is spent in this stage Filters 92% of data entering this stage –Only 8% of data proceeds to the next stage

Slide # 9Washington University in St. Louis BLASTN - Pipeline - Stage 2 Extends the matches from stage 1 … ACTGTGTTTCACTGACGGGTGT … … GTGTCCCCAACATTTCACTGACGAGAATCGTGTAG …

Slide # 10Washington University in St. Louis BLASTN - Pipeline - Stage 2 Extends the matches from stage 1 –Allows mismatches of individual bases –Does not allow gaps in either the query or the database –Match score should be higher than threshold to proceed 16% of pipeline time is spent is this stage Only 2/100,000 of data entering this stage proceeds to the next stage

Slide # 11Washington University in St. Louis BLASTN - Pipeline - Stage 3 Extends the matches from stage 2 … ACCACTGTTTCACTGACG_GA_T_GT … … CTGTGTCCCCAC_GTTTCACTGACGAGAATCGTGTAG …

Slide # 12Washington University in St. Louis BLASTN - Pipeline - Stage 3 Extends the matches from stage 2 –Scores matches with Gaps inserted in both the sequences –Smith-Waterman dynamic programming algorithm <1% of pipeline time is spent is this stage

Slide # 13Washington University in St. Louis NCBI - BLASTN Stage 1 (word matching) is implemented as a lookup table –Efficient only for certain word lengths (w= 11) Performance degrades dramatically for larger query sizes Query Size 10 Kbases 100 Kbases 1 MbasesUnits Throughput Mbases/s Pentium-4 2.6GHz 1Gbyte RAM

Slide # 14Washington University in St. Louis Firmware implementation - Stage 1 Bloom Filters Hash Lookup Redundancy Eliminator Eliminates false-positives from Bloom filters, obtain offset in query Discards matches that are close to one another Matches ‘11-mers’ to query, but generates false-positives

Slide # 15Washington University in St. Louis Bloom filters operation ‘11-mer’ K Hash Functions Programming the query into the bloom filter (processing query) ‘m-bit’ vector query

Slide # 16Washington University in St. Louis Bloom filters operation database K Hash Functions Finding matches in the database ‘m-bit’ vector ? ? ? 1: Potential match 0: Not a match ‘11-mer’

Slide # 17Washington University in St. Louis Bloom filters operation K Hash Functions Finding matches in the database ‘m-bit’ vector ? ? ? 1*: Potential match 0: Not a match * False positives are eliminated using a hash table database ‘11-mer’

Slide # 18Washington University in St. Louis Bloom filter performance

Slide # 19Washington University in St. Louis Performance analysis Firmware Vs. Software Stage 1 Query Size 10 Kbases 100 Kbases 1 Mbases Units NCBI BLASTN Mbases/s Mercury BLASTN Mbases/s Speedup

Slide # 20Washington University in St. Louis Overall system throughput Query Size 10 Kbases 100 Kbases 1 MbasesUnits NCBI BLASTN Mbases/s Mercury BLASTN Mbases/s Speedup Tput overall = min (Tput 1, Tput (2&3) )

Slide # 21Washington University in St. Louis Stage 2 in firmware - Throughput

Slide # 22Washington University in St. Louis Stage 2 in firmware - Speedup

Slide # 23Washington University in St. Louis Related work Hardware based commercial systems –Paracel GeneMatcher TM, used ASIC, and hence is inflexible –RDisk, FPGA based system with throughput of 60 Mbases/s for stage 1 High-end commercial system –Paracel BLASTMachine2 TM, 32 CPU linux cluster 2.93 Mbases/s for 2.8 Mbase query 2 times faster than 1-node Mercury BLASTN –TimeLogic DeCypherBLAST TM, FPGA based 213 Kbases/s for a 16 Mbase query Comparable to 1-node Mercury BLASTN

Slide # 24Washington University in St. Louis Conclusion BLASTN on the Mercury system –Bloom filters to improve performance of stage 1 Efficient hash functions in hardware –7x improvement in speed with only stage 1 firmware –>50x speedup with stage 2 implemented in firmware Future work –Algorithmic changes to stage 2 Efficient use of hardware capabilities –Other apps BLASTP, BLASTX etc.

Slide # 25Washington University in St. Louis Thank you