Database Allocation Strategies for Parallel BLAST Evaluation on Clusters Louis Woods.

Slides:



Advertisements
Similar presentations
Fa07CSE 182 CSE182-L4: Database filtering. Fa07CSE 182 Summary (through lecture 3) A2 is online We considered the basics of sequence alignment –Opt score.
Advertisements

Blast outputoutput. How to measure the similarity between two sequences Q: which one is a better match to the query ? Query: M A T W L Seq_A: M A T P.
Hidden Markov Models (1)  Brief review of discrete time finite Markov Chain  Hidden Markov Model  Examples of HMM in Bioinformatics  Estimations Basic.
Whole genome alignments Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.
Master/Slave Architecture Pattern Source: Pattern-Oriented Software Architecture, Vol. 1, Buschmann, et al.
BLAST Sequence alignment, E-value & Extreme value distribution.
Development of Parallel Simulator for Wireless WCDMA Network Hong Zhang Communication lab of HUT.
Reference: Message Passing Fundamentals.
Heuristic alignment algorithms and cost matrices
SST:an algorithm for finding near- exact sequence matches in time proportional to the logarithm of the database size Eldar Giladi Eldar Giladi Michael.
TurboBLAST: A Parallel Implementation of BLAST Built on the TurboHub Bin Gan CMSC 838 Presentation.
1 Bio-Sequence Analysis with Cradle’s 3SoC™ Software Scalable System on Chip Xiandong Meng, Vipin Chaudhary Parallel and Distributed Computing Lab Wayne.
Pairwise Sequence Alignment Part 2. Outline Global alignments-continuation Local versus Global BLAST algorithms Evaluating significance of alignments.
Improving performance of Multiple Sequence Alignment in Multi-client Environments Aaron Zollman CMSC 838 Presentation.
Similar Sequence Similar Function Charles Yan Spring 2006.
Sequence Alignment III CIS 667 February 10, 2004.
Parallel Computation in Biological Sequence Analysis Xue Wu CMSC 838 Presentation.
Sequence alignment, E-value & Extreme value distribution
07/14/08. 2 Points Introduction. Cluster and Supercomputers. Cluster Types and Advantages. Our Cluster. Cluster Performance. Cluster Computer for Basic.
Speed Up DNA Sequence Database Search and Alignment by Methods of DSP
BLAST What it does and what it means Steven Slater Adapted from pt.
Cluster-based SNP Calling on Large Scale Genome Sequencing Data Mucahid KutluGagan Agrawal Department of Computer Science and Engineering The Ohio State.
Chapter 3 Parallel Algorithm Design. Outline Task/channel model Task/channel model Algorithm design methodology Algorithm design methodology Case studies.
BLAST: A Case Study Lecture 25. BLAST: Introduction The Basic Local Alignment Search Tool, BLAST, is a fast approach to finding similar strings of characters.
1 Moshe Shadmon ScaleDB Scaling MySQL in the Cloud.
SISAP’08 – Approximate Similarity Search in Genomic Sequence Databases using Landmark-Guided Embedding Ahmet Sacan and I. Hakki Toroslu
DLS on Star (Single-level tree) Networks Background: A simple network model for DLS is the star network with a master-worker platform. It consists of a.
InCoB August 30, HKUST “Speedup Bioinformatics Applications on Multicore- based Processor using Vectorizing & Multithreading Strategies” King.
Computer Science Department of 1 Massively Parallel Genomic Sequence Search on Blue Gene/P Heshan Lin (NCSU) Pavan Balaji.
ICPP 2012 Indexing and Parallel Query Processing Support for Visualizing Climate Datasets Yu Su*, Gagan Agrawal*, Jonathan Woodring † *The Ohio State University.
Efficient Data Accesses for Parallel Sequence Searches Heshan Lin (NCSU) Xiaosong Ma (NCSU & ORNL) Praveen Chandramohan (ORNL) Al Geist (ORNL) Nagiza Samatova.
Frontiers in Massive Data Analysis Chapter 3.  Difficult to include data from multiple sources  Each organization develops a unique way of representing.
1 Data structure:Lookup Table Application:BLAST. 2 The Look-up Table Data Structure A k-mer is a string of length k. A lookup table is a table of size.
BLAST Anders Gorm Pedersen & Rasmus Wernersson. Database searching Using pairwise alignments to search databases for similar sequences Database Query.
1 Efficient Data Handling in Large-Scale Sequence Database Searches Heshan Lin (NCSU) Xiaosong Ma (NCSU and ORNL) Wu-chun Feng (LANL  VT) Al Geist (ORNL)
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
Sequence Comparison Algorithms Ellen Walker Bioinformatics Hiram College.
Running BLAST on the cluster system over the Pacific Rim.
BLAST, which stands for basic local alignment search tool, is a heuristic algorithm that is used to find similar sequences of amino acids or nucleotides.
Biocomputation: Comparative Genomics Tanya Talkar Lolly Kruse Colleen O’Rourke.
GEM: A Framework for Developing Shared- Memory Parallel GEnomic Applications on Memory Constrained Architectures Mucahid Kutlu Gagan Agrawal Department.
Pairwise Sequence Alignment Part 2. Outline Summary Local and Global alignments FASTA and BLAST algorithms Evaluating significance of alignments Alignment.
BLAST, which stands for basic local alignment search tool, is a heuristic algorithm that is used to find similar sequences of amino acids or nucleotides.
DOE Network PI Meeting 2005 Runtime Data Management for Data-Intensive Scientific Applications Xiaosong Ma NC State University Joint Faculty: Oak Ridge.
Biosequence Similarity Search on the Mercury System Praveen Krishnamurthy, Jeremy Buhler, Roger Chamberlain, Mark Franklin, Kwame Gyang, and Joseph Lancaster.
Computer Science and Engineering Parallel and Distributed Processing CSE 8380 April 28, 2005 Session 29.
CEng 713, Evolutionary Computation, Lecture Notes parallel Evolutionary Computation.
9/6/07BCB 444/544 F07 ISU Dobbs - Lab 3 - BLAST1 BCB 444/544 Lab 3 BLAST Scoring Matrices & Alignment Statistics Sept6.
Advanced Algorithms Analysis and Design
Introduction to Parallel Computing: MPI, OpenMP and Hybrid Programming
Optimizing Parallel Algorithms for All Pairs Similarity Search
Parallel Programming By J. H. Wang May 2, 2017.
The University of Adelaide, School of Computer Science
BLAST Anders Gorm Pedersen & Rasmus Wernersson.
Parallel Algorithm Design
Genomic Data Clustering on FPGAs for Compression
Applying Twister to Scientific Applications
Department of Computer Science University of California, Santa Barbara
Sequence Alignment 11/24/2018.
Objective of This Course
Summary Background Introduction in algorithms and applications
Ch 4. The Evolution of Analytic Scalability
Sequence Based Analysis Tutorial
BIOINFORMATICS Fast Alignment
Parallel Programming in C with MPI and OpenMP
Basic Local Alignment Search Tool
Department of Computer Science University of California, Santa Barbara
Sequence alignment, E-value & Extreme value distribution
MapReduce: Simplified Data Processing on Large Clusters
L. Glimcher, R. Jin, G. Agrawal Presented by: Leo Glimcher
Presentation transcript:

Database Allocation Strategies for Parallel BLAST Evaluation on Clusters Louis Woods

24/05/2008Database Allocation Strategies for Parallel BLAST2 Outline What is BLAST? How does BLAST work? Some details about the BLAST algorithm PART I PART II Parallel BLAST: Motivation Parallel strategies Implementation issues Q & A

24/05/2008Database Allocation Strategies for Parallel BLAST3 Sequence Comparison DNA and protein sequences can be represented as finite strings over a restricted alphabet, e.g. {A, T, C, G} Given a query sequence molecular biologists are interested in finding similar sequences (evolutionarily related) in a sequence database

24/05/2008Database Allocation Strategies for Parallel BLAST4 BLAST – Basic Local Alignment Search Tool BLAST was developed in 1990 by The National Center for Biotechnology Information (NCBI) Due to its speed BLAST has become the most popular tool for doing sequence comparison BLAST uses a heuristic approach which makes BLAST a lot faster than other algorithms Trade-off: Results are good (in terms of accuracy), but not optimal

24/05/2008Database Allocation Strategies for Parallel BLAST5 Sequence Database >NT_ dna:supercontig supercontig::NT_113887:1:3994:1 TTGTTTGTCCCACGACAGAGCTGGGCTGAATTATTAATGTGG ATTTTGTCCAACAATGGACTGAAAAGGGAGAAGCCCATGAAC TCTGTGAGGAGTGCATGACAGGTGCTCGTGAGATGA... >NT_ dna:supercontig supercontig::NT_113947:1:4262:1 CATACATTTAATATACCCTCACCATACAGAATGTTCTTTCCC TATTACATAAGGAGTATATGTATTAAGCACTAAATCTTTGGA ATAATAAAAGACTATATTCATATTTGGTAACTTATT... >NT_ dna:supercontig supercontig::NT_113903:1:12854:1 ACCAGTTCTCACAGGAACTAATAAGAGTGAGAGCTCACTCAC CACTGTGAGGAAGGCACCAAGATGTCCATGAGTGACCTGTCC CTAAGACCCAAAGACCTCCCATTAGGCCCCACCTCC... >NT_ dna:supercontig supercontig::NT_113908:1:13036:1 CGGCCCTGCTGAGGCCGGGCATGGAGCTGGGGGTCAGGCCCT TCAGTCTCTTGAGGGTGTTCAGGACCACGTTCAGGTACCTGT TCTTGTTGGGGCTGCAGTCGTAGGCCACCTTCTCCT... >NT_ dna:supercontig supercontig::NT_113940:1:19187:1 CTTTGCATTCAACGCACAGTGTTGAACCTTTCTTTGATAGTT CAGATTTGAAACACTCTTTTTGTAGAAACTGCAAGTGGATAA CTGCACTTCTTTGAGGCCTATCGTAGTAAAGGAAAT... >NT_ dna:supercontig supercontig::NT_113917:1:19840:1 CCCCCAACTCTCGCCGGCTGTCCCTAGGAGTCCACGGGCCGT CCGGGGCCCCCCCCAGGCCTGGCGGGACCAGGATGCTGCCCT GTCACCTGCCCCCCAGCCCCACACAACGCCCCCACC... >NT_ dna:supercontig supercontig::NT_113963:1:24360:1 TACCTTTCCTCGTTGTGCGTCCGGTCTTCTGAGCCATGTGCT AATATCCTGGGACTCTGTCTGGTTTTGTTTTTTTTTTTAACA TTGCCTAAATCATATTTTCCATTTAAAGAAATTTCTAA... Extract form the sequences database: Homo_sapiens.NCBI36.48.dna.nonchromosomal.fa A sequence database is a textfile An simplified analogy to BLAST from the SQL- world: SELECT* FROM sequence_table WHEREsequence LIKE '%TGTTG%' ORDER BY$score; TTGTTTGTCCCACGACAGAGCTGGGCTGAATTATTAATGTGG ATTTTGTCCAACAATGGACTGAAAAGGGAGAAGCCCATGAAC TCTGTGAGGAGTGCATGACAGGTGCTCGTGAGATGA... CATACATTTAATATACCCTCACCATACAGAATGTTCTTTCCC TATTACATAAGGAGTATATGTATTAAGCACTAAATCTTTGGA ATAATAAAAGACTATATTCATATTTGGTAACTTATT... ACCAGTTCTCACAGGAACTAATAAGAGTGAGAGCTCACTCAC CACTGTGAGGAAGGCACCAAGATGTCCATGAGTGACCTGTCC CTAAGACCCAAAGACCTCCCATTAGGCCCCACCTCC... CGGCCCTGCTGAGGCCGGGCATGGAGCTGGGGGTCAGGCCCT TCAGTCTCTTGAGGGTGTTCAGGACCACGTTCAGGTACCTGT TCTTGTTGGGGCTGCAGTCGTAGGCCACCTTCTCCT... CTTTGCATTCAACGCACAGTGTTGAACCTTTCTTTGATAGTT CAGATTTGAAACACTCTTTTTGTAGAAACTGCAAGTGGATAA CTGCACTTCTTTGAGGCCTATCGTAGTAAAGGAAAT... CCCCCAACTCTCGCCGGCTGTCCCTAGGAGTCCACGGGCCGT CCGGGGCCCCCCCCAGGCCTGGCGGGACCAGGATGCTGCCCT GTCACCTGCCCCCCAGCCCCACACAACGCCCCCACC... TACCTTTCCTCGTTGTGCGTCCGGTCTTCTGAGCCATGTGCT AATATCCTGGGACTCTGTCTGGTTTTGTTTTTTTTTTTAACA TTGCCTAAATCATATTTTCCATTTAAAGAAATTTCTAA...

24/05/2008Database Allocation Strategies for Parallel BLAST6...TATATAGGTAT......CCGAAAGTCTT... BLAST Search...TACCTAGTCTA......GTTGTAGTCTA... Score = TTAATTTTCAC......AAATGTCCTAA......TCAACCCCGAC... Query Set...GGCCTAATTTT......AACGTAGTCTT... Score = 3 Score = 6 Sequence Database Threshold S = 4

24/05/2008Database Allocation Strategies for Parallel BLAST7 Scoring Matrices Scoring matrix: 4x4 matrix (DNA) or 20x20 matrix (protein) Assume these matrices as given. Not further discussed in this talk Score of an alignment: Sum of pairwise scores

24/05/2008Database Allocation Strategies for Parallel BLAST8 BLAST Algorithm - Phase 1 Compile a list of high scoring words: break query sequence into words of length w and add all w-length words that score above some threshold T. Example: query sequence = ACTTAG, w = 3, T = 12 ACTTAG ACT CTT TTA TAG,ACC,TTT A|A| C|C| T|T| Score = = 13 A CC

24/05/2008Database Allocation Strategies for Parallel BLAST9 BLAST Algorithm - Phase 2 The sequence database is searched for sequences with exact matches to words from the list from phase 1. List form phase 1: {ACT, ACC, CTT, TTA, TTT, TAG} Example: sequence match = AGTTAC

24/05/2008Database Allocation Strategies for Parallel BLAST10 BLAST Algorithm - Phase 3 Extend hits: for every pair (word, match) extend the pair in both directions until a score >= S (other threshold) is reached. All sequences with score >= S will be listed in the final results. Example:query sequence= ACTTAG database sequence= AGTTAC TTA ||| TTA C|GC|G G|CG|C A|AA|A Score = = 12

24/05/2008Database Allocation Strategies for Parallel BLAST11 Why Parallelize BLAST? Moore's Law outpaced by growth of sequence databases. BLAST alone is not fast enough anymore. Souce:

24/05/2008Database Allocation Strategies for Parallel BLAST12 Parallel Environment Hardware: Distributed memory architecture (workstation cluster) Parallel Interface: Message Passing Interface (MPI) Set-up: Master-Slave

24/05/2008Database Allocation Strategies for Parallel BLAST13 Parallel Strategies Advantages of clusters: (1) a lot of memory available (2) parallelize data, not the BLAST algorithm Focus: How can the data be distributed? Partition query set: replicated approach. Partition sequence database: fragmented approach.

24/05/2008Database Allocation Strategies for Parallel BLAST14 Replicated Approach Souce: [1] Database allocation strategies for parallel BLAST evaluation on clusters

24/05/2008Database Allocation Strategies for Parallel BLAST15 ✔ Very simple and easy to implement Result merging: concatenate results from slaves Low communication overhead: For a query set of m sequences, m messages have to be sent to the slaves Linear speed-up: (1h 7 min -> 2 min 20 sec with 30 slaves) Replicated Approach: Analysis ✔ Doesn't solve the database size problem (e.g when database doesn't fit into main memory) ✔ Load balancing depends on composition of the query set

24/05/2008Database Allocation Strategies for Parallel BLAST16 Fragmented Approach Souce: [1] Database allocation strategies for parallel BLAST evaluation on clusters

24/05/2008Database Allocation Strategies for Parallel BLAST17 ✔ Attacks problem of exponential sequence database growth ✔ Superlinear speed-up with large databases. (Whole database in main memory) ✔ Load balancing independent of query set Fragmented Approach: Analysis ✔ Partitioning strategy required ✔ Result merging phase more complicated ✔ Higher communication overhead (query set of m sequences sent to n slaves)

24/05/2008Database Allocation Strategies for Parallel BLAST18 Static Partitioning Prior to executing BLAST a program computes the fragments of the sequence database The slaves then copy their fragment from the shared storage to their local storage Maste r Shared Storage Sequence database Slave 1 Slave 2 Slave 3 Slave 4

24/05/2008Database Allocation Strategies for Parallel BLAST19 Disadvantages of Static Partitioning A large number of smaller files are created, which are harder to manage, migrate, and share The database needs to be re- partitioned if the number of slaves in a parallel BLAST search exceeds the number of existing fragments Creating a large number of fragments for running on different number of slaves is not a good option (1)I/O overhead (2)Result merging- and retrieving overhead

24/05/2008Database Allocation Strategies for Parallel BLAST20 Dynamic Partitioning with Parallel I/O Instead of physically computing the fragment files only the offsets are calculated Offsets are distributed over the network to the slaves Slaves read sequences directly form shared storage via parallel I/O into their main memory Maste r Shared Storage Sequence database Slave 1 Slave 2 Slave 3 Slave 4 Communicates file offsets Read sequences [0,29 ] [60,89 ] [30,59 ] [90,117 ] [0,29] [30,59] [60,89] [90,117]

24/05/2008Database Allocation Strategies for Parallel BLAST21 Load Balancing – Adaptive Distribution Dynamic adaptive approach: i nitially not the whole database is distributed (e.g. only 70%) When the fist slave has processed all queries on its fragment and reports back to the master, the master assigns a new fragment to that slave The master gradually decreases the size of the fragments until the query has been processed on the entire database

24/05/2008Database Allocation Strategies for Parallel BLAST22 Result Merging Phase: Problem Problem: the master has to regroup and filter result sequences from slaves Master waits for all of the results and then generates the global output file itself Slaves are idle and don't make use of their I/O capacity in the result merging phase

24/05/2008Database Allocation Strategies for Parallel BLAST23 Result Merging Phase : Solution Slaves cache results and send only relevant information for filtering to the master Master does filtering and informs slaves which results to write at which offset to the global output file Slaves use parallel I/O and write global output file simultaneously No I/O bottleneck at the master Communication overhead by the slaves reduced

24/05/2008Database Allocation Strategies for Parallel BLAST24 Bottleneck Master Mast er SSSSSSSSSSSS

24/05/2008Database Allocation Strategies for Parallel BLAST25 Introducing a Supermaster Supermas ter SSSSSSSSSSSS Mast er

24/05/2008Database Allocation Strategies for Parallel BLAST26 Blue Gene/L The world's top supercomputer Parallel BLAST was implemented on a cluster with 4096 nodes To utilize as many nodes as possible the fragmented approach was combined with the replicated approach

24/05/2008Database Allocation Strategies for Parallel BLAST27 Replicated & Fragmented Approach Supermas ter SSSSSSSSSSSS Mast er Fragmented Approach Replicated Approach

24/05/2008Database Allocation Strategies for Parallel BLAST28 Summary BLAST: popular tool for sequence comparison (good performance) Problem: exponential growth of sequence databases Solution: parallelization, distribute data Replicated approach Fragmented approach Implementation issues: partitioning, load balancing, I/O bottleneck, message overhead MPI & BLAST = a good solution

24/05/2008Database Allocation Strategies for Parallel BLAST29 References 1. R. de Carvalho Costa and S. Lifschitz. “Database allocation strategies for parallel BLAST evaluation on clusters”, Distributed and Parallel Databases, 13(1), H. Lin, X. Ma, P. Chandramohan, A. Geist, and N. Samatova, “Efficient Data Access for Parallel BLAST,” Proc. 19th Int'l Parallel and Distributed Processing Symp. (IPDPS '05), The European Bioinformatics Institute (EBI), 4. Karl Jiang, Oystein Thorsen, Amanda Peters, Brian Smith and Carlos P. Sosa, “An Efficient Parallel Implementation of the Hidden Markov Methods for Genomic Sequence Search on a Massively Parallel System”, A. E. Darling, L. Carey, and W. Feng, “The design, implementation, and evaluation of mpiblast,” In proceedings of CluterWorld and Expo, Huzefa Rangwala, Eric Lantz, Roy Musselman, Kurt Pinnow, Brian Smith, Brian Wallenfelt. “Massively Parallel BLAST for the Blue Gene/L” Presented at the High Availability and Performance Computing Workshop held in conjunction with the 6th Los Alamos Computer Science Institute Symposium, Santa Fe, New Mexico (October, 2005).

24/05/2008Database Allocation Strategies for Parallel BLAST30 Questions