Download presentation
Presentation is loading. Please wait.
Published byConstance Caldwell Modified over 8 years ago
1
Database Allocation Strategies for Parallel BLAST Evaluation on Clusters Louis Woods
2
24/05/2008Database Allocation Strategies for Parallel BLAST2 Outline What is BLAST? How does BLAST work? Some details about the BLAST algorithm PART I PART II Parallel BLAST: Motivation Parallel strategies Implementation issues Q & A
3
24/05/2008Database Allocation Strategies for Parallel BLAST3 Sequence Comparison DNA and protein sequences can be represented as finite strings over a restricted alphabet, e.g. {A, T, C, G} Given a query sequence molecular biologists are interested in finding similar sequences (evolutionarily related) in a sequence database
4
24/05/2008Database Allocation Strategies for Parallel BLAST4 BLAST – Basic Local Alignment Search Tool BLAST was developed in 1990 by The National Center for Biotechnology Information (NCBI) Due to its speed BLAST has become the most popular tool for doing sequence comparison BLAST uses a heuristic approach which makes BLAST a lot faster than other algorithms Trade-off: Results are good (in terms of accuracy), but not optimal
5
24/05/2008Database Allocation Strategies for Parallel BLAST5 Sequence Database >NT_113887 dna:supercontig supercontig::NT_113887:1:3994:1 TTGTTTGTCCCACGACAGAGCTGGGCTGAATTATTAATGTGG ATTTTGTCCAACAATGGACTGAAAAGGGAGAAGCCCATGAAC TCTGTGAGGAGTGCATGACAGGTGCTCGTGAGATGA... >NT_113947 dna:supercontig supercontig::NT_113947:1:4262:1 CATACATTTAATATACCCTCACCATACAGAATGTTCTTTCCC TATTACATAAGGAGTATATGTATTAAGCACTAAATCTTTGGA ATAATAAAAGACTATATTCATATTTGGTAACTTATT... >NT_113903 dna:supercontig supercontig::NT_113903:1:12854:1 ACCAGTTCTCACAGGAACTAATAAGAGTGAGAGCTCACTCAC CACTGTGAGGAAGGCACCAAGATGTCCATGAGTGACCTGTCC CTAAGACCCAAAGACCTCCCATTAGGCCCCACCTCC... >NT_113908 dna:supercontig supercontig::NT_113908:1:13036:1 CGGCCCTGCTGAGGCCGGGCATGGAGCTGGGGGTCAGGCCCT TCAGTCTCTTGAGGGTGTTCAGGACCACGTTCAGGTACCTGT TCTTGTTGGGGCTGCAGTCGTAGGCCACCTTCTCCT... >NT_113940 dna:supercontig supercontig::NT_113940:1:19187:1 CTTTGCATTCAACGCACAGTGTTGAACCTTTCTTTGATAGTT CAGATTTGAAACACTCTTTTTGTAGAAACTGCAAGTGGATAA CTGCACTTCTTTGAGGCCTATCGTAGTAAAGGAAAT... >NT_113917 dna:supercontig supercontig::NT_113917:1:19840:1 CCCCCAACTCTCGCCGGCTGTCCCTAGGAGTCCACGGGCCGT CCGGGGCCCCCCCCAGGCCTGGCGGGACCAGGATGCTGCCCT GTCACCTGCCCCCCAGCCCCACACAACGCCCCCACC... >NT_113963 dna:supercontig supercontig::NT_113963:1:24360:1 TACCTTTCCTCGTTGTGCGTCCGGTCTTCTGAGCCATGTGCT AATATCCTGGGACTCTGTCTGGTTTTGTTTTTTTTTTTAACA TTGCCTAAATCATATTTTCCATTTAAAGAAATTTCTAA... Extract form the sequences database: Homo_sapiens.NCBI36.48.dna.nonchromosomal.fa A sequence database is a textfile An simplified analogy to BLAST from the SQL- world: SELECT* FROM sequence_table WHEREsequence LIKE '%TGTTG%' ORDER BY$score; TTGTTTGTCCCACGACAGAGCTGGGCTGAATTATTAATGTGG ATTTTGTCCAACAATGGACTGAAAAGGGAGAAGCCCATGAAC TCTGTGAGGAGTGCATGACAGGTGCTCGTGAGATGA... CATACATTTAATATACCCTCACCATACAGAATGTTCTTTCCC TATTACATAAGGAGTATATGTATTAAGCACTAAATCTTTGGA ATAATAAAAGACTATATTCATATTTGGTAACTTATT... ACCAGTTCTCACAGGAACTAATAAGAGTGAGAGCTCACTCAC CACTGTGAGGAAGGCACCAAGATGTCCATGAGTGACCTGTCC CTAAGACCCAAAGACCTCCCATTAGGCCCCACCTCC... CGGCCCTGCTGAGGCCGGGCATGGAGCTGGGGGTCAGGCCCT TCAGTCTCTTGAGGGTGTTCAGGACCACGTTCAGGTACCTGT TCTTGTTGGGGCTGCAGTCGTAGGCCACCTTCTCCT... CTTTGCATTCAACGCACAGTGTTGAACCTTTCTTTGATAGTT CAGATTTGAAACACTCTTTTTGTAGAAACTGCAAGTGGATAA CTGCACTTCTTTGAGGCCTATCGTAGTAAAGGAAAT... CCCCCAACTCTCGCCGGCTGTCCCTAGGAGTCCACGGGCCGT CCGGGGCCCCCCCCAGGCCTGGCGGGACCAGGATGCTGCCCT GTCACCTGCCCCCCAGCCCCACACAACGCCCCCACC... TACCTTTCCTCGTTGTGCGTCCGGTCTTCTGAGCCATGTGCT AATATCCTGGGACTCTGTCTGGTTTTGTTTTTTTTTTTAACA TTGCCTAAATCATATTTTCCATTTAAAGAAATTTCTAA...
6
24/05/2008Database Allocation Strategies for Parallel BLAST6...TATATAGGTAT......CCGAAAGTCTT... BLAST Search...TACCTAGTCTA......GTTGTAGTCTA... Score = 5......TTAATTTTCAC......AAATGTCCTAA......TCAACCCCGAC... Query Set...GGCCTAATTTT......AACGTAGTCTT... Score = 3 Score = 6 Sequence Database Threshold S = 4
7
24/05/2008Database Allocation Strategies for Parallel BLAST7 Scoring Matrices Scoring matrix: 4x4 matrix (DNA) or 20x20 matrix (protein) Assume these matrices as given. Not further discussed in this talk Score of an alignment: Sum of pairwise scores
8
24/05/2008Database Allocation Strategies for Parallel BLAST8 BLAST Algorithm - Phase 1 Compile a list of high scoring words: break query sequence into words of length w and add all w-length words that score above some threshold T. Example: query sequence = ACTTAG, w = 3, T = 12 ACTTAG ACT CTT TTA TAG,ACC,TTT A|A| C|C| T|T| Score =5+ 9- 1 = 13 A CC
9
24/05/2008Database Allocation Strategies for Parallel BLAST9 BLAST Algorithm - Phase 2 The sequence database is searched for sequences with exact matches to words from the list from phase 1. List form phase 1: {ACT, ACC, CTT, TTA, TTT, TAG} Example: sequence match = AGTTAC
10
24/05/2008Database Allocation Strategies for Parallel BLAST10 BLAST Algorithm - Phase 3 Extend hits: for every pair (word, match) extend the pair in both directions until a score >= S (other threshold) is reached. All sequences with score >= S will be listed in the final results. Example:query sequence= ACTTAG database sequence= AGTTAC TTA ||| TTA C|GC|G G|CG|C A|AA|A Score = 13- 3 + 5 = 12
11
24/05/2008Database Allocation Strategies for Parallel BLAST11 Why Parallelize BLAST? Moore's Law outpaced by growth of sequence databases. BLAST alone is not fast enough anymore. Souce: http://www.expasy.org/sprot/relnotes/relstat.html
12
24/05/2008Database Allocation Strategies for Parallel BLAST12 Parallel Environment Hardware: Distributed memory architecture (workstation cluster) Parallel Interface: Message Passing Interface (MPI) Set-up: Master-Slave
13
24/05/2008Database Allocation Strategies for Parallel BLAST13 Parallel Strategies Advantages of clusters: (1) a lot of memory available (2) parallelize data, not the BLAST algorithm Focus: How can the data be distributed? Partition query set: replicated approach. Partition sequence database: fragmented approach.
14
24/05/2008Database Allocation Strategies for Parallel BLAST14 Replicated Approach Souce: [1] Database allocation strategies for parallel BLAST evaluation on clusters
15
24/05/2008Database Allocation Strategies for Parallel BLAST15 ✔ Very simple and easy to implement Result merging: concatenate results from slaves Low communication overhead: For a query set of m sequences, m messages have to be sent to the slaves Linear speed-up: (1h 7 min -> 2 min 20 sec with 30 slaves) Replicated Approach: Analysis ✔ Doesn't solve the database size problem (e.g when database doesn't fit into main memory) ✔ Load balancing depends on composition of the query set
16
24/05/2008Database Allocation Strategies for Parallel BLAST16 Fragmented Approach Souce: [1] Database allocation strategies for parallel BLAST evaluation on clusters
17
24/05/2008Database Allocation Strategies for Parallel BLAST17 ✔ Attacks problem of exponential sequence database growth ✔ Superlinear speed-up with large databases. (Whole database in main memory) ✔ Load balancing independent of query set Fragmented Approach: Analysis ✔ Partitioning strategy required ✔ Result merging phase more complicated ✔ Higher communication overhead (query set of m sequences sent to n slaves)
18
24/05/2008Database Allocation Strategies for Parallel BLAST18 Static Partitioning Prior to executing BLAST a program computes the fragments of the sequence database The slaves then copy their fragment from the shared storage to their local storage Maste r Shared Storage Sequence database Slave 1 Slave 2 Slave 3 Slave 4
19
24/05/2008Database Allocation Strategies for Parallel BLAST19 Disadvantages of Static Partitioning A large number of smaller files are created, which are harder to manage, migrate, and share The database needs to be re- partitioned if the number of slaves in a parallel BLAST search exceeds the number of existing fragments Creating a large number of fragments for running on different number of slaves is not a good option (1)I/O overhead (2)Result merging- and retrieving overhead
20
24/05/2008Database Allocation Strategies for Parallel BLAST20 Dynamic Partitioning with Parallel I/O Instead of physically computing the fragment files only the offsets are calculated Offsets are distributed over the network to the slaves Slaves read sequences directly form shared storage via parallel I/O into their main memory Maste r Shared Storage Sequence database Slave 1 Slave 2 Slave 3 Slave 4 Communicates file offsets Read sequences [0,29 ] [60,89 ] [30,59 ] [90,117 ] [0,29] [30,59] [60,89] [90,117]
21
24/05/2008Database Allocation Strategies for Parallel BLAST21 Load Balancing – Adaptive Distribution Dynamic adaptive approach: i nitially not the whole database is distributed (e.g. only 70%) When the fist slave has processed all queries on its fragment and reports back to the master, the master assigns a new fragment to that slave The master gradually decreases the size of the fragments until the query has been processed on the entire database
22
24/05/2008Database Allocation Strategies for Parallel BLAST22 Result Merging Phase: Problem Problem: the master has to regroup and filter result sequences from slaves Master waits for all of the results and then generates the global output file itself Slaves are idle and don't make use of their I/O capacity in the result merging phase
23
24/05/2008Database Allocation Strategies for Parallel BLAST23 Result Merging Phase : Solution Slaves cache results and send only relevant information for filtering to the master Master does filtering and informs slaves which results to write at which offset to the global output file Slaves use parallel I/O and write global output file simultaneously No I/O bottleneck at the master Communication overhead by the slaves reduced
24
24/05/2008Database Allocation Strategies for Parallel BLAST24 Bottleneck Master Mast er SSSSSSSSSSSS
25
24/05/2008Database Allocation Strategies for Parallel BLAST25 Introducing a Supermaster Supermas ter SSSSSSSSSSSS Mast er
26
24/05/2008Database Allocation Strategies for Parallel BLAST26 Blue Gene/L The world's top supercomputer Parallel BLAST was implemented on a cluster with 4096 nodes To utilize as many nodes as possible the fragmented approach was combined with the replicated approach
27
24/05/2008Database Allocation Strategies for Parallel BLAST27 Replicated & Fragmented Approach Supermas ter SSSSSSSSSSSS Mast er Fragmented Approach Replicated Approach
28
24/05/2008Database Allocation Strategies for Parallel BLAST28 Summary BLAST: popular tool for sequence comparison (good performance) Problem: exponential growth of sequence databases Solution: parallelization, distribute data Replicated approach Fragmented approach Implementation issues: partitioning, load balancing, I/O bottleneck, message overhead MPI & BLAST = a good solution
29
24/05/2008Database Allocation Strategies for Parallel BLAST29 References 1. R. de Carvalho Costa and S. Lifschitz. “Database allocation strategies for parallel BLAST evaluation on clusters”, Distributed and Parallel Databases, 13(1), 2003. 2. H. Lin, X. Ma, P. Chandramohan, A. Geist, and N. Samatova, “Efficient Data Access for Parallel BLAST,” Proc. 19th Int'l Parallel and Distributed Processing Symp. (IPDPS '05), 2005. 3. The European Bioinformatics Institute (EBI), http://www.ebi.ac.uk/swissprot/http://www.ebi.ac.uk/swissprot/ 4. Karl Jiang, Oystein Thorsen, Amanda Peters, Brian Smith and Carlos P. Sosa, “An Efficient Parallel Implementation of the Hidden Markov Methods for Genomic Sequence Search on a Massively Parallel System”, 2008. 5. A. E. Darling, L. Carey, and W. Feng, “The design, implementation, and evaluation of mpiblast,” In proceedings of CluterWorld and Expo, 2003. 6. Huzefa Rangwala, Eric Lantz, Roy Musselman, Kurt Pinnow, Brian Smith, Brian Wallenfelt. “Massively Parallel BLAST for the Blue Gene/L” Presented at the High Availability and Performance Computing Workshop held in conjunction with the 6th Los Alamos Computer Science Institute Symposium, Santa Fe, New Mexico (October, 2005).
30
24/05/2008Database Allocation Strategies for Parallel BLAST30 Questions
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.