Parallel Computation in Biological Sequence Analysis: ParAlign & TurboBLAST Larissa Smelkov
Biological Sequence Alignment LocalGlobal Goal Algorithm Application To identify conserved regions and differences To see whether 2 strings have a common substring Needleman-WunschSmith-Waterman Comparing two genes with same function (human vs. mouse) Comparing two proteins with similar function Searching for local similarities in large sequences (newly sequenced genomes) Looking for motifs in 2 proteins
Protein Responsible for Iron Transport Human MQEYTNHSDTTFALRNISFRVPGRTLLHPLSLTFPAGKVTGLIGHNGSGKSTLLK MLGRHPPSEGEILLDAQPLESWSSKAFARKVAYLPQQLPPAEGMTVRELVAIGR YPWHGALGRFGAADREKVEEAISLVGLKPLAHRLVDSLSGGERQRAWIAMLVA QDSRCLLLDEPTSALDIHQVDVLSLVHRLSQERGLTVIAVLHDINMAARYCDYL VALRGGEMIAQGTPAEIMRGETLEMIYGIPMGILPHPAGAAPVSFVY Chicken MKLILCTVLSLGIAAVCFAAPPKSVIRWCTISSPEEKKCNNLRDLTQQERISLTCV QKATYLDCIKAIANNEADAISLDGGQVFEAGLAPYKLKPIAAEIYEHTEGSTTSY YAVAVVKKGTEFTVNDLQGKTSCHTGLGRSAGWNIPIGTLLHWGAIEWEGIESG SVEQAVAKFFSASCVPGATIEQKLCRQCKGDPKTKCARNAPYSGYSGAFHCLKD GKGDVAFVKHTTVNENAPDLNDEYELLCLDGSRQPVDNYKTCNWARVAAHAV VARDDNKVEDIWSFLSKAQSDFGVDTKSDFHLFGPPGKKDPVLKDLLFKDSAIM LKRVPSLMSQLYLGFEYYSAIQSMRKDQLSGSPRQNRIQWIAVLKAEKSKCDRW SVVSNGDVECTVVDETKDCIIKIMKGEADAV
Protein Responsible for Iron Transport Human MQEYTNHSDTTFALRNISFRVPGRTLLHPLSLTFPAGKVTGLIGHNGSGKSTLLK MLGRHPPSEGEILLDAQPLESWSSKAFARKVAYLPQQLPPAEGMTVRELVAIGR YPWHGALGRFGAADREKVEEAISLVGLKPLAHRLVDSLSGGERQRAWIAMLVA QDSRCLLLDEPTSALDIHQVDVLSLVHRLSQERGLTVIAVLHDINMAARYCDYL VALRGGEMIAQGTPAEIMRGETLEMIYGIPMGILPHPAGAAPVSFVY Chicken MKLILCTVLSLGIAAVCFAAPPKSVIRWCTISSPEEKKCNNLRDLTQQERISLTCV QKATYLDCIKAIANNEADAISLDGGQVFEAGLAPYKLKPIAAEIYEHTEGSTTSY YAVAVVKKGTEFTVNDLQGKTSCHTGLGRSAGWNIPIGTLLHWGAIEWEGIESG SVEQAVAKFFSASCVPGATIEQKLCRQCKGDPKTKCARNAPYSGYSGAFHCLKD GKGDVAFVKHTTVNENAPDLNDEYELLCLDGSRQPVDNYKTCNWARVAAHAV VARDDNKVEDIWSFLSKAQSDFGVDTKSDFHLFGPPGKKDPVLKDLLFKDSAIM LKRVPSLMSQLYLGFEYYSAIQSMRKDQLSGSPRQNRIQWIAVLKAEKSKCDRW SVVSNGDVECTVVDETKDCIIKIMKGEADAV
Similar Substrings DSLSGGERQ–RA–WIAMLVAQDSRC : : : : : : : DQLSGSPRQNRIQWIAVLKAEKSKC
Talk Outline Problem Description Smith-Waterman Algorithm BLAST ParAlign TurboBLAST Comparison
Problems of Comparison of 2 Sequences Evolution Factor Additions Deletions Substitutions Human Factor Typos Duplicates
Solution Smith-Waterman Algorithm (S-W) Score Matrix Gap Penalty
Score Matrix: BLOSUM45
Pairwise Alignment Example ELEPHANT PANTHER
S-W: Dynamic Programming Matrix
S-W: Formula T[i-1, j-1] + score(s[i], t[j]) T[i, j] = max T[i-1, j] – g T[i, j-1] – g 0 g – gap penalty g = 8 (in our example) T[i-1, j-1] T[i, j-1] T[i-1, j] ?
S-W: Dynamic Programming Matrix
S-W: Result Alignment ELEPHANT : : : : P– ANTHER
S-W: Summary Uses Score matrix Gap penalties Complexity O(mn) Sensitivity High
~ 33 mln sequences as of Feb. 14, 2004 Growth of GenBank
BLAST: Basic Local Alignment Search Tool
BLAST: Steps Divide both sequences into words of length w default w = 3 Calculate score for each pair Extend high scored pairs to increase score
BLAST: Divide Sequences
BLAST: Calculate Score E L E P A N score: -1 L E P P A N score: -6 E P H P A N score: 0 P H A P A N score: 6 H A N P A N score: 9 A N T P A N score: -2
BLAST: Sort Pairs on Score
BLAST: Extension
BLAST: Summary Uses Score matrix Gap penalties Heuristics to reduce computations Complexity O(m) with O(n) processors Sensitivity Low
Sensitivity AXBXCXDXE ABCDE Task: Align 2 sequences: Smith-Waterman: BLAST: AXBXCXDXE : : : : : A– B– C– D– E Ø (no similar substrings)
S-W vs. BLAST Speed Sensitivity S-W BLAST
S-W and BLAST Using them now Too costly Inefficient Time-consuming Solution More heuristics More parallelism
ParAlign
ParAlign: Steps Find ungapped alignments Calculate approximate alignment scores Choose high-scored sequences Apply S-W
ParAlign: Microparallelism Divide wide registers into smaller units Perform the same operation on different data sources Modern microprocessors have this technology built in
ParAlign: Calculate Scores in Parallel
ParAlign: Estimate of Gaps
ParAlign: Apply S-W in Parallel
ParAlign: Summary Uses SIMD technology (single instruction multiple data) S-W Algorithm Heuristics to reduce computations Requirement for machine Modern microprocessor Speed Fast Sensitivity Medium
TurboBLAST
TurboBLAST: Steps Divide the job Parts of query against partition of database Apply BLAST Merge results
TurboBLAST: Implementation A three-tier system Components Client Master Workers
TurboBLAST: Schema Master Client Workers tasks job task Divide task Schedule subtasks Solve subtasks Merge results Turbo Hub DB request File Provider DB part Sets up tasks Manages execution Coordinates Workers Provides VSM Divides job into tasks Writes results to file results request task It does it not by pushing the work out, but rather by simply posting information about what work needs to be done and letting the machines grab work from the remote locations.
TurboBLAST: Client Takes a BLAST job and divides it into a number of initial BLAST tasks. Submits these tasks to the Master Retrieves the results, and writes them to file.
TurboBLAST: Master Accepts tasks from Clients and sets them up to for processing by the Workers Includes TurboHub (the server portion of a parallel execution system) Includes File Provider (Java application that manages the databases)
TurboBLAST: Worker Workers are processors Run a Java application and perform the BLAST computations Merge the result Are responsible for scheduling
TurboHub TurboHub is execution engine for parallel and distributed Java applications Scalable high performance Wide range of computing environments Manages the flow of data through the workflows Schedules the components Transforms data between components Balances load Handles errors
TurboBLAST: TurboHub Manages task execution Coordinates the Workers Provides a virtual shared memories Supports dynamic changes in the set of Workers Supports fault tolerance
TurboBLAST: File Provider Maintains a copy of each database Delivers all or part of each database to Workers as they require them
TurboBLAST: Advantages Size of each task is optimal processing is efficient on the processor that computes the task Large set of tasks no waste of time for processors No algorithm change Support for all flavors of BLAST Ease to update Applicable for different environments (PC, Macintosh …)
TurboBLAST: Experiment Input data 500 proteins 200 – 400 amino acids in each Database 1,681,522,266 sequences Hardware IBM Linux cluster 8 dual-processor workstations 2 Pentium III processors, 996 Mhz each 2 Gbyte memory 100 Mbit Ethernet
TurboBLAST: Results of Experiment
TurboBLAST: Summary Divide and Conquer Use many copies of BLAST in parallel Uses BLAST Algorithm Requirement for each machine Java VM Local BLAST executable Speed Very fast Sensitivity Low
Comparison of Algorithms/Products Speed Sensitivity S-W BLAST ParAlign Turbo BLAST
References R.D. Bjornson, A.H. Sherman, S.B. Weston, N. Willard, J. Wing “TurboBLAST: A Parallel Implementation of BLAST Built on the TurboHub” Intl. Parallel and Distributed Processing Symposium (IPDPS), Rognes T. “ParAlign: a parallel sequence alignment algorithm for rapid and sensitive database searches” Oxford University Press, 2001
Don’t ask any Questions, please…
PS Web site there you can donate your computer time to participate in search of methods to cure cancer: