Download presentation
Presentation is loading. Please wait.
1
Parallel Computation in Biological Sequence Analysis: ParAlign & TurboBLAST Larissa Smelkov
2
Biological Sequence Alignment LocalGlobal Goal Algorithm Application To identify conserved regions and differences To see whether 2 strings have a common substring Needleman-WunschSmith-Waterman Comparing two genes with same function (human vs. mouse) Comparing two proteins with similar function Searching for local similarities in large sequences (newly sequenced genomes) Looking for motifs in 2 proteins
3
Protein Responsible for Iron Transport Human MQEYTNHSDTTFALRNISFRVPGRTLLHPLSLTFPAGKVTGLIGHNGSGKSTLLK MLGRHPPSEGEILLDAQPLESWSSKAFARKVAYLPQQLPPAEGMTVRELVAIGR YPWHGALGRFGAADREKVEEAISLVGLKPLAHRLVDSLSGGERQRAWIAMLVA QDSRCLLLDEPTSALDIHQVDVLSLVHRLSQERGLTVIAVLHDINMAARYCDYL VALRGGEMIAQGTPAEIMRGETLEMIYGIPMGILPHPAGAAPVSFVY Chicken MKLILCTVLSLGIAAVCFAAPPKSVIRWCTISSPEEKKCNNLRDLTQQERISLTCV QKATYLDCIKAIANNEADAISLDGGQVFEAGLAPYKLKPIAAEIYEHTEGSTTSY YAVAVVKKGTEFTVNDLQGKTSCHTGLGRSAGWNIPIGTLLHWGAIEWEGIESG SVEQAVAKFFSASCVPGATIEQKLCRQCKGDPKTKCARNAPYSGYSGAFHCLKD GKGDVAFVKHTTVNENAPDLNDEYELLCLDGSRQPVDNYKTCNWARVAAHAV VARDDNKVEDIWSFLSKAQSDFGVDTKSDFHLFGPPGKKDPVLKDLLFKDSAIM LKRVPSLMSQLYLGFEYYSAIQSMRKDQLSGSPRQNRIQWIAVLKAEKSKCDRW SVVSNGDVECTVVDETKDCIIKIMKGEADAV
4
Protein Responsible for Iron Transport Human MQEYTNHSDTTFALRNISFRVPGRTLLHPLSLTFPAGKVTGLIGHNGSGKSTLLK MLGRHPPSEGEILLDAQPLESWSSKAFARKVAYLPQQLPPAEGMTVRELVAIGR YPWHGALGRFGAADREKVEEAISLVGLKPLAHRLVDSLSGGERQRAWIAMLVA QDSRCLLLDEPTSALDIHQVDVLSLVHRLSQERGLTVIAVLHDINMAARYCDYL VALRGGEMIAQGTPAEIMRGETLEMIYGIPMGILPHPAGAAPVSFVY Chicken MKLILCTVLSLGIAAVCFAAPPKSVIRWCTISSPEEKKCNNLRDLTQQERISLTCV QKATYLDCIKAIANNEADAISLDGGQVFEAGLAPYKLKPIAAEIYEHTEGSTTSY YAVAVVKKGTEFTVNDLQGKTSCHTGLGRSAGWNIPIGTLLHWGAIEWEGIESG SVEQAVAKFFSASCVPGATIEQKLCRQCKGDPKTKCARNAPYSGYSGAFHCLKD GKGDVAFVKHTTVNENAPDLNDEYELLCLDGSRQPVDNYKTCNWARVAAHAV VARDDNKVEDIWSFLSKAQSDFGVDTKSDFHLFGPPGKKDPVLKDLLFKDSAIM LKRVPSLMSQLYLGFEYYSAIQSMRKDQLSGSPRQNRIQWIAVLKAEKSKCDRW SVVSNGDVECTVVDETKDCIIKIMKGEADAV
5
Similar Substrings DSLSGGERQ–RA–WIAMLVAQDSRC : : : : : : : DQLSGSPRQNRIQWIAVLKAEKSKC
6
Talk Outline Problem Description Smith-Waterman Algorithm BLAST ParAlign TurboBLAST Comparison
7
Problems of Comparison of 2 Sequences Evolution Factor Additions Deletions Substitutions Human Factor Typos Duplicates
8
Solution Smith-Waterman Algorithm (S-W) Score Matrix Gap Penalty
9
Score Matrix: BLOSUM45
10
Pairwise Alignment Example ELEPHANT PANTHER
11
S-W: Dynamic Programming Matrix
12
S-W: Formula T[i-1, j-1] + score(s[i], t[j]) T[i, j] = max T[i-1, j] – g T[i, j-1] – g 0 g – gap penalty g = 8 (in our example) T[i-1, j-1] T[i, j-1] T[i-1, j] ?
13
S-W: Dynamic Programming Matrix
17
S-W: Result Alignment ELEPHANT : : : : P– ANTHER
18
S-W: Summary Uses Score matrix Gap penalties Complexity O(mn) Sensitivity High
19
http://www.ncbi.nlm.nih.gov/Genbank/genbankstats.html ~ 33 mln sequences as of Feb. 14, 2004 Growth of GenBank
20
BLAST: Basic Local Alignment Search Tool
21
BLAST: Steps Divide both sequences into words of length w default w = 3 Calculate score for each pair Extend high scored pairs to increase score
22
BLAST: Divide Sequences
23
BLAST: Calculate Score E L E P A N 0 -1 0 score: -1 L E P P A N -3-1-2 score: -6 E P H P A N 0 -1 1 score: 0 P H A P A N 9 -2-1 score: 6 H A N P A N -2 5 6 score: 9 A N T P A N -1 -1 0 score: -2
24
BLAST: Sort Pairs on Score
25
BLAST: Extension
26
BLAST: Summary Uses Score matrix Gap penalties Heuristics to reduce computations Complexity O(m) with O(n) processors Sensitivity Low
27
Sensitivity AXBXCXDXE ABCDE Task: Align 2 sequences: Smith-Waterman: BLAST: AXBXCXDXE : : : : : A– B– C– D– E Ø (no similar substrings)
28
S-W vs. BLAST Speed Sensitivity S-W BLAST
29
S-W and BLAST Using them now Too costly Inefficient Time-consuming Solution More heuristics More parallelism
30
ParAlign
31
ParAlign: Steps Find ungapped alignments Calculate approximate alignment scores Choose high-scored sequences Apply S-W
32
ParAlign: Microparallelism Divide wide registers into smaller units Perform the same operation on different data sources Modern microprocessors have this technology built in
33
ParAlign: Calculate Scores in Parallel
34
ParAlign: Estimate of Gaps
35
ParAlign: Apply S-W in Parallel
36
ParAlign: Summary Uses SIMD technology (single instruction multiple data) S-W Algorithm Heuristics to reduce computations Requirement for machine Modern microprocessor Speed Fast Sensitivity Medium
37
TurboBLAST
38
TurboBLAST: Steps Divide the job Parts of query against partition of database Apply BLAST Merge results
39
TurboBLAST: Implementation A three-tier system Components Client Master Workers
40
TurboBLAST: Schema Master Client Workers tasks job task Divide task Schedule subtasks Solve subtasks Merge results Turbo Hub DB request File Provider DB part Sets up tasks Manages execution Coordinates Workers Provides VSM Divides job into tasks Writes results to file results request task It does it not by pushing the work out, but rather by simply posting information about what work needs to be done and letting the machines grab work from the remote locations.
41
TurboBLAST: Client Takes a BLAST job and divides it into a number of initial BLAST tasks. Submits these tasks to the Master Retrieves the results, and writes them to file.
42
TurboBLAST: Master Accepts tasks from Clients and sets them up to for processing by the Workers Includes TurboHub (the server portion of a parallel execution system) Includes File Provider (Java application that manages the databases)
43
TurboBLAST: Worker Workers are processors Run a Java application and perform the BLAST computations Merge the result Are responsible for scheduling
44
TurboHub TurboHub is execution engine for parallel and distributed Java applications Scalable high performance Wide range of computing environments Manages the flow of data through the workflows Schedules the components Transforms data between components Balances load Handles errors
45
TurboBLAST: TurboHub Manages task execution Coordinates the Workers Provides a virtual shared memories Supports dynamic changes in the set of Workers Supports fault tolerance
46
TurboBLAST: File Provider Maintains a copy of each database Delivers all or part of each database to Workers as they require them
47
TurboBLAST: Advantages Size of each task is optimal processing is efficient on the processor that computes the task Large set of tasks no waste of time for processors No algorithm change Support for all flavors of BLAST Ease to update Applicable for different environments (PC, Macintosh …)
48
TurboBLAST: Experiment Input data 500 proteins 200 – 400 amino acids in each Database 1,681,522,266 sequences Hardware IBM Linux cluster 8 dual-processor workstations 2 Pentium III processors, 996 Mhz each 2 Gbyte memory 100 Mbit Ethernet
49
TurboBLAST: Results of Experiment
51
TurboBLAST: Summary Divide and Conquer Use many copies of BLAST in parallel Uses BLAST Algorithm Requirement for each machine Java VM Local BLAST executable Speed Very fast Sensitivity Low
52
Comparison of Algorithms/Products Speed Sensitivity S-W BLAST ParAlign Turbo BLAST
53
References R.D. Bjornson, A.H. Sherman, S.B. Weston, N. Willard, J. Wing “TurboBLAST: A Parallel Implementation of BLAST Built on the TurboHub” Intl. Parallel and Distributed Processing Symposium (IPDPS), 2002. Rognes T. “ParAlign: a parallel sequence alignment algorithm for rapid and sensitive database searches” Oxford University Press, 2001
54
Don’t ask any Questions, please…
55
PS Web site there you can donate your computer time to participate in search of methods to cure cancer: http://www.the-optimists.org.uk
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.