Parallel Computation in Biological Sequence Analysis: ParAlign & TurboBLAST Larissa Smelkov.

Slides:



Advertisements
Similar presentations
Blast outputoutput. How to measure the similarity between two sequences Q: which one is a better match to the query ? Query: M A T W L Seq_A: M A T P.
Advertisements

Master/Slave Architecture Pattern Source: Pattern-Oriented Software Architecture, Vol. 1, Buschmann, et al.
BLAST Sequence alignment, E-value & Extreme value distribution.
Optimal Sum of Pairs Multiple Sequence Alignment David Kelley.
Definitions Optimal alignment - one that exhibits the most correspondences. It is the alignment with the highest score. May or may not be biologically.
A Grid implementation of the sliding window algorithm for protein similarity searches facilitates whole proteome analysis on continuously updated databases.
Sequence Similarity Searching Class 4 March 2010.
Sequence Alignment Storing, retrieving and comparing DNA sequences in Databases. Comparing two or more sequences for similarities. Searching databases.
Heuristic alignment algorithms and cost matrices
Using Metacomputing Tools to Facilitate Large Scale Analyses of Biological Databases Vinay D. Shet CMSC 838 Presentation Authors: Allison Waugh, Glenn.
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez.
TurboBLAST: A Parallel Implementation of BLAST Built on the TurboHub Bin Gan CMSC 838 Presentation.
1 Bio-Sequence Analysis with Cradle’s 3SoC™ Software Scalable System on Chip Xiandong Meng, Vipin Chaudhary Parallel and Distributed Computing Lab Wayne.
Sequence Alignment Variations Computing alignments using only O(m) space rather than O(mn) space. Computing alignments with bounded difference Exclusion.
Improving performance of Multiple Sequence Alignment in Multi-client Environments Aaron Zollman CMSC 838 Presentation.
Index-based search of single sequences Omkar Mate CS 374 Stanford University.
1 Parallel EST Clustering by Kalyanaraman, Aluru, and Kothari Nargess Memarsadeghi CMSC 838 Presentation.
Cluster Computer For Bioinformatics Applications Nile University, Bioinformatics Group. Hisham Adel 2008.
Similar Sequence Similar Function Charles Yan Spring 2006.
Sequence Alignment III CIS 667 February 10, 2004.
Parallel Computation in Biological Sequence Analysis Xue Wu CMSC 838 Presentation.
Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.
Practical algorithms in Sequence Alignment Sushmita Roy BMI/CS 576 Sep 16 th, 2014.
Sequence similarity. Motivation Same gene, or similar gene Suffix of A similar to prefix of B? Suffix of A similar to prefix of B..Z? Longest similar.
Blast heuristics Morten Nielsen Department of Systems Biology, DTU.
Sequence alignment, E-value & Extreme value distribution
Introduction to Bioinformatics From Pairwise to Multiple Alignment.
Heuristic methods for sequence alignment in practice Sushmita Roy BMI/CS 576 Sushmita Roy Sep 27 th,
Whole genome alignments Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas
07/14/08. 2 Points Introduction. Cluster and Supercomputers. Cluster Types and Advantages. Our Cluster. Cluster Performance. Cluster Computer for Basic.
How Parallelism Is Used In Bioinformatics Presented by: Laura L. Neureuter April 9, 2001 Using: Three Complimentary Approaches to Parallelization of Local.
Alignment Statistics and Substitution Matrices BMI/CS 576 Colin Dewey Fall 2010.
Sequence Analysis Determining how similar 2 (or more) gene/protein sequences are (too each other) is a “staple” function in bioinformatics. This information.
Pair-wise Sequence Alignment What happened to the sequences of similar genes? random mutation deletion, insertion Seq. 1: 515 EVIRMQDNNPFSFQSDVYSYG EVI.
Traceback and local alignment Prof. William Stafford Noble Department of Genome Sciences Department of Computer Science and Engineering University of Washington.
Sequence Alignment Algorithms Morten Nielsen Department of systems biology, DTU.
Pairwise alignments Introduction Introduction Why do alignments? Why do alignments? Definitions Definitions Scoring alignments Scoring alignments Alignment.
SUMA: A Scientific Metacomputer Cardinale, Yudith Figueira, Carlos Hernández, Emilio Baquero, Eduardo Berbín, Luis Bouza, Roberto Gamess, Eric García,
Loosely Coupled Parallelism: Clusters. Context We have studied older archictures for loosely coupled parallelism, such as mesh’s, hypercubes etc, which.
11 Overview Paracel GeneMatcher2. 22 GeneMatcher2 The GeneMatcher system comprises of hardware and software components that significantly accelerate a.
1 Data structure:Lookup Table Application:BLAST. 2 The Look-up Table Data Structure A k-mer is a string of length k. A lookup table is a table of size.
BLAST Anders Gorm Pedersen & Rasmus Wernersson. Database searching Using pairwise alignments to search databases for similar sequences Database Query.
Construction of Substitution Matrices
Multiple alignment: Feng- Doolittle algorithm. Why multiple alignments? Alignment of more than two sequences Usually gives better information about conserved.
Function preserves sequences Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence.
Chapter 3 Computational Molecular Biology Michael Smith
Using SWARM service to run a Grid based EST Sequence Assembly Karthik Narayan Primary Advisor : Dr. Geoffrey Fox 1.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
Computing the Smith-Waterman Algorithm on the Illinois Bio-Grid Dave S. Angulo 1, Nigel M. Parsad 2, Tom Goodale 3, Gabrielle Allen 3, Ed Seidel 3 1 The.
COT 6930 HPC and Bioinformatics Multiple Sequence Alignment Xingquan Zhu Dept. of Computer Science and Engineering.
Biocomputation: Comparative Genomics Tanya Talkar Lolly Kruse Colleen O’Rourke.
Pairwise Sequence Alignment Part 2. Outline Summary Local and Global alignments FASTA and BLAST algorithms Evaluating significance of alignments Alignment.
Heuristic Methods for Sequence Database Searching BMI/CS 576 Colin Dewey Fall 2015.
Sequence Alignment.
Construction of Substitution matrices
Pairwise Sequence Alignment (cont.) (Lecture for CS397-CXZ Algorithms in Bioinformatics) Feb. 4, 2004 ChengXiang Zhai Department of Computer Science University.
Step 3: Tools Database Searching
Heuristic Methods for Sequence Database Searching BMI/CS 576 Colin Dewey Fall 2010.
System Models Advanced Operating Systems Nael Abu-halaweh.
BIG DATA/ Hadoop Interview Questions.
Bioinformatics Computation in the Cloud A Joint Collaboration Between Microsoft’s External Research and eXtreme Computing Groups
Lecture 3 – MapReduce: Implementation CSE 490h – Introduction to Distributed Computing, Spring 2009 Except as otherwise noted, the content of this presentation.
Database Scanning/Searching FASTA/BLAST/PSIBLAST G P S Raghava.
Blast Basic Local Alignment Search Tool
Basics of BLAST Basic BLAST Search - What is BLAST?
Ch 4. The Evolution of Analytic Scalability
Parallel System for BLAST
Sequence Alignment Algorithms Morten Nielsen BioSys, DTU
Basic Local Alignment Search Tool
Sequence alignment, E-value & Extreme value distribution
Presentation transcript:

Parallel Computation in Biological Sequence Analysis: ParAlign & TurboBLAST Larissa Smelkov

Biological Sequence Alignment LocalGlobal Goal Algorithm Application To identify conserved regions and differences To see whether 2 strings have a common substring Needleman-WunschSmith-Waterman Comparing two genes with same function (human vs. mouse) Comparing two proteins with similar function Searching for local similarities in large sequences (newly sequenced genomes) Looking for motifs in 2 proteins

Protein Responsible for Iron Transport Human MQEYTNHSDTTFALRNISFRVPGRTLLHPLSLTFPAGKVTGLIGHNGSGKSTLLK MLGRHPPSEGEILLDAQPLESWSSKAFARKVAYLPQQLPPAEGMTVRELVAIGR YPWHGALGRFGAADREKVEEAISLVGLKPLAHRLVDSLSGGERQRAWIAMLVA QDSRCLLLDEPTSALDIHQVDVLSLVHRLSQERGLTVIAVLHDINMAARYCDYL VALRGGEMIAQGTPAEIMRGETLEMIYGIPMGILPHPAGAAPVSFVY Chicken MKLILCTVLSLGIAAVCFAAPPKSVIRWCTISSPEEKKCNNLRDLTQQERISLTCV QKATYLDCIKAIANNEADAISLDGGQVFEAGLAPYKLKPIAAEIYEHTEGSTTSY YAVAVVKKGTEFTVNDLQGKTSCHTGLGRSAGWNIPIGTLLHWGAIEWEGIESG SVEQAVAKFFSASCVPGATIEQKLCRQCKGDPKTKCARNAPYSGYSGAFHCLKD GKGDVAFVKHTTVNENAPDLNDEYELLCLDGSRQPVDNYKTCNWARVAAHAV VARDDNKVEDIWSFLSKAQSDFGVDTKSDFHLFGPPGKKDPVLKDLLFKDSAIM LKRVPSLMSQLYLGFEYYSAIQSMRKDQLSGSPRQNRIQWIAVLKAEKSKCDRW SVVSNGDVECTVVDETKDCIIKIMKGEADAV

Protein Responsible for Iron Transport Human MQEYTNHSDTTFALRNISFRVPGRTLLHPLSLTFPAGKVTGLIGHNGSGKSTLLK MLGRHPPSEGEILLDAQPLESWSSKAFARKVAYLPQQLPPAEGMTVRELVAIGR YPWHGALGRFGAADREKVEEAISLVGLKPLAHRLVDSLSGGERQRAWIAMLVA QDSRCLLLDEPTSALDIHQVDVLSLVHRLSQERGLTVIAVLHDINMAARYCDYL VALRGGEMIAQGTPAEIMRGETLEMIYGIPMGILPHPAGAAPVSFVY Chicken MKLILCTVLSLGIAAVCFAAPPKSVIRWCTISSPEEKKCNNLRDLTQQERISLTCV QKATYLDCIKAIANNEADAISLDGGQVFEAGLAPYKLKPIAAEIYEHTEGSTTSY YAVAVVKKGTEFTVNDLQGKTSCHTGLGRSAGWNIPIGTLLHWGAIEWEGIESG SVEQAVAKFFSASCVPGATIEQKLCRQCKGDPKTKCARNAPYSGYSGAFHCLKD GKGDVAFVKHTTVNENAPDLNDEYELLCLDGSRQPVDNYKTCNWARVAAHAV VARDDNKVEDIWSFLSKAQSDFGVDTKSDFHLFGPPGKKDPVLKDLLFKDSAIM LKRVPSLMSQLYLGFEYYSAIQSMRKDQLSGSPRQNRIQWIAVLKAEKSKCDRW SVVSNGDVECTVVDETKDCIIKIMKGEADAV

Similar Substrings DSLSGGERQ–RA–WIAMLVAQDSRC : : : : : : : DQLSGSPRQNRIQWIAVLKAEKSKC

Talk Outline Problem Description Smith-Waterman Algorithm BLAST ParAlign TurboBLAST Comparison

Problems of Comparison of 2 Sequences Evolution Factor Additions Deletions Substitutions Human Factor Typos Duplicates

Solution Smith-Waterman Algorithm (S-W) Score Matrix Gap Penalty

Score Matrix: BLOSUM45

Pairwise Alignment Example ELEPHANT PANTHER

S-W: Dynamic Programming Matrix

S-W: Formula T[i-1, j-1] + score(s[i], t[j]) T[i, j] = max T[i-1, j] – g T[i, j-1] – g 0 g – gap penalty g = 8 (in our example) T[i-1, j-1] T[i, j-1] T[i-1, j] ?

S-W: Dynamic Programming Matrix

S-W: Result Alignment ELEPHANT : : : : P– ANTHER

S-W: Summary Uses Score matrix Gap penalties Complexity O(mn) Sensitivity High

~ 33 mln sequences as of Feb. 14, 2004 Growth of GenBank

BLAST: Basic Local Alignment Search Tool

BLAST: Steps Divide both sequences into words of length w default w = 3 Calculate score for each pair Extend high scored pairs to increase score

BLAST: Divide Sequences

BLAST: Calculate Score E L E P A N score: -1 L E P P A N score: -6 E P H P A N score: 0 P H A P A N score: 6 H A N P A N score: 9 A N T P A N score: -2

BLAST: Sort Pairs on Score

BLAST: Extension

BLAST: Summary Uses Score matrix Gap penalties Heuristics to reduce computations Complexity O(m) with O(n) processors Sensitivity Low

Sensitivity AXBXCXDXE ABCDE Task: Align 2 sequences: Smith-Waterman: BLAST: AXBXCXDXE : : : : : A– B– C– D– E Ø (no similar substrings)

S-W vs. BLAST Speed Sensitivity S-W BLAST

S-W and BLAST Using them now Too costly Inefficient Time-consuming Solution More heuristics More parallelism

ParAlign

ParAlign: Steps Find ungapped alignments Calculate approximate alignment scores Choose high-scored sequences Apply S-W

ParAlign: Microparallelism Divide wide registers into smaller units Perform the same operation on different data sources Modern microprocessors have this technology built in

ParAlign: Calculate Scores in Parallel

ParAlign: Estimate of Gaps

ParAlign: Apply S-W in Parallel

ParAlign: Summary Uses SIMD technology (single instruction multiple data) S-W Algorithm Heuristics to reduce computations Requirement for machine Modern microprocessor Speed Fast Sensitivity Medium

TurboBLAST

TurboBLAST: Steps Divide the job Parts of query against partition of database Apply BLAST Merge results

TurboBLAST: Implementation A three-tier system Components Client Master Workers

TurboBLAST: Schema Master Client Workers tasks job task Divide task Schedule subtasks Solve subtasks Merge results Turbo Hub DB request File Provider DB part Sets up tasks Manages execution Coordinates Workers Provides VSM Divides job into tasks Writes results to file results request task It does it not by pushing the work out, but rather by simply posting information about what work needs to be done and letting the machines grab work from the remote locations.

TurboBLAST: Client Takes a BLAST job and divides it into a number of initial BLAST tasks. Submits these tasks to the Master Retrieves the results, and writes them to file.

TurboBLAST: Master Accepts tasks from Clients and sets them up to for processing by the Workers Includes TurboHub (the server portion of a parallel execution system) Includes File Provider (Java application that manages the databases)

TurboBLAST: Worker Workers are processors Run a Java application and perform the BLAST computations Merge the result Are responsible for scheduling

TurboHub TurboHub is execution engine for parallel and distributed Java applications Scalable high performance Wide range of computing environments Manages the flow of data through the workflows Schedules the components Transforms data between components Balances load Handles errors

TurboBLAST: TurboHub Manages task execution Coordinates the Workers Provides a virtual shared memories Supports dynamic changes in the set of Workers Supports fault tolerance

TurboBLAST: File Provider Maintains a copy of each database Delivers all or part of each database to Workers as they require them

TurboBLAST: Advantages Size of each task is optimal processing is efficient on the processor that computes the task Large set of tasks no waste of time for processors No algorithm change Support for all flavors of BLAST Ease to update Applicable for different environments (PC, Macintosh …)

TurboBLAST: Experiment Input data 500 proteins 200 – 400 amino acids in each Database 1,681,522,266 sequences Hardware IBM Linux cluster 8 dual-processor workstations 2 Pentium III processors, 996 Mhz each 2 Gbyte memory 100 Mbit Ethernet

TurboBLAST: Results of Experiment

TurboBLAST: Summary Divide and Conquer Use many copies of BLAST in parallel Uses BLAST Algorithm Requirement for each machine Java VM Local BLAST executable Speed Very fast Sensitivity Low

Comparison of Algorithms/Products Speed Sensitivity S-W BLAST ParAlign Turbo BLAST

References R.D. Bjornson, A.H. Sherman, S.B. Weston, N. Willard, J. Wing “TurboBLAST: A Parallel Implementation of BLAST Built on the TurboHub” Intl. Parallel and Distributed Processing Symposium (IPDPS), Rognes T. “ParAlign: a parallel sequence alignment algorithm for rapid and sensitive database searches” Oxford University Press, 2001

Don’t ask any Questions, please…

PS Web site there you can donate your computer time to participate in search of methods to cure cancer: