Download presentation
Presentation is loading. Please wait.
1
A Parallel Solution to Global Sequence Comparisons CSC 583 – Parallel Programming By: Nnamdi Ihuegbu 12/19/03
2
Brief Introduction Human Genome Project (and others) -> Vast amount of biological data Human Genome Project (and others) -> Vast amount of biological data Venture: Computer Science and Biology (BCB) - > Genetic Databases (map,genomic,proteomic) Venture: Computer Science and Biology (BCB) - > Genetic Databases (map,genomic,proteomic) Expected date of Completed map of human genome: end of 2003 Expected date of Completed map of human genome: end of 2003 Next stage: Sequence comp. and Seq-Protein function. Next stage: Sequence comp. and Seq-Protein function. Useful to Pharm. Companies (CADD – e.g. SKB’s Relenza). Useful to Pharm. Companies (CADD – e.g. SKB’s Relenza).
3
Results - Sequence Current Sequence Generation Technologies Current Sequence Generation Technologies Maxam-Gilbert (use chemicals to cleave DNA at a specific base/length) Maxam-Gilbert (use chemicals to cleave DNA at a specific base/length) Sanger (use enzymatic procedures to produce DNA based on specific base—i.e. length) Sanger (use enzymatic procedures to produce DNA based on specific base—i.e. length)
4
Derivation of nucleotide sequence from human chromosome
5
Sequence Comparison Methods Types of Sequence Comparisons/alignmts. Types of Sequence Comparisons/alignmts. Global (“How similar are these two sequences?”) Global (“How similar are these two sequences?”) To find best overall alignment b/w two sequences To find best overall alignment b/w two sequences 1970: Needleman and Wunch (global, dynamic) 1970: Needleman and Wunch (global, dynamic) Shortcomings: in small similarities w/in 2 subseq. Shortcomings: in small similarities w/in 2 subseq. Local (“What sequences in a database are most similar to this sequence?”) Local (“What sequences in a database are most similar to this sequence?”) To find the best subseq. match b/w two sequences To find the best subseq. match b/w two sequences 1981: Smith and Waterman (local, dynamic) 1981: Smith and Waterman (local, dynamic) Shortcomings: not computationally efficient, slow Shortcomings: not computationally efficient, slow
6
Results - Sequence
7
Heuristic Search (Quick, Approximate) Heuristic Search (Quick, Approximate) Quickly search for “words” that match sequence. Then recursively perform local search on each matched word until no other matches Quickly search for “words” that match sequence. Then recursively perform local search on each matched word until no other matches FASTA (1998), BLAST(1990) FASTA (1998), BLAST(1990) Shortcomings: approximate not exact, E-Value (sig if <0.05) Shortcomings: approximate not exact, E-Value (sig if <0.05)
8
Results – Sequence (CSC Implementation) Sequence alignment can be represented as matrices and graphs (using rules and costs) Sequence alignment can be represented as matrices and graphs (using rules and costs) When converted into a directed acyclic graph, solution of the sequence alignment is the shortest-path with maximum value (max. path problem). When converted into a directed acyclic graph, solution of the sequence alignment is the shortest-path with maximum value (max. path problem).
9
Sequencing (CSC Implementation) Diag. edge = character matches; down edge = gap in string 2; across edge = gap in string 1 Can be solved dynamically as a ‘running max score’ (RMS). For each D(i,j), best RMS = max(west+gap1, north+gap2, NW+current_score) Replace D(i,j) with max Needleman-Wunch Dynamic Program
10
Parallel Solution Work (Slaves) allocated in stripes
11
Parallel Solution (Cont’d) ATT T33 G -3 [Ga p] -2-6 ATT T33 G -3 [Ga p] -2-6 Allocating Strips in SubMatrix
12
Parallel Results ATT T33 G -3 [Ga p] -2-6 Each cell in each strip computes maximum of NEIGHBORS (running max) ATT T G [Ga p] Path:T A G T -3 _ T -6 -10
13
Improvements Parallel Smith-Waterman (localized; start and continue while >0 then end); (BLAZE- Stanford). Parallel Smith-Waterman (localized; start and continue while >0 then end); (BLAZE- Stanford). Pipeline implementation on an actual Mesh Topology Pipeline implementation on an actual Mesh Topology Other possible data infrastructures to traverse data in search of shortest path (e.g. Trees -- specialized) Other possible data infrastructures to traverse data in search of shortest path (e.g. Trees -- specialized)
14
Improvements (Cont’d) Faster means of comparing and aligning multiple sequences simultaneously (e.g. comparing novel protein sequence to family). Faster means of comparing and aligning multiple sequences simultaneously (e.g. comparing novel protein sequence to family).
15
Any Questions?
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.