Presentation is loading. Please wait.

Presentation is loading. Please wait.

Whole Genome Alignment using Multithreaded Parallel Implementation Hyma S Murthy CMSC 838 Presentation.

Similar presentations


Presentation on theme: "Whole Genome Alignment using Multithreaded Parallel Implementation Hyma S Murthy CMSC 838 Presentation."— Presentation transcript:

1 Whole Genome Alignment using Multithreaded Parallel Implementation Hyma S Murthy CMSC 838 Presentation

2 CMSC 838T – Presentation Talk Overview u Organization of the paper  Motivation  Technique: Pairwise Sequence Comparison using Dynamic Programming  EARTH Execution Model  Evaluation  Result Graphs  Conclusions  Related Work (MUMmer)

3 CMSC 838T – Presentation Motivation u Importance of Genome Alignment :  Identify important matched and mismatched regions  “matches” represent homolog pairs, conserved regions or long repeats  “mismatches”represent foreign fragments inserted by transposition, sequence reversal or lateral transfer  Detect functional differences between pathogenic/ non-pathogenic strains, evolutionary distance, mutations leading to disease, phenotypes, etc. u Problems  Large computational power, memory and execution time  Existing algorithms apply dynamic programming only to subsequences  Computationally intensive to apply to whole sequences (O(n 2 ))  Thus applicable only to closely related genomes

4 CMSC 838T – Presentation Solution.. u Multithreaded parallel implementation of sequence alignment algorithm to align whole genomes  Parallel implementation of dynamic programming technique  Uses collective memory of several nodes  Uses multithreading to overlap computation and communication  Applicable to closely related as well as less similar genomes  Reliable output in reasonable time

5 CMSC 838T – Presentation Pairwise Sequence Comparison using Dynamic Programming u Basic Idea:  Quantify the similarity between pairs of symbols of target sequences  Associate score for each possible arrangement  Similarity is given by the highest score  Example : sequence x A T A A G T sequence y A T G C A G T SCORE 1 1 -1 –1 –1 –1 –1 TOTAL = -3 sequence x A T A - A G T sequence y A T G C A G T SCORE 1 1 -1 –2 1 1 1 TOTAL = 2  Model mutation by “gaps” (gaps indicate evolution of one sequence into another)

6 CMSC 838T – Presentation Dynamic Programming u Smith and Waterman approach:  Aligns subsequences of given sequences  Involves: (a) calculation of scores indicating similarity (b) identification of alignment(s) corresponding to the score  Build solution using previous solutions for smaller subsequences  Construct a two-dimensional array – “Similarity Matrix” to store scores corresponding to partial results  Matrix represents all possible alignments of the input sequences  Recurrence equation SM[i, j-1] + gp SM[i-1, j-1] + ss SM[i-1, j] + gp 0 SM[i, j] =

7 CMSC 838T – Presentation Contd…. Each element of the matrix is the max of the foll four values: Left element + gap, upper-left element + score of replacing vertical with horizontal symbol, upper element + gap, 0. Consider the foll example T G A T G G A G G T 00000000000 0010 0002 0 0 0 0 A T A G G G 2 = max{0 + (-2), 1 + (1), 0 + (-2), 0}

8 CMSC 838T – Presentation Identifying alignments  Alignments with score above a given threshold are reported  Start at end of the alignment and move backwards to the beginning T G A T G G A G G T 00000000000 00100110110 00020002000 01003100101 00011201000 00100231210 00100132231 A T A G G G T G A T – G G A G G T G A T A G G T G A T G G A G G T G A T A G G T G A T G G A G G T G A T A G G T G A T G G A G G T G A T A G G

9 CMSC 838T – Presentation EARTH Execution Model u Program is viewed as a collection of threads  execution order determined by data and control dependencies u Threads further divided into fibers  fibers are non-preemptive and  all data is ready before their execution u Each node in EARTH has  an execution unit  synchronization unit  queues linking the two (RQ and EQ)  local memory  interface to interconnection network

10 CMSC 838T – Presentation EARTH Architecture Memory bus From RQ To EQ PE … Local Memory EU SU EQRQ node.... Inter connection Network

11 CMSC 838T – Presentation Multithreaded parallel implementation u Divide scoring matrix as follows  horizontal strips (each element of input sequence X)  strips into rectangular blocks u Blocks are calculated by two fibers within a thread  only one fiber is active at any given time u Each thread is assigned to one horizontal strip  the computation is done by even/ odd fibers within the thread u Initialization delay of reading sequences from server is minimized  Each thread needs only the piece of input sequence it grabs and not the whole of sequence X  After computing a block, fiber sends to fiber beneath a piece of sequence Y among other information u The computation of the anti-diagonal elements of the matrix is as shown

12 CMSC 838T – Presentation Computation of similarity matrix on EARTH E fibers O E fibers O Thread AThread B P1 P2 P3 Inactive fiber Active fiber Ack Sync Data P1 P2 P3 P4 P1 P2 P3 P4

13 CMSC 838T – Presentation Evaluation u Experimental environment  Beowulf implementation of EARTH  Uses Beowulf machine consisting of 64 nodes, each containing two 200MHz Pentium Pro processors (a total of 128 processors and 128MB of memory)  Sequences of lengths ranging from 30K to 900K were tested  Execution times for sequential and parallel implementation of Smith and Waterman algorithm is given below: Implementation Time Seq. Smith-Waterman 53 hours ATGC on 16 nodes 3.3 hours ATGC on 32 nodes 2.1 hours ATGC on 64 nodes 1.3 hours

14 CMSC 838T – Presentation Evaluation u The multithreaded parallel implementation is named ATGC – Another Tool for Genomic Comparison u Experiment alignes  human and mice mitochondrial genomes  human and drosophila mitochondrial genomes u Reason for selection  human and mice are closely related and the other pair are less similar u The results were confirmed with MUMmer – another whole genome alignment tool u Result graphs show that ATGC is more accurate than MUMmer (verified by using NCBI Blast)

15 CMSC 838T – Presentation Result Graphs

16 CMSC 838T – Presentation Contd….

17 CMSC 838T – Presentation Conclusions u Comparison of whole genomes requires high computation and memory u Made convenient by using a multithreaded parallel implementation of dynamic programming on a cluster of PCs u Accurate results obtained in reasonable amount of time u Aligns closely related as well as less similar genomes u Slower, but plays important role where high accuracy is needed ( as seen in comparison with MUMmer for human and drosophila mitochondrial genome)

18 CMSC 838T – Presentation Related work –MUMmer(Maximal Unique Match) u given genomes A and B  find all maximal, unique, matching subsequences (MUMs)  extract the longest possible set of matches that occur in the same order in both genomes  close the gaps  output the alignment u maximal unique match (MUM):  occurs exactly once in both genomes A and B  not contained in any longer MUM u key idea in identifying MUMs is to build a suffix tree for genomes A and B


Download ppt "Whole Genome Alignment using Multithreaded Parallel Implementation Hyma S Murthy CMSC 838 Presentation."

Similar presentations


Ads by Google