Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 ALAE: Accelerating Local Alignment with Affine Gap Exactly in Biosequence Databases Xiaochun Yang, Honglei Liu, Bin Wang Northeastern University, China.

Similar presentations


Presentation on theme: "1 ALAE: Accelerating Local Alignment with Affine Gap Exactly in Biosequence Databases Xiaochun Yang, Honglei Liu, Bin Wang Northeastern University, China."— Presentation transcript:

1 1 ALAE: Accelerating Local Alignment with Affine Gap Exactly in Biosequence Databases Xiaochun Yang, Honglei Liu, Bin Wang Northeastern University, China

2 2 Local Alignment Similar over short conserved regions Dissimilar over remaining regions Applications  Comparing long stretches of anonymous DNA  Searching for unknown domains or motifs within proteins from different families …

3 3 Related Work Smith-Waterman algorithm (1981)  An exact approach but very slow  Not used for search BLAST: an efficient but approximate approach OASIS: an exact approach and efficient only for short query sequences (less than 60 characters) BWT-SW: an exact approach but inefficient Our target  An efficient and exact approach: ALAE (Accelerating Local Alignment with affine gap Exactly)

4 4 Local Alignment Input: 2 sequences, a similarity function, a threshold Output: Alignments. T P Score >= H T P

5 5 Measure Similarity Scoring scheme  An identical mapping: positive score s a  A mismatch: negative score s b  Gap: negative score s g + r×s s TGCGC-ATGGATTGACCGA TGCGCCATTGAT--ACCGA sim(S1,S2) = 15×1 + (-3) + (-2-1) + (-2 + 2 × (-1)) = 5 S1: S2: Scoring scheme: Gap opening penalty Gap extension penalty

6 6 A Basic Approach T P X … i The best alignment score of X[1,i] and any substring of P ending at position j. j

7 7 A DP Algorithm

8 8 An Example of a DP Matrix P = GCTAG, T = AAAGCTA. Scoring scheme = Ga Gb -2 -5-2 -2 -5-2

9 9 A Basic Approach i = i 1 +t 1 = i 2 +t 2 4 6 6 T P i j

10 10 Challenges Speed  Each matrix contains m ~ m×n entries  n matrixes  How to avoid calculating most of entries without impairing the accuracy of the alignment results? In-memory algorithm  Long sequences: both T and P are long

11 11 Contributions Speed  Prune unnecessary calculations  Avoid duplicate calculations In-memory algorithm  Use compressed suffix array Mathematical analysis

12 12 Outline Local filterings Global filtering Reusing calculations A hybrid algorithm

13 13 Local filterings Length Filtering Pruned

14 14 Local filterings Score Filtering Pruned

15 15 Local filterings q-Prefix Filtering Pruned Simpler function

16 16 Comparison of Calculating One Matrix P=G 1 C 2 T 3 A 4 A 5 G 6 C 7 T 8 A 9 A 10 G 11 C 12 T 13 G 14 C 15 X=G 1 C 2 T 3 A 4 A 5 G 6 C 7 T 8 A 9 G 10 T 11 Scoring scheme H=3 P GCTAAGCTAAGCTGC 0000000000000000 G-71-3 1 1 1 C-9-62-5-6 2-5-6 2-5-62 T-11-8-53-4-6-8-53-4-6-8-53-4-5 A-13-10-7-44-3-5-7-44-3-5-7-40-7 A-15-12-9-6-35-2-4-6-35-2-4-6-7-3 XG-17-14-11-8-5-26-3-5-26-3-5-7 C-19-16-13-10-7-470-2-470-2-4 T-21-18-15-12-9-6-3081-3081 A-23-20-17-14-11-8-5-21920 15 G-25-22-19-16-13-10-7-4263-322 T-27-24-21-18-15-12-9-6-3030-2-4

17 17 Comparison of Calculating One Matrix P=G 1 C 2 T 3 A 4 A 5 G 6 C 7 T 8 A 9 A 10 G 11 C 12 T 13 G 14 C 15 X=G 1 C 2 T 3 A 4 A 5 G 6 C 7 T 8 A 9 G 10 T 11 Scoring scheme H=3 P GCTAAGCTAAGCTGC 0000000000000000 G-∞1 1 1 1 C 2 2 2 2 T 3 3 3 A 4 4 A 5 5 XG 6 6 C 7 7 T 81 8 A 192 15 G 263 2 T 3

18 18 Outline Local filterings Global filtering Reusing calculations A hybrid algorithm

19 19 Global Filtering i = i 1 +t 1 = i 2 +t 2 4 6 6 Pruned

20 20 Global Filtering Pruned fork areas Using X’ : Alignment score >= S a It is unnecessary to calculate the fork area in the matrix of X and P Question: Safely avoid calculating based on calculated matrixes?

21 21 Global Filtering X’ Update and check unnecessary calculations on-the-fly Scoring scheme Boolean matrix X (1)Space consuming: m×n space (2) Calculation order

22 22 Global Filtering X’ X q-prefix domination X’ dominates X

23 23 Global Filtering X’ X q-prefix domination X’ dominates X Text T  Constructing dominations offline in O(n) time Query P  Check useless calculations on-the-fly t Calculation order is unnecessary.

24 24 Outline Local filterings Global filtering Reusing calculations A hybrid algorithm

25 25 Reusing score calculations for P Entries with a common prefix P s can share alignment scores. reusable alignment entries

26 26 Reusing score calculations for P reusable alignment entries If two forks have equivalent scores for their FGOEs, their entries with common substring Ps can share alignment scores.

27 27 Outline Local filterings Global filtering Reusing calculations A hybrid algorithm

28 28 A Hybrid Algorithm Row by row Column by column

29 29 Mathematical Analysis Upper bound on the number of calculated entries for representative scoring schemes specified by BLAST ( http://blast.ncbi.nlm.nih.gov/Blast.cgi)  DNA: 4.50mn 0.520 ~ 9.05mn 0.896  Proteins: 8.28mn 0.364 ~ 7.49mn 0.723

30 30 Experiments Data sets  Human genome data set Length of a text: 50 million ~ 1 billion.  Mouse genome data set Length of each query: 1 thousand ~ 1 million.  Protein data set Length of a text: 10 million ~ 50 million. Length of each query: 200 ~ 100,000. E-value: threshold Scoring scheme: the same parameters as BLAST Environment: GNU C++, Intel 2.93GHz Quad Core CPUi7 and 8GB memory with a 500GB disk, running a Ubuntu (Linux) operating system.

31 31 Alignment Time and Number of Results 76 times faster than BWT-SW 16 times faster than BWT-SW

32 32 Filtering Ratio

33 33 Reusing Ratio

34 34 Index Size

35 35 Conclusions High efficiency of ALAE  Improves BWT-SW significantly  Accelerates BLAST for most of the scoring schemes In-memory approach using compressed suffix array Mathematical analysis  Upper bound on calculated entries

36 36 Thank you! Source code to be available at http://faculty.neu.edu.cn/yangxc/project

37 37 Simulating Searches Using Compressed Suffix Array Match a q-length substring in text  Identify forks Find occurrences of a substring in text  Calculate end positions of alignments Get all suffixes with the same prefix as X q

38 38 X = GC Positions of GC in T  SA[4] = 5  SA[5] = 1 Review of Compressed Suffix Array T = G 1 C 2 T 3 A 4 G 5 C 6 T’ = G 1 C 2 T 3 A 4 G 5 C 6 $ 7 Conceptual matrix G C T A G C $ C T A G C $ G T A G C $ G C A G C $ G C T G C $ G C T A C $ G C T A G $ G C T A G C BTW = CTGGA$C $ G C T A G C A G C $ G C T C $ G C T A G C T A G C $ G G C $ G C T A G C T A G C $ T A G C $ G C 74625137462513 SA[0,6]

39 39 X = GC  P -1 = CG Positions of CG in T -1  SA[2] = 2  SA[3] = 6 Therefore, Positions of GC in T  SA[2]-|X|+1 = 1  SA[3]-|X|+1= 5 Compressed Suffix Array – reverse T to T -1 T = G 1 C 2 T 3 A 4 G 5 C 6 T’ = $ 0 G 1 C 2 T 3 A 4 G 5 C 6 Conceptual matrix C G A T C G $ G A T C G $ C A T C G $ C G T C G $ C G A C G $ C G A T G $ C G A T C $ C G A T C G BTW = GGT$CCA $ C G A T C G A T C G $ C G C G $ C G A T C G A T C G $ G $ C G A T C G A T C G $ C T C G $ C G A 04261530426153 SA[0,6] T -1 = C 6 G 5 A 4 T 3 C 2 G 1 $ 0

40 40 Align Distinct Substring in T with P T P X … i v j v v

41 41 Alignment Time T = 50 million characters P = 10 thousand characters Smith-Waterman algorithm7.7 hours ALAE25 ms


Download ppt "1 ALAE: Accelerating Local Alignment with Affine Gap Exactly in Biosequence Databases Xiaochun Yang, Honglei Liu, Bin Wang Northeastern University, China."

Similar presentations


Ads by Google