1 ALAE: Accelerating Local Alignment with Affine Gap Exactly in Biosequence Databases Xiaochun Yang, Honglei Liu, Bin Wang Northeastern University, China.

1 ALAE: Accelerating Local Alignment with Affine Gap Exactly in Biosequence Databases Xiaochun Yang, Honglei Liu, Bin Wang Northeastern University, China

2 Local Alignment Similar over short conserved regions Dissimilar over remaining regions Applications  Comparing long stretches of anonymous DNA  Searching for unknown domains or motifs within proteins from different families …

3 Related Work Smith-Waterman algorithm (1981)  An exact approach but very slow  Not used for search BLAST: an efficient but approximate approach OASIS: an exact approach and efficient only for short query sequences (less than 60 characters) BWT-SW: an exact approach but inefficient Our target  An efficient and exact approach: ALAE (Accelerating Local Alignment with affine gap Exactly)

4 Local Alignment Input: 2 sequences, a similarity function, a threshold Output: Alignments. T P Score >= H T P

5 Measure Similarity Scoring scheme  An identical mapping: positive score s a  A mismatch: negative score s b  Gap: negative score s g + r×s s TGCGC-ATGGATTGACCGA TGCGCCATTGAT--ACCGA sim(S1,S2) = 15×1 + (-3) + (-2-1) + (-2 + 2 × (-1)) = 5 S1: S2: Scoring scheme: Gap opening penalty Gap extension penalty

6 A Basic Approach T P X … i The best alignment score of X[1,i] and any substring of P ending at position j. j

7 A DP Algorithm

8 An Example of a DP Matrix P = GCTAG, T = AAAGCTA. Scoring scheme = Ga Gb -2 -5-2 -2 -5-2

9 A Basic Approach i = i 1 +t 1 = i 2 +t 2 4 6 6 T P i j

10 Challenges Speed  Each matrix contains m ~ m×n entries  n matrixes  How to avoid calculating most of entries without impairing the accuracy of the alignment results? In-memory algorithm  Long sequences: both T and P are long

11 Contributions Speed  Prune unnecessary calculations  Avoid duplicate calculations In-memory algorithm  Use compressed suffix array Mathematical analysis

12 Outline Local filterings Global filtering Reusing calculations A hybrid algorithm

13 Local filterings Length Filtering Pruned

14 Local filterings Score Filtering Pruned

15 Local filterings q-Prefix Filtering Pruned Simpler function

16 Comparison of Calculating One Matrix P=G 1 C 2 T 3 A 4 A 5 G 6 C 7 T 8 A 9 A 10 G 11 C 12 T 13 G 14 C 15 X=G 1 C 2 T 3 A 4 A 5 G 6 C 7 T 8 A 9 G 10 T 11 Scoring scheme H=3 P GCTAAGCTAAGCTGC 0000000000000000 G-71-3 1 1 1 C-9-62-5-6 2-5-6 2-5-62 T-11-8-53-4-6-8-53-4-6-8-53-4-5 A-13-10-7-44-3-5-7-44-3-5-7-40-7 A-15-12-9-6-35-2-4-6-35-2-4-6-7-3 XG-17-14-11-8-5-26-3-5-26-3-5-7 C-19-16-13-10-7-470-2-470-2-4 T-21-18-15-12-9-6-3081-3081 A-23-20-17-14-11-8-5-21920 15 G-25-22-19-16-13-10-7-4263-322 T-27-24-21-18-15-12-9-6-3030-2-4

17 Comparison of Calculating One Matrix P=G 1 C 2 T 3 A 4 A 5 G 6 C 7 T 8 A 9 A 10 G 11 C 12 T 13 G 14 C 15 X=G 1 C 2 T 3 A 4 A 5 G 6 C 7 T 8 A 9 G 10 T 11 Scoring scheme H=3 P GCTAAGCTAAGCTGC 0000000000000000 G-∞1 1 1 1 C 2 2 2 2 T 3 3 3 A 4 4 A 5 5 XG 6 6 C 7 7 T 81 8 A 192 15 G 263 2 T 3

19 Global Filtering i = i 1 +t 1 = i 2 +t 2 4 6 6 Pruned

20 Global Filtering Pruned fork areas Using X’ : Alignment score >= S a It is unnecessary to calculate the fork area in the matrix of X and P Question: Safely avoid calculating based on calculated matrixes?

21 Global Filtering X’ Update and check unnecessary calculations on-the-fly Scoring scheme Boolean matrix X (1)Space consuming: m×n space (2) Calculation order

22 Global Filtering X’ X q-prefix domination X’ dominates X

23 Global Filtering X’ X q-prefix domination X’ dominates X Text T  Constructing dominations offline in O(n) time Query P  Check useless calculations on-the-fly t Calculation order is unnecessary.

25 Reusing score calculations for P Entries with a common prefix P s can share alignment scores. reusable alignment entries

26 Reusing score calculations for P reusable alignment entries If two forks have equivalent scores for their FGOEs, their entries with common substring Ps can share alignment scores.

28 A Hybrid Algorithm Row by row Column by column

29 Mathematical Analysis Upper bound on the number of calculated entries for representative scoring schemes specified by BLAST ( http://blast.ncbi.nlm.nih.gov/Blast.cgi)  DNA: 4.50mn 0.520 ~ 9.05mn 0.896  Proteins: 8.28mn 0.364 ~ 7.49mn 0.723

30 Experiments Data sets  Human genome data set Length of a text: 50 million ~ 1 billion.  Mouse genome data set Length of each query: 1 thousand ~ 1 million.  Protein data set Length of a text: 10 million ~ 50 million. Length of each query: 200 ~ 100,000. E-value: threshold Scoring scheme: the same parameters as BLAST Environment: GNU C++, Intel 2.93GHz Quad Core CPUi7 and 8GB memory with a 500GB disk, running a Ubuntu (Linux) operating system.

31 Alignment Time and Number of Results 76 times faster than BWT-SW 16 times faster than BWT-SW

32 Filtering Ratio

33 Reusing Ratio

34 Index Size

35 Conclusions High efficiency of ALAE  Improves BWT-SW significantly  Accelerates BLAST for most of the scoring schemes In-memory approach using compressed suffix array Mathematical analysis  Upper bound on calculated entries

36 Thank you! Source code to be available at http://faculty.neu.edu.cn/yangxc/project

37 Simulating Searches Using Compressed Suffix Array Match a q-length substring in text  Identify forks Find occurrences of a substring in text  Calculate end positions of alignments Get all suffixes with the same prefix as X q

38 X = GC Positions of GC in T  SA[4] = 5  SA[5] = 1 Review of Compressed Suffix Array T = G 1 C 2 T 3 A 4 G 5 C 6 T’ = G 1 C 2 T 3 A 4 G 5 C 6 $ 7 Conceptual matrix G C T A G C $ C T A G C $ G T A G C $ G C A G C $ G C T G C $ G C T A C $ G C T A G $ G C T A G C BTW = CTGGA$C $ G C T A G C A G C $ G C T C $ G C T A G C T A G C $ G G C $ G C T A G C T A G C $ T A G C $ G C 74625137462513 SA[0,6]

39 X = GC  P -1 = CG Positions of CG in T -1  SA[2] = 2  SA[3] = 6 Therefore, Positions of GC in T  SA[2]-|X|+1 = 1  SA[3]-|X|+1= 5 Compressed Suffix Array – reverse T to T -1 T = G 1 C 2 T 3 A 4 G 5 C 6 T’ = $ 0 G 1 C 2 T 3 A 4 G 5 C 6 Conceptual matrix C G A T C G $ G A T C G $ C A T C G $ C G T C G $ C G A C G $ C G A T G $ C G A T C $ C G A T C G BTW = GGT$CCA $ C G A T C G A T C G $ C G C G $ C G A T C G A T C G $ G $ C G A T C G A T C G $ C T C G $ C G A 04261530426153 SA[0,6] T -1 = C 6 G 5 A 4 T 3 C 2 G 1 $ 0

40 Align Distinct Substring in T with P T P X … i v j v v

41 Alignment Time T = 50 million characters P = 10 thousand characters Smith-Waterman algorithm7.7 hours ALAE25 ms

1 ALAE: Accelerating Local Alignment with Affine Gap Exactly in Biosequence Databases Xiaochun Yang, Honglei Liu, Bin Wang Northeastern University, China.

Similar presentations

Presentation on theme: "1 ALAE: Accelerating Local Alignment with Affine Gap Exactly in Biosequence Databases Xiaochun Yang, Honglei Liu, Bin Wang Northeastern University, China."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1 ALAE: Accelerating Local Alignment with Affine Gap Exactly in Biosequence Databases Xiaochun Yang, Honglei Liu, Bin Wang Northeastern University, China.

Similar presentations

Presentation on theme: "1 ALAE: Accelerating Local Alignment with Affine Gap Exactly in Biosequence Databases Xiaochun Yang, Honglei Liu, Bin Wang Northeastern University, China."— Presentation transcript:

Similar presentations

About project

Feedback