1 ALAE: Accelerating Local Alignment with Affine Gap Exactly in Biosequence Databases Xiaochun Yang, Honglei Liu, Bin Wang Northeastern University, China
2 Local Alignment Similar over short conserved regions Dissimilar over remaining regions Applications Comparing long stretches of anonymous DNA Searching for unknown domains or motifs within proteins from different families …
3 Related Work Smith-Waterman algorithm (1981) An exact approach but very slow Not used for search BLAST: an efficient but approximate approach OASIS: an exact approach and efficient only for short query sequences (less than 60 characters) BWT-SW: an exact approach but inefficient Our target An efficient and exact approach: ALAE (Accelerating Local Alignment with affine gap Exactly)
4 Local Alignment Input: 2 sequences, a similarity function, a threshold Output: Alignments. T P Score >= H T P
5 Measure Similarity Scoring scheme An identical mapping: positive score s a A mismatch: negative score s b Gap: negative score s g + r×s s TGCGC-ATGGATTGACCGA TGCGCCATTGAT--ACCGA sim(S1,S2) = 15×1 + (-3) + (-2-1) + ( × (-1)) = 5 S1: S2: Scoring scheme: Gap opening penalty Gap extension penalty
6 A Basic Approach T P X … i The best alignment score of X[1,i] and any substring of P ending at position j. j
7 A DP Algorithm
8 An Example of a DP Matrix P = GCTAG, T = AAAGCTA. Scoring scheme = Ga Gb
9 A Basic Approach i = i 1 +t 1 = i 2 +t T P i j
10 Challenges Speed Each matrix contains m ~ m×n entries n matrixes How to avoid calculating most of entries without impairing the accuracy of the alignment results? In-memory algorithm Long sequences: both T and P are long
11 Contributions Speed Prune unnecessary calculations Avoid duplicate calculations In-memory algorithm Use compressed suffix array Mathematical analysis
12 Outline Local filterings Global filtering Reusing calculations A hybrid algorithm
13 Local filterings Length Filtering Pruned
14 Local filterings Score Filtering Pruned
15 Local filterings q-Prefix Filtering Pruned Simpler function
16 Comparison of Calculating One Matrix P=G 1 C 2 T 3 A 4 A 5 G 6 C 7 T 8 A 9 A 10 G 11 C 12 T 13 G 14 C 15 X=G 1 C 2 T 3 A 4 A 5 G 6 C 7 T 8 A 9 G 10 T 11 Scoring scheme H=3 P GCTAAGCTAAGCTGC G C T A A XG C T A G T
17 Comparison of Calculating One Matrix P=G 1 C 2 T 3 A 4 A 5 G 6 C 7 T 8 A 9 A 10 G 11 C 12 T 13 G 14 C 15 X=G 1 C 2 T 3 A 4 A 5 G 6 C 7 T 8 A 9 G 10 T 11 Scoring scheme H=3 P GCTAAGCTAAGCTGC G-∞ C T A 4 4 A 5 5 XG 6 6 C 7 7 T 81 8 A G T 3
18 Outline Local filterings Global filtering Reusing calculations A hybrid algorithm
19 Global Filtering i = i 1 +t 1 = i 2 +t Pruned
20 Global Filtering Pruned fork areas Using X’ : Alignment score >= S a It is unnecessary to calculate the fork area in the matrix of X and P Question: Safely avoid calculating based on calculated matrixes?
21 Global Filtering X’ Update and check unnecessary calculations on-the-fly Scoring scheme Boolean matrix X (1)Space consuming: m×n space (2) Calculation order
22 Global Filtering X’ X q-prefix domination X’ dominates X
23 Global Filtering X’ X q-prefix domination X’ dominates X Text T Constructing dominations offline in O(n) time Query P Check useless calculations on-the-fly t Calculation order is unnecessary.
24 Outline Local filterings Global filtering Reusing calculations A hybrid algorithm
25 Reusing score calculations for P Entries with a common prefix P s can share alignment scores. reusable alignment entries
26 Reusing score calculations for P reusable alignment entries If two forks have equivalent scores for their FGOEs, their entries with common substring Ps can share alignment scores.
27 Outline Local filterings Global filtering Reusing calculations A hybrid algorithm
28 A Hybrid Algorithm Row by row Column by column
29 Mathematical Analysis Upper bound on the number of calculated entries for representative scoring schemes specified by BLAST ( DNA: 4.50mn ~ 9.05mn Proteins: 8.28mn ~ 7.49mn 0.723
30 Experiments Data sets Human genome data set Length of a text: 50 million ~ 1 billion. Mouse genome data set Length of each query: 1 thousand ~ 1 million. Protein data set Length of a text: 10 million ~ 50 million. Length of each query: 200 ~ 100,000. E-value: threshold Scoring scheme: the same parameters as BLAST Environment: GNU C++, Intel 2.93GHz Quad Core CPUi7 and 8GB memory with a 500GB disk, running a Ubuntu (Linux) operating system.
31 Alignment Time and Number of Results 76 times faster than BWT-SW 16 times faster than BWT-SW
32 Filtering Ratio
33 Reusing Ratio
34 Index Size
35 Conclusions High efficiency of ALAE Improves BWT-SW significantly Accelerates BLAST for most of the scoring schemes In-memory approach using compressed suffix array Mathematical analysis Upper bound on calculated entries
36 Thank you! Source code to be available at
37 Simulating Searches Using Compressed Suffix Array Match a q-length substring in text Identify forks Find occurrences of a substring in text Calculate end positions of alignments Get all suffixes with the same prefix as X q
38 X = GC Positions of GC in T SA[4] = 5 SA[5] = 1 Review of Compressed Suffix Array T = G 1 C 2 T 3 A 4 G 5 C 6 T’ = G 1 C 2 T 3 A 4 G 5 C 6 $ 7 Conceptual matrix G C T A G C $ C T A G C $ G T A G C $ G C A G C $ G C T G C $ G C T A C $ G C T A G $ G C T A G C BTW = CTGGA$C $ G C T A G C A G C $ G C T C $ G C T A G C T A G C $ G G C $ G C T A G C T A G C $ T A G C $ G C SA[0,6]
39 X = GC P -1 = CG Positions of CG in T -1 SA[2] = 2 SA[3] = 6 Therefore, Positions of GC in T SA[2]-|X|+1 = 1 SA[3]-|X|+1= 5 Compressed Suffix Array – reverse T to T -1 T = G 1 C 2 T 3 A 4 G 5 C 6 T’ = $ 0 G 1 C 2 T 3 A 4 G 5 C 6 Conceptual matrix C G A T C G $ G A T C G $ C A T C G $ C G T C G $ C G A C G $ C G A T G $ C G A T C $ C G A T C G BTW = GGT$CCA $ C G A T C G A T C G $ C G C G $ C G A T C G A T C G $ G $ C G A T C G A T C G $ C T C G $ C G A SA[0,6] T -1 = C 6 G 5 A 4 T 3 C 2 G 1 $ 0
40 Align Distinct Substring in T with P T P X … i v j v v
41 Alignment Time T = 50 million characters P = 10 thousand characters Smith-Waterman algorithm7.7 hours ALAE25 ms