Download presentation
Presentation is loading. Please wait.
Published byPauline Boone Modified over 9 years ago
1
1 ALAE: Accelerating Local Alignment with Affine Gap Exactly in Biosequence Databases Xiaochun Yang, Honglei Liu, Bin Wang Northeastern University, China
2
2 Local Alignment Similar over short conserved regions Dissimilar over remaining regions Applications Comparing long stretches of anonymous DNA Searching for unknown domains or motifs within proteins from different families …
3
3 Related Work Smith-Waterman algorithm (1981) An exact approach but very slow Not used for search BLAST: an efficient but approximate approach OASIS: an exact approach and efficient only for short query sequences (less than 60 characters) BWT-SW: an exact approach but inefficient Our target An efficient and exact approach: ALAE (Accelerating Local Alignment with affine gap Exactly)
4
4 Local Alignment Input: 2 sequences, a similarity function, a threshold Output: Alignments. T P Score >= H T P
5
5 Measure Similarity Scoring scheme An identical mapping: positive score s a A mismatch: negative score s b Gap: negative score s g + r×s s TGCGC-ATGGATTGACCGA TGCGCCATTGAT--ACCGA sim(S1,S2) = 15×1 + (-3) + (-2-1) + (-2 + 2 × (-1)) = 5 S1: S2: Scoring scheme: Gap opening penalty Gap extension penalty
6
6 A Basic Approach T P X … i The best alignment score of X[1,i] and any substring of P ending at position j. j
7
7 A DP Algorithm
8
8 An Example of a DP Matrix P = GCTAG, T = AAAGCTA. Scoring scheme = Ga Gb -2 -5-2 -2 -5-2
9
9 A Basic Approach i = i 1 +t 1 = i 2 +t 2 4 6 6 T P i j
10
10 Challenges Speed Each matrix contains m ~ m×n entries n matrixes How to avoid calculating most of entries without impairing the accuracy of the alignment results? In-memory algorithm Long sequences: both T and P are long
11
11 Contributions Speed Prune unnecessary calculations Avoid duplicate calculations In-memory algorithm Use compressed suffix array Mathematical analysis
12
12 Outline Local filterings Global filtering Reusing calculations A hybrid algorithm
13
13 Local filterings Length Filtering Pruned
14
14 Local filterings Score Filtering Pruned
15
15 Local filterings q-Prefix Filtering Pruned Simpler function
16
16 Comparison of Calculating One Matrix P=G 1 C 2 T 3 A 4 A 5 G 6 C 7 T 8 A 9 A 10 G 11 C 12 T 13 G 14 C 15 X=G 1 C 2 T 3 A 4 A 5 G 6 C 7 T 8 A 9 G 10 T 11 Scoring scheme H=3 P GCTAAGCTAAGCTGC 0000000000000000 G-71-3 1 1 1 C-9-62-5-6 2-5-6 2-5-62 T-11-8-53-4-6-8-53-4-6-8-53-4-5 A-13-10-7-44-3-5-7-44-3-5-7-40-7 A-15-12-9-6-35-2-4-6-35-2-4-6-7-3 XG-17-14-11-8-5-26-3-5-26-3-5-7 C-19-16-13-10-7-470-2-470-2-4 T-21-18-15-12-9-6-3081-3081 A-23-20-17-14-11-8-5-21920 15 G-25-22-19-16-13-10-7-4263-322 T-27-24-21-18-15-12-9-6-3030-2-4
17
17 Comparison of Calculating One Matrix P=G 1 C 2 T 3 A 4 A 5 G 6 C 7 T 8 A 9 A 10 G 11 C 12 T 13 G 14 C 15 X=G 1 C 2 T 3 A 4 A 5 G 6 C 7 T 8 A 9 G 10 T 11 Scoring scheme H=3 P GCTAAGCTAAGCTGC 0000000000000000 G-∞1 1 1 1 C 2 2 2 2 T 3 3 3 A 4 4 A 5 5 XG 6 6 C 7 7 T 81 8 A 192 15 G 263 2 T 3
18
18 Outline Local filterings Global filtering Reusing calculations A hybrid algorithm
19
19 Global Filtering i = i 1 +t 1 = i 2 +t 2 4 6 6 Pruned
20
20 Global Filtering Pruned fork areas Using X’ : Alignment score >= S a It is unnecessary to calculate the fork area in the matrix of X and P Question: Safely avoid calculating based on calculated matrixes?
21
21 Global Filtering X’ Update and check unnecessary calculations on-the-fly Scoring scheme Boolean matrix X (1)Space consuming: m×n space (2) Calculation order
22
22 Global Filtering X’ X q-prefix domination X’ dominates X
23
23 Global Filtering X’ X q-prefix domination X’ dominates X Text T Constructing dominations offline in O(n) time Query P Check useless calculations on-the-fly t Calculation order is unnecessary.
24
24 Outline Local filterings Global filtering Reusing calculations A hybrid algorithm
25
25 Reusing score calculations for P Entries with a common prefix P s can share alignment scores. reusable alignment entries
26
26 Reusing score calculations for P reusable alignment entries If two forks have equivalent scores for their FGOEs, their entries with common substring Ps can share alignment scores.
27
27 Outline Local filterings Global filtering Reusing calculations A hybrid algorithm
28
28 A Hybrid Algorithm Row by row Column by column
29
29 Mathematical Analysis Upper bound on the number of calculated entries for representative scoring schemes specified by BLAST ( http://blast.ncbi.nlm.nih.gov/Blast.cgi) DNA: 4.50mn 0.520 ~ 9.05mn 0.896 Proteins: 8.28mn 0.364 ~ 7.49mn 0.723
30
30 Experiments Data sets Human genome data set Length of a text: 50 million ~ 1 billion. Mouse genome data set Length of each query: 1 thousand ~ 1 million. Protein data set Length of a text: 10 million ~ 50 million. Length of each query: 200 ~ 100,000. E-value: threshold Scoring scheme: the same parameters as BLAST Environment: GNU C++, Intel 2.93GHz Quad Core CPUi7 and 8GB memory with a 500GB disk, running a Ubuntu (Linux) operating system.
31
31 Alignment Time and Number of Results 76 times faster than BWT-SW 16 times faster than BWT-SW
32
32 Filtering Ratio
33
33 Reusing Ratio
34
34 Index Size
35
35 Conclusions High efficiency of ALAE Improves BWT-SW significantly Accelerates BLAST for most of the scoring schemes In-memory approach using compressed suffix array Mathematical analysis Upper bound on calculated entries
36
36 Thank you! Source code to be available at http://faculty.neu.edu.cn/yangxc/project
37
37 Simulating Searches Using Compressed Suffix Array Match a q-length substring in text Identify forks Find occurrences of a substring in text Calculate end positions of alignments Get all suffixes with the same prefix as X q
38
38 X = GC Positions of GC in T SA[4] = 5 SA[5] = 1 Review of Compressed Suffix Array T = G 1 C 2 T 3 A 4 G 5 C 6 T’ = G 1 C 2 T 3 A 4 G 5 C 6 $ 7 Conceptual matrix G C T A G C $ C T A G C $ G T A G C $ G C A G C $ G C T G C $ G C T A C $ G C T A G $ G C T A G C BTW = CTGGA$C $ G C T A G C A G C $ G C T C $ G C T A G C T A G C $ G G C $ G C T A G C T A G C $ T A G C $ G C 74625137462513 SA[0,6]
39
39 X = GC P -1 = CG Positions of CG in T -1 SA[2] = 2 SA[3] = 6 Therefore, Positions of GC in T SA[2]-|X|+1 = 1 SA[3]-|X|+1= 5 Compressed Suffix Array – reverse T to T -1 T = G 1 C 2 T 3 A 4 G 5 C 6 T’ = $ 0 G 1 C 2 T 3 A 4 G 5 C 6 Conceptual matrix C G A T C G $ G A T C G $ C A T C G $ C G T C G $ C G A C G $ C G A T G $ C G A T C $ C G A T C G BTW = GGT$CCA $ C G A T C G A T C G $ C G C G $ C G A T C G A T C G $ G $ C G A T C G A T C G $ C T C G $ C G A 04261530426153 SA[0,6] T -1 = C 6 G 5 A 4 T 3 C 2 G 1 $ 0
40
40 Align Distinct Substring in T with P T P X … i v j v v
41
41 Alignment Time T = 50 million characters P = 10 thousand characters Smith-Waterman algorithm7.7 hours ALAE25 ms
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.