Download presentation
Presentation is loading. Please wait.
Published byHubert Johnston Modified over 8 years ago
1
A Sub-quadratic Sequence Alignment Algorithm
2
Global alignment Alignment graph for S = aacgacga, T = ctacgaga Complexity: O(n 2 ) V(i,j) = max { V(i-1,j-1) + (S[i], T[j]), V(i-1,j) + (S[i], -), V(i,j-1) + (-, T[j]) }
3
FOUR RUSSIAN ALGORITHM
5
UNRESTRICTED SCORING FUNCTION
6
Main idea: Compress the sequences S = aacgacga T = ctacgaga 0 213 45 c t a g g 0 13 2 4 a g c g LZ-78: Divide the sequence into distinct words 1234 aacgacga 12345 ctacgaga Trie The number of distinct words:
7
aacggaca c t 3/43/2 a cgcg 5/45/2 agag a 2 3 4 1 2 3 4 5 0 1 g a gca a gca a ca g a ca Main idea 0 3 5 2 1 a g c t Trie for T 4 g g 0 1 2 3 4 a c g Trie for S Compute the alignment score in each block Propagate the scores between the adjacent blocks
8
Main idea Compress the sequence into words Pre-compute the score for each block Do alignment between blocks Note: – Replace normal characters by words – Operate on blocks
9
COMPRESS THE SEQUENCE LZ-78
10
S = aacgacga T = ctacgaga 0 213 45 c t a g g 0 13 2 4 a g c g LZ-78: Divide the sequence into distinct words 1234 aacgacga 12345 ctacgaga Trie The number of distinct words:
11
LZ-78 Theorem (Lempel and Ziv): – Constant alphabet sequence S – The maximal number of distinct phrases in S is O(n/log n). Tighter upper bound: O(hn/log n) – h is the entropy factor – a real number, 0 < h 1 – Entropy is small sequence is repetitive
12
COMPUTE THE ALIGNMENT SCORE IN EACH BLOCK
13
aacggaca c t 3/43/2 a cgcg 5/45/2 agag a 2 3 4 1 2 3 4 5 0 1 g a gca a gca a ca g a ca Compute the alignment score in each block
14
Given – Input border: I – Block Compute – Output border: O O g a gca G 0 2 0 1 2 3 4 1 3 4 5 5 I
15
Matrices I[i] : is the input border value DIST[i,j] : weight of the optimal path – From entry i of the input border – To entry j of its output border OUT[i,j] : merges the information from input row I and DIST – OUT[i,j]=I[i] + DIST[i,j] O[j] = max{OUT[i,j] for i=1..n} O g a gca G 0 2 0 1 2 3 4 1 3 4 5 5 I
16
DIST and OUT matrix example O g a gca G 0 2 0 1 2 3 4 1 3 4 5 5 I DIST matrixOUT matrix I (input borders) Block – sub-sequences “acg”, “ag” 012345 I0I0 0-2-3 △△ I1I1 -2-3 △ I2I2 -2001-3 I3I3 △ -2 0 I4I4 △△ 0 I5I5 △△△ -20 012345 10 -2 -- -- 1101 -- 133420 -1200200 -13 100 -14 123 I 0 =1 I 1 =2 I 2 =3 I 3 =2 I 4 =1 I 5 =3 O0O0 O1O1 O2O2 O3O3 O4O4 O5O5 133423 max col
17
For each block, given two sub-sequence S1, S2 Compute (from scratch) DIST in (n*m) time Given I and DIST, compute OUT in (n*m) time Given OUT[i,j], Compute O in (m*n) time
18
Revise Compress the sequence Pre-compute DIST[i,j] for each block Compute border values of each blocks Remaining questions – How to compute DIST[i,j] efficiently? – How to compute O[j] from I[i] and DIST[i,j] efficiently? aacggaca c t a 4/4 cgcg 5/45/3 agag a 2 3 4 1 2 3 4 5 0 1
19
COMPUTE O[J] EFFICIENTLY
20
Compute O[j] efficiently For each block of two sub-sequences S1, S2 Given – I[i] – DIST[i,j] Compute – O[j]
21
DIST and OUT matrix example O g a gca G 0 2 0 1 2 3 4 1 3 4 5 5 I DIST matrixOUT matrix I (input borders) Block – sub-sequences “acg”, “ag” 012345 I0I0 0-2-3 △△ I1I1 -2-3 △ I2I2 -2001-3 I3I3 △ -2 0 I4I4 △△ 0 I5I5 △△△ -20 012345 10 -2 -- -- 1101 -- 133420 -1200200 -13 100 -14 123 I 0 =1 I 1 =2 I 2 =3 I 3 =2 I 4 =1 I 5 =3 O0O0 O1O1 O2O2 O3O3 O4O4 O5O5 133423 max col
22
Compute O without explicit OUT O g a gca G 0 2 0 1 2 3 4 1 3 4 5 5 I DIST matrix I (input borders) Block – sub-sequences “acg”, “ag” 012345 I0I0 0-2-3 △△ I1I1 -2-3 △ I2I2 -2001-3 I3I3 △ -2 0 I4I4 △△ 0 I5I5 △△△ -20 I 0 =1 I 1 =2 I 2 =3 I 3 =2 I 4 =1 I 5 =3 O0O0 O1O1 O2O2 O3O3 O4O4 O5O5 133423 SMAWK
23
Given DIST[i,j], I[i] we can compute O[j] in O(n+m) – Without creating OUT[i,j] How? Why?
24
Why? Aggarwal, Park and Schmidt observed that DIST and OUT matrices are Monge arrays. Definition: a matrix M[0…m,0…n] is totally monotone if either condition 1 or 2 below holds for all a,b=0…m; c,d=0…n: 1.Convex condition: M[a,c] M[b,c] M[a,d] M[b,d] for all a<b and c<d. 2.Concave condition: M[a,c] M[b,c] M[a,d] M[b,d] for all a<b and c<d.
25
How? Aggarwal et. al. gave a recursive algorithm, called SMAWK, which can find all row and column maxima of a totally monotone matrix by querying only O(n) elements of the matrix.
26
Why DIST[i,j] is totally monotone? O g a gca G 0 2 0 1 2 3 4 1 3 4 5 5 I The concave condition If b-c is better than a-c, then b-d is better than a-d. a b dc
27
Other problem Rectangle problem of DIST Set upper right corner of OUT to - Set lower left corner of OUT to -(n+i-1)*k Preserve the totally monotone property of OUT 012345 I0I0 0-2-3 △△ I1I1 -2-3 △ I2I2 -2001-3 I3I3 △ -2 0 I4I4 △△ 0 I5I5 △△△ -20
28
COMPUTE DIST[I,J] EFFICIENTLY
29
aacggaca c t 3/43/2 a cgcg 5/45/2 agag a 2 3 4 1 2 3 4 5 0 1 g a gca a gca a ca g a ca Compute DIST[i,j] for block(5/4) 0 3 5 2 1 a g c t Trie for T 4 g g 0 1 2 3 4 a c g Trie for S
30
DIST matrix
35
Only column m in DIST[i,j] is new DIST block can be updated in O(m+n)
36
MANTAINING DIRECT ACCESS TO DIST TABLE
37
-3 1 0 0 -2 aa cga c ga c t a cgcg agag a Triefor T 0 1 3 2 4 g g a c Triefor S 0 3 1 2 5 4 g c t a g 2 3 4 1 2 3 4 5 0 1
38
-3 1 0 0 -2 0 0 -2 0 -2 -2 -3 -2 0 aa cga c ga c t a cgcg agag a Triefor T 0 1 3 2 4 g g a c Triefor S 0 3 1 2 5 4 g c t a g 2 3 4 1 2 3 4 5 0 1
39
DIST -3 1 0 0 -2 0 0 -2 0 -2 -2 -3 -2 0 aa cga c ga c t a cgcg agag a Triefor T 0 1 3 2 4 g g a c Triefor S 0 3 1 2 5 4 g c t a g 2 3 4 1 2 3 4 5 0 1
41
Complexity Assume |S| = |T| = n Number of words in S, T = O(hn/log n) Number of blocks in alignment graph O(h 2 n 2 /(log n) 2 ) For each block – Update new DIST block O(t = size of the border) – Create direct access table O(t) Propagating I/O across blocks – SMAWK O(t) Sum of the sizes of all borders is O(hn 2 /log n) Total complexity: O(hn 2 /log n)
42
Other extensions Trace Reducing the space complexity for discrete scoring Local alignment
43
References Crochemore, M.; Landau, G. M. & Ziv-Ukelson, M. A sub-quadratic sequence alignment algorithm for unrestricted cost matrices ACM-SIAM, 2002, 679-688 Some pictures from 葉恆青
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.