A Sub-quadratic Sequence Alignment Algorithm
Global alignment Alignment graph for S = aacgacga, T = ctacgaga Complexity: O(n 2 ) V(i,j) = max { V(i-1,j-1) + (S[i], T[j]), V(i-1,j) + (S[i], -), V(i,j-1) + (-, T[j]) }
FOUR RUSSIAN ALGORITHM
UNRESTRICTED SCORING FUNCTION
Main idea: Compress the sequences S = aacgacga T = ctacgaga c t a g g a g c g LZ-78: Divide the sequence into distinct words 1234 aacgacga ctacgaga Trie The number of distinct words:
aacggaca c t 3/43/2 a cgcg 5/45/2 agag a g a gca a gca a ca g a ca Main idea a g c t Trie for T 4 g g a c g Trie for S Compute the alignment score in each block Propagate the scores between the adjacent blocks
Main idea Compress the sequence into words Pre-compute the score for each block Do alignment between blocks Note: – Replace normal characters by words – Operate on blocks
COMPRESS THE SEQUENCE LZ-78
S = aacgacga T = ctacgaga c t a g g a g c g LZ-78: Divide the sequence into distinct words 1234 aacgacga ctacgaga Trie The number of distinct words:
LZ-78 Theorem (Lempel and Ziv): – Constant alphabet sequence S – The maximal number of distinct phrases in S is O(n/log n). Tighter upper bound: O(hn/log n) – h is the entropy factor – a real number, 0 < h 1 – Entropy is small sequence is repetitive
COMPUTE THE ALIGNMENT SCORE IN EACH BLOCK
aacggaca c t 3/43/2 a cgcg 5/45/2 agag a g a gca a gca a ca g a ca Compute the alignment score in each block
Given – Input border: I – Block Compute – Output border: O O g a gca G I
Matrices I[i] : is the input border value DIST[i,j] : weight of the optimal path – From entry i of the input border – To entry j of its output border OUT[i,j] : merges the information from input row I and DIST – OUT[i,j]=I[i] + DIST[i,j] O[j] = max{OUT[i,j] for i=1..n} O g a gca G I
DIST and OUT matrix example O g a gca G I DIST matrixOUT matrix I (input borders) Block – sub-sequences “acg”, “ag” I0I △△ I1I △ I2I I3I3 △ -2 0 I4I4 △△ 0 I5I5 △△△ - -- - I 0 =1 I 1 =2 I 2 =3 I 3 =2 I 4 =1 I 5 =3 O0O0 O1O1 O2O2 O3O3 O4O4 O5O max col
For each block, given two sub-sequence S1, S2 Compute (from scratch) DIST in (n*m) time Given I and DIST, compute OUT in (n*m) time Given OUT[i,j], Compute O in (m*n) time
Revise Compress the sequence Pre-compute DIST[i,j] for each block Compute border values of each blocks Remaining questions – How to compute DIST[i,j] efficiently? – How to compute O[j] from I[i] and DIST[i,j] efficiently? aacggaca c t a 4/4 cgcg 5/45/3 agag a
COMPUTE O[J] EFFICIENTLY
Compute O[j] efficiently For each block of two sub-sequences S1, S2 Given – I[i] – DIST[i,j] Compute – O[j]
DIST and OUT matrix example O g a gca G I DIST matrixOUT matrix I (input borders) Block – sub-sequences “acg”, “ag” I0I △△ I1I △ I2I I3I3 △ -2 0 I4I4 △△ 0 I5I5 △△△ - -- - I 0 =1 I 1 =2 I 2 =3 I 3 =2 I 4 =1 I 5 =3 O0O0 O1O1 O2O2 O3O3 O4O4 O5O max col
Compute O without explicit OUT O g a gca G I DIST matrix I (input borders) Block – sub-sequences “acg”, “ag” I0I △△ I1I △ I2I I3I3 △ -2 0 I4I4 △△ 0 I5I5 △△△ -20 I 0 =1 I 1 =2 I 2 =3 I 3 =2 I 4 =1 I 5 =3 O0O0 O1O1 O2O2 O3O3 O4O4 O5O SMAWK
Given DIST[i,j], I[i] we can compute O[j] in O(n+m) – Without creating OUT[i,j] How? Why?
Why? Aggarwal, Park and Schmidt observed that DIST and OUT matrices are Monge arrays. Definition: a matrix M[0…m,0…n] is totally monotone if either condition 1 or 2 below holds for all a,b=0…m; c,d=0…n: 1.Convex condition: M[a,c] M[b,c] M[a,d] M[b,d] for all a<b and c<d. 2.Concave condition: M[a,c] M[b,c] M[a,d] M[b,d] for all a<b and c<d.
How? Aggarwal et. al. gave a recursive algorithm, called SMAWK, which can find all row and column maxima of a totally monotone matrix by querying only O(n) elements of the matrix.
Why DIST[i,j] is totally monotone? O g a gca G I The concave condition If b-c is better than a-c, then b-d is better than a-d. a b dc
Other problem Rectangle problem of DIST Set upper right corner of OUT to - Set lower left corner of OUT to -(n+i-1)*k Preserve the totally monotone property of OUT I0I △△ I1I △ I2I I3I3 △ -2 0 I4I4 △△ 0 I5I5 △△△ -20
COMPUTE DIST[I,J] EFFICIENTLY
aacggaca c t 3/43/2 a cgcg 5/45/2 agag a g a gca a gca a ca g a ca Compute DIST[i,j] for block(5/4) a g c t Trie for T 4 g g a c g Trie for S
DIST matrix
Only column m in DIST[i,j] is new DIST block can be updated in O(m+n)
MANTAINING DIRECT ACCESS TO DIST TABLE
aa cga c ga c t a cgcg agag a Triefor T g g a c Triefor S g c t a g
aa cga c ga c t a cgcg agag a Triefor T g g a c Triefor S g c t a g
DIST aa cga c ga c t a cgcg agag a Triefor T g g a c Triefor S g c t a g
Complexity Assume |S| = |T| = n Number of words in S, T = O(hn/log n) Number of blocks in alignment graph O(h 2 n 2 /(log n) 2 ) For each block – Update new DIST block O(t = size of the border) – Create direct access table O(t) Propagating I/O across blocks – SMAWK O(t) Sum of the sizes of all borders is O(hn 2 /log n) Total complexity: O(hn 2 /log n)
Other extensions Trace Reducing the space complexity for discrete scoring Local alignment
References Crochemore, M.; Landau, G. M. & Ziv-Ukelson, M. A sub-quadratic sequence alignment algorithm for unrestricted cost matrices ACM-SIAM, 2002, Some pictures from 葉恆青