SMAWK
REVISE
Global alignment (Revise) Alignment graph for S = aacgacga, T = ctacgaga Complexity: O(n 2 ) V(i,j) = max { V(i-1,j-1) + (S[i], T[j]), V(i-1,j) + (S[i], -), V(i,j-1) + (-, T[j]) }
DIST and OUT matrix (Revise) O g a gca G I DIST matrixOUT matrix I (input borders) Block – sub-sequences “acg”, “ag” I0I △△ I1I △ I2I I3I3 △ -2 0 I4I4 △△ 0 I5I5 △△△ - -- - I 0 =1 I 1 =2 I 2 =3 I 3 =2 I 4 =1 I 5 =3 O0O0 O1O1 O2O2 O3O3 O4O4 O5O max col
Compute O without explicit OUT O g a gca G I DIST matrix I (input borders) Block – sub-sequences “acg”, “ag” I0I △△ I1I △ I2I I3I3 △ -2 0 I4I4 △△ 0 I5I5 △△△ -20 I 0 =1 I 1 =2 I 2 =3 I 3 =2 I 4 =1 I 5 =3 O0O0 O1O1 O2O2 O3O3 O4O4 O5O SMAWK
Aggarwal, Park and Schmidt observed that DIST and OUT matrices are Monge arrays. Definition: a matrix M[0…m,0…n] is totally monotone if either condition 1 or 2 below holds for all a,b=0…m; c,d=0…n; a<b and c<d 1.Convex condition: M[a,c] M[b,c] M[a,d] M[b,d]. 2.Concave condition: M[a,c] M[b,c] M[a,d] M[b,d].
SMAWK Aggarwal et. al. gave a recursive algorithm, called SMAWK, which can find all row and column maxima of a totally monotone matrix by querying only O(n) elements of the matrix.
Presentation Outline What is Monge arrays? – Monge Totally monotone Why DIST alignment matrix is Monge arrays? How to compute totally monotone arrays efficiently? – SMAWK Given a totally monotone arrays Compute all columns maxima in O(n)
MONGE AND TOTALLY MONOTONE PROPERTIES
Monge A matrix M[0…m, 0…n] is Monge if either condition 1 or 2 below holds for all a,b=0…m; c,d=0…n; a<b and c<d 1.M[a, c] + M[b, d] M[a, d] + M[b, c] 2.M[a, c] + M[b, d] M[a, d] + M[b, c] cdz aM[a,c]M[a,d]… bM[b,c]M[b,d] x……
Totally monotone A matrix M[0…m, 0…n] is totally monotone if either condition 1 or 2 below holds for all a,b=0…m; c,d=0…n; a<b and c<d 1.Convex condition: M[a,c] M[b,c] M[a,d] M[b,d] 2.Concave condition: M[a,c] M[b,c] M[a,d] M[b,d] Monge Totally monotone cdz aM[a,c]M[a,d]… bM[b,c]M[b,d] x……
Intuition Monge: Quadrangle inequality: a c b d x z cdz aM[a,c]M[a,d]… bM[b,c]M[b,d] x…… M[a, c] + M[b, d] M[a, d] + M[b, c]
History Computational Geometry All nearest neighbor problem – Shamos and Hoey proved (n log n) in 1975 All farthest neighbor problem – F.P.Reparata proved (n log n) in 1977 All farthest neighbor problem in convex polygon – Lee and Preparata proved O(n) in 1978
SMAWK Aggarwal et.al. proved O(n) for farthest in convex polygon in 1987 Aggarwal et. al. gave a recursive algorithm, called SMAWK, which can find all row and column maxima of a totally monotone matrix by querying only O(n) elements of the matrix.
DIST AND OUT MATRICES
Assumption – row and column maxima of a totally monotone matrix can be computed in O(n) Why DIST and OUT matrices of the alignment problem is totally monotone?
DIST and OUT matrix (Revise) O g a gca G I DIST matrixOUT matrix I (input borders) Block – sub-sequences “acg”, “ag” I0I △△ I1I △ I2I I3I3 △ -2 0 I4I4 △△ 0 I5I5 △△△ - -- - I 0 =1 I 1 =2 I 2 =3 I 3 =2 I 4 =1 I 5 =3 O0O0 O1O1 O2O2 O3O3 O4O4 O5O max col
Compute O without explicit OUT O g a gca G I DIST matrix I (input borders) Block – sub-sequences “acg”, “ag” I0I △△ I1I △ I2I I3I3 △ -2 0 I4I4 △△ 0 I5I5 △△△ -20 I 0 =1 I 1 =2 I 2 =3 I 3 =2 I 4 =1 I 5 =3 O0O0 O1O1 O2O2 O3O3 O4O4 O5O SMAWK
DIST is Monge O g a gca G I
DIST is Monge array Monge M[a, c] + M[b, d] M[a, d] + M[b, c] Totally monotone by Concave condition: M[a,c] M[b,c] M[a,d] M[b,d]
Comment on this approach Advantages – Easy to parallelize – Easy to combine Disadvantages – Need to compute/keep more information
Applications Parallel sequence alignment – O(log m log n) time – Using O(m n / log m) processors (CREW PRAM) Best non-overlapping alignment score – O(n 2 log 2 n) time Tandem approximate repeat – O(n 2 log n) time Common Substring Alignment
SMAWK
[a b] [c d] Find all column mimimas of the following totally monotone arrays b < d a < c b = d a c
[a b] [c d] a > c b > d a = c b d Find all column mimimas of the following totally monotone arrays b < d a < c b = d a c
[a b] [c d] a > c b > d a = c b d b < d a < c b = d a c Observation 1
[a b] [c d] a > c b > d a = c b d Observation 2 b < d a < c b = d a c
[a b] [c d] a > c b > d a = c b d SMAWK is a recursive algorithm of 2 steps – REDUCE – INTERPOLATE b < d a < c b = d a c
[a b] [c d] a > c b > d a = c b d SMAWK is a recursive algorithm of 2 steps – REDUCE – INTERPOLATE REDUCE removes rows INTERPOLATE removes half of the columns b < d a < c b = d a c
REDUCE
REDUCE
REDUCE
REDUCE
REDUCE
REDUCE
REDUCE
REDUCE
REDUCE
REDUCE
REDUCE
REDUCE
REDUCE
REDUCE
REDUCE
REDUCE
REDUCE
REDUCE
REDUCE
REDUCE
REDUCE
REDUCE
REDUCE
REDUCE
INTERPOLATE Remove all odd indexed colums
INTERPOLATE
RECURSIVE Find all row minima
APPROXIMATE TANDEM REPEAT Application of DIST and SMAWK
Tandem repeat IRQI QLWLR QIWIR LRQL
Social City
Observation Approximate tandem repeat – With the Mid-point c – Alignments start at column c end at row c c c 0n n
4 cases – Cross column n/2 – Cross row n/2 – In side sub-triangle [0,n/2] – In side sub-triangle [n/2,n]
Algorithm 1.Find all repeats that cross – row n/2 – column n/2 2.Recursively solve the – sub-array [0..n/2, 0..n/2] – sub-array [n/2..n, n/2..n] c1c1 0 n/2c2c2 c1c1 c2c2 c3c3 c3c3
Cross column n/2 Combine – Best path from column c to (k,n/2) – Best path from (k,n/2) to row c c c 0n n n/2
Cross column n/2 Sub-problems: – DIST_col (c,n/2) [i,j] – DIST_row (c,n/2) [i,j] c1c1 0 n/2c2c2 c1c1 c2c2
Cross column n/2 DIST_col (c,n/2) [i,j] : O(n 3 ) words Encode in array of binary trees Using O(n 2 log n) words B[j,c] is a binary tree B[j,c](i) is a leaf of the tree Read an entry of DIST_col (c,n/2) [i,j] in O(log n) c1c1 0 n/2c2c2 c1c1 c2c2
Algorithm 1.Find all repeats O(n 2 logn) – cross row n/2 – column n/2 1.Recursively solve the – sub-array [0..n/2, 0..n/2] – sub-array [n/2..n, n/2..n] c1c1 0 n/2c2c2 c1c1 c2c2 c3c3 c3c3
References Aggarwal, A. and Park, J. Notes on Searching in Multidimensional Monotone Arrays. IEEE Jeanette P. Schmidt. All highest scoring paths in weighted grid graphs and their application to finding all approximate repeats in strings. SIAM. Lawrence L. Larmore. The SMAWK Algorithm. UNLV. Apostolico, A. and Atallah, M.J. and Larmore, L.L. and McFaddin, S.. Efficient Parallel Algorithms for String Editing and Related Problems. SIAM J. Comput. Landau, G.M. and Ziv-Ukelson, M. On the Common Substring Alignment Problem. J. of Algorithms