A Sub-quadratic Sequence Alignment Algorithm for Unrestricted Cost Matrices Maxime Crochemore Gad M. Landau Michal Ziv-Ukelson
Presentation R 蘇展弘 B 葉恆青 R 呂育恩 R 張文亮 R 游騰楷
Outline Introduction and preliminaries. –LZ-78. –The basic concept. Global alignment. Local alignment. Proof for LZ-76. Proof for SMAWK algorithm.
LZ-78 aacgacga 0 a 1 c 2 g 4 g 3 The number of distinct code word:
Sample of LZ ctacgaga 1234 aacgacga c t a g g a g c g
Basic Concept aacgacga 1c 2t 3a 4cg 5ag a a 5/4ag5 cg4 a3 t2 c1 aacggaca g a gca a 5/45/2ag5 cg4 a3 t2 c1 aacggaca g a ca a 5/45/2ag5 cg4 3/43/2a3 t2 c1 aacggaca a ca g a gca Left prefix g a gca Top prefix g a gca Diagonal prefix Input border: I Output border: O
Basic Concept I/O Propagation Across G I 0 = △△ I 1 = △ I 2 = I 3 =0 △ -2 0 I 4 =0 △△ -20 I 5 =0 △△△ -20 DIST matrix 0-2 △△△ I 5 =0 0-2 △△ I 4 = △ I 3 = I 2 =0 △ -3-2 I 1 =0 △△ -3-20I 0 = DIST matrix I 5 = I 4 = I 3 = I 2 =3 -∞-∞1011I 1 =2 -∞-∞-∞-∞-201I 0 = OUT matrix Directly assign -∞ OUT[i,j]=-(n+i+1) x k, Where k is the maximal absolute value in the penalty matrix I 5 = I 4 = I 3 = I 2 =3 -∞-∞1011I 1 =2 -∞-∞-∞-∞-201I 0 = O5O5 O4O4 O3O3 O2O2 O1O1 OoOo
Basic Concept Monge Property DIST matrix I 0 = △△ I 1 = △ I 2 = I 3 =0 △ -2 0 I 4 =0 △△ -20 I 5 =0 △△△ -20 Aggarwal and Park and Schmidt observed that DIST matrices are Monge arrays. Def : A matrix M[ m x n ] is Monge if either condition 1 or 2 below holds for all a, b=0 … m; c, d=0 … n: 1. convex condition : 2. concave condition :
Basic Concept Tatally Monotone An important property of Monge arrays is that of being totally monotone. Def : A matrix M[ m x n ] is totally monotone if either condition 1 or 2 below holds for all a, b=0 … m; c, d=0 … n: 1. convex condition : 2. concave condition : Both DIST and OUT matrices are totally monotone by the concave condition. Aggarwal et al gave a recursive algorithm, nicknamed SMAWK, which can compute on O(n) time all row and column maxima of a n x n totally monotone matrix, by querying only O(n) elements of the array.
DIST matrix I 0 = ΔΔ I 1 = Δ I 2 = I 3 = 2Δ-2 0 I 4 = 1ΔΔ-20 I 5 = 3ΔΔΔ-20
OUT matrix - -- -
OUT matrix - -- - concave monotonicity: 若左行 的上面比下面小,則右 行的上面也比下面小 No new column maximum : –(n + i + 1) * k -
01234 aa cga c ga 1 c 2 t 3 a 4 cgcg 5 agag a The New block
Corresponding matrices
Maintaining Direct Access to DIST Columns 目的 : 跑 SMAWK 時需要用到的 OUT matrix 必須由 DIST 和 inplut 來提供,並在 Constant time 內得到 OUT matrix 的每一格。但是 Space 又不能超過。 作法 : 只存新產生的 column ,並維護一個 data structure 。
DIST(5,4) aa cga c ga 1 c 2 t 3 a 4 cgcg 5 agag a Data Strucure
DIST(5,4) aa cga c ga 1 c 2 t 3 a 4 cgcg 5 agag a Construction
Time and Space Complexity 作 new column 作 DIST vector ( 即找出 該 DIST matrix 所有的 column) 用 SMAWK 從這個 DIST( 加上 input) 算出 output maxima 。 O ( t )
Total complexity aa cga c ga 1 c 2 t 3 a 4 cgcg 5 agag a h n / log ( n ) n O ( h n 2 / log(n) )
Sub-Quadratic Local Alignment Eric, Yu En Lu Information Management Dept. National Taiwan University
Sub-Quadratic Global Alignment Exploits Redundancy among sequences resulted by Lempel-Ziv Compression (self- repeating) to obtain the sub-quadratic part
Sub-Quadratic Local Alignment Requires additional knowledge of where a locally optimal string starts and ends However, this algorithm is performed on a per-block basis, we have to compute additional information specific to a block And then use it as the cue to the final score
Additional Information I S[i] C E[k] F=max {MAX t i=0 {I[i]+E[i]}, C}
Algorithm Body Given: DIST G Encoding –Compute values of E –Compute values of S –Compute values of C Propagation –Compute values of O ’ (modified from the O in global alignment) –Computing F Seek Highest Score –Find the highest score F
Back-Tracking the Exact Path Global Alignment Local Alignment –Given the block with max F value –We seek its path through looking its max{lp, tp, dia} block recursively until the score 0
Time/Space Analysis Encoding –E: max{E[i] lp, E[i] tp, DIST[I, l c ]} O(t) –S: (all other can be copies, except..) S lr,lc = max{S lr- 1,lc +W, S lr,lc-1 +W, S lr-1,lc-1 +W} O(t) –C: max{C lp, C tp, S[l c ]} O(1) Propagation –O[i] = max{O[i], S[i]} O(t) –F=max {MAX ti=0 {I[i]+E[i]}, C} O(t) Find F O(hn 2 /log 2 n) Total Complexity O(hn 2 /log n)
Further Improvements Efficient alignment storage algorithm –Conditioned in “ discrete weights ” –Gives a minimal encoding to DIST (O(t) O(1) ) for G –Thus we obtain O(hn 2 /(log n )2) storage complexity in Global-Alignment problem –While time complexity is O(hn 2 /log n)
Now, we are going to have presentations on SMAWK & LZ-76 Thank you!
The Maximum Numbers of Distinct Words Speaker : Emory Chang Date : 2002/1/31 Lempel and Ziv,1976 “On the Complexity of Finite Sequences” Reference :
What is a Distinct Word? EX: (LZ78) A = {0,1},a = |A| = 2 S = ,n = |S| = we have four distinct words, and five steps to generate the sequence.
Notation A : the set of alphabets α : the number of alphabets S : a sequence belong to A n : the length of S C(S) : production complexity of S N : the maximum possible number of distinct words. n
The upper bound Any sequential encoding procedure employs a parsing rule which a long string of data is broken down into words that are individually mapped into distinct words. For every :
Special case(1/2) Let N denote the maximum possible number of distinct words. Clearly C(S) < N+1 (a possible exception of the last one) Consider the special case : The sequence is formed by all distinct words of length of one, two, …, k ex: S = n =
Special case(2/2) The length of symbols at level i The number of nodes at level i
General Case The length of level k+1=k+1 Level k+1
Proof: Since We have Therefore from
SMAWK A Linear Time Algorithm for the Maximum Problem on Wide Totally Monotone Matrices
Definition Let A be an n m matrix with real entries. –A j denote the jth column of A and A i denote the ith row of A. –A[i 1,…,i k ;j 1,…,j k ] denote the submatrix of A. –Let j(i) be the smallest column index j such that A(i,j) equals to the maximum value in A i.
i=2 j(i)=3
Definition A n m matrix A is monotone if for 1 i 1 i 2 n, j(i 1 ) j(i 2 ). A is totally monotone if every submatrix of A is monotone.
Another Definition In the previous paper, the definition of totally monotone is: –A matrix M[0…m,0…n] is totally monotone is either condition 1 or 2 below holds for all a,b=0…n; c,d=0…m: –1. Convex condition: M[a,c] M[b,c] M[a,d] M[b,d] for all a<b and c<d –2. Concave condition: M[a,c] M[b,c] M[a,d] M[b,d] for all a<b and c<d We use concave here.
Comparison Now we want to compare these two definitions. The definition in SMAWK’s paper is called D s, The definition in this paper is called D c (we need a transpose to match the row and column of these to definition).
Comparison (cont.) To proof D c D s. –D c holds on matrix A[0…n,0…m] –Let A’[i…i’,j…j’] be a submatrix of A, i i 1 i 2 i’, j 1 = j(i 1 ), j 2 = j(i 2 ) j 1 j 2. i1i1 i2i2 j1j1 a,b,c d e,f,g h So j 1 j 2
Comparison (cont.) To proof D s D c –The matrix satisfies D s but not D c. D c is stronger.
Lemma 1 –We define an entry A[i,j] is dead if j j(i). –Lemma 1: –Let A be a totally monotone n m matrix and let 1 j 1 j 2 m. if A(r, j 1 ) A(r,j 2 ) then entries in {A(i, j 2 ):1 i r} are dead. if A(r, j 1 ) A(r,j 2 ) then entries in {A(i, j 1 ):r i n} are dead. j1j1 j2j2 r i i
REDUCE(A) –C=A; k=1 –While C has more than n columns do – case – C(k,k) C(k,k+1) and k < n : k = k+1 – C(k,k) C(k,k+1) and k = n : Delete column C k+1 – C(k,k) < C(k,k+1) : Delete column C k ; – if k>1 then k = k-1
REDUCE(A) <
<
Time Complexity Case 2 + Case 3 = m – n Case 1 at most n + (m – n) –1 Totally 2m – n – 1 O(m)
MAXCOMPUTE(A) – B = REDUCE(A) –If n=1 then output the maximum and return –C=B[2,4,…,2 n/2 ; 1,2…,n] –MAXCOMPUTE(C) –From the known positions of maxima in the even rows of B, find the maxima in its odd rows.
IDEA
Time Complexity T(n,m) = c 1 m + c 2 n + T(n/2, n) = c 1 m + (c 1 +c 2 )n + c 2 n/2 + T(n/4, n/2) T(n,m) = 2 (c 1 +c 2 )n + c 1 m = O(m)