A Sub-quadratic Sequence Alignment Algorithm for Unrestricted Cost Matrices Maxime Crochemore Gad M. Landau Michal Ziv-Ukelson.

A Sub-quadratic Sequence Alignment Algorithm for Unrestricted Cost Matrices Maxime Crochemore Gad M. Landau Michal Ziv-Ukelson

Presentation R89922024 蘇展弘 B86202049 葉恆青 R90725054 呂育恩 R90922001 張文亮 R90922091 游騰楷

Outline Introduction and preliminaries. –LZ-78. –The basic concept. Global alignment. Local alignment. Proof for LZ-76. Proof for SMAWK algorithm.

LZ-78 aacgacga 0 a 1 c 2 g 4 g 3 The number of distinct code word:

Sample of LZ-78 12345 ctacgaga 1234 aacgacga 0 213 45 c t a g g 0 13 2 4 a g c g

Basic Concept 01234 aacgacga 1c 2t 3a 4cg 5ag a a 5/4ag5 cg4 a3 t2 c1 aacggaca 43210 g a gca a 5/45/2ag5 cg4 a3 t2 c1 aacggaca 43210 g a ca a 5/45/2ag5 cg4 3/43/2a3 t2 c1 aacggaca 43210 a ca g a gca Left prefix g a gca Top prefix g a gca Diagonal prefix Input border: I Output border: O

Basic Concept I/O Propagation Across G 012345 I 0 =00-2-3 △△ I 1 =0 -2-3 △ I 2 =0-2001-3 I 3 =0 △ -2 0 I 4 =0 △△ -20 I 5 =0 △△△ -20 DIST matrix 0-2 △△△ I 5 =0 0-2 △△ I 4 =0 -2 0 △ I 3 =0 -3100-2I 2 =0 △ -3-2 I 1 =0 △△ -3-20I 0 =0 543210 DIST matrix 321-14 I 5 =3 001-13 I 4 =1 00200-12I 3 =2 024331I 2 =3 -∞-∞1011I 1 =2 -∞-∞-∞-∞-201I 0 =1 543210 OUT matrix Directly assign -∞ OUT[i,j]=-(n+i+1) x k, Where k is the maximal absolute value in the penalty matrix. 321-14 I 5 =3 001-13 I 4 =1 00200-12I 3 =2 024331I 2 =3 -∞-∞1011I 1 =2 -∞-∞-∞-∞-201I 0 =1 543210 324331 O5O5 O4O4 O3O3 O2O2 O1O1 OoOo

Basic Concept Monge Property DIST matrix 012345 I 0 =00-2-3 △△ I 1 =0 -2-3 △ I 2 =0-2001-3 I 3 =0 △ -2 0 I 4 =0 △△ -20 I 5 =0 △△△ -20 Aggarwal and Park and Schmidt observed that DIST matrices are Monge arrays. Def ： A matrix M[ m x n ] is Monge if either condition 1 or 2 below holds for all a, b=0 … m; c, d=0 … n: 1. convex condition ： 2. concave condition ：

Basic Concept Tatally Monotone An important property of Monge arrays is that of being totally monotone. Def ： A matrix M[ m x n ] is totally monotone if either condition 1 or 2 below holds for all a, b=0 … m; c, d=0 … n: 1. convex condition ： 2. concave condition ： Both DIST and OUT matrices are totally monotone by the concave condition. Aggarwal et al gave a recursive algorithm, nicknamed SMAWK, which can compute on O(n) time all row and column maxima of a n x n totally monotone matrix, by querying only O(n) elements of the array.

DIST matrix I 0 = 10-2-3ΔΔ I 1 = 2 -2-2Δ I 2 = 3-2001-3 I 3 = 2Δ-2 0 I 4 = 1ΔΔ-20 I 5 = 3ΔΔΔ-20

OUT matrix 10-2 -- -- 1101 -- 133420 -1200200 -13 100 -14 123

OUT matrix 10-2 -- -- 1101 -- 133420 -1200200 -13 100 -14 123 concave monotonicity: 若左行的上面比下面小，則右行的上面也比下面小 No new column maximum : –(n + i + 1) * k - 

01234 aa cga c ga 1 c 2 t 3 a 4 cgcg 5 agag a The New block

Corresponding matrices

Maintaining Direct Access to DIST Columns 目的 : 跑 SMAWK 時需要用到的 OUT matrix 必須由 DIST 和 inplut 來提供，並在 Constant time 內得到 OUT matrix 的每一格。但是 Space 又不能超過。作法 : 只存新產生的 column ，並維護一個 data structure 。

DIST(5,4) -3 1 0 0 -2 0 0 -2 0 -2 -2 -3 -2 0 01234 aa cga c ga 1 c 2 t 3 a 4 cgcg 5 agag a Data Strucure

DIST(5,4) -3 1 0 0 -2 0 0 -2 0 -2 -2 -3 -2 0 01234 aa cga c ga 1 c 2 t 3 a 4 cgcg 5 agag a Construction

Time and Space Complexity 作 new column 作 DIST vector ( 即找出該 DIST matrix 所有的 column) 用 SMAWK 從這個 DIST( 加上 input) 算出 output maxima 。 O ( t )

Total complexity 01234 aa cga c ga 1 c 2 t 3 a 4 cgcg 5 agag a h n ／ log ( n ) n O ( h n 2 ／ log(n) )

Sub-Quadratic Local Alignment Eric, Yu En Lu Information Management Dept. National Taiwan University

Sub-Quadratic Global Alignment Exploits Redundancy among sequences resulted by Lempel-Ziv Compression (self- repeating) to obtain the sub-quadratic part

Sub-Quadratic Local Alignment Requires additional knowledge of where a locally optimal string starts and ends However, this algorithm is performed on a per-block basis, we have to compute additional information specific to a block And then use it as the cue to the final score

Additional Information I S[i] C E[k] F=max {MAX t i=0 {I[i]+E[i]}, C}

Algorithm Body Given: DIST G Encoding –Compute values of E –Compute values of S –Compute values of C Propagation –Compute values of O ’ (modified from the O in global alignment) –Computing F Seek Highest Score –Find the highest score F

Back-Tracking the Exact Path Global Alignment Local Alignment –Given the block with max F value –We seek its path through looking its max{lp, tp, dia} block recursively until the score 0

Time/Space Analysis Encoding –E: max{E[i] lp, E[i] tp, DIST[I, l c ]}  O(t) –S: (all other can be copies, except..) S lr,lc = max{S lr- 1,lc +W, S lr,lc-1 +W, S lr-1,lc-1 +W}  O(t) –C: max{C lp, C tp, S[l c ]}  O(1) Propagation –O[i] = max{O[i], S[i]}  O(t) –F=max {MAX ti=0 {I[i]+E[i]}, C}  O(t) Find F  O(hn 2 /log 2 n) Total Complexity  O(hn 2 /log n)

Further Improvements Efficient alignment storage algorithm –Conditioned in “ discrete weights ” –Gives a minimal encoding to DIST (O(t)  O(1) ) for G –Thus we obtain O(hn 2 /(log n )2) storage complexity in Global-Alignment problem –While time complexity is O(hn 2 /log n)

Now, we are going to have presentations on SMAWK & LZ-76 Thank you!

The Maximum Numbers of Distinct Words Speaker : Emory Chang Date : 2002/1/31 Lempel and Ziv,1976 “On the Complexity of Finite Sequences” Reference :

What is a Distinct Word? EX: (LZ78) A = {0,1},a = |A| = 2 S = 0101000,n = |S| = 7 0 5 1 0 4 0 2 1 3 1 0 1 0 1 0 0 0 we have four distinct words, and five steps to generate the sequence.

Notation A : the set of alphabets α : the number of alphabets S : a sequence belong to A n : the length of S C(S) : production complexity of S N : the maximum possible number of distinct words. n

The upper bound Any sequential encoding procedure employs a parsing rule which a long string of data is broken down into words that are individually mapped into distinct words. For every :

Special case(1/2) Let N denote the maximum possible number of distinct words. Clearly C(S) < N+1 (a possible exception of the last one) Consider the special case : The sequence is formed by all distinct words of length of one, two, …, k ex: S = 0 1 0 0 0 1 1 0 1 1 0 12 43 0 0 1 1 65 0 1 n = 12 + 22 12

Special case(2/2) The length of symbols at level i The number of nodes at level i

General Case The length of level k+1=k+1 Level k+1

Proof: Since We have Therefore from

SMAWK A Linear Time Algorithm for the Maximum Problem on Wide Totally Monotone Matrices

Definition Let A be an n  m matrix with real entries. –A j denote the jth column of A and A i denote the ith row of A. –A[i 1,…,i k ;j 1,…,j k ] denote the submatrix of A. –Let j(i) be the smallest column index j such that A(i,j) equals to the maximum value in A i.

i=2 j(i)=3

Definition A n  m matrix A is monotone if for 1  i 1  i 2  n, j(i 1 )  j(i 2 ). A is totally monotone if every submatrix of A is monotone.

Another Definition In the previous paper, the definition of totally monotone is: –A matrix M[0…m,0…n] is totally monotone is either condition 1 or 2 below holds for all a,b=0…n; c,d=0…m: –1. Convex condition: M[a,c]  M[b,c]  M[a,d]  M[b,d] for all a<b and c<d –2. Concave condition: M[a,c]  M[b,c]  M[a,d]  M[b,d] for all a<b and c<d We use concave here.

Comparison Now we want to compare these two definitions. The definition in SMAWK’s paper is called D s, The definition in this paper is called D c (we need a transpose to match the row and column of these to definition).

Comparison (cont.) To proof D c  D s. –D c holds on matrix A[0…n,0…m] –Let A’[i…i’,j…j’] be a submatrix of A, i  i 1  i 2  i’, j 1 = j(i 1 ), j 2 = j(i 2 )  j 1  j 2. i1i1 i2i2 j1j1 a,b,c  d  e,f,g  h So j 1  j 2

Comparison (cont.) To proof D s  D c –The matrix satisfies D s but not D c. D c is stronger.

Lemma 1 –We define an entry A[i,j] is dead if j  j(i). –Lemma 1: –Let A be a totally monotone n  m matrix and let 1  j 1  j 2  m. if A(r, j 1 )  A(r,j 2 ) then entries in {A(i, j 2 ):1  i  r} are dead. if A(r, j 1 )  A(r,j 2 ) then entries in {A(i, j 1 ):r  i  n} are dead. j1j1 j2j2 r i i

REDUCE(A) –C=A; k=1 –While C has more than n columns do – case – C(k,k)  C(k,k+1) and k < n : k = k+1 – C(k,k)  C(k,k+1) and k = n : Delete column C k+1 – C(k,k) < C(k,k+1) : Delete column C k ; – if k>1 then k = k-1

REDUCE(A)   <

  

  <

Time Complexity Case 2 + Case 3 = m – n Case 1 at most n + (m – n) –1 Totally 2m – n – 1 O(m)

MAXCOMPUTE(A) – B = REDUCE(A) –If n=1 then output the maximum and return –C=B[2,4,…,2  n/2  ; 1,2…,n] –MAXCOMPUTE(C) –From the known positions of maxima in the even rows of B, find the maxima in its odd rows.

Time Complexity T(n,m) = c 1 m + c 2 n + T(n/2, n) = c 1 m + (c 1 +c 2 )n + c 2 n/2 + T(n/4, n/2) T(n,m) = 2 (c 1 +c 2 )n + c 1 m = O(m)

A Sub-quadratic Sequence Alignment Algorithm for Unrestricted Cost Matrices Maxime Crochemore Gad M. Landau Michal Ziv-Ukelson.

Similar presentations

Presentation on theme: "A Sub-quadratic Sequence Alignment Algorithm for Unrestricted Cost Matrices Maxime Crochemore Gad M. Landau Michal Ziv-Ukelson."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

A Sub-quadratic Sequence Alignment Algorithm for Unrestricted Cost Matrices Maxime Crochemore Gad M. Landau Michal Ziv-Ukelson.

Similar presentations

Presentation on theme: "A Sub-quadratic Sequence Alignment Algorithm for Unrestricted Cost Matrices Maxime Crochemore Gad M. Landau Michal Ziv-Ukelson."— Presentation transcript:

Similar presentations

About project

Feedback