Download presentation
Presentation is loading. Please wait.
Published byAlbert Wilkinson Modified over 9 years ago
2
A Sub-quadratic Sequence Alignment Algorithm for Unrestricted Cost Matrices Maxime Crochemore Gad M. Landau Michal Ziv-Ukelson
3
Presentation R89922024 蘇展弘 B86202049 葉恆青 R90725054 呂育恩 R90922001 張文亮 R90922091 游騰楷
4
Outline Introduction and preliminaries. –LZ-78. –The basic concept. Global alignment. Local alignment. Proof for LZ-76. Proof for SMAWK algorithm.
5
LZ-78 aacgacga 0 a 1 c 2 g 4 g 3 The number of distinct code word:
6
Sample of LZ-78 12345 ctacgaga 1234 aacgacga 0 213 45 c t a g g 0 13 2 4 a g c g
7
Basic Concept 01234 aacgacga 1c 2t 3a 4cg 5ag a a 5/4ag5 cg4 a3 t2 c1 aacggaca 43210 g a gca a 5/45/2ag5 cg4 a3 t2 c1 aacggaca 43210 g a ca a 5/45/2ag5 cg4 3/43/2a3 t2 c1 aacggaca 43210 a ca g a gca Left prefix g a gca Top prefix g a gca Diagonal prefix Input border: I Output border: O
8
Basic Concept I/O Propagation Across G 012345 I 0 =00-2-3 △△ I 1 =0 -2-3 △ I 2 =0-2001-3 I 3 =0 △ -2 0 I 4 =0 △△ -20 I 5 =0 △△△ -20 DIST matrix 0-2 △△△ I 5 =0 0-2 △△ I 4 =0 -2 0 △ I 3 =0 -3100-2I 2 =0 △ -3-2 I 1 =0 △△ -3-20I 0 =0 543210 DIST matrix 321-14 I 5 =3 001-13 I 4 =1 00200-12I 3 =2 024331I 2 =3 -∞-∞1011I 1 =2 -∞-∞-∞-∞-201I 0 =1 543210 OUT matrix Directly assign -∞ OUT[i,j]=-(n+i+1) x k, Where k is the maximal absolute value in the penalty matrix. 321-14 I 5 =3 001-13 I 4 =1 00200-12I 3 =2 024331I 2 =3 -∞-∞1011I 1 =2 -∞-∞-∞-∞-201I 0 =1 543210 324331 O5O5 O4O4 O3O3 O2O2 O1O1 OoOo
9
Basic Concept Monge Property DIST matrix 012345 I 0 =00-2-3 △△ I 1 =0 -2-3 △ I 2 =0-2001-3 I 3 =0 △ -2 0 I 4 =0 △△ -20 I 5 =0 △△△ -20 Aggarwal and Park and Schmidt observed that DIST matrices are Monge arrays. Def : A matrix M[ m x n ] is Monge if either condition 1 or 2 below holds for all a, b=0 … m; c, d=0 … n: 1. convex condition : 2. concave condition :
10
Basic Concept Tatally Monotone An important property of Monge arrays is that of being totally monotone. Def : A matrix M[ m x n ] is totally monotone if either condition 1 or 2 below holds for all a, b=0 … m; c, d=0 … n: 1. convex condition : 2. concave condition : Both DIST and OUT matrices are totally monotone by the concave condition. Aggarwal et al gave a recursive algorithm, nicknamed SMAWK, which can compute on O(n) time all row and column maxima of a n x n totally monotone matrix, by querying only O(n) elements of the array.
11
DIST matrix I 0 = 10-2-3ΔΔ I 1 = 2 -2-2Δ I 2 = 3-2001-3 I 3 = 2Δ-2 0 I 4 = 1ΔΔ-20 I 5 = 3ΔΔΔ-20
12
OUT matrix 10-2 -- -- 1101 -- 133420 -1200200 -13 100 -14 123
13
OUT matrix 10-2 -- -- 1101 -- 133420 -1200200 -13 100 -14 123 concave monotonicity: 若左行 的上面比下面小,則右 行的上面也比下面小 No new column maximum : –(n + i + 1) * k -
14
01234 aa cga c ga 1 c 2 t 3 a 4 cgcg 5 agag a The New block
15
Corresponding matrices
16
Maintaining Direct Access to DIST Columns 目的 : 跑 SMAWK 時需要用到的 OUT matrix 必須由 DIST 和 inplut 來提供,並在 Constant time 內得到 OUT matrix 的每一格。但是 Space 又不能超過。 作法 : 只存新產生的 column ,並維護一個 data structure 。
17
DIST(5,4) -3 1 0 0 -2 0 0 -2 0 -2 -2 -3 -2 0 01234 aa cga c ga 1 c 2 t 3 a 4 cgcg 5 agag a Data Strucure
18
DIST(5,4) -3 1 0 0 -2 0 0 -2 0 -2 -2 -3 -2 0 01234 aa cga c ga 1 c 2 t 3 a 4 cgcg 5 agag a Construction
19
Time and Space Complexity 作 new column 作 DIST vector ( 即找出 該 DIST matrix 所有的 column) 用 SMAWK 從這個 DIST( 加上 input) 算出 output maxima 。 O ( t )
20
Total complexity 01234 aa cga c ga 1 c 2 t 3 a 4 cgcg 5 agag a h n / log ( n ) n O ( h n 2 / log(n) )
21
Sub-Quadratic Local Alignment Eric, Yu En Lu Information Management Dept. National Taiwan University
22
Sub-Quadratic Global Alignment Exploits Redundancy among sequences resulted by Lempel-Ziv Compression (self- repeating) to obtain the sub-quadratic part
23
Sub-Quadratic Local Alignment Requires additional knowledge of where a locally optimal string starts and ends However, this algorithm is performed on a per-block basis, we have to compute additional information specific to a block And then use it as the cue to the final score
24
Additional Information I S[i] C E[k] F=max {MAX t i=0 {I[i]+E[i]}, C}
25
Algorithm Body Given: DIST G Encoding –Compute values of E –Compute values of S –Compute values of C Propagation –Compute values of O ’ (modified from the O in global alignment) –Computing F Seek Highest Score –Find the highest score F
26
Back-Tracking the Exact Path Global Alignment Local Alignment –Given the block with max F value –We seek its path through looking its max{lp, tp, dia} block recursively until the score 0
27
Time/Space Analysis Encoding –E: max{E[i] lp, E[i] tp, DIST[I, l c ]} O(t) –S: (all other can be copies, except..) S lr,lc = max{S lr- 1,lc +W, S lr,lc-1 +W, S lr-1,lc-1 +W} O(t) –C: max{C lp, C tp, S[l c ]} O(1) Propagation –O[i] = max{O[i], S[i]} O(t) –F=max {MAX ti=0 {I[i]+E[i]}, C} O(t) Find F O(hn 2 /log 2 n) Total Complexity O(hn 2 /log n)
28
Further Improvements Efficient alignment storage algorithm –Conditioned in “ discrete weights ” –Gives a minimal encoding to DIST (O(t) O(1) ) for G –Thus we obtain O(hn 2 /(log n )2) storage complexity in Global-Alignment problem –While time complexity is O(hn 2 /log n)
29
Now, we are going to have presentations on SMAWK & LZ-76 Thank you!
31
The Maximum Numbers of Distinct Words Speaker : Emory Chang Date : 2002/1/31 Lempel and Ziv,1976 “On the Complexity of Finite Sequences” Reference :
32
What is a Distinct Word? EX: (LZ78) A = {0,1},a = |A| = 2 S = 0101000,n = |S| = 7 0 5 1 0 4 0 2 1 3 1 0 1 0 1 0 0 0 we have four distinct words, and five steps to generate the sequence.
33
Notation A : the set of alphabets α : the number of alphabets S : a sequence belong to A n : the length of S C(S) : production complexity of S N : the maximum possible number of distinct words. n
34
The upper bound Any sequential encoding procedure employs a parsing rule which a long string of data is broken down into words that are individually mapped into distinct words. For every :
35
Special case(1/2) Let N denote the maximum possible number of distinct words. Clearly C(S) < N+1 (a possible exception of the last one) Consider the special case : The sequence is formed by all distinct words of length of one, two, …, k ex: S = 0 1 0 0 0 1 1 0 1 1 0 12 43 0 0 1 1 65 0 1 n = 12 + 22 12
36
Special case(2/2) The length of symbols at level i The number of nodes at level i
37
General Case The length of level k+1=k+1 Level k+1
38
Proof: Since We have Therefore from
39
SMAWK A Linear Time Algorithm for the Maximum Problem on Wide Totally Monotone Matrices
40
Definition Let A be an n m matrix with real entries. –A j denote the jth column of A and A i denote the ith row of A. –A[i 1,…,i k ;j 1,…,j k ] denote the submatrix of A. –Let j(i) be the smallest column index j such that A(i,j) equals to the maximum value in A i.
41
i=2 j(i)=3
42
Definition A n m matrix A is monotone if for 1 i 1 i 2 n, j(i 1 ) j(i 2 ). A is totally monotone if every submatrix of A is monotone.
43
Another Definition In the previous paper, the definition of totally monotone is: –A matrix M[0…m,0…n] is totally monotone is either condition 1 or 2 below holds for all a,b=0…n; c,d=0…m: –1. Convex condition: M[a,c] M[b,c] M[a,d] M[b,d] for all a<b and c<d –2. Concave condition: M[a,c] M[b,c] M[a,d] M[b,d] for all a<b and c<d We use concave here.
44
Comparison Now we want to compare these two definitions. The definition in SMAWK’s paper is called D s, The definition in this paper is called D c (we need a transpose to match the row and column of these to definition).
45
Comparison (cont.) To proof D c D s. –D c holds on matrix A[0…n,0…m] –Let A’[i…i’,j…j’] be a submatrix of A, i i 1 i 2 i’, j 1 = j(i 1 ), j 2 = j(i 2 ) j 1 j 2. i1i1 i2i2 j1j1 a,b,c d e,f,g h So j 1 j 2
46
Comparison (cont.) To proof D s D c –The matrix satisfies D s but not D c. D c is stronger.
47
Lemma 1 –We define an entry A[i,j] is dead if j j(i). –Lemma 1: –Let A be a totally monotone n m matrix and let 1 j 1 j 2 m. if A(r, j 1 ) A(r,j 2 ) then entries in {A(i, j 2 ):1 i r} are dead. if A(r, j 1 ) A(r,j 2 ) then entries in {A(i, j 1 ):r i n} are dead. j1j1 j2j2 r i i
48
REDUCE(A) –C=A; k=1 –While C has more than n columns do – case – C(k,k) C(k,k+1) and k < n : k = k+1 – C(k,k) C(k,k+1) and k = n : Delete column C k+1 – C(k,k) < C(k,k+1) : Delete column C k ; – if k>1 then k = k-1
49
REDUCE(A) <
50
51
<
52
Time Complexity Case 2 + Case 3 = m – n Case 1 at most n + (m – n) –1 Totally 2m – n – 1 O(m)
53
MAXCOMPUTE(A) – B = REDUCE(A) –If n=1 then output the maximum and return –C=B[2,4,…,2 n/2 ; 1,2…,n] –MAXCOMPUTE(C) –From the known positions of maxima in the even rows of B, find the maxima in its odd rows.
54
IDEA
55
Time Complexity T(n,m) = c 1 m + c 2 n + T(n/2, n) = c 1 m + (c 1 +c 2 )n + c 2 n/2 + T(n/4, n/2) T(n,m) = 2 (c 1 +c 2 )n + c 1 m = O(m)
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.