Presentation is loading. Please wait.

Presentation is loading. Please wait.

A Sub-quadratic Sequence Alignment Algorithm. Global alignment Alignment graph for S = aacgacga, T = ctacgaga Complexity: O(n 2 ) V(i,j) = max { V(i-1,j-1)

Similar presentations


Presentation on theme: "A Sub-quadratic Sequence Alignment Algorithm. Global alignment Alignment graph for S = aacgacga, T = ctacgaga Complexity: O(n 2 ) V(i,j) = max { V(i-1,j-1)"— Presentation transcript:

1 A Sub-quadratic Sequence Alignment Algorithm

2 Global alignment Alignment graph for S = aacgacga, T = ctacgaga Complexity: O(n 2 ) V(i,j) = max { V(i-1,j-1) +  (S[i], T[j]), V(i-1,j) +  (S[i], -), V(i,j-1) +  (-, T[j]) }

3 FOUR RUSSIAN ALGORITHM

4

5 UNRESTRICTED SCORING FUNCTION

6 Main idea: Compress the sequences S = aacgacga T = ctacgaga 0 213 45 c t a g g 0 13 2 4 a g c g LZ-78: Divide the sequence into distinct words 1234 aacgacga 12345 ctacgaga Trie The number of distinct words:

7 aacggaca c t 3/43/2 a cgcg 5/45/2 agag a 2 3 4 1 2 3 4 5 0 1 g a gca a gca a ca g a ca Main idea 0 3 5 2 1 a g c t Trie for T 4 g g 0 1 2 3 4 a c g Trie for S Compute the alignment score in each block Propagate the scores between the adjacent blocks

8 Main idea Compress the sequence into words Pre-compute the score for each block Do alignment between blocks Note: – Replace normal characters by words – Operate on blocks

9 COMPRESS THE SEQUENCE LZ-78

10 S = aacgacga T = ctacgaga 0 213 45 c t a g g 0 13 2 4 a g c g LZ-78: Divide the sequence into distinct words 1234 aacgacga 12345 ctacgaga Trie The number of distinct words:

11 LZ-78 Theorem (Lempel and Ziv): – Constant alphabet sequence S – The maximal number of distinct phrases in S is O(n/log n). Tighter upper bound: O(hn/log n) – h is the entropy factor – a real number, 0 < h  1 – Entropy is small sequence is repetitive

12 COMPUTE THE ALIGNMENT SCORE IN EACH BLOCK

13 aacggaca c t 3/43/2 a cgcg 5/45/2 agag a 2 3 4 1 2 3 4 5 0 1 g a gca a gca a ca g a ca Compute the alignment score in each block

14 Given – Input border: I – Block Compute – Output border: O O g a gca G 0 2 0 1 2 3 4 1 3 4 5 5 I

15 Matrices I[i] : is the input border value DIST[i,j] : weight of the optimal path – From entry i of the input border – To entry j of its output border OUT[i,j] : merges the information from input row I and DIST – OUT[i,j]=I[i] + DIST[i,j] O[j] = max{OUT[i,j] for i=1..n} O g a gca G 0 2 0 1 2 3 4 1 3 4 5 5 I

16 DIST and OUT matrix example O g a gca G 0 2 0 1 2 3 4 1 3 4 5 5 I DIST matrixOUT matrix I (input borders) Block – sub-sequences “acg”, “ag” 012345 I0I0 0-2-3 △△ I1I1 -2-3 △ I2I2 -2001-3 I3I3 △ -2 0 I4I4 △△ 0 I5I5 △△△ -20 012345 10 -2 -- -- 1101 -- 133420 -1200200 -13 100 -14 123 I 0 =1 I 1 =2 I 2 =3 I 3 =2 I 4 =1 I 5 =3 O0O0 O1O1 O2O2 O3O3 O4O4 O5O5 133423 max col

17 For each block, given two sub-sequence S1, S2 Compute (from scratch) DIST in  (n*m) time Given I and DIST, compute OUT in  (n*m) time Given OUT[i,j], Compute O in  (m*n) time

18 Revise Compress the sequence Pre-compute DIST[i,j] for each block Compute border values of each blocks Remaining questions – How to compute DIST[i,j] efficiently? – How to compute O[j] from I[i] and DIST[i,j] efficiently? aacggaca c t a 4/4 cgcg 5/45/3 agag a 2 3 4 1 2 3 4 5 0 1

19 COMPUTE O[J] EFFICIENTLY

20 Compute O[j] efficiently For each block of two sub-sequences S1, S2 Given – I[i] – DIST[i,j] Compute – O[j]

21 DIST and OUT matrix example O g a gca G 0 2 0 1 2 3 4 1 3 4 5 5 I DIST matrixOUT matrix I (input borders) Block – sub-sequences “acg”, “ag” 012345 I0I0 0-2-3 △△ I1I1 -2-3 △ I2I2 -2001-3 I3I3 △ -2 0 I4I4 △△ 0 I5I5 △△△ -20 012345 10 -2 -- -- 1101 -- 133420 -1200200 -13 100 -14 123 I 0 =1 I 1 =2 I 2 =3 I 3 =2 I 4 =1 I 5 =3 O0O0 O1O1 O2O2 O3O3 O4O4 O5O5 133423 max col

22 Compute O without explicit OUT O g a gca G 0 2 0 1 2 3 4 1 3 4 5 5 I DIST matrix I (input borders) Block – sub-sequences “acg”, “ag” 012345 I0I0 0-2-3 △△ I1I1 -2-3 △ I2I2 -2001-3 I3I3 △ -2 0 I4I4 △△ 0 I5I5 △△△ -20 I 0 =1 I 1 =2 I 2 =3 I 3 =2 I 4 =1 I 5 =3 O0O0 O1O1 O2O2 O3O3 O4O4 O5O5 133423 SMAWK

23 Given DIST[i,j], I[i] we can compute O[j] in O(n+m) – Without creating OUT[i,j] How? Why?

24 Why? Aggarwal, Park and Schmidt observed that DIST and OUT matrices are Monge arrays. Definition: a matrix M[0…m,0…n] is totally monotone if either condition 1 or 2 below holds for all a,b=0…m; c,d=0…n: 1.Convex condition: M[a,c]  M[b,c]  M[a,d]  M[b,d] for all a<b and c<d. 2.Concave condition: M[a,c]  M[b,c]  M[a,d]  M[b,d] for all a<b and c<d.

25 How? Aggarwal et. al. gave a recursive algorithm, called SMAWK, which can find all row and column maxima of a totally monotone matrix by querying only O(n) elements of the matrix.

26 Why DIST[i,j] is totally monotone? O g a gca G 0 2 0 1 2 3 4 1 3 4 5 5 I The concave condition If b-c is better than a-c, then b-d is better than a-d. a b dc

27 Other problem Rectangle problem of DIST Set upper right corner of OUT to -  Set lower left corner of OUT to -(n+i-1)*k Preserve the totally monotone property of OUT 012345 I0I0 0-2-3 △△ I1I1 -2-3 △ I2I2 -2001-3 I3I3 △ -2 0 I4I4 △△ 0 I5I5 △△△ -20

28 COMPUTE DIST[I,J] EFFICIENTLY

29 aacggaca c t 3/43/2 a cgcg 5/45/2 agag a 2 3 4 1 2 3 4 5 0 1 g a gca a gca a ca g a ca Compute DIST[i,j] for block(5/4) 0 3 5 2 1 a g c t Trie for T 4 g g 0 1 2 3 4 a c g Trie for S

30 DIST matrix

31

32

33

34

35 Only column m in DIST[i,j] is new DIST block can be updated in O(m+n)

36 MANTAINING DIRECT ACCESS TO DIST TABLE

37 -3 1 0 0 -2 aa cga c ga c t a cgcg agag a Triefor T 0 1 3 2 4 g g a c Triefor S 0 3 1 2 5 4 g c t a g 2 3 4 1 2 3 4 5 0 1

38 -3 1 0 0 -2 0 0 -2 0 -2 -2 -3 -2 0 aa cga c ga c t a cgcg agag a Triefor T 0 1 3 2 4 g g a c Triefor S 0 3 1 2 5 4 g c t a g 2 3 4 1 2 3 4 5 0 1

39 DIST -3 1 0 0 -2 0 0 -2 0 -2 -2 -3 -2 0 aa cga c ga c t a cgcg agag a Triefor T 0 1 3 2 4 g g a c Triefor S 0 3 1 2 5 4 g c t a g 2 3 4 1 2 3 4 5 0 1

40

41 Complexity Assume |S| = |T| = n Number of words in S, T = O(hn/log n) Number of blocks in alignment graph O(h 2 n 2 /(log n) 2 ) For each block – Update new DIST block O(t = size of the border) – Create direct access table O(t) Propagating I/O across blocks – SMAWK O(t) Sum of the sizes of all borders is O(hn 2 /log n) Total complexity: O(hn 2 /log n)

42 Other extensions Trace Reducing the space complexity for discrete scoring Local alignment

43 References Crochemore, M.; Landau, G. M. & Ziv-Ukelson, M. A sub-quadratic sequence alignment algorithm for unrestricted cost matrices ACM-SIAM, 2002, 679-688 Some pictures from 葉恆青


Download ppt "A Sub-quadratic Sequence Alignment Algorithm. Global alignment Alignment graph for S = aacgacga, T = ctacgaga Complexity: O(n 2 ) V(i,j) = max { V(i-1,j-1)"

Similar presentations


Ads by Google