1 Fast Parallel and Serial Approximate String Matching Journal of Algorithms, Vol.10 (1989), pp G. Landau and U. Vishkin Advisor: Prof. R. C. T. Lee Speaker: L. Y. Huang
2 Problem Give two arrays: P = p 1 p 2 …p m – the pattern, and T = t 1 t 2 …t n – the text, and an integer k (k 1), find all occurrences of the pattern in the text with edit distances at most equal to k.
3 This algorithm improves the Alternative Dynamic Programming Computation. First, we introduce the Dynamic Programming Computation.
4 The Dynamic Programming Algorithm[S80] In the dynamic programming approach, we construct a matrix D n+1,m+1 when D i,j is the minimum edit distance between P(1, j) and any substring in T which ends at T i. Example: T = gggtcta P = gttc k = t t catgg c t g i j1234j1234 g
5 We found: –gt gt gt –gttc g t t gt –g t c gtc –g t t c gtc Distance =2 (1) Distance =1 (2) t t catgg c t g i j1234j1234 g
6 –g t c t g t c t gtct –g t t c g t t t gtct – –g t c t g t c t gtct –g t t c g t t gtct –g t c t a g t c t a gtcta – g t t c g t t a gtcta Distance =2 (3) (4) (5) t t catgg c t g i j1234j1234 g
7 An alternative Dynamic Programming Computation We should heavily use the concept of diagonal. Diagonal d is defined as all of the D i,j s where d = i – j. Diagonal 2 Diagonal c 101b 0000 cba i j12j12
8 We first have the following: –(a) If T i = P j, D i,j = D i-1,j-1 ; –(b) otherwise, D i,j = D i-1,j-1 +1 (subsitutaion) or D i,j = D i, j-1 +1 (deletion) or D i,j = D i-1,j (insertion)
9 Consider any diagonal d. Let us find the largest j, if it exists, such that (i,j) is on Diagonal d (i - j = d) and D i,j = 0. Let us now label all of these locations. c t 0t 000 g atctggg i j1234j1234 Diagonal 0 Diagonal 1 Diagonal 2
10 Having found the above locations (i, j) where D i,j = 0, we can further find the largest j, if it exists, such that (i, j) is on Diagonal d and D i,j = 1. To do this, we use the following observation: Each element in Diagonal d can only influence elements in Diagonals d-1, d and d+1.
11 Let us consider any (i, j) location on Diagonal d. Why can D i,j suddenly become 1? –It can only be influenced as shown below: Thus, we conclude that we only need to consider Diagonals d-1, d and d+1. D i-1, j-1 D i, j-1 D i-1, j D i, j d d+1 d-1 delete insert substitution
12 Let us consider the following table. Question: what is the value of D 4,3 ? –It can not be 0 because we have already decided that on Diagonal 1, the largest j on Diagonal 1 is 1. Thus D 4,3 =1. j1234j1234 d =1 i c ?0t 00t 0000g atctggg
13 Question: What is the value of D 5,4 ? –Since T 5 =P 4, D 5,4 =D 4,3 =1. j1234j1234 d =1 i ?0c 10t 00t 0000g atctggg
14 Based upon the above discussion, we can find all (i,j)s where D i,j =1 after finding all (i, j)s when D i,j =0. In fact, after finding all D i,j s where D i,j = e, we can find all (i, j)s where D i,j = e+1. Thus the dynamic programming table does not have to computed. In the following, we shall give the Alternative Dynamic Programming Computations Method formally.
15 Let L d,e denote the largest row j such that D i,j is on the Diagonal d (i- j = d) and D i,j =e. Based upon this definition, e is the minimum edit distance between any substring of T ending at T L d,e +d and P L d,e +1 T L d,e +d+1 Let d =3. L 3,0 = 0, L 3,1 =3, L 3,2 =4 i c t t catggg t g j1234j1234
16 Example: –T = gggtcta –P = gttc –k = 2 Now, L 3,1 = 3. It means that we have found a substring A, which is T(3,6)=gtct, ending at T L d,e +d = T 3+3 =T 6, such that the edit distance between A and P(1,3) = gtt is 1. P L d,e +1 T L d,e +d+1 P 3+1 T gggtcta g t t c i j1234j1234
17 Example: –T = gggtcta –P = gttc –k = 2 Now, L 1,1 = 4 = m. It means that we have found substring A, which is T(2,5)=ggtc, ending at T L d,e +d = T 3+3 =T 6, such that the edit distance between A and P(1,3) = gtt is 1. They are T(2,5) = ggtc and P = gttc c t t g atctggg j1234j1234 i
18 The alternative dynamic algorithm computation is to compute the L d,e s value.
19 gggtcta g0 t0 t0 c0 An alternative Dynamic Programming Computation First, we set the initial value. Example: –T = gggtcta –P= gttc
20 gggtcta g000 t0 t0 c0 i j1234j1234 e =0 From d = 0 to d = n, if P [1…j] is equal T [d+1…i], then we set the value of L d,0 = j. d = 0 P 1 = T 1, L 0,0 =1 d=0
21 gggtcta g000 t0 t0 c0 i j1234j1234 e =0 d = 1 P 1 = T 2, L 1,0 =1 d=1
22 gggtcta g0000 t00 t0 c0 i j1234j1234 e =0 d =2 P 1 =T 3, P 2 = T 4, L 2,0 = 2 d=2
23 Our approach is based upon Rule 1 proposed by Professor Lee. Consider tow substring A 1 and A 2 as shown below: A1A1 P1P1 S1S1 A2A2 P2P2 S2S2 If d(A 1, A 2 ) k and S 1 =S 2, then d(P 1, P 2 ) k.
24 Observe the following: If d(A 1,A 2 ) = k, S 1 = S 2, x y, then d(A 1 +S 1 +x, A 2 +S 2 +y) k+1
25 For e0, we search through d = -e to d =n. Let row = max[(L d,e-1 +1),(L d-1,e-1 ),(L d+1,e-1 +1)]. (subsitutaion) (deletion) (insertion) Find the largest j, if it exists, such that P(row+1, j) = T(row+1+d, i) =T(row +1+i-j, i), set L d,e =j. If no such j exists, set L d,e = row.
26 Let row = max[(L d,e-1 +1),(L d-1,e-1 ),(L d+1,e-1 +1)]. (subsitutaion) (deletion) (insertion) L d,e-1 L d-1,e-1 L d+1,e-1 Diagonal d Diagonal d+1 Diagonal d-1 substitution deletion insertion
27 row = max[(L d,e-1 +1),(L d-1,e-1 ),(L d+1,e-1 +1)] = max[1+1, 2, 1+1] = max[2, 2, 2] = 2 P(row+1, j) T(row+1+d, i), P 3 T 2 L -1,1 = 2 d = -1 i j1234j1234 0c 0t 00t 0000g atctggg
28 row = max[(L d,e-1 +1),(L d-1,e-1 ),(L d+1,e-1 +1)] = max[1+1, 1, 1+1] = max[2, 1, 2] = 2 P(row+1, j) T(row+1+d, i), P 3 T 3 L 0,1 = 2 i d =0 j1234j1234 0c 0t 010t 0000g atctggg
29 row = max[(L d,e-1 +1),(L d-1,e-1 ),(L d+1,e-1 +1)] = max[1+1, 1, 2+1]= max[2, 1, 3] = 3 P(row+1, j) = T(row+1+d, i) = P 4 = T 5 = c L 1,1 = 4 = m We find an occurrence of the pattern in the text with edit distance at most 1 that ends at T d+m = T 1+4 = T 5 j1234j1234 d =1 i c 0t 0110t 0000g atctggg
30 10c 110t 0110t 0000g atctggg i j1234j1234 d =3 row = max[(L d,e-1 +1),(L d-1,e-1 ),(L d+1,e-1 +1)] = max[0+1, 2, 0+1] = max[1, 2, 1] = 2 P(row+1, j) = T(row+1+d, i), P 3 = T 6, P 4 T 7 L 3,1 = 3
31 row = max[(L d,e-1 +1),(L d-1,e-1 ),(L d+1,e-1 +1)] = max[3+1, 3, 2+1] = max[4, 3, 3] = 4 L 3,2 = 4 = m We find an occurrence of the pattern in the text with edit distance at most 2 that ends at t d+m = t 3+4 = t c t t g atctggg j i1234i1234 d =3
32 An alternative Dynamic Programming Computation Initialization for all d, 0 d n, L d,-1 = -1 for all d, -(k+1) d -1, L d,|d|-1 = |d|, L d,|d|-2 = |d|-2 for all e, -1 e k, L n+1,e = -1 For e = 0 to k do For d = -e to n do row = max[(L d,e-1 +1),(L d-1,e-1 ),(L d+1,e-1 +1)] row = min(row,m) while row < m and row +d <n and a row+1 = t row+1+d do row = row + 1 L d,e = row if L d,e = m then print *there is an occurrence ending at t d+m *
33 Different with this algorithm In the alternative dynamic algorithm computation, we must search j such that P(row+1,j) = T (row +1+d, i) = T (row +1+i-j, i). Essentially, we are looking for S 1 and S 2 in T and P respectively, as show below: This paper will use LCA (lowest common ancestor) to improve this searching part.
34 This algorithm has two steps: –Concatenate the text and the pattern to one string t 1,…,t n,p 1,…p m. Compute the suffix tree of this string. –Find all occurrence of the pattern in the text with edit distance at most k. Algorithm
35 T = ABCDEA P = DDBE S = ABCDEADDBE Suffix tree of a string with length n can be constructed in O(n). Weiner, 1973 McCreight, 1976 Ukkonen, 1995
36 The lowest common ancestor of two leaf nodes can be found in O(1) by O(n) preprocessing in constructing time. Harel and Tarjan, 1984
37 To find such S, if it exists, we may concatenate T and P to find a new string. Obviously, on the suffix tree, suffixes S 1 and S 2 have a common ancestor S. T P S1S1 S2S2
38 If we want to compute L 3,1, we will use L 2,0, L 3,0, L 4,0 to decide the row value (row =2). 1 0 a 0a 0a 1110t t 10000g ctctggg i j12345j12345 d=3 In this paper, we find the length of LCA 2,3 is 2. q = 2 L 3,1 = row +2 =4 S1S1 S2S2
39 S= gggtctacgttac text pattern
40 Time Complexity An alternative Dynamic Programming Computation takes O(mn) time. The suffix tree has O(n) nodes. LCA query responds in O(1) time. For each of the n+k+1 diagonals, we evaluate (k+1)L d,e s This algorithm takes O(nk) time.
41 [AHU-74] A. V. AHO, J. W. HOPCROFT, AND J. D. ULLMAN, The Designand Analysis of Computer Algorithms, Addison- Wesley, Reading, MA, 1974 [AILSV-88] A. APOSTOLICO, C. ILIOPOULOS, G.M. LANDAU, B. SCHIEBER, AND U. VISHKIN, Parallel construction of a suffix tree with applications, Algorithmica 3(1988), [BM-77] R.S. BOYER AND J. S. MOORE, Afast string searching algorithm, Comm. ACM 20(1977), [CS-85] M. T. CHEN AND J. SEIFERAS, Efficient and elegant subword tree construction, in Combinatiorial Algorithms on Words, (A. Apostolico and Z. Galil, ED.), NATO ASI Series F: Computer and System Sciences Vol. 12, pp , Springer-Verlag, New York/ Berlin, [G-84] Z. GALIL, Optimal parallel algorithms for string matching, in Proceedings, 16th ACM Symposium on Theory of Computing, 1984 pp ; Inform. And CONTROL 67(1985), [GG-86] Z. GALIL AND R. GIANCARLO, Improved string matching with k mismatches, SIGACT News 17, No. 4(1986), [GG-87] Z. GALIL AND R. GIANCARLO, Parallel string matching with k mismatches, Theoret. Comput. Sci. 51(1987), [GS-83] Z. GALIL AND J. I. SEFIERAS, Time-space-optimal string matching, J. Comput. System Sci. 26(1983), [HT-84] D. HAREL AND R. E. TARJAN, Fast algorithms for finding nearest common ancestors, SIAM J. Comput. 13, No. 2(1984), [KMP-77] D.E. KNUTH, J. H. MORRIS, AND V. R. PRATT, Fast pattern matching in strings, SIAM J. COMPUT. 6(1977), [KR-87] R. KARP AND M. O. RABIN, Efficient randomized pattern-matching algortihms, IBM J. Res. Develop. 31, No.2(1987), Reference
42 [LSV-87] G. M. LANDAU, B. SCHIEBER, AND U. VISHKIN, Parallel construction of a suffix tree, in Proceedings 14th ICALP, Lecture Notes in Computer Science Vol. 267, pp , Springer-Verlag, New York/Berlin,1987. [LV-86a] G. M. Landau and U. Vishkin, Introducing efficient parallelism into approximate string matching, in Proc. 18 th ACM Symposium on Theory of Computing, 1986, pp [LV-86b] G. M. Landau and U. Vishkin, Efficient string with k mismatches, Theoret. Comput. Sci.,43(1986), [LV-88] G. M. LANDAU AND VISHKIN, Fast string matching with k differences, J. Comput. System Sci. 37(No. 1), 1988,63-78 [S80] The Theory and Computation of Evolutionary Distances: Pattern Recognition, Sellers, P. H., Journal of Algorithms, Vol. 20, No. 1, 1980, pp. 359~373. [SK-83] D. SANKOFF AND J. B. KURSKAL (Eds.),Time Warps, String Edits, and Macromolecules: The Theory and Practice of Sequence Comparison, Addison-Wesley, Reading, MA, [SV-88] B. SCHIEBER AND U. VISHIN, Parallel computation of lowest common ancestor in trees, SIAM J. Comput., in press. [U-83]E. UKKONEN, On approximate string matching, in press. In Proceedings Int. Conf. Found. Comput. Theory, Lecture Notes in Computer Science Vol. 158, pp , Springer-Verlag, Berlin/New York, [U-85] E. UKKONEN, Finding approximate pattern in strings, J. Algorithms 6(1985), [V-83] U. VISHKIN, Synchronous parallel computation-A survey, TR-71, Department of Computer Science, Courant Institute, NYU, [V-85] U. VISHKIN, Optimal parallel pattern matching in strings, in Proceedings 12th ICALP, Lecture Notes in Computer Science Vol. 194, pp , Springer- Verlag, New York/Berlin, Inform. and Control 67(1985, )
43 Thank you