1 Fast Parallel and Serial Approximate String Matching Journal of Algorithms, Vol.10 (1989), pp.157-169. G. Landau and U. Vishkin Advisor: Prof. R. C.

Slides:



Advertisements
Similar presentations
Advisor: Prof. R. C. T. Lee Speaker: L. C. Chen
Advertisements

Advanced Piloting Cruise Plot.
1 Vorlesung Informatik 2 Algorithmen und Datenstrukturen (Parallel Algorithms) Robin Pomplun.
Copyright © 2003 Pearson Education, Inc. Slide 1 Computer Systems Organization & Architecture Chapters 8-12 John D. Carpinelli.
Chapter 1 The Study of Body Function Image PowerPoint
Copyright © 2011, Elsevier Inc. All rights reserved. Chapter 6 Author: Julia Richards and R. Scott Hawley.
Author: Julia Richards and R. Scott Hawley
1 Copyright © 2013 Elsevier Inc. All rights reserved. Appendix 01.
Properties Use, share, or modify this drill on mathematic properties. There is too much material for a single class, so you’ll have to select for your.
1 Average Case Analysis of an Exact String Matching Algorithm Advisor: Professor R. C. T. Lee Speaker: S. C. Chen.
1 Fast text searching: allowing errors Sun Wu and Udi Manber, Communications of the ACM, Vol. 35, 1992, pp Advisor: Prof. R. C. T. Lee Reporter:
Speaker: C. C. Lin Adviser: R. C. T. Lee
1 Rules for Approximate String Matching R.C.T. Lee.
1 String Matching with Errors The Theory and Computation of Evolutionary Distances: Pattern Recognition, Sellers, P. H., Journal of Algorithms, Vol. 20,
1 Faster algorithms for string matching with k mismatches Adviser : R. C. T. Lee Speaker: C. C. Yen Journal of Algorithms, Volume 50, Issue 2, February.
UNITED NATIONS Shipment Details Report – January 2006.
and 6.855J Spanning Tree Algorithms. 2 The Greedy Algorithm in Action
Jeopardy Q 1 Q 6 Q 11 Q 16 Q 21 Q 2 Q 7 Q 12 Q 17 Q 22 Q 3 Q 8 Q 13
Jeopardy Q 1 Q 6 Q 11 Q 16 Q 21 Q 2 Q 7 Q 12 Q 17 Q 22 Q 3 Q 8 Q 13
Title Subtitle.
DIVIDING INTEGERS 1. IF THE SIGNS ARE THE SAME THE ANSWER IS POSITIVE 2. IF THE SIGNS ARE DIFFERENT THE ANSWER IS NEGATIVE.
FACTORING ax2 + bx + c Think “unfoil” Work down, Show all steps.
Addition Facts
Year 6 mental test 5 second questions
Year 6 mental test 10 second questions
2010 fotografiert von Jürgen Roßberg © Fr 1 Sa 2 So 3 Mo 4 Di 5 Mi 6 Do 7 Fr 8 Sa 9 So 10 Mo 11 Di 12 Mi 13 Do 14 Fr 15 Sa 16 So 17 Mo 18 Di 19.
ZMQS ZMQS
REVIEW: Arthropod ID. 1. Name the subphylum. 2. Name the subphylum. 3. Name the order.
Copyright © Cengage Learning. All rights reserved.
Randomized Algorithms Randomized Algorithms CS648 1.
PP Test Review Sections 6-1 to 6-6
Data Structures Using C++
ABC Technology Project
Chapter 15 Complex Numbers
Outline Minimum Spanning Tree Maximal Flow Algorithm LP formulation 1.
1 Undirected Breadth First Search F A BCG DE H 2 F A BCG DE H Queue: A get Undiscovered Fringe Finished Active 0 distance from A visit(A)
2 |SharePoint Saturday New York City
Green Eggs and Ham.
VOORBLAD.
15. Oktober Oktober Oktober 2012.
1 Breadth First Search s s Undiscovered Discovered Finished Queue: s Top of queue 2 1 Shortest path from s.
BIOLOGY AUGUST 2013 OPENING ASSIGNMENTS. AUGUST 7, 2013  Question goes here!
Factor P 16 8(8-5ab) 4(d² + 4) 3rs(2r – s) 15cd(1 + 2cd) 8(4a² + 3b²)
Basel-ICU-Journal Challenge18/20/ Basel-ICU-Journal Challenge8/20/2014.
1..
© 2012 National Heart Foundation of Australia. Slide 2.
Understanding Generalist Practice, 5e, Kirst-Ashman/Hull
Chapter 5 Test Review Sections 5-1 through 5-4.
Addition 1’s to 20.
Model and Relationships 6 M 1 M M M M M M M M M M M M M M M M
25 seconds left…...
H to shape fully developed personality to shape fully developed personality for successful application in life for successful.
Januar MDMDFSSMDMDFSSS
Week 1.
Analyzing Genes and Genomes
We will resume in: 25 Minutes.
©Brooks/Cole, 2001 Chapter 12 Derived Types-- Enumerated, Structure and Union.
Essential Cell Biology
Intracellular Compartments and Transport
PSSA Preparation.
Immunobiology: The Immune System in Health & Disease Sixth Edition
Essential Cell Biology
Immunobiology: The Immune System in Health & Disease Sixth Edition
Energy Generation in Mitochondria and Chlorplasts
Presentation transcript:

1 Fast Parallel and Serial Approximate String Matching Journal of Algorithms, Vol.10 (1989), pp G. Landau and U. Vishkin Advisor: Prof. R. C. T. Lee Speaker: L. Y. Huang

2 Problem Give two arrays: P = p 1 p 2 …p m – the pattern, and T = t 1 t 2 …t n – the text, and an integer k (k 1), find all occurrences of the pattern in the text with edit distances at most equal to k.

3 This algorithm improves the Alternative Dynamic Programming Computation. First, we introduce the Dynamic Programming Computation.

4 The Dynamic Programming Algorithm[S80] In the dynamic programming approach, we construct a matrix D n+1,m+1 when D i,j is the minimum edit distance between P(1, j) and any substring in T which ends at T i. Example: T = gggtcta P = gttc k = t t catgg c t g i j1234j1234 g

5 We found: –gt gt gt –gttc g t t gt –g t c gtc –g t t c gtc Distance =2 (1) Distance =1 (2) t t catgg c t g i j1234j1234 g

6 –g t c t g t c t gtct –g t t c g t t t gtct – –g t c t g t c t gtct –g t t c g t t gtct –g t c t a g t c t a gtcta – g t t c g t t a gtcta Distance =2 (3) (4) (5) t t catgg c t g i j1234j1234 g

7 An alternative Dynamic Programming Computation We should heavily use the concept of diagonal. Diagonal d is defined as all of the D i,j s where d = i – j. Diagonal 2 Diagonal c 101b 0000 cba i j12j12

8 We first have the following: –(a) If T i = P j, D i,j = D i-1,j-1 ; –(b) otherwise, D i,j = D i-1,j-1 +1 (subsitutaion) or D i,j = D i, j-1 +1 (deletion) or D i,j = D i-1,j (insertion)

9 Consider any diagonal d. Let us find the largest j, if it exists, such that (i,j) is on Diagonal d (i - j = d) and D i,j = 0. Let us now label all of these locations. c t 0t 000 g atctggg i j1234j1234 Diagonal 0 Diagonal 1 Diagonal 2

10 Having found the above locations (i, j) where D i,j = 0, we can further find the largest j, if it exists, such that (i, j) is on Diagonal d and D i,j = 1. To do this, we use the following observation: Each element in Diagonal d can only influence elements in Diagonals d-1, d and d+1.

11 Let us consider any (i, j) location on Diagonal d. Why can D i,j suddenly become 1? –It can only be influenced as shown below: Thus, we conclude that we only need to consider Diagonals d-1, d and d+1. D i-1, j-1 D i, j-1 D i-1, j D i, j d d+1 d-1 delete insert substitution

12 Let us consider the following table. Question: what is the value of D 4,3 ? –It can not be 0 because we have already decided that on Diagonal 1, the largest j on Diagonal 1 is 1. Thus D 4,3 =1. j1234j1234 d =1 i c ?0t 00t 0000g atctggg

13 Question: What is the value of D 5,4 ? –Since T 5 =P 4, D 5,4 =D 4,3 =1. j1234j1234 d =1 i ?0c 10t 00t 0000g atctggg

14 Based upon the above discussion, we can find all (i,j)s where D i,j =1 after finding all (i, j)s when D i,j =0. In fact, after finding all D i,j s where D i,j = e, we can find all (i, j)s where D i,j = e+1. Thus the dynamic programming table does not have to computed. In the following, we shall give the Alternative Dynamic Programming Computations Method formally.

15 Let L d,e denote the largest row j such that D i,j is on the Diagonal d (i- j = d) and D i,j =e. Based upon this definition, e is the minimum edit distance between any substring of T ending at T L d,e +d and P L d,e +1 T L d,e +d+1 Let d =3. L 3,0 = 0, L 3,1 =3, L 3,2 =4 i c t t catggg t g j1234j1234

16 Example: –T = gggtcta –P = gttc –k = 2 Now, L 3,1 = 3. It means that we have found a substring A, which is T(3,6)=gtct, ending at T L d,e +d = T 3+3 =T 6, such that the edit distance between A and P(1,3) = gtt is 1. P L d,e +1 T L d,e +d+1 P 3+1 T gggtcta g t t c i j1234j1234

17 Example: –T = gggtcta –P = gttc –k = 2 Now, L 1,1 = 4 = m. It means that we have found substring A, which is T(2,5)=ggtc, ending at T L d,e +d = T 3+3 =T 6, such that the edit distance between A and P(1,3) = gtt is 1. They are T(2,5) = ggtc and P = gttc c t t g atctggg j1234j1234 i

18 The alternative dynamic algorithm computation is to compute the L d,e s value.

19 gggtcta g0 t0 t0 c0 An alternative Dynamic Programming Computation First, we set the initial value. Example: –T = gggtcta –P= gttc

20 gggtcta g000 t0 t0 c0 i j1234j1234 e =0 From d = 0 to d = n, if P [1…j] is equal T [d+1…i], then we set the value of L d,0 = j. d = 0 P 1 = T 1, L 0,0 =1 d=0

21 gggtcta g000 t0 t0 c0 i j1234j1234 e =0 d = 1 P 1 = T 2, L 1,0 =1 d=1

22 gggtcta g0000 t00 t0 c0 i j1234j1234 e =0 d =2 P 1 =T 3, P 2 = T 4, L 2,0 = 2 d=2

23 Our approach is based upon Rule 1 proposed by Professor Lee. Consider tow substring A 1 and A 2 as shown below: A1A1 P1P1 S1S1 A2A2 P2P2 S2S2 If d(A 1, A 2 ) k and S 1 =S 2, then d(P 1, P 2 ) k.

24 Observe the following: If d(A 1,A 2 ) = k, S 1 = S 2, x y, then d(A 1 +S 1 +x, A 2 +S 2 +y) k+1

25 For e0, we search through d = -e to d =n. Let row = max[(L d,e-1 +1),(L d-1,e-1 ),(L d+1,e-1 +1)]. (subsitutaion) (deletion) (insertion) Find the largest j, if it exists, such that P(row+1, j) = T(row+1+d, i) =T(row +1+i-j, i), set L d,e =j. If no such j exists, set L d,e = row.

26 Let row = max[(L d,e-1 +1),(L d-1,e-1 ),(L d+1,e-1 +1)]. (subsitutaion) (deletion) (insertion) L d,e-1 L d-1,e-1 L d+1,e-1 Diagonal d Diagonal d+1 Diagonal d-1 substitution deletion insertion

27 row = max[(L d,e-1 +1),(L d-1,e-1 ),(L d+1,e-1 +1)] = max[1+1, 2, 1+1] = max[2, 2, 2] = 2 P(row+1, j) T(row+1+d, i), P 3 T 2 L -1,1 = 2 d = -1 i j1234j1234 0c 0t 00t 0000g atctggg

28 row = max[(L d,e-1 +1),(L d-1,e-1 ),(L d+1,e-1 +1)] = max[1+1, 1, 1+1] = max[2, 1, 2] = 2 P(row+1, j) T(row+1+d, i), P 3 T 3 L 0,1 = 2 i d =0 j1234j1234 0c 0t 010t 0000g atctggg

29 row = max[(L d,e-1 +1),(L d-1,e-1 ),(L d+1,e-1 +1)] = max[1+1, 1, 2+1]= max[2, 1, 3] = 3 P(row+1, j) = T(row+1+d, i) = P 4 = T 5 = c L 1,1 = 4 = m We find an occurrence of the pattern in the text with edit distance at most 1 that ends at T d+m = T 1+4 = T 5 j1234j1234 d =1 i c 0t 0110t 0000g atctggg

30 10c 110t 0110t 0000g atctggg i j1234j1234 d =3 row = max[(L d,e-1 +1),(L d-1,e-1 ),(L d+1,e-1 +1)] = max[0+1, 2, 0+1] = max[1, 2, 1] = 2 P(row+1, j) = T(row+1+d, i), P 3 = T 6, P 4 T 7 L 3,1 = 3

31 row = max[(L d,e-1 +1),(L d-1,e-1 ),(L d+1,e-1 +1)] = max[3+1, 3, 2+1] = max[4, 3, 3] = 4 L 3,2 = 4 = m We find an occurrence of the pattern in the text with edit distance at most 2 that ends at t d+m = t 3+4 = t c t t g atctggg j i1234i1234 d =3

32 An alternative Dynamic Programming Computation Initialization for all d, 0 d n, L d,-1 = -1 for all d, -(k+1) d -1, L d,|d|-1 = |d|, L d,|d|-2 = |d|-2 for all e, -1 e k, L n+1,e = -1 For e = 0 to k do For d = -e to n do row = max[(L d,e-1 +1),(L d-1,e-1 ),(L d+1,e-1 +1)] row = min(row,m) while row < m and row +d <n and a row+1 = t row+1+d do row = row + 1 L d,e = row if L d,e = m then print *there is an occurrence ending at t d+m *

33 Different with this algorithm In the alternative dynamic algorithm computation, we must search j such that P(row+1,j) = T (row +1+d, i) = T (row +1+i-j, i). Essentially, we are looking for S 1 and S 2 in T and P respectively, as show below: This paper will use LCA (lowest common ancestor) to improve this searching part.

34 This algorithm has two steps: –Concatenate the text and the pattern to one string t 1,…,t n,p 1,…p m. Compute the suffix tree of this string. –Find all occurrence of the pattern in the text with edit distance at most k. Algorithm

35 T = ABCDEA P = DDBE S = ABCDEADDBE Suffix tree of a string with length n can be constructed in O(n). Weiner, 1973 McCreight, 1976 Ukkonen, 1995

36 The lowest common ancestor of two leaf nodes can be found in O(1) by O(n) preprocessing in constructing time. Harel and Tarjan, 1984

37 To find such S, if it exists, we may concatenate T and P to find a new string. Obviously, on the suffix tree, suffixes S 1 and S 2 have a common ancestor S. T P S1S1 S2S2

38 If we want to compute L 3,1, we will use L 2,0, L 3,0, L 4,0 to decide the row value (row =2). 1 0 a 0a 0a 1110t t 10000g ctctggg i j12345j12345 d=3 In this paper, we find the length of LCA 2,3 is 2. q = 2 L 3,1 = row +2 =4 S1S1 S2S2

39 S= gggtctacgttac text pattern

40 Time Complexity An alternative Dynamic Programming Computation takes O(mn) time. The suffix tree has O(n) nodes. LCA query responds in O(1) time. For each of the n+k+1 diagonals, we evaluate (k+1)L d,e s This algorithm takes O(nk) time.

41 [AHU-74] A. V. AHO, J. W. HOPCROFT, AND J. D. ULLMAN, The Designand Analysis of Computer Algorithms, Addison- Wesley, Reading, MA, 1974 [AILSV-88] A. APOSTOLICO, C. ILIOPOULOS, G.M. LANDAU, B. SCHIEBER, AND U. VISHKIN, Parallel construction of a suffix tree with applications, Algorithmica 3(1988), [BM-77] R.S. BOYER AND J. S. MOORE, Afast string searching algorithm, Comm. ACM 20(1977), [CS-85] M. T. CHEN AND J. SEIFERAS, Efficient and elegant subword tree construction, in Combinatiorial Algorithms on Words, (A. Apostolico and Z. Galil, ED.), NATO ASI Series F: Computer and System Sciences Vol. 12, pp , Springer-Verlag, New York/ Berlin, [G-84] Z. GALIL, Optimal parallel algorithms for string matching, in Proceedings, 16th ACM Symposium on Theory of Computing, 1984 pp ; Inform. And CONTROL 67(1985), [GG-86] Z. GALIL AND R. GIANCARLO, Improved string matching with k mismatches, SIGACT News 17, No. 4(1986), [GG-87] Z. GALIL AND R. GIANCARLO, Parallel string matching with k mismatches, Theoret. Comput. Sci. 51(1987), [GS-83] Z. GALIL AND J. I. SEFIERAS, Time-space-optimal string matching, J. Comput. System Sci. 26(1983), [HT-84] D. HAREL AND R. E. TARJAN, Fast algorithms for finding nearest common ancestors, SIAM J. Comput. 13, No. 2(1984), [KMP-77] D.E. KNUTH, J. H. MORRIS, AND V. R. PRATT, Fast pattern matching in strings, SIAM J. COMPUT. 6(1977), [KR-87] R. KARP AND M. O. RABIN, Efficient randomized pattern-matching algortihms, IBM J. Res. Develop. 31, No.2(1987), Reference

42 [LSV-87] G. M. LANDAU, B. SCHIEBER, AND U. VISHKIN, Parallel construction of a suffix tree, in Proceedings 14th ICALP, Lecture Notes in Computer Science Vol. 267, pp , Springer-Verlag, New York/Berlin,1987. [LV-86a] G. M. Landau and U. Vishkin, Introducing efficient parallelism into approximate string matching, in Proc. 18 th ACM Symposium on Theory of Computing, 1986, pp [LV-86b] G. M. Landau and U. Vishkin, Efficient string with k mismatches, Theoret. Comput. Sci.,43(1986), [LV-88] G. M. LANDAU AND VISHKIN, Fast string matching with k differences, J. Comput. System Sci. 37(No. 1), 1988,63-78 [S80] The Theory and Computation of Evolutionary Distances: Pattern Recognition, Sellers, P. H., Journal of Algorithms, Vol. 20, No. 1, 1980, pp. 359~373. [SK-83] D. SANKOFF AND J. B. KURSKAL (Eds.),Time Warps, String Edits, and Macromolecules: The Theory and Practice of Sequence Comparison, Addison-Wesley, Reading, MA, [SV-88] B. SCHIEBER AND U. VISHIN, Parallel computation of lowest common ancestor in trees, SIAM J. Comput., in press. [U-83]E. UKKONEN, On approximate string matching, in press. In Proceedings Int. Conf. Found. Comput. Theory, Lecture Notes in Computer Science Vol. 158, pp , Springer-Verlag, Berlin/New York, [U-85] E. UKKONEN, Finding approximate pattern in strings, J. Algorithms 6(1985), [V-83] U. VISHKIN, Synchronous parallel computation-A survey, TR-71, Department of Computer Science, Courant Institute, NYU, [V-85] U. VISHKIN, Optimal parallel pattern matching in strings, in Proceedings 12th ICALP, Lecture Notes in Computer Science Vol. 194, pp , Springer- Verlag, New York/Berlin, Inform. and Control 67(1985, )

43 Thank you