1 Advisor: Prof. R. C. T. Lee Speaker: G. W. Cheng Two exact string matching algorithms using suffix to prefix rule
2 Speeding up on two string matching algorithms Algorithmica, Vol.12, 1994, pp CROCHEMORE, M., CZUMAJ, A., GASIENIEC, L., JAROMINEK, S., LECROQ, T., PLANDOWSKI, W. and RYTTER, W.
3 Problem Definition : We are given a text string and a pattern string and we want to find all occurrences of P in T.
4 Consider the following example: There are two occurrences of P in T as shown below:
5 Rule 1: The Suffix to Prefix Rule For a window to have any chance to match a pattern, in some way, there must be a suffix of the window which is equal to a prefix of the pattern. T P
6 Basic Ideas Open a window W with size |P| in the text. T |P||P| W p Find the longest suffix of W is also the prefix of pattern. T |P||P| p W Match! Case 1:
7 T |P||P| W p Case 2: T |P||P| W p T |P||P| W p Case 3: |P||P| If there is no such suffix, we move W with length |P|.
8 Preprocessing phase T=GCATCGGCGAGAGTATACAGTACG P=GCAGAGAG G A G A G GA C C C CA We construct the suffix automaton of P. Suffix Automaton
9 Preprocessing: Construct a Suffix Tree of the reverse of Pattern P R : the reversal string of P
10 GCATCGCAGAGAGTATACAGTACG GCAGAGAG When there is a match, how do we move the window? T P
11 GCATCGCAGAGAGTATACAGTACG GCAGAGAG T P
12 GCATCGCAGGCAGTATACAGTACG GCAGAGAG T P Find the longest suffix of W is also the prefix of pattern.
13 GCATCGCAGGCAGTATACAGTACG GCAGAGAG T P
14 A Whole Example T=GCATCGCAGAGA GTATACAGTACG P=GCAGAGAG First attempt : GCATCGCAGAGAGTATACAGTACG GCAGAGAG Shift by: 5 (8 - 3) T P
15 GCATCGCAGAGAGTATACAGTACG GCAGAGAG Second attempt : Shift by: 7 (8 - 1) T P
16 Third attempt: GCATCGCAGAGAGTATACAGTACG GCAGAGAG Shift by: 7 (8 - 1) T P
17 Third attempt: GCATCGCAGAGAGTATACAGTACG GCAGAGAG T P
18 Conclusion Preprocessing phase is O (m). Searching phase is O (mn).
19 Reference [A90]Algorithms for finding patterns in strings, A. V. Aho, Handbook of Theoretical Computer Science, Vol. A, Elsevier, Amsterdam, 1990, pp [A85]The myriad virtues of suffix trees, Apostolico, A., Combinatorial Algorithms on words, NATO Advanced Science Institutes, Series F, Vol. 12, 1985, pp [AG86]The Boyer-Moore-Galil string searching strategies revisited, Apostolico, A. and Giancarlo, R., SIAM, Comput. 15, 1986, pp [BR92]Average running time of the Boyer-Moore-Horspool algorithm, Baeza-Yates, R. A. and Regnier, M. Theoret. Comput. Sci., 1992, pp [BKR91]Analysis of algorithms and Data Structures, Banachowski, L., Kreczmar, A. and Rytter, W., Addison- Wesley. Reading, MA,1991.
20 Speeding up on two string matching algorithms Algorithmica, Vol.12, 1994, pp CROCHEMORE, M., CZUMAJ, A., GASIENIEC, L., JAROMINEK, S., LECROQ, T., PLANDOWSKI, W. and RYTTER, W.
21 A Bit-Parallel Approach to Suffix Automata: Fast Extended String Matching In Proceedings of the 9th Annual Symposium on Combinatorial Pattern Matching, Lecture Notes in Computer Science 1448, Springer-Verlag, Berlin, 14-31, NAVARRO G., RAFFINOT M.,
22 Problem Definition : We are given a text string and a pattern string and we want to find all occurrences of P in T.
23 This algorithm compares the pattern P with T within a sliding window. And the sliding window slides from left to right. Example: Text : ABDAACDGAEEGGGGJJ Pattern : ACDAAC sliding window
24 Text : ABDAACDGAEEGGGGJJ Pattern : ACDAAC sliding window Example:
25 Text : ABDAACDGAEEGGGGJJ Pattern : ACDAAC sliding window Example:
26 Basic idea In this algorithm, we want to find the longest prefix of the pattern which is equal to the suffix of the window.
27 Text : ABDDCACDADEGGGGJJ Pattern : ACDADCEAD Example: We want to find the suffix of “BDDCACDAD” which is a longest prefix of the pattern.
28 Text : ABDDCACDADEGGGGJJ Pattern : ACDADCEAD Example: We find all substrings ”D” in the pattern.
29 Text : ABDDCACDADEGGGGJJ Pattern : ACDADCEAD Example: Actually, it means that we compare the windows as above. ACDADCEAD
30 Text : ABDDCACDADEGGGGJJ Pattern : ACDADCEAD Example: mismatch Then we try to find out all substrings ”AD” in the pattern.
31 Text : ABDDCACDADEGGGGJJ Pattern : ACDADCEAD Example: We succeed in finding all substrings ”AD” in the pattern.
32 Text : ABDDCACDADEGGGGJJ Pattern : ACDADCEAD Example: mismatch We try to find out all substrings ”DAD” in the pattern.
33 Text : ABDDCACDADEGGGGJJ Pattern : ACDADCEAD Example: We find all substrings ”DAD” in the pattern.
34 Text : ABDDCACDADEGGGGJJ Pattern : ACDADCEAD Example: We try to find all substrings ”CDAD” in the pattern.
35 Text : ABDDCACDADEGGGGJJ Pattern : ACDADCEAD Example: We try to find all substrings ”ACDAD” in the pattern.
36 Text : ABDDCACDADEGGGGJJ Pattern : ACDADCEAD Example: We can align the pattern and the text with the longest prefix of the pattern to the suffix of the window.
37 Why do we want to find the longest suffix of the text in the sliding window which is also a prefix of pattern? We will explain this by the following idea.
38 P: T: u: u u Case 1: u is not a prefix of P, and no prefix of P is equal to the suffix of the window.
39 P: T: u u So, we can shift the pattern as below. u:
40 Text : ABDDCCDDADEGGGGJJ Pattern : ACDADCEAD Example: P must be shifted in such a way to avoid comparing any part of P with “DDAD”.
41 Text : ABDDCCDDADEGGGGJJ Pattern : ACDADCEAD Example: So, we can shift the pattern as above.
42 P: T: u: u u Case 2: u is not a prefix of P.
43 P: T: v : v u: u u But a suffix v of the window of T may be a prefix of P. v
44 P: T: v : v u: u u So, we can shift pattern as below.
45 Text: ABCABCABA Pattern: CABBCAD Example: “BCA” is a the longest suffix of “ABCABCA” which is also a substring of pattern “CA” is a suffix of “BCA” which is a prefix of the pattern.
46 Text: ABCABCABA Pattern: CABBCAD Example: So we can shift as above.
47 The idea that we explained above is the main idea of this algorithm, and we will use bit-parallel method to implement this algorithm.
48 Here, we explain how to use bit-parallel to find the substring of a pattern which is equal to a suffix of the window. Text: ABCABCCBA,∑={A,B,C} Example: Pattern: ACBCCBB
49 Text: ABCABCCBA Pattern: ACBCCBB Example: For every character exists in both Text and Pattern, we build: Pattern: ACBCCBB A: B: C: others:
50 Text: ABCABCCBA Pattern: ACBCCBB Example: Pattern: ACBCCBB A: B: C: other: We use a mask D to record some information. D:
51 Text: ABCABCCBA Pattern: ACBCCBB Example: Pattern: ACBCCBB A: B: C: other: D:
52 Text: ABCABCCBA Pattern: ACBCCBB Example: Pattern: ACBCCBB A: B: C: other: D: C: And We set D = Where there is a “1”, there is a substring “C” in Pattern <<1=
53 Text: ABCABCCBA Pattern: ACBCCBB Example: Pattern: ACBCCBB A: B: C: other: D: C: And We set D = Where there is a “1”, there is a substring “CC” in Pattern <<1=
54 Text: ABCABCCBA Pattern: ACBCCBB Example: Pattern: ACBCCBB A: B: C: other: D: B: And We set D = Where there is a “1”, there is a substring “BCC” in Pattern <<1=
55 Text: ABCABCCBA Pattern: ACBCCBB Example: Pattern: ACBCCBB A: B: C: other: D: A: And So, we can say that there is no prefix of Pattern which is equal to the suffix of the window. There is no substring “ABCC” in Pattern.
56 Text: ABCABCCBA Pattern: ACBCCBB Example: We can shift Pattern as above.
57 We give another example: Text: ABCABCABA Pattern: CABBCAB,∑={A,B,C} Pattern: CABCCAB A: B: C: other: D: A: And D= <<1 =
58 Text: ABCABCABA Pattern: CABBCAB,∑={A,B,C,D} Pattern: CABCCAB A: B: C: other: D: C: And We know that “CA” is a substring of the pattern which starts from position 1 in pattern, and this means that “CA” is a prefix of the pattern. D= <<1 =
59 Text: ABCABCABA Pattern: CABBCAB,∑={A,B,C,D} Pattern: CABCCAB A: B: C: other: D: B: And So, we know “BCA” is a substring of the pattern. D= <<1 =
60 Text: ABCABCABA Pattern: CABBCAB,∑={A,B,C,D} Pattern: CABCCAB A: B: C: other: D: A: And There is no substring “ABCA” in Pattern.
61 Text: ABCABCABA Pattern: CABBCAB,∑={A,B,C,D} “BCA” is a the longest suffix of “ABCABCA” which is also a substring of pattern, but the longest prefix of the pattern which is equal to the suffix of the window is “CA”.
62 We take an example of the whole algorithm.
63 We use “read” to store the suffix of the sliding in the text which we have already read and use “pre-temp” for storing the suffix of the current read which is also a prefix of the pattern.
64 Example: P:ATATAT:AGATACGATATATAC Preprocessing: A:10101 B= T: *: 00000
65 Example: P:ATATAT:AGATACGATATATAC Preprocessing: A:10101 B= T: *: Initial: D:11111 read : empty pre-temp : empty
66 Example: P:ATATA Preprocessing: A:10101 B= T: *: D:11111 Reading A T:AGATACGATATATAC We set pre-temp = “A” which is a prefix of the pattern. read : A pre-temp : empty
67 Example: P:ATATA Preprocessing: A:10101 B= T: *: D:11111 Reading A D =10101<< 1 =01010 T:AGATACGATATATAC read : A pre-temp : A
68 Example: P:ATATA Preprocessing: A:10101 B= T: *: D:01010 Reading T T:AGATACGATATATAC pre-temp : A read : TA
69 Example: P:ATATA Preprocessing: A:10101 B= T: *: D:01010 Reading T T:AGATACGATATATAC pre-temp : A read : TA D =01010<< 1 =10100
70 Example: P:ATATA Preprocessing: A:10101 B= T: *: D:10100 Reading A We set pre-temp=“ATA” which is a prefix of the pattern. T:AGATACGATATATAC read : ATA pre-temp : A
71 Example: P:ATATA Preprocessing: A:10101 B= T: *: D:10100 Reading A T:AGATACGATATATAC read : ATA pre-temp : ATA pre-temp : A D =10100<< 1 =01000
72 Example: P:ATATA Preprocessing: A:10101 B= T: *: D:01000 Reading G T:AGATACGATATATAC read : GATA pre-temp : ATA
73 Example: P:ATATA Preprocessing: A:10101 B= T: *: D:00000 T:AGATACGATATATAC read : GATA pre-temp : ATA We find that “ATA” is the longest suffix of “AGATA” which is also a prefix of the pattern. P:ATATA
74 Example: P:ATATA Preprocessing: A:10101 B= T: *: D:00000 T:AGATACGATATATAC read : GATA pre-temp : ATA P: ATATA So, we can shift as above.
75 Example: P:ATATA Preprocessing: A:10101 B= T: *: Initial: D:11111 Then we reset D=11111, read=empty and pre-temp =empty. T:AGATACGATATATAC read : empty pre-temp : empty P: ATATA
76 Example: P:ATATA Preprocessing: A:10101 B= T: *: D:11111 Reading G T:AGATACGATATATAC read : G pre-temp : empty
77 Example: P:ATATA Preprocessing: A:10101 B= T: *: D:00000 There is no substring “G” in the pattern. T:AGATACGATATATAC read : G pre-temp : empty P: ATATA
78 Example: P:ATATA Preprocessing: A:10101 B= T: *: Initial: D:11111 T:AGATACGATATATAC So, we can shift the length of P to the right. And we reset D=11111, read=empty and pre-temp =empty. P: ATATA read : empty pre-temp : empty
79 Example: P:ATATA Preprocessing: A:10101 B= T: *: D:11111 T:AGATACGATATATAC Reading A read : A pre-temp : empty We set pre-temp=“A” which is a prefix of the pattern.
80 Example: P:ATATA Preprocessing: A:10101 B= T: *: D:11111 T:AGATACGATATATAC Reading A read : A pre-temp : A D =10101<< 1 =01010
81 read : AT Example: P:ATATA Preprocessing: A:10101 B= T: *: D:01010 T:AGATACGATATATAC Reading T pre-temp : A D =01010<< 1 =10100
82 Example: P:ATATA Preprocessing: A:10101 B= T: *: D:01010 T:AGATACGATATATAC Reading A pre-temp : A read : ATA We set pre-temp=“ATA” which is a prefix of the pattern.
83 Example: P:ATATA Preprocessing: A:10101 B= T: *: D:10100 T:AGATACGATATATAC Reading A pre-temp : A read : ATA D =10100<< 1 =01000
84 Example: P:ATATA Preprocessing: A:10101 B= T: *: D:01000 T:AGATACGATATATAC Reading T pre-temp : ATA read : TATA D =01000<< 1 =10000
85 Example: P:ATATA Preprocessing: A:10101 B= T: *: D:10000 T:AGATACGATATATAC Reading A pre-temp : ATA read : ATATA
86 Example: P:ATATA Preprocessing: A:10101 B= T: *: D:10000 T:AGATACGATATATAC pre-temp : ATA read : ATATA We find “ATATA” which is the longest prefix of the pattern which is equal to the suffix of the window with length m, so an exact match occurs. P: ATATA
87 Example: P:ATATA Preprocessing: A:10101 B= T: *: Initial: D:10000 T:AGATACGATATATAC pre-temp : ATA read : ATATA “ATA” is a longest suffix of “ATATA” which is equal to the suffix of the window of T besides the full pattern. P: ATATA
88 Example: P:ATATA Preprocessing: A:10101 B= T: *: Initial: D:10000 T:AGATACGATATATAC pre-temp : ATA read : ATATA P: ATATA So, we can shift as above.
89 Example: P:ATATA Preprocessing: A:10101 B= T: *: Initial: D:11111 T:AGATACGATATATAC pre-temp : empty read : empty P: ATATA Repeat above steps, until the window slides out of Text.
90 We give an extreme example to show the worst case of the algorithm.
91 Example: P:AAAAAT:AAAAAAAA Preprocessing: A:11111 B= *: Initial: D:11111 read : empty pre-temp : empty
92 Example: P:ATATA Preprocessing: A:11111 B= *: D:11111 Reading A T:AAAAAAAA read : A pre-temp : empty
93 Example: P:ATATA Preprocessing: A:11111 B= *: D:11111 Reading A T:AAAAAAAA read : A pre-temp : A D =11111<< 1 =11110
94 Example: P:ATATA Preprocessing: A:11111 B= *: D:11110 Reading A T:AAAAAAAA read : AA pre-temp : A
95 Example: P:ATATA Preprocessing: A:11111 B= *: D:11110 Reading A T:AAAAAAAA read : AA pre-temp : AA D =11110<< 1 =11100
96 Example: P:ATATA Preprocessing: A:11111 B= *: D:11100 Reading A T:AAAAAAAA read : AAA pre-temp : AA
97 Example: P:ATATA Preprocessing: A:11111 B= *: D:11100 Reading A T:AAAAAAAA read : AAA pre-temp : AAA D =11100<< 1 =11000
98 Example: P:ATATA Preprocessing: A:11111 B= *: D:11000 Reading A T:AAAAAAAA read : AAAA pre-temp : AAA
99 Example: P:ATATA Preprocessing: A:11111 B= *: D:11000 Reading A T:AAAAAAAA read : AAAA pre-temp : AAAA D =11000<< 1 =10000
100 Example: P:ATATA Preprocessing: A:11111 B= *: D:10000 Reading A T:AAAAAAAA read : AAAAA pre-temp : AAAA We find “AAAAA” which is the longest prefix of the pattern which is equal to the suffix of the window with length m, so an exact match occurs.
101 Example: P:ATATA Preprocessing: A:11111 B= *: D:11111 T:AAAAAAAA read : empty pre-temp : empty
102 Example: P:ATATA Preprocessing: A:11111 B= *: D:11111 Reading A T:AAAAAAAA read : A pre-temp : empty
103 Example: P:ATATA Preprocessing: A:11111 B= *: D:11111 Reading A T:AAAAAAAA read : A pre-temp : A D =11111<< 1 =11110
104 Example: P:ATATA Preprocessing: A:11111 B= *: D:11110 Reading A T:AAAAAAAA read : AA pre-temp : A
105 Example: P:ATATA Preprocessing: A:11111 B= *: D:11110 Reading A T:AAAAAAAA read : AA pre-temp : AA D =11110<< 1 =11100
106 Example: P:ATATA Preprocessing: A:11111 B= *: D:11100 Reading A T:AAAAAAAA read : AAA pre-temp : AA
107 Example: P:ATATA Preprocessing: A:11111 B= *: D:11100 Reading A T:AAAAAAAA read : AAA pre-temp : AAA D =11100<< 1 =11000
108 Example: P:ATATA Preprocessing: A:11111 B= *: D:11000 Reading A T:AAAAAAAA read : AAAA pre-temp : AAA
109 Example: P:ATATA Preprocessing: A:11111 B= *: D:11000 Reading A T:AAAAAAAA read : AAAA pre-temp : AAAA D =11000<< 1 =10000
110 Example: P:ATATA Preprocessing: A:11111 B= *: D:10000 Reading A T:AAAAAAAA read : AAAAA pre-temp : AAAA We find “AAAAA” which is the longest prefix of the pattern which is equal to the suffix of the window with length m, so an exact match occurs.
111 Time Complexity: If the length of the text is n and the length of pattern is m, the time complexity of this algorithm is O(mn) in the worst case.
112 Reference: M.Crochemore, A.Czumaj, L.Gasieniec, S.Jarominek, T.Lecroq, W.Plandowski, and W.Rytter. Speeding up two string matching algorithms. Algorithmica, 12(4/5): ,1994. G.Navarro and M.Raffinot. Fast and flexible string matching by combining bit-parallelism and Suffix automata. ACM Journal of Experimental Algorithmics,5,2000. W.I.Chang and E.L.Lawler. Sublinear approximate string matching and biological applications. Algorithmica, 12(4/5): ,1994
113 Thanks for your attention.
114 Algorithm: Preprocessing For c € ∑ Do B[c]←0 m For j € 1…m Do B[p j ]←B[p j ]|0 j-1 1 m-j Searching pos ← 0 while pos ≤ n-m Do j ← m, last ← m D ←1 m while D≠ 0 m Do D ←D & B[t pos+j ] j ←j-1 If D & 10 m-1 ≠0 m Then If j >0 Then last ← j Else report an occurrence at pos+1 End of if D ←D<<1 End of while pos ←pos + last End of while