Download presentation
Presentation is loading. Please wait.
1
1 Advisor: Prof. R. C. T. Lee Speaker: G. W. Cheng Two exact string matching algorithms using suffix to prefix rule
2
2 Speeding up on two string matching algorithms Algorithmica, Vol.12, 1994, pp.247-267 CROCHEMORE, M., CZUMAJ, A., GASIENIEC, L., JAROMINEK, S., LECROQ, T., PLANDOWSKI, W. and RYTTER, W.
3
3 Problem Definition : We are given a text string and a pattern string and we want to find all occurrences of P in T.
4
4 Consider the following example: There are two occurrences of P in T as shown below:
5
5 Rule 1: The Suffix to Prefix Rule For a window to have any chance to match a pattern, in some way, there must be a suffix of the window which is equal to a prefix of the pattern. T P
6
6 Basic Ideas Open a window W with size |P| in the text. T |P||P| W p Find the longest suffix of W is also the prefix of pattern. T |P||P| p W Match! Case 1:
7
7 T |P||P| W p Case 2: T |P||P| W p T |P||P| W p Case 3: |P||P| If there is no such suffix, we move W with length |P|.
8
8 Preprocessing phase T=GCATCGGCGAGAGTATACAGTACG P=GCAGAGAG 087654321 G A G A G GA C C C CA We construct the suffix automaton of P. Suffix Automaton
9
9 Preprocessing: Construct a Suffix Tree of the reverse of Pattern P R : the reversal string of P. 1 86 47 5 3 2
10
10 GCATCGCAGAGAGTATACAGTACG GCAGAGAG When there is a match, how do we move the window? T P
11
11 GCATCGCAGAGAGTATACAGTACG GCAGAGAG T P
12
12 GCATCGCAGGCAGTATACAGTACG GCAGAGAG T P Find the longest suffix of W is also the prefix of pattern.
13
13 GCATCGCAGGCAGTATACAGTACG GCAGAGAG T P
14
14 A Whole Example T=GCATCGCAGAGA GTATACAGTACG P=GCAGAGAG First attempt : GCATCGCAGAGAGTATACAGTACG GCAGAGAG Shift by: 5 (8 - 3) T P
15
15 GCATCGCAGAGAGTATACAGTACG GCAGAGAG Second attempt : Shift by: 7 (8 - 1) T P
16
16 Third attempt: GCATCGCAGAGAGTATACAGTACG GCAGAGAG Shift by: 7 (8 - 1) T P
17
17 Third attempt: GCATCGCAGAGAGTATACAGTACG GCAGAGAG T P
18
18 Conclusion Preprocessing phase is O (m). Searching phase is O (mn).
19
19 Reference [A90]Algorithms for finding patterns in strings, A. V. Aho, Handbook of Theoretical Computer Science, Vol. A, Elsevier, Amsterdam, 1990, pp.255-300. [A85]The myriad virtues of suffix trees, Apostolico, A., Combinatorial Algorithms on words, NATO Advanced Science Institutes, Series F, Vol. 12, 1985, pp.85-96 [AG86]The Boyer-Moore-Galil string searching strategies revisited, Apostolico, A. and Giancarlo, R., SIAM, Comput. 15, 1986, pp98-105. [BR92]Average running time of the Boyer-Moore-Horspool algorithm, Baeza-Yates, R. A. and Regnier, M. Theoret. Comput. Sci., 1992, pp.19-31. [BKR91]Analysis of algorithms and Data Structures, Banachowski, L., Kreczmar, A. and Rytter, W., Addison- Wesley. Reading, MA,1991.
20
20 Speeding up on two string matching algorithms Algorithmica, Vol.12, 1994, pp.247-267 CROCHEMORE, M., CZUMAJ, A., GASIENIEC, L., JAROMINEK, S., LECROQ, T., PLANDOWSKI, W. and RYTTER, W.
21
21 A Bit-Parallel Approach to Suffix Automata: Fast Extended String Matching In Proceedings of the 9th Annual Symposium on Combinatorial Pattern Matching, Lecture Notes in Computer Science 1448, Springer-Verlag, Berlin, 14-31, 1998. NAVARRO G., RAFFINOT M.,
22
22 Problem Definition : We are given a text string and a pattern string and we want to find all occurrences of P in T.
23
23 This algorithm compares the pattern P with T within a sliding window. And the sliding window slides from left to right. Example: Text : ABDAACDGAEEGGGGJJ Pattern : ACDAAC sliding window
24
24 Text : ABDAACDGAEEGGGGJJ Pattern : ACDAAC sliding window Example:
25
25 Text : ABDAACDGAEEGGGGJJ Pattern : ACDAAC sliding window Example:
26
26 Basic idea In this algorithm, we want to find the longest prefix of the pattern which is equal to the suffix of the window.
27
27 Text : ABDDCACDADEGGGGJJ Pattern : ACDADCEAD Example: We want to find the suffix of “BDDCACDAD” which is a longest prefix of the pattern.
28
28 Text : ABDDCACDADEGGGGJJ Pattern : ACDADCEAD Example: We find all substrings ”D” in the pattern.
29
29 Text : ABDDCACDADEGGGGJJ Pattern : ACDADCEAD Example: Actually, it means that we compare the windows as above. ACDADCEAD
30
30 Text : ABDDCACDADEGGGGJJ Pattern : ACDADCEAD Example: mismatch Then we try to find out all substrings ”AD” in the pattern.
31
31 Text : ABDDCACDADEGGGGJJ Pattern : ACDADCEAD Example: We succeed in finding all substrings ”AD” in the pattern.
32
32 Text : ABDDCACDADEGGGGJJ Pattern : ACDADCEAD Example: mismatch We try to find out all substrings ”DAD” in the pattern.
33
33 Text : ABDDCACDADEGGGGJJ Pattern : ACDADCEAD Example: We find all substrings ”DAD” in the pattern.
34
34 Text : ABDDCACDADEGGGGJJ Pattern : ACDADCEAD Example: We try to find all substrings ”CDAD” in the pattern.
35
35 Text : ABDDCACDADEGGGGJJ Pattern : ACDADCEAD Example: We try to find all substrings ”ACDAD” in the pattern.
36
36 Text : ABDDCACDADEGGGGJJ Pattern : ACDADCEAD Example: We can align the pattern and the text with the longest prefix of the pattern to the suffix of the window.
37
37 Why do we want to find the longest suffix of the text in the sliding window which is also a prefix of pattern? We will explain this by the following idea.
38
38 P: T: u: u u Case 1: u is not a prefix of P, and no prefix of P is equal to the suffix of the window.
39
39 P: T: u u So, we can shift the pattern as below. u:
40
40 Text : ABDDCCDDADEGGGGJJ Pattern : ACDADCEAD Example: P must be shifted in such a way to avoid comparing any part of P with “DDAD”.
41
41 Text : ABDDCCDDADEGGGGJJ Pattern : ACDADCEAD Example: So, we can shift the pattern as above.
42
42 P: T: u: u u Case 2: u is not a prefix of P.
43
43 P: T: v : v u: u u But a suffix v of the window of T may be a prefix of P. v
44
44 P: T: v : v u: u u So, we can shift pattern as below.
45
45 Text: ABCABCABA Pattern: CABBCAD Example: “BCA” is a the longest suffix of “ABCABCA” which is also a substring of pattern “CA” is a suffix of “BCA” which is a prefix of the pattern.
46
46 Text: ABCABCABA Pattern: CABBCAD Example: So we can shift as above.
47
47 The idea that we explained above is the main idea of this algorithm, and we will use bit-parallel method to implement this algorithm.
48
48 Here, we explain how to use bit-parallel to find the substring of a pattern which is equal to a suffix of the window. Text: ABCABCCBA,∑={A,B,C} Example: Pattern: ACBCCBB
49
49 Text: ABCABCCBA Pattern: ACBCCBB Example: For every character exists in both Text and Pattern, we build: Pattern: ACBCCBB A: 1000000 B: 0010011 C: 0101100 others: 0000000
50
50 Text: ABCABCCBA Pattern: ACBCCBB Example: Pattern: ACBCCBB A: 1000000 B: 0010011 C: 0101100 other: 0000000 We use a mask D to record some information. D: 1111111
51
51 Text: ABCABCCBA Pattern: ACBCCBB Example: Pattern: ACBCCBB A: 1000000 B: 0010011 C: 0101100 other: 0000000 D: 1111111
52
52 Text: ABCABCCBA Pattern: ACBCCBB Example: Pattern: ACBCCBB A: 1000000 B: 0010011 C: 0101100 other: 0000000 D: 1111111 C: 0101100 And 0101100 We set D = Where there is a “1”, there is a substring “C” in Pattern. 0101100<<1= 1011000
53
53 Text: ABCABCCBA Pattern: ACBCCBB Example: Pattern: ACBCCBB A: 1000000 B: 0010011 C: 0101100 other: 0000000 D: 1011000 C: 0101100 And 0001000 We set D = Where there is a “1”, there is a substring “CC” in Pattern. 0001000<<1= 0010000
54
54 Text: ABCABCCBA Pattern: ACBCCBB Example: Pattern: ACBCCBB A: 1000000 B: 0010011 C: 0101100 other: 0000000 D: 0010000 B: 0010011 And 0010000 We set D = Where there is a “1”, there is a substring “BCC” in Pattern. 0010000<<1= 0100000
55
55 Text: ABCABCCBA Pattern: ACBCCBB Example: Pattern: ACBCCBB A: 1000000 B: 0010011 C: 0101100 other: 0000000 D: 0100000 A: 1000000 And 0000000 So, we can say that there is no prefix of Pattern which is equal to the suffix of the window. There is no substring “ABCC” in Pattern.
56
56 Text: ABCABCCBA Pattern: ACBCCBB Example: We can shift Pattern as above.
57
57 We give another example: Text: ABCABCABA Pattern: CABBCAB,∑={A,B,C} Pattern: CABCCAB A: 0100010 B: 0011001 C: 1000100 other: 0000000 D: 1111111 A: 0100010 And 0100010 D= 0100100<<1 =1000100
58
58 Text: ABCABCABA Pattern: CABBCAB,∑={A,B,C,D} Pattern: CABCCAB A: 0100010 B: 0011001 C: 1000100 other: 0000000 D: 1000100 C: 1000100 And 1000100 We know that “CA” is a substring of the pattern which starts from position 1 in pattern, and this means that “CA” is a prefix of the pattern. D= 1000100<<1 =0001000
59
59 Text: ABCABCABA Pattern: CABBCAB,∑={A,B,C,D} Pattern: CABCCAB A: 0100010 B: 0011001 C: 1000100 other: 0000000 D: 0001000 B: 0011001 And 0001000 So, we know “BCA” is a substring of the pattern. D= 0001000<<1 =0010000
60
60 Text: ABCABCABA Pattern: CABBCAB,∑={A,B,C,D} Pattern: CABCCAB A: 0100010 B: 0011001 C: 1000100 other: 0000000 D: 0010000 A: 0100010 And 0000000 There is no substring “ABCA” in Pattern.
61
61 Text: ABCABCABA Pattern: CABBCAB,∑={A,B,C,D} “BCA” is a the longest suffix of “ABCABCA” which is also a substring of pattern, but the longest prefix of the pattern which is equal to the suffix of the window is “CA”.
62
62 We take an example of the whole algorithm.
63
63 We use “read” to store the suffix of the sliding in the text which we have already read and use “pre-temp” for storing the suffix of the current read which is also a prefix of the pattern.
64
64 Example: P:ATATAT:AGATACGATATATAC Preprocessing: A:10101 B= T: 01010 *: 00000
65
65 Example: P:ATATAT:AGATACGATATATAC Preprocessing: A:10101 B= T: 01010 *: 00000 Initial: D:11111 read : empty pre-temp : empty
66
66 Example: P:ATATA Preprocessing: A:10101 B= T: 01010 *: 00000 D:11111 Reading A 11111 10101 ----------------------- 10101 T:AGATACGATATATAC We set pre-temp = “A” which is a prefix of the pattern. read : A pre-temp : empty
67
67 Example: P:ATATA Preprocessing: A:10101 B= T: 01010 *: 00000 D:11111 Reading A 11111 10101 ----------------------- 10101 D =10101<< 1 =01010 T:AGATACGATATATAC read : A pre-temp : A
68
68 Example: P:ATATA Preprocessing: A:10101 B= T: 01010 *: 00000 D:01010 Reading T 01010 01010 ----------------------- 01010 T:AGATACGATATATAC pre-temp : A read : TA
69
69 Example: P:ATATA Preprocessing: A:10101 B= T: 01010 *: 00000 D:01010 Reading T 01010 01010 ----------------------- 01010 T:AGATACGATATATAC pre-temp : A read : TA D =01010<< 1 =10100
70
70 Example: P:ATATA Preprocessing: A:10101 B= T: 01010 *: 00000 D:10100 Reading A 10100 10101 ----------------------- 10100 We set pre-temp=“ATA” which is a prefix of the pattern. T:AGATACGATATATAC read : ATA pre-temp : A
71
71 Example: P:ATATA Preprocessing: A:10101 B= T: 01010 *: 00000 D:10100 Reading A 10100 10101 ----------------------- 10100 T:AGATACGATATATAC read : ATA pre-temp : ATA pre-temp : A D =10100<< 1 =01000
72
72 Example: P:ATATA Preprocessing: A:10101 B= T: 01010 *: 00000 D:01000 Reading G 01000 00000 ----------------------- 00000 T:AGATACGATATATAC read : GATA pre-temp : ATA
73
73 Example: P:ATATA Preprocessing: A:10101 B= T: 01010 *: 00000 D:00000 T:AGATACGATATATAC read : GATA pre-temp : ATA We find that “ATA” is the longest suffix of “AGATA” which is also a prefix of the pattern. P:ATATA
74
74 Example: P:ATATA Preprocessing: A:10101 B= T: 01010 *: 00000 D:00000 T:AGATACGATATATAC read : GATA pre-temp : ATA P: ATATA So, we can shift as above.
75
75 Example: P:ATATA Preprocessing: A:10101 B= T: 01010 *: 00000 Initial: D:11111 Then we reset D=11111, read=empty and pre-temp =empty. T:AGATACGATATATAC read : empty pre-temp : empty P: ATATA
76
76 Example: P:ATATA Preprocessing: A:10101 B= T: 01010 *: 00000 D:11111 Reading G 11111 00000 ----------------------- 00000 T:AGATACGATATATAC read : G pre-temp : empty
77
77 Example: P:ATATA Preprocessing: A:10101 B= T: 01010 *: 00000 D:00000 There is no substring “G” in the pattern. T:AGATACGATATATAC read : G pre-temp : empty P: ATATA
78
78 Example: P:ATATA Preprocessing: A:10101 B= T: 01010 *: 00000 Initial: D:11111 T:AGATACGATATATAC So, we can shift the length of P to the right. And we reset D=11111, read=empty and pre-temp =empty. P: ATATA read : empty pre-temp : empty
79
79 Example: P:ATATA Preprocessing: A:10101 B= T: 01010 *: 00000 D:11111 T:AGATACGATATATAC Reading A 11111 10101 ----------------------- 10101 read : A pre-temp : empty We set pre-temp=“A” which is a prefix of the pattern.
80
80 Example: P:ATATA Preprocessing: A:10101 B= T: 01010 *: 00000 D:11111 T:AGATACGATATATAC Reading A 11111 10101 ----------------------- 10101 read : A pre-temp : A D =10101<< 1 =01010
81
81 read : AT Example: P:ATATA Preprocessing: A:10101 B= T: 01010 *: 00000 D:01010 T:AGATACGATATATAC Reading T 01010 01010 ----------------------- 01010 pre-temp : A D =01010<< 1 =10100
82
82 Example: P:ATATA Preprocessing: A:10101 B= T: 01010 *: 00000 D:01010 T:AGATACGATATATAC Reading A 10100 10101 ----------------------- 10100 pre-temp : A read : ATA We set pre-temp=“ATA” which is a prefix of the pattern.
83
83 Example: P:ATATA Preprocessing: A:10101 B= T: 01010 *: 00000 D:10100 T:AGATACGATATATAC Reading A 10100 10101 ----------------------- 10100 pre-temp : A read : ATA D =10100<< 1 =01000
84
84 Example: P:ATATA Preprocessing: A:10101 B= T: 01010 *: 00000 D:01000 T:AGATACGATATATAC Reading T 01000 01010 ----------------------- 01000 pre-temp : ATA read : TATA D =01000<< 1 =10000
85
85 Example: P:ATATA Preprocessing: A:10101 B= T: 01010 *: 00000 D:10000 T:AGATACGATATATAC Reading A 10000 10101 ----------------------- 10000 pre-temp : ATA read : ATATA
86
86 Example: P:ATATA Preprocessing: A:10101 B= T: 01010 *: 00000 D:10000 T:AGATACGATATATAC pre-temp : ATA read : ATATA We find “ATATA” which is the longest prefix of the pattern which is equal to the suffix of the window with length m, so an exact match occurs. P: ATATA
87
87 Example: P:ATATA Preprocessing: A:10101 B= T: 01010 *: 00000 Initial: D:10000 T:AGATACGATATATAC pre-temp : ATA read : ATATA “ATA” is a longest suffix of “ATATA” which is equal to the suffix of the window of T besides the full pattern. P: ATATA
88
88 Example: P:ATATA Preprocessing: A:10101 B= T: 01010 *: 00000 Initial: D:10000 T:AGATACGATATATAC pre-temp : ATA read : ATATA P: ATATA So, we can shift as above.
89
89 Example: P:ATATA Preprocessing: A:10101 B= T: 01010 *: 00000 Initial: D:11111 T:AGATACGATATATAC pre-temp : empty read : empty P: ATATA Repeat above steps, until the window slides out of Text.
90
90 We give an extreme example to show the worst case of the algorithm.
91
91 Example: P:AAAAAT:AAAAAAAA Preprocessing: A:11111 B= *: 00000 Initial: D:11111 read : empty pre-temp : empty
92
92 Example: P:ATATA Preprocessing: A:11111 B= *: 00000 D:11111 Reading A 11111 11111 ----------------------- 11111 T:AAAAAAAA read : A pre-temp : empty
93
93 Example: P:ATATA Preprocessing: A:11111 B= *: 00000 D:11111 Reading A 11111 11111 ----------------------- 11111 T:AAAAAAAA read : A pre-temp : A D =11111<< 1 =11110
94
94 Example: P:ATATA Preprocessing: A:11111 B= *: 00000 D:11110 Reading A 11110 11111 ----------------------- 11110 T:AAAAAAAA read : AA pre-temp : A
95
95 Example: P:ATATA Preprocessing: A:11111 B= *: 00000 D:11110 Reading A 11110 11111 ----------------------- 11110 T:AAAAAAAA read : AA pre-temp : AA D =11110<< 1 =11100
96
96 Example: P:ATATA Preprocessing: A:11111 B= *: 00000 D:11100 Reading A 11100 11111 ----------------------- 11100 T:AAAAAAAA read : AAA pre-temp : AA
97
97 Example: P:ATATA Preprocessing: A:11111 B= *: 00000 D:11100 Reading A 11100 11111 ----------------------- 11100 T:AAAAAAAA read : AAA pre-temp : AAA D =11100<< 1 =11000
98
98 Example: P:ATATA Preprocessing: A:11111 B= *: 00000 D:11000 Reading A 11000 11111 ----------------------- 11000 T:AAAAAAAA read : AAAA pre-temp : AAA
99
99 Example: P:ATATA Preprocessing: A:11111 B= *: 00000 D:11000 Reading A 11000 11111 ----------------------- 11000 T:AAAAAAAA read : AAAA pre-temp : AAAA D =11000<< 1 =10000
100
100 Example: P:ATATA Preprocessing: A:11111 B= *: 00000 D:10000 Reading A 10000 11111 ----------------------- 10000 T:AAAAAAAA read : AAAAA pre-temp : AAAA We find “AAAAA” which is the longest prefix of the pattern which is equal to the suffix of the window with length m, so an exact match occurs.
101
101 Example: P:ATATA Preprocessing: A:11111 B= *: 00000 D:11111 T:AAAAAAAA read : empty pre-temp : empty
102
102 Example: P:ATATA Preprocessing: A:11111 B= *: 00000 D:11111 Reading A 11111 11111 ----------------------- 11111 T:AAAAAAAA read : A pre-temp : empty
103
103 Example: P:ATATA Preprocessing: A:11111 B= *: 00000 D:11111 Reading A 11111 11111 ----------------------- 11111 T:AAAAAAAA read : A pre-temp : A D =11111<< 1 =11110
104
104 Example: P:ATATA Preprocessing: A:11111 B= *: 00000 D:11110 Reading A 11110 11111 ----------------------- 11110 T:AAAAAAAA read : AA pre-temp : A
105
105 Example: P:ATATA Preprocessing: A:11111 B= *: 00000 D:11110 Reading A 11110 11111 ----------------------- 11110 T:AAAAAAAA read : AA pre-temp : AA D =11110<< 1 =11100
106
106 Example: P:ATATA Preprocessing: A:11111 B= *: 00000 D:11100 Reading A 11100 11111 ----------------------- 11100 T:AAAAAAAA read : AAA pre-temp : AA
107
107 Example: P:ATATA Preprocessing: A:11111 B= *: 00000 D:11100 Reading A 11100 11111 ----------------------- 11100 T:AAAAAAAA read : AAA pre-temp : AAA D =11100<< 1 =11000
108
108 Example: P:ATATA Preprocessing: A:11111 B= *: 00000 D:11000 Reading A 11000 11111 ----------------------- 11000 T:AAAAAAAA read : AAAA pre-temp : AAA
109
109 Example: P:ATATA Preprocessing: A:11111 B= *: 00000 D:11000 Reading A 11000 11111 ----------------------- 11000 T:AAAAAAAA read : AAAA pre-temp : AAAA D =11000<< 1 =10000
110
110 Example: P:ATATA Preprocessing: A:11111 B= *: 00000 D:10000 Reading A 10000 11111 ----------------------- 10000 T:AAAAAAAA read : AAAAA pre-temp : AAAA We find “AAAAA” which is the longest prefix of the pattern which is equal to the suffix of the window with length m, so an exact match occurs.
111
111 Time Complexity: If the length of the text is n and the length of pattern is m, the time complexity of this algorithm is O(mn) in the worst case.
112
112 Reference: M.Crochemore, A.Czumaj, L.Gasieniec, S.Jarominek, T.Lecroq, W.Plandowski, and W.Rytter. Speeding up two string matching algorithms. Algorithmica, 12(4/5):247-267,1994. G.Navarro and M.Raffinot. Fast and flexible string matching by combining bit-parallelism and Suffix automata. ACM Journal of Experimental Algorithmics,5,2000. W.I.Chang and E.L.Lawler. Sublinear approximate string matching and biological applications. Algorithmica, 12(4/5):327-344,1994
113
113 Thanks for your attention.
114
114 Algorithm: Preprocessing For c € ∑ Do B[c]←0 m For j € 1…m Do B[p j ]←B[p j ]|0 j-1 1 m-j Searching pos ← 0 while pos ≤ n-m Do j ← m, last ← m D ←1 m while D≠ 0 m Do D ←D & B[t pos+j ] j ←j-1 If D & 10 m-1 ≠0 m Then If j >0 Then last ← j Else report an occurrence at pos+1 End of if D ←D<<1 End of while pos ←pos + last End of while
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.