Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Advisor: Prof. R. C. T. Lee Speaker: G. W. Cheng Two exact string matching algorithms using suffix to prefix rule.

Similar presentations


Presentation on theme: "1 Advisor: Prof. R. C. T. Lee Speaker: G. W. Cheng Two exact string matching algorithms using suffix to prefix rule."— Presentation transcript:

1 1 Advisor: Prof. R. C. T. Lee Speaker: G. W. Cheng Two exact string matching algorithms using suffix to prefix rule

2 2 Speeding up on two string matching algorithms Algorithmica, Vol.12, 1994, pp.247-267 CROCHEMORE, M., CZUMAJ, A., GASIENIEC, L., JAROMINEK, S., LECROQ, T., PLANDOWSKI, W. and RYTTER, W.

3 3 Problem Definition : We are given a text string and a pattern string and we want to find all occurrences of P in T.

4 4 Consider the following example: There are two occurrences of P in T as shown below:

5 5 Rule 1: The Suffix to Prefix Rule For a window to have any chance to match a pattern, in some way, there must be a suffix of the window which is equal to a prefix of the pattern. T P

6 6 Basic Ideas Open a window W with size |P| in the text. T |P||P| W p Find the longest suffix of W is also the prefix of pattern. T |P||P| p W Match! Case 1:

7 7 T |P||P| W p Case 2: T |P||P| W p T |P||P| W p Case 3: |P||P| If there is no such suffix, we move W with length |P|.

8 8 Preprocessing phase T=GCATCGGCGAGAGTATACAGTACG P=GCAGAGAG 087654321 G A G A G GA C C C CA We construct the suffix automaton of P. Suffix Automaton

9 9 Preprocessing: Construct a Suffix Tree of the reverse of Pattern P R : the reversal string of P. 1 86 47 5 3 2

10 10 GCATCGCAGAGAGTATACAGTACG GCAGAGAG When there is a match, how do we move the window? T P

11 11 GCATCGCAGAGAGTATACAGTACG GCAGAGAG T P

12 12 GCATCGCAGGCAGTATACAGTACG GCAGAGAG T P Find the longest suffix of W is also the prefix of pattern.

13 13 GCATCGCAGGCAGTATACAGTACG GCAGAGAG T P

14 14 A Whole Example T=GCATCGCAGAGA GTATACAGTACG P=GCAGAGAG First attempt : GCATCGCAGAGAGTATACAGTACG GCAGAGAG Shift by: 5 (8 - 3) T P

15 15 GCATCGCAGAGAGTATACAGTACG GCAGAGAG Second attempt : Shift by: 7 (8 - 1) T P

16 16 Third attempt: GCATCGCAGAGAGTATACAGTACG GCAGAGAG Shift by: 7 (8 - 1) T P

17 17 Third attempt: GCATCGCAGAGAGTATACAGTACG GCAGAGAG T P

18 18 Conclusion Preprocessing phase is O (m). Searching phase is O (mn).

19 19 Reference [A90]Algorithms for finding patterns in strings, A. V. Aho, Handbook of Theoretical Computer Science, Vol. A, Elsevier, Amsterdam, 1990, pp.255-300. [A85]The myriad virtues of suffix trees, Apostolico, A., Combinatorial Algorithms on words, NATO Advanced Science Institutes, Series F, Vol. 12, 1985, pp.85-96 [AG86]The Boyer-Moore-Galil string searching strategies revisited, Apostolico, A. and Giancarlo, R., SIAM, Comput. 15, 1986, pp98-105. [BR92]Average running time of the Boyer-Moore-Horspool algorithm, Baeza-Yates, R. A. and Regnier, M. Theoret. Comput. Sci., 1992, pp.19-31. [BKR91]Analysis of algorithms and Data Structures, Banachowski, L., Kreczmar, A. and Rytter, W., Addison- Wesley. Reading, MA,1991.

20 20 Speeding up on two string matching algorithms Algorithmica, Vol.12, 1994, pp.247-267 CROCHEMORE, M., CZUMAJ, A., GASIENIEC, L., JAROMINEK, S., LECROQ, T., PLANDOWSKI, W. and RYTTER, W.

21 21 A Bit-Parallel Approach to Suffix Automata: Fast Extended String Matching In Proceedings of the 9th Annual Symposium on Combinatorial Pattern Matching, Lecture Notes in Computer Science 1448, Springer-Verlag, Berlin, 14-31, 1998. NAVARRO G., RAFFINOT M.,

22 22 Problem Definition : We are given a text string and a pattern string and we want to find all occurrences of P in T.

23 23 This algorithm compares the pattern P with T within a sliding window. And the sliding window slides from left to right. Example: Text : ABDAACDGAEEGGGGJJ Pattern : ACDAAC sliding window

24 24 Text : ABDAACDGAEEGGGGJJ Pattern : ACDAAC sliding window Example:

25 25 Text : ABDAACDGAEEGGGGJJ Pattern : ACDAAC sliding window Example:

26 26 Basic idea In this algorithm, we want to find the longest prefix of the pattern which is equal to the suffix of the window.

27 27 Text : ABDDCACDADEGGGGJJ Pattern : ACDADCEAD Example: We want to find the suffix of “BDDCACDAD” which is a longest prefix of the pattern.

28 28 Text : ABDDCACDADEGGGGJJ Pattern : ACDADCEAD Example: We find all substrings ”D” in the pattern.

29 29 Text : ABDDCACDADEGGGGJJ Pattern : ACDADCEAD Example: Actually, it means that we compare the windows as above. ACDADCEAD

30 30 Text : ABDDCACDADEGGGGJJ Pattern : ACDADCEAD Example: mismatch Then we try to find out all substrings ”AD” in the pattern.

31 31 Text : ABDDCACDADEGGGGJJ Pattern : ACDADCEAD Example: We succeed in finding all substrings ”AD” in the pattern.

32 32 Text : ABDDCACDADEGGGGJJ Pattern : ACDADCEAD Example: mismatch We try to find out all substrings ”DAD” in the pattern.

33 33 Text : ABDDCACDADEGGGGJJ Pattern : ACDADCEAD Example: We find all substrings ”DAD” in the pattern.

34 34 Text : ABDDCACDADEGGGGJJ Pattern : ACDADCEAD Example: We try to find all substrings ”CDAD” in the pattern.

35 35 Text : ABDDCACDADEGGGGJJ Pattern : ACDADCEAD Example: We try to find all substrings ”ACDAD” in the pattern.

36 36 Text : ABDDCACDADEGGGGJJ Pattern : ACDADCEAD Example: We can align the pattern and the text with the longest prefix of the pattern to the suffix of the window.

37 37 Why do we want to find the longest suffix of the text in the sliding window which is also a prefix of pattern? We will explain this by the following idea.

38 38 P: T: u: u u Case 1: u is not a prefix of P, and no prefix of P is equal to the suffix of the window.

39 39 P: T: u u So, we can shift the pattern as below. u:

40 40 Text : ABDDCCDDADEGGGGJJ Pattern : ACDADCEAD Example: P must be shifted in such a way to avoid comparing any part of P with “DDAD”.

41 41 Text : ABDDCCDDADEGGGGJJ Pattern : ACDADCEAD Example: So, we can shift the pattern as above.

42 42 P: T: u: u u Case 2: u is not a prefix of P.

43 43 P: T: v : v u: u u But a suffix v of the window of T may be a prefix of P. v

44 44 P: T: v : v u: u u So, we can shift pattern as below.

45 45 Text: ABCABCABA Pattern: CABBCAD Example: “BCA” is a the longest suffix of “ABCABCA” which is also a substring of pattern “CA” is a suffix of “BCA” which is a prefix of the pattern.

46 46 Text: ABCABCABA Pattern: CABBCAD Example: So we can shift as above.

47 47 The idea that we explained above is the main idea of this algorithm, and we will use bit-parallel method to implement this algorithm.

48 48 Here, we explain how to use bit-parallel to find the substring of a pattern which is equal to a suffix of the window. Text: ABCABCCBA,∑={A,B,C} Example: Pattern: ACBCCBB

49 49 Text: ABCABCCBA Pattern: ACBCCBB Example: For every character exists in both Text and Pattern, we build: Pattern: ACBCCBB A: 1000000 B: 0010011 C: 0101100 others: 0000000

50 50 Text: ABCABCCBA Pattern: ACBCCBB Example: Pattern: ACBCCBB A: 1000000 B: 0010011 C: 0101100 other: 0000000 We use a mask D to record some information. D: 1111111

51 51 Text: ABCABCCBA Pattern: ACBCCBB Example: Pattern: ACBCCBB A: 1000000 B: 0010011 C: 0101100 other: 0000000 D: 1111111

52 52 Text: ABCABCCBA Pattern: ACBCCBB Example: Pattern: ACBCCBB A: 1000000 B: 0010011 C: 0101100 other: 0000000 D: 1111111 C: 0101100 And 0101100 We set D = Where there is a “1”, there is a substring “C” in Pattern. 0101100<<1= 1011000

53 53 Text: ABCABCCBA Pattern: ACBCCBB Example: Pattern: ACBCCBB A: 1000000 B: 0010011 C: 0101100 other: 0000000 D: 1011000 C: 0101100 And 0001000 We set D = Where there is a “1”, there is a substring “CC” in Pattern. 0001000<<1= 0010000

54 54 Text: ABCABCCBA Pattern: ACBCCBB Example: Pattern: ACBCCBB A: 1000000 B: 0010011 C: 0101100 other: 0000000 D: 0010000 B: 0010011 And 0010000 We set D = Where there is a “1”, there is a substring “BCC” in Pattern. 0010000<<1= 0100000

55 55 Text: ABCABCCBA Pattern: ACBCCBB Example: Pattern: ACBCCBB A: 1000000 B: 0010011 C: 0101100 other: 0000000 D: 0100000 A: 1000000 And 0000000 So, we can say that there is no prefix of Pattern which is equal to the suffix of the window. There is no substring “ABCC” in Pattern.

56 56 Text: ABCABCCBA Pattern: ACBCCBB Example: We can shift Pattern as above.

57 57 We give another example: Text: ABCABCABA Pattern: CABBCAB,∑={A,B,C} Pattern: CABCCAB A: 0100010 B: 0011001 C: 1000100 other: 0000000 D: 1111111 A: 0100010 And 0100010 D= 0100100<<1 =1000100

58 58 Text: ABCABCABA Pattern: CABBCAB,∑={A,B,C,D} Pattern: CABCCAB A: 0100010 B: 0011001 C: 1000100 other: 0000000 D: 1000100 C: 1000100 And 1000100 We know that “CA” is a substring of the pattern which starts from position 1 in pattern, and this means that “CA” is a prefix of the pattern. D= 1000100<<1 =0001000

59 59 Text: ABCABCABA Pattern: CABBCAB,∑={A,B,C,D} Pattern: CABCCAB A: 0100010 B: 0011001 C: 1000100 other: 0000000 D: 0001000 B: 0011001 And 0001000 So, we know “BCA” is a substring of the pattern. D= 0001000<<1 =0010000

60 60 Text: ABCABCABA Pattern: CABBCAB,∑={A,B,C,D} Pattern: CABCCAB A: 0100010 B: 0011001 C: 1000100 other: 0000000 D: 0010000 A: 0100010 And 0000000 There is no substring “ABCA” in Pattern.

61 61 Text: ABCABCABA Pattern: CABBCAB,∑={A,B,C,D} “BCA” is a the longest suffix of “ABCABCA” which is also a substring of pattern, but the longest prefix of the pattern which is equal to the suffix of the window is “CA”.

62 62 We take an example of the whole algorithm.

63 63 We use “read” to store the suffix of the sliding in the text which we have already read and use “pre-temp” for storing the suffix of the current read which is also a prefix of the pattern.

64 64 Example: P:ATATAT:AGATACGATATATAC Preprocessing: A:10101 B= T: 01010 *: 00000

65 65 Example: P:ATATAT:AGATACGATATATAC Preprocessing: A:10101 B= T: 01010 *: 00000 Initial: D:11111 read : empty pre-temp : empty

66 66 Example: P:ATATA Preprocessing: A:10101 B= T: 01010 *: 00000 D:11111 Reading A 11111 10101 ----------------------- 10101 T:AGATACGATATATAC We set pre-temp = “A” which is a prefix of the pattern. read : A pre-temp : empty

67 67 Example: P:ATATA Preprocessing: A:10101 B= T: 01010 *: 00000 D:11111 Reading A 11111 10101 ----------------------- 10101 D =10101<< 1 =01010 T:AGATACGATATATAC read : A pre-temp : A

68 68 Example: P:ATATA Preprocessing: A:10101 B= T: 01010 *: 00000 D:01010 Reading T 01010 01010 ----------------------- 01010 T:AGATACGATATATAC pre-temp : A read : TA

69 69 Example: P:ATATA Preprocessing: A:10101 B= T: 01010 *: 00000 D:01010 Reading T 01010 01010 ----------------------- 01010 T:AGATACGATATATAC pre-temp : A read : TA D =01010<< 1 =10100

70 70 Example: P:ATATA Preprocessing: A:10101 B= T: 01010 *: 00000 D:10100 Reading A 10100 10101 ----------------------- 10100 We set pre-temp=“ATA” which is a prefix of the pattern. T:AGATACGATATATAC read : ATA pre-temp : A

71 71 Example: P:ATATA Preprocessing: A:10101 B= T: 01010 *: 00000 D:10100 Reading A 10100 10101 ----------------------- 10100 T:AGATACGATATATAC read : ATA pre-temp : ATA pre-temp : A D =10100<< 1 =01000

72 72 Example: P:ATATA Preprocessing: A:10101 B= T: 01010 *: 00000 D:01000 Reading G 01000 00000 ----------------------- 00000 T:AGATACGATATATAC read : GATA pre-temp : ATA

73 73 Example: P:ATATA Preprocessing: A:10101 B= T: 01010 *: 00000 D:00000 T:AGATACGATATATAC read : GATA pre-temp : ATA We find that “ATA” is the longest suffix of “AGATA” which is also a prefix of the pattern. P:ATATA

74 74 Example: P:ATATA Preprocessing: A:10101 B= T: 01010 *: 00000 D:00000 T:AGATACGATATATAC read : GATA pre-temp : ATA P: ATATA So, we can shift as above.

75 75 Example: P:ATATA Preprocessing: A:10101 B= T: 01010 *: 00000 Initial: D:11111 Then we reset D=11111, read=empty and pre-temp =empty. T:AGATACGATATATAC read : empty pre-temp : empty P: ATATA

76 76 Example: P:ATATA Preprocessing: A:10101 B= T: 01010 *: 00000 D:11111 Reading G 11111 00000 ----------------------- 00000 T:AGATACGATATATAC read : G pre-temp : empty

77 77 Example: P:ATATA Preprocessing: A:10101 B= T: 01010 *: 00000 D:00000 There is no substring “G” in the pattern. T:AGATACGATATATAC read : G pre-temp : empty P: ATATA

78 78 Example: P:ATATA Preprocessing: A:10101 B= T: 01010 *: 00000 Initial: D:11111 T:AGATACGATATATAC So, we can shift the length of P to the right. And we reset D=11111, read=empty and pre-temp =empty. P: ATATA read : empty pre-temp : empty

79 79 Example: P:ATATA Preprocessing: A:10101 B= T: 01010 *: 00000 D:11111 T:AGATACGATATATAC Reading A 11111 10101 ----------------------- 10101 read : A pre-temp : empty We set pre-temp=“A” which is a prefix of the pattern.

80 80 Example: P:ATATA Preprocessing: A:10101 B= T: 01010 *: 00000 D:11111 T:AGATACGATATATAC Reading A 11111 10101 ----------------------- 10101 read : A pre-temp : A D =10101<< 1 =01010

81 81 read : AT Example: P:ATATA Preprocessing: A:10101 B= T: 01010 *: 00000 D:01010 T:AGATACGATATATAC Reading T 01010 01010 ----------------------- 01010 pre-temp : A D =01010<< 1 =10100

82 82 Example: P:ATATA Preprocessing: A:10101 B= T: 01010 *: 00000 D:01010 T:AGATACGATATATAC Reading A 10100 10101 ----------------------- 10100 pre-temp : A read : ATA We set pre-temp=“ATA” which is a prefix of the pattern.

83 83 Example: P:ATATA Preprocessing: A:10101 B= T: 01010 *: 00000 D:10100 T:AGATACGATATATAC Reading A 10100 10101 ----------------------- 10100 pre-temp : A read : ATA D =10100<< 1 =01000

84 84 Example: P:ATATA Preprocessing: A:10101 B= T: 01010 *: 00000 D:01000 T:AGATACGATATATAC Reading T 01000 01010 ----------------------- 01000 pre-temp : ATA read : TATA D =01000<< 1 =10000

85 85 Example: P:ATATA Preprocessing: A:10101 B= T: 01010 *: 00000 D:10000 T:AGATACGATATATAC Reading A 10000 10101 ----------------------- 10000 pre-temp : ATA read : ATATA

86 86 Example: P:ATATA Preprocessing: A:10101 B= T: 01010 *: 00000 D:10000 T:AGATACGATATATAC pre-temp : ATA read : ATATA We find “ATATA” which is the longest prefix of the pattern which is equal to the suffix of the window with length m, so an exact match occurs. P: ATATA

87 87 Example: P:ATATA Preprocessing: A:10101 B= T: 01010 *: 00000 Initial: D:10000 T:AGATACGATATATAC pre-temp : ATA read : ATATA “ATA” is a longest suffix of “ATATA” which is equal to the suffix of the window of T besides the full pattern. P: ATATA

88 88 Example: P:ATATA Preprocessing: A:10101 B= T: 01010 *: 00000 Initial: D:10000 T:AGATACGATATATAC pre-temp : ATA read : ATATA P: ATATA So, we can shift as above.

89 89 Example: P:ATATA Preprocessing: A:10101 B= T: 01010 *: 00000 Initial: D:11111 T:AGATACGATATATAC pre-temp : empty read : empty P: ATATA Repeat above steps, until the window slides out of Text.

90 90 We give an extreme example to show the worst case of the algorithm.

91 91 Example: P:AAAAAT:AAAAAAAA Preprocessing: A:11111 B= *: 00000 Initial: D:11111 read : empty pre-temp : empty

92 92 Example: P:ATATA Preprocessing: A:11111 B= *: 00000 D:11111 Reading A 11111 11111 ----------------------- 11111 T:AAAAAAAA read : A pre-temp : empty

93 93 Example: P:ATATA Preprocessing: A:11111 B= *: 00000 D:11111 Reading A 11111 11111 ----------------------- 11111 T:AAAAAAAA read : A pre-temp : A D =11111<< 1 =11110

94 94 Example: P:ATATA Preprocessing: A:11111 B= *: 00000 D:11110 Reading A 11110 11111 ----------------------- 11110 T:AAAAAAAA read : AA pre-temp : A

95 95 Example: P:ATATA Preprocessing: A:11111 B= *: 00000 D:11110 Reading A 11110 11111 ----------------------- 11110 T:AAAAAAAA read : AA pre-temp : AA D =11110<< 1 =11100

96 96 Example: P:ATATA Preprocessing: A:11111 B= *: 00000 D:11100 Reading A 11100 11111 ----------------------- 11100 T:AAAAAAAA read : AAA pre-temp : AA

97 97 Example: P:ATATA Preprocessing: A:11111 B= *: 00000 D:11100 Reading A 11100 11111 ----------------------- 11100 T:AAAAAAAA read : AAA pre-temp : AAA D =11100<< 1 =11000

98 98 Example: P:ATATA Preprocessing: A:11111 B= *: 00000 D:11000 Reading A 11000 11111 ----------------------- 11000 T:AAAAAAAA read : AAAA pre-temp : AAA

99 99 Example: P:ATATA Preprocessing: A:11111 B= *: 00000 D:11000 Reading A 11000 11111 ----------------------- 11000 T:AAAAAAAA read : AAAA pre-temp : AAAA D =11000<< 1 =10000

100 100 Example: P:ATATA Preprocessing: A:11111 B= *: 00000 D:10000 Reading A 10000 11111 ----------------------- 10000 T:AAAAAAAA read : AAAAA pre-temp : AAAA We find “AAAAA” which is the longest prefix of the pattern which is equal to the suffix of the window with length m, so an exact match occurs.

101 101 Example: P:ATATA Preprocessing: A:11111 B= *: 00000 D:11111 T:AAAAAAAA read : empty pre-temp : empty

102 102 Example: P:ATATA Preprocessing: A:11111 B= *: 00000 D:11111 Reading A 11111 11111 ----------------------- 11111 T:AAAAAAAA read : A pre-temp : empty

103 103 Example: P:ATATA Preprocessing: A:11111 B= *: 00000 D:11111 Reading A 11111 11111 ----------------------- 11111 T:AAAAAAAA read : A pre-temp : A D =11111<< 1 =11110

104 104 Example: P:ATATA Preprocessing: A:11111 B= *: 00000 D:11110 Reading A 11110 11111 ----------------------- 11110 T:AAAAAAAA read : AA pre-temp : A

105 105 Example: P:ATATA Preprocessing: A:11111 B= *: 00000 D:11110 Reading A 11110 11111 ----------------------- 11110 T:AAAAAAAA read : AA pre-temp : AA D =11110<< 1 =11100

106 106 Example: P:ATATA Preprocessing: A:11111 B= *: 00000 D:11100 Reading A 11100 11111 ----------------------- 11100 T:AAAAAAAA read : AAA pre-temp : AA

107 107 Example: P:ATATA Preprocessing: A:11111 B= *: 00000 D:11100 Reading A 11100 11111 ----------------------- 11100 T:AAAAAAAA read : AAA pre-temp : AAA D =11100<< 1 =11000

108 108 Example: P:ATATA Preprocessing: A:11111 B= *: 00000 D:11000 Reading A 11000 11111 ----------------------- 11000 T:AAAAAAAA read : AAAA pre-temp : AAA

109 109 Example: P:ATATA Preprocessing: A:11111 B= *: 00000 D:11000 Reading A 11000 11111 ----------------------- 11000 T:AAAAAAAA read : AAAA pre-temp : AAAA D =11000<< 1 =10000

110 110 Example: P:ATATA Preprocessing: A:11111 B= *: 00000 D:10000 Reading A 10000 11111 ----------------------- 10000 T:AAAAAAAA read : AAAAA pre-temp : AAAA We find “AAAAA” which is the longest prefix of the pattern which is equal to the suffix of the window with length m, so an exact match occurs.

111 111 Time Complexity: If the length of the text is n and the length of pattern is m, the time complexity of this algorithm is O(mn) in the worst case.

112 112 Reference: M.Crochemore, A.Czumaj, L.Gasieniec, S.Jarominek, T.Lecroq, W.Plandowski, and W.Rytter. Speeding up two string matching algorithms. Algorithmica, 12(4/5):247-267,1994. G.Navarro and M.Raffinot. Fast and flexible string matching by combining bit-parallelism and Suffix automata. ACM Journal of Experimental Algorithmics,5,2000. W.I.Chang and E.L.Lawler. Sublinear approximate string matching and biological applications. Algorithmica, 12(4/5):327-344,1994

113 113 Thanks for your attention.

114 114 Algorithm: Preprocessing For c € ∑ Do B[c]←0 m For j € 1…m Do B[p j ]←B[p j ]|0 j-1 1 m-j Searching pos ← 0 while pos ≤ n-m Do j ← m, last ← m D ←1 m while D≠ 0 m Do D ←D & B[t pos+j ] j ←j-1 If D & 10 m-1 ≠0 m Then If j >0 Then last ← j Else report an occurrence at pos+1 End of if D ←D<<1 End of while pos ←pos + last End of while


Download ppt "1 Advisor: Prof. R. C. T. Lee Speaker: G. W. Cheng Two exact string matching algorithms using suffix to prefix rule."

Similar presentations


Ads by Google