1 Boyer-Moore Charles Yan 2007. 2 Exact Matching Boyer-Moore ( worst-case: linear time, Typical: sublinear time ) Aho-Corasik ( A set of pattern )

1 Boyer-Moore Charles Yan 2007

2 Exact Matching Boyer-Moore ( worst-case: linear time, Typical: sublinear time ) Aho-Corasik ( A set of pattern )

3 Boyer-Moore Idea 1: Right-to-left comparison 12345678901234567 T: xpbctbxabpqxctbpq P: tpabxab

4 Boyer-Moore 12345678901234567 T: spbctbsabpqsctbpq P: tpabsab Idea 2: Bad character rule R(x): The right-most occurrence of x in P. R(x)=0 if x does not occur. R(t)=1, R(s)=5. i: the position of mismatch in P. i=3 k: the counterpart in T. k=5. T[k]=t The bad character rule says P should be shifted right by max{1, i-R(T[k])}. i.e., if the right-most occurrence of character T[k] in P is in position j (j<i), then P[j] show be below T[k] after the shifting. P: tpabxab

5 Boyer-Moore The idea of bad character rule is to shift P by more than one characters when possible. But is has no effect if j>i Unfortunately, it is often the case that j>i 12345678901234567 T: spbctbsatpqsctbpq P: tpabsat

6 Boyer-Moore Let x=T[k], the mismatched character in T. Idea 3: Extended bad character rule says P should be shifted right so that the closest x to the left of position i in P is below T[K]. 12345678901234567 T: spbctbsatpqsctbpq P: tpabsat

7 Boyer-Moore To use extended bad character rule we need: For each position i of P, for each character x in the alphabet, the position of the closest occurrence of x to the left of i. Approach 1: Two dimensional array. n*| | Space and time: expensive

8 Boyer-Moore Approach two: scan P from right to left and for each x maintain a list positions where x occurs (in decreasing order). P: tpabsat t  7,1 a  6,3 … When P[i] is mismatched with T[k], (let x=T[k]), scan the x’s list, find the first number (let it be j) that is less than i and shift P to right so that P[j] is below T[k]. If no such j is found then shift P past T[k] Space and time: Linear 12345678901234567 T: spbctbsatpqsctbpq P: tpabsat

9 Boyer-Moore Idea 3: Strong good suffix rule t is a suffix of P that match with a substring t of T x≠y t’ is the right-most copy of t in P such that t’ is not a suffix of P and z≠y x t y t t’ z T P

10 Boyer-Moore The strong good suffix rule says (1) if t’ exist then shift P to right such that t’ in P is below t in T 123456789012345678 T: prstabstubabvqxrst P: qcabdabdab x t y t t’ y t z z T P P: qcabdabdab

11 Boyer-Moore Extended bad character rule focuses on characters. Strong good rule focuses on substrings. How to get the information needed for the strong good suffix rule? i.e., for a t, how do we find t`?

12 Boyer-Moore L’(i): For each i, L’(i) is the largest position less than n such that substring P[i,…,n] matches a suffix of P[1,…, ’(i) ] with the additional requirement that the character preceding that suffix is not equal to character P[i-1]. If there is no such a position, L’(i) =0. Let t= P[i,…,n], then L’(i) is the right end-position of t’. x t y t t’ y t z z T P ni L’(i) T: prstabstubabvqxrst P: qcabdabdab 1234567890 L’(9)=4, L’(10)=0, L’(8)=?, L’(7)=? L’(6)=?

13 Boyer-Moore Let t= P[i,…,n], then L’(i) is the right end-position of t’. Thus to use the strong good suffix rule, we need to find out L’(i) for every i=1,…,n. For pattern P, N j is the length of the longest substring that end at j and that is also a suffix of P. tt’ j xy P t=t’; j=|t’|=|t|; x≠y

14 Boyer-Moore N j is the length of the longest substring that end at j and that is also a suffix of P. Z i : the length of the longest substring of P that starts at i and matches a prefix of P tt’ j xy t xy i

15 Boyer-Moore N is the reverse of Z! P: the pattern P r the string obtained by reversing P Then N j (P)=Z n-j+1 (P r ) 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 P: q c a b d a b d a b P r : b a d b a d b a c q N j : 0 0 0 2 0 0 5 0 0 0 Z i 0 0 0 5 0 0 2 0 0 0 tt’xy i t j xy

16 Boyer-Moore For pattern P, N j (for j=1,…,n) can be calculated in O(n) using the Z algorithm. Why do we need to define N j ? To use the strong good suffix rule, we need to find out L’(i) for every i=1,…,n. We can get L’(i) from N j ! x t y t t’ y t z z T P ni L’(i)

17 Boyer-Moore For position i, let t=P[i,…n]. L’(i) is the largest position j less than n such that N j =|t| y t t’ z P n i L’(i) t’’ 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 P: q c a b d a b d a b P r : b a d b a d b a c q N j : 0 0 0 2 0 0 5 0 0 0 Z i 0 0 0 5 0 0 2 0 0 0 L’(i): 0 0 0 0 0 7 0 0 4 0

18 Boyer-Moore How to obtain L’(i) from N j in linear time? Input: Pattern P Output: L’(i) for i=1,…,n Algorithm Calculate N j for j= 1,…,n based on Z algorithm for i=1; i<=n; i++ L’(i)=0; for j=1; j<n; j++ i=n-N j +1 L’(i)=j; y t t’ z P n i L’(i) j

19 Boyer-Moore The strong good suffix rule says (1) if t’ exist then shift P to right such that t’ in P is below t in T 123456789012345678 T: prstabstubabvqxrst P: qcabdabdab i=9; L’(9)=4 x t y t t’ y t z z T P P: qcabdabdab in L’(i) in

20 Boyer-Moore The strong good suffix rule: (1) If a mismatch occurs at position i-1 of P and L’(i)>0 (i.e. t’ exists), then using the strong good suffix rule we can shift P by n-L’(i) positions to the right. (2) What if a mismatch occurs at position i-1 of P and L’(i)=0 (i.e. t’ does not exists)? We can shift P as least like this x t y t y t T P in P in

21 Boyer-Moore But we can do more than that! x t y t y t T P in P in

22 Boyer-Moore Observation 1  If  is a prefix of P is also a suffix of P, then… x t y t y t T P in P in  ’’

23 Boyer-Moore Observation 2: If there are more than one candidates of , then shift P by the least amount x t y t y t T P P1P1  ’’ y t P2P2

24 Boyer-Moore The strong good suffix rule: When a mismatch occurs at position i-1 of P (1) If L’(i)>0 (i.e. t’ exists), then using the strong good suffix rule we can shift P by n-L’(i) positions to the right. (2) Else if L’(i)=0 (i.e. t’ does not exists)? We can shift P past the left end of t by the least amount such a prefix of the shifted pattern matches a suffix of t. x t y t y t T P in P in  ’’

25 Boyer-Moore l’(i) : the length of the largest suffix of P[i,…,n], that is also a prefix of P. If none exists, then l’(i)=0. l’(i) is length of the overlap between the unshifted and shifted patterns. x t y t y t T P P1P1  ’’ y t P2P2 i l’(i)

26 Boyer-Moore l’(i) equals the largest j≤|P[i,…n]|, such that N j =j 1. N j =j then  is a prefix of P is also a suffix of P 2. and we want the largest j y t P i l’(i) P j   j2j2 j1j1

27 Boyer-Moore l’(i) equals the largest j≤|P[i,…n]|, such that N j =j 1 2 3 4 5 6 7 8 9 0 P: a b d a b a b d a b N j : 0 2 0 0 5 0 2 0 0 0 l ’(i): 5 5 5 5 5 5 2 2 2 0

28 Boyer-Moore How to calculate l’(i) from N j in linear time ?

29 Boyer-Moore The strong good suffix rule: When a mismatch occurs at position i-1 of P (1) If L’(i)>0 (i.e. t’ exists), then using the strong good suffix rule we can shift P by n-L’(i) positions to the right. (2) Else if L’(i)=0 (i.e. t’ does not exists)? We can shift P past the left end of t by the least amount such a prefix of the shifted pattern matches t, that is by n-l’(i) positions to the right. x t y t y t T P in P in  ’’ x t y t t’ y t z z T P in L’(i) in l’(i)

30 Boyer-Moore What if a match is found? Shift P by one position…but… Shift P by the least amount such a prefix of the shifted pattern matches a suffix of t, that is, shift P to the right by n-l’(2) y t T P P 

31 Boyer-Moore The strong good suffix rule: When a mismatch occurs at position i-1 of P (1) If L’(i)>0 (i.e. t’ exists), then using the strong good suffix rule we can shift P by n- L’(i) positions to the right. (2) Else if L’(i)=0 (i.e. t’ does not exists)? We can shift P past the left end of t by the least amount such a prefix of the shifted pattern matches t, that is by n-l’(i) positions to the right. (3) If a match is found, then shift P to the right by n-l’(2) x t y t y t T P P i n  ’’ x t y t t’ y t z z T P in L’(i) l’(i)

32 Boyer-Moore The extended bad character rule vs. the strong good suffix rule 123456789012345678 T: prstabstubabvqxrst P: qcabdabdab 123456789012345678 T: prstabstuqabvqxrst P: qcabdabdab

33 Boyer-Moore Shift P by the largest amount given by either of rules. That results in the Boyer- Moore algorithm! Input: Text T, and pattern P; Output: Find the occurrences of P in T Algorithm Boyer-Moore Compute L’(i), L`(i), and R(x) k=n; while (k ≤m) do i=n h=k while i>0 and P[i]=T[h] do i--; h--; if i=0 report an occurrence of P in T ending at position k; k=k+n-l`(2) else shift P (increase k) by the maximum amount determined by the extended bad character rule and the good suffix rule. t t T P i kh

1 Boyer-Moore Charles Yan 2007. 2 Exact Matching Boyer-Moore ( worst-case: linear time, Typical: sublinear time ) Aho-Corasik ( A set of pattern )

Similar presentations

Presentation on theme: "1 Boyer-Moore Charles Yan 2007. 2 Exact Matching Boyer-Moore ( worst-case: linear time, Typical: sublinear time ) Aho-Corasik ( A set of pattern )"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1 Boyer-Moore Charles Yan 2007. 2 Exact Matching Boyer-Moore ( worst-case: linear time, Typical: sublinear time ) Aho-Corasik ( A set of pattern )

Similar presentations

Presentation on theme: "1 Boyer-Moore Charles Yan 2007. 2 Exact Matching Boyer-Moore ( worst-case: linear time, Typical: sublinear time ) Aho-Corasik ( A set of pattern )"— Presentation transcript:

Similar presentations

About project

Feedback