Presentation is loading. Please wait.

Presentation is loading. Please wait.

Boyer-Moore Algorithm 3 main ideas –right to left scan –bad character rule –good suffix rule.

Similar presentations


Presentation on theme: "Boyer-Moore Algorithm 3 main ideas –right to left scan –bad character rule –good suffix rule."— Presentation transcript:

1 Boyer-Moore Algorithm 3 main ideas –right to left scan –bad character rule –good suffix rule

2 Right to left scan 0 1 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 x z b c t z x t b p t x c t b p q t b c b b

3 Bad character rule Definition –For each character x in the alphabet, let R(x) denote the position of the right-most occurrence of character x in P. –R(x) is defined to be 0 if x is not in P Usage –Suppose characters in P[i+1..n] match T[k+1..k+n-i] and P[i] mismatches T[k]. –Shift P to the right by max(1, i - R(T[k])) places Hopefully more than 1

4 Illustration of bad character rule 0 1 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 x z b c t z x t b p t x c t b p q t b c b b i = 5, R(z) = 0, so max(1, 5-0) = 5 i = 5, R(t) = 1, so max(1, 5-1) = 4 t b c b b i = 4, R(t) = 1, so max(1, 4-1) = 3 t b c b b

5 Extended bad character rule Definition –For each character x in the alphabet, let R(x,i) denote the position of the right-most occurrence of character x P[1..i-1]. –R(x,i) is defined to be 0 if x is not in P[1..i-1]. Usage –Suppose characters in P[i+1..n] match T[k+1..k+n-i] and P[i] mismatches T[k]. –Shift P to the right by max(1, i - R(T[k],i)) places Hopefully more than 1

6 Illustration of extended bad character rule 0 1 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 x z b c b b x t b p t x c t b p q b t c t b i = 4, R(b) = 5, so max(1, 4-5) = 1 i = 4, R(b,4) = 1, so max(1, 4-1) = 3 b t c t b

7 Implementation Issues Bad character rule –Space required: O(|  ) for the number of characters in the alphabet –Calculate R[] matrix in O(n) time (exercise) Extended bad character rule –Space required: full table is O(n|  |) –Smaller implementation: O(n) –Preprocess time: O(n) –Search time impact: increases search time by at worst twice the number of mismatches See book for details (pg 18)

8 Observations Bad character rules –work well in practice with large alphabets like the English alphabet –work less well with small alphabets like DNA –Do not guarantee linear worst-case run-time Give an example of such a case

9 Strong good suffix rule part 1 Situation –P[i..n] matches text T[j..j+n-i] but T[(j-1) does not match P(i-1) The rightmost non-suffix substring t’ of P that matches the suffix P[i..n] AND the character to the left of t’ in P is different than P(i-1) Shift P so that t’ matches up with T[j..j+n-i]

10 Illustration of suffix rule part 1 0 1 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 p r s t a b s t u b a b v q x r s t q c a b d a b d a b

11 Preprocessing for suffix rule part 1 Definitions –For each i, L’(i) is the largest position less than n such that P[i..n] matches a suffix of P[1..L’(i)] and that the character preceding that suffix is not equal to P(i-1). –For string P, N j (P) is the length of the longest suffix of the substring P[1..j] that is also a suffix of P Observations –N j (P) = Z n-j+1 (P r ) –L’(i) is the largest j < n such that N j (P) = |P[i..n]| which equals n-i+1 –If L’(i) > 0, shift P by n-L’(i) places to the right

12 Z-based computation of L’(i) for (i=1;i<=n;i++) L’(i) = 0; for (j=1; j<=n-1; j++) { k = n-N j (P)+1; L’(k) = j; }

13 Strong good suffix rule part 2 If L’(i) = 0 then … Let t’’ = the largest suffix of P[i..n] that is also a prefix of P, if one exists. If t’’ exists, shift P so that prefix of P matches up with t’’ at end of T[j..j+n-i]. Otherwise, shift P past T[j+n-i].

14 Illustration of suffix rule part 2 0 1 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 p r s t a b s t a b a b v q x r s t a b a b s t a b a b

15 Preprocessing for suffix rule part 2 Definitions –For each i, let l’(i) denote the length of the largest suffix of P[i..n] that is also a prefix of P, if one exists. Otherwise, let l’(i)=0. Observations –l’(i) = the largest j <= |P[i..n]| such that N j (P) = j Question –How does l’(i) relate to l’(i+1)? The same unless N n-i+1 (P) = n-i+1

16 Z-based computation of l’(i) l’[n+1] = 0; for (i=n;i>=2;i--) if (N[n-i+1] = = (n-i+1)) l’[i] = n-i+1; else l’[i] = l’[i+1]; }

17 Addendum to suffix rule Shift by 1 if there is an immediate mismatch That is, if P(n) mismatches with the corresponding character in T

18 Boyer-Moore Overview Precompute L’(i), l’(i) for each position in P Precompute R(x) or R(x,i) for each character x in  Align P to T Compare right to left On mismatch, shift by the max possible from (extended) bad character rule and good suffix rule and return to compare

19 Observations I Original Boyer-Moore algorithm uses “weak good suffix rule” without using the mismatch information –This is not sufficient to prove that the search part of Boyer-Moore runs in linear time in the worst case Using only strong good suffix rule, can prove a worst-case time of O(n) provided P is not in T If P is in T, original Boyer-Moore runs in  (nm) time in the worst case, but this can be corrected with simple modifications Using only the bad character shift rule leads to O(nm) time in the worst-case, but works in sublinear time on random strings


Download ppt "Boyer-Moore Algorithm 3 main ideas –right to left scan –bad character rule –good suffix rule."

Similar presentations


Ads by Google