Boyer-Moore Algorithm 3 main ideas –right to left scan –bad character rule –good suffix rule.

Boyer-Moore Algorithm 3 main ideas –right to left scan –bad character rule –good suffix rule

Right to left scan 0 1 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 x z b c t z x t b p t x c t b p q t b c b b

Bad character rule Definition –For each character x in the alphabet, let R(x) denote the position of the right-most occurrence of character x in P. –R(x) is defined to be 0 if x is not in P Usage –Suppose characters in P[i+1..n] match T[k+1..k+n-i] and P[i] mismatches T[k]. –Shift P to the right by max(1, i - R(T[k])) places Hopefully more than 1

Illustration of bad character rule 0 1 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 x z b c t z x t b p t x c t b p q t b c b b i = 5, R(z) = 0, so max(1, 5-0) = 5 i = 5, R(t) = 1, so max(1, 5-1) = 4 t b c b b i = 4, R(t) = 1, so max(1, 4-1) = 3 t b c b b

Extended bad character rule Definition –For each character x in the alphabet, let R(x,i) denote the position of the right-most occurrence of character x P[1..i-1]. –R(x,i) is defined to be 0 if x is not in P[1..i-1]. Usage –Suppose characters in P[i+1..n] match T[k+1..k+n-i] and P[i] mismatches T[k]. –Shift P to the right by max(1, i - R(T[k],i)) places Hopefully more than 1

Illustration of extended bad character rule 0 1 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 x z b c b b x t b p t x c t b p q b t c t b i = 4, R(b) = 5, so max(1, 4-5) = 1 i = 4, R(b,4) = 1, so max(1, 4-1) = 3 b t c t b

Implementation Issues Bad character rule –Space required: O(|  ) for the number of characters in the alphabet –Calculate R[] matrix in O(n) time (exercise) Extended bad character rule –Space required: full table is O(n|  |) –Smaller implementation: O(n) –Preprocess time: O(n) –Search time impact: increases search time by at worst twice the number of mismatches See book for details (pg 18)

Observations Bad character rules –work well in practice with large alphabets like the English alphabet –work less well with small alphabets like DNA –Do not guarantee linear worst-case run-time Give an example of such a case

Strong good suffix rule part 1 Situation –P[i..n] matches text T[j..j+n-i] but T[(j-1) does not match P(i-1) The rightmost non-suffix substring t’ of P that matches the suffix P[i..n] AND the character to the left of t’ in P is different than P(i-1) Shift P so that t’ matches up with T[j..j+n-i]

Illustration of suffix rule part 1 0 1 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 p r s t a b s t u b a b v q x r s t q c a b d a b d a b

Preprocessing for suffix rule part 1 Definitions –For each i, L’(i) is the largest position less than n such that P[i..n] matches a suffix of P[1..L’(i)] and that the character preceding that suffix is not equal to P(i-1). –For string P, N j (P) is the length of the longest suffix of the substring P[1..j] that is also a suffix of P Observations –N j (P) = Z n-j+1 (P r ) –L’(i) is the largest j < n such that N j (P) = |P[i..n]| which equals n-i+1 –If L’(i) > 0, shift P by n-L’(i) places to the right

Z-based computation of L’(i) for (i=1;i<=n;i++) L’(i) = 0; for (j=1; j<=n-1; j++) { k = n-N j (P)+1; L’(k) = j; }

Strong good suffix rule part 2 If L’(i) = 0 then … Let t’’ = the largest suffix of P[i..n] that is also a prefix of P, if one exists. If t’’ exists, shift P so that prefix of P matches up with t’’ at end of T[j..j+n-i]. Otherwise, shift P past T[j+n-i].

Illustration of suffix rule part 2 0 1 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 p r s t a b s t a b a b v q x r s t a b a b s t a b a b

Preprocessing for suffix rule part 2 Definitions –For each i, let l’(i) denote the length of the largest suffix of P[i..n] that is also a prefix of P, if one exists. Otherwise, let l’(i)=0. Observations –l’(i) = the largest j <= |P[i..n]| such that N j (P) = j Question –How does l’(i) relate to l’(i+1)? The same unless N n-i+1 (P) = n-i+1

Z-based computation of l’(i) l’[n+1] = 0; for (i=n;i>=2;i--) if (N[n-i+1] = = (n-i+1)) l’[i] = n-i+1; else l’[i] = l’[i+1]; }

Addendum to suffix rule Shift by 1 if there is an immediate mismatch That is, if P(n) mismatches with the corresponding character in T

Boyer-Moore Overview Precompute L’(i), l’(i) for each position in P Precompute R(x) or R(x,i) for each character x in  Align P to T Compare right to left On mismatch, shift by the max possible from (extended) bad character rule and good suffix rule and return to compare

Observations I Original Boyer-Moore algorithm uses “weak good suffix rule” without using the mismatch information –This is not sufficient to prove that the search part of Boyer-Moore runs in linear time in the worst case Using only strong good suffix rule, can prove a worst-case time of O(n) provided P is not in T If P is in T, original Boyer-Moore runs in  (nm) time in the worst case, but this can be corrected with simple modifications Using only the bad character shift rule leads to O(nm) time in the worst-case, but works in sublinear time on random strings

Boyer-Moore Algorithm 3 main ideas –right to left scan –bad character rule –good suffix rule.

Similar presentations

Presentation on theme: "Boyer-Moore Algorithm 3 main ideas –right to left scan –bad character rule –good suffix rule."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Boyer-Moore Algorithm 3 main ideas –right to left scan –bad character rule –good suffix rule.

Similar presentations

Presentation on theme: "Boyer-Moore Algorithm 3 main ideas –right to left scan –bad character rule –good suffix rule."— Presentation transcript:

Similar presentations

About project

Feedback