Download presentation
Presentation is loading. Please wait.
1
Boyer-Moore Algorithm 3 main ideas –right to left scan –bad character rule –good suffix rule
2
Right to left scan 0 1 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 x z b c t z x t b p t x c t b p q t b c b b
3
Bad character rule Definition –For each character x in the alphabet, let R(x) denote the position of the right-most occurrence of character x in P. –R(x) is defined to be 0 if x is not in P Usage –Suppose characters in P[i+1..n] match T[k+1..k+n-i] and P[i] mismatches T[k]. –Shift P to the right by max(1, i - R(T[k])) places Hopefully more than 1
4
Illustration of bad character rule 0 1 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 x z b c t z x t b p t x c t b p q t b c b b i = 5, R(z) = 0, so max(1, 5-0) = 5 i = 5, R(t) = 1, so max(1, 5-1) = 4 t b c b b i = 4, R(t) = 1, so max(1, 4-1) = 3 t b c b b
5
Extended bad character rule Definition –For each character x in the alphabet, let R(x,i) denote the position of the right-most occurrence of character x P[1..i-1]. –R(x,i) is defined to be 0 if x is not in P[1..i-1]. Usage –Suppose characters in P[i+1..n] match T[k+1..k+n-i] and P[i] mismatches T[k]. –Shift P to the right by max(1, i - R(T[k],i)) places Hopefully more than 1
6
Illustration of extended bad character rule 0 1 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 x z b c b b x t b p t x c t b p q b t c t b i = 4, R(b) = 5, so max(1, 4-5) = 1 i = 4, R(b,4) = 1, so max(1, 4-1) = 3 b t c t b
7
Implementation Issues Bad character rule –Space required: O(| ) for the number of characters in the alphabet –Calculate R[] matrix in O(n) time (exercise) Extended bad character rule –Space required: full table is O(n| |) –Smaller implementation: O(n) –Preprocess time: O(n) –Search time impact: increases search time by at worst twice the number of mismatches See book for details (pg 18)
8
Observations Bad character rules –work well in practice with large alphabets like the English alphabet –work less well with small alphabets like DNA –Do not guarantee linear worst-case run-time Give an example of such a case
9
Strong good suffix rule part 1 Situation –P[i..n] matches text T[j..j+n-i] but T[(j-1) does not match P(i-1) The rightmost non-suffix substring t’ of P that matches the suffix P[i..n] AND the character to the left of t’ in P is different than P(i-1) Shift P so that t’ matches up with T[j..j+n-i]
10
Illustration of suffix rule part 1 0 1 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 p r s t a b s t u b a b v q x r s t q c a b d a b d a b
11
Preprocessing for suffix rule part 1 Definitions –For each i, L’(i) is the largest position less than n such that P[i..n] matches a suffix of P[1..L’(i)] and that the character preceding that suffix is not equal to P(i-1). –For string P, N j (P) is the length of the longest suffix of the substring P[1..j] that is also a suffix of P Observations –N j (P) = Z n-j+1 (P r ) –L’(i) is the largest j < n such that N j (P) = |P[i..n]| which equals n-i+1 –If L’(i) > 0, shift P by n-L’(i) places to the right
12
Z-based computation of L’(i) for (i=1;i<=n;i++) L’(i) = 0; for (j=1; j<=n-1; j++) { k = n-N j (P)+1; L’(k) = j; }
13
Strong good suffix rule part 2 If L’(i) = 0 then … Let t’’ = the largest suffix of P[i..n] that is also a prefix of P, if one exists. If t’’ exists, shift P so that prefix of P matches up with t’’ at end of T[j..j+n-i]. Otherwise, shift P past T[j+n-i].
14
Illustration of suffix rule part 2 0 1 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 p r s t a b s t a b a b v q x r s t a b a b s t a b a b
15
Preprocessing for suffix rule part 2 Definitions –For each i, let l’(i) denote the length of the largest suffix of P[i..n] that is also a prefix of P, if one exists. Otherwise, let l’(i)=0. Observations –l’(i) = the largest j <= |P[i..n]| such that N j (P) = j Question –How does l’(i) relate to l’(i+1)? The same unless N n-i+1 (P) = n-i+1
16
Z-based computation of l’(i) l’[n+1] = 0; for (i=n;i>=2;i--) if (N[n-i+1] = = (n-i+1)) l’[i] = n-i+1; else l’[i] = l’[i+1]; }
17
Addendum to suffix rule Shift by 1 if there is an immediate mismatch That is, if P(n) mismatches with the corresponding character in T
18
Boyer-Moore Overview Precompute L’(i), l’(i) for each position in P Precompute R(x) or R(x,i) for each character x in Align P to T Compare right to left On mismatch, shift by the max possible from (extended) bad character rule and good suffix rule and return to compare
19
Observations I Original Boyer-Moore algorithm uses “weak good suffix rule” without using the mismatch information –This is not sufficient to prove that the search part of Boyer-Moore runs in linear time in the worst case Using only strong good suffix rule, can prove a worst-case time of O(n) provided P is not in T If P is in T, original Boyer-Moore runs in (nm) time in the worst case, but this can be corrected with simple modifications Using only the bad character shift rule leads to O(nm) time in the worst-case, but works in sublinear time on random strings
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.