15-211 Fundamental Data Structures and Algorithms String Matching II March 30, 2006 Ananda Gunawardena
In this lecture FSM revisited Aho-Corasick Algorithm Multiple pattern matching Boyer-Moore Algorithm Right to left matching Rabin-Karp Algorithm Based on hash codes Summary
FSM Revisited Suppose we consider the alphabet ∑ ={a,b} and a pattern P=“ababa” The states of the FSM are all the prefixes of P, i.e. { ε , a, ab, aba, abab, ababa} Q4 Q5 Q0 Q1 Q2 Q3 Exercise: Mark the failure or backward transitions
FSM Revisited P=“ababa” j f a b a 1 ab 2 aba 3 abab 4 ababa 5 state 1 Q4 Q5 Q0 Q1 Q2 Q3 j f a 1 ab 2 aba 3 abab 4 ababa 5 P=“ababa” Expressed as a table state 1 2 3 4 a b
AHO-CORASICK
Multiple Pattern Search Suppose we need to search for a set of k patterns, P1, P2 ….., Pk in a text T Possible solution: Apply KMP to all k patterns cost is O(k(n+m)), where |T| = n, m=max|Pi| Is there a better solution? consider all patterns at once – some patterns may be prefixes of others max and maximum can be found in one scan
All Prefixes Consider a set of patterns P={ab, ac, abc, bca, bcc, ba, bc} Prefixes of the patterns – {a, ab, abc, b, bc, bca, bcc, ba, ac} A trie representing the patterns can be built in O(M) time, where M = Nodes of the trie are states and forward transitions are easy
Failure Transitions How do we deal with the backward(failure) transitions? Suppose U is the current match followed by a “wrong” letter Find the longest suffix V of U, that is a prefix of some pattern in the set of patterns P Example: Let P={aba, baba, cabab} The failure function π is given by U a ab aba b ba bab baba c ca π(u) ε U cab caba aba cabab π(u) ab ba bab
Failure functions a ab aba b ba bab baba c ca ε cab caba cabab ab aba Q0 ba b cabab c ca cab caba
More Formally.. Let P = {W1, W2, …., WM} be the set of all prefixes of all patterns in the set of patterns {P1, P2, …., Pk} The transition function δ is given by δ : P x ∑ P The failure function π is given by π : P+ P π (p) = longest proper suffix of p in P, which is prefix of some Pi
Transition Function Given the failure function π, we can compute the transition function as follows δ (u, a) =
Computing π How do we compute the failure function π? KMP traverses a single string from left to right Instead we need to traverse a trie in breadth first order computing failure functions Complexity: As in KMP we can show that complexity of Aho-Corasick is O(M+n), where M=total length of the patterns and n=|T|
Boyer Moore
Boyer Moore Boyer-Moore Idea Scan pattern from right to left and text from left to right Allow for bigger jumps on early failures Use a table similar to KMP. Follow a “better” idea: Use information about T as well as P in deciding what to do next.
Brute Force B-M 2 + 6 = 8 comparisons 15 + 6 = 21 comparisons abcdeabcdeabcedfghijkl - bc- bcedfg abcdeabcdeabcedfghijkl - g f d e c b 2 + 6 = 8 comparisons 15 + 6 = 21 comparisons
Brute Force B-M 3 + 7 = 10 comparisons 16 + 7 = 23 comparisons This string is textual - t- textual This string is textual - l a u t x e 3 + 7 = 10 comparisons 16 + 7 = 23 comparisons
Brute Force B-M foobar 5 comparisons 25 comparisons This is a sample sentence - This is a sample sentence - foobar 5 comparisons 25 comparisons
Boyer Moore Ideas Scan pattern from right to left (and target from left to right) Allows for bigger jumps on early failures Could use a table similar to KMP. But follow a better idea: Use information about T as well as P in deciding what to do next. If T[i] does not appear in the pattern, skip forward beyond the end of the pattern.
Boyer Moore matcher static int[] buildLast(char[] P) { int[] last = new int[128]; int m = P.length; for (int i=0; i<128; i++) last[i] = -1; for (int j=0; j<P.length; j++) last[P[j]] = j; return last; } Mismatch char is nowhere in the pattern (default). last says “jump the distance” Mismatch is a pattern char. last says “jump to align pattern with last instance of this char”
Use last to determine next value for i. Boyer Moore matcher static int match(char[] T, char[] P) { int[] last = buildLast(P); int n = T.length; int m = P.length; int i = m-1; int j = m-1; if (i > n-1) return -1; do { if (P[j]==T[i]) if (j==0) return i; else { i--; j--; } else { i = i + m – Math.min(j, 1 + last[T[i]]); j = m - 1; } } while (i <= n-1); return -1; Use last to determine next value for i.
KMP B-M 7777777 1 comparison 13 comparisons 1234561234356
KMP B-M ring 7 comparisons 16 comparisons This is a string
KMP B-M tring 8 comparisons 16 comparisons This is a string
Rabin-Karp
Rabin-Karp Algorithm Suppose P is a pattern and T is the search text. Compute a hash code of P and ALL the hash codes of substrings of T of length |P|=m If hash(P) = hash(T(i..i+m-1)) for some 0≤ i ≤n-m, then we possibly found the pattern But computing all hash codes takes Ω(nm) time, where |T|=n, |P|=m How to compute a good hash code? H(P) = where B is a large enough base, eg: B=256 How to compute the hash code efficiently?
Rabin-Karp How to compute the hash code efficiently Need hash codes for the substrings of length m of the form T[i…i+m-1] How to get T[i+1…i+m] from T[i…i+m-1] drop T[i] and add T[i+m] Find a relation between hash codes for T[i…i+m-1] and T[i+1…i+m]
Rabin-Karp Algorithm H(T(i..i+m-1)) = What about H(T(i+1,…,i+1+m-1)) Keep arithmetic overflows in control using a Mod P for some prime P However, we still need to do character by character comparison after we get a match
Summary
Knuth-Morris-Pratt Summary Intuition: Analyze the pattern Analog with a Matching FSM. Never decrement i. Works well: For self-repetitive patterns in self-repetitive text But: For text, performance similar to brute force Possibly slower, due to precomputation
Aho-Corasick Summary Intuition: Works well: Use prefixes of multiple patterns to define failure transitions Natural extension of the KMP idea Works well: For multiple pattern search Used in famous fgrep utility
Boyer-Moore Summary Intuition: Works well: But: Analyze the target and the pattern Work backwards from end of pattern Jump forward in target when failing Works well: For large alphabets The last table for {0,1}? For text, in practice But: Streams? Must be able to decrement i.
Rabin-Karp Summary Intuition: Works well: But: If hash codes of two patterns are the same, then patterns “might” be the same If the pattern is length m, compute hash codes of all substrings of length m Leverage previous hash code to compute the next one Works well: Multiple pattern search But: Computing hash codes may be expensive