Contest Algorithms January 2016 Three types of string search: brute force, Knuth-Morris-Pratt (KMP) and Rabin-Karp 13. String Searching 1Contest Algorithms: 13. String Srch
Definition : given a text string T and a search string (pattern) P, find P inside T T: “the rain in spain stays mainly on the plain” P: “n th” Applications : text editors, Web search engines (e.g. Google), image analysis 1. What is String Searching?
Assume S is a string of size m. A substring S[i.. j] of S is the string fragment between indexes i and j. A prefix of S is a substring S[0.. i] A suffix of S is a substring S[i.. m-1] i is any index between 0 and m-1 String Concepts "start of S" "end of S"
Substring S[1..3] == "ndr" All possible prefixes of S: "andrew", "andre", "andr", "and", "an”, "a" All possible suffixes of S: "andrew", "ndrew", "drew", "rew", "ew", "w" Examples andrew S 05
Check each position in the text T to see if the pattern P starts in that position 2. The Brute Force Algorithm andrew T: rew P: andrew T: rew P:.. P moves 1 char at a time through T
public static int brute(String text, String pattern) { int n = text.length(); int m = pattern.length(); int j; for(int i=0; i <= (n-m); i++) { j = 0; while ((j < m) && (text.charAt(i+j) == pattern.charAt(j)) ) j++; if (j == m) return i; // match at i } return -1; // no match } // end of brute() Code Contest Algorithms:13. String Srch6 see
Easy to code No preprocessing needs to be done on the pattern Usually takes O(n+m) steps – not so bad n = length of text; m = length of pattern Worst case scenario O(nm) when searching for aaab in aaaaaaaaaaaaaaaaaaaaaaaab Properties of Brute-force Search Contest Algorithms:13. String Srch7
The Knuth-Morris-Pratt (KMP) algorithm shifts the pattern more intelligently than the brute force algorithm. steps are bigger than just 1 character move 3. The KMP Algorithm continued
If a mismatch occurs between the text and pattern P at P[j], what is the most we can shift the pattern to avoid wasteful comparisons? Answer : the largest prefix of P[0.. j-1] that is a suffix of P[1.. j-1]
Example T: P: j new = 2 j = 5 i
Find largest prefix (start) of: "a b a a b"( P[0..j-1] ) which is suffix (end) of: "b a a b"( p[1.. j-1] ) Answer: "a b" Set j = 2 // the new j value Why j == 5
KMP preprocesses the pattern to find matches of prefixes of the pattern with the pattern itself. j = mismatch position in P[] k = position before the mismatch (k = j-1). The failure function F(k) is defined as the size of the largest prefix of P[0..k] that is also a suffix of P[1..k]. KMP Failure Function
P: "a b a a b a" j: In code, F() is represented by an array, like the table. Failure Function Example F(k) is the size of the largest prefix j 100 F(j)F(j) k F(k) (k == j-1)
F(4) means find the size of the largest prefix of P[0..4] that is also a suffix of P[1..4] = find the size largest prefix of "abaab" that is also a suffix of "baab" = find the size of "ab" = 2 Why is F(4) == 2? P: "abaaba"
Knuth-Morris-Pratt’s algorithm modifies the brute-force algorithm. if a mismatch occurs at P[j] (i.e. P[j] != T[i]), then k = j-1; j = F(k); // obtain the new j Using the Failure Function
int kmpMatch(String text, String pattern) { int n = text.length(); int m = pattern.length(); int fail[] = computeFail(pattern); int i=0; int j=0; : Code Return index where pattern starts, or -1 see
while (i 0) j = fail[j-1]; else i++; } return -1; // no match } // end of kmpMatch()
int[] computeFail(String pattern) { int fail[] = new int[pattern.length()]; fail[0] = 0; int m = pattern.length(); int j = 0; int i = 1; :
while (i 0) // j follows matching prefix j = fail[j-1]; else { // no match fail[i] = 0; i++; } } return fail; } // end of computeFail() Similar code to kmpMatch()
Example k 100 F(k)F(k) T: P:
F(4) means find the size of the largest prefix of P[0..4] that is also a suffix of P[1..4] = find the size largest prefix of "abaca" that is also a suffix of "baca" = find the size of "a" = 1 Why is F(4) == 1? P: "abacab"
Time to find match is only O(n) with O(m) preprocessing time n = length of text; m = length of the pattern Can be modified to search for multiple patterns in a single search. Properties of KMP Contest Algorithms:13. String Srch22
KMP doesn’t work so well as the size of the alphabet increases more chance of a mismatch (more possible mismatches) mismatches tend to occur early in the pattern, but KMP is faster when the mismatches occur later KMP Disadvantage
The basic algorithm doesn't take into account the letter in the text that caused the mismatch. KMP Extensions aaab b aaa b b a x aaa b b a T: P: Basic KMP does not do this.
String search is based on a hash function applied to the pattern and substrings in the text Look for a match by comparing the hash values, not substrings. 5. The Rabin-Karp Algorithm
long hash(String s) { long h = 0; for (int j = 0; j < s.size(); j++) h = (R * h + key.charAt(j)) % Q; // % acts as mod return h; } R == radix; often 10 for numeric data; 128 for ASCII, etc. Q == a large prime number; e.g. 997 Typical hash function Contest Algorithms:13. String Srch26 hash("26535") == 613
Hash Function explained Contest Algorithms:13. String Srch27 T t0t0 t1t1 t2t2 t3t3 t m-2 t m-1 tmtm t m+1... P p0p0 p1p1 p2p2 p3p3 p m-2 p m-1... pattern has m chars hash(P) examine m char of text at a time = X i hash(X i )
The hash function calculates: hash(X i ) = ( t o *R m-1 + t 1 *R m-2 + t 3 *R m t m-2 *R + t m-1 ) mod Q Contest Algorithms:13. String Srch28
T = " " and P = "26" R = 10; Q = 11 hash("ab") = (a*10 + b) mod 11 Example Contest Algorithms:13. String Srch T 62 P hash(P) == hash("26") == 26 mod 11 = 4
Iterate through the Text mod 11 = 3 not equal to 4 31 mod 11 = 9 not equal to mod 11 = 8 not equal to 4
mod 11 = 4 equal to 4 -> wrong match mod 11 = 4 equal to 4 -> wrong match mod 11 = 4 equal to 4 -> wrong match mod 11 = 4 equal to 4 -> correct match
The hash() function uses modulo Q, so the range of results is 0 to Q-1. If Q is small then it is likely that two different strings will hash to the same result probability is 1/Q Solution is to make Q very big, which reduces the chance of a wrong match. (e.g. Q = == 4.3 billion) Also double-check the match using string operations Why Wrong Matches? Contest Algorithms:13. String Srch32
This is an example of a Monte Carlo algorithm it's fast but may output an incorrect answer with a small probability (1/Q) The "double-checking" approach is known as a Las Vegas algorithm it can be slow Gambling Names
After the hash() of the first substring of T, there is no need to keep calling hash() for the 2nd substring, 3rd substring, etc. It is possible to calculate the next hash (e.g. hash(X i+1 )) based on the current hash value (hash(X i )) much faster (O(m) --> O(1) running time) less memory needed Speeding up hash Calculation Contest Algorithms:13. String Srch34
hash(X i ) = ( t o *R m-1 + t 1 *R m-2 + t 3 *R m t m-2 *R + t m-1 ) mod Q hash(X i+1 ) = ( t 1 *R m-1 + t 2 *R m-2 + t 3 *R m t m-1 *R + t m ) mod Q Connection between hash()s Contest Algorithms:13. String Srch35 T t0t0 t1t1 t2t2 t3t3 t m-2 t m-1 tmtm t m+1... XiXi X i+1
Therefore: hash(X i+1 ) = ( ( hash(X i+1 ) - t 0 *R m-1 ) mod Q )*R + t m mod Q ) mod Q = ( ( hash(X i+1 ) + ( t 0 *Q m-1 - t 0 *R m-1 )) mod Q )*R + t m mod Q ) mod Q = ( ( hash(X i+1 ) + t 0 ( Q - (R m-1 mod Q) ) )*R + t m ) mod Q Contest Algorithms:13. String Srch36 old front value new end value include so mod value is positive a constant, which can be pre-calculated
Using: Modulo Properties
We move through the text left-to-right, one character at a time, building up the hash for an m-character substring from preceding hash values. Creating the Hash
P: "26535" R = 10, Q = 997 Hash of the Pattern the hash value for the pattern
T: " " M = 5, R = 10, Q = 997 Hashing the Text Substrings In the code RM = R m-1 mod Q The hash values for the M-char substrings
public static void main(String[] args) { if (args.length != 2) { System.out.println("Usage: java RabinKarp "); return; } RabinKarp searcher = new RabinKarp(args[1]); int pos =[0]); showPos(args[0], args[1], pos); } // end of main() Code Contest Algorithms:13. String Srch41 see
public class RabinKarp { private static final int R = 256; // radix private String pat; // the pattern; needs to be global for LV checking private long patHash; // pattern hash value private int M; // pattern length private long Q; // a large prime, small enough to avoid long overflow private long RM; // == R^(M-1) % Q public RabinKarp(String pat) { this.pat = pat; // save pattern (needed only for Las Vegas) M = pat.length(); Q = longRandomPrime(); // precompute R^(M-1) % Q for use in removing leading digit RM = 1; for (int i = 1; i <= M - 1; i++) RM = (R * RM) % Q; patHash = hash(pat, M); } // end of RabinKarp() : Contest Algorithms:13. String Srch42
private static long longRandomPrime() // a random 31-bit probable prime { BigInteger prime = new BigInteger(31, 20, new Random()); return prime.longValue(); } private long hash(String key, int M) // Compute hash for key[0..M-1]. { long h = 0; for (int j = 0; j < M; j++) h = (R * h + key.charAt(j)) % Q; return h; } // end of hash() Contest Algorithms:13. String Srch43
public int search(String txt) { int N = txt.length(); if (N < M) return -1; long txtHash = hash(txt, M); // hash match found at offset 0, so double-check if ((patHash == txtHash) && check(txt, 0)) return 0; // iterate through the text for (int i = M; i < N; i++) { // Calculate new hash by removing leading digit, add trailing digit txtHash = (txtHash + Q - RM * txt.charAt(i - M) % Q) % Q; txtHash = (txtHash * R + txt.charAt(i)) % Q; // found a hash match, so double-check int offset = i - M + 1; if ((patHash == txtHash) && check(txt, offset)) return offset; } return -1; // no match found } // end of search() Contest Algorithms:13. String Srch44
private boolean check(String txt, int i) // Las Vegas version: does pat[] match txt[i..i-M+1] ? { for (int j = 0; j < M; j++) if (pat.charAt(j) != txt.charAt(i + j)) return false; return true; } // end of check() Contest Algorithms:13. String Srch45
Has a poor worst-case running time (O(nm)), and so KMP is probably better for string searching. KMP's hashing technique allows the search algorithm to be used on other things than text e.g. image, audio, video search Rabin-Karp can be easily modified to do fast multiple pattern search. check whether the hash of a string in the text belongs to a set of hash values of patterns Properties of Rabin-Karp Contest Algorithms:13. String Srch46
AlgorithmPreprocessing time m = pat len. Matching time (average, worst) n = text len; Brute force0 (no preprocessing)O(n+m), O(nm) Knuth-Morris-PrattO(m) O(n) Rabin-KarpO(m)O(n+m), O(nm) 6. Summary Contest Algorithms:13. String Srch47 35 algorithms with C code at