Presentation is loading. Please wait.

Presentation is loading. Please wait.

Contest Algorithms January 2016 Three types of string search: brute force, Knuth-Morris-Pratt (KMP) and Rabin-Karp 13. String Searching 1Contest Algorithms:

Similar presentations


Presentation on theme: "Contest Algorithms January 2016 Three types of string search: brute force, Knuth-Morris-Pratt (KMP) and Rabin-Karp 13. String Searching 1Contest Algorithms:"— Presentation transcript:

1 Contest Algorithms January 2016 Three types of string search: brute force, Knuth-Morris-Pratt (KMP) and Rabin-Karp 13. String Searching 1Contest Algorithms: 13. String Srch

2  Definition :  given a text string T and a search string (pattern) P, find P inside T  T: “the rain in spain stays mainly on the plain”  P: “n th”  Applications :  text editors, Web search engines (e.g. Google), image analysis 1. What is String Searching?

3  Assume S is a string of size m.  A substring S[i.. j] of S is the string fragment between indexes i and j.  A prefix of S is a substring S[0.. i]  A suffix of S is a substring S[i.. m-1]  i is any index between 0 and m-1 String Concepts "start of S" "end of S"

4  Substring S[1..3] == "ndr"  All possible prefixes of S:  "andrew", "andre", "andr", "and", "an”, "a"  All possible suffixes of S:  "andrew", "ndrew", "drew", "rew", "ew", "w" Examples andrew S 05

5  Check each position in the text T to see if the pattern P starts in that position 2. The Brute Force Algorithm andrew T: rew P: andrew T: rew P:.. P moves 1 char at a time through T

6 public static int brute(String text, String pattern) { int n = text.length(); int m = pattern.length(); int j; for(int i=0; i <= (n-m); i++) { j = 0; while ((j < m) && (text.charAt(i+j) == pattern.charAt(j)) ) j++; if (j == m) return i; // match at i } return -1; // no match } // end of brute() Code Contest Algorithms:13. String Srch6 see BruteSearch.java

7  Easy to code  No preprocessing needs to be done on the pattern  Usually takes O(n+m) steps – not so bad  n = length of text; m = length of pattern  Worst case scenario O(nm) when searching for aaab in aaaaaaaaaaaaaaaaaaaaaaaab Properties of Brute-force Search Contest Algorithms:13. String Srch7

8  The Knuth-Morris-Pratt (KMP) algorithm shifts the pattern more intelligently than the brute force algorithm.  steps are bigger than just 1 character move 3. The KMP Algorithm continued

9  If a mismatch occurs between the text and pattern P at P[j], what is the most we can shift the pattern to avoid wasteful comparisons?  Answer : the largest prefix of P[0.. j-1] that is a suffix of P[1.. j-1]

10 Example T: P: j new = 2 j = 5 i

11  Find largest prefix (start) of: "a b a a b"( P[0..j-1] ) which is suffix (end) of: "b a a b"( p[1.. j-1] )  Answer: "a b"  Set j = 2 // the new j value Why j == 5

12  KMP preprocesses the pattern to find matches of prefixes of the pattern with the pattern itself.  j = mismatch position in P[]  k = position before the mismatch (k = j-1).  The failure function F(k) is defined as the size of the largest prefix of P[0..k] that is also a suffix of P[1..k]. KMP Failure Function

13  P: "a b a a b a" j: 0 1 2 3 4 5  In code, F() is represented by an array, like the table. Failure Function Example F(k) is the size of the largest prefix. 1 3 2 4210 j 100 F(j)F(j) k F(k) (k == j-1)

14  F(4) means  find the size of the largest prefix of P[0..4] that is also a suffix of P[1..4] = find the size largest prefix of "abaab" that is also a suffix of "baab" = find the size of "ab" = 2 Why is F(4) == 2? P: "abaaba"

15  Knuth-Morris-Pratt’s algorithm modifies the brute-force algorithm.  if a mismatch occurs at P[j] (i.e. P[j] != T[i]), then k = j-1; j = F(k); // obtain the new j Using the Failure Function

16 int kmpMatch(String text, String pattern) { int n = text.length(); int m = pattern.length(); int fail[] = computeFail(pattern); int i=0; int j=0; : Code Return index where pattern starts, or -1 see KmpSearch.java

17 while (i 0) j = fail[j-1]; else i++; } return -1; // no match } // end of kmpMatch()

18 int[] computeFail(String pattern) { int fail[] = new int[pattern.length()]; fail[0] = 0; int m = pattern.length(); int j = 0; int i = 1; :

19 while (i 0) // j follows matching prefix j = fail[j-1]; else { // no match fail[i] = 0; i++; } } return fail; } // end of computeFail() Similar code to kmpMatch()

20 Example 0 3 1 4210 k 100 F(k)F(k) T: P:

21  F(4) means  find the size of the largest prefix of P[0..4] that is also a suffix of P[1..4] = find the size largest prefix of "abaca" that is also a suffix of "baca" = find the size of "a" = 1 Why is F(4) == 1? P: "abacab"

22  Time to find match is only O(n) with O(m) preprocessing time  n = length of text; m = length of the pattern  Can be modified to search for multiple patterns in a single search. Properties of KMP Contest Algorithms:13. String Srch22

23  KMP doesn’t work so well as the size of the alphabet increases  more chance of a mismatch (more possible mismatches)  mismatches tend to occur early in the pattern, but KMP is faster when the mismatches occur later KMP Disadvantage

24  The basic algorithm doesn't take into account the letter in the text that caused the mismatch. KMP Extensions aaab b aaa b b a x aaa b b a T: P: Basic KMP does not do this.

25  String search is based on a hash function applied to the pattern and substrings in the text  Look for a match by comparing the hash values, not substrings. 5. The Rabin-Karp Algorithm

26 long hash(String s) { long h = 0; for (int j = 0; j < s.size(); j++) h = (R * h + key.charAt(j)) % Q; // % acts as mod return h; }  R == radix; often 10 for numeric data; 128 for ASCII, etc.  Q == a large prime number; e.g. 997 Typical hash function Contest Algorithms:13. String Srch26 hash("26535") == 613

27 Hash Function explained Contest Algorithms:13. String Srch27 T t0t0 t1t1 t2t2 t3t3 t m-2 t m-1 tmtm t m+1... P p0p0 p1p1 p2p2 p3p3 p m-2 p m-1... pattern has m chars hash(P) examine m char of text at a time = X i hash(X i )

28  The hash function calculates:  hash(X i ) = ( t o *R m-1 + t 1 *R m-2 + t 3 *R m-1 +... t m-2 *R + t m-1 ) mod Q Contest Algorithms:13. String Srch28

29  T = "31415926535" and P = "26"  R = 10; Q = 11  hash("ab") = (a*10 + b) mod 11 Example Contest Algorithms:13. String Srch29 13149562355 T 62 P hash(P) == hash("26") == 26 mod 11 = 4

30 Iterate through the Text 13149562355 13149562355 14 mod 11 = 3 not equal to 4 31 mod 11 = 9 not equal to 4 13149562355 41 mod 11 = 8 not equal to 4

31 13149562355 15 mod 11 = 4 equal to 4 -> wrong match 13149562355 59 mod 11 = 4 equal to 4 -> wrong match 13149562355 92 mod 11 = 4 equal to 4 -> wrong match 13149562355 26 mod 11 = 4 equal to 4 -> correct match

32  The hash() function uses modulo Q, so the range of results is 0 to Q-1.  If Q is small then it is likely that two different strings will hash to the same result  probability is 1/Q  Solution is to make Q very big, which reduces the chance of a wrong match. (e.g. Q = 2 32 -1 == 4.3 billion)  Also double-check the match using string operations Why Wrong Matches? Contest Algorithms:13. String Srch32

33  This is an example of a Monte Carlo algorithm  it's fast but may output an incorrect answer with a small probability (1/Q)  The "double-checking" approach is known as a Las Vegas algorithm  it can be slow Gambling Names

34  After the hash() of the first substring of T, there is no need to keep calling hash() for the 2nd substring, 3rd substring, etc.  It is possible to calculate the next hash (e.g. hash(X i+1 )) based on the current hash value (hash(X i ))  much faster (O(m) --> O(1) running time)  less memory needed Speeding up hash Calculation Contest Algorithms:13. String Srch34

35  hash(X i ) = ( t o *R m-1 + t 1 *R m-2 + t 3 *R m-1 +... t m-2 *R + t m-1 ) mod Q  hash(X i+1 ) = ( t 1 *R m-1 + t 2 *R m-2 + t 3 *R m-1 +... t m-1 *R + t m ) mod Q Connection between hash()s Contest Algorithms:13. String Srch35 T t0t0 t1t1 t2t2 t3t3 t m-2 t m-1 tmtm t m+1... XiXi X i+1

36  Therefore:  hash(X i+1 ) = ( ( hash(X i+1 ) - t 0 *R m-1 ) mod Q )*R + t m mod Q ) mod Q  = ( ( hash(X i+1 ) + ( t 0 *Q m-1 - t 0 *R m-1 )) mod Q )*R + t m mod Q ) mod Q  = ( ( hash(X i+1 ) + t 0 ( Q - (R m-1 mod Q) ) )*R + t m ) mod Q Contest Algorithms:13. String Srch36 old front value new end value include so mod value is positive a constant, which can be pre-calculated

37  Using: Modulo Properties

38  We move through the text left-to-right, one character at a time, building up the hash for an m-character substring from preceding hash values. Creating the Hash

39  P: "26535"  R = 10, Q = 997 Hash of the Pattern the hash value for the pattern

40  T: "3 1 4 1 5 9 2 6 5 3 5 8 9 7 9 3"  M = 5, R = 10, Q = 997 Hashing the Text Substrings In the code RM = R m-1 mod Q The hash values for the M-char substrings

41 public static void main(String[] args) { if (args.length != 2) { System.out.println("Usage: java RabinKarp "); return; } RabinKarp searcher = new RabinKarp(args[1]); int pos = searcher.search(args[0]); showPos(args[0], args[1], pos); } // end of main() Code Contest Algorithms:13. String Srch41 see RabinKarp.java

42 public class RabinKarp { private static final int R = 256; // radix private String pat; // the pattern; needs to be global for LV checking private long patHash; // pattern hash value private int M; // pattern length private long Q; // a large prime, small enough to avoid long overflow private long RM; // == R^(M-1) % Q public RabinKarp(String pat) { this.pat = pat; // save pattern (needed only for Las Vegas) M = pat.length(); Q = longRandomPrime(); // precompute R^(M-1) % Q for use in removing leading digit RM = 1; for (int i = 1; i <= M - 1; i++) RM = (R * RM) % Q; patHash = hash(pat, M); } // end of RabinKarp() : Contest Algorithms:13. String Srch42

43 private static long longRandomPrime() // a random 31-bit probable prime { BigInteger prime = new BigInteger(31, 20, new Random()); return prime.longValue(); } private long hash(String key, int M) // Compute hash for key[0..M-1]. { long h = 0; for (int j = 0; j < M; j++) h = (R * h + key.charAt(j)) % Q; return h; } // end of hash() Contest Algorithms:13. String Srch43

44 public int search(String txt) { int N = txt.length(); if (N < M) return -1; long txtHash = hash(txt, M); // hash match found at offset 0, so double-check if ((patHash == txtHash) && check(txt, 0)) return 0; // iterate through the text for (int i = M; i < N; i++) { // Calculate new hash by removing leading digit, add trailing digit txtHash = (txtHash + Q - RM * txt.charAt(i - M) % Q) % Q; txtHash = (txtHash * R + txt.charAt(i)) % Q; // found a hash match, so double-check int offset = i - M + 1; if ((patHash == txtHash) && check(txt, offset)) return offset; } return -1; // no match found } // end of search() Contest Algorithms:13. String Srch44

45 private boolean check(String txt, int i) // Las Vegas version: does pat[] match txt[i..i-M+1] ? { for (int j = 0; j < M; j++) if (pat.charAt(j) != txt.charAt(i + j)) return false; return true; } // end of check() Contest Algorithms:13. String Srch45

46  Has a poor worst-case running time (O(nm)), and so KMP is probably better for string searching.  KMP's hashing technique allows the search algorithm to be used on other things than text  e.g. image, audio, video search  Rabin-Karp can be easily modified to do fast multiple pattern search.  check whether the hash of a string in the text belongs to a set  of hash values of patterns Properties of Rabin-Karp Contest Algorithms:13. String Srch46

47 AlgorithmPreprocessing time m = pat len. Matching time (average, worst) n = text len; Brute force0 (no preprocessing)O(n+m), O(nm) Knuth-Morris-PrattO(m) O(n) Rabin-KarpO(m)O(n+m), O(nm) 6. Summary Contest Algorithms:13. String Srch47 35 algorithms with C code at http://www-igm.univ-mlv.fr/~lecroq/string/


Download ppt "Contest Algorithms January 2016 Three types of string search: brute force, Knuth-Morris-Pratt (KMP) and Rabin-Karp 13. String Searching 1Contest Algorithms:"

Similar presentations


Ads by Google