Contest Algorithms January 2016 Three types of string search: brute force, Knuth-Morris-Pratt (KMP) and Rabin-Karp 13. String Searching 1Contest Algorithms:

Slides:



Advertisements
Similar presentations
Recursion Chapter 14. Overview Base case and general case of recursion. A recursion is a method that calls itself. That simplifies the problem. The simpler.
Advertisements

© 2004 Goodrich, Tamassia Pattern Matching1. © 2004 Goodrich, Tamassia Pattern Matching2 Strings A string is a sequence of characters Examples of strings:
Comp. Eng. Lab III (Software), Pattern Matching1 Pattern Matching Dr. Andrew Davison WiG Lab (teachers room), CoE ,
Combinatorial Pattern Matching CS 466 Saurabh Sinha.
String Searching Algorithms Problem Description Given two strings P and T over the same alphabet , determine whether P occurs as a substring in T (or.
Comp. Eng. Lab III (Software), Pattern Matching1 Pattern Matching Dr. Andrew Davison WiG Lab (teachers room), CoE ,
1 A simple fast hybrid pattern- matching algorithm Department of Computer Science and Information Engineering National Cheng Kung University, Taiwan R.O.C.
1 Prof. Dr. Th. Ottmann Theory I Algorithm Design and Analysis (12 - Text search, part 1)
Pattern Matching1. 2 Outline and Reading Strings (§9.1.1) Pattern matching algorithms Brute-force algorithm (§9.1.2) Boyer-Moore algorithm (§9.1.3) Knuth-Morris-Pratt.
Goodrich, Tamassia String Processing1 Pattern Matching.
CSC 212 – Data Structures Lecture 34: Strings and Pattern Matching.
Princeton University COS 423 Theory of Algorithms Spring 2002 Kevin Wayne String Searching Reference: Chapter 19, Algorithms in C by R. Sedgewick. Addison.
UMass Lowell Computer Science Analysis of Algorithms Prof. Karen Daniels Fall, 2006 Wednesday, 12/6/06 String Matching Algorithms Chapter 32.
6-1 String Matching Learning Outcomes Students are able to: Explain naïve, Rabin-Karp, Knuth-Morris- Pratt algorithms Analyse the complexity of these algorithms.
UMass Lowell Computer Science Analysis of Algorithms Prof. Karen Daniels Fall, 2001 Lecture 8 Tuesday, 11/13/01 String Matching Algorithms Chapter.
String Matching COMP171 Fall String matching 2 Pattern Matching * Given a text string T[0..n-1] and a pattern P[0..m-1], find all occurrences of.
Algorithms for Regulatory Motif Discovery Xiaohui Xie University of California, Irvine.
Exact and Approximate Pattern in the Streaming Model Presented by - Tanushree Mitra Benny Porat and Ely Porat 2009 FOCS.
1 prepared from lecture material © 2004 Goodrich & Tamassia COMMONWEALTH OF AUSTRALIA Copyright Regulations 1969 WARNING This material.
Pattern Matching COMP171 Spring Pattern Matching / Slide 2 Pattern Matching * Given a text string T[0..n-1] and a pattern P[0..m-1], find all occurrences.
Pattern Matching1. 2 Outline Strings Pattern matching algorithms Brute-force algorithm Boyer-Moore algorithm Knuth-Morris-Pratt algorithm.
A Fast Algorithm for Multi-Pattern Searching Sun Wu, Udi Manber May 1994.
String Matching. Problem is to find if a pattern P[1..m] occurs within text T[1..n] Simple solution: Naïve String Matching –Match each position in the.
String Matching Using the Rabin-Karp Algorithm Katey Cruz CSC 252: Algorithms Smith College
Hash Functions and the HashMap Class A Brief Overview On Green Marble John W. Benning.
KMP String Matching Prepared By: Carlens Faustin.
CSC401 – Analysis of Algorithms Chapter 9 Text Processing
Advanced Algorithm Design and Analysis (Lecture 3) SW5 fall 2004 Simonas Šaltenis E1-215b
20/10/2015Applied Algorithmics - week31 String Processing  Typical applications: pattern matching/recognition molecular biology, comparative genomics,
String Matching Fundamental Data Structures and Algorithms April 22, 2003.
MCS 101: Algorithms Instructor Neelima Gupta
Comp 335 File Structures Hashing.
Working with arrays (we will use an array of double as example)
Strings and Pattern Matching Algorithms Pattern P[0..m-1] Text T[0..n-1] Brute Force Pattern Matching Algorithm BruteForceMatch(T,P): Input: Strings T.
1 5. Abstract Data Structures & Algorithms 5.2 Static Data Structures.
Comp. Eng. Lab III (Software), Pattern Matching1 Pattern Matching Dr. Andrew Davison WiG Lab (teachers room), CoE ,
MCS 101: Algorithms Instructor Neelima Gupta
WEEK 1 Hashing CE222 Dr. Senem Kumova Metin
String Searching CSCI 2720 Spring 2007 Eileen Kraemer.
Rabin-Karp algorithm Robin Visser. What is Rabin-Karp?
Chapter 11 Hash Tables © John Urrutia 2014, All Rights Reserved1.
String Matching String Matching Problem We introduce a general framework which is suitable to capture an essence of compressed pattern matching according.
COSC 2007 Data Structures II Chapter 13 Advanced Implementation of Tables IV.
1 String Matching Algorithms Topics  Basics of Strings  Brute-force String Matcher  Rabin-Karp String Matching Algorithm  KMP Algorithm.
CS5263 Bioinformatics Lecture 15 & 16 Exact String Matching Algorithms.
CSC 212 – Data Structures Lecture 36: Pattern Matching.
String Sorts Tries Substring Search: KMP, BM, RK
Fundamental Data Structures and Algorithms
String-Matching Problem COSC Advanced Algorithm Analysis and Design
1 UNIT-I BRUTE FORCE ANALYSIS AND DESIGN OF ALGORITHMS CHAPTER 3:
Lab 6 Problem 1: DNA. DNA Given a string with length N, determine the number of occurrences of some given substrings (with length K) in that string. For.
ICS220 – Data Structures and Algorithms Analysis Lecture 14 Dr. Ken Cosh.
1/39 COMP170 Tutorial 13: Pattern Matching T: P:.
Rabin & Karp Algorithm. Rabin-Karp – the idea Compare a string's hash values, rather than the strings themselves. For efficiency, the hash value of the.
Pattern Matching Boyer-Moore substring search Rabin-Karp fingerprint search.
1 String Matching Algorithms Mohd. Fahim Lecturer Department of Computer Engineering Faculty of Engineering and Technology Jamia Millia Islamia New Delhi,
CSG523/ Desain dan Analisis Algoritma
13 Text Processing Hongfei Yan June 1, 2016.
Hash functions Open addressing
Rabin & Karp Algorithm.
Pattern Matching 12/8/ :21 PM Pattern Matching Pattern Matching
Pattern Matching 1/14/2019 8:30 AM Pattern Matching Pattern Matching.
KMP String Matching Donald Knuth Jim H. Morris Vaughan Pratt 1997.
Pattern Matching 2/15/2019 6:17 PM Pattern Matching Pattern Matching.
Pattern Matching Pattern Matching 5/1/2019 3:53 PM Spring 2007
Pattern Matching 4/27/2019 1:16 AM Pattern Matching Pattern Matching
Space-for-time tradeoffs
Sequences 5/17/ :43 AM Pattern Matching.
Finding substrings BY Taariq Mowzer.
Presentation transcript:

Contest Algorithms January 2016 Three types of string search: brute force, Knuth-Morris-Pratt (KMP) and Rabin-Karp 13. String Searching 1Contest Algorithms: 13. String Srch

 Definition :  given a text string T and a search string (pattern) P, find P inside T  T: “the rain in spain stays mainly on the plain”  P: “n th”  Applications :  text editors, Web search engines (e.g. Google), image analysis 1. What is String Searching?

 Assume S is a string of size m.  A substring S[i.. j] of S is the string fragment between indexes i and j.  A prefix of S is a substring S[0.. i]  A suffix of S is a substring S[i.. m-1]  i is any index between 0 and m-1 String Concepts "start of S" "end of S"

 Substring S[1..3] == "ndr"  All possible prefixes of S:  "andrew", "andre", "andr", "and", "an”, "a"  All possible suffixes of S:  "andrew", "ndrew", "drew", "rew", "ew", "w" Examples andrew S 05

 Check each position in the text T to see if the pattern P starts in that position 2. The Brute Force Algorithm andrew T: rew P: andrew T: rew P:.. P moves 1 char at a time through T

public static int brute(String text, String pattern) { int n = text.length(); int m = pattern.length(); int j; for(int i=0; i <= (n-m); i++) { j = 0; while ((j < m) && (text.charAt(i+j) == pattern.charAt(j)) ) j++; if (j == m) return i; // match at i } return -1; // no match } // end of brute() Code Contest Algorithms:13. String Srch6 see BruteSearch.java

 Easy to code  No preprocessing needs to be done on the pattern  Usually takes O(n+m) steps – not so bad  n = length of text; m = length of pattern  Worst case scenario O(nm) when searching for aaab in aaaaaaaaaaaaaaaaaaaaaaaab Properties of Brute-force Search Contest Algorithms:13. String Srch7

 The Knuth-Morris-Pratt (KMP) algorithm shifts the pattern more intelligently than the brute force algorithm.  steps are bigger than just 1 character move 3. The KMP Algorithm continued

 If a mismatch occurs between the text and pattern P at P[j], what is the most we can shift the pattern to avoid wasteful comparisons?  Answer : the largest prefix of P[0.. j-1] that is a suffix of P[1.. j-1]

Example T: P: j new = 2 j = 5 i

 Find largest prefix (start) of: "a b a a b"( P[0..j-1] ) which is suffix (end) of: "b a a b"( p[1.. j-1] )  Answer: "a b"  Set j = 2 // the new j value Why j == 5

 KMP preprocesses the pattern to find matches of prefixes of the pattern with the pattern itself.  j = mismatch position in P[]  k = position before the mismatch (k = j-1).  The failure function F(k) is defined as the size of the largest prefix of P[0..k] that is also a suffix of P[1..k]. KMP Failure Function

 P: "a b a a b a" j:  In code, F() is represented by an array, like the table. Failure Function Example F(k) is the size of the largest prefix j 100 F(j)F(j) k F(k) (k == j-1)

 F(4) means  find the size of the largest prefix of P[0..4] that is also a suffix of P[1..4] = find the size largest prefix of "abaab" that is also a suffix of "baab" = find the size of "ab" = 2 Why is F(4) == 2? P: "abaaba"

 Knuth-Morris-Pratt’s algorithm modifies the brute-force algorithm.  if a mismatch occurs at P[j] (i.e. P[j] != T[i]), then k = j-1; j = F(k); // obtain the new j Using the Failure Function

int kmpMatch(String text, String pattern) { int n = text.length(); int m = pattern.length(); int fail[] = computeFail(pattern); int i=0; int j=0; : Code Return index where pattern starts, or -1 see KmpSearch.java

while (i 0) j = fail[j-1]; else i++; } return -1; // no match } // end of kmpMatch()

int[] computeFail(String pattern) { int fail[] = new int[pattern.length()]; fail[0] = 0; int m = pattern.length(); int j = 0; int i = 1; :

while (i 0) // j follows matching prefix j = fail[j-1]; else { // no match fail[i] = 0; i++; } } return fail; } // end of computeFail() Similar code to kmpMatch()

Example k 100 F(k)F(k) T: P:

 F(4) means  find the size of the largest prefix of P[0..4] that is also a suffix of P[1..4] = find the size largest prefix of "abaca" that is also a suffix of "baca" = find the size of "a" = 1 Why is F(4) == 1? P: "abacab"

 Time to find match is only O(n) with O(m) preprocessing time  n = length of text; m = length of the pattern  Can be modified to search for multiple patterns in a single search. Properties of KMP Contest Algorithms:13. String Srch22

 KMP doesn’t work so well as the size of the alphabet increases  more chance of a mismatch (more possible mismatches)  mismatches tend to occur early in the pattern, but KMP is faster when the mismatches occur later KMP Disadvantage

 The basic algorithm doesn't take into account the letter in the text that caused the mismatch. KMP Extensions aaab b aaa b b a x aaa b b a T: P: Basic KMP does not do this.

 String search is based on a hash function applied to the pattern and substrings in the text  Look for a match by comparing the hash values, not substrings. 5. The Rabin-Karp Algorithm

long hash(String s) { long h = 0; for (int j = 0; j < s.size(); j++) h = (R * h + key.charAt(j)) % Q; // % acts as mod return h; }  R == radix; often 10 for numeric data; 128 for ASCII, etc.  Q == a large prime number; e.g. 997 Typical hash function Contest Algorithms:13. String Srch26 hash("26535") == 613

Hash Function explained Contest Algorithms:13. String Srch27 T t0t0 t1t1 t2t2 t3t3 t m-2 t m-1 tmtm t m+1... P p0p0 p1p1 p2p2 p3p3 p m-2 p m-1... pattern has m chars hash(P) examine m char of text at a time = X i hash(X i )

 The hash function calculates:  hash(X i ) = ( t o *R m-1 + t 1 *R m-2 + t 3 *R m t m-2 *R + t m-1 ) mod Q Contest Algorithms:13. String Srch28

 T = " " and P = "26"  R = 10; Q = 11  hash("ab") = (a*10 + b) mod 11 Example Contest Algorithms:13. String Srch T 62 P hash(P) == hash("26") == 26 mod 11 = 4

Iterate through the Text mod 11 = 3 not equal to 4 31 mod 11 = 9 not equal to mod 11 = 8 not equal to 4

mod 11 = 4 equal to 4 -> wrong match mod 11 = 4 equal to 4 -> wrong match mod 11 = 4 equal to 4 -> wrong match mod 11 = 4 equal to 4 -> correct match

 The hash() function uses modulo Q, so the range of results is 0 to Q-1.  If Q is small then it is likely that two different strings will hash to the same result  probability is 1/Q  Solution is to make Q very big, which reduces the chance of a wrong match. (e.g. Q = == 4.3 billion)  Also double-check the match using string operations Why Wrong Matches? Contest Algorithms:13. String Srch32

 This is an example of a Monte Carlo algorithm  it's fast but may output an incorrect answer with a small probability (1/Q)  The "double-checking" approach is known as a Las Vegas algorithm  it can be slow Gambling Names

 After the hash() of the first substring of T, there is no need to keep calling hash() for the 2nd substring, 3rd substring, etc.  It is possible to calculate the next hash (e.g. hash(X i+1 )) based on the current hash value (hash(X i ))  much faster (O(m) --> O(1) running time)  less memory needed Speeding up hash Calculation Contest Algorithms:13. String Srch34

 hash(X i ) = ( t o *R m-1 + t 1 *R m-2 + t 3 *R m t m-2 *R + t m-1 ) mod Q  hash(X i+1 ) = ( t 1 *R m-1 + t 2 *R m-2 + t 3 *R m t m-1 *R + t m ) mod Q Connection between hash()s Contest Algorithms:13. String Srch35 T t0t0 t1t1 t2t2 t3t3 t m-2 t m-1 tmtm t m+1... XiXi X i+1

 Therefore:  hash(X i+1 ) = ( ( hash(X i+1 ) - t 0 *R m-1 ) mod Q )*R + t m mod Q ) mod Q  = ( ( hash(X i+1 ) + ( t 0 *Q m-1 - t 0 *R m-1 )) mod Q )*R + t m mod Q ) mod Q  = ( ( hash(X i+1 ) + t 0 ( Q - (R m-1 mod Q) ) )*R + t m ) mod Q Contest Algorithms:13. String Srch36 old front value new end value include so mod value is positive a constant, which can be pre-calculated

 Using: Modulo Properties

 We move through the text left-to-right, one character at a time, building up the hash for an m-character substring from preceding hash values. Creating the Hash

 P: "26535"  R = 10, Q = 997 Hash of the Pattern the hash value for the pattern

 T: " "  M = 5, R = 10, Q = 997 Hashing the Text Substrings In the code RM = R m-1 mod Q The hash values for the M-char substrings

public static void main(String[] args) { if (args.length != 2) { System.out.println("Usage: java RabinKarp "); return; } RabinKarp searcher = new RabinKarp(args[1]); int pos = searcher.search(args[0]); showPos(args[0], args[1], pos); } // end of main() Code Contest Algorithms:13. String Srch41 see RabinKarp.java

public class RabinKarp { private static final int R = 256; // radix private String pat; // the pattern; needs to be global for LV checking private long patHash; // pattern hash value private int M; // pattern length private long Q; // a large prime, small enough to avoid long overflow private long RM; // == R^(M-1) % Q public RabinKarp(String pat) { this.pat = pat; // save pattern (needed only for Las Vegas) M = pat.length(); Q = longRandomPrime(); // precompute R^(M-1) % Q for use in removing leading digit RM = 1; for (int i = 1; i <= M - 1; i++) RM = (R * RM) % Q; patHash = hash(pat, M); } // end of RabinKarp() : Contest Algorithms:13. String Srch42

private static long longRandomPrime() // a random 31-bit probable prime { BigInteger prime = new BigInteger(31, 20, new Random()); return prime.longValue(); } private long hash(String key, int M) // Compute hash for key[0..M-1]. { long h = 0; for (int j = 0; j < M; j++) h = (R * h + key.charAt(j)) % Q; return h; } // end of hash() Contest Algorithms:13. String Srch43

public int search(String txt) { int N = txt.length(); if (N < M) return -1; long txtHash = hash(txt, M); // hash match found at offset 0, so double-check if ((patHash == txtHash) && check(txt, 0)) return 0; // iterate through the text for (int i = M; i < N; i++) { // Calculate new hash by removing leading digit, add trailing digit txtHash = (txtHash + Q - RM * txt.charAt(i - M) % Q) % Q; txtHash = (txtHash * R + txt.charAt(i)) % Q; // found a hash match, so double-check int offset = i - M + 1; if ((patHash == txtHash) && check(txt, offset)) return offset; } return -1; // no match found } // end of search() Contest Algorithms:13. String Srch44

private boolean check(String txt, int i) // Las Vegas version: does pat[] match txt[i..i-M+1] ? { for (int j = 0; j < M; j++) if (pat.charAt(j) != txt.charAt(i + j)) return false; return true; } // end of check() Contest Algorithms:13. String Srch45

 Has a poor worst-case running time (O(nm)), and so KMP is probably better for string searching.  KMP's hashing technique allows the search algorithm to be used on other things than text  e.g. image, audio, video search  Rabin-Karp can be easily modified to do fast multiple pattern search.  check whether the hash of a string in the text belongs to a set  of hash values of patterns Properties of Rabin-Karp Contest Algorithms:13. String Srch46

AlgorithmPreprocessing time m = pat len. Matching time (average, worst) n = text len; Brute force0 (no preprocessing)O(n+m), O(nm) Knuth-Morris-PrattO(m) O(n) Rabin-KarpO(m)O(n+m), O(nm) 6. Summary Contest Algorithms:13. String Srch47 35 algorithms with C code at