UMass Lowell Computer Science Analysis of Algorithms Prof. Karen Daniels Fall, 2006 Wednesday, 12/6/06 String Matching Algorithms Chapter 32
Chapter Dependencies Ch 32 String Matching Automata You’re responsible for material in Sections of this chapter.
String Matching Algorithms Motivation & Basics
String Matching Problem source: textbook Cormen et al. Motivations: text-editing, pattern matching in DNA sequences Text: array T[1...n] Pattern: array P[1...m] Array Element: Character from finite alphabet Pattern P occurs with shift s in T if P[1...m] = T[s+1...s+m] 32.1
String Matching Algorithms ä Naive Algorithm ä Worst-case running time in O((n-m+1) m) ä Rabin-Karp ä Worst-case running time in O((n-m+1) m) ä Better than this on average and in practice ä Finite Automaton-Based Worst-case running time in O(n + m| ) ä Knuth-Morris-Pratt ä Worst-case running time in O(n + m)
Notation & Terminology * = set of all finite-length strings formed using characters from alphabet Empty string: ä |x| = length of string x ä w is a prefix of x: wx ä w is a suffix of x: wx ä prefix, suffix are transitive ab abcca cca abcca
Overlapping Suffix Lemma source: textbook Cormen et al
String Matching Algorithms Naive Algorithm
Naive String Matching source: textbook Cormen et al. worst-case running time is in ((n-m+1)m) 32.4
String Matching Algorithms Rabin-Karp
Rabin-Karp Algorithm source: textbook Cormen et al. ä Assume each character is digit in radix-d notation (e.g. d=10) ä p = decimal value of pattern ä t s = decimal value of substring T[s+1..s+m] for s = 0,1...,n-m ä Strategy: ä compute p in O(m) time (which is in O(n)) ä compute all t i values in total of O(n) time ä find all valid shifts s in O(n) time by comparing p with each t s ä Compute p in O(m) time using Horner’s rule: ä p = P[m] + d(P[m-1] + d(P[m-2] d(P[2] + dP[1]))) ä Compute t 0 similarly from T[1..m] in O(m) time ä Compute remaining t i ‘s in O(n-m) time ä t s+1 = d(t s - d m-1 T[s+1]) + T[s+m+1]
Rabin-Karp Algorithm source: textbook Cormen et al. p, t s may be large, so use mod 32.5
Rabin-Karp Algorithm (continued) p = spurious spurioushit t s+1 = d(t s - d m-1 T[s+1]) + T[s+m+1] source: textbook Cormen et al.
Rabin-Karp Algorithm (continued) source: textbook Cormen et al.
Rabin-Karp Algorithm (continued) source: textbook Cormen et al. worst-case running time is in ((n-m+1)m) (m) in (n) (m) ((n-m+1)m) high-order digit position for m-digit window Matching loop invariant: when line 10 executed t s =T[s+1..s+m] mod q rule out spurious hit Try all possible shifts d is radix q is modulus Preprocessing
Rabin-Karp Algorithm (continued) source: textbook Cormen et al. average-case running time is in (n+m) Assume reducing mod q is like random mapping from * to Z q Estimate (chance that t s = p mod q) = 1/q # spurious hits is in O(n/q) (m) in (n) (m) ((n-m+1)m) high-order digit position for m-digit window Matching loop invariant: when line 10 executed t s =T[s+1..s+m] mod q rule out spurious hit Try all possible shifts d is radix q is modulus Preprocessing Expected matching time = O(n) + O(m(v + n/q)) (v = # valid shifts) If v is in O(1) and q >= m
String Matching Algorithms Finite Automata
source: textbook Cormen et al. Strategy: Build automaton for pattern, then examine each text character once. worst-case running time is in (n) + automaton creation time 32.6
Finite Automata source: textbook Cormen et al.
String-Matching Automaton source: textbook Cormen et al. Pattern = P = ababaca Automaton accepts strings ending in P 32.7
String-Matching Automaton source: textbook Cormen et al. Suffix Function for P: (x) = length of longest prefix of P that is a suffix of x Automaton’s operational invariant at each step: keeps track of longest pattern prefix that is a suffix of what has been read so far
String-Matching Automaton source: textbook Cormen et al. Simulate behavior of string-matching automaton that finds occurrences of pattern P of length m in T[1..n] worst-case running time of matching is in (n) assuming automaton has already been created...
String-Matching Automaton (continued) source: textbook Cormen et al. Correctness of matching procedure to be proved next…
String-Matching Automaton (continued) source: textbook Cormen et al. Correctness of matching procedure
String-Matching Automaton (continued) source: textbook Cormen et al. Correctness of matching procedure
String-Matching Automaton (continued) source: textbook Cormen et al. Correctness of matching procedure
String-Matching Automaton (continued) source: textbook Cormen et al. worst-case running time of automaton creation is in (m 3 | |) worst-case running time of entire string-matching strategy is in (m | |) + (n) can be improved to: (m | |) pattern matching time automaton creation time
String Matching Algorithms Knuth-Morris-Pratt
Knuth-Morris-Pratt Overview Achieve (n+m) time by shortening automaton preprocessing time below (m | |) ä Approach: ä don’t precompute automaton’s transition function ä calculate enough transition data “on-the-fly” ä obtain data via “alphabet-independent” pattern preprocessing ä pattern preprocessing compares pattern against shifts of itself
Knuth-Morris-Pratt Algorithm source: textbook Cormen et al. determine how pattern matches against itself 32.10
Knuth-Morris-Pratt Algorithm source: textbook Cormen et al. Prefix function shows how pattern matches against itself Equivalently, what is largest k < q such that P k P q ? (q) is length of longest prefix of P that is a proper suffix of P q Example: 32.5
Knuth-Morris-Pratt Algorithm source: textbook Cormen et al. (m) in (n) using amortized analysis # characters matched scan text left-to-right next character does not match next character matches Is all of P matched? Look for next match (m+n) using amortized analysis (n)
Knuth-Morris-Pratt Algorithm Amortized Analysis Potential Method k = current state of algorithm source: textbook Cormen et al. (m) in (n) initial potential value potential decreases Potential is never negative since (k) >= 0 for all k potential increases by <=1 in each execution of for loop body amortized cost of loop body is in (1) (m) loop iterations
Knuth-Morris-Pratt Algorithm source: textbook Cormen et al. Correctness...
Knuth-Morris-Pratt Algorithm source: textbook Cormen et al. Correctness
Knuth-Morris-Pratt Algorithm source: textbook Cormen et al. Correctness
Knuth-Morris-Pratt Algorithm source: textbook Cormen et al. Correctness