Application: String Matching By Rong Ge COSC3100 Preprocessing Application: String Matching By Rong Ge COSC3100
Space-for-time tradeoffs Two varieties of space-for-time algorithms: input enhancement — preprocess the input (or its part) to store some info to be used later in solving the problem counting sorts (Last Lecture) string searching algorithms (today) prestructuring — preprocess the input to make accessing its elements easier hashing A. Levitin “Introduction to the Design & Analysis of Algorithms,” 3rd ed., Ch. 7 ©2012 Pearson Education, Inc. Upper Saddle River, NJ. All Rights Reserved.
Review: String searching by brute force String Matching problem: Given a pattern P: a string of m characters to search for and a text T: a (long) string of n>=m characters to search in Return whether P occurs in T and the position of P in T Brute force algorithm Revisit Step 1 Align pattern P at beginning of text T Step 2 Scan from left to right, compare each character of P to the corresponding character in T until either all characters of P are found to match (successful search) or a mismatch is detected Step 3 While a mismatch is detected and T is not yet exhausted, realign P one position to the right and repeat Step 2 #comparison: O(mn) More efficient algorithms with O(n) time?
String searching by preprocessing Several string searching algorithms are based on the input enhancement idea of preprocessing the pattern P Horspool’s algorithm preprocess P to build one shift table tb align P against the beginning of T repeat until a matching substring is found or T ends: compare the corresponding characters from right to left for P if mismatch occurs, shift P to right by tb(c) where c is the rightmost character of T in the current alignment Boyer -Moore algorithm improves Horspool’s algorithm by using two tables: one good table and one bad table A. Levitin “Introduction to the Design & Analysis of Algorithms,” 3rd ed., Ch. 7 ©2012 Pearson Education, Inc. Upper Saddle River, NJ. All Rights Reserved.
Horspool’s Algorithm Two key points: preprocesses P to generate a shift table that determines how much to shift P when a mismatch occurs always makes a shift based on the T’s character c aligned with the last character in P according to the shift table’s entry for c A. Levitin “Introduction to the Design & Analysis of Algorithms,” 3rd ed., Ch. 7 ©2012 Pearson Education, Inc. Upper Saddle River, NJ. All Rights Reserved.
How far to shift for each case? Focus on the rightmost character in T in the alignment: Case I: character C != B (mismatch), and C does not occur in P .....C...................... Text (C not in pattern) BAOBAB Pattern Case II: Character O != B (mismatch), but occur in P once .....O...................... (O occurs once in pattern) BAOBAB Character A != B (mismatch), and occurs in P more than once .....A...................... (A occurs twice in pattern) BAOBAB Case III: character R = R (match), but doesn’t appear in P in the first m-1 characters ...MER...................... LEADER Case IV: character B = B (match), and occurs in P in the first m-1 characters .....B...................... Your answers?
Build a shift table for a given pattern P An 1-D array indicating the shift size for each character c if it is the rightmost character of T in the alignment array size: number of characters in the alphabet E.g.: 26 if all characters in T are capital case English letters The array entry for a character c is computed as follows Cases I & III: P’s length m Case II & IV: distance from c’s rightmost occurrence in P among P’s first m-1 characters to P’s right end Horspool's alg. populates the shift table with two steps: Initialize each array entry as P’s length Scan P from left to right and update the array entries A. Levitin “Introduction to the Design & Analysis of Algorithms,” 3rd ed., Ch. 7 ©2012 Pearson Education, Inc. Upper Saddle River, NJ. All Rights Reserved.
Build shift table Example: Pattern: BAOBAB Preprocessing Alphabet: capital case English letters Preprocessing Initialize all array entries as P’s length, which is 6 for BAOBAB - Taking care of cases I and III - Shift table is indexed by text and pattern alphabet Scan P from left to right for the first m-1 characters and update the table 0. BAOBAB: distance = 5 (B’s entry) 1. BAOBAB: distance = 4 (A’s entry) 2. BAOBAB: distance = 3 (O’s entry) 3. BAOBAB: distance = 2 (B’s entry) 4. BAOBAB: distance = 1 (A’s entry) skip the last character of P, - Taking care of cases II & IV Resulting shift table A B C D E F G H I J K L M N O P Q R S T U V W X Y Z 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 A B C D E F G H I J K L M N O P Q R S T U V W X Y Z 1 2 6 6 6 6 6 6 6 6 6 6 6 6 3 6 6 6 6 6 6 6 6 6 6 6 A. Levitin “Introduction to the Design & Analysis of Algorithms,” 3rd ed., Ch. 7 ©2012 Pearson Education, Inc. Upper Saddle River, NJ. All Rights Reserved.
Example of Horspool’s alg. application A B C D E F G H I J K L M N O P Q R S T U V W X Y Z 1 2 6 6 6 6 6 6 6 6 6 6 6 6 3 6 6 6 6 6 6 6 6 6 6 6 _ 6 Alg: keep aligning P against T and comparing from right to left, shifting if mismatched. Example: BARD LOVED BANANAS (Text) BAOBAB (shift t(L)=6) BAOBAB (shift t(B)=2) BAOBAB (shift t(N)=6) BAOBAB (T out of bound, unsuccessful search) # total comparisons: 4. 1st alignment: 1 2nd alignment: 2 3rd alignment: 1 4th alignment: 0 How many comparisons with brute force alg? A. Levitin “Introduction to the Design & Analysis of Algorithms,” 3rd ed., Ch. 7 ©2012 Pearson Education, Inc. Upper Saddle River, NJ. All Rights Reserved.
Boyer-Moore algorithm Based on same two ideas: comparing P to T from right to left for each alignment precomputing shift sizes in two tables bad-symbol table t1 indicates how much to shift based on text’s character causing a mismatch t1 is built the same as Horspool’s alg. good-suffix table d2 indicates how much to shift based on matched part (suffix) of the pattern A. Levitin “Introduction to the Design & Analysis of Algorithms,” 3rd ed., Ch. 7 ©2012 Pearson Education, Inc. Upper Saddle River, NJ. All Rights Reserved.
Scenarios in string matching SI: the rightmost character of P doesn’t match, BM algorithm acts as Horspool’s. Shift P to right by t1(c) text pattern SII: k characters are matched, and then a mismatch occurs. 0 < k < m How much do we shift Pattern to right? Rightmost character of P doesn’t match c x k>0 matches c K characters x A. Levitin “Introduction to the Design & Analysis of Algorithms,” 3rd ed., Ch. 7 ©2012 Pearson Education, Inc. Upper Saddle River, NJ. All Rights Reserved.
Prefix and suffix of strings Prefix of a string P: A prefix of a string P is a substring of P that occurs at the beginning of P. E.g. P: BANANA Suffix of a string P: A suffix of a string P is a substring that occurs at the end of P. Suffixes: A, NA, ANA, NANA, ANANA, BANANA Prefix B BA BAN BANA BANAN BANANA Length 1 2 3 4 5 6 suffix A NA ANA NANA ANANA BANANA Length 1 2 3 4 5 6 A. Levitin “Introduction to the Design & Analysis of Algorithms,” 3rd ed., Ch. 7 ©2012 Pearson Education, Inc. Upper Saddle River, NJ. All Rights Reserved.
Good-suffix table d2 for pattern applied after a suffix of P matches T for an alignment, and then a mismatch occurs k is the length of the matched suffix of P, 0 < k < m Good-suffix table d2: an 1-D array with size of m, but the entry indexed by 0 is not used. Each entry indicates the shift size for the good, matched suffix A. Levitin “Introduction to the Design & Analysis of Algorithms,” 3rd ed., Ch. 7 ©2012 Pearson Education, Inc. Upper Saddle River, NJ. All Rights Reserved.
Build good-suffix table d2 for the pattern For each entry indexed by k, set d2(k) with three cases in order, k is in [1, m-1] and corresponds to the suffix of P with size k the distance between matched suffix of size k and its rightmost occurrence in the pattern that is not preceded by the same character as the suffix E.g.: P: CABABA d2(k=1) = 4 //Suffix with size 1: A; the rightmost A that is not preceded by B is the second character in P. the distance between the longest part of the k-character suffix and the corresponding prefix of P, if there is no occurrence in case 1 E.g.: P: ANAN d2(k=3) = 2 // Suffix with size 3: NAN; the longest part of NAN that matches a prefix of P is AN m, the length of P if there is no occurrence in case 2 E.g.: P: ANAN d2(k=1) = 4 //Suffix with size 1: N A. Levitin “Introduction to the Design & Analysis of Algorithms,” 3rd ed., Ch. 7 ©2012 Pearson Education, Inc. Upper Saddle River, NJ. All Rights Reserved.
Exercise Build the good suffix table for pattern WOWWOW A. Levitin “Introduction to the Design & Analysis of Algorithms,” 3rd ed., Ch. 7 ©2012 Pearson Education, Inc. Upper Saddle River, NJ. All Rights Reserved.
Exercise Build the good suffix table for pattern WOWWOW k pattern d2 1 5 3 4 Case 1 Case 2: longest part of OW with a matched prefix is W Case 2: longest part of WWOW with a matched prefix is WOW A. Levitin “Introduction to the Design & Analysis of Algorithms,” 3rd ed., Ch. 7 ©2012 Pearson Education, Inc. Upper Saddle River, NJ. All Rights Reserved.
BM alg. for scenario II in string matching After successfully matching 0 < k < m characters, the algorithm shifts the pattern to right by d = max {d1, d2} where d1 = max{t1(c) - k, 1} is bad-symbol shift d2(k) is good-suffix shift A. Levitin “Introduction to the Design & Analysis of Algorithms,” 3rd ed., Ch. 7 ©2012 Pearson Education, Inc. Upper Saddle River, NJ. All Rights Reserved.
Boyer-Moore Algorithm (cont.) Step 1 Fill in the bad-symbol shift table Step 2 Fill in the good-suffix shift table Step 3 Align the pattern against the beginning of the text Step 4 Repeat until a matching substring is found or text ends: Compare the corresponding characters right to left. If no characters match, retrieve entry t1(c) from the bad-symbol table for the text’s character c causing the mismatch and shift the pattern to the right by t1(c). If 0 < k < m characters are matched, retrieve entry t1(c) from the bad-symbol table for the text’s character c causing the mismatch and entry d2(k) from the good-suffix table and shift the pattern to the right by d = max {d1, d2} where d1 = max{t1(c) - k, 1}. A. Levitin “Introduction to the Design & Analysis of Algorithms,” 3rd ed., Ch. 7 ©2012 Pearson Education, Inc. Upper Saddle River, NJ. All Rights Reserved.
Example of Boyer-Moore alg. application Table t1: B E S S _ N N E W _ A B O U T _ B A O B A B S B A O B A B 0 match shift d1 = t1(N) = 6 Table d2 2 matches: k=2 d1 = t1(_)-2 = 4 shift d2(2) = 5 1 match: k=1 shift d1 = t1(_)-1 = 5 d2(1) = 2 B A O B A B (success) #comp: 12 with BM, 13 with Horspool #alignment: 4 with BM, 5 with Horspool A B C D E F G H I J K L M N O P Q R S T U V W X Y Z 1 2 6 6 6 6 6 6 6 6 6 6 6 6 3 6 6 6 6 6 6 6 6 6 6 6 _ 6 k pattern d2 1 BAOBAB 2 5 3 4 A. Levitin “Introduction to the Design & Analysis of Algorithms,” 3rd ed., Ch. 7 ©2012 Pearson Education, Inc. Upper Saddle River, NJ. All Rights Reserved.
Summary Preprocessing - preprocess the input (or its part) to store some info to be used later in solving the problem With preprocessing, Horspool’s and BM string algorithms are fast Comparison starts from right to left on P A. Levitin “Introduction to the Design & Analysis of Algorithms,” 3rd ed., Ch. 7 ©2012 Pearson Education, Inc. Upper Saddle River, NJ. All Rights Reserved.
Finishing Counting Sort Alg. CountingSort(A[0,…,n-1], lb, ub) Input: an array A of integers between lb and ub (lb <=ub) Output: Array B[0,…,n-1] of A’s elements sorted in nondescending order for j 0 to ub-lb do //init freq to 0 C[j] 0 for i 0 to n-1 do //count freq C[A[i]-lb] C[A[i]-lb] + 1 for j 1 to ub-lb do //calc. last pos. of lb+j-1 C[j] C[j-1] + C[j] for i n-1 to 0 do //copy from A to B j A[i] – lb B[C[j]–1] A[i] C[j] C[j] – 1 E.G. A = {3, 2, 4, 1, 4, 1} A. Levitin “Introduction to the Design & Analysis of Algorithms,” 3rd ed., Ch. 7 ©2012 Pearson Education, Inc. Upper Saddle River, NJ. All Rights Reserved.