Presented By Dr. Shazzad Hosain Asst. Prof. EECS, NSU Exact String Matching Algorithms.

Slides:

Advertisements

Similar presentations

© 2004 Goodrich, Tamassia Pattern Matching1. © 2004 Goodrich, Tamassia Pattern Matching2 Strings A string is a sequence of characters Examples of strings:

Advertisements

1 Turing Machines and Equivalent Models Section 13.2 The Church-Turing Thesis.

Longest Common Subsequence

Space-for-Time Tradeoffs

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bioinformatics Algorithms and Data Structures Chapter : k-difference.

15-853Page : Algorithms in the Real World Suffix Trees.

296.3: Algorithms in the Real World

Exact String Search Lecture 7: September 22, 2005 Algorithms in Biosequence Analysis Nathan Edwards - Fall, 2005.

1 CSC 421: Algorithm Design & Analysis Spring 2013 Space vs. time  space/time tradeoffs  examples: heap sort, data structure redundancy, hashing  string.

String Searching Algorithms Problem Description Given two strings P and T over the same alphabet , determine whether P occurs as a substring in T (or.

1 Prof. Dr. Th. Ottmann Theory I Algorithm Design and Analysis (12 - Text search, part 1)

Pattern Matching1. 2 Outline and Reading Strings (§9.1.1) Pattern matching algorithms Brute-force algorithm (§9.1.2) Boyer-Moore algorithm (§9.1.3) Knuth-Morris-Pratt.

1 Introduction to Computability Theory Lecture11: Variants of Turing Machines Prof. Amos Israeli.

Goodrich, Tamassia String Processing1 Pattern Matching.

Advisor: Prof. R. C. T. Lee Speaker: Y. L. Chen

1 The Colussi Algorithm Advisor: Prof. R. C. T. Lee Speaker: Y. L. Chen Correctness and Efficiency of Pattern Matching Algorithms Information and Computation,

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bioinformatics Algorithms and Data Structures Chapter 2: Boyer-Moore Algorithm.

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bioinformatics Algorithms and Data Structures Chapter 2: KMP Algorithm Lecturer:

Boyer-Moore string search algorithm Book by Dan Gusfield: Algorithms on Strings, Trees and Sequences (1997) Original: Robert S. Boyer, J Strother Moore.

Knuth-Morris-Pratt Algorithm left to right scan like the naïve algorithm one main improvement –on a mismatch, calculate maximum possible shift to the right.

Boyer-Moore Algorithm 3 main ideas –right to left scan –bad character rule –good suffix rule.

1 A Fast Algorithm for Multi-Pattern Searching Sun Wu, Udi Manber Tech. Rep. TR94-17,Department of Computer Science, University of Arizona, May 1994.

String Matching COMP171 Fall String matching 2 Pattern Matching * Given a text string T[0..n-1] and a pattern P[0..m-1], find all occurrences of.

Sequence Alignment Variations Computing alignments using only O(m) space rather than O(mn) space. Computing alignments with bounded difference Exclusion.

Reverse Colussi algorithm

Algorithms and Data Structures. /course/eleg67701-f/Topic-1b2 Outline  Data Structures  Space Complexity  Case Study: string matching Array implementation.

Aho-Corasick Algorithm Generalizes KMP to handle sets of strings New ideas –keyword trees –failure functions/links –output links.

1 Boyer-Moore Charles Yan Exact Matching Boyer-Moore ( worst-case: linear time, Typical: sublinear time ) Aho-Corasik ( A set of pattern )

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bioinformatics Algorithms and Data Structures Chapter 12.3: Exclusion Methods.

Building Suffix Trees in O(m) time Weiner had first linear time algorithm in 1973 McCreight developed a more space efficient algorithm in 1976 Ukkonen.

Pattern Matching1. 2 Outline Strings Pattern matching algorithms Brute-force algorithm Boyer-Moore algorithm Knuth-Morris-Pratt algorithm.

1 Exact Matching Charles Yan Na ï ve Method Input: P: pattern; T: Text Output: Occurrences of P in T Algorithm Naive Align P with the left end.

1 Exact Set Matching Charles Yan Exact Set Matching Goal: To find all occurrences in text T of any pattern in a set of patterns P={p 1,p 2,…,p.

String Matching. Problem is to find if a pattern P[1..m] occurs within text T[1..n] Simple solution: Naïve String Matching –Match each position in the.

Database Systems Normal Forms. Decomposition Suppose we have a relation R[U] with a schema U={A 1,…,A n } – A decomposition of U is a set of schemas.

KMP String Matching Prepared By: Carlens Faustin.

CSC401 – Analysis of Algorithms Chapter 9 Text Processing

Advanced Algorithm Design and Analysis (Lecture 3) SW5 fall 2004 Simonas Šaltenis E1-215b

MA/CSSE 473 Day 24 Student questions Quadratic probing proof

20/10/2015Applied Algorithmics - week31 String Processing  Typical applications: pattern matching/recognition molecular biology, comparative genomics,

Boyer Moore Algorithm Idan Szpektor. Boyer and Moore.

MCS 101: Algorithms Instructor Neelima Gupta

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bioinformatics Algorithms and Data Structures Chapter 1: Exact String Matching.

Strings and Pattern Matching Algorithms Pattern P[0..m-1] Text T[0..n-1] Brute Force Pattern Matching Algorithm BruteForceMatch(T,P): Input: Strings T.

Book: Algorithms on strings, trees and sequences by Dan Gusfield Presented by: Amir Anter and Vladimir Zoubritsky.

MCS 101: Algorithms Instructor Neelima Gupta

String Searching CSCI 2720 Spring 2007 Eileen Kraemer.

String Matching String Matching Problem We introduce a general framework which is suitable to capture an essence of compressed pattern matching according.

Exact String Matching Algorithms Presented By Dr. Shazzad Hosain Asst. Prof. EECS, NSU.

06/12/2015Applied Algorithmics - week41 Non-periodicity and witnesses  Periodicity - continued If string w=w[0..n-1] has periodicity p if w[i]=w[i+p],

CS5263 Bioinformatics Lecture 15 & 16 Exact String Matching Algorithms.

CSC 212 – Data Structures Lecture 36: Pattern Matching.

CompSci 102 Discrete Math for Computer Science March 13, 2012 Prof. Rodger Slides modified from Rosen.

ICS220 – Data Structures and Algorithms Analysis Lecture 14 Dr. Ken Cosh.

Suffix Tree 6 Mar MinKoo Seo. Contents  Basic Text Searching  Introduction to Suffix Tree  Suffix Trees and Exact Matching  Longest Common Substring.

1/39 COMP170 Tutorial 13: Pattern Matching T: P:.

Sorting by placement and Shift Sergi Elizalde Peter Winkler By 資工四 B 周于荃.

1 String Matching Algorithms Mohd. Fahim Lecturer Department of Computer Engineering Faculty of Engineering and Technology Jamia Millia Islamia New Delhi,

CSG523/ Desain dan Analisis Algoritma

13 Text Processing Hongfei Yan June 1, 2016.

Chapter 7 Space and Time Tradeoffs

Pattern Matching 12/8/ :21 PM Pattern Matching Pattern Matching

Pattern Matching 1/14/2019 8:30 AM Pattern Matching Pattern Matching.

KMP String Matching Donald Knuth Jim H. Morris Vaughan Pratt 1997.

Pattern Matching 2/15/2019 6:17 PM Pattern Matching Pattern Matching.

Pattern Matching Pattern Matching 5/1/2019 3:53 PM Spring 2007

Space-for-time tradeoffs

Pattern Matching 4/27/2019 1:16 AM Pattern Matching Pattern Matching

Space-for-time tradeoffs

Sequences 5/17/ :43 AM Pattern Matching.

Presentation transcript:

Presented By Dr. Shazzad Hosain Asst. Prof. EECS, NSU Exact String Matching Algorithms

Classical Comparison Based Methods Boyer-Moore Algorithm Knuth-Morris-Pratt Algorithm (KMP Algorithm)

Boyer-Moore Algorithm Basic ideas: – Previously discussed ideas for naïve matching 1.successively align P and T to check for a match. 2.Shift P to the right on match failure. – new concepts wrt the naïve algorithm 1.Scan from right-to-left, i.e.,  2.Special Bad character rule 3.Suffix shift rule

Concept: Right-to-left Scan How can we check for a match of pattern P at location i in target T? Naïve algorithm scanned left-to-right, i.e., T[i+k]&P[1+k], k = 0 to length(P)-1 ^ 1a == a ^ 2d != b Example: P = adab, T = abaracadabara a b a r a c a d a b a r a a d a b

Concept: Right-to-left Scan Alternative, scan right-to-left, i.e., T[i+k]&P[1+k], k = length(P)-1 down-to 0 ^ 1b != r Example: P = adab, T = abaracadabara a b a r a c a d a b a r a a d a b

Concept: Right-to-left Scan Why is scanning right-to-left a good idea? Answer: by itself, it isn’t any better than left- to-right. – A naïve approach with right-to-left scanning is also  (nm). – Larger shifts, supported by a clever bad character rule and a suffix shift rule make it better.

Concept: Bad Character Rule Idea: the mismatched character indicates a safe minimum shift. ^ 1a == a Example: P = adacara, T=abaracadabara a b a r a c a d a b a r a a d a c a r a ^ 2r != c Here the bad character is c. Perhaps we should shift to align this character with its rightmost occurrence in P?

Concept: Bad Character Rule Shift two positions to align the rightmost occurrence of the mismatched character c in P. a b a r a c a d a b a r a a d a c a r a Now, start matching again from right to left.

Concept: Bad Character Rule ^ 1a == a Second Example: P = adacara, T=abaxaradabara a b a x a r a d a b a r a a d a c a r a ^ 2r == r Here the bad character is x. The minimum that we should shift should align this character with its occurrence in P. But x doesn’t occur in P!!!! ^ 3a == a ^ 4c != x

Concept: Bad Character Rule Second Example: P = adacara, T=abaxaradabara a b a x a r a d a b a r a a d a c a r a Since x doesn’t occur in P, we can shift past it. a d a c a r a Now, start matching again from right to left.

11 Concept: Bad Character Rule The idea of bad character rule is to shift P by more than one characters when possible. But if rightmost position is greater than the mismatched position. Unfortunately, it is often the case T: spbctbsatpqsctbpq P: tpabsat

Concept: Bad Character Rule We will define a bad character rule that uses the concept of the rightmost occurrence of each letter. Let R(x) be the rightmost position of the letter x in P for each letter x in our alphabet. If x doesn’t occur in P, define R(x) to be 0. abcdz 7042** P = adacara R

13 Concept: Bad Character Rule T: spbctbsabpqsctbpq P: tpabsab R(t)=1, R(s)=5. i: the position of mismatch in P. i=3 k: the counterpart in T. k=5. T[k]=t The bad character rule says P should be shifted right by max{1, i-R(T[k])}. i.e., if the right-most occurrence of character T[k] in P is in position j (j<i), then P[j] should be below T[k] after the shifting. Otherwise, we will shift P one position, i.e., when R(T[k]) >= i, 1 >= i - R(T[k]) Obviously this rule is not very useful when R(T[k]) >= i, which is usually the case for DNA sequences P: tpabxab

Concept: Extended Bad Character Rule Extended Bad Character Rule: If P[i] mismatches T[k], shift P along T so that the closest occurrence of the letter T[k] in P to the left of i in P is aligned with T[k]. ^ 1a == a Example: P = aracara, T=abararadabara a b a r a r a d a b a r a a r a c a r a ^ 2r == r ^ 3a == a ^ 4c != r ^ This is the rightmost occurrence of r in P. Notice that i - R(T(k)) < 0, i.e., 4 – 6 < 0 ^ This is the rightmost occurrence of r to the left of i in P. Notice that 4 – 2 > 0, i.e., this gives us a positive shift.

Concept: Extended Bad Character Rule The amount of shift is i – j, where: – i is the index of the mismatch in P. – j is the rightmost occurrence of T[k] to the left of i in P. ^ 1a == a Example: P = aracara, T=abataradabara a b a t a r a d a b a r a a r a c a r a ^ 2r == r ^ 3a == a ^ 4c != t There is no occurrence of t in P, thus j = 0. Notice that i – j = 4, i.e., this gives us a positive shift past the point of mismatch.

Concept: Extended Bad Character Rule How do we implement this rule? We preprocess P (from right to left), recording the position of each occurrence of the letters. For each character x in , the alphabet, create a list of its occurrences in P. If x doesn’t occur in P, then it has an empty list.

Concept: Extended Bad Character Rule Example:  = {a, b, c, d, r, t}, P = abataradabara a_list = since ‘a’ occurs at these positions in P, i.e., abataradabara b_list = ( abataradabara) c_list = Ø d_list = ( abataradabara) r_list = ( abataradabara) t_list = ( abataradabara)

Concept: Suffix Shift Rule Recall that we investigated finding prefixes before. Since we are matching P to T from right-to-left, we will instead need to use suffixes.

19 Suffix Shift Rule t is a suffix of P that match with a substring t of T x≠y t’ is the right-most copy of t in P such that t’ is not a suffix of P and z≠y

Concept: Suffix Shift Rule Consider the partial right-to-left matching of P to T below. This partial match involves  a suffix of P.

Concept: Suffix Shift Rule This partial match ends where the first mismatch occurs, where x is aligned with d.

Concept: Suffix Shift Rule We want to find a right-most copy  ´ of this substring  in P such that:  ´ is not a suffix of P and 2.The character to the left of  ´ is not the same as the character to the left of 

Concept: Suffix Shift Rule 1.If  ´ exists, shift P to the right such that  ´ is now aligned with the substring in T that was previously aligned with .

Concept: Suffix Shift Rule 2.If  ´ doesn’t exist, shift P right by the least amount such that a prefix of P is aligned with a suffix of  in T.

Concept: Suffix Shift Rule 3.If  ´ doesn’t exist, and there is no prefix of P that matches a suffix of  in T, shift P left by n positions.

Preprocessing for the good suffix rule Let L(i) denote the largest position less than n s.t. P[i..n] matches a suffix of P[1..L(i)]. If there is no such position, then L(i) = 0 Example 1: If i = 17 then L(i) = 9 Example 2: If i = 16 then L(i) = 0

Concept: Suffix Shift Rule Let L´(i) denote the largest position less than n s.t. P[i..n] matches a suffix of P[1..L´(i)] and s.t. the character preceding the suffix is not equal to P(i-1). If there is no such position, then L´(i) = 0 Example 1: If i = 20 then L(i) = 12 and L´(i) = 6

Concept: Suffix Shift Rule Example 2: If i = 19 then L(i) = 12 and L´(i) = 0 slydogsaddogdbadbaddog P 19 L(19)

Concept: Suffix Shift Rule Notice that L(i) indicates the right-most copy of P[i..n] that is not a suffix of P. In contrast, L´(i) indicates the right-most copy of P[i..n] that is not a suffix of P and whose preceding character doesn’t match P(i-1). The relation between L´(i) and L(i) is analogous to the relation between  ´ and .

Concept: Suffix Shift Rule Q: What is the point? A: If P(i - 1) causes the mismatch and L´(i) > 0, then we can shift P right by n - L´(i) positions. Example:

Concept: Suffix Shift Rule If L(i) and L´(i) are different, then obviously shifting by n - L´(i) positions is a greater shift than n - L(i). Example:

Concept: Suffix Shift Rule Let N j (P) denote the length of the longest suffix of P[1..j] that is also a suffix of P. Example 1: N 6 (P) = 3 and N 12 (P) = 5. Example 2: N 3 (P) = 2, N 9 (P) = 3, N 15 (P) = 5, N 19 (P) = 0.

Concept: Suffix Shift Rule Q: How are the concepts of N i and Z i related? Recall that Z i = Length of a maximal substring starting at position i, which is a prefix of P. In contrast, N i = Length of a maximal substring ending at position i, which is a suffix of P. In the case of Boyer-Moore, we are naturally interested in suffixes since we are scanning right-to-left i  i 

Concept: Suffix Shift Rule Let P r denote the mirror image of P, then the relationship can be expressed as N j (P)=Z n-j+1 (P r ). In words, the length of the substring matching a suffix at position j in P is equal to the length of the corresponding substring matching a prefix in the reverse of P. Q: Why must this true? A: Because they are the same substring, except that one is the reverse of the other.

Concept: Suffix Shift Rule Since N j (P) = Z n-j+1 (P r ), we can use the Z algorithm to compute N in O(n). Q: How do we do this? A: We create P r, the reverse of P, and process it with the Z algorithm.

36 Concept: Suffix Shift Rule N is the reverse of Z! P: the pattern P r the string obtained by reversing P Then N j (P)=Z n-j+1 (P r ) P: q c a b d a b d a b P r : b a d b a d b a c q N j : Z i tt’xy i t j xy

37 Concept: Suffix Shift Rule For pattern P, N j (for j=1,…,n) can be calculated in O(n) using the Z algorithm. Why do we need to define N j ? To use the strong good suffix rule, we need to find out L’(i) for every i=1,…,n. We can get L’(i) from N j ! x t y t t’ y t z z T P ni L’(i)

Concept: Suffix Shift Rule We can then find L´(i) and L(i) values from N values in linear time with the following: For i = 1 to n {L´(i) = 0;} For j = 1 to n – 1 { i = n - N j (P) + 1; L´(i) = j; } // L values (if desired) can be obtained L(2) = L´(2) ; For i = 3 to n { L(i) = max(L(i - 1), L´(i));}

Concept: Suffix Shift Rule Example: P = asdbasasas, n = 10 Values of N i (P): 0, 2, 0, 0, 0, 2, 0, 4, 0 Computed values i:11, 9, 11, 11, 11, 9, 11, 7, 11 Values of L´: 0, 0, 0, 0, 0, 0, 8, 0, 6 For i = 1 to n {L´(i) = 0;} For j = 1 to n – 1 { i = n - N j (P) + 1; L´(i) = j; } L(2) = L´(2) ; For i = 3 to n { L(i) = max(L(i - 1), L´(i));}

Concept: Suffix Shift Rule Let l´(i) denote the length of the largest suffix of P[i..n] that is also a prefix of P. Let l´(i) = 0 if no such suffix exists. Example: P = asasbsasas ^ l’(1) = 4 ^ l’(2) = 4 ^ l’(3) = 4 ^ l’(4) = 4 ^ l’(5) = 4 ^ l’(6) = 4 ^ l’(7) = 4 ^ l’(8) = 2 ^ l’(9) = 2 ^ l’(10) = 0 tt’ l´(i) = t i

Concept: Suffix Shift Rule Thm: l´(i) = largest j <= n – i + 1 s.t. N j (P) = j. Q: How can we compute l´(i) values in linear time? A: This is problem #9 in Chapter 2. This would make an interesting homework problem. tt’ j xy t i l´(i) = t

Boyer-Moore Algorithm Preprocessing: Compute L´(i) and l´(i) for each position i in P, Compute R(x), the right-most occurrence of x in P, for each character x in . Search: k = n; While k <= m { i = n; h = k; While i > 0 and P(i) = T(j) { i = i – 1; h = h – 1;} if i = 0 { report occurrence of P in T at position k. k = k + n - l´(2);} else Shift P (increase k) by the max amount indicated by the extended bad character rule and the good suffix rule. }

Boyer-Moore Algorithm Example: P = golgol Preprocessing: Compute L´(i) and l´(i) for each position i in P Notice that first we need N j (P) values in order to compute L´(i) and l´(i) for each position i in P. For i = 1 to n {L´(i) = 0;} For j = 1 to n – 1 { i = n - N j (P) + 1; L´(i) = j; }

Boyer-Moore Algorithm Example: P = golgol Recall that N j (P) is the length of the longest suffix of P[1..j] that is also a suffix of P. N 1 (P) = 0, there is no suffix of P that ends with g N 2 (P) = 0, there is no suffix of P that ends with o N 3 (P) = 3, there is a suffix of P that ends with l N 4 (P) = 0, there is no suffix of P that ends with g N 5 (P) = 0, there is no suffix of P that ends with o N 1 (P) = N 2 (P) = N 4 (P) = N 5 (P) = 0 and N 3 (P) = 3

Boyer-Moore Algorithm Preprocessing: P = golgol, n = 6 N 1 (P) = N 2 (P) = N 4 (P) = N 5 (P) = 0 and N 3 (P) = 3 Compute L´(i) and l´(i) for each position i in P For i = 1 to n {L´(i) = 0;} For j = 1 to n – 1 { i = n - N j (P) + 1; L´(i) = j; } j = 1  i = 7Therefore L´(7) = 1 j = 2  i = 7 Therefore L´(7) = 2 j = 3  i = 4 Therefore L´(4) = 3 j = 4  i = 7 Therefore L´(7) = 4 j = 5  i = 7Therefore L´(7) = 5 L´(1) = L´(2) = L´(3) = L´(5) = 0 and L´(4) = 3

Boyer-Moore Algorithm Preprocessing: P = golgol, n = 6 N 1 (P) = N 2 (P) = N 4 (P) = N 5 (P) = 0 and N 3 (P) = 3 L´(1) = L´(2) = L´(4) = L´(5) = 0 and L´(4) = 3 Compute l´(i) for each position i in P. Recall that l´(i) is the length of the longest suffix of P[i..n] that is also a prefix of P. l´(1) = 6since gol is the longest suffix of P[1..n] that is a prefix of P. l´(2) = 3since gol is the longest suffix of P[2..n] that is a prefix of P. l´(3) = 3since gol is the longest suffix of P[3..n] that is a prefix of P. l´(4) = 3since gol is the longest suffix of P[4..n] that is a prefix of P. l´(5) = 0since there is no suffix of P[5..n] that is a prefix of P. l´(6) = 0since there is no suffix of P[6..n] that is a prefix of P. l´(1) = 6, l´(2) = l´(3) = l´(4) = 3and l´(5) = l´(6) = 0

Boyer-Moore Algorithm Preprocessing: P = golgol, n = 6 N 1 (P) = N 2 (P) = N 4 (P) = N 5 (P) = 0 and N 3 (P) = 3 L´(1) = L´(2) = L´(4) = L´(5) = 0 and L´(4) = 3 l´(1) = 6, l´(2) = l´(3) = l´(4) = 3 and l´(5) = l´(6) = 0 Compute the list R(x), the right-most occurrences of x in P, for each character x in  = {g, o, l} R(g) = R(o) = R(l) =

Boyer-Moore Algorithm Preprocessing: P = golgol, n = 6, T = lolgolgol, m = 9 L´(1) = L´(2) = L´(4) = L´(5) = 0 and L´(4) = 3 l´(1) =6, l´(2) = l´(3) = l´(4) = 3 and l´(5) = l´(6) = 0 R(g) =, R(o) =, R(l) = Search: k = n; While k <= m { i = n; h = k; While i > 0 and P(i) = T(j) { i = i – 1; h = h – 1;} if i = 0 { report occurrence of P in T at position k. k = k + n - l´(2);} else Shift P (increase k) by the max amount indicated by the extended bad character rule and the good suffix rule. }

Search ^ i = 6, h = 6 ^ i = 5, h = 5 ^ i = 4, h = 4 lolgolgol golgol Bad Character Rule: there is no occurrence of l, the mismatched character in T, to the left of P(1). This suggests shifting only 1 place Good Suffix Rule: Since L´(2) = 0, l´(2) = 3 therefore shift P by n - l´(2) places, i.e., 6-3=3 places. Thus k = k + 3 = 9 But i = 1! ^ i = 3, h = 3 ^ i = 2, h = 2 ^ i = 1, h = 1, P(1) != T(1)  k = 6; While k <= 9 { i = 6; h = k; While i > 0 and P(i) = T(j) { i = i – 1; h = h – 1;} if i = 0 { report occurrence of P in T at position k. k = k l´(2);} else Shift P (increase k) by the max amount indicated by the extended bad character rule and the good suffix rule. }

Search lolgolgol golgol ^ i = 6, h = 9 ^ i = 5, h = 8 ^ i = 4, h = 7 ^ i = 3, h = 6 ^ i = 2, h = 5 ^ i = 1, h = 4 ^ i = 0, h = 3 i = 0, report occurrence of P in T at position 4, k = k l´(2) = = 12 lolgolgol golgol k = 12, we are done! k = 6; While k <= 9 { i = 6; h = k; While i > 0 and P(i) = T(j) { i = i – 1; h = h – 1;} if i = 0 { report occurrence of P in T at position k. k = k l´(2);} else Shift P (increase k) by the max amount indicated by the extended bad character rule and the good suffix rule. }

Homework 1: Due Next Week Implement the Boyeer More Algorithm

Break

KMP Algorithm Preliminaries: – KMP can be easily explained in terms of finite state machines. – KMP has a easily proved linear bound – KMP is usually not the method of choice

KMP Algorithm Recall that the naïve approach to string matching is  (mn). How can we reduce this complexity? – Avoid redundant comparisons – Use larger shifts Boyer-Moore good suffix rule Boyer-Moore extended bad character rule

KMP Algorithm KMP finds larger shifts by recognizing patterns in P. – Let sp i (P) denote the length of the longest proper suffix of P[1..i] that matches a prefix of P. – By definition sp 1 = 0 for any string. – Q: Why does this make sense? – A: The proper suffix must be the empty string αα i

KMP Algorithm Example: P = abcaeabcabd – P[1..2] = ab hence sp 2 = ? – sp 2 = 0 – P[1..3] = abc hence sp 3 = ? – sp 3 = 0 – P[1..4] = abca hence sp 4 = ? – sp 4 = 1 – P[1..5] = abcae hence sp 5 = ? – sp 5 = 0 – P[1..6] = abcaea hence sp 6 = ? – sp 6 = 1

KMP Algorithm Example Continued – P[1..7] = abcaeab hence sp 7 = ? – sp 7 = 2 – P[1..8] = abcaeabc hence sp 8 = ? – sp 8 = 3 – P[1..9] = abcaeabca hence sp 9 = ? – sp 9 = 4 – P[1..10] = abcaeabcab hence sp 10 = ? – sp 10 = 2 – P[1..11] = abcaeabcabd hence sp 11 = ? – sp 11 = 0

KMP Algorithm Like the  /  concept for Boyer-Moore, there is an analogous sp i /sp´ i concept. Let sp´ i (P) denote the length of the longest proper suffix of P[1..i] that matches a prefix of P, with the added condition that characters P(i + 1) and P(sp´ i + 1) are unequal. Example: P = abcdabce sp´ 7 = 3 Obviously sp´ i (P) <= sp i (P), since the later is less restrictive. αα i x y

KMP Algorithm KMP Shift Rule: 1.Mismatch case: Let position i+1 in P and position k in T be the first mismatch in a left-to-right scan. Shift P to the right, aligning P[1..sp´ i ] with T[k- sp´ i..k-1] 2.Match case: If no mismatch is found, an occurrence of P has been found. Shift P by n – sp´ n spaces to continue searching for other occurrences. i+1 k α αα n+1 α αα

KMP Algorithm Observations: – The prefix P[1..sp´ i ] of the shifted P is shifted to match the corresponding substring in T. – Subsequent character matching proceeds from position sp´ i + 1 – Unlike Boyer-Moore, the matched substring is not compared again. – The shift rule based on sp´ i guarantees that the exact same mismatch won’t occur at sp´ i + 1 but doesn’t guarantee that P(sp´ i +1) = T(k)

KMP Algorithm Example: P = abcxabcde – If a mismatch occurs at position 8, P will be shifted 4 positions to the right. – Q: Where did the 4 position shift come from? – A: The number of position is given by i - sp´ i, in this example i = 7, sp´ 7 = 3,  7 – 3 = 4 – Notice that we know the amount of shift without knowing anything about T other than there was a mismatch at position 8..

KMP Algorithm Example Continued: P = abcxabcde – After the shift, P[1..3] lines up with T[k-4..k-1] – Since it known that P[1..3] must match T[k-4..k-1], no comparison is needed. – The scan continues from P(4) & T(k) Advantages of KMP Shift Rule 1. P is often shifted by more than 1 character, (i - sp´ i ) 2.The left-most sp´ i characters in the shifted P are known to match the corresponding characters in T.

KMP Algorithm Full Example: T = xyabcxabcxadcdqfeg P = abcxabcde Assume that we have already shifted past the first two positions in T. xyabcxabcxadcdqfeg abcxabcde ^1^1 ^ 2 ^ 3 ^ 4 ^ 5 ^ 6 ^ 7 abcxabcde ^ 8 d!=x, shift 4 places ^ 1 start again from position 4

Preprocessing for KMP Approach: show how to derive sp´ values from Z values. Definition: Position j > 1 maps to i if i = j + Z j (P) – 1 – Recall that Z j (P) denotes the length of the Z-box starting at position j. – This says that j maps to i if i is the right end of a Z-box starting at j. αα αα i j

Preprocessing for KMP Theorem. For any i > 1, sp´ i (P) = Z j = i – j + 1 Where j > 1 is the smallest position that maps to i. If  j then sp´ i (P) = 0 Similarly for sp: For any i > 1, sp i (P) = i – j + 1 Where j, i  j > 1, is the smallest position that maps to i or beyond. If  j then sp i (P) = 0 Definition: Position j > 1 maps to i if i = j + Z j (P) – 1 αα αα i j x y

Preprocessing for KMP Given the theorem from the preceding slide, the sp´ i and sp i values can be computed in linear time using Z i values: For i = 1 to n { sp´ i = 0;} For j = n downto 2 { i = j + Z j (P) – 1; sp´ i = Z j ; } sp n (P) = sp´ n (P); For i = n - 1 downto 2 { sp i (P) = max[sp i+1 (P) - 1, sp´ i (P)];} αα αα i j x y

Preprocessing for KMP Defn. Failure function F´(i) = sp´ i-1 + 1, 1  i  n + 1, sp´ 0 = 0 (similarly F(i) = sp i-1 + 1, 1  i  n + 1, sp 0 = 0) xyabcxabcxadcdqfeg abcxabcde ^1^1 ^ 2 ^ 3 ^ 4 ^ 5 ^ 6 ^ 7 abcxabcde ^ 8 d!=x, shift 4 places Shifting is only conceptual and P is never explicitly shifted xyabcxabcxadcdqfeg abcxabcde ^ i c | ^ i c | ^ i c | ^ i c | ^i^i Two special cases: 1.Mismatch at position 1, then F’(1) = 1 2.Match found, then P shifts by n - sp’ n places o Which is F’(n+1) = sp’ n + 1

Preprocessing for KMP Defn. Failure function F´(i) = sp´ i-1 + 1, 1  i  n + 1, sp´ 0 = 0 (similarly F(i) = sp i-1 + 1, 1  i  n + 1, sp 0 = 0) Idea: – We maintain a pointer i in P and c in T. – After a mismatch at P(i+1) with T(c), shift P to align P(sp´ i + 1) with T(c), i.e., i = sp´ i + 1. – Special case 1: i = 1  set i = F´(1) = 1 & c = c + 1 – Special case 2: we find P in T,  shift n - sp´ n spaces, i.e., i = F´(n + 1) = sp´ n + 1.

Full KMP Algorithm Preprocess P to find F´(k) = sp´ k-1 +1 for k from 1 to n + 1 c = 1; p = 1; While c + (n – p)  m { While P(p) = T( c )and p  n { p = p + 1; c = c + 1;} If (p = n + 1) then report an occurrence of P at position c – n of T. if (p = 1) then c = c + 1; p = F´(p) ;} T = xyabcxabcxadcdqfeg P = abcxabcde ^ p c | |T| = m |P| = n

Full KMP Algorithm xyabcxabcxabcdefeg abcxabcde ^ 1 a!=x p != n+1 p = 1!  c = 2 p = F’(1) = 1 c = 1; p = 1; While c + (n – p)  m { While P(p) = T( c )and p  n { p = p + 1; c = c + 1;} If (p = n + 1) then report an occurrence of P at position c – n of T. if (p = 1) then c = c + 1; p = F´(p) ; }

Full KMP Algorithm xyabcxabcxabcdefeg abcxabcde ^ 1 a!=y p != n+1 p = 1!  c = 3 p = F’(1) = 1 c = 1; p = 1; While c + (n – p)  m { While P(p) = T( c )and p  n { p = p + 1; c = c + 1;} If (p = n + 1) then report an occurrence of P at position c – n of T. if (p = 1) then c = c + 1; p = F´(p) ; } abcxabcde

Full KMP Algorithm xyabcxabcxabcdefeg p != n+1 p = 8!  don’t change c p = F´(8) = 4 abcxabcde ^1^1 ^ 2 ^ 3 ^ 4 ^ 5 ^ 6 ^ 7 ^ 8 d!=x c = 1; p = 1; While c + (n – p)  m { While P(p) = T( c )and p  n { p = p + 1; c = c + 1;} If (p = n + 1) then report an occurrence of P at position c – n of T. if (p = 1) then c = c + 1; p = F´(p) ; }

p = 4, c = 10 ^ 4 Full KMP Algorithm xyabcxabcxabcdefeg p = n+1 ! abcxabcde ^ 5 ^ 6 ^ 7 ^ 8 abcxabcde c = 1; p = 1; While c + (n – p)  m { While P(p) = T( c )and p  n { p = p + 1; c = c + 1;} If (p = n + 1) then report an occurrence of P at position c – n of T. if (p = 1) then c = c + 1; p = F´(p) ; } ^ 9

Real-Time KMP Q: What is meant by real-time algorithms? A: Typically these are algorithms that are meant to interact synchronously in the real world. – This implies a known fixed turn-around time for processing a task – Many embedded scheduling systems are examples involving real-time algorithms. – For KMP this means that we require a constant time for processing all strings of length n.

Real-Time KMP Q: Why is KMP not real-time? A: For any mismatched character in T, we may try matching it several times. – Recall that sp´ i only guarantees that P(i + 1) and P(sp´ i + 1) differ – There is NO guarantee that P(i + 1) and T(k) match We need to ensure that a mismatch at T(k) does NOT entail additional matches at T(k). This means that we have to compute sp´ i values with respect to all characters in  since any could appear in T.

Real-Time KMP Define: sp´ (i,x) (P) to be the length of the longest proper suffix of P[1..i] that matches a prefix of P, with the added condition that character P(sp´ i + 1) is x. This is will tell us exactly what shift to use for each possible mismatch. A mismatched character T(k) will never be involved in subsequent comparisons.

Real-Time KMP Q: How do we know that the mismatched character T(k) will never be involved in subsequent comparisons? A: Because the shift will shift P so that either the matching character aligns with T(k) or P will be shifted past T(k). This results in a real-time version of KMP. Let’s consider how we can find the sp´ (i,x) (P) values in linear time.

Real-Time KMP Thm. For P[i + 1]  x, sp´ (i,x) (P) = i - j + 1 – Here j is the smallest position such that j maps to i and P(Z j + 1) = x. – If there is no such j then where sp´ (i,x) (P) = 0 For i = 1 to n { sp´ (i,x) = 0 for every character x;} For j = n downto 2 { i = j + Z i (P) – 1; x = P(Z j + 1); sp´ (i,x) = Z i ; }

Real-Time KMP Notice how this works: – Starting from the right Find i the right end of the Z box associated with j Find x the character immediately following the prefix corresponding to this Z box. Set sp´ (i,x) = Z i, the length of this Z box. For i = 1 to n { sp´ (i,x) = 0 for every character x;} For j = n downto 2 { i = j + Z i (P) – 1; x = P(Z j + 1); sp´ (i,x) = Z i ;}

Reference Chapter 1, 2: Exact Matching: Fundamental Preprocessing and First Algorithms