Presented By Dr. Shazzad Hosain Asst. Prof. EECS, NSU Exact String Matching Algorithms.

Presented By Dr. Shazzad Hosain Asst. Prof. EECS, NSU Exact String Matching Algorithms

Classical Comparison Based Methods Boyer-Moore Algorithm Knuth-Morris-Pratt Algorithm (KMP Algorithm)

Boyer-Moore Algorithm Basic ideas: – Previously discussed ideas for naïve matching 1.successively align P and T to check for a match. 2.Shift P to the right on match failure. – new concepts wrt the naïve algorithm 1.Scan from right-to-left, i.e.,  2.Special Bad character rule 3.Suffix shift rule

Concept: Right-to-left Scan How can we check for a match of pattern P at location i in target T? Naïve algorithm scanned left-to-right, i.e., T[i+k]&P[1+k], k = 0 to length(P)-1 ^ 1a == a ^ 2d != b Example: P = adab, T = abaracadabara a b a r a c a d a b a r a a d a b

Concept: Right-to-left Scan Alternative, scan right-to-left, i.e., T[i+k]&P[1+k], k = length(P)-1 down-to 0 ^ 1b != r Example: P = adab, T = abaracadabara a b a r a c a d a b a r a a d a b

Concept: Right-to-left Scan Why is scanning right-to-left a good idea? Answer: by itself, it isn’t any better than left- to-right. – A naïve approach with right-to-left scanning is also  (nm). – Larger shifts, supported by a clever bad character rule and a suffix shift rule make it better.

Concept: Bad Character Rule Idea: the mismatched character indicates a safe minimum shift. ^ 1a == a Example: P = adacara, T=abaracadabara a b a r a c a d a b a r a a d a c a r a ^ 2r != c Here the bad character is c. Perhaps we should shift to align this character with its rightmost occurrence in P?

Concept: Bad Character Rule Shift two positions to align the rightmost occurrence of the mismatched character c in P. a b a r a c a d a b a r a a d a c a r a Now, start matching again from right to left.

Concept: Bad Character Rule ^ 1a == a Second Example: P = adacara, T=abaxaradabara a b a x a r a d a b a r a a d a c a r a ^ 2r == r Here the bad character is x. The minimum that we should shift should align this character with its occurrence in P. But x doesn’t occur in P!!!! ^ 3a == a ^ 4c != x

Concept: Bad Character Rule Second Example: P = adacara, T=abaxaradabara a b a x a r a d a b a r a a d a c a r a Since x doesn’t occur in P, we can shift past it. a d a c a r a Now, start matching again from right to left.

11 Concept: Bad Character Rule The idea of bad character rule is to shift P by more than one characters when possible. But if rightmost position is greater than the mismatched position. Unfortunately, it is often the case 12345678901234567 T: spbctbsatpqsctbpq P: tpabsat

Concept: Bad Character Rule We will define a bad character rule that uses the concept of the rightmost occurrence of each letter. Let R(x) be the rightmost position of the letter x in P for each letter x in our alphabet. If x doesn’t occur in P, define R(x) to be 0. abcdz 7042**0 1234567 P = adacara R

13 Concept: Bad Character Rule 12345678901234567 T: spbctbsabpqsctbpq P: tpabsab R(t)=1, R(s)=5. i: the position of mismatch in P. i=3 k: the counterpart in T. k=5. T[k]=t The bad character rule says P should be shifted right by max{1, i-R(T[k])}. i.e., if the right-most occurrence of character T[k] in P is in position j (j<i), then P[j] should be below T[k] after the shifting. Otherwise, we will shift P one position, i.e., when R(T[k]) >= i, 1 >= i - R(T[k]) Obviously this rule is not very useful when R(T[k]) >= i, which is usually the case for DNA sequences P: tpabxab

Concept: Extended Bad Character Rule Extended Bad Character Rule: If P[i] mismatches T[k], shift P along T so that the closest occurrence of the letter T[k] in P to the left of i in P is aligned with T[k]. ^ 1a == a Example: P = aracara, T=abararadabara a b a r a r a d a b a r a a r a c a r a ^ 2r == r ^ 3a == a ^ 4c != r ^ This is the rightmost occurrence of r in P. Notice that i - R(T(k)) < 0, i.e., 4 – 6 < 0 ^ This is the rightmost occurrence of r to the left of i in P. Notice that 4 – 2 > 0, i.e., this gives us a positive shift.

Concept: Extended Bad Character Rule The amount of shift is i – j, where: – i is the index of the mismatch in P. – j is the rightmost occurrence of T[k] to the left of i in P. ^ 1a == a Example: P = aracara, T=abataradabara a b a t a r a d a b a r a a r a c a r a ^ 2r == r ^ 3a == a ^ 4c != t There is no occurrence of t in P, thus j = 0. Notice that i – j = 4, i.e., this gives us a positive shift past the point of mismatch.

Concept: Extended Bad Character Rule How do we implement this rule? We preprocess P (from right to left), recording the position of each occurrence of the letters. For each character x in , the alphabet, create a list of its occurrences in P. If x doesn’t occur in P, then it has an empty list.

Concept: Extended Bad Character Rule Example:  = {a, b, c, d, r, t}, P = abataradabara a_list = since ‘a’ occurs at these positions in P, i.e., abataradabara b_list = ( abataradabara) c_list = Ø d_list = ( abataradabara) r_list = ( abataradabara) t_list = ( abataradabara)

Concept: Suffix Shift Rule Recall that we investigated finding prefixes before. Since we are matching P to T from right-to-left, we will instead need to use suffixes.

19 Suffix Shift Rule t is a suffix of P that match with a substring t of T x≠y t’ is the right-most copy of t in P such that t’ is not a suffix of P and z≠y

Concept: Suffix Shift Rule Consider the partial right-to-left matching of P to T below. This partial match involves  a suffix of P.

Concept: Suffix Shift Rule This partial match ends where the first mismatch occurs, where x is aligned with d.

Concept: Suffix Shift Rule We want to find a right-most copy  ´ of this substring  in P such that:  ´ is not a suffix of P and 2.The character to the left of  ´ is not the same as the character to the left of 

Concept: Suffix Shift Rule 1.If  ´ exists, shift P to the right such that  ´ is now aligned with the substring in T that was previously aligned with .

Concept: Suffix Shift Rule 2.If  ´ doesn’t exist, shift P right by the least amount such that a prefix of P is aligned with a suffix of  in T.

Concept: Suffix Shift Rule 3.If  ´ doesn’t exist, and there is no prefix of P that matches a suffix of  in T, shift P left by n positions.

Preprocessing for the good suffix rule Let L(i) denote the largest position less than n s.t. P[i..n] matches a suffix of P[1..L(i)]. If there is no such position, then L(i) = 0 Example 1: If i = 17 then L(i) = 9 Example 2: If i = 16 then L(i) = 0

Concept: Suffix Shift Rule Let L´(i) denote the largest position less than n s.t. P[i..n] matches a suffix of P[1..L´(i)] and s.t. the character preceding the suffix is not equal to P(i-1). If there is no such position, then L´(i) = 0 Example 1: If i = 20 then L(i) = 12 and L´(i) = 6

Concept: Suffix Shift Rule Example 2: If i = 19 then L(i) = 12 and L´(i) = 0 slydogsaddogdbadbaddog P 19 L(19)

Concept: Suffix Shift Rule Notice that L(i) indicates the right-most copy of P[i..n] that is not a suffix of P. In contrast, L´(i) indicates the right-most copy of P[i..n] that is not a suffix of P and whose preceding character doesn’t match P(i-1). The relation between L´(i) and L(i) is analogous to the relation between  ´ and .

Concept: Suffix Shift Rule Q: What is the point? A: If P(i - 1) causes the mismatch and L´(i) > 0, then we can shift P right by n - L´(i) positions. Example:

Concept: Suffix Shift Rule If L(i) and L´(i) are different, then obviously shifting by n - L´(i) positions is a greater shift than n - L(i). Example:

Concept: Suffix Shift Rule Let N j (P) denote the length of the longest suffix of P[1..j] that is also a suffix of P. Example 1: N 6 (P) = 3 and N 12 (P) = 5. Example 2: N 3 (P) = 2, N 9 (P) = 3, N 15 (P) = 5, N 19 (P) = 0.

Concept: Suffix Shift Rule Q: How are the concepts of N i and Z i related? Recall that Z i = Length of a maximal substring starting at position i, which is a prefix of P. In contrast, N i = Length of a maximal substring ending at position i, which is a suffix of P. In the case of Boyer-Moore, we are naturally interested in suffixes since we are scanning right-to-left i  i 

Concept: Suffix Shift Rule Let P r denote the mirror image of P, then the relationship can be expressed as N j (P)=Z n-j+1 (P r ). In words, the length of the substring matching a suffix at position j in P is equal to the length of the corresponding substring matching a prefix in the reverse of P. Q: Why must this true? A: Because they are the same substring, except that one is the reverse of the other.

Concept: Suffix Shift Rule Since N j (P) = Z n-j+1 (P r ), we can use the Z algorithm to compute N in O(n). Q: How do we do this? A: We create P r, the reverse of P, and process it with the Z algorithm.

36 Concept: Suffix Shift Rule N is the reverse of Z! P: the pattern P r the string obtained by reversing P Then N j (P)=Z n-j+1 (P r ) 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 P: q c a b d a b d a b P r : b a d b a d b a c q N j : 0 0 0 2 0 0 5 0 0 0 Z i 0 0 0 5 0 0 2 0 0 0 tt’xy i t j xy

37 Concept: Suffix Shift Rule For pattern P, N j (for j=1,…,n) can be calculated in O(n) using the Z algorithm. Why do we need to define N j ? To use the strong good suffix rule, we need to find out L’(i) for every i=1,…,n. We can get L’(i) from N j ! x t y t t’ y t z z T P ni L’(i)

Concept: Suffix Shift Rule We can then find L´(i) and L(i) values from N values in linear time with the following: For i = 1 to n {L´(i) = 0;} For j = 1 to n – 1 { i = n - N j (P) + 1; L´(i) = j; } // L values (if desired) can be obtained L(2) = L´(2) ; For i = 3 to n { L(i) = max(L(i - 1), L´(i));}

Concept: Suffix Shift Rule Example: P = asdbasasas, n = 10 Values of N i (P): 0, 2, 0, 0, 0, 2, 0, 4, 0 Computed values i:11, 9, 11, 11, 11, 9, 11, 7, 11 Values of L´: 0, 0, 0, 0, 0, 0, 8, 0, 6 For i = 1 to n {L´(i) = 0;} For j = 1 to n – 1 { i = n - N j (P) + 1; L´(i) = j; } L(2) = L´(2) ; For i = 3 to n { L(i) = max(L(i - 1), L´(i));}

Concept: Suffix Shift Rule Let l´(i) denote the length of the largest suffix of P[i..n] that is also a prefix of P. Let l´(i) = 0 if no such suffix exists. Example: P = asasbsasas ^ l’(1) = 4 ^ l’(2) = 4 ^ l’(3) = 4 ^ l’(4) = 4 ^ l’(5) = 4 ^ l’(6) = 4 ^ l’(7) = 4 ^ l’(8) = 2 ^ l’(9) = 2 ^ l’(10) = 0 tt’ l´(i) = t i

Concept: Suffix Shift Rule Thm: l´(i) = largest j <= n – i + 1 s.t. N j (P) = j. Q: How can we compute l´(i) values in linear time? A: This is problem #9 in Chapter 2. This would make an interesting homework problem. tt’ j xy t i l´(i) = t

Boyer-Moore Algorithm Preprocessing: Compute L´(i) and l´(i) for each position i in P, Compute R(x), the right-most occurrence of x in P, for each character x in . Search: k = n; While k <= m { i = n; h = k; While i > 0 and P(i) = T(j) { i = i – 1; h = h – 1;} if i = 0 { report occurrence of P in T at position k. k = k + n - l´(2);} else Shift P (increase k) by the max amount indicated by the extended bad character rule and the good suffix rule. }

Boyer-Moore Algorithm Example: P = golgol Preprocessing: Compute L´(i) and l´(i) for each position i in P Notice that first we need N j (P) values in order to compute L´(i) and l´(i) for each position i in P. For i = 1 to n {L´(i) = 0;} For j = 1 to n – 1 { i = n - N j (P) + 1; L´(i) = j; }

Boyer-Moore Algorithm Example: P = golgol Recall that N j (P) is the length of the longest suffix of P[1..j] that is also a suffix of P. N 1 (P) = 0, there is no suffix of P that ends with g N 2 (P) = 0, there is no suffix of P that ends with o N 3 (P) = 3, there is a suffix of P that ends with l N 4 (P) = 0, there is no suffix of P that ends with g N 5 (P) = 0, there is no suffix of P that ends with o N 1 (P) = N 2 (P) = N 4 (P) = N 5 (P) = 0 and N 3 (P) = 3

Boyer-Moore Algorithm Preprocessing: P = golgol, n = 6 N 1 (P) = N 2 (P) = N 4 (P) = N 5 (P) = 0 and N 3 (P) = 3 Compute L´(i) and l´(i) for each position i in P For i = 1 to n {L´(i) = 0;} For j = 1 to n – 1 { i = n - N j (P) + 1; L´(i) = j; } j = 1  i = 7Therefore L´(7) = 1 j = 2  i = 7 Therefore L´(7) = 2 j = 3  i = 4 Therefore L´(4) = 3 j = 4  i = 7 Therefore L´(7) = 4 j = 5  i = 7Therefore L´(7) = 5 L´(1) = L´(2) = L´(3) = L´(5) = 0 and L´(4) = 3

Boyer-Moore Algorithm Preprocessing: P = golgol, n = 6 N 1 (P) = N 2 (P) = N 4 (P) = N 5 (P) = 0 and N 3 (P) = 3 L´(1) = L´(2) = L´(4) = L´(5) = 0 and L´(4) = 3 Compute l´(i) for each position i in P. Recall that l´(i) is the length of the longest suffix of P[i..n] that is also a prefix of P. l´(1) = 6since gol is the longest suffix of P[1..n] that is a prefix of P. l´(2) = 3since gol is the longest suffix of P[2..n] that is a prefix of P. l´(3) = 3since gol is the longest suffix of P[3..n] that is a prefix of P. l´(4) = 3since gol is the longest suffix of P[4..n] that is a prefix of P. l´(5) = 0since there is no suffix of P[5..n] that is a prefix of P. l´(6) = 0since there is no suffix of P[6..n] that is a prefix of P. l´(1) = 6, l´(2) = l´(3) = l´(4) = 3and l´(5) = l´(6) = 0

Boyer-Moore Algorithm Preprocessing: P = golgol, n = 6 N 1 (P) = N 2 (P) = N 4 (P) = N 5 (P) = 0 and N 3 (P) = 3 L´(1) = L´(2) = L´(4) = L´(5) = 0 and L´(4) = 3 l´(1) = 6, l´(2) = l´(3) = l´(4) = 3 and l´(5) = l´(6) = 0 Compute the list R(x), the right-most occurrences of x in P, for each character x in  = {g, o, l} R(g) = R(o) = R(l) =

Boyer-Moore Algorithm Preprocessing: P = golgol, n = 6, T = lolgolgol, m = 9 L´(1) = L´(2) = L´(4) = L´(5) = 0 and L´(4) = 3 l´(1) =6, l´(2) = l´(3) = l´(4) = 3 and l´(5) = l´(6) = 0 R(g) =, R(o) =, R(l) = Search: k = n; While k <= m { i = n; h = k; While i > 0 and P(i) = T(j) { i = i – 1; h = h – 1;} if i = 0 { report occurrence of P in T at position k. k = k + n - l´(2);} else Shift P (increase k) by the max amount indicated by the extended bad character rule and the good suffix rule. }

Search ^ i = 6, h = 6 ^ i = 5, h = 5 ^ i = 4, h = 4 lolgolgol golgol Bad Character Rule: there is no occurrence of l, the mismatched character in T, to the left of P(1). This suggests shifting only 1 place Good Suffix Rule: Since L´(2) = 0, l´(2) = 3 therefore shift P by n - l´(2) places, i.e., 6-3=3 places. Thus k = k + 3 = 9 But i = 1! ^ i = 3, h = 3 ^ i = 2, h = 2 ^ i = 1, h = 1, P(1) != T(1)  k = 6; While k <= 9 { i = 6; h = k; While i > 0 and P(i) = T(j) { i = i – 1; h = h – 1;} if i = 0 { report occurrence of P in T at position k. k = k + 6 - l´(2);} else Shift P (increase k) by the max amount indicated by the extended bad character rule and the good suffix rule. }

Search lolgolgol golgol ^ i = 6, h = 9 ^ i = 5, h = 8 ^ i = 4, h = 7 ^ i = 3, h = 6 ^ i = 2, h = 5 ^ i = 1, h = 4 ^ i = 0, h = 3 i = 0, report occurrence of P in T at position 4, k = k + 6 - l´(2) = 9 + 6 - 3 = 12 lolgolgol golgol k = 12, we are done! k = 6; While k <= 9 { i = 6; h = k; While i > 0 and P(i) = T(j) { i = i – 1; h = h – 1;} if i = 0 { report occurrence of P in T at position k. k = k + 6 - l´(2);} else Shift P (increase k) by the max amount indicated by the extended bad character rule and the good suffix rule. }

Homework 1: Due Next Week Implement the Boyeer More Algorithm

KMP Algorithm Preliminaries: – KMP can be easily explained in terms of finite state machines. – KMP has a easily proved linear bound – KMP is usually not the method of choice

KMP Algorithm Recall that the naïve approach to string matching is  (mn). How can we reduce this complexity? – Avoid redundant comparisons – Use larger shifts Boyer-Moore good suffix rule Boyer-Moore extended bad character rule

KMP Algorithm KMP finds larger shifts by recognizing patterns in P. – Let sp i (P) denote the length of the longest proper suffix of P[1..i] that matches a prefix of P. – By definition sp 1 = 0 for any string. – Q: Why does this make sense? – A: The proper suffix must be the empty string αα i

KMP Algorithm Example: P = abcaeabcabd – P[1..2] = ab hence sp 2 = ? – sp 2 = 0 – P[1..3] = abc hence sp 3 = ? – sp 3 = 0 – P[1..4] = abca hence sp 4 = ? – sp 4 = 1 – P[1..5] = abcae hence sp 5 = ? – sp 5 = 0 – P[1..6] = abcaea hence sp 6 = ? – sp 6 = 1

KMP Algorithm Example Continued – P[1..7] = abcaeab hence sp 7 = ? – sp 7 = 2 – P[1..8] = abcaeabc hence sp 8 = ? – sp 8 = 3 – P[1..9] = abcaeabca hence sp 9 = ? – sp 9 = 4 – P[1..10] = abcaeabcab hence sp 10 = ? – sp 10 = 2 – P[1..11] = abcaeabcabd hence sp 11 = ? – sp 11 = 0

KMP Algorithm Like the  /  concept for Boyer-Moore, there is an analogous sp i /sp´ i concept. Let sp´ i (P) denote the length of the longest proper suffix of P[1..i] that matches a prefix of P, with the added condition that characters P(i + 1) and P(sp´ i + 1) are unequal. Example: P = abcdabce sp´ 7 = 3 Obviously sp´ i (P) <= sp i (P), since the later is less restrictive. αα i x y

KMP Algorithm KMP Shift Rule: 1.Mismatch case: Let position i+1 in P and position k in T be the first mismatch in a left-to-right scan. Shift P to the right, aligning P[1..sp´ i ] with T[k- sp´ i..k-1] 2.Match case: If no mismatch is found, an occurrence of P has been found. Shift P by n – sp´ n spaces to continue searching for other occurrences. i+1 k α αα n+1 α αα

KMP Algorithm Observations: – The prefix P[1..sp´ i ] of the shifted P is shifted to match the corresponding substring in T. – Subsequent character matching proceeds from position sp´ i + 1 – Unlike Boyer-Moore, the matched substring is not compared again. – The shift rule based on sp´ i guarantees that the exact same mismatch won’t occur at sp´ i + 1 but doesn’t guarantee that P(sp´ i +1) = T(k)

KMP Algorithm Example: P = abcxabcde – If a mismatch occurs at position 8, P will be shifted 4 positions to the right. – Q: Where did the 4 position shift come from? – A: The number of position is given by i - sp´ i, in this example i = 7, sp´ 7 = 3,  7 – 3 = 4 – Notice that we know the amount of shift without knowing anything about T other than there was a mismatch at position 8..

KMP Algorithm Example Continued: P = abcxabcde – After the shift, P[1..3] lines up with T[k-4..k-1] – Since it known that P[1..3] must match T[k-4..k-1], no comparison is needed. – The scan continues from P(4) & T(k) Advantages of KMP Shift Rule 1. P is often shifted by more than 1 character, (i - sp´ i ) 2.The left-most sp´ i characters in the shifted P are known to match the corresponding characters in T.

KMP Algorithm Full Example: T = xyabcxabcxadcdqfeg P = abcxabcde Assume that we have already shifted past the first two positions in T. xyabcxabcxadcdqfeg abcxabcde ^1^1 ^ 2 ^ 3 ^ 4 ^ 5 ^ 6 ^ 7 abcxabcde ^ 8 d!=x, shift 4 places ^ 1 start again from position 4

Preprocessing for KMP Approach: show how to derive sp´ values from Z values. Definition: Position j > 1 maps to i if i = j + Z j (P) – 1 – Recall that Z j (P) denotes the length of the Z-box starting at position j. – This says that j maps to i if i is the right end of a Z-box starting at j. αα αα i j

Preprocessing for KMP Theorem. For any i > 1, sp´ i (P) = Z j = i – j + 1 Where j > 1 is the smallest position that maps to i. If  j then sp´ i (P) = 0 Similarly for sp: For any i > 1, sp i (P) = i – j + 1 Where j, i  j > 1, is the smallest position that maps to i or beyond. If  j then sp i (P) = 0 Definition: Position j > 1 maps to i if i = j + Z j (P) – 1 αα αα i j x y

Preprocessing for KMP Given the theorem from the preceding slide, the sp´ i and sp i values can be computed in linear time using Z i values: For i = 1 to n { sp´ i = 0;} For j = n downto 2 { i = j + Z j (P) – 1; sp´ i = Z j ; } sp n (P) = sp´ n (P); For i = n - 1 downto 2 { sp i (P) = max[sp i+1 (P) - 1, sp´ i (P)];} αα αα i j x y

Preprocessing for KMP Defn. Failure function F´(i) = sp´ i-1 + 1, 1  i  n + 1, sp´ 0 = 0 (similarly F(i) = sp i-1 + 1, 1  i  n + 1, sp 0 = 0) xyabcxabcxadcdqfeg abcxabcde ^1^1 ^ 2 ^ 3 ^ 4 ^ 5 ^ 6 ^ 7 abcxabcde ^ 8 d!=x, shift 4 places Shifting is only conceptual and P is never explicitly shifted xyabcxabcxadcdqfeg abcxabcde ^ i c | ^ i c | ^ i c | ^ i c | ^i^i Two special cases: 1.Mismatch at position 1, then F’(1) = 1 2.Match found, then P shifts by n - sp’ n places o Which is F’(n+1) = sp’ n + 1

Preprocessing for KMP Defn. Failure function F´(i) = sp´ i-1 + 1, 1  i  n + 1, sp´ 0 = 0 (similarly F(i) = sp i-1 + 1, 1  i  n + 1, sp 0 = 0) Idea: – We maintain a pointer i in P and c in T. – After a mismatch at P(i+1) with T(c), shift P to align P(sp´ i + 1) with T(c), i.e., i = sp´ i + 1. – Special case 1: i = 1  set i = F´(1) = 1 & c = c + 1 – Special case 2: we find P in T,  shift n - sp´ n spaces, i.e., i = F´(n + 1) = sp´ n + 1.

Full KMP Algorithm Preprocess P to find F´(k) = sp´ k-1 +1 for k from 1 to n + 1 c = 1; p = 1; While c + (n – p)  m { While P(p) = T( c )and p  n { p = p + 1; c = c + 1;} If (p = n + 1) then report an occurrence of P at position c – n of T. if (p = 1) then c = c + 1; p = F´(p) ;} T = xyabcxabcxadcdqfeg P = abcxabcde ^ p c | |T| = m |P| = n

Full KMP Algorithm xyabcxabcxabcdefeg abcxabcde ^ 1 a!=x p != n+1 p = 1!  c = 2 p = F’(1) = 1 c = 1; p = 1; While c + (n – p)  m { While P(p) = T( c )and p  n { p = p + 1; c = c + 1;} If (p = n + 1) then report an occurrence of P at position c – n of T. if (p = 1) then c = c + 1; p = F´(p) ; }

Full KMP Algorithm xyabcxabcxabcdefeg abcxabcde ^ 1 a!=y p != n+1 p = 1!  c = 3 p = F’(1) = 1 c = 1; p = 1; While c + (n – p)  m { While P(p) = T( c )and p  n { p = p + 1; c = c + 1;} If (p = n + 1) then report an occurrence of P at position c – n of T. if (p = 1) then c = c + 1; p = F´(p) ; } abcxabcde

Full KMP Algorithm xyabcxabcxabcdefeg p != n+1 p = 8!  don’t change c p = F´(8) = 4 abcxabcde ^1^1 ^ 2 ^ 3 ^ 4 ^ 5 ^ 6 ^ 7 ^ 8 d!=x c = 1; p = 1; While c + (n – p)  m { While P(p) = T( c )and p  n { p = p + 1; c = c + 1;} If (p = n + 1) then report an occurrence of P at position c – n of T. if (p = 1) then c = c + 1; p = F´(p) ; }

p = 4, c = 10 ^ 4 Full KMP Algorithm xyabcxabcxabcdefeg p = n+1 ! abcxabcde ^ 5 ^ 6 ^ 7 ^ 8 abcxabcde c = 1; p = 1; While c + (n – p)  m { While P(p) = T( c )and p  n { p = p + 1; c = c + 1;} If (p = n + 1) then report an occurrence of P at position c – n of T. if (p = 1) then c = c + 1; p = F´(p) ; } ^ 9

Real-Time KMP Q: What is meant by real-time algorithms? A: Typically these are algorithms that are meant to interact synchronously in the real world. – This implies a known fixed turn-around time for processing a task – Many embedded scheduling systems are examples involving real-time algorithms. – For KMP this means that we require a constant time for processing all strings of length n.

Real-Time KMP Q: Why is KMP not real-time? A: For any mismatched character in T, we may try matching it several times. – Recall that sp´ i only guarantees that P(i + 1) and P(sp´ i + 1) differ – There is NO guarantee that P(i + 1) and T(k) match We need to ensure that a mismatch at T(k) does NOT entail additional matches at T(k). This means that we have to compute sp´ i values with respect to all characters in  since any could appear in T.

Real-Time KMP Define: sp´ (i,x) (P) to be the length of the longest proper suffix of P[1..i] that matches a prefix of P, with the added condition that character P(sp´ i + 1) is x. This is will tell us exactly what shift to use for each possible mismatch. A mismatched character T(k) will never be involved in subsequent comparisons.

Real-Time KMP Q: How do we know that the mismatched character T(k) will never be involved in subsequent comparisons? A: Because the shift will shift P so that either the matching character aligns with T(k) or P will be shifted past T(k). This results in a real-time version of KMP. Let’s consider how we can find the sp´ (i,x) (P) values in linear time.

Real-Time KMP Thm. For P[i + 1]  x, sp´ (i,x) (P) = i - j + 1 – Here j is the smallest position such that j maps to i and P(Z j + 1) = x. – If there is no such j then where sp´ (i,x) (P) = 0 For i = 1 to n { sp´ (i,x) = 0 for every character x;} For j = n downto 2 { i = j + Z i (P) – 1; x = P(Z j + 1); sp´ (i,x) = Z i ; }

Real-Time KMP Notice how this works: – Starting from the right Find i the right end of the Z box associated with j Find x the character immediately following the prefix corresponding to this Z box. Set sp´ (i,x) = Z i, the length of this Z box. For i = 1 to n { sp´ (i,x) = 0 for every character x;} For j = n downto 2 { i = j + Z i (P) – 1; x = P(Z j + 1); sp´ (i,x) = Z i ;}

Reference Chapter 1, 2: Exact Matching: Fundamental Preprocessing and First Algorithms

Presented By Dr. Shazzad Hosain Asst. Prof. EECS, NSU Exact String Matching Algorithms.

Similar presentations

Presentation on theme: "Presented By Dr. Shazzad Hosain Asst. Prof. EECS, NSU Exact String Matching Algorithms."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Presented By Dr. Shazzad Hosain Asst. Prof. EECS, NSU Exact String Matching Algorithms.

Similar presentations

Presentation on theme: "Presented By Dr. Shazzad Hosain Asst. Prof. EECS, NSU Exact String Matching Algorithms."— Presentation transcript:

Similar presentations

About project

Feedback