Presentation is loading. Please wait.

Presentation is loading. Please wait.

Presented By Dr. Shazzad Hosain Asst. Prof. EECS, NSU Exact String Matching Algorithms.

Similar presentations


Presentation on theme: "Presented By Dr. Shazzad Hosain Asst. Prof. EECS, NSU Exact String Matching Algorithms."— Presentation transcript:

1 Presented By Dr. Shazzad Hosain Asst. Prof. EECS, NSU Exact String Matching Algorithms

2 Classical Comparison Based Methods Boyer-Moore Algorithm Knuth-Morris-Pratt Algorithm (KMP Algorithm)

3 Boyer-Moore Algorithm Basic ideas: – Previously discussed ideas for naïve matching 1.successively align P and T to check for a match. 2.Shift P to the right on match failure. – new concepts wrt the naïve algorithm 1.Scan from right-to-left, i.e.,  2.Special Bad character rule 3.Suffix shift rule

4 Concept: Right-to-left Scan How can we check for a match of pattern P at location i in target T? Naïve algorithm scanned left-to-right, i.e., T[i+k]&P[1+k], k = 0 to length(P)-1 ^ 1a == a ^ 2d != b Example: P = adab, T = abaracadabara a b a r a c a d a b a r a a d a b

5 Concept: Right-to-left Scan Alternative, scan right-to-left, i.e., T[i+k]&P[1+k], k = length(P)-1 down-to 0 ^ 1b != r Example: P = adab, T = abaracadabara a b a r a c a d a b a r a a d a b

6 Concept: Right-to-left Scan Why is scanning right-to-left a good idea? Answer: by itself, it isn’t any better than left- to-right. – A naïve approach with right-to-left scanning is also  (nm). – Larger shifts, supported by a clever bad character rule and a suffix shift rule make it better.

7 Concept: Bad Character Rule Idea: the mismatched character indicates a safe minimum shift. ^ 1a == a Example: P = adacara, T=abaracadabara a b a r a c a d a b a r a a d a c a r a ^ 2r != c Here the bad character is c. Perhaps we should shift to align this character with its rightmost occurrence in P?

8 Concept: Bad Character Rule Shift two positions to align the rightmost occurrence of the mismatched character c in P. a b a r a c a d a b a r a a d a c a r a Now, start matching again from right to left.

9 Concept: Bad Character Rule ^ 1a == a Second Example: P = adacara, T=abaxaradabara a b a x a r a d a b a r a a d a c a r a ^ 2r == r Here the bad character is x. The minimum that we should shift should align this character with its occurrence in P. But x doesn’t occur in P!!!! ^ 3a == a ^ 4c != x

10 Concept: Bad Character Rule Second Example: P = adacara, T=abaxaradabara a b a x a r a d a b a r a a d a c a r a Since x doesn’t occur in P, we can shift past it. a d a c a r a Now, start matching again from right to left.

11 11 Concept: Bad Character Rule The idea of bad character rule is to shift P by more than one characters when possible. But if rightmost position is greater than the mismatched position. Unfortunately, it is often the case 12345678901234567 T: spbctbsatpqsctbpq P: tpabsat

12 Concept: Bad Character Rule We will define a bad character rule that uses the concept of the rightmost occurrence of each letter. Let R(x) be the rightmost position of the letter x in P for each letter x in our alphabet. If x doesn’t occur in P, define R(x) to be 0. abcdz 7042**0 1234567 P = adacara R

13 13 Concept: Bad Character Rule 12345678901234567 T: spbctbsabpqsctbpq P: tpabsab R(t)=1, R(s)=5. i: the position of mismatch in P. i=3 k: the counterpart in T. k=5. T[k]=t The bad character rule says P should be shifted right by max{1, i-R(T[k])}. i.e., if the right-most occurrence of character T[k] in P is in position j (j<i), then P[j] should be below T[k] after the shifting. Otherwise, we will shift P one position, i.e., when R(T[k]) >= i, 1 >= i - R(T[k]) Obviously this rule is not very useful when R(T[k]) >= i, which is usually the case for DNA sequences P: tpabxab

14 Concept: Extended Bad Character Rule Extended Bad Character Rule: If P[i] mismatches T[k], shift P along T so that the closest occurrence of the letter T[k] in P to the left of i in P is aligned with T[k]. ^ 1a == a Example: P = aracara, T=abararadabara a b a r a r a d a b a r a a r a c a r a ^ 2r == r ^ 3a == a ^ 4c != r ^ This is the rightmost occurrence of r in P. Notice that i - R(T(k)) < 0, i.e., 4 – 6 < 0 ^ This is the rightmost occurrence of r to the left of i in P. Notice that 4 – 2 > 0, i.e., this gives us a positive shift.

15 Concept: Extended Bad Character Rule The amount of shift is i – j, where: – i is the index of the mismatch in P. – j is the rightmost occurrence of T[k] to the left of i in P. ^ 1a == a Example: P = aracara, T=abataradabara a b a t a r a d a b a r a a r a c a r a ^ 2r == r ^ 3a == a ^ 4c != t There is no occurrence of t in P, thus j = 0. Notice that i – j = 4, i.e., this gives us a positive shift past the point of mismatch.

16 Concept: Extended Bad Character Rule How do we implement this rule? We preprocess P (from right to left), recording the position of each occurrence of the letters. For each character x in , the alphabet, create a list of its occurrences in P. If x doesn’t occur in P, then it has an empty list.

17 Concept: Extended Bad Character Rule Example:  = {a, b, c, d, r, t}, P = abataradabara a_list = since ‘a’ occurs at these positions in P, i.e., abataradabara b_list = ( abataradabara) c_list = Ø d_list = ( abataradabara) r_list = ( abataradabara) t_list = ( abataradabara)

18 Concept: Suffix Shift Rule Recall that we investigated finding prefixes before. Since we are matching P to T from right-to-left, we will instead need to use suffixes.

19 19 Suffix Shift Rule t is a suffix of P that match with a substring t of T x≠y t’ is the right-most copy of t in P such that t’ is not a suffix of P and z≠y

20 Concept: Suffix Shift Rule Consider the partial right-to-left matching of P to T below. This partial match involves  a suffix of P.

21 Concept: Suffix Shift Rule This partial match ends where the first mismatch occurs, where x is aligned with d.

22 Concept: Suffix Shift Rule We want to find a right-most copy  ´ of this substring  in P such that:  ´ is not a suffix of P and 2.The character to the left of  ´ is not the same as the character to the left of 

23 Concept: Suffix Shift Rule 1.If  ´ exists, shift P to the right such that  ´ is now aligned with the substring in T that was previously aligned with .

24 Concept: Suffix Shift Rule 2.If  ´ doesn’t exist, shift P right by the least amount such that a prefix of P is aligned with a suffix of  in T.

25 Concept: Suffix Shift Rule 3.If  ´ doesn’t exist, and there is no prefix of P that matches a suffix of  in T, shift P left by n positions.

26 Preprocessing for the good suffix rule Let L(i) denote the largest position less than n s.t. P[i..n] matches a suffix of P[1..L(i)]. If there is no such position, then L(i) = 0 Example 1: If i = 17 then L(i) = 9 Example 2: If i = 16 then L(i) = 0

27 Concept: Suffix Shift Rule Let L´(i) denote the largest position less than n s.t. P[i..n] matches a suffix of P[1..L´(i)] and s.t. the character preceding the suffix is not equal to P(i-1). If there is no such position, then L´(i) = 0 Example 1: If i = 20 then L(i) = 12 and L´(i) = 6

28 Concept: Suffix Shift Rule Example 2: If i = 19 then L(i) = 12 and L´(i) = 0 slydogsaddogdbadbaddog P 19 L(19)

29 Concept: Suffix Shift Rule Notice that L(i) indicates the right-most copy of P[i..n] that is not a suffix of P. In contrast, L´(i) indicates the right-most copy of P[i..n] that is not a suffix of P and whose preceding character doesn’t match P(i-1). The relation between L´(i) and L(i) is analogous to the relation between  ´ and .

30 Concept: Suffix Shift Rule Q: What is the point? A: If P(i - 1) causes the mismatch and L´(i) > 0, then we can shift P right by n - L´(i) positions. Example:

31 Concept: Suffix Shift Rule If L(i) and L´(i) are different, then obviously shifting by n - L´(i) positions is a greater shift than n - L(i). Example:

32 Concept: Suffix Shift Rule Let N j (P) denote the length of the longest suffix of P[1..j] that is also a suffix of P. Example 1: N 6 (P) = 3 and N 12 (P) = 5. Example 2: N 3 (P) = 2, N 9 (P) = 3, N 15 (P) = 5, N 19 (P) = 0.

33 Concept: Suffix Shift Rule Q: How are the concepts of N i and Z i related? Recall that Z i = Length of a maximal substring starting at position i, which is a prefix of P. In contrast, N i = Length of a maximal substring ending at position i, which is a suffix of P. In the case of Boyer-Moore, we are naturally interested in suffixes since we are scanning right-to-left i  i 

34 Concept: Suffix Shift Rule Let P r denote the mirror image of P, then the relationship can be expressed as N j (P)=Z n-j+1 (P r ). In words, the length of the substring matching a suffix at position j in P is equal to the length of the corresponding substring matching a prefix in the reverse of P. Q: Why must this true? A: Because they are the same substring, except that one is the reverse of the other.

35 Concept: Suffix Shift Rule Since N j (P) = Z n-j+1 (P r ), we can use the Z algorithm to compute N in O(n). Q: How do we do this? A: We create P r, the reverse of P, and process it with the Z algorithm.

36 36 Concept: Suffix Shift Rule N is the reverse of Z! P: the pattern P r the string obtained by reversing P Then N j (P)=Z n-j+1 (P r ) 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 P: q c a b d a b d a b P r : b a d b a d b a c q N j : 0 0 0 2 0 0 5 0 0 0 Z i 0 0 0 5 0 0 2 0 0 0 tt’xy i t j xy

37 37 Concept: Suffix Shift Rule For pattern P, N j (for j=1,…,n) can be calculated in O(n) using the Z algorithm. Why do we need to define N j ? To use the strong good suffix rule, we need to find out L’(i) for every i=1,…,n. We can get L’(i) from N j ! x t y t t’ y t z z T P ni L’(i)

38 Concept: Suffix Shift Rule We can then find L´(i) and L(i) values from N values in linear time with the following: For i = 1 to n {L´(i) = 0;} For j = 1 to n – 1 { i = n - N j (P) + 1; L´(i) = j; } // L values (if desired) can be obtained L(2) = L´(2) ; For i = 3 to n { L(i) = max(L(i - 1), L´(i));}

39 Concept: Suffix Shift Rule Example: P = asdbasasas, n = 10 Values of N i (P): 0, 2, 0, 0, 0, 2, 0, 4, 0 Computed values i:11, 9, 11, 11, 11, 9, 11, 7, 11 Values of L´: 0, 0, 0, 0, 0, 0, 8, 0, 6 For i = 1 to n {L´(i) = 0;} For j = 1 to n – 1 { i = n - N j (P) + 1; L´(i) = j; } L(2) = L´(2) ; For i = 3 to n { L(i) = max(L(i - 1), L´(i));}

40 Concept: Suffix Shift Rule Let l´(i) denote the length of the largest suffix of P[i..n] that is also a prefix of P. Let l´(i) = 0 if no such suffix exists. Example: P = asasbsasas ^ l’(1) = 4 ^ l’(2) = 4 ^ l’(3) = 4 ^ l’(4) = 4 ^ l’(5) = 4 ^ l’(6) = 4 ^ l’(7) = 4 ^ l’(8) = 2 ^ l’(9) = 2 ^ l’(10) = 0 tt’ l´(i) = t i

41 Concept: Suffix Shift Rule Thm: l´(i) = largest j <= n – i + 1 s.t. N j (P) = j. Q: How can we compute l´(i) values in linear time? A: This is problem #9 in Chapter 2. This would make an interesting homework problem. tt’ j xy t i l´(i) = t

42 Boyer-Moore Algorithm Preprocessing: Compute L´(i) and l´(i) for each position i in P, Compute R(x), the right-most occurrence of x in P, for each character x in . Search: k = n; While k <= m { i = n; h = k; While i > 0 and P(i) = T(j) { i = i – 1; h = h – 1;} if i = 0 { report occurrence of P in T at position k. k = k + n - l´(2);} else Shift P (increase k) by the max amount indicated by the extended bad character rule and the good suffix rule. }

43 Boyer-Moore Algorithm Example: P = golgol Preprocessing: Compute L´(i) and l´(i) for each position i in P Notice that first we need N j (P) values in order to compute L´(i) and l´(i) for each position i in P. For i = 1 to n {L´(i) = 0;} For j = 1 to n – 1 { i = n - N j (P) + 1; L´(i) = j; }

44 Boyer-Moore Algorithm Example: P = golgol Recall that N j (P) is the length of the longest suffix of P[1..j] that is also a suffix of P. N 1 (P) = 0, there is no suffix of P that ends with g N 2 (P) = 0, there is no suffix of P that ends with o N 3 (P) = 3, there is a suffix of P that ends with l N 4 (P) = 0, there is no suffix of P that ends with g N 5 (P) = 0, there is no suffix of P that ends with o N 1 (P) = N 2 (P) = N 4 (P) = N 5 (P) = 0 and N 3 (P) = 3

45 Boyer-Moore Algorithm Preprocessing: P = golgol, n = 6 N 1 (P) = N 2 (P) = N 4 (P) = N 5 (P) = 0 and N 3 (P) = 3 Compute L´(i) and l´(i) for each position i in P For i = 1 to n {L´(i) = 0;} For j = 1 to n – 1 { i = n - N j (P) + 1; L´(i) = j; } j = 1  i = 7Therefore L´(7) = 1 j = 2  i = 7 Therefore L´(7) = 2 j = 3  i = 4 Therefore L´(4) = 3 j = 4  i = 7 Therefore L´(7) = 4 j = 5  i = 7Therefore L´(7) = 5 L´(1) = L´(2) = L´(3) = L´(5) = 0 and L´(4) = 3

46 Boyer-Moore Algorithm Preprocessing: P = golgol, n = 6 N 1 (P) = N 2 (P) = N 4 (P) = N 5 (P) = 0 and N 3 (P) = 3 L´(1) = L´(2) = L´(4) = L´(5) = 0 and L´(4) = 3 Compute l´(i) for each position i in P. Recall that l´(i) is the length of the longest suffix of P[i..n] that is also a prefix of P. l´(1) = 6since gol is the longest suffix of P[1..n] that is a prefix of P. l´(2) = 3since gol is the longest suffix of P[2..n] that is a prefix of P. l´(3) = 3since gol is the longest suffix of P[3..n] that is a prefix of P. l´(4) = 3since gol is the longest suffix of P[4..n] that is a prefix of P. l´(5) = 0since there is no suffix of P[5..n] that is a prefix of P. l´(6) = 0since there is no suffix of P[6..n] that is a prefix of P. l´(1) = 6, l´(2) = l´(3) = l´(4) = 3and l´(5) = l´(6) = 0

47 Boyer-Moore Algorithm Preprocessing: P = golgol, n = 6 N 1 (P) = N 2 (P) = N 4 (P) = N 5 (P) = 0 and N 3 (P) = 3 L´(1) = L´(2) = L´(4) = L´(5) = 0 and L´(4) = 3 l´(1) = 6, l´(2) = l´(3) = l´(4) = 3 and l´(5) = l´(6) = 0 Compute the list R(x), the right-most occurrences of x in P, for each character x in  = {g, o, l} R(g) = R(o) = R(l) =

48 Boyer-Moore Algorithm Preprocessing: P = golgol, n = 6, T = lolgolgol, m = 9 L´(1) = L´(2) = L´(4) = L´(5) = 0 and L´(4) = 3 l´(1) =6, l´(2) = l´(3) = l´(4) = 3 and l´(5) = l´(6) = 0 R(g) =, R(o) =, R(l) = Search: k = n; While k <= m { i = n; h = k; While i > 0 and P(i) = T(j) { i = i – 1; h = h – 1;} if i = 0 { report occurrence of P in T at position k. k = k + n - l´(2);} else Shift P (increase k) by the max amount indicated by the extended bad character rule and the good suffix rule. }

49 Search ^ i = 6, h = 6 ^ i = 5, h = 5 ^ i = 4, h = 4 lolgolgol golgol Bad Character Rule: there is no occurrence of l, the mismatched character in T, to the left of P(1). This suggests shifting only 1 place Good Suffix Rule: Since L´(2) = 0, l´(2) = 3 therefore shift P by n - l´(2) places, i.e., 6-3=3 places. Thus k = k + 3 = 9 But i = 1! ^ i = 3, h = 3 ^ i = 2, h = 2 ^ i = 1, h = 1, P(1) != T(1)  k = 6; While k <= 9 { i = 6; h = k; While i > 0 and P(i) = T(j) { i = i – 1; h = h – 1;} if i = 0 { report occurrence of P in T at position k. k = k + 6 - l´(2);} else Shift P (increase k) by the max amount indicated by the extended bad character rule and the good suffix rule. }

50 Search lolgolgol golgol ^ i = 6, h = 9 ^ i = 5, h = 8 ^ i = 4, h = 7 ^ i = 3, h = 6 ^ i = 2, h = 5 ^ i = 1, h = 4 ^ i = 0, h = 3 i = 0, report occurrence of P in T at position 4, k = k + 6 - l´(2) = 9 + 6 - 3 = 12 lolgolgol golgol k = 12, we are done! k = 6; While k <= 9 { i = 6; h = k; While i > 0 and P(i) = T(j) { i = i – 1; h = h – 1;} if i = 0 { report occurrence of P in T at position k. k = k + 6 - l´(2);} else Shift P (increase k) by the max amount indicated by the extended bad character rule and the good suffix rule. }

51 Homework 1: Due Next Week Implement the Boyeer More Algorithm

52 Break

53 KMP Algorithm Preliminaries: – KMP can be easily explained in terms of finite state machines. – KMP has a easily proved linear bound – KMP is usually not the method of choice

54 KMP Algorithm Recall that the naïve approach to string matching is  (mn). How can we reduce this complexity? – Avoid redundant comparisons – Use larger shifts Boyer-Moore good suffix rule Boyer-Moore extended bad character rule

55 KMP Algorithm KMP finds larger shifts by recognizing patterns in P. – Let sp i (P) denote the length of the longest proper suffix of P[1..i] that matches a prefix of P. – By definition sp 1 = 0 for any string. – Q: Why does this make sense? – A: The proper suffix must be the empty string αα i

56 KMP Algorithm Example: P = abcaeabcabd – P[1..2] = ab hence sp 2 = ? – sp 2 = 0 – P[1..3] = abc hence sp 3 = ? – sp 3 = 0 – P[1..4] = abca hence sp 4 = ? – sp 4 = 1 – P[1..5] = abcae hence sp 5 = ? – sp 5 = 0 – P[1..6] = abcaea hence sp 6 = ? – sp 6 = 1

57 KMP Algorithm Example Continued – P[1..7] = abcaeab hence sp 7 = ? – sp 7 = 2 – P[1..8] = abcaeabc hence sp 8 = ? – sp 8 = 3 – P[1..9] = abcaeabca hence sp 9 = ? – sp 9 = 4 – P[1..10] = abcaeabcab hence sp 10 = ? – sp 10 = 2 – P[1..11] = abcaeabcabd hence sp 11 = ? – sp 11 = 0

58 KMP Algorithm Like the  /  concept for Boyer-Moore, there is an analogous sp i /sp´ i concept. Let sp´ i (P) denote the length of the longest proper suffix of P[1..i] that matches a prefix of P, with the added condition that characters P(i + 1) and P(sp´ i + 1) are unequal. Example: P = abcdabce sp´ 7 = 3 Obviously sp´ i (P) <= sp i (P), since the later is less restrictive. αα i x y

59 KMP Algorithm KMP Shift Rule: 1.Mismatch case: Let position i+1 in P and position k in T be the first mismatch in a left-to-right scan. Shift P to the right, aligning P[1..sp´ i ] with T[k- sp´ i..k-1] 2.Match case: If no mismatch is found, an occurrence of P has been found. Shift P by n – sp´ n spaces to continue searching for other occurrences. i+1 k α αα n+1 α αα

60 KMP Algorithm Observations: – The prefix P[1..sp´ i ] of the shifted P is shifted to match the corresponding substring in T. – Subsequent character matching proceeds from position sp´ i + 1 – Unlike Boyer-Moore, the matched substring is not compared again. – The shift rule based on sp´ i guarantees that the exact same mismatch won’t occur at sp´ i + 1 but doesn’t guarantee that P(sp´ i +1) = T(k)

61 KMP Algorithm Example: P = abcxabcde – If a mismatch occurs at position 8, P will be shifted 4 positions to the right. – Q: Where did the 4 position shift come from? – A: The number of position is given by i - sp´ i, in this example i = 7, sp´ 7 = 3,  7 – 3 = 4 – Notice that we know the amount of shift without knowing anything about T other than there was a mismatch at position 8..

62 KMP Algorithm Example Continued: P = abcxabcde – After the shift, P[1..3] lines up with T[k-4..k-1] – Since it known that P[1..3] must match T[k-4..k-1], no comparison is needed. – The scan continues from P(4) & T(k) Advantages of KMP Shift Rule 1. P is often shifted by more than 1 character, (i - sp´ i ) 2.The left-most sp´ i characters in the shifted P are known to match the corresponding characters in T.

63 KMP Algorithm Full Example: T = xyabcxabcxadcdqfeg P = abcxabcde Assume that we have already shifted past the first two positions in T. xyabcxabcxadcdqfeg abcxabcde ^1^1 ^ 2 ^ 3 ^ 4 ^ 5 ^ 6 ^ 7 abcxabcde ^ 8 d!=x, shift 4 places ^ 1 start again from position 4

64 Preprocessing for KMP Approach: show how to derive sp´ values from Z values. Definition: Position j > 1 maps to i if i = j + Z j (P) – 1 – Recall that Z j (P) denotes the length of the Z-box starting at position j. – This says that j maps to i if i is the right end of a Z-box starting at j. αα αα i j

65 Preprocessing for KMP Theorem. For any i > 1, sp´ i (P) = Z j = i – j + 1 Where j > 1 is the smallest position that maps to i. If  j then sp´ i (P) = 0 Similarly for sp: For any i > 1, sp i (P) = i – j + 1 Where j, i  j > 1, is the smallest position that maps to i or beyond. If  j then sp i (P) = 0 Definition: Position j > 1 maps to i if i = j + Z j (P) – 1 αα αα i j x y

66 Preprocessing for KMP Given the theorem from the preceding slide, the sp´ i and sp i values can be computed in linear time using Z i values: For i = 1 to n { sp´ i = 0;} For j = n downto 2 { i = j + Z j (P) – 1; sp´ i = Z j ; } sp n (P) = sp´ n (P); For i = n - 1 downto 2 { sp i (P) = max[sp i+1 (P) - 1, sp´ i (P)];} αα αα i j x y

67 Preprocessing for KMP Defn. Failure function F´(i) = sp´ i-1 + 1, 1  i  n + 1, sp´ 0 = 0 (similarly F(i) = sp i-1 + 1, 1  i  n + 1, sp 0 = 0) xyabcxabcxadcdqfeg abcxabcde ^1^1 ^ 2 ^ 3 ^ 4 ^ 5 ^ 6 ^ 7 abcxabcde ^ 8 d!=x, shift 4 places Shifting is only conceptual and P is never explicitly shifted xyabcxabcxadcdqfeg abcxabcde ^ i c | ^ i c | ^ i c | ^ i c | ^i^i Two special cases: 1.Mismatch at position 1, then F’(1) = 1 2.Match found, then P shifts by n - sp’ n places o Which is F’(n+1) = sp’ n + 1

68 Preprocessing for KMP Defn. Failure function F´(i) = sp´ i-1 + 1, 1  i  n + 1, sp´ 0 = 0 (similarly F(i) = sp i-1 + 1, 1  i  n + 1, sp 0 = 0) Idea: – We maintain a pointer i in P and c in T. – After a mismatch at P(i+1) with T(c), shift P to align P(sp´ i + 1) with T(c), i.e., i = sp´ i + 1. – Special case 1: i = 1  set i = F´(1) = 1 & c = c + 1 – Special case 2: we find P in T,  shift n - sp´ n spaces, i.e., i = F´(n + 1) = sp´ n + 1.

69 Full KMP Algorithm Preprocess P to find F´(k) = sp´ k-1 +1 for k from 1 to n + 1 c = 1; p = 1; While c + (n – p)  m { While P(p) = T( c )and p  n { p = p + 1; c = c + 1;} If (p = n + 1) then report an occurrence of P at position c – n of T. if (p = 1) then c = c + 1; p = F´(p) ;} T = xyabcxabcxadcdqfeg P = abcxabcde ^ p c | |T| = m |P| = n

70 Full KMP Algorithm xyabcxabcxabcdefeg abcxabcde ^ 1 a!=x p != n+1 p = 1!  c = 2 p = F’(1) = 1 c = 1; p = 1; While c + (n – p)  m { While P(p) = T( c )and p  n { p = p + 1; c = c + 1;} If (p = n + 1) then report an occurrence of P at position c – n of T. if (p = 1) then c = c + 1; p = F´(p) ; }

71 Full KMP Algorithm xyabcxabcxabcdefeg abcxabcde ^ 1 a!=y p != n+1 p = 1!  c = 3 p = F’(1) = 1 c = 1; p = 1; While c + (n – p)  m { While P(p) = T( c )and p  n { p = p + 1; c = c + 1;} If (p = n + 1) then report an occurrence of P at position c – n of T. if (p = 1) then c = c + 1; p = F´(p) ; } abcxabcde

72 Full KMP Algorithm xyabcxabcxabcdefeg p != n+1 p = 8!  don’t change c p = F´(8) = 4 abcxabcde ^1^1 ^ 2 ^ 3 ^ 4 ^ 5 ^ 6 ^ 7 ^ 8 d!=x c = 1; p = 1; While c + (n – p)  m { While P(p) = T( c )and p  n { p = p + 1; c = c + 1;} If (p = n + 1) then report an occurrence of P at position c – n of T. if (p = 1) then c = c + 1; p = F´(p) ; }

73 p = 4, c = 10 ^ 4 Full KMP Algorithm xyabcxabcxabcdefeg p = n+1 ! abcxabcde ^ 5 ^ 6 ^ 7 ^ 8 abcxabcde c = 1; p = 1; While c + (n – p)  m { While P(p) = T( c )and p  n { p = p + 1; c = c + 1;} If (p = n + 1) then report an occurrence of P at position c – n of T. if (p = 1) then c = c + 1; p = F´(p) ; } ^ 9

74 Real-Time KMP Q: What is meant by real-time algorithms? A: Typically these are algorithms that are meant to interact synchronously in the real world. – This implies a known fixed turn-around time for processing a task – Many embedded scheduling systems are examples involving real-time algorithms. – For KMP this means that we require a constant time for processing all strings of length n.

75 Real-Time KMP Q: Why is KMP not real-time? A: For any mismatched character in T, we may try matching it several times. – Recall that sp´ i only guarantees that P(i + 1) and P(sp´ i + 1) differ – There is NO guarantee that P(i + 1) and T(k) match We need to ensure that a mismatch at T(k) does NOT entail additional matches at T(k). This means that we have to compute sp´ i values with respect to all characters in  since any could appear in T.

76 Real-Time KMP Define: sp´ (i,x) (P) to be the length of the longest proper suffix of P[1..i] that matches a prefix of P, with the added condition that character P(sp´ i + 1) is x. This is will tell us exactly what shift to use for each possible mismatch. A mismatched character T(k) will never be involved in subsequent comparisons.

77 Real-Time KMP Q: How do we know that the mismatched character T(k) will never be involved in subsequent comparisons? A: Because the shift will shift P so that either the matching character aligns with T(k) or P will be shifted past T(k). This results in a real-time version of KMP. Let’s consider how we can find the sp´ (i,x) (P) values in linear time.

78 Real-Time KMP Thm. For P[i + 1]  x, sp´ (i,x) (P) = i - j + 1 – Here j is the smallest position such that j maps to i and P(Z j + 1) = x. – If there is no such j then where sp´ (i,x) (P) = 0 For i = 1 to n { sp´ (i,x) = 0 for every character x;} For j = n downto 2 { i = j + Z i (P) – 1; x = P(Z j + 1); sp´ (i,x) = Z i ; }

79 Real-Time KMP Notice how this works: – Starting from the right Find i the right end of the Z box associated with j Find x the character immediately following the prefix corresponding to this Z box. Set sp´ (i,x) = Z i, the length of this Z box. For i = 1 to n { sp´ (i,x) = 0 for every character x;} For j = n downto 2 { i = j + Z i (P) – 1; x = P(Z j + 1); sp´ (i,x) = Z i ;}

80 Reference Chapter 1, 2: Exact Matching: Fundamental Preprocessing and First Algorithms


Download ppt "Presented By Dr. Shazzad Hosain Asst. Prof. EECS, NSU Exact String Matching Algorithms."

Similar presentations


Ads by Google