Download presentation
Presentation is loading. Please wait.
1
UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bioinformatics Algorithms and Data Structures Chapter 2: KMP Algorithm Lecturer: Dr. Rose Slides by: Dr. Rose January 28, 2003
2
UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology KMP Algorithm Preliminaries: –KMP can be easily explained in terms of finite state machines. –KMP has a easily proved linear bound –KMP is usually not the method of choice
3
UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology KMP Algorithm Recall that the naïve approach to string matching is (mn). How can we reduce this complexity? –Avoid redundant comparisons –Use larger shifts Boyer-Moore good suffix rule Boyer-Moore extended bad character rule
4
UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology KMP Algorithm KMP finds larger shifts by recognizing patterns in P. –Let sp i (P) denote the length of the longest proper suffix of P[1..i] that matches a prefix of P. –By definition sp 1 = 0 for any string. –Q: Why does this make sense? –A: The proper suffix must be the empty string
5
UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology KMP Algorithm Example: P = abcaeabcabd –P[1..2] = ab hence sp 2 = ? –sp 2 = 0 –P[1..3] = abc hence sp 3 = ? –sp 3 = 0 –P[1..4] = abca hence sp 4 = ? –sp 4 = 1 –P[1..5] = abcae hence sp 5 = ? –sp 5 = 0 –P[1..6] = abcaea hence sp 6 = ? –sp 6 = 1
6
UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology KMP Algorithm Example Continued –P[1..7] = abcaeab hence sp 7 = ? –sp 7 = 2 –P[1..8] = abcaeabc hence sp 8 = ? –sp 8 = 3 –P[1..9] = abcaeabca hence sp 9 = ? –sp 9 = 4 –P[1..10] = abcaeabcab hence sp 10 = ? –sp 10 = 2 –P[1..11] = abcaeabcabd hence sp 11 = ? –sp 11 = 0
7
UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology KMP Algorithm Like the / concept for Boyer-Moore, there is an analogous sp i /sp´ i concept. Let sp´ i (P) denote the length of the longest proper suffix of P[1..i] that matches a prefix of P, with the added condition that characters P(i + 1) and P(sp´ i + 1) are unequal. Example: P = abcdabce sp´ 7 = 3 Obviously sp´ i (P) <= sp i (P), since the later is less restrictive.
8
UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology KMP Algorithm KMP Shift Rule: 1.Mismatch case: Let position i+1 in P and position k in T be the first mismatch in a left-to-right scan. Shift P to the right, aligning P[1..sp´ i ] with T[k- sp´ i..k-1] 2.Match case: If no mismatch is found, an occurrence of P has been found. Shift P by n – sp´ n spaces to continue searching for other occurrences.
9
UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology KMP Algorithm Observations: –The prefix P[1..sp´ i ] of the shifted P is shifted to match the corresponding substring in T. –Subsequent character matching proceeds from position sp´ i + 1 –Unlike Boyer-Moore, the matched substring is not compared again. –The shift rule based on sp´ i guarantees that the exact same mismatch won’t occur at sp´ i + 1 but doesn’t guarantee that P(sp´ i +1) = T(k)
10
UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology KMP Algorithm Example: P = abcxabcde –If a mismatch occurs at position 8, P will be shifted 4 positions to the right. –Q: Where did the 4 position shift come from? –A: The number of position is given by i - sp´ i, in this example i = 7, sp´ 7 = 3, 7 – 3 = 4 –Notice that we know the amount of shift without knowing anything about T other than there was a mismatch at position 8..
11
UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology KMP Algorithm Example Continued: P = abcxabcde –After the shift, P[1..3] lines up with T[k-4..k-1] –Since it known that P[1..3] must match T[k-4..k-1], no comparison is needed. –The scan continues from P(4) & T(k) Advantages of KMP Shift Rule 1. P is often shifted by more than 1 character, (i - sp´ i ) 2.The left-most sp´ i characters in the shifted P are known to match the corresponding characters in T.
12
UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology KMP Algorithm Full Example: T = xyabcxabcxadcdqfeg P = abcxabcde Assume that we have already shifted past the first two positions in T. xyabcxabcxadcdqfeg abcxabcde ^1^1 ^ 2 ^ 3 ^ 4 ^ 5 ^ 6 ^ 7 ^ 8 d!=x, shift 4 places abcxabcde ^ 1 start again from position 4
13
UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Preprocessing for KMP Approach: show how to derive sp´ values from Z values. Definition: Position j > 1 maps to i if i = j + Z j (P) – 1 –Recall that Z j (P) denotes the length of the Z-box starting at position j. –This says that j maps to i if i is the right end of a Z-box starting at j.
14
UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Preprocessing for KMP Theorem. For any i > 1, sp´ i (P) = Z j = i – j + 1 Where j > 1 is the smallest position that maps to i. If j then sp´ i (P) = 0 Similarly for sp: For any i > 1, sp i (P) = i – j + 1 Where j, i j > 1, is the smallest position that maps to i or beyond. If j then sp i (P) = 0
15
UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Preprocessing for KMP Given the theorem from the preceding slide, the sp´ i and sp i values can be computed in linear time using Z i values: For i = 1 to n { sp´ i = 0;} For j = n downto 2 { i = j + Z i (P) – 1; sp´ i = Z i ; } sp n (P) = sp´ n (P); For i = n - 1 downto 2 { sp i (P) = max[sp i+1 (P) - 1, sp´ i (P)];}
16
UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Preprocessing for KMP Defn. Failure function F´(i) = sp´ i-1 + 1, 1 i n + 1, sp´ 0 = 0 (similarly F(i) = sp i-1 + 1, 1 i n + 1, sp 0 = 0) Idea: –We maintain a pointer i in P and c in T. – After a mismatch at P(i+1) with T(c), shift P to align P(sp´ i + 1) with T(c), i.e., i = sp´ i + 1. –Special case 1: i = 1 set i = F´(1) = 1 & c = c + 1 –Special case 2: we find P in T, shift n - sp´ n spaces, i.e., i = F´(n + 1) = sp´ n + 1.
17
UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Full KMP Algorithm Preprocess P to find F´(k) = sp´ k-1 +1 for k from 1 to n + 1 c = 1; p = 1; While c + (n – p) m { While P(p) = T( c )and p n { p = p + 1; c = c + 1;} If (p = n + 1) then report an occurrence of P at position c – n of T. if (p = 1) then c = c + 1; p = F´(p) ;}
18
UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Full KMP Algorithm xyabcxabcxabcdefeg abcxabcde ^ 1 a!=x p != n+1 p = 1! c = 2 p = F’(1) = 1 c = 1; p = 1; While c + (n – p) m { While P(p) = T( c )and p n { p = p + 1; c = c + 1;} If (p = n + 1) then report an occurrence of P at position c – n of T. if (p = 1) then c = c + 1; p = F´(p) ; }
19
UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Full KMP Algorithm xyabcxabcxabcdefeg abcxabcde ^ 1 a!=y p != n+1 p = 1! c = 3 p = F’(1) = 1 c = 1; p = 1; While c + (n – p) m { While P(p) = T( c )and p n { p = p + 1; c = c + 1;} If (p = n + 1) then report an occurrence of P at position c – n of T. if (p = 1) then c = c + 1; p = F´(p) ; } abcxabcde
20
UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Full KMP Algorithm xyabcxabcxabcdefeg p != n+1 p = 8! don’t change c p = F´(8) = 4 abcxabcde ^1^1 ^ 2 ^ 3 ^ 4 ^ 5 ^ 6 ^ 7 ^ 8 d!=x c = 1; p = 1; While c + (n – p) m { While P(p) = T( c )and p n { p = p + 1; c = c + 1;} If (p = n + 1) then report an occurrence of P at position c – n of T. if (p = 1) then c = c + 1; p = F´(p) ; }
21
UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology p = 4, c = 10 ^ 4 Full KMP Algorithm xyabcxabcxabcdefeg p = n+1 ! abcxabcde ^ 5 ^ 6 ^ 7 ^ 8 abcxabcde c = 1; p = 1; While c + (n – p) m { While P(p) = T( c )and p n { p = p + 1; c = c + 1;} If (p = n + 1) then report an occurrence of P at position c – n of T. if (p = 1) then c = c + 1; p = F´(p) ; } ^ 9
22
UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Real-Time KMP Q: What is meant by real-time algorithms? A: Typically these are algorithms that are meant to interact synchronously in the real world. –This implies a known fixed turn-around time for processing a task –Many embedded scheduling systems are examples involving real-time algorithms. –For KMP this means that we require a constant time for processing all strings of length n.
23
UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Real-Time KMP Q: Why is KMP not real-time? A: For any mismatched character in T, we may try matching it several times. –Recall that sp´ i only guarantees that P(i + 1) and P(sp´ i + 1) differ –There is NO guarantee that P(i + 1) and T(k) match We need to ensure that a mismatch at T(k) does NOT entail additional matches at T(k). This means that we have to compute sp´ i values with respect to all characters in since any could appear in T.
24
UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Real-Time KMP Define: sp´ (i,x) (P) to be the length of the longest proper suffix of P[1..i] that matches a prefix of P, with the added condition that character P(sp´ i + 1) is x. This is will tell us exactly what shift to use for each possible mismatch. A mismatched character T(k) will never be involved in subsequent comparisons.
25
UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Real-Time KMP Q: How do we know that the mismatched character T(k) will never be involved in subsequent comparisons? A: Because the shift will shift P so that either the matching character aligns with T(k) or P will be shifted past T(k). This results in a real-time version of KMP. Let’s consider how we can find the sp´ (i,x) (P) values in linear time.
26
UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Real-Time KMP Thm. For P[i + 1] x, sp´ (i,x) (P) = i - j + 1 –Here j is the smallest position such that j maps to i and P(Z j + 1) = x. –If there is no such j then where sp´ (i,x) (P) = 0 For i = 1 to n { sp´ (i,x) = 0 for every character x;} For j = n downto 2 { i = j + Z i (P) – 1; x = P(Z j + 1); sp´ (i,x) = Z i ; }
27
UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Real-Time KMP Notice how this works: –Starting from the right Find i the right end of the Z box associated with j Find x the character immediately following the prefix corresponding to this Z box. Set sp´ (i,x) = Z i, the length of this Z box. For i = 1 to n { sp´ (i,x) = 0 for every character x;} For j = n downto 2 { i = j + Z i (P) – 1; x = P(Z j + 1); sp´ (i,x) = Z i ;}
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.