String-Matching Algorithms (UNIT-5) ADVANCED ALGORITHMS String-Matching Algorithms (UNIT-5)
Let there is an array of text, T[1..n] of length ‘n’. String Matching : Let there is an array of text, T[1..n] of length ‘n’. Let there is a pattern of text, P[1..m] of length ‘m’. Let T and P are drawn from a finite alphabet . Here P and T are called ‘Strings of Characters’. Here, the pattern P occurs with shift s in text T, if, 0 ≤ s ≤ n – m and T[s+1..s+m] = P[1..m] i.e., for 1 ≤ j ≤ m, T[s+j] = P[j] If P occurs with shift s in T, it is a VALID SHIFT. Other wise, we call INVALID SHIFT.
The String-matching Problem is the problem of finding all valid shifts with which a given pattern P occurs in a given text T. Ex-1 : Let text T : a b c a b a a b c a b a c Let pattern P : a b a a Find the number of valid shifts and ‘s’ values. Answer : Only one Valid Shift. s = 3 The symbol * (read as ‘sigma-star’) is the set of all finite-length strings formed using characters from the alphabet .
The zero-length string is called ‘Empty String’. denoted by ‘ɛ’, also belongs to *. The length of the string ‘x’ is denoted |x|. The concatenation of two strings x and y, denoted xy has length |x| + |y|. A string ω is a prefix of a string x, denoted as ω ⊏ x, if x = ω y for some string y ∊ *. Here, note that if ω ⊏ x, then |w| ≤ |x|. Similarly, a string ω is a suffix of a string x, denoted as ω ⊐ x, if x = y ω for some string y ∊ *. Here, note that if ω ⊐ x, then |w| ≤ |x|.
Ex-2 : Let abcca is a string. Here, ab ⊏ abcca and cca ⊐ abcca Note-1: The empty string ɛ is both a suffix and prefix of every string. Note-2 : Both prefix and suffix are transitive relations. Lemma : Suppose that x, y, and z are strings such that x ⊐ z and y ⊐ z. Here, if |x| ≤ |y| then x ⊐ y. if |x| ≥ |y| then y ⊐ x. if |x| = |y| then x = y.
2. The Naïve String-matching Algorithm : This algorithm finds all valid shifts using a loop that checks the condition P[1..m] = T[s+1..s+m] for each of the n –m + 1 possible values of s. NAÏVE-STRING-MATCHER(T,P) n = T.length m = P.length 3. for s = 0 to n – m 4. if P[1..m] = = T[s+1..s+m] 5 Print “Pattern occurs with shift s.”
Ex-3 : Let T = acaabc & P = aab Find the value of s. Answer : The value of s = 2 Ex-4 : Let T = 000010001010001 P = 0001 Find the values of ‘s’. Answer : The value of s = 1 & 5 & 11 Ex-5 : Let T = an and P = am Answer : The values of s = 0 to n – m i.e., s contains n – m + 1 values
ts = p iff T[s+1..s+m] = P[1..m] s is a valid shift iff ts = p 3. The Rabin-Karp Algorithm : Let = {0, 1, 2, … , 9} Here each character is a decimal digit. d = | | = 10. The string 31415 represents 31,415 in radix-d notation. Let there is a text T[1..n]. Let there is a pattern P[1..m]. Let p denote the corresponding decimal value. Let ts is the decimal value of the length –m substring T[s+1..s+m], for s = 0,1,2,..n-m. ts = p iff T[s+1..s+m] = P[1..m] s is a valid shift iff ts = p
ts+1 = 10 (ts – 10m-1 T[s+1 ]) + T[s+m+1]. Now, the value of p can be computed using Horner’s rule as follows: p = P[1..m] = P[1] P[2] P[3]…P[m] So, p = P[m] + 10 (P[m-1] + 10 (P[m-2] + … + 10 (P[2] + 10 P[1])…)). Similarly, one can compute t0 as follows : t0 = T[m] + 10 (T[m-1] + 10 (T[m-2] + … + 10 (T[2] + 10 T[1])…)). Here we can compute ts+1 from ts as follows : ts+1 = 10 (ts – 10m-1 T[s+1 ]) + T[s+m+1].
ts+1 = (d (ts – T[s+1] h ) + T[s+m+1]) mod q. Ex-6 : Let m = 5, ts = 31415 Let T[s+m+1] = 2 So, RHS = 10 (ts – 10m-1 T[s+1 ]) + T[s+m+1] = 10 (31415 – 104 . 3) + 2 = 14150 + 2 = 14152 Let q is defined so that dq fits in one computer word and the above recurrence equation can be written as : ts+1 = (d (ts – T[s+1] h ) + T[s+m+1]) mod q. Here, h dm-1 (mod q) i.e., h is the first digit in the m-digit text window.
The test ts p (mod q) is a fast heuristic test to rule out invalid shifts s. For any value of ‘s’, if ts p (mod q) is TRUE and P[1..m] = T[s+1..s+m] is FALSE then ‘s’ is called SPURIOUS HIT. Note : a) If ts p (mod q) is TRUE then ts = p may be TRUE b) If ts p (mod q) is FALSE then ts ≠ p is definitely TRUE
RABIN-KARP-MATCHER (T,P,d,q) n = T.length m = P.length h = dm-1 (mod q) p = 0 5 t0 = 0 6 for i = 1 to m // preprocessing 7 p = (dp + P[i]) mod q 8 t0 = (d t0 + T[i]) mod q 9 for s = 0 to n-m //matching 10 if (p = = ts ) 11 if (P[1..m] = T[s+1..s+m]) 12 print “Pattern occurs with shift” s 13 if (s < n – m) 14 ts+1 = (d (ts – T[s+1] h ) + T[s+m+1]) mod q.
Ex-7 : Let T = 2 3 5 9 0 2 3 1 4 1 5 2 6 7 3 9 9 2 1 Let P = 3 1 4 1 5 Here n = 19 m = 5 d = 10 q = 13 h = 3 p = 0 t0 = 0 First for statement : i = 1 : p = 3 t0 = 2 i = 2 : p = 5 t0 = 10 i = 3 : p = 2 t0 = 1 i = 4 : p = 8 t0 = 6 i = 5 : p = 7 t0 = 8
s p ts T p = = ts s < n – m ts+1 0 7 8 23590 FALSE TRUE 9 Second for statement : s p ts T p = = ts s < n – m ts+1 0 7 8 23590 FALSE TRUE 9 1 7 9 35902 FALSE TRUE 3 2 7 3 59023 FALSE TRUE 11 3 7 11 90231 FALSE TRUE 0 4 7 0 02314 FALSE TRUE 1 5 7 1 23141 FALSE TRUE 7 7 7 31415 TRUE S = 6 TRUE VM 8 7 7 8 14152 FALSE TRUE 4 8 7 4 41526 FALSE TRUE 5
Hence, there is only ONE VALID MATCH at s = 6 s p ts T p = = ts s < n – m ts+1 7 5 15267 FALSE TRUE 10 10 7 10 52673 FALSE TRUE 11 11 7 11 26739 FALSE TRUE 7 7 7 67399 TRUE S = 12 TRUE SH 9 13 7 9 73992 FALSE TRUE 11 14 7 11 39921 FALSE FALSE --- Hence, there is only ONE VALID MATCH at s = 6 there is only ONE SPURIOUS HIT at s = 12
The Knuth-Morris-Pratt Algorithm : This algorithm is meant for ‘Pattern Matching’. Here, the prefix function for a pattern encapsulates knowledge about how the pattern matches against shifts of itself. Ex-8 : Let the Text String T & Pattern P is : T : 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 b a c b a b a b a c a c a c a P : 1 2 3 4 5 6 7 a b a b a c a
COMPUTE-PREFIX-FUNCTION (P) : 1. m = P.length Let [1..m] be a new array [1] = 0 k = 0 for q = 2 to m while k > 0 and P[k+1] P[q] 7. k = [k] 8. if P[k+1] = = P[q] 9. k = k + 1 [q] = k 11. return
Ex-8 (contd…) P : 1 2 3 4 5 6 7 a b a b a c a INIT : m = 7 [1] = 0 k = 0 Step : q = 2 : Here, k = 0 & P[k+1] = a & P[q] = b So, while : FALSE & if : FALSE Hence, [2] = 0 Step : q = 3 : Here, k = 0 & P[k+1] = a & P[q] = a So, while : FALSE & if : TRUE k = 1 Hence, [3] = 1
Step : q = 4 : Here, k = 1 & P[k+1] = b & P[q] = b So, while : FALSE & if : TRUE k = 2 Hence, [4] = 2 Step : q = 5 : Here, k = 2 & P[k+1] = a & P[q] = a So, while : FALSE & if : TRUE k = 3 Hence, [5] = 3 Step : q = 6 : Here, k = 3 & P[k+1] = b & P[q] = c So, while : TRUE k = 1 ( = [3] ) & k = 1 & P[k+1] = b & P[q] = c while : TRUE k = 0 ( = [1] ) if : FALSE ([P[1] = = P[6]) Hence, [6] = 0
Step : q = 7 : Here, k = 0 & P[k+1] = a & P[q] = a So, while : FALSE & if : TRUE (P[1] = = P[7] ) k = 1 Hence, [7] = 1 Hence the array is as follows : q : 1 2 3 4 5 6 7 : 0 0 1 2 3 0 1 Hence, this returns the value : 1
6. while q > 0 and P[q+1] T[i] 7. q = [q] 8. if P[q+1] = = T[i] KMP-MATCHER (T,P) : 1. n = T.length m = P.length = COMPUTE-PREFIX-FUNCTION(P) q = 0 5. for i = 1 to n 6. while q > 0 and P[q+1] T[i] 7. q = [q] 8. if P[q+1] = = T[i] 9. q = q + 1 10. if q = = m 11. print ”Pattern occurs with shift” i - m 12. q = [q]
i q C1 C2 wh q= [q] if q++ if print q= [q] Ex-8 contd.. KMP-Matcher (T,P) : INIT : n = 15 m = 7 =1 q = 0 ---------------------------------------------------------------------------------------- i q C1 C2 wh q= [q] if q++ if print q= [q] ------------------------------------------------------------------- 1 0 F T F --- F ---- F ---- ---- 2 0 F F F --- T q = 1 F ---- ---- 3 1 T T T q = 0 F ---- F ---- ---- 4 0 F T F --- F ---- F ---- ---- 5 0 F F F --- T q = 1 F ---- ----
6 1 T F F --- T q=2 F ---- ---- 7 2 T F F --- T q=3 F ---- ---- ----------------------------------------------------------------------------------------------- i q C1 C2 wh q= [q] if q++ if print q= [q] 6 1 T F F --- T q=2 F ---- ---- 7 2 T F F --- T q=3 F ---- ---- 8 3 T F F --- T q=4 F ---- ---- 9 4 T F F --- T q=5 F ---- ---- 10 5 T F F --- T q=6 F ---- ---- 11 6 T F F --- T q=7 F shift 4 q=1 12 1 T T T q=0 F ---- F ---- ---- 13 0 F F F ---- T q=1 F ---- ---- 14 1 T T T q=0 F ---- F ---- ---- 15 0 F F F ---- T q=1 F ---- ---- -----------------------------------------------------------------------------------------------