Exact String Matching Algorithms Presented By Dr. Shazzad Hosain Asst. Prof. EECS, NSU.

Exact String Matching Algorithms Presented By Dr. Shazzad Hosain Asst. Prof. EECS, NSU

Exact Matching: What’s the Problem 1 1 2 34 5 67 8 90 1 2 T = bbabaxababay P = aba P occurs in T starting at locations 3, 7, and 9 P may overlap, as found at 7 and 9.

The Naive Method Problem is to find if a pattern P[1..m] occurs within text T[1..n] Let P = abxyabxz and T = xabxyabxyabxz Where m = 8 and n = 13

The Naive Method If P = aaa and T = aaaaaaaaaa then n=3, m=10 In worst case exactly n(m-n+1) comparisons In this case 24 comparisons in the order of θ ( mn ).

The Naive Algorithm Char text[], pat[] ; int n, m ; { int i, j, k, lim ; lim=n-m+1 ; for (i=1 ; i<=lim ; i++) /* search */ { k=i ; for (j=1 ; j<=m && text[k]==pat[j]; j++) k++; if (j>m) Report_match_at_position(i-j+1); } The worst-case bound can be reduced to O ( m + n ) For applications with n = 1000 and m = 10,000,000 the improvement is significant.

The Smart Algorithm Reasoning of this sort is the key to shifting by more than one character Instead of Skips over three comparisons If you know first character of P (namely a) does not occur again at P until position 5 of P 12345 678

The Smarter Algorithm Instead of Skips over three comparisons Instead of Starts at Skips another three

The Smart Algorithms Knuth-Morris-Pratt (KMP) Alogorithm Boyer-Moore Algorithm Reduced run-time to O ( n + m ) Additional knowledge requires preprocessing of strings Usually P is much shorter than T So P is preprocessed

The Preprocessing Approach Usually P is preprocessed instead of T Sometimes T is preprocessed, e.g. suffix tree The preprocessing methods are similar in spirit, but often quite different in detail and conceptual difficulty Fundamental preprocessing of P is independent of any particular algorithm Each algorithm uses this information

Basic String Definitions/Notations Let, S be the string S[i..j] is the substring of S starting at position i and ending at position j, S[i..j] is empty if i > j 1 1 2 34 5 67 8 90 1 2 S = bbabaxababay S[3..7] = abaxa S[1..4] = bbab |S| is the length of the string. Here, |S| = 12 S[1..i] is prefix of S that ends at position i Prefix S[i..|S|] is the suffix of S that begins at position i S[9..12] = abay Suffix

A proper prefix, suffix or substring of S is, respectively, a prefix, suffix or substring that is not the entire string S, not the empty string. For any string S, S(i) denotes the i th character of S Basic String Definitions/Notations

12 Preprocessing Goal: To gather the information needed for speeding up the algorithm Definitions: – Z i : For i>1, the length of the longest substring of S that starts at i and matches a prefix of S – Z-box: for any position i >1 where Z i >0, the Z-box at i starts at i and ends at i+Z i -1 – r i; For every i>1, r i is the right-most endpoint of the Z-boxes that begin at or before i – l i; For every i>1, l i is the left endpoint of the Z-box ends at r i

Preprocessing Z i (S) = The longest prefix of S[i..|S|] that matches a prefix of S, where i > 1 1 12 3 456 7 8 901 S = aabcaabxaaz Z 5 (S) = Z 6 (S) = Z 7 (S) = Z 8 (S) = 0 Z 9 (S) = 2 (aab…aaz) 3 (aabc…aabx…) 1 (aa…ab…) We will use Z i in place of Z i (S) Z Box for i > 1, where Z i is greater than zero Figure 1.2: From Gusfield

The l i and r i of Z-Box 40 50 55 62 70 78 82 85 89 95 r i = the right-most endpoint of the Z-boxes that begin at or before position i. l i = the left end of the Z-box that ends at r i. r 78 =95l 78 =78 r 82 =95l 82 =78 r 52 =50l 52 =40 r 75 =85l 75 =70

15 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 S: a a b a a b c a x a a b a a b c y Z: 0 1 0 3 1 0 0 1 0 7 1 0 3 1 0 0 0 Z-box a a b a a b c a x a a b a a b c y r i: 0 2 2 6 6 6 6 8 8 16 16 16 16 16 16 16 16 l i: 0 2 2 4 4 4 4 8 8 10 10 10 10 10 10 10 10 Preprocessing

16 Z-Algorithm Goal: To calculate Z i for an input string S in a linear time Starting from i=2, calculate Z 2, r 2 and l 2 For i=3; i<n; i++ In iteration k, calculate Z k, r k and l k based on Z j, r j and l j for j=2,…,k-1 For iteration k, the algorithm only need r k-1 and l k-1. Thus, there is no need to keep all r i and l i. We use r, and l to denote r k-1 and l k-1

17 Z-Algorithm ’’ k r l   k’ r’ l’ ’’ k’=k-l+1; r’=r-l+1;  =  ’;  =  ’ k r l In iteration k: (I) if k<=r a a b a a b c a x a a b a a b c y 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17   ’’ ’’

18 k r l   k’ r’ l’ ’’ ’’ ’’  A) If |  ’ |<|  ’ |, that is, Z k’ < r-k+1, Z k = Z k’  ’’ x y y  =  ’=  ’’; x≠y a a b a a b c a x a a b a a b c y 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 Z: 0 1 0 3 1 0 0 1 0 7 1 0 3    ’’ ’’  ’’ ’’ Z-Algorithm

19 Z-Algorithm k r l   k’ r’ l’ ’’ ’’ ’’  B) If |  ’ |>|  ’ |, that is, Z k’ >r-k+1, Z k =|  |, i.e., r-k+1  ’’ y  ’  ’’  ’=  ’’; x ≠y (because  is a Z box)  ’’ xx Z k =|  |, i.e., r-k+1 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 S: a a b a a b c a x a a b a a c d Z: 0 1 0 3 1 0 0 1 0 6 1 0 2 1 0 0   ’’ ’’ ’’  ’’  ’’

20 Z-Algorithm k r l   k’ r’ l’ ’’ ’’ ’’  C) If |  ’ |=|  ’ |, that is, Z k’ =r-k+1, Z k ≥|  |, i.e., ≥ r-k+1  ’’ y  ’  ’’  =  ’=  ’’; x ≠y (because  is a Z box) z ≠x (because  ’ is a Z box) z ?? y  ’’ xz Compare S[r+1,...] with S[ |  | +1,…] until a mismatch occurs. Update Z k, r, and l 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 S: a a b a a e c a x a a b a a b d Z: 0 1 0 2 1 0 0 1 0 6 1 0 3 1 0 0   ’’ ’’ ’’  ’’

21 Z-Algorithm krl (II) if k>r Compare the characters starting at k+1 with those starting at 1. Update r, and l if necessary

22 Z-Algorithm Input: Pattern P Output: Z i Z Algorithm Calculate Z 2, r 2 and l 2 specifically by comparisons. R= r 2 and l=l 2 for i=3; i<n; i++ if k<=r if Z k-l+1 <r-k+1, then Z k = Z k-l+1 else if Z k-l+1 > r-k+1 Z k = r-k+1 else compare the characters starting at r+1 with those starting at |  | +1. Update r, and l if necessary else Compare the characters starting at k to those starting at 1. Update r, and l if necessary

23 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 S: a a b a a b c a x a a b a a b c y Z: 0 1 0 3 1 0 0 1 0 7 1 0 3 1 0 0 0 r : 0 2 2 6 6 6 6 8 8 16 16 16 16 16 16 16 16 l : 0 2 2 4 4 4 4 8 8 10 10 10 10 10 10 10 10 Preprocessing

24 Z-Algorithm Time complexity #mismatches <= number of iterations, n #matches Let q be the number of matches at iteration k, then we need to increase r by at least q r<=n Thus total #match <=n T=O( #matches + #mismatches +#iterations)=O(n) 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 S: a a b a a b c a x a a b a a b c y Z: 0 1 0 3 1 0 0 1 0 7 1 0 3 1 0 0 0 r : 0 2 2 6 6 6 6 8 8 16 16 16 16 16 16 16 16 l : 0 2 2 4 4 4 4 8 8 10 10 10 10 10 10 10 10 #m: 0 1 0 3 0 0 0 1 0 7 0 0 0 0 0 0 0 #mis: 0 1 1 1 0 0 1 1 1 1 0 0 0 0 0 0 1

25 Simplest Linear Time Exact Matching Algorithm Input: Pattern P, Text T Output: Occurrences of P in T Algorithm Simplest S=P$T, where $ is a character that do not appear in P and T For i=2; i<|S|; i++ Calculate Z i If Z i =|P|, then report that there is an occurrence of P in T starting at i-|P|-1 of T=O(|P|+|T|+1)=O(n+m)

26 Simplest Linear Time Exact Matching Algorithm Take only O (n) extra space Alphabet-independent linear time k r l   k’ r’ l’ ’’ ’’ $

Reference Chapter 1, 2: Exact Matching: Fundamental Preprocessing and First Algorithms

Exact String Matching Algorithms Presented By Dr. Shazzad Hosain Asst. Prof. EECS, NSU.

Similar presentations

Presentation on theme: "Exact String Matching Algorithms Presented By Dr. Shazzad Hosain Asst. Prof. EECS, NSU."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Exact String Matching Algorithms Presented By Dr. Shazzad Hosain Asst. Prof. EECS, NSU.

Similar presentations

Presentation on theme: "Exact String Matching Algorithms Presented By Dr. Shazzad Hosain Asst. Prof. EECS, NSU."— Presentation transcript:

Similar presentations

About project

Feedback