Download presentation
Presentation is loading. Please wait.
Published byHarold Walters Modified over 9 years ago
1
Exact String Matching Algorithms Presented By Dr. Shazzad Hosain Asst. Prof. EECS, NSU
2
Exact Matching: What’s the Problem 1 1 2 34 5 67 8 90 1 2 T = bbabaxababay P = aba P occurs in T starting at locations 3, 7, and 9 P may overlap, as found at 7 and 9.
3
The Naive Method Problem is to find if a pattern P[1..m] occurs within text T[1..n] Let P = abxyabxz and T = xabxyabxyabxz Where m = 8 and n = 13
4
The Naive Method If P = aaa and T = aaaaaaaaaa then n=3, m=10 In worst case exactly n(m-n+1) comparisons In this case 24 comparisons in the order of θ ( mn ).
5
The Naive Algorithm Char text[], pat[] ; int n, m ; { int i, j, k, lim ; lim=n-m+1 ; for (i=1 ; i<=lim ; i++) /* search */ { k=i ; for (j=1 ; j<=m && text[k]==pat[j]; j++) k++; if (j>m) Report_match_at_position(i-j+1); } The worst-case bound can be reduced to O ( m + n ) For applications with n = 1000 and m = 10,000,000 the improvement is significant.
6
The Smart Algorithm Reasoning of this sort is the key to shifting by more than one character Instead of Skips over three comparisons If you know first character of P (namely a) does not occur again at P until position 5 of P 12345 678
7
The Smarter Algorithm Instead of Skips over three comparisons Instead of Starts at Skips another three
8
The Smart Algorithms Knuth-Morris-Pratt (KMP) Alogorithm Boyer-Moore Algorithm Reduced run-time to O ( n + m ) Additional knowledge requires preprocessing of strings Usually P is much shorter than T So P is preprocessed
9
The Preprocessing Approach Usually P is preprocessed instead of T Sometimes T is preprocessed, e.g. suffix tree The preprocessing methods are similar in spirit, but often quite different in detail and conceptual difficulty Fundamental preprocessing of P is independent of any particular algorithm Each algorithm uses this information
10
Basic String Definitions/Notations Let, S be the string S[i..j] is the substring of S starting at position i and ending at position j, S[i..j] is empty if i > j 1 1 2 34 5 67 8 90 1 2 S = bbabaxababay S[3..7] = abaxa S[1..4] = bbab |S| is the length of the string. Here, |S| = 12 S[1..i] is prefix of S that ends at position i Prefix S[i..|S|] is the suffix of S that begins at position i S[9..12] = abay Suffix
11
A proper prefix, suffix or substring of S is, respectively, a prefix, suffix or substring that is not the entire string S, not the empty string. For any string S, S(i) denotes the i th character of S Basic String Definitions/Notations
12
12 Preprocessing Goal: To gather the information needed for speeding up the algorithm Definitions: – Z i : For i>1, the length of the longest substring of S that starts at i and matches a prefix of S – Z-box: for any position i >1 where Z i >0, the Z-box at i starts at i and ends at i+Z i -1 – r i; For every i>1, r i is the right-most endpoint of the Z-boxes that begin at or before i – l i; For every i>1, l i is the left endpoint of the Z-box ends at r i
13
Preprocessing Z i (S) = The longest prefix of S[i..|S|] that matches a prefix of S, where i > 1 1 12 3 456 7 8 901 S = aabcaabxaaz Z 5 (S) = Z 6 (S) = Z 7 (S) = Z 8 (S) = 0 Z 9 (S) = 2 (aab…aaz) 3 (aabc…aabx…) 1 (aa…ab…) We will use Z i in place of Z i (S) Z Box for i > 1, where Z i is greater than zero Figure 1.2: From Gusfield
14
The l i and r i of Z-Box 40 50 55 62 70 78 82 85 89 95 r i = the right-most endpoint of the Z-boxes that begin at or before position i. l i = the left end of the Z-box that ends at r i. r 78 =95l 78 =78 r 82 =95l 82 =78 r 52 =50l 52 =40 r 75 =85l 75 =70
15
15 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 S: a a b a a b c a x a a b a a b c y Z: 0 1 0 3 1 0 0 1 0 7 1 0 3 1 0 0 0 Z-box a a b a a b c a x a a b a a b c y r i: 0 2 2 6 6 6 6 8 8 16 16 16 16 16 16 16 16 l i: 0 2 2 4 4 4 4 8 8 10 10 10 10 10 10 10 10 Preprocessing
16
16 Z-Algorithm Goal: To calculate Z i for an input string S in a linear time Starting from i=2, calculate Z 2, r 2 and l 2 For i=3; i<n; i++ In iteration k, calculate Z k, r k and l k based on Z j, r j and l j for j=2,…,k-1 For iteration k, the algorithm only need r k-1 and l k-1. Thus, there is no need to keep all r i and l i. We use r, and l to denote r k-1 and l k-1
17
17 Z-Algorithm ’’ k r l k’ r’ l’ ’’ k’=k-l+1; r’=r-l+1; = ’; = ’ k r l In iteration k: (I) if k<=r a a b a a b c a x a a b a a b c y 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 ’’ ’’
18
18 k r l k’ r’ l’ ’’ ’’ ’’ A) If | ’ |<| ’ |, that is, Z k’ < r-k+1, Z k = Z k’ ’’ x y y = ’= ’’; x≠y a a b a a b c a x a a b a a b c y 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 Z: 0 1 0 3 1 0 0 1 0 7 1 0 3 ’’ ’’ ’’ ’’ Z-Algorithm
19
19 Z-Algorithm k r l k’ r’ l’ ’’ ’’ ’’ B) If | ’ |>| ’ |, that is, Z k’ >r-k+1, Z k =| |, i.e., r-k+1 ’’ y ’ ’’ ’= ’’; x ≠y (because is a Z box) ’’ xx Z k =| |, i.e., r-k+1 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 S: a a b a a b c a x a a b a a c d Z: 0 1 0 3 1 0 0 1 0 6 1 0 2 1 0 0 ’’ ’’ ’’ ’’ ’’
20
20 Z-Algorithm k r l k’ r’ l’ ’’ ’’ ’’ C) If | ’ |=| ’ |, that is, Z k’ =r-k+1, Z k ≥| |, i.e., ≥ r-k+1 ’’ y ’ ’’ = ’= ’’; x ≠y (because is a Z box) z ≠x (because ’ is a Z box) z ?? y ’’ xz Compare S[r+1,...] with S[ | | +1,…] until a mismatch occurs. Update Z k, r, and l 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 S: a a b a a e c a x a a b a a b d Z: 0 1 0 2 1 0 0 1 0 6 1 0 3 1 0 0 ’’ ’’ ’’ ’’
21
21 Z-Algorithm krl (II) if k>r Compare the characters starting at k+1 with those starting at 1. Update r, and l if necessary
22
22 Z-Algorithm Input: Pattern P Output: Z i Z Algorithm Calculate Z 2, r 2 and l 2 specifically by comparisons. R= r 2 and l=l 2 for i=3; i<n; i++ if k<=r if Z k-l+1 <r-k+1, then Z k = Z k-l+1 else if Z k-l+1 > r-k+1 Z k = r-k+1 else compare the characters starting at r+1 with those starting at | | +1. Update r, and l if necessary else Compare the characters starting at k to those starting at 1. Update r, and l if necessary
23
23 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 S: a a b a a b c a x a a b a a b c y Z: 0 1 0 3 1 0 0 1 0 7 1 0 3 1 0 0 0 r : 0 2 2 6 6 6 6 8 8 16 16 16 16 16 16 16 16 l : 0 2 2 4 4 4 4 8 8 10 10 10 10 10 10 10 10 Preprocessing
24
24 Z-Algorithm Time complexity #mismatches <= number of iterations, n #matches Let q be the number of matches at iteration k, then we need to increase r by at least q r<=n Thus total #match <=n T=O( #matches + #mismatches +#iterations)=O(n) 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 S: a a b a a b c a x a a b a a b c y Z: 0 1 0 3 1 0 0 1 0 7 1 0 3 1 0 0 0 r : 0 2 2 6 6 6 6 8 8 16 16 16 16 16 16 16 16 l : 0 2 2 4 4 4 4 8 8 10 10 10 10 10 10 10 10 #m: 0 1 0 3 0 0 0 1 0 7 0 0 0 0 0 0 0 #mis: 0 1 1 1 0 0 1 1 1 1 0 0 0 0 0 0 1
25
25 Simplest Linear Time Exact Matching Algorithm Input: Pattern P, Text T Output: Occurrences of P in T Algorithm Simplest S=P$T, where $ is a character that do not appear in P and T For i=2; i<|S|; i++ Calculate Z i If Z i =|P|, then report that there is an occurrence of P in T starting at i-|P|-1 of T=O(|P|+|T|+1)=O(n+m)
26
26 Simplest Linear Time Exact Matching Algorithm Take only O (n) extra space Alphabet-independent linear time k r l k’ r’ l’ ’’ ’’ $
27
Reference Chapter 1, 2: Exact Matching: Fundamental Preprocessing and First Algorithms
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.