Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Exact Matching Charles Yan 2008. 2 Na ï ve Method Input: P: pattern; T: Text Output: Occurrences of P in T Algorithm Naive Align P with the left end.

Similar presentations


Presentation on theme: "1 Exact Matching Charles Yan 2008. 2 Na ï ve Method Input: P: pattern; T: Text Output: Occurrences of P in T Algorithm Naive Align P with the left end."— Presentation transcript:

1 1 Exact Matching Charles Yan 2008

2 2 Na ï ve Method Input: P: pattern; T: Text Output: Occurrences of P in T Algorithm Naive Align P with the left end of T Compare from left right until mismatch or an occurrence of P is found Shift P one place to the right O (n*m)

3 3 Speeding Up The Naïve Algorithm Shift P by more than one places at a time Skip comparisons that have been made

4 4 Preprocessing Goal: To gather the information needed for speeding up the algorithm Definitions: substring, prefix, suffix, proper prefix, proper suffix Z i : For i>1, the length of the longest substring of S that starts at i and matches a prefix of S Z-box: for any position i >1 where Z i >0, the Z-box at i starts at i and ends at i+Z i -1 r i; For every i>1, r i is the right-most endpoint of the Z- boxes that begin at or before i l i; For every i>1, l i is the left endpoint of the Z-box ends at r i

5 5 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 S: a a b a a b c a x a a b a a b c y Z: 0 1 0 3 1 0 0 1 0 7 1 0 3 1 0 0 0 Z-box a a b a a b c a x a a b a a b c y r i: 0 2 2 6 6 6 6 8 8 16 16 16 16 16 16 16 16 l i: 0 2 2 4 4 4 4 8 8 10 10 10 10 10 10 10 10 Preprocessing

6 6 Z-Algorithm Goal: To calculate Z i for an input string S in a linear time Starting from i=2, calculate Z 2, r 2 and l 2 For i=3; i<n; i++ In iteration k, calculate Z k, r k and l k based on Z j, r j and l j for j=2,…,k-1 For iteration k, the algorithm only need r k-1 and l k-1. Thus, there is no need to keep all r i and l i. We use r, and l to denote r k-1 and l k-1

7 7 Z-Algorithm k r l   k’ r’ l’ ’’ ’’ k’=k-l+1; r’=r-l+1;  =  ’;  =  ’ k r l In iteration k: (I) if k<=r a a b a a b c a x a a b a a b c y 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17   ’’ ’’

8 8 k r l   k’ r’ l’ ’’ ’’ ’’  A) If |  ’ |<|  ’ |, that is, Z k’ < r-k+1, Z k = Z k’  ’’ x y y  =  ’=  ’’; x≠y a a b a a b c a x a a b a a b c y 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 Z: 0 1 0 3 1 0 0 1 0 7 1 0 3    ’’ ’’  ’’ ’’

9 9 Z-Algorithm k r l   k’ r’ l’ ’’ ’’ ’’  B) If |  ’ |>|  ’ |, that is, Z k’ >r-k+1, Z k =|  |, i.e., r-k+1  ’’ y  ’  ’’  ’=  ’’; x ≠y (because  is a Z box)  ’’ xx Z k =|  |, i.e., r-k+1 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 S: a a b a a b c a x a a b a a c d Z: 0 1 0 3 1 0 0 1 0 6 1 0 2 1 0 0   ’’ ’’ ’’  ’’  ’’

10 10 Z-Algorithm k r l   k’ r’ l’ ’’ ’’ ’’  C) If |  ’ |=|  ’ |, that is, Z k’ =r-k+1, Z k =|  |, i.e., r-k+1  ’’ y  ’  ’’  =  ’=  ’’; x ≠y (because  is a Z box) z ≠x (because  ’ is a Z box) z ?? y  ’’ xz Compare S[r+1,...] with S[ |  | +1,…] until a mismatch occurs. Update Z k, r, and l 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 S: a a b a a e c a x a a b a a b d Z: 0 1 0 2 1 0 0 1 0 6 1 0 3 1 0 0   ’’ ’’ ’’  ’’

11 11 Z-Algorithm krl (II) if k>r Compare the characters starting at k+1 with those starting at 1. Update r, and l if necessary

12 12 Z-Algorithm Input: Pattern P Output: Z i Z Algorithm Calculate Z 2, r 2 and l 2 specifically by comparisons. R= r 2 and l=l 2 for i=3; i<n; i++ if k<=r if Z k-l+1 <r-k+1, then Z k = Z k-l+1 else if Z k-l+1 > r-k+1 Z k = r-k+1 else compare the characters starting at r+1 with those starting at |  | +1. Update r, and l if necessary else Compare the characters starting at k to those starting at 1. Update r, and l if necessary

13 13 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 S: a a b a a b c a x a a b a a b c y Z: 0 1 0 3 1 0 0 1 0 7 1 0 3 1 0 0 0 r : 0 2 2 6 6 6 6 8 8 16 16 16 16 16 16 16 16 l : 0 2 2 4 4 4 4 8 8 10 10 10 10 10 10 10 10 Preprocessing

14 14 Z-Algorithm Time complexity #mismatches <= number of iterations, n #matches Let q be the number of matches at iteration k, then we need to increase r by at least q r<=n Thus total #match <=n T=O( #matches + #mismatches +#iterations)=O(n) 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 S: a a b a a b c a x a a b a a b c y Z: 0 1 0 3 1 0 0 1 0 7 1 0 3 1 0 0 0 r : 0 2 2 6 6 6 6 8 8 16 16 16 16 16 16 16 16 l : 0 2 2 4 4 4 4 8 8 10 10 10 10 10 10 10 10 #m: 0 1 0 3 0 0 0 1 0 7 0 0 0 0 0 0 0 #mis: 0 1 1 1 0 0 1 1 1 1 0 0 0 0 0 0 1

15 15 Simplest Linear Time Exact Matching Algorithm Input: Pattern P, Text T Output: Occurrences of P in T Algorithm Simplest S=P$T, where $ is a character that do not appear in P and T For i=2; i<|S|; i++ Calculate Z i If Z i =|P|, then report that there is an occurrence of P in T starting at i-|P|-1 of T=O(|P|+|T|+1)=O(n+m)

16 16 Simplest Linear Time Exact Matching Algorithm Take only O (n) extra space Alphabet-independent linear time k r l   k’ r’ l’ ’’ ’’ $


Download ppt "1 Exact Matching Charles Yan 2008. 2 Na ï ve Method Input: P: pattern; T: Text Output: Occurrences of P in T Algorithm Naive Align P with the left end."

Similar presentations


Ads by Google