Presentation is loading. Please wait.

Presentation is loading. Please wait.

北海道大学 Hokkaido University 1 Lecture on Information knowledge network2010/12/23 Lecture on Information Knowledge Network "Information retrieval and pattern.

Similar presentations


Presentation on theme: "北海道大学 Hokkaido University 1 Lecture on Information knowledge network2010/12/23 Lecture on Information Knowledge Network "Information retrieval and pattern."— Presentation transcript:

1 北海道大学 Hokkaido University 1 Lecture on Information knowledge network2010/12/23 Lecture on Information Knowledge Network "Information retrieval and pattern matching" Laboratory of Information Knowledge Network, Division of Computer Science, Graduate School of Information Science and Technology, Hokkaido University Takuya KIDA

2 The 3rd Suffix type algorithm Boyer-Moore algorithm Galil algorithm Horspool algorithm Sunday algorithm

3 北海道大学 Hokkaido University 3 Lecture on Information knowledge network 2010/11/10 ababc Knuth-Morris-Pratt algorithm (review) KMP-String-Matching (T, P) 1 n ← length[T]. 2 m ← length[P]. 3 q ← 1. 4 next ← ComputeNext(P). 5 for i ← 1 to n do 6 while q>0 かつ P[q]≠T[i] do q ← next[q]; 7 if q=m then report an occurrence at i-m; 8 q ← q+1. The next position of comparison can be obtained by function next (the amount of shifting P is equal to q - next[q]). The comparison restarts from the next character when the value of next is equal to 0. The number of comparison at each text position is O(1) times. next[5] = 3 Even in the worst case, it takes only O(n+m) time (if next is preprocessed) next[3] = 0 D. E. Knuth, J. H. Morris, Jr, and V. R. Pratt. Fast pattern matching in strings. SIAM Journal on Computing, 6(1):323-350, 1977. ababbababcbaababc Text T: Pattern P: 1234567891011121314151617 ababc The pattern occurs at position 6 of T ababc

4 北海道大学 Hokkaido University 4 Lecture on Information knowledge network 2010/11/10 Shift-And algorithm (review) abababba ababb 0 1 0 0 0 1 0 1 0 0 1 0 1 0 0 1 0 1 0 0 & 0000000000 1000010000 0100001000 1010010100 0101001010 1010010100 0101001010 0000100001 1000010000 1234512345 R i = (R i-1 <<1 | 1) & M(T[i]) Mask table M ab 1010010100 0101101011 ababbababb Text T: Pattern P: This can be calculated in O(1) time ※ Keeping only the right transferred bits by taking AND op. with the maskbits M. R. A. Baeza-Yates and G. H. Gonnet. A new approach to text searching. Proceedings of the 12th International Conference on Research and Development in Information Retrieval, 168-175. ACM Press, 1989.

5 北海道大学 Hokkaido University 5 Lecture on Information knowledge network 2010/12/23 General form of efficient matching algorithms MatchingAlgorithm (P, T) 1 m ← length[P]. 2 n ← length[T]. 3 i ← 1. 4 while i ≦ n – m +1 do 5 decide if i is an occurrence; 6 if i is an occurrence then report the occurrence at i; 7 decide the amount of Δ to shift the pattern safely; 8 i ← i + Δ. A lot of efficient algorithms including KMP and BM are in this frame ※ Masayuki Taketa “High-speed pattern matching algorithms for full text processing,” Informatics symposium, January 1991(written in Japanese). Important things for speeding-up the algorithm : How much can we save our work for the 5th line? How much can we make the amount of Δ large at the 7th line? Important things for speeding-up the algorithm : How much can we save our work for the 5th line? How much can we make the amount of Δ large at the 7th line?

6 北海道大学 Hokkaido University 6 Lecture on Information knowledge network 2010/12/23 Boyer-Moore algorithm a a b c d a a c b c a b c c a ・・・ Text T: Pattern P: a b c b a b a b Shift P to align the rightmost ‘c’ in P with the current position delta1(char) := the jump width of which we shift the pattern so that the rightmost position of char in P is aligned to the current text position (if the pattern doesn’t include char, then it is equal to the pattern length). Δ=delta1(char) – j + 1 = 5 – 0 = 5 delta1(c) = 5 Features: Characters of the pattern are compared from the right to the left. The values of two functions (delta1 and delta2) are compared, and then the pattern is shifted by the larger. Although the time complexity of BM algorithm is O(mn) in the worst case, it becomes O(n/m) on average ( sub linear!! ) (bad-character heuristic) R. S. Boyer and J. S. Moore. A fast string searching algorithm. Communications of the ACM, 20(10):762-772, 1977.

7 北海道大学 Hokkaido University 7 Lecture on Information knowledge network 2010/12/23 delta2(j) a a b c d a a b b c a b c c a ・・・ a b c b a b a b Shift P to align ‘ab’ with the prefix of P delta2(j) := the jump width of which we shift to align the suffix of P of length j-1 with another factor of P ( or the longest prefix of P such that it is also the suffix of the string ) ( If there isn’t such factor, it is equal to the length of P. ) Δ=delta2(3) – 3 + 1 = 8 – 2 = 6 ※ There are two candidate, 1 and 5, for the value of delta2(3). However, we can see that the left side character of the 5th, namely the 4th, is ‘b’, which doesn’t match with ‘a’. Therefore, the 1st position is the only candidate. a a b c a b a b b c a b c c a ・・・ a b c b a b a b delta2(3) = 8 delta2(5) = 10 Δ=delta2(5) – 5 + 1 = 10 – 4 = 6 (good-suffix heuristic) Text T: Pattern P: Text T: Pattern P: Shift P to align ‘ab’ with the prefix of P

8 北海道大学 Hokkaido University 8 Lecture on Information knowledge network 2010/12/23 Problem of BM method It is complicated to decide the value of the delta functions. –It takes in O(m 2 ) time in a naïve way. –To reduce it to O(m) is somewhat trouble → in a similar way of KMP It costs to compare the values of delta1 and delta2 for each iteration. –Generally, only delta1 is used. ( However, we have to devise to shift the pattern correctly since it cannot be shifted by delta1 only.) It takes O(mn) time in the worst case. –Consider when T = a n and P = ba m. The efficiency of BM declines when the alphabet size is small. –For strings in ∑={0,1}, Δ’s would be very small. Binary strings

9 北海道大学 Hokkaido University 9 Lecture on Information knowledge network 2010/12/23 Galil algorithm Since the information about the matched string is forgotten in the original BM method, it takes O(mn) time in the worst case. The idea for improvement is to memory how long the prefix of P has been matched with the text. Galil algorithm scans in O(n) time theoretically, but it slows down in practice since the algorithm becomes much complicated. a a b c a b a b c b a b a b a ・・・ a b c b a b a b delta2(5) = 10 Memory that we’ve already compared the forward positions. Only these are to be compared! Each character of the text is compared twice at most! Z. Galil. On improving the worst case running time of the Boyer-Moore string searching algorithm. Communications of the ACM, 22(9):505-508, 1979. Text T: Pattern P:

10 北海道大学 Hokkaido University 10 Lecture on Information knowledge network 2010/12/23 a b c b a b a d Horspool algorithm If ∑ is large enough, delta1 ( bad-character heuristic ) can mostly give the best shift amount. → A small modification can enlarge the jump width. a a b c d c a d b c a b c c a b a c a ・・・ a b c b a b a d delta1(c) = 5 a b c b a b a d a a b c d c a d b c a b c c a b a c a ・・・ a b c b a b a d delta1’(d) = 10 Always decide the jump width by the character of the text at the end position of the pattern. delta1’(b) = 3 R. N. Horspool. Practical fast searching in strings. Software Practice and Experience, 10(6):501-506, 1980. Text T: Pattern P: Text T: Pattern P:

11 北海道大学 Hokkaido University 11 Lecture on Information knowledge network 2010/12/23 Pseudo code Horspool (P, T) 1 m ← length[P]. 2 n ← length[T]. 3 Preprocessing: 4 For each c ∈ ∑ do delta1’[c] ← m. 5 For j←1 to m – 1 do delta1’[ P[j] ] ← m – j. 6 Searching: 7 i ← 0. 8 while i ≦ n – m do 9 j ← m; 10 while j > 0 and T[i+j] = P[j] do j ← j – 1; 11 if j = 0 then report an occurrence at i+1; 12 i ← i + delta1’[ T[i+m] ].

12 北海道大学 Hokkaido University 12 Lecture on Information knowledge network 2010/12/23 Sunday algorithm It basically based on BM method. Different point –It compares in an arbitrary position order of the pattern to match For example, it compares characters in an infrequent order. –It uses the right side text character at the end position of the pattern to determine the value of delta1 ( it also calculates delta2, and then compares them to select the longer ). The jump width tends to be longer than that of Horspool –However, the memory consumption is larger than Horspool –Moreover, it takes much more time to decide the jump width than Horspool. a b c b a b a b a a b c d c a b d c a b c c a b a c a ・・・ a b c b a b a b delta1’(d) = 9 Always decide the jump width by the character of the text at the right side of the end position of the pattern. delta1’(c) = 6 D. M. Sunday. A very fast substring search algorithm. Communications of the ACM, 33(8):132-142, 1990. Text T: Pattern P:

13 Factor type algorithm BDM algorithm BOM algorithm BNDM algorithm

14 北海道大学 Hokkaido University 14 Lecture on Information knowledge network 2010/12/23 a b c b a b a b Backward Dawg Matching (BDM) algorithm It basically based on BM method. Different point –It decide if the pattern occurs at the current position by detecting if the reading string matches with any factors of the pattern, not with a suffix of the pattern. –It uses Suffix Automaton (or suffix tree) to determine if the reading string is a factor of the pattern. Features of Suffix automaton (SA): –It can tell whether string u is a factor of pattern P in O(|u|) time. –It can also tell whether string u is a suffix of P. –For P=p 0 p 2 …p m, there exists an online construction algorithm that runs in O(m) time. M. Crochemore, A. Czumanj, L. Gasieniec, S. Jarominek, T. Lecroq, W. Plandowski, and W. Rytter. Speeding up two string matching algorithms. Algorithmica, 12(4/5):247-267, 1994. a b c b a b a b a a b c d c a b d c a b c c a b a c a ・・・ a b c b a b a b Factor search As neither ‘cc’ is a factor of P nor ‘c’ is a prefix of P, the pattern can be shifted safely to the next position. We can see whether the reading string is a prefix of P or not, from the second feature of SA. u σ Text T: Pattern P:

15 北海道大学 Hokkaido University 15 Lecture on Information knowledge network 2010/12/23 Suffix Automaton A. Blumer, J. Blumer, D. Haussler, A. Ehrenfeucht, M. T. Chen and J. Seiferas. The smallest automation recognizing the subwords of a text. Theoretical Computer Science (40):31-55, 1985. e c n u o n n a c n u o n n a n u o n n a u o n n a o n n a n a a a e c n u o n n a c n u o n n a n u o n n a u o n n a o n n a n a a a Suffix treeSuffix trie Suffix automaton An automaton that accepts the reverse P R of P = announce. uonne 3457 8 6012 cna 9 n a u o c u n a

16 北海道大学 Hokkaido University 16 Lecture on Information knowledge network 2010/12/23 On-line construction algorithm SuffixAutomaton(P=p 1 p 2 …p m ) 1 Create the one-node graph G=DAWG(e). 2 root ← sink ← the node of G. suf[root] ←θ. 3 for i ← 1 to m do 4 create a new node newsink; 5 make a solid edge (sink, newsink) labeled by a; 6 w ← suf[sink]; 7 while w≠θ and son(w,a)=θ do 8 make a non-solid a-edge (w, newsink); 9 w ← suf[w]; 10 v ← son(w,a); 11 If w=θthen suf[newsink] ← root 12 else if (w,v) is a solid edge then suf[newsink] ← v 13 else 14 create a node newnode; 15 newnode has the same outgoing edges as v except that they are all non-solid; 16 change (w,v) into a solid edge (w, newnode); 17 suf[newsink] ← newnode; 18 suf[newnode] ← suf[v]; suf[v] ← newnode; 19 w ← suf[w]; 20 while w≠θ and (w,v) is a non-solid a-edge do 21 redirect this edge to newnode; w ← suf[w]. 22 sink ← newsink. This is rather complicated! The online construction of SA is a hard task!

17 北海道大学 Hokkaido University 17 Lecture on Information knowledge network 2010/12/23 BNDM algorithm The idea is basically same as BDM algorithm. Different point: –It uses a non-deterministic version of suffix automaton to determine the reading string is a factor of the pattern. –It simulates the move of the NFA by bit-parallel technique. G. Navarro and M. Raffinot. Fast and flexible string matching by combining bit-parallelism and suffix automata. ACM Journal of Experimental Algorithmics (JEA), 5(4), 2000. An NFA that accepts the suffix of P R for pattern P = announce Simulate this NFA e 0124 c u n o 37 8 n n a 65 I ε ε εεεεεεε The same Mask table as that of Shift-And method Initial condition : R 0 = 1 m State transition : R = (R << 1) & M[ T[i] ]

18 北海道大学 Hokkaido University 18 Lecture on Information knowledge network 2010/12/23 Pseudo code BNDM (P, T) 1 m ← length[P]. 2 n ← length[T]. 3 Preprocessing: 4 for c ∈ ∑ do M[c] ← 0 m. 5 for j ← 1 to m do M[ P[j] ] ← M[ P[j] ] | 0 j–1 10 m–j. 6 Searching: 7 s ← 0. 8 while s ≦ n – m do 9 j ← m, last ← m, R ← 1 m ; 10 while R ≠ 0 m do 11 R ← R & M[ T[s+j] ]; 12 j ← j – 1; 13 If R & 10 m-1 ≠ 0 m then 14 If j > 0 then last ← j; 15 else report an occurrence at s+1; 16 R ← R << 1; 17 s ← s + last.

19 北海道大学 Hokkaido University 19 Lecture on Information knowledge network 2010/12/23 Backward Oracle Matching (BOM) algorithm The idea is basically same as BDM algorithm. Different point: –It uses Factor oracle instead of Suffix automaton –A necessary thing for BDM is that σu is not a factor, rather than that string u is a factor. Feature of Factor oracle: –It may accept strings other than the factor of P. For example, in the bottom figure, ‘cnn’ is not a factor of P R. –It can be constructed in O(m) time. Moreover, it is easy to implement with small memory space. The number of states: m+1. The number of state transitions: 2m-1. C. Allauzen, M. Crochemore, and M. Raffinot. Efficient experimental string matching by weak factor recognition. In Proceedings of the 12 th Annual Symposium on Combinatorial Pattern Matching, LNCS2089:51-72, 2001. A factor oracle of P R for P=announce uonne 3457 8 6012 cna n a a n c u o

20 北海道大学 Hokkaido University 20 Lecture on Information knowledge network 2010/12/23 Construction algorithm of Factor oracle Oracle-on-line (P=p 1 p 2 …p m ) 1 Create Oracle(ε) with 2 One single initial state 0, S(0) ←θ. 3 for i ∈ 1…m do 4 Oracle(P=p 1 p 2 …p j ) 5 ← Oracle_add_letter (Oracle(P=p 1 p 2 …p j-1 ), p j ). Oracle_add_letter (Oracle(P=p 1 p 2 …p m ),σ) 1 Create a new state m+1. 2 δ(m,σ) ← m+1. 3 k ← S(m) 4 while k≠θ and δ(k,σ)=θ do 5 δ(k,σ) ← m+1; 6 k ← S(k). 7 If k =θthen s ← 0; 8 else s ← δ(k,σ). 9 S(m+1) ← s. 10 return Oracle(P=p 1 p 2 …p m σ).

21 北海道大学 Hokkaido University Matching time comparison 21 Lecture on Information knowledge network 2010/12/23 Flexible Pattern Matching in Strings, Navarro&Raffinot, 2002: Fig.2.22, p39. 248163264128256 2 4 8 16 32 64 3 4 7 8 18 29 50 100 Horspoor Shift-Or BNDM BOM DNA English

22 Extensions of Suffix & Factor type algorithms to multiple patterns Set Horspool algorithm Wu-Manber algorithm

23 北海道大学 Hokkaido University 23 Lecture on Information knowledge network 2010/12/23 Suffix & Factor type algorithms for multiple patterns Commentz-Walter algorithm B. Commentz-Walter. A string matching algorithm fast on the average. In Proceedings of the 6th International Colloquium on Automata, Languages and Programming, LNCS71:118-132, 1979. A straight extension of BM algorithm Set Horspool algorithm A simplified algorithm of Commentz-Walter based on the idea of Horspool Wu-Manber algorithm S. Wu and U. Manber. A fast algorithm for multi-pattern searching. Report TR-94-17, Department of Computer Science, University of Arizona, Tucson, AZ, 1994. A practically fast algorithm based on hashing. Agrep employs this algorithm. Uratani-Takeda algorithm A BM type algorithm with AC machine. It is faster than CW. Set Backward Oracle Matching (SBOM) algorithm C. Allauzen and M. Raffinot. Factor oracle of a set of words. Techinical report 99-11, Institut Gaspard-Monge, Universite de Marne-la-Vallee, 1999. A extension of BOM by extending Factor oracle for multiple patterns.

24 北海道大学 Hokkaido University 24 Lecture on Information knowledge network 2010/12/23 Set Horspool algorithm First, it makes a trie for the set of the reversed patterns in Π. Its matching approach is the same as Horspool. –It traverses the trie as doing suffix search. –If the reading string doesn’t match with any suffixes of the patterns, then it shifts by delta1’. Text T: α σ β suffix search Reversed trie for patterns This range doesn’t include β delta1’ ※ Cf. In Uratani-Takeda algorithm, it uses AC machine for the trie, and decides a jump width by the failure functions

25 北海道大学 Hokkaido University 25 Lecture on Information knowledge network 2010/12/23 Reason why the performance decreases Text T: Pattern P: ℓmin ℓmax delta ( ≦ ℓmin) The maximum of jump width is limited to ℓmin When the number of patterns increases, bad-character heuristic cannot work well since the frequency of each character increases.

26 北海道大学 Hokkaido University 26 Lecture on Information knowledge network 2010/12/23 Wu-Manber algorithm It examines whether some patterns occur or not by reading B characters from the current matching position of the text (i.e. T[i- B+1…i]). – SHIFT[ T[i-B+1…i] ] : IF T[i-B+1…i] is a suffix of some patterns, then 0. Otherwise, it returns the maximum length of possible shift. – HASH[ T[i-B+1…i] ] : When SHIFT returns 0, (i.e. T[i-B+1…i] is a suffix of some patterns), it returns the list of patterns that can occur at the position. S. Wu and U. Manber. A fast algorithm for multi-pattern searching. Report TR-94-17, Department of Computer Science, University of Arizona, Tucson, AZ, 1994. Text T: CPMannuaconferenceannlounce Patterns Π: annuallyannounceannual String llno ouanun ncua ally nn nu cece * Amount of shift 134100205 String ce lyua al * Pattern ID3, 12φ SHIFT[B] = HASH[B] = SHIFT[an]=4 HASH[al]=2, → Shift by 1 SHIFT[l ]=5 Some patterns may occur! SHIFT[al]=0

27 北海道大学 Hokkaido University 27 Lecture on Information knowledge network 2010/12/23 Pseudo code Construct_SHIFT (P={p 1,p 2,…,p r }) 1 initialize SHIFT table by ℓmin–B+1. 2For each Bl=p i [j–B+1…j] do 3 If SHIFT[h1(Bl)] > m i – j do SHIFT[h1(Bl)] = m i – j. Wu-Manber (P={p 1,p 2,…,p r }, T=T[1…n]) 1 Preprocessing: 2 Computation of B. 3 Construction of the hash tables SHIFT and HASH. 4 Searching: 5 pos ← ℓmin. 6 while pos ≦ n do 7 i ← h1( T[pos–B+1…pos] ); 8 If SHIFT[i] = 0 then 9 list ← HASH[ h2( T[pos–B+1…pos] ) ]; 10 Verify all the patterns in list one by one against the text; 11 pos ← pos + 1; 12 else pos ← pos + SHIFT[i]. ※ In the implementation of agrep ver4.02 (mgrep.c) in fact, SHIFT ・ HASH ・ B are 4096, 8192, and 3.

28 北海道大学 Hokkaido University 28 Lecture on Information knowledge network 2010/12/23 The 3rd summary Suffix type algorithm –It matches with the pattern from the right to the left. –It takes O(mn) time in the worst case, but O(n/m) time on average. –Boyer-Moore, Galil, Horspool, and Sunday Factor type algorithm –It determines whether the current position is a factor of the pattern or not, and then skips the text. –BDM, BNDM, and BOM algorithm Extensions of Suffix & Factor type algorithms to multiple patterns –When the number of patterns increases, bad-character heuristic doesn’t work well since the frequency of each character increases. –Set Horspool and Wu-Manber algorithm The next theme –Approximate pattern matching: pattern matching with allowing errors.


Download ppt "北海道大学 Hokkaido University 1 Lecture on Information knowledge network2010/12/23 Lecture on Information Knowledge Network "Information retrieval and pattern."

Similar presentations


Ads by Google