Presentation is loading. Please wait.

Presentation is loading. Please wait.

Shift-And Approach to Pattern Matching in LZW Compressed Text Takuya KIDA Department of Informatics Kyushu University, Japan Masayuki TAKEDA Ayumi SHINOHARA.

Similar presentations


Presentation on theme: "Shift-And Approach to Pattern Matching in LZW Compressed Text Takuya KIDA Department of Informatics Kyushu University, Japan Masayuki TAKEDA Ayumi SHINOHARA."— Presentation transcript:

1

2 Shift-And Approach to Pattern Matching in LZW Compressed Text Takuya KIDA Department of Informatics Kyushu University, Japan Masayuki TAKEDA Ayumi SHINOHARA Setsuo ARIKAWA

3 E-mail Address book Schedule Dictionary Phone numbers Memo Electronic book Database  The available storage devices are limited!  I am eager to stuff any available information up to possible!  I want to do pattern matching as fast as possible! Motivation...Yes! Data compression!...but a suffix trie is very large...

4 Compressed Text OriginalText Compressed Text Pattern Matching Machine Machine New Machine ! Our goal decompress

5 yearresearcherscompression method 1988 Eliam-Tsoreff and Vishkinrun-length 1992 Amir, Landau, and Vishkin two-dimensional run-length 1995 Farach and ThorupLZ77 1996 Amir, Benson and FarachLZW 1997 Karpinski, Rytter, and Shinoharastraight-line programs 1996 Gasieniec, et al.LZ77 1997 Miyazaki, Shinohara, and Takedastraight-line programs 1992 Amir and Benson two-dimensional run-length Amir, Benson, and Farach 1994 two-dimensional run-length 1997 Takedafinite state encoding 1998 Shibatabyte pair encoding 1994 Manberoriginal compression scheme 1998 Fukamachi, Shinohara, and TakedaHuffman encoding 1998 Kida, et al.LZW Previous researches AC automaton DCC’98

6 yearresearcherscompression method 1999 Kida, Takeda, Shinohara, and Arikawa LZW 1999 Shibata, et al. Byte pair encoding Kida, et al. 1999 Dictionary based methods (Collage system) 1999 Navarro and Raffinot LZ family 1999 Shibata, Takeda, Shinohara, and Arikawa Antidictionaries CPM’99 SPIRE’99 1998 de Moura, Navarro, Ziviani, and Baeza-Yates Word based encoding Previous researches Recent researches Shift-And algorithm

7 Main results  The new algorithm scans a compressed text in O(n+r) time using O(|D|) space, and reports all occurrences of the pattern after an O(m+|  |) time and O(|  |) space preprocessing.  The algorithm is about 1.3 times faster than our previous one which simulates the AC automaton.  The algorithm is about 1.5 times faster than a decompression followed by a simple search using the Shift-And algorithm. Our main results |D| : size of the dictionary trie n : compressed text length m : pattern length r : number of pattern occurrences

8 Lempel-Ziv-Welch Compression how to compress and decompress

9 LZW compression a b ab ab ba b c aba bc abab 1234569 11 42 Original text: Compressed text: Dictionary trie b a b c a a a a b b b c 0 123 4 5 6 7 9 8 12 10 11 aba 6 6 a a b Lempel-Ziv-Welch(LZW) compression O(| D |) = O(n)

10 Move of compression a b ab ab ba b c aba bc abab 1234569 11 42 Original text: Compressed text: Dictionary trie a b c 0 123 b 4 a 5 a 6 b 7 b 8 c 9 a 10 b 11 a 12 How to compress a text

11 Move of decompression 1234569 11 42 Original text: Compressed text: How to decompress a compressed text abab babcababc abab Dictionary trie a b c 0 123 b 4 a 5 a 6 b 7 b 8 c 9 a 10 b 11 a 12 O(n) time O(N) time

12 Compressed Pattern Matching in LZW Compressed Text with Shift-And approach

13 Shift-And approach to pattern matching 1 0 0 0 0 a b a c a aabaacaabacab text: pattern: aabac 1 1 0 0 0 0 0 1 0 0 1 0 0 1 0 1 1 0 0 0 0 0 0 0 0 1 0 0 0 0 1 1 0 0 0 0 0 1 0 0 1 0 0 1 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 1 0 0 0 1 1 0 0 0 1 1 0 1 0 & aab ac abc 1 1 0 1 0 0 0 1 0 0 0 0 0 0 1 mask bits a b a c a Shift-And approach to pattern matching Pattern was found! (Baeza-Yates and Gonnet[1992], Wu and Manber[1992])

14 Property of SA approach Properties of Shift-And approach  Simple, but very fast when a pattern length m is not greater than the word length of typical computers (32 or 64).  Assuming m  32 (or 64) and that bit-shift operations and bitwise logical operations on integers can be performed in constant time, it runs in O(n) time.  This method has many variations  generalized pattern matching  pattern matching with k-mismatch  pattern matching for multiple patterns

15 aabaacaabacab a b a c a text: Basic idea 1 0 0 0 0 1 1 0 0 0 0 0 1 0 0 1 0 0 1 0 1 1 0 0 0 0 0 0 0 0 1 0 0 0 0 1 1 0 0 0 0 0 1 0 0 1 0 0 1 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 aab a a a c aa b a c Jump! pattern: aabac Basic idea of our algorithm abc 1 1 0 1 0 0 0 1 0 0 0 0 0 0 1 mask bits 1 0 0 0 0 1 1 0 0 0 1 0 0 0 0 6151 compressed text : O(1) time?

16 Basic idea aabaacaabacab a b a c a text: 1 0 0 0 0 1 1 0 0 0 0 0 1 0 0 1 0 0 1 0 1 1 0 0 0 0 0 0 0 0 1 0 0 0 0 1 1 0 0 0 0 0 1 0 0 1 0 0 1 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 abc 1 1 0 1 0 0 0 1 0 0 0 0 0 0 1 mask bits 1 0 0 0 0 1 1 0 0 0 1 0 0 0 0 We need a mechanism for reporting all pattern occurrences. pattern: aabac 6151 compressed text : Pattern was found! 1 Basic idea of our algorithm

17 Main results Lemma 1 (Realization of ‘Jump’) The state transition function can be realized in O(|D|+m) time using O(|D|) space, and return the value in O(1) time. Lemma 2 (Realization of ‘Output ’) The procedure which enumerates the pattern occurrences can be realized in O(|D|+m) time using O(|D|) space, and run in O(r) time. Technical details |D| : size of the dictionary trie m : pattern length r : number of pattern occurrences

18 Overview of the algorithm Input. pattern P, u 1,u 2, …,u n : LZW compressed text. Output. All occurrences of the patterns. ^ ^ Construct mask bits from P. Initialize the dictionary trie, M, U, and V; l:=0; S:=  ; for i:=1 to n do begin for each d  Output(S, u i ) do report ‘pattern occurs at position l+d ’; S:= f (S, u); /* Jump the state! */ l:= l+ | u i | ; /* increment the offset */ Update the dictionary trie, M, U, and V; end ^

19 Detail of our Algorithm Realization of Jump and Output

20 Detail of ‘Jump’ for a ∈ Σ, u ∈ Σ *, and S ∈ {1, ・・・, m}, Detail of ‘Jump’ 1 0 0 0 0 1 1 0 0 0 1 1 0 1 0 & state transition 1 0 1 0 0 state S={1,3} M(a)={1,2,4} M(b)={3} M(c)={5} abc 1 1 0 1 0 0 0 1 0 0 0 0 0 0 1 a b a c a mask bits f (S, a) :  ((S  1) ∪ {1}) ∩ M(a) M(a) :  { 1  i  m | Pattern[i] = a } bit shift OR AND

21 Detail of ‘Jump’ f (S, a) :  ((S  1) ∪ {1}) ∩ M(a) M(a) :  { 1  i  m | Pattern[i] = a } for a ∈ Σ, u ∈ Σ *, and S ∈ {1, ・・・, m}, f (S, u) = ((S  |u|) ∪ {1, ・・・, |u|}) ∩ M(u) ^^ O(1) Detail of ‘Jump’ :  M(u) :  f( {1, ・・・, m}, u ) ^ ^ ^ ^ define recursively f (S,ε) :  S f (S,ε) :  S f (S, ua) :  f ( f (S, u), a) f (S, ua) :  f ( f (S, u), a) ^ ^^

22 Move of ‘Jump’ aba 1 0 0 1 0 a b a c a acaabac 0 0 0 0 1 M(u)M(u) ^ 1 0 0 0 0 1 0 0 1 0 0 1 0 1 0 0 1 0 & 1 0 0 0 0 a b a c a aabaacaabacab text: 1 1 0 0 0 0 0 1 0 0 1 0 0 1 0 1 1 0 0 0 0 0 0 0 0 1 0 0 0 0 1 1 0 0 0 0 0 1 0 0 1 0 0 1 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 Move of f (S, u) ^ 1 1 1

23 1 0 0 0 0 aba 1 0 0 1 0 a b a c a acaabac 0 0 0 0 1 M(u)M(u) ^ Move of ‘Jump’ Move of f (S, u) ^ 0 0 0 0 1 0 0 0 0 1 & 1 0 0 0 0 a b a c a aabaacaabacab text: 1 1 0 0 0 0 0 1 0 0 1 0 0 1 0 1 1 0 0 0 0 0 0 0 0 1 0 0 0 0 1 1 0 0 0 0 0 1 0 0 1 0 0 1 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1

24 Detail of updating Mhat(u) How to calculate M(u) ^ M(u  a) M(u  a) = f( {1, ・・・, m}, u  a)^^ = f ( f( {1, ・・・, m}, u ), a ) ^ = f ( M( u ), a ) ^ ((M(u)  1) ∪ {1}) ∩ M(a) = ((M(u)  1) ∪ {1}) ∩ M(a) ^ u  a u a Dictionary trie D M(u)M(u) ^ M(u  a) ^ O(1) total: O(|D|) time and space total: O(|D|) time and space

25 Detail of Output(S,u) Output(S, u) = { 1  j  |u| | m ∈ S } How to enumerate the occurrences 2 11 Output(S, u) = { 2, 11 } u S length i prefix of the pattern for the largest i ∈ S. pattern occurrence pattern occurrence 2 {1,...,m}  D

26 Two subset U and A U(u) :  {1 j  |u| | i < m and u[1..i]=Pattern[m-i+1..m]} V(u) :  {1  j  |u| | i  m and u[1-m+1..i]=Pattern} Output(S, u) =((m S)  U(u))  V(u) Realization of Output(S, u) dependent on S independent of S u S

27 Detail of updating U and A How to calculate U(u) and V(u) u  a u a Dictionary trie D U(ua)V(ua)U(ua)V(ua) U(u)V(u)U(u)V(u) total: O(|D|) time and space total: O(|D|) time and space if m ∈ M(u  a) then U(u  a) = U(u)  {|u  a|} else U(u  a) = U(u) ; ^ We can deal with V(n) as the same way of [DCC’98]. O(1)

28 -- Is this really practical? -- But... Is it really fast ? Uhmm....

29 Experimentation ◆ Method 1: ◆ Method 2: Compressed Text bcbababc 9 Compressed Text Shift-And Our previous algorithm(DCC’98) ◆ Method 3: Experimental Comparisons Decompress ! Compressed Text Our new algorithms

30 Experimentation Original Text "The Brown corpus" 6.8 Mbytes Compressed Text 3.4 Mbytes Language: C (with gcc compiler) Machine : Sun SPARCstation 20 with remote disk storage File transfer ratio: 0.96 Mbyte/sec compress (UNIX command) Experimental Comparisons

31 Experimental results uncompressed text Shift-And CPU time + File I/O time 1.3 times faster! 1.5 times faster! elapsed time(s) 6.05 7.31 8.16 CPU time(s) Shift-And with decompression Our previous algorithm(DCC’98) New algorithm 7.52 6.57 5.15 Method

32 Experimental results Shift-And in original text 9.36 3.09 elapsed time(s) 6.05 7.31 8.16 CPU time(s) Shift-And with decompression Our previous algorithm(DCC’98) New algorithm 7.52 6.57 5.15 Method

33 Conclusion  The proposed algorithm scans an LZW compressed text in O(n+r) time using O(|D|) space, and reports all occurrences of the pattern after an O(m+|  |) time and O(|  |) space preprocessing.  We implemented the algorithm, and showed that it is approximately 1.3 times faster than our previous algorithm.  Our new algorithm has several extensions.  generalized pattern matching  pattern matching with k-mismatches  pattern matching for multiple patterns


Download ppt "Shift-And Approach to Pattern Matching in LZW Compressed Text Takuya KIDA Department of Informatics Kyushu University, Japan Masayuki TAKEDA Ayumi SHINOHARA."

Similar presentations


Ads by Google