Shift-And Approach to Pattern Matching in LZW Compressed Text Takuya KIDA Department of Informatics Kyushu University, Japan Masayuki TAKEDA Ayumi SHINOHARA Setsuo ARIKAWA
Address book Schedule Dictionary Phone numbers Memo Electronic book Database The available storage devices are limited! I am eager to stuff any available information up to possible! I want to do pattern matching as fast as possible! Motivation...Yes! Data compression!...but a suffix trie is very large...
Compressed Text OriginalText Compressed Text Pattern Matching Machine Machine New Machine ! Our goal decompress
yearresearcherscompression method 1988 Eliam-Tsoreff and Vishkinrun-length 1992 Amir, Landau, and Vishkin two-dimensional run-length 1995 Farach and ThorupLZ Amir, Benson and FarachLZW 1997 Karpinski, Rytter, and Shinoharastraight-line programs 1996 Gasieniec, et al.LZ Miyazaki, Shinohara, and Takedastraight-line programs 1992 Amir and Benson two-dimensional run-length Amir, Benson, and Farach 1994 two-dimensional run-length 1997 Takedafinite state encoding 1998 Shibatabyte pair encoding 1994 Manberoriginal compression scheme 1998 Fukamachi, Shinohara, and TakedaHuffman encoding 1998 Kida, et al.LZW Previous researches AC automaton DCC’98
yearresearcherscompression method 1999 Kida, Takeda, Shinohara, and Arikawa LZW 1999 Shibata, et al. Byte pair encoding Kida, et al Dictionary based methods (Collage system) 1999 Navarro and Raffinot LZ family 1999 Shibata, Takeda, Shinohara, and Arikawa Antidictionaries CPM’99 SPIRE’ de Moura, Navarro, Ziviani, and Baeza-Yates Word based encoding Previous researches Recent researches Shift-And algorithm
Main results The new algorithm scans a compressed text in O(n+r) time using O(|D|) space, and reports all occurrences of the pattern after an O(m+| |) time and O(| |) space preprocessing. The algorithm is about 1.3 times faster than our previous one which simulates the AC automaton. The algorithm is about 1.5 times faster than a decompression followed by a simple search using the Shift-And algorithm. Our main results |D| : size of the dictionary trie n : compressed text length m : pattern length r : number of pattern occurrences
Lempel-Ziv-Welch Compression how to compress and decompress
LZW compression a b ab ab ba b c aba bc abab Original text: Compressed text: Dictionary trie b a b c a a a a b b b c aba 6 6 a a b Lempel-Ziv-Welch(LZW) compression O(| D |) = O(n)
Move of compression a b ab ab ba b c aba bc abab Original text: Compressed text: Dictionary trie a b c b 4 a 5 a 6 b 7 b 8 c 9 a 10 b 11 a 12 How to compress a text
Move of decompression Original text: Compressed text: How to decompress a compressed text abab babcababc abab Dictionary trie a b c b 4 a 5 a 6 b 7 b 8 c 9 a 10 b 11 a 12 O(n) time O(N) time
Compressed Pattern Matching in LZW Compressed Text with Shift-And approach
Shift-And approach to pattern matching a b a c a aabaacaabacab text: pattern: aabac & aab ac abc mask bits a b a c a Shift-And approach to pattern matching Pattern was found! (Baeza-Yates and Gonnet[1992], Wu and Manber[1992])
Property of SA approach Properties of Shift-And approach Simple, but very fast when a pattern length m is not greater than the word length of typical computers (32 or 64). Assuming m 32 (or 64) and that bit-shift operations and bitwise logical operations on integers can be performed in constant time, it runs in O(n) time. This method has many variations generalized pattern matching pattern matching with k-mismatch pattern matching for multiple patterns
aabaacaabacab a b a c a text: Basic idea aab a a a c aa b a c Jump! pattern: aabac Basic idea of our algorithm abc mask bits compressed text : O(1) time?
Basic idea aabaacaabacab a b a c a text: abc mask bits We need a mechanism for reporting all pattern occurrences. pattern: aabac 6151 compressed text : Pattern was found! 1 Basic idea of our algorithm
Main results Lemma 1 (Realization of ‘Jump’) The state transition function can be realized in O(|D|+m) time using O(|D|) space, and return the value in O(1) time. Lemma 2 (Realization of ‘Output ’) The procedure which enumerates the pattern occurrences can be realized in O(|D|+m) time using O(|D|) space, and run in O(r) time. Technical details |D| : size of the dictionary trie m : pattern length r : number of pattern occurrences
Overview of the algorithm Input. pattern P, u 1,u 2, …,u n : LZW compressed text. Output. All occurrences of the patterns. ^ ^ Construct mask bits from P. Initialize the dictionary trie, M, U, and V; l:=0; S:= ; for i:=1 to n do begin for each d Output(S, u i ) do report ‘pattern occurs at position l+d ’; S:= f (S, u); /* Jump the state! */ l:= l+ | u i | ; /* increment the offset */ Update the dictionary trie, M, U, and V; end ^
Detail of our Algorithm Realization of Jump and Output
Detail of ‘Jump’ for a ∈ Σ, u ∈ Σ *, and S ∈ {1, ・・・, m}, Detail of ‘Jump’ & state transition state S={1,3} M(a)={1,2,4} M(b)={3} M(c)={5} abc a b a c a mask bits f (S, a) : ((S 1) ∪ {1}) ∩ M(a) M(a) : { 1 i m | Pattern[i] = a } bit shift OR AND
Detail of ‘Jump’ f (S, a) : ((S 1) ∪ {1}) ∩ M(a) M(a) : { 1 i m | Pattern[i] = a } for a ∈ Σ, u ∈ Σ *, and S ∈ {1, ・・・, m}, f (S, u) = ((S |u|) ∪ {1, ・・・, |u|}) ∩ M(u) ^^ O(1) Detail of ‘Jump’ : M(u) : f( {1, ・・・, m}, u ) ^ ^ ^ ^ define recursively f (S,ε) : S f (S,ε) : S f (S, ua) : f ( f (S, u), a) f (S, ua) : f ( f (S, u), a) ^ ^^
Move of ‘Jump’ aba a b a c a acaabac M(u)M(u) ^ & a b a c a aabaacaabacab text: Move of f (S, u) ^ 1 1 1
aba a b a c a acaabac M(u)M(u) ^ Move of ‘Jump’ Move of f (S, u) ^ & a b a c a aabaacaabacab text:
Detail of updating Mhat(u) How to calculate M(u) ^ M(u a) M(u a) = f( {1, ・・・, m}, u a)^^ = f ( f( {1, ・・・, m}, u ), a ) ^ = f ( M( u ), a ) ^ ((M(u) 1) ∪ {1}) ∩ M(a) = ((M(u) 1) ∪ {1}) ∩ M(a) ^ u a u a Dictionary trie D M(u)M(u) ^ M(u a) ^ O(1) total: O(|D|) time and space total: O(|D|) time and space
Detail of Output(S,u) Output(S, u) = { 1 j |u| | m ∈ S } How to enumerate the occurrences 2 11 Output(S, u) = { 2, 11 } u S length i prefix of the pattern for the largest i ∈ S. pattern occurrence pattern occurrence 2 {1,...,m} D
Two subset U and A U(u) : {1 j |u| | i < m and u[1..i]=Pattern[m-i+1..m]} V(u) : {1 j |u| | i m and u[1-m+1..i]=Pattern} Output(S, u) =((m S) U(u)) V(u) Realization of Output(S, u) dependent on S independent of S u S
Detail of updating U and A How to calculate U(u) and V(u) u a u a Dictionary trie D U(ua)V(ua)U(ua)V(ua) U(u)V(u)U(u)V(u) total: O(|D|) time and space total: O(|D|) time and space if m ∈ M(u a) then U(u a) = U(u) {|u a|} else U(u a) = U(u) ; ^ We can deal with V(n) as the same way of [DCC’98]. O(1)
-- Is this really practical? -- But... Is it really fast ? Uhmm....
Experimentation ◆ Method 1: ◆ Method 2: Compressed Text bcbababc 9 Compressed Text Shift-And Our previous algorithm(DCC’98) ◆ Method 3: Experimental Comparisons Decompress ! Compressed Text Our new algorithms
Experimentation Original Text "The Brown corpus" 6.8 Mbytes Compressed Text 3.4 Mbytes Language: C (with gcc compiler) Machine : Sun SPARCstation 20 with remote disk storage File transfer ratio: 0.96 Mbyte/sec compress (UNIX command) Experimental Comparisons
Experimental results uncompressed text Shift-And CPU time + File I/O time 1.3 times faster! 1.5 times faster! elapsed time(s) CPU time(s) Shift-And with decompression Our previous algorithm(DCC’98) New algorithm Method
Experimental results Shift-And in original text elapsed time(s) CPU time(s) Shift-And with decompression Our previous algorithm(DCC’98) New algorithm Method
Conclusion The proposed algorithm scans an LZW compressed text in O(n+r) time using O(|D|) space, and reports all occurrences of the pattern after an O(m+| |) time and O(| |) space preprocessing. We implemented the algorithm, and showed that it is approximately 1.3 times faster than our previous algorithm. Our new algorithm has several extensions. generalized pattern matching pattern matching with k-mismatches pattern matching for multiple patterns