Shift-And Approach to Pattern Matching in LZW Compressed Text Takuya KIDA Department of Informatics Kyushu University, Japan Masayuki TAKEDA Ayumi SHINOHARA.

Slides:



Advertisements
Similar presentations
Hokkaido University 1 Lecture on Information knowledge network2010/11/10 Lecture on Information Knowledge Network "Information retrieval and pattern matching"
Advertisements

Hokkaido University Lecture on Information Knowledge Network "Information retrieval and pattern matching" Laboratory of Information Knowledge Network,
An Array-Based Algorithm for Simultaneous Multidimensional Aggregates By Yihong Zhao, Prasad M. Desphande and Jeffrey F. Naughton Presented by Kia Hall.
CSCI 3280 Tutorial 6. Outline  Theory part of LZW  Tree representation of LZW  Table representation of LZW.
On-line Linear-time Construction of Word Suffix Trees Shunsuke Inenaga (Japan Society for the Promotion of Science & Kyushu University) Masayuki Takeda.
Sparse Compact Directed Acyclic Word Graphs
Lecture 6 Source Coding and Compression Dr.-Ing. Khaled Shawky Hassan
1 String Matching of Bit Parallel Suffix Automata.
Advanced Algorithm Design and Analysis (Lecture 4) SW5 fall 2004 Simonas Šaltenis E1-215b
CPM04, 05/07/04C.E. A Trie-Based Approach for Compacting Automata M. Crochemore, C. Epifanio, R. Grossi, F. Mignosi.
1 Indexing and Searching (File Structures) Modern Information Retrieval (C hapter 8) With G. Navarro.
A general compression algorithm that supports fast searching Szymon Grabowski Computer Engineering Dept., Tech. Univ. of Łódź, Poland
CSCI 3 Chapter 1.8 Data Compression. Chapter 1.8 Data Compression  For the purpose of storing or transferring data, it is often helpful to reduce the.
1 A Fast Algorithm for Multi-Pattern Searching Sun Wu, Udi Manber Tech. Rep. TR94-17,Department of Computer Science, University of Arizona, May 1994.
Dynamic Text and Static Pattern Matching Amihood Amir Gad M. Landau Moshe Lewenstein Dina Sokol Bar-Ilan University.
Introduction to Computers and Programming. Some definitions Algorithm: –A procedure for solving a problem –A sequence of discrete steps that defines such.
Fa05CSE 182 CSE182-L4: Scoring matrices, Dictionary Matching.
Fa05CSE 182 CSE182-L5: Scoring matrices Dictionary Matching.
String Matching with Mismatches Some slides are stolen from Moshe Lewenstein (Bar Ilan University)
Source Coding Hafiz Malik Dept. of Electrical & Computer Engineering The University of Michigan-Dearborn
Survey: String Matching with k Mismatches Moshe Lewenstein Bar Ilan University.
A Fast Algorithm for Multi-Pattern Searching Sun Wu, Udi Manber May 1994.
Hideo Bannai, Shunsuke Inenaga, Masayuki Takeda Kyushu University, Japan SPIRE Cartagena, Colombia.
On the Use of Regular Expressions for Searching Text Charles L.A. Clarke and Gordon V. Cormack Fast Text Searching.
1 Amihood Amir Bar-Ilan University and Georgia Tech UWSL 2006.
String Matching. Problem is to find if a pattern P[1..m] occurs within text T[1..n] Simple solution: Naïve String Matching –Match each position in the.
Fundamental Structures of Computer Science Feb. 24, 2005 Ananda Guna Lempel-Ziv Compression.
1 Strings CopyWrite D.Bockus. 2 Strings Def: A string is a sequence (possibly empty) of symbols from some alphabet. What do we use strings for? 1) Text.
Takeaki Uno Tatsuya Asai Yuzo Uchida Hiroki Arimura
Finding Characteristic Substrings from Compressed Texts Shunsuke Inenaga Kyushu University, Japan Hideo Bannai Kyushu University, Japan.
Geometric Matching on Sequential Data Veli Mäkinen AG Genominformatik Technical Fakultät Bielefeld Universität.
20/10/2015Applied Algorithmics - week31 String Processing  Typical applications: pattern matching/recognition molecular biology, comparative genomics,
String Matching with k Mismatches Moshe Lewenstein Bar Ilan University Modified by Ariel Rosenfeld.
Multiple Pattern Matching in LZW Compressed Text Takuya KIDA Masayuki TAKEDA Ayumi SHINOHARA Masamichi MIYAZAKI Setsuo ARIKAWA Department of Informatics.
Improved string matching with k mismatches (The Kangaroo Method) Galil, R. Giancarlo SIGACT News, Vol. 17, No. 4, 1986, pp. 52–54 Original: Moshe Lewenstein.
The LZ family LZ77 LZ78 LZR LZSS LZB LZH – used by zip and unzip
A fast algorithm for the generalized k- keyword proximity problem given keyword offsets Sung-Ryul Kim, Inbok Lee, Kunsoo Park Information Processing Letters,
Computing longest common substring and all palindromes from compressed strings Wataru Matsubara 1, Shunsuke Inenaga 2, Akira Ishino 1, Ayumi Shinohara.
Book: Algorithms on strings, trees and sequences by Dan Gusfield Presented by: Amir Anter and Vladimir Zoubritsky.
Compressed Pattern Matching in DNA Sequences BARNA SAHA.
Speeding up pattern matching by text compression Department of Informatics, Kyushu University, Japan Department of AI, Kyushu Institute of Technology,
Fundamental Data Structures and Algorithms Margaret Reid-Miller 24 February 2005 LZW Compression.
String Matching String Matching Problem We introduce a general framework which is suitable to capture an essence of compressed pattern matching according.
Multiple Pattern Matching Algorithms on Collage System T. Kida, T. Matsumoto, M. Takeda, A. Shinohara, and S. Arikawa Department of Informatics, Kyushu.
A Unifying Framework for Compressed Pattern Matching Takuya Kida, Masayuki Takeda, Ayumi Shinohara, Yusuke Shibata, Setsuo Arikawa Department of Informatics,
Data Compression Reduce the size of data.  Reduces storage space and hence storage cost. Compression ratio = original data size/compressed data size.
Compressed Suffix Arrays and Suffix Trees Roberto Grossi, Jeffery Scott Vitter.
A New Operating Tool for Coding in Lossless Image Compression Radu Rădescu University POLITEHNICA of Bucharest, Faculty of Electronics, Telecommunications.
Keisuke Goto, Hideo Bannai, Shunsuke Inenaga, Masayuki Takeda
Joint Advanced Student School Compressed Suffix Arrays Compression of Suffix Arrays to linear size Fabian Pache.
Faster Approximate String Matching over Compressed Text By Gonzalo Navarro *, Takuya Kida †, Masayuki Takeda †, Ayumi Shinohara †, and Setsuo Arikawa.
Lecture 7 Source Coding and Compression Dr.-Ing. Khaled Shawky Hassan
Comp 335 File Structures Data Compression. Why Study Data Compression? Conserves storage space Files can be transmitted faster because there are less.
CS 1501: Algorithm Implementation LZW Data Compression.
Accelerating Multi-Pattern Matching on Compressed HTTP Traffic Dr. Anat Bremler-Barr (IDC) Joint work with Yaron Koral (IDC), Infocom[2009]
CS 1501: Algorithm Implementation
Computer Sciences Department1. 2 Data Compression and techniques.
IP Address Lookup Masoud Sabaei Assistant professor Computer Engineering and Information Technology Department, Amirkabir University of Technology.
CSE 589 Applied Algorithms Spring 1999
Data Compression.
Alternative Algorithms for Lyndon Factorization
Applied Algorithmics - week7
Recuperació de la informació
Data Compression Reduce the size of data.
Suffix trees.
فشرده سازي داده ها Reduce the size of data.
CSE 589 Applied Algorithms Spring 1999
2-Dimensional Pattern Matching
Suffix Arrays and Suffix Trees
String Matching with k Mismatches
Presentation transcript:

Shift-And Approach to Pattern Matching in LZW Compressed Text Takuya KIDA Department of Informatics Kyushu University, Japan Masayuki TAKEDA Ayumi SHINOHARA Setsuo ARIKAWA

Address book Schedule Dictionary Phone numbers Memo Electronic book Database  The available storage devices are limited!  I am eager to stuff any available information up to possible!  I want to do pattern matching as fast as possible! Motivation...Yes! Data compression!...but a suffix trie is very large...

Compressed Text OriginalText Compressed Text Pattern Matching Machine Machine New Machine ! Our goal decompress

yearresearcherscompression method 1988 Eliam-Tsoreff and Vishkinrun-length 1992 Amir, Landau, and Vishkin two-dimensional run-length 1995 Farach and ThorupLZ Amir, Benson and FarachLZW 1997 Karpinski, Rytter, and Shinoharastraight-line programs 1996 Gasieniec, et al.LZ Miyazaki, Shinohara, and Takedastraight-line programs 1992 Amir and Benson two-dimensional run-length Amir, Benson, and Farach 1994 two-dimensional run-length 1997 Takedafinite state encoding 1998 Shibatabyte pair encoding 1994 Manberoriginal compression scheme 1998 Fukamachi, Shinohara, and TakedaHuffman encoding 1998 Kida, et al.LZW Previous researches AC automaton DCC’98

yearresearcherscompression method 1999 Kida, Takeda, Shinohara, and Arikawa LZW 1999 Shibata, et al. Byte pair encoding Kida, et al Dictionary based methods (Collage system) 1999 Navarro and Raffinot LZ family 1999 Shibata, Takeda, Shinohara, and Arikawa Antidictionaries CPM’99 SPIRE’ de Moura, Navarro, Ziviani, and Baeza-Yates Word based encoding Previous researches Recent researches Shift-And algorithm

Main results  The new algorithm scans a compressed text in O(n+r) time using O(|D|) space, and reports all occurrences of the pattern after an O(m+|  |) time and O(|  |) space preprocessing.  The algorithm is about 1.3 times faster than our previous one which simulates the AC automaton.  The algorithm is about 1.5 times faster than a decompression followed by a simple search using the Shift-And algorithm. Our main results |D| : size of the dictionary trie n : compressed text length m : pattern length r : number of pattern occurrences

Lempel-Ziv-Welch Compression how to compress and decompress

LZW compression a b ab ab ba b c aba bc abab Original text: Compressed text: Dictionary trie b a b c a a a a b b b c aba 6 6 a a b Lempel-Ziv-Welch(LZW) compression O(| D |) = O(n)

Move of compression a b ab ab ba b c aba bc abab Original text: Compressed text: Dictionary trie a b c b 4 a 5 a 6 b 7 b 8 c 9 a 10 b 11 a 12 How to compress a text

Move of decompression Original text: Compressed text: How to decompress a compressed text abab babcababc abab Dictionary trie a b c b 4 a 5 a 6 b 7 b 8 c 9 a 10 b 11 a 12 O(n) time O(N) time

Compressed Pattern Matching in LZW Compressed Text with Shift-And approach

Shift-And approach to pattern matching a b a c a aabaacaabacab text: pattern: aabac & aab ac abc mask bits a b a c a Shift-And approach to pattern matching Pattern was found! (Baeza-Yates and Gonnet[1992], Wu and Manber[1992])

Property of SA approach Properties of Shift-And approach  Simple, but very fast when a pattern length m is not greater than the word length of typical computers (32 or 64).  Assuming m  32 (or 64) and that bit-shift operations and bitwise logical operations on integers can be performed in constant time, it runs in O(n) time.  This method has many variations  generalized pattern matching  pattern matching with k-mismatch  pattern matching for multiple patterns

aabaacaabacab a b a c a text: Basic idea aab a a a c aa b a c Jump! pattern: aabac Basic idea of our algorithm abc mask bits compressed text : O(1) time?

Basic idea aabaacaabacab a b a c a text: abc mask bits We need a mechanism for reporting all pattern occurrences. pattern: aabac 6151 compressed text : Pattern was found! 1 Basic idea of our algorithm

Main results Lemma 1 (Realization of ‘Jump’) The state transition function can be realized in O(|D|+m) time using O(|D|) space, and return the value in O(1) time. Lemma 2 (Realization of ‘Output ’) The procedure which enumerates the pattern occurrences can be realized in O(|D|+m) time using O(|D|) space, and run in O(r) time. Technical details |D| : size of the dictionary trie m : pattern length r : number of pattern occurrences

Overview of the algorithm Input. pattern P, u 1,u 2, …,u n : LZW compressed text. Output. All occurrences of the patterns. ^ ^ Construct mask bits from P. Initialize the dictionary trie, M, U, and V; l:=0; S:=  ; for i:=1 to n do begin for each d  Output(S, u i ) do report ‘pattern occurs at position l+d ’; S:= f (S, u); /* Jump the state! */ l:= l+ | u i | ; /* increment the offset */ Update the dictionary trie, M, U, and V; end ^

Detail of our Algorithm Realization of Jump and Output

Detail of ‘Jump’ for a ∈ Σ, u ∈ Σ *, and S ∈ {1, ・・・, m}, Detail of ‘Jump’ & state transition state S={1,3} M(a)={1,2,4} M(b)={3} M(c)={5} abc a b a c a mask bits f (S, a) :  ((S  1) ∪ {1}) ∩ M(a) M(a) :  { 1  i  m | Pattern[i] = a } bit shift OR AND

Detail of ‘Jump’ f (S, a) :  ((S  1) ∪ {1}) ∩ M(a) M(a) :  { 1  i  m | Pattern[i] = a } for a ∈ Σ, u ∈ Σ *, and S ∈ {1, ・・・, m}, f (S, u) = ((S  |u|) ∪ {1, ・・・, |u|}) ∩ M(u) ^^ O(1) Detail of ‘Jump’ :  M(u) :  f( {1, ・・・, m}, u ) ^ ^ ^ ^ define recursively f (S,ε) :  S f (S,ε) :  S f (S, ua) :  f ( f (S, u), a) f (S, ua) :  f ( f (S, u), a) ^ ^^

Move of ‘Jump’ aba a b a c a acaabac M(u)M(u) ^ & a b a c a aabaacaabacab text: Move of f (S, u) ^ 1 1 1

aba a b a c a acaabac M(u)M(u) ^ Move of ‘Jump’ Move of f (S, u) ^ & a b a c a aabaacaabacab text:

Detail of updating Mhat(u) How to calculate M(u) ^ M(u  a) M(u  a) = f( {1, ・・・, m}, u  a)^^ = f ( f( {1, ・・・, m}, u ), a ) ^ = f ( M( u ), a ) ^ ((M(u)  1) ∪ {1}) ∩ M(a) = ((M(u)  1) ∪ {1}) ∩ M(a) ^ u  a u a Dictionary trie D M(u)M(u) ^ M(u  a) ^ O(1) total: O(|D|) time and space total: O(|D|) time and space

Detail of Output(S,u) Output(S, u) = { 1  j  |u| | m ∈ S } How to enumerate the occurrences 2 11 Output(S, u) = { 2, 11 } u S length i prefix of the pattern for the largest i ∈ S. pattern occurrence pattern occurrence 2 {1,...,m}  D

Two subset U and A U(u) :  {1 j  |u| | i < m and u[1..i]=Pattern[m-i+1..m]} V(u) :  {1  j  |u| | i  m and u[1-m+1..i]=Pattern} Output(S, u) =((m S)  U(u))  V(u) Realization of Output(S, u) dependent on S independent of S u S

Detail of updating U and A How to calculate U(u) and V(u) u  a u a Dictionary trie D U(ua)V(ua)U(ua)V(ua) U(u)V(u)U(u)V(u) total: O(|D|) time and space total: O(|D|) time and space if m ∈ M(u  a) then U(u  a) = U(u)  {|u  a|} else U(u  a) = U(u) ; ^ We can deal with V(n) as the same way of [DCC’98]. O(1)

-- Is this really practical? -- But... Is it really fast ? Uhmm....

Experimentation ◆ Method 1: ◆ Method 2: Compressed Text bcbababc 9 Compressed Text Shift-And Our previous algorithm(DCC’98) ◆ Method 3: Experimental Comparisons Decompress ! Compressed Text Our new algorithms

Experimentation Original Text "The Brown corpus" 6.8 Mbytes Compressed Text 3.4 Mbytes Language: C (with gcc compiler) Machine : Sun SPARCstation 20 with remote disk storage File transfer ratio: 0.96 Mbyte/sec compress (UNIX command) Experimental Comparisons

Experimental results uncompressed text Shift-And CPU time + File I/O time 1.3 times faster! 1.5 times faster! elapsed time(s) CPU time(s) Shift-And with decompression Our previous algorithm(DCC’98) New algorithm Method

Experimental results Shift-And in original text elapsed time(s) CPU time(s) Shift-And with decompression Our previous algorithm(DCC’98) New algorithm Method

Conclusion  The proposed algorithm scans an LZW compressed text in O(n+r) time using O(|D|) space, and reports all occurrences of the pattern after an O(m+|  |) time and O(|  |) space preprocessing.  We implemented the algorithm, and showed that it is approximately 1.3 times faster than our previous algorithm.  Our new algorithm has several extensions.  generalized pattern matching  pattern matching with k-mismatches  pattern matching for multiple patterns