A Unifying Framework for Compressed Pattern Matching Takuya Kida, Masayuki Takeda, Ayumi Shinohara, Yusuke Shibata, Setsuo Arikawa Department of Informatics,

Slides:



Advertisements
Similar presentations
Hokkaido University 1 Lecture on Information knowledge network2010/11/10 Lecture on Information Knowledge Network "Information retrieval and pattern matching"
Advertisements

Hokkaido University Lecture on Information Knowledge Network "Information retrieval and pattern matching" Laboratory of Information Knowledge Network,
Parametrized Matching Amir, Farach, Muthukrishnan Orgad Keller.
Shift-And Approach to Pattern Matching in LZW Compressed Text Takuya KIDA Department of Informatics Kyushu University, Japan Masayuki TAKEDA Ayumi SHINOHARA.
CSCI 3280 Tutorial 6. Outline  Theory part of LZW  Tree representation of LZW  Table representation of LZW.
On-line Linear-time Construction of Word Suffix Trees Shunsuke Inenaga (Japan Society for the Promotion of Science & Kyushu University) Masayuki Takeda.
Sparse Compact Directed Acyclic Word Graphs
Data Compressor---Huffman Encoding and Decoding. Huffman Encoding Compression Typically, in files and messages, Each character requires 1 byte or 8 bits.
Lecture 6 Source Coding and Compression Dr.-Ing. Khaled Shawky Hassan
15-853Page : Algorithms in the Real World Suffix Trees.
296.3: Algorithms in the Real World
Optimal Partitions of Strings: A new class of Burrows-Wheeler Compression Algorithms Raffaele Giancarlo Marinella Sciortino
1 2 Dimensional Parameterized Matching Carmit Hazay Moshe Lewenstein Dekel Tsur.
Algorithm Programming Some Topics in Compression Bar-Ilan University תשס"ח by Moshe Fresko.
1 A simple construction of two- dimensional suffix trees in linear time * Division of Electronics and Computer Engineering Hanyang University, Korea Dong.
Dynamic Text and Static Pattern Matching Amihood Amir Gad M. Landau Moshe Lewenstein Dina Sokol Bar-Ilan University.
1 Accelerating Multi-Patterns Matching on Compressed HTTP Traffic Authors: Anat Bremler-Barr, Yaron Koral Presenter: Chia-Ming,Chang Date: Publisher/Conf.
Module 12 Computation and Configurations Formal Definition Examples.
Faster Algorithm for String Matching with k Mismatches Amihood Amir, Moshe Lewenstin, Ely Porat Journal of Algorithms, Vol. 50, 2004, pp Date.
Faster 2-Dimensional Scaled Matching Amihood Amir and Eran Chencinski.
S C A L E D Pattern Matching Amihood Amir Ayelet Butman Bar-Ilan University Moshe Lewenstein and Johns Hopkins University Bar-Ilan University.
1 Exact Matching Charles Yan Na ï ve Method Input: P: pattern; T: Text Output: Occurrences of P in T Algorithm Naive Align P with the left end.
Survey: String Matching with k Mismatches Moshe Lewenstein Bar Ilan University.
Hideo Bannai, Shunsuke Inenaga, Masayuki Takeda Kyushu University, Japan SPIRE Cartagena, Colombia.
1 Amihood Amir Bar-Ilan University and Georgia Tech UWSL 2006.
Chapter 2 Source Coding (part 2)
Text Compression Spring 2007 CSE, POSTECH. 2 2 Data Compression Deals with reducing the size of data – Reduce storage space and hence storage cost Compression.
Lecture 29. Data Compression Algorithms 1. Commonly, algorithms are analyzed on the base probability factor such as average case in linear search. Amortized.
KMP String Matching Prepared By: Carlens Faustin.
Finding Characteristic Substrings from Compressed Texts Shunsuke Inenaga Kyushu University, Japan Hideo Bannai Kyushu University, Japan.
Geometric Matching on Sequential Data Veli Mäkinen AG Genominformatik Technical Fakultät Bielefeld Universität.
String Matching with k Mismatches Moshe Lewenstein Bar Ilan University Modified by Ariel Rosenfeld.
Multiple Pattern Matching in LZW Compressed Text Takuya KIDA Masayuki TAKEDA Ayumi SHINOHARA Masamichi MIYAZAKI Setsuo ARIKAWA Department of Informatics.
Improved string matching with k mismatches (The Kangaroo Method) Galil, R. Giancarlo SIGACT News, Vol. 17, No. 4, 1986, pp. 52–54 Original: Moshe Lewenstein.
Multimedia Data Introduction to Lossless Data Compression Dr Sandra I. Woolley Electronic, Electrical.
Addressing Image Compression Techniques on current Internet Technologies By: Eduardo J. Moreira & Onyeka Ezenwoye CIS-6931 Term Paper.
Computing longest common substring and all palindromes from compressed strings Wataru Matsubara 1, Shunsuke Inenaga 2, Akira Ishino 1, Ayumi Shinohara.
Strings and Pattern Matching Algorithms Pattern P[0..m-1] Text T[0..n-1] Brute Force Pattern Matching Algorithm BruteForceMatch(T,P): Input: Strings T.
Compressed Pattern Matching in DNA Sequences BARNA SAHA.
Speeding up pattern matching by text compression Department of Informatics, Kyushu University, Japan Department of AI, Kyushu Institute of Technology,
String Matching String Matching Problem We introduce a general framework which is suitable to capture an essence of compressed pattern matching according.
Multiple Pattern Matching Algorithms on Collage System T. Kida, T. Matsumoto, M. Takeda, A. Shinohara, and S. Arikawa Department of Informatics, Kyushu.
Data Compression Reduce the size of data.  Reduces storage space and hence storage cost. Compression ratio = original data size/compressed data size.
Compressed Suffix Arrays and Suffix Trees Roberto Grossi, Jeffery Scott Vitter.
Keisuke Goto, Hideo Bannai, Shunsuke Inenaga, Masayuki Takeda
Everything is String. Closed Factorization Golnaz Badkobeh 1, Hideo Bannai 2, Keisuke Goto 2, Tomohiro I 2, Costas S. Iliopoulos 3, Shunsuke Inenaga 2,
Faster Approximate String Matching over Compressed Text By Gonzalo Navarro *, Takuya Kida †, Masayuki Takeda †, Ayumi Shinohara †, and Setsuo Arikawa.
Lecture 7 Source Coding and Compression Dr.-Ing. Khaled Shawky Hassan
Data Compression 황승원 Fall 2010 CSE, POSTECH 2 2 포항공과대학교 황승원 교 수는 데이터구조를 수강하 는 포항공과대학교 재학생 들에게 데이터구조를 잘해 야 전산학을 잘할수 있으니 더욱 열심히 해야한다고 말 했다. 포항공과대학교 A 데이터구조를.
CS 1501: Algorithm Implementation LZW Data Compression.
Lampel ZIV (LZ) code The Lempel-Ziv algorithm is a variable-to-fixed length code Basically, there are two versions of the algorithm LZ77 and LZ78 are the.
Suffix Tree 6 Mar MinKoo Seo. Contents  Basic Text Searching  Introduction to Suffix Tree  Suffix Trees and Exact Matching  Longest Common Substring.
CS 1501: Algorithm Implementation
Compression and Huffman Coding. Compression Reducing the memory required to store some information. Lossless compression vs lossy compression Lossless.
Computing smallest and largest repetition factorization in O(n log n) time Hiroe Inoue, Yoshiaki Matsuoka, Yuto Nakashima, Shunsuke Inenaga, Hideo Bannai,
Textbook does not really deal with compression.
Data Compression.
Applied Algorithmics - week7
Reducing the Space Requirement of LZ-index
13 Text Processing Hongfei Yan June 1, 2016.
Data Compression Reduce the size of data.
Pattern Matching 12/8/ :21 PM Pattern Matching Pattern Matching
فشرده سازي داده ها Reduce the size of data.
Reachability on Suffix Tree Graphs
2-Dimensional Pattern Matching
Tries 2/23/2019 8:29 AM Tries 2/23/2019 8:29 AM Tries.
String Matching with k Mismatches
Pattern Matching Pattern Matching 5/1/2019 3:53 PM Spring 2007
Pattern Matching 4/27/2019 1:16 AM Pattern Matching Pattern Matching
Table 3. Decompression process using LZW
Presentation transcript:

A Unifying Framework for Compressed Pattern Matching Takuya Kida, Masayuki Takeda, Ayumi Shinohara, Yusuke Shibata, Setsuo Arikawa Department of Informatics, Kyushu University, Japan

2 Contents Pattern matching and compressed pattern matching Previous results Collage system Proposed algorithm Conclusion

3 Pattern Matching Problem We introduce a general framework which is suitable to capture an essence of compressed pattern matching according to various dictionary based compressions. The goal is to find all occurrences of a pattern in a text without decompression, which is one of the most active topics in string matching. Our framework includes such compression methods as Lempel-Ziv family, (LZ77, LZSS, LZ78, LZW), byte-pair encoding, and the static dictionary based method. Technically, our pattern matching algorithm extremely extends that for LZW compressed text presented by Amir, Benson and Farach. text:= pattern:= compress

4 Compressed Pattern Matching Compressed Text OriginalText Compressed Text Pattern Matching Machine Machine New Machine ! decompress

Previous Results(1) 1988 Eliam-Tsoreff and Vishkinrun-length 1992 Amir, Landau, and Vishkin two-dimensional run-length 1995 Farach and ThorupLZ Amir, Benson and FarachLZW 1997 Karpinski, Rytter, and Shinoharastraight-line programs 1996 Gasieniec, et al.LZ Miyazaki, Shinohara, and Takedastraight-line programs 1992 Amir and Benson two-dimensional run-length Amir, Benson, and Farach 1994 two-dimensional run-length 1997 Takedafinite state encoding 1998 Shibatabyte pair encoding 1994 Manberoriginal compression scheme 1998 Fukamachi, Shinohara, and TakedaHuffman encoding 1998 Kida, et al.LZW yearresearchercompression

yearresearchercompression 1999 Shibata, Takeda, Shinohara, and Arikawa Antidictionaries 1999 Kida, Takeda, Shinohara, and Arikawa LZW 1999 Shibata, et al. Byte pair encoding Kida, et al Dictionary based methods (Collage system) 1999 Navarro and Raffinot LZ family Today’s talk Today’s talk Previous Results(2) 1998 de Moura, Navarro, Ziviani, and Baeza-Yates Word based encoding faster than Agrep! faster than Agrep!

7 Motivation Previous: Compression APM Algorithm A Compression BPM Algorithm B Compression CPM Algorithm C Ours: General Pattern matching algorithm on the unifying framework Compression A Compression B Compression C Collage system

Collage System Definition and Several Examples

9 Original text Original text Dictionary Based Compression compressed text compressed text Dictionary structure Dictionary structure encoding factorize into a series of phrases  How to choose the phrases.  How to design the data structure of the dictionary.  How to encode phrases.

10 Definition of Collage System  Collage system is a pair 〈 D, S 〉 S : A sequence of variables defined in D (Compressed text) S := X i 1, X i 2, ・・・, X i l ( X i ∈ D ) D : A sequence of assignments (Dictionary structure) X 1 = expr 1 ; ・・・ X 2 = expr 2 ;X n = expr n ; ||D|| = n : number of assignments in D |S| = l : number of variables in S

11 Definition of Collage System where expr k are X 1 = expr 1 ; ・・・ X 2 = expr 2 ;X n = expr n ; a a ∈ Σ ∪ {ε}, (primitive assignment) X i ・ X j (concatenation) for i, j < k, ( X i ) j for i < k and integer j ( j times repetition) D : A sequence of assignments (Dictionary structure) [ j ] X i (prefix truncation) for i < k and integer j X i [ j ] (suffix truncation) for i < k and integer j

Example of Collage System X 1 = a ; X 2 = b ; D : S :S : X 3, X 6, X 4, X 7 abbabbababba X 7 = X 6 ・ X 4 ; X 6 = [ 3 ] X 5 ; X 5 = ( X 3 ) 3 ; X 4 = X 2 ・ X 1 ; X 3 = X 1 ・ X 2 ; babba bab ababab ba ab X7X7 X6X6 X4X4 X5X5 X3X3 X1X1 X2X2 X2X2 X1X1 a b ) 3 ) [ 3 ] (( ba prefix truncation 3 times repetition T(X7)T(X7) height(X 7 ) = 4 height(D) = 4

13 Example of Collage System Byte Pair Encoding (BPE) D:D: X 1 = a ; X 2 = b ; X 4 = X 1 ・ X 2 ; X 5 = X 4 ・ X 3 ; Original Text: a b a b c b a b c c a b c a c b D D c b D c c D c a c b D E b E c E a c b ab  D Dc  E X 3 = c ; S : X 4, X 5, X 2, X 5, X 3, X 5, X 1, X 3, X 2 ab  D Dc  E

14 Example of Collage System (LZSS[gzip]) X q+1, X q+2, ・・・, X q+n X q+1 = ( ( [i 1 ] X l(1) ・ X l(1)+1 ・・・ X r(1) ) m 1 ) [ j 1 ] b 1 ; X q+2 = ( ( [i 2 ] X l(2) ・ X l(2)+1 ・・・ X r(2) ) m 2 ) [ j 2 ] b 2 ; X q+n = ( ( [i n ] X l(n) ・ X l(n)+1 ・・・ X r(n) ) m n ) [ j n ] b n ; D:D:X 1 = a 1 ;X 2 = a 2 ;X q = a q ; ・・・ S :S :

15 What is ‘Collage’? This is college!

16 Collage is... an artistic composition technique. 1. Cut or tear up materials. 2. Paste the pieces over a surface.

Our Algorithm Pattern Matching Algorithm on a Collage System

Compressed pattern matching on a collage system The problem of compressed pattern matching can be solved in O( (||D||+|S|) ・ height(D) + m 2 + r ) time using O( ||D|| + m 2 ) space. If D contains no truncation, it can be solved in O( ||D|| + |S| + m 2 + r ) time. m : pattern length r : number of pattern occurrences ||D|| : number of assignments in D |S| : number of variables in S O(compressed text length+m 2 +r)

19 state: S : Xi1Xi1 Xi2Xi2 Xi3Xi3 Xi4Xi4 7 : goto function : failure function a b b a b 3 Pattern π= a b a b b Basic Idea original text: abababba

20 The set Output( j, u) ={1 ≦ i ≦ |u| | P = a suffix of P[1: j] ・ u[1: i]} The function Jump( j, u) =δ KMP ( j, u) This set contains the pattern occurrences. The domain is Q×D It simulates the sequence of state transitions for u. Jump and Output Reply in O(1) time Reply in O(1) time Reply in O( l ) time Reply in O( l ) time

21 Realization of Jump for Jump( q, X k ), if X k is... a X i ・ X j O(1) time If the factor concatenation problem for length m string can be solved in O(1) time, it can be solved in O(1) time. [ j ] X i X i [ j ] O( height(X i ) ) time ( X i ) j O(1) time

22 Factor Concatenation Problem example: P =COPACABANA OPA, CABANOPACABAN ‘Yes’! P[2:9] concatenate Instance: Two factors x and y of a string P each represented as a node of suffix trie of P. Question: Is the string xy a factor of P ? If ‘yes’ then return its node number.

23 Solution to the problem Using a suffix trie, it can be solved in O(m) time after preprocessing of O(m 2 ) time and space. Using a two-dimensional lookup table, it can be solved in O(1), but we need O(m 4 ) time and space preprocessing. It can be solved in O(1) time after O(m 2 ) space and time preprocessing.

24 Realization of Output a X i ・ X j O(1) time [ j ] X i X i [ j ] O( l ・ height(X i ) ) time ( X i ) j O( l ) time for Output( q, X k ), if X k is... It can be enumerate in O( l ) time from Output of X i and X j. Size of the set Output

Outline of Our Algorithm Input. pattern P and Collage system: 〈 D, S 〉 ( S := X i 1, X i 2, ・・・, X i n ) Output. All occurrences of the patterns. Input. pattern P and Collage system: 〈 D, S 〉 ( S := X i 1, X i 2, ・・・, X i n ) Output. All occurrences of the patterns. /* preprocess for D and P */ preprocess(D); preprocess(P); l:=0; q:=0; for j:=1 to n do begin for each d  Output(q, X i j ) do report ‘pattern occurs at position l+d ’; q:= Jump(q, X i j ); /* state transition */ l:= l + | X i j | ; /* calculate the offset */ end

Concluding Remarks Conclusion and Future Works

27 Our Results If D contains no truncation : O( ||D|| + |S| + m 2 + r ) time 1998 Kida, et al. ( LZW ) : O( n + m 2 ) space O( n + m 2 + r ) time LZ78, LZW, BPE, Run-length, etc... LZ78, LZW, BPE, Run-length, etc... no truncation LZ77, LZSS, etc... truncation Complexity of our algorithm:O( ||D|| + m 2 ) space O( (||D|| + |S| ) ・ height(D) + m 2 + r ) time

28 Conclusion We introduced a general framework for compressed pattern matching (Collage system) We proposed a compressed pattern matching algorithm on collage system and showed its complexity. O( (||D||+|S|) ・ height(D) + m 2 + r ) time O( ||D|| + m 2 ) space ( If no truncation ) O( ||D|| + |S| + m 2 + r ) time

29 Future Works Can we reduce the complexity of the preprocessing? O(m 2 )  O(m) To improve our algorithm for dealing with multiple patterns. To develop an approximate pattern matching algorithm on a collage system. To develop a new compression which is suitable for compressed pattern matching.