Speeding up pattern matching by text compression Department of Informatics, Kyushu University, Japan Department of AI, Kyushu Institute of Technology,

Slides:



Advertisements
Similar presentations
Hokkaido University 1 Lecture on Information knowledge network2010/11/10 Lecture on Information Knowledge Network "Information retrieval and pattern matching"
Advertisements

Hokkaido University Lecture on Information Knowledge Network "Information retrieval and pattern matching" Laboratory of Information Knowledge Network,
Shift-And Approach to Pattern Matching in LZW Compressed Text Takuya KIDA Department of Informatics Kyushu University, Japan Masayuki TAKEDA Ayumi SHINOHARA.
CSCI 3280 Tutorial 6. Outline  Theory part of LZW  Tree representation of LZW  Table representation of LZW.
On-line Linear-time Construction of Word Suffix Trees Shunsuke Inenaga (Japan Society for the Promotion of Science & Kyushu University) Masayuki Takeda.
Sparse Compact Directed Acyclic Word Graphs
296.3: Algorithms in the Real World
Optimal Partitions of Strings: A new class of Burrows-Wheeler Compression Algorithms Raffaele Giancarlo Marinella Sciortino
1 Prof. Dr. Th. Ottmann Theory I Algorithm Design and Analysis (12 - Text search, part 1)
Pattern Matching1. 2 Outline and Reading Strings (§9.1.1) Pattern matching algorithms Brute-force algorithm (§9.1.2) Boyer-Moore algorithm (§9.1.3) Knuth-Morris-Pratt.
Goodrich, Tamassia String Processing1 Pattern Matching.
BTrees & Bitmap Indexes
A general compression algorithm that supports fast searching Szymon Grabowski Computer Engineering Dept., Tech. Univ. of Łódź, Poland
CSCI 3 Chapter 1.8 Data Compression. Chapter 1.8 Data Compression  For the purpose of storing or transferring data, it is often helpful to reduce the.
Text Operations: Coding / Compression Methods. Text Compression Motivation –finding ways to represent the text in fewer bits –reducing costs associated.
Dynamic Text and Static Pattern Matching Amihood Amir Gad M. Landau Moshe Lewenstein Dina Sokol Bar-Ilan University.
1 Accelerating Multi-Patterns Matching on Compressed HTTP Traffic Authors: Anat Bremler-Barr, Yaron Koral Presenter: Chia-Ming,Chang Date: Publisher/Conf.
Indexing and Searching
1 A Lempel-Ziv text index on secondary storage Diego Arroyuelo and Gonzalo Navarro Combinatorial Pattern Matching 2007.
Algorithms and Data Structures. /course/eleg67701-f/Topic-1b2 Outline  Data Structures  Space Complexity  Case Study: string matching Array implementation.
S C A L E D Pattern Matching Amihood Amir Ayelet Butman Bar-Ilan University Moshe Lewenstein and Johns Hopkins University Bar-Ilan University.
Pattern Matching1. 2 Outline Strings Pattern matching algorithms Brute-force algorithm Boyer-Moore algorithm Knuth-Morris-Pratt algorithm.
1 Exact Matching Charles Yan Na ï ve Method Input: P: pattern; T: Text Output: Occurrences of P in T Algorithm Naive Align P with the left end.
A Fast Algorithm for Multi-Pattern Searching Sun Wu, Udi Manber May 1994.
Gzip Compression and Decompression 1. Gzip file format 2. Gzip Compress Algorithm. LZ77 algorithm. LZ77 algorithm.Dynamic Huffman coding algorithm.Dynamic.
Hideo Bannai, Shunsuke Inenaga, Masayuki Takeda Kyushu University, Japan SPIRE Cartagena, Colombia.
1 Pattern Matching Using n-grams With Algebraic Signatures Witold Litwin[1], Riad Mokadem1, Philippe Rigaux1 & Thomas Schwarz[2] [1] Université Paris Dauphine.
1 Amihood Amir Bar-Ilan University and Georgia Tech UWSL 2006.
Text Compression Spring 2007 CSE, POSTECH. 2 2 Data Compression Deals with reducing the size of data – Reduce storage space and hence storage cost Compression.
: Chapter 12: Image Compression 1 Montri Karnjanadecha ac.th/~montri Image Processing.
KMP String Matching Prepared By: Carlens Faustin.
1 Strings CopyWrite D.Bockus. 2 Strings Def: A string is a sequence (possibly empty) of symbols from some alphabet. What do we use strings for? 1) Text.
Shift-based Pattern Matching for Compressed Web Traffic Author: Anat Bremler-Barr, Yaron Koral,Victor Zigdon Publisher: IEEE HPSR,2011 Presenter: Kai-Yang,
Finding Characteristic Substrings from Compressed Texts Shunsuke Inenaga Kyushu University, Japan Hideo Bannai Kyushu University, Japan.
Geometric Matching on Sequential Data Veli Mäkinen AG Genominformatik Technical Fakultät Bielefeld Universität.
20/10/2015Applied Algorithmics - week31 String Processing  Typical applications: pattern matching/recognition molecular biology, comparative genomics,
Multiple Pattern Matching in LZW Compressed Text Takuya KIDA Masayuki TAKEDA Ayumi SHINOHARA Masamichi MIYAZAKI Setsuo ARIKAWA Department of Informatics.
Multimedia Data Introduction to Lossless Data Compression Dr Sandra I. Woolley Electronic, Electrical.
1 Pattern Matching Using n-gram Sampling Of Cumulative Algebraic Signatures : Preliminary Results Witold Litwin[1], Riad Mokadem1, Philippe Rigaux1 & Thomas.
Computing longest common substring and all palindromes from compressed strings Wataru Matsubara 1, Shunsuke Inenaga 2, Akira Ishino 1, Ayumi Shinohara.
Strings and Pattern Matching Algorithms Pattern P[0..m-1] Text T[0..n-1] Brute Force Pattern Matching Algorithm BruteForceMatch(T,P): Input: Strings T.
Book: Algorithms on strings, trees and sequences by Dan Gusfield Presented by: Amir Anter and Vladimir Zoubritsky.
Compressed Pattern Matching in DNA Sequences BARNA SAHA.
String Matching String Matching Problem We introduce a general framework which is suitable to capture an essence of compressed pattern matching according.
Multiple Pattern Matching Algorithms on Collage System T. Kida, T. Matsumoto, M. Takeda, A. Shinohara, and S. Arikawa Department of Informatics, Kyushu.
A Unifying Framework for Compressed Pattern Matching Takuya Kida, Masayuki Takeda, Ayumi Shinohara, Yusuke Shibata, Setsuo Arikawa Department of Informatics,
Data Compression Reduce the size of data.  Reduces storage space and hence storage cost. Compression ratio = original data size/compressed data size.
Keisuke Goto, Hideo Bannai, Shunsuke Inenaga, Masayuki Takeda
Faster Approximate String Matching over Compressed Text By Gonzalo Navarro *, Takuya Kida †, Masayuki Takeda †, Ayumi Shinohara †, and Setsuo Arikawa.
Lecture 7 Source Coding and Compression Dr.-Ing. Khaled Shawky Hassan
Fundamental Data Structures and Algorithms
Data Compression 황승원 Fall 2010 CSE, POSTECH 2 2 포항공과대학교 황승원 교 수는 데이터구조를 수강하 는 포항공과대학교 재학생 들에게 데이터구조를 잘해 야 전산학을 잘할수 있으니 더욱 열심히 해야한다고 말 했다. 포항공과대학교 A 데이터구조를.
CS 1501: Algorithm Implementation LZW Data Compression.
Lempel-Ziv-Welch Compression
Accelerating Multi-Pattern Matching on Compressed HTTP Traffic Dr. Anat Bremler-Barr (IDC) Joint work with Yaron Koral (IDC), Infocom[2009]
15-853Page :Algorithms in the Real World Data Compression III Lempel-Ziv algorithms Burrows-Wheeler Introduction to Lossy Compression.
CS 1501: Algorithm Implementation
Computer Sciences Department1. 2 Data Compression and techniques.
Compression and Huffman Coding. Compression Reducing the memory required to store some information. Lossless compression vs lossy compression Lossless.
A new matching algorithm based on prime numbers N. D. Atreas and C. Karanikas Department of Informatics Aristotle University of Thessaloniki.
Succinct Data Structures
Applied Algorithmics - week7
13 Text Processing Hongfei Yan June 1, 2016.
Data Compression Reduce the size of data.
فشرده سازي داده ها Reduce the size of data.
Reachability on Suffix Tree Graphs
Tries 2/23/2019 8:29 AM Tries 2/23/2019 8:29 AM Tries.
Pattern Matching 4/27/2019 1:16 AM Pattern Matching Pattern Matching
Sequences 5/17/ :43 AM Pattern Matching.
CPS 296.3:Algorithms in the Real World
Presentation transcript:

Speeding up pattern matching by text compression Department of Informatics, Kyushu University, Japan Department of AI, Kyushu Institute of Technology, Japan Yusuke Shibata, Takuya Kida, Shuichi Fukamachi, Masayuki Takeda, Ayumi Shinohara, Takeshi Shinohara, Setsuo Arikawa

Contents Pattern matching on compressed text. A unifying framework for compressed pattern matching (Collage System) Byte pair encoding (BPE). Pattern matching algorithm on BPE compressed text. Experimental result. Conclusion.

matching Pattern matching is one of the most fundamental operations in string processing. matching Recently, a new trend for accelerating pattern matching has matching emerged: Speeding up pattern matching by text compression. From the traditional criteria for data compression, i.e., compression ratio and compression/decompression time, adaptive dictionary methods such as the Lempel-Ziv family are often preferred. However, such methods cannot speed up matching the pattern matching since an extra work is needed to keep track of compression mechanism. matching Pattern matching is one of the most fundamental operations in string processing. matching Recently, a new trend for accelerating pattern matching has matching emerged: Speeding up pattern matching by text compression. From the traditional criteria for data compression, i.e., compression ratio and compression/decompression time, adaptive dictionary methods such as the Lempel-Ziv family are often preferred. However, such methods cannot speed up matching the pattern matching since an extra work is needed to keep track of compression mechanism. Pattern Matching Problem matching Pattern Text Knuth-Morris-Pratt (1974) Boyer-Moore (1977) Aho-Corasick (1975) Shift-Or (1992)

Pattern Matching on Compressed Text Expand on Memory File transfer on Secondary disk storage original text File transfer on Memoryon Secondary disk storage compressed text Search Search It requires extra time and space.

Pattern Matching on Compressed Text File transfer on Memoryon Secondary disk storage compressed text Search directly To perform a faster search in compressed texts in comparison with a regular decompression followed by an ordinary search. GOAL 1 To perform a faster search in compressed texts in comparison with an ordinary search in the original texts. GOAL 2 Speeding up pattern matching by text compression

Previous Results(1) 1988 Eliam-Tsoreff and Vishkinrun-length 1992 Amir, Landau, and Vishkin two-dimensional run-length 1995 Farach and ThorupLZ Amir, Benson and FarachLZW 1997 Karpinski, Rytter, and Shinoharastraight-line programs 1996 Gasieniec, et al.LZ Miyazaki, Shinohara, and Takedastraight-line programs 1992 Amir and Benson two-dimensional run-length Amir, Benson, and Farach 1994 two-dimensional run-length 1997 Takedafinite state encoding 1998 Shibatabyte pair encoding 1994 Manberoriginal compression scheme 1998 Fukamachi, Shinohara, and TakedaHuffman encoding 1998 Kida, et al.LZW yearresearchercompression

yearresearchercompression 1999 Shibata, Takeda, Shinohara, and Arikawa Antidictionary based 1999 Kida, Takeda, Shinohara, and Arikawa LZW 2000 Shibata, et al. Byte pair encoding 1999 Navarro and Raffinot LZ family Today’s talk Previous Results(2) 1998 de Moura, Navarro, Ziviani, and Baeza-Yates Word based encoding Unifying framework Unifying framework Kida, et al Dictionary based methods (Collage system)

A Unifying Framework for Compressed Pattern Matching Previous: Compression APM Algorithm A Compression BPM Algorithm B Compression CPM Algorithm C Collage system Kida et al.[1999]: Pattern matching algorithm on the unifying framework Compression A Compression B Compression C

Collage System Definition and Several Examples

Original text Original text Dictionary Based Compression compressed text compressed text Dictionary structure Dictionary structure encoding factorize into a series of phrases How to choose the phrases. How to design the data structure of the dictionary. How to encode phrases.

Collage System Collage system is a pair 〈 D, S 〉 S : A sequence of variables defined in D (Compressed text) S = X i 1, X i 2, ・・・, X i l ( X i ∈ D ) D : A sequence of assignments (Dictionary structure) X 1 : = expr 1 ; ・・・ X 2 : = expr 2 ;X n : = expr n ; ||D|| = n : number of assignments in D |S| = l : number of variables in S

where expr k are... X 1 = expr 1 ; ・・・ X 2 = expr 2 ;X n = expr n ; D : A sequence of assignments (Dictionary structure) a a ∈ Σ ∪ {ε}, (primitive assignment) X i ・ X j (concatenation) for i, j < k, ( X i ) j for i < k and integer j ( j times repetition) [ j ] X i (prefix truncation) for i < k and integer j X i [ j ] (suffix truncation) for i < k and integer j Collage System

Example of Collage System X 1 = a ; X 2 = b ; D : S :S : X 3, X 6, X 4, X 7 abbabbababba X 7 = X 6 ・ X 4 ; X 6 = [ 3 ] X 5 ; X 5 = ( X 3 ) 3 ; X 4 = X 2 ・ X 1 ; X 3 = X 1 ・ X 2 ; babba bab ababab ba ab X7X7 X6X6 X4X4 X5X5 X3X3 X1X1 X2X2 X2X2 X1X1 a b ) 3 ) [ 3 ] (( ba prefix truncation 3 times repetition T(X7)T(X7) height(X 7 ) = 4 height(D) = 4

??? Pattern Matching Algorithm on a Collage System

Compressed pattern matching on a collage system m m : pattern length r r : number of pattern occurrences ||D|| ||D|| : number of assignments in D |S| |S| : number of variables in S Theorem[Kida et al. 1999] Problem of compressed pattern matching can be solved in O( (||D||+|S|) ・ height(D) + m 2 + r ) time O( ||D|| + m 2 ) space using O( ||D|| + m 2 ) space. If D contains no truncation, it can be solved in O( ||D|| + |S| + m 2 + r ) time O( ||D|| + |S| + m 2 + r ) time. Theorem[Kida et al. 1999] Problem of compressed pattern matching can be solved in O( (||D||+|S|) ・ height(D) + m 2 + r ) time O( ||D|| + m 2 ) space using O( ||D|| + m 2 ) space. If D contains no truncation, it can be solved in O( ||D|| + |S| + m 2 + r ) time O( ||D|| + |S| + m 2 + r ) time.

state: 0 : goto function : failure function Pattern π= a b a b b Basic Idea original text: abababba 0 a 12 ba 3 b 4 b S : Xi1Xi1 Xi2Xi2 Xi3Xi3 Xi4Xi4 abababba

The set Output( j, u) ={1 ≦ i ≦ |u| | P = a suffix of P[1: j] ・ u[1: i]} The function Jump( j, u) =δ KMP ( j, u) This set contains the pattern occurrences. The domain is Q×D It simulates the sequence of state transitions for u. Jump and Output Reply in O(1) time Reply in O(1) time Reply in O( l ) time Reply in O( l ) time

Realization of Jump and Output for Jump( q, X k ), if X k is... a X i ・ X j O(1) time If the factor concatenation problem for length m string can be solved in O(1) time, it can be solved in O(1) time. a X i ・ X j O(1) time for Output( q, X k ), if X k is... It can be enumerate in O( l ) time from Output of X i and X j. Size of the set Output

Factor Concatenation Problem example: P = COPACABANA OPA, CABANOPACABAN ‘Yes’! P[2:9] concatenate Instance: Two factors x and y of a string P each represented as a node of suffix trie of P. Question: Is the string xy a factor of P ? If ‘yes’ then return its node number.

Solution to the problem Using a suffix trie, it can be solved in O(m) time after preprocessing of O(m 2 ) time and space. Using a two-dimensional lookup table, it can be solved in O(1), but we need O(m 4 ) time and space preprocessing. It can be solved in O(1) time after O(m 2 ) space and time preprocessing.

Outline of Our Algorithm Input. pattern P and collage system 〈 D, S 〉 ( S := X i 1, X i 2, ・・・, X i n ) Output. All occurrences of the patterns. Input. pattern P and collage system 〈 D, S 〉 ( S := X i 1, X i 2, ・・・, X i n ) Output. All occurrences of the patterns. /* preprocessing of D and P */ preprocess(D); preprocess(P); l:=0; q:=0; for j:=1 to n do begin for each d  Output(q, X i j ) do report ‘pattern occurs at position l+d ’; q:= Jump(q, X i j ); /* state transition */ l:= l + | X i j | ; /* calculation of the offset */ end

Compressed pattern matching on a collage system O( ||D|| + |S| + m 2 + r ) time LZ78, LZW, BPE BPE, Run-length, etc... LZ78, LZW, BPE BPE, Run-length, etc... no truncation LZ77, LZSS, etc... truncation O( (||D|| + |S| ) ・ height(D) + m 2 + r ) time not suitable for speeding up pattern matching

Byte Pair Encoding original encoding algorithm and modified algorithm

A B C D E F G H I Code Pair Pair Table Byte Pair Encoding Text: T = ABABCDEBDEFABDEABC GGCHBHFGHGC GIHBHFGHI GGCDEBDEFGDEGC AB AB→G DE DE→H GC GC→I AB C D E F Used Character ABABABAB DEDEDE GCGC

Byte Pair Encoding “collage system” ABABABAB Text: T = ABABCDEBDEFABDEABC GCGC GGCHBHFGHGC GIHBHFGHI DEDE GGCDEBDEFGDEGC AB→G DE→H GC→I X 1 = A; X 2 = B ; D : X 7 = X 1 ・ X 2 ; X 6 = F ; X 5 = E ; X 4 = D ; X 3 = C ; X 8 = X 4 ・ X 5 ; X 9 = X 7 ・ X 3 ; S : X 7, X 9, X 8, X 2, X 8, X 6, X 7, X 8, X 9

Speeding up of compression Time complexity of BPE O(uN) u : The number of character codes , N : Text length using doubly-linked list O(u + N) time

Speed-up of compression original text: we apply the BPE algorithm to the first block. X 1 = A X 2 = C X 3 = X 2 ・ X 1 X 255 = X 247 ・ X 8 X 256 = X 125 ・ X 48 D: Pattern Matching Machine for multiple replacement [Arikawa et al. 1984] BPE compressed text:

BPE CompressGzip originalmodified Brown corpus ( 6.8Mb) Medline (60.3Mb) Genbank (17.1Mb) Brown corpus Medline Genbank Comparison of Compression Ratio and time compression Ratio(%) compression time(sec) BPE are worse than those of “Compress” and “Gzip” It is drastically accelerated by our modification

Compressed pattern matching on BPE compressed text Problem of compressed pattern matching on BPE compressed text can be solved in O( ||D|| + |S| + m 2 + r ) time O( ||D|| + |S| + m 2 + r ) time. Problem of compressed pattern matching on BPE compressed text can be solved in O( ||D|| + |S| + m 2 + r ) time O( ||D|| + |S| + m 2 + r ) time. ||D|| ≦ 256 -The dictionary D is encoded separately from the sequence S. -The size of D is small enough. -The variables of S are encoded using a fixed length code.

Experimental resultKMPKMP AgrepAgrep our algorithm Medline data (compression ratio is 59%) Genbank data (compression ratio is 32%) Ultra... a clinically- oriented subset of Medlin a data set from GenBank

Concluding Remarks Conclusion and Future Works

Conclusion We introduced compressed pattern matching from practical viewpoints. We observed that our algorithm is reduced at the same rate as the compression ratio compared with uncompressed case. We also observed that it is occasionally faster than Agrep .

Future Works Can we reduce the complexity of the preprocessing? O(m 2 )  O(m) To develop a sublinear algorithm on BPE compressed texts. To develop an approximate pattern matching algorithm on a collage system. To develop a new compression which is suitable for compressed pattern matching. More recent work

A Boyer-Moore type algorithm for compressed pattern matching [CPM2000] compressed pattern matching [CPM2000] A Boyer-Moore type algorithm for compressed pattern matching [CPM2000] compressed pattern matching [CPM2000] We proposed a Boyer-Moore (BM) type algorithm for pattern matching in BPE compressed texts. Does text compression speed up such a sublinear time algorithm?

More recent work KMP Agrep our algorithm most recent work KMP Agrep our algorithm most recent work Medline data (compression ratio is 59%) Genbank data (compression ratio is 32%)