Multiple Pattern Matching in LZW Compressed Text Takuya KIDA Masayuki TAKEDA Ayumi SHINOHARA Masamichi MIYAZAKI Setsuo ARIKAWA Department of Informatics.

Slides:



Advertisements
Similar presentations
Hokkaido University 1 Lecture on Information knowledge network2010/11/10 Lecture on Information Knowledge Network "Information retrieval and pattern matching"
Advertisements

Hokkaido University Lecture on Information Knowledge Network "Information retrieval and pattern matching" Laboratory of Information Knowledge Network,
Parametrized Matching Amir, Farach, Muthukrishnan Orgad Keller.
Two-dimensional pattern matching M.G.W.H. van de Rijdt 23 August 2005.
YES-NO machines Finite State Automata as language recognizers.
Shift-And Approach to Pattern Matching in LZW Compressed Text Takuya KIDA Department of Informatics Kyushu University, Japan Masayuki TAKEDA Ayumi SHINOHARA.
CSCI 3280 Tutorial 6. Outline  Theory part of LZW  Tree representation of LZW  Table representation of LZW.
On-line Linear-time Construction of Word Suffix Trees Shunsuke Inenaga (Japan Society for the Promotion of Science & Kyushu University) Masayuki Takeda.
Sparse Compact Directed Acyclic Word Graphs
1 Prof. Dr. Th. Ottmann Theory I Algorithm Design and Analysis (12 - Text search: suffix trees)
Tries Search for ‘bell’ O(n) by KMP algorithm O(dm) in a trie Tries
1 String Matching The problem: Input: a text T (very long string) and a pattern P (short string). Output: the index in T where a copy of P begins.
Algorithm Programming Some Topics in Compression Bar-Ilan University תשס"ח by Moshe Fresko.
1 String Matching The problem: Input: a text T (very long string) and a pattern P (short string). Output: the index in T where a copy of P begins.
Dynamic Text and Static Pattern Matching Amihood Amir Gad M. Landau Moshe Lewenstein Dina Sokol Bar-Ilan University.
Pattern Matching II COMP171 Fall Pattern matching 2 A Finite Automaton Approach * A directed graph that allows self-loop. * Each vertex denotes.
Fa05CSE 182 CSE182-L4: Scoring matrices, Dictionary Matching.
Aho-Corasick String Matching An Efficient String Matching.
Faster Algorithm for String Matching with k Mismatches Amihood Amir, Moshe Lewenstin, Ely Porat Journal of Algorithms, Vol. 50, 2004, pp Date.
Fa05CSE 182 CSE182-L5: Scoring matrices Dictionary Matching.
1 Exact Matching Charles Yan Na ï ve Method Input: P: pattern; T: Text Output: Occurrences of P in T Algorithm Naive Align P with the left end.
Survey: String Matching with k Mismatches Moshe Lewenstein Bar Ilan University.
Topic : algorithms on FSA -- M.Mohri,on some applications of Finite- state automata theory to natural language processing. Natural Language Eng 1 (1996)
1 Exact Set Matching Charles Yan Exact Set Matching Goal: To find all occurrences in text T of any pattern in a set of patterns P={p 1,p 2,…,p.
Hideo Bannai, Shunsuke Inenaga, Masayuki Takeda Kyushu University, Japan SPIRE Cartagena, Colombia.
Text Compression Spring 2007 CSE, POSTECH. 2 2 Data Compression Deals with reducing the size of data – Reduce storage space and hence storage cost Compression.
1. 2 Overview  Suffix tries  On-line construction of suffix tries in quadratic time  Suffix trees  On-line construction of suffix trees in linear.
1 Chapter 1 Introduction to the Theory of Computation.
Finding Characteristic Substrings from Compressed Texts Shunsuke Inenaga Kyushu University, Japan Hideo Bannai Kyushu University, Japan.
String Matching with k Mismatches Moshe Lewenstein Bar Ilan University Modified by Ariel Rosenfeld.
Improved string matching with k mismatches (The Kangaroo Method) Galil, R. Giancarlo SIGACT News, Vol. 17, No. 4, 1986, pp. 52–54 Original: Moshe Lewenstein.
Great Theoretical Ideas in Computer Science.
Faster Algorithm for String Matching with k Mismatches (II) Amihood Amir, Moshe Lewenstin, Ely Porat Journal of Algorithms, Vol. 50, 2004, pp
Membership problem CYK Algorithm Project presentation CS 5800 Spring 2013 Professor : Dr. Elise de Doncker Presented by : Savitha parur venkitachalam.
Computing longest common substring and all palindromes from compressed strings Wataru Matsubara 1, Shunsuke Inenaga 2, Akira Ishino 1, Ayumi Shinohara.
Strings and Pattern Matching Algorithms Pattern P[0..m-1] Text T[0..n-1] Brute Force Pattern Matching Algorithm BruteForceMatch(T,P): Input: Strings T.
Book: Algorithms on strings, trees and sequences by Dan Gusfield Presented by: Amir Anter and Vladimir Zoubritsky.
Speeding up pattern matching by text compression Department of Informatics, Kyushu University, Japan Department of AI, Kyushu Institute of Technology,
Fundamental Data Structures and Algorithms Margaret Reid-Miller 24 February 2005 LZW Compression.
Suffix trees. Trie A tree representing a set of strings. a b c e e f d b f e g { aeef ad bbfe bbfg c }
Multiple Pattern Matching Algorithms on Collage System T. Kida, T. Matsumoto, M. Takeda, A. Shinohara, and S. Arikawa Department of Informatics, Kyushu.
A Unifying Framework for Compressed Pattern Matching Takuya Kida, Masayuki Takeda, Ayumi Shinohara, Yusuke Shibata, Setsuo Arikawa Department of Informatics,
Data Compression Reduce the size of data.  Reduces storage space and hence storage cost. Compression ratio = original data size/compressed data size.
Keisuke Goto, Hideo Bannai, Shunsuke Inenaga, Masayuki Takeda
Faster Approximate String Matching over Compressed Text By Gonzalo Navarro *, Takuya Kida †, Masayuki Takeda †, Ayumi Shinohara †, and Setsuo Arikawa.
String Sorts Tries Substring Search: KMP, BM, RK
Fundamental Data Structures and Algorithms
Data Compression 황승원 Fall 2010 CSE, POSTECH 2 2 포항공과대학교 황승원 교 수는 데이터구조를 수강하 는 포항공과대학교 재학생 들에게 데이터구조를 잘해 야 전산학을 잘할수 있으니 더욱 열심히 해야한다고 말 했다. 포항공과대학교 A 데이터구조를.
Machines That Can’t Count CS Lecture 15 b b a b a a a b a b.
CS 1501: Algorithm Implementation LZW Data Compression.
Great Theoretical Ideas In Computer Science John LaffertyCS Fall 2005 Lecture 10Sept Carnegie Mellon University b b a b a a a b a b One.
Great Theoretical Ideas In Computer Science John LaffertyCS Fall 2006 Lecture 22 November 9, 2006Carnegie Mellon University b b a b a a a b a b.
1 Chapter 3 Regular Languages.  2 3.1: Regular Expressions (1)   Regular Expression (RE):   E is a regular expression over  if E is one of:
CS 1501: Algorithm Implementation
1 1. Eliminate all  -transitions from the following FA without changing the number of states and the language accepted by the automaton. You should also.
Great Theoretical Ideas In Computer Science Steven RudichCS Spring 2005 Lecture 9Feb Carnegie Mellon University b b a b a a a b a b One Minute.
CSE 589 Applied Algorithms Spring 1999
Andrzej Ehrenfeucht, University of Colorado, Boulder
Applied Algorithmics - week7
Data Compression Reduce the size of data.
Suffix trees.
فشرده سازي داده ها Reduce the size of data.
Reachability on Suffix Tree Graphs
CSE 589 Applied Algorithms Spring 1999
KMP String Matching Donald Knuth Jim H. Morris Vaughan Pratt 1997.
2-Dimensional Pattern Matching
One Minute To Learn Programming: Finite Automata
String Matching with k Mismatches
Chapter 1 Introduction to the Theory of Computation
Prepared by- Patel priya( ) Guided by – Prof. Archana Singh Gandhinagar Institute of Technology SUBJECT - CD ( ) Introcution to Regular.
Presentation transcript:

Multiple Pattern Matching in LZW Compressed Text Takuya KIDA Masayuki TAKEDA Ayumi SHINOHARA Masamichi MIYAZAKI Setsuo ARIKAWA Department of Informatics Kyushu University, Japan Nagano Fukuoka Masayuki TAKEDA Ayumi SHINOHARA Masamichi MIYAZAKI Setsuo ARIKAWA

Our Goal CompressedText OriginalText CompressedText Pattern Matching Machine Machine New Machine !

Previous studies year researcher compression method Eilam-Tsoreff and Vishkin Amir, Landau, and Vishikin Amir and Benson Farach and Thorup Gasieniec, et al. Amir, Benson and Farach Karpinski, et al. Miyazaki, et al. run-lengthtwo-dimensionalrun-lengthLZ77LZW straight-line programs

Previous result vs Our result Amir, Benson, and Farach's algorithm (JCSS 1996) "Let sleeping files lie: Pattern matching in Z-compressed files" Amir, Benson, and Farach's algorithm (JCSS 1996) "Let sleeping files lie: Pattern matching in Z-compressed files" –deals with only single pattern. –can find only the first occurrence of the pattern. –takes O(n+m 2 ) time and space. n : length of the compressed text, m: length of the pattern. n Our algorithm –deals with multiple patterns. –can find all occurrences of the patterns. –takes O(n+m 2 +r) time and O(n+m 2 ) space. m: total length of the patterns, r : number of pattern occurrences.

Lempel-Ziv-Welch compression a b ab ab ba b c aba bc abab Dictionary trie : D Σ= {a,b,c} b a b c a a a a b b b c original text compressed text O( |D| ) = O( n )

Pattern : abab a b b a {abab} original text: a a b a b a a b b a b a b a b a b a a b b b a a b b a a a b b b b a a a b b a a b b found ! found ! KMP automaton Σ : goto function : failure function { } : output Basic Idea(Amir et al.)

{abab} {abab} ab, bab aba abab b c bc ca, ba bca, a b ba a b b a {abab} Next (0, bab)=2 Pattern : abab KMP automaton

01234a b a b {abab} abc ab abc Who is watching the occurrences of the pattern?! Output (2, abc)= { 〈 2, abab 〉 } Basic Idea(Amir et al.) Next (2, abc)=0

for Multiple Patterns n Aho-Corasick Pattern Matching Machine ac b b a b c a b b {bb} {abca} {aba} {ababb,bb} Patterns:Π={aba,ababb,abca,bb} : goto function : failure function { } : output

Our Algorithm Input. Π : set of patterns, u 1,u 2, …,u n : LZW compressed text. u 1,u 2, …,u n : LZW compressed text. Output. All occurrences of the patterns. Construct from Π the AC machine, Construct from Π the AC machine, and the generalized suffix trie. and the generalized suffix trie. Initialize the dictionary trie, Next and Output ; Initialize the dictionary trie, Next and Output ; l:=0; state:=q 0 ; l:=0; state:=q 0 ; for i:=1 to n do begin for i:=1 to n do begin for each 〈 d,π 〉∈ Output(state,u i ) do for each 〈 d,π 〉∈ Output(state,u i ) do report "pattern π occurs at position l+d"; report "pattern π occurs at position l+d"; state:=Next(state,u i ); state:=Next(state,u i ); l:= l+ | u i | ; l:= l+ | u i | ; Update the dictionary trie, Next and Output Update the dictionary trie, Next and Output end. end. O( n+r ) O( n ) O( m 2 )

Ok! Let’s go!

State Transition Function Next (q, u) Next: Q×D → Q O( m× | D | ) !! Next(q,u) N 1 (q, u) ・ u N 1 (q, u) ・ u Next(0, u) Next(0, u) if u ∈ Factor(Π), otherwise. = O( m×m 2 ) O( | D | ) Q: states of AC machine D: strings represented by dictionary trie m: total length of patterns

n Table of N 1 (q, u) ・ u --- O(m×m 2 ) state a b c ab ba bb bc ca aba abb abc bab bca abab abca babb ababb O( | D | +m 3 ) State Transition Function Next (q, u) Π={aba,ababb,abca,bb}

ab c a a a a b b b c b b c a b b O( m ) Generalized Suffix Trie : explicit node O( m 2 ) : nonexplicit node

state a b c ab ba bb bc ca aba abb abc bab bca abab abca babb ababb O( | D | +m 3 ) O( | D | +m 2 ) State Transition Function Next (q, u) statestate a b ab ba bb ca aba abb bca abca babb ababb a b c ab ba bb bc ca aba abb abc bab bca abab abca babb ababb n Table of N 1 (q, u) ・ u --- O( m×m )

Ancestor(q, k): the ancestor of node q with distance k in the trie of AC machine. in the trie of AC machine. u : one of the explicit descendants of node u in the generalized suffix trie. in the generalized suffix trie.

Output Function Output(q,u)= { 〈 i,π 〉 | 1 ≦ i ≦ | u |, π ∈ Π, and π is a suffix of string q ・ u [ 1..i ] } π is a suffix of string q ・ u [ 1..i ] }q u π i O( m× | D | ) !!! ii

q u π1π1π1π1 π1π1π1π1 π2π2π2π2 π3π3π3π3 O( | D | ) O(m 2 ) u~ dependent on q independent of q Output Function u~ Let be the longest prefix of u such that is a suffix of some pattern. is a suffix of some pattern. u ~

But... Is it really fast ? Uhmm....

Experiment ◆ Method 1: ◆ Method 2: ◆ Method 3: Without Decompression CompressedText OriginalText CompressedText bcbababc 9 CompressedText Decompression ! AC Machine Decompression ! AC Machine Our Algorithm

Experiment Original Text "The Brown corpus" 6.8 Mbytes Compressed Text 3.4 Mbytes Language: C++ (gcc without optimization) Machine : Sun SPARCstation 20. compress (UNIX command)

Result of the Experiment (number of pattern occurrences / original text length) Occurrence rate ( % ) CPU time (s) Method 1 Method 2 Method 3 Our Algorithm

Conclusion Previous Result Our Result deals with only single pattern deals with multiple patterns can find only the first occurrence of the pattern takes O( n+m 2 ) time and space can find all occurrences of the patterns takes O( n+m 2 ) space can answer in O(n+m 2 +r) time no practical evaluation about twice faster than a decompression followed by using the AC machine

plain zgrep Another result (number of pattern occurrences / original text length) Occurrence rate ( % ) CPU time (s) Method 1 Method 2 Method 3

LZW Compression Input. LZW compressed text u 1,u 2,…,u n. Output. Dictionary D represented in the form of trie. Method.begin D := Σ; D := Σ; for i:=1 to n-1 do begin for i:=1 to n-1 do begin if u n +1 ≦| D | then if u n +1 ≦| D | then let a be the first symbol of u i +1 ; let a be the first symbol of u i +1 ; else else let a be the first symbol of u i ; let a be the first symbol of u i ; D:=D ∪ {u i ・ a} D:=D ∪ {u i ・ a} end endend.

Proofq u p p Next(q,u) = p Let u be not a substring of any pattern. Next(q,u) N 1 (q, u) ・ u N 1 (q, u) ・ u Next(0, u) Next(0, u) if u ∈ Factor(Π), otherwise. =

q u π3π3π3π3 π2π2π2π2 π1π1π1π1 π1π1π1π1 π2π2π2π2 Output(q,u)= { 〈 i,π 〉 | 1 ≦ i ≦ | u |, π ∈ Π, and π is a suffix of string q ・ u [ 1..i ] } π is a suffix of string q ・ u [ 1..i ] } u~ Output Function

Realization of Output funstion Dictionary trie : D Σ= {a,b,c} flagprev(u) AC state a b c a a a a b b b c b false0 NULL Patterns:Π={aba,ababb,abca,bb} false0 NULL true0 3 false6 NULL

Realization of Output funstion

state aba ba a ababb babb abb bb b abca bca ca (1,ba) (8,a) (1,ε) (1,babb) (8,abb) (1,bb) (8,b) (8,ε) (1,bca) (8,ca) (0,a) (1,ba) (2,a) (1,ε) (1,babb) (2,abb) (1,bb) (2,b) (2,ε) (1,bca) (2,ca) (0,a) (3,ba) (9,a) (3,ε) (3,babb) (9,abb) (3,bb) (9,b) (9,ε) (3,bca) (9,ca) (6,a) (1,ba) (4,a) (1,ε) (1,babb) (4,abb) (1,bb) (4,b) (4,ε) (1,bca) (4,ca) (0,a) (3,ba) (5,a) (3,ε) (3,babb) (5,abb) (3,bb) (5,b) (5,ε) (3,bca) (5,ca) (6,a) (1,ba) (9,a) (1,ε) (1,babb) (9,abb) (1,bb) (9,b) (9,ε) (1,bca) (9,ca) (0,a) (7,ba) (8,a) (7,ε) (7,babb) (8,abb) (7,bb) (8,b) (8,ε) (7,bca) (8,ca) (0,a) (1,ba) (2,a) (1,ε) (1,babb) (2,abb) (1,bb) (2,b) (2,ε) (1,bca) (2,ca) (0,a) (1,ba) (9,a) (1,ε) (1,babb) (9,abb) (1,bb) (9,b) (9,ε) (1,bca) (9,ca) (0,a) (1,babb)→(2,abb) Realization of Output funstion Output(1, babb)

for Multiple Patterns n Aho-Corasick Pattern Matching Machine ac b b a b c a b b {bb} {abca} {aba} {ababb,bb} Patterns:Π={aba,ababb,abca,bb} : goto function : failure function { } : output

state aba ba a ababb babb abb bb b abca bca ca (1,ba) (8,a) (1,ε) (1,babb) (8,abb) (1,bb) (8,b) (8,ε) (1,bca) (8,ca) (0,a) (1,ba) (2,a) (1,ε) (1,babb) (2,abb) (1,bb) (2,b) (2,ε) (1,bca) (2,ca) (0,a) (3,ba) (9,a) (3,ε) (3,babb) (9,abb) (3,bb) (9,b) (9,ε) (3,bca) (9,ca) (6,a) (1,ba) (4,a) (1,ε) (1,babb) (4,abb) (1,bb) (4,b) (4,ε) (1,bca) (4,ca) (0,a) (3,ba) (5,a) (3,ε) (3,babb) (5,abb) (3,bb) (5,b) (5,ε) (3,bca) (5,ca) (6,a) (1,ba) (9,a) (1,ε) (1,babb) (9,abb) (1,bb) (9,b) (9,ε) (1,bca) (9,ca) (0,a) (7,ba) (8,a) (7,ε) (7,babb) (8,abb) (7,bb) (8,b) (8,ε) (7,bca) (8,ca) (0,a) (1,ba) (2,a) (1,ε) (1,babb) (2,abb) (1,bb) (2,b) (2,ε) (1,bca) (2,ca) (0,a) (1,ba) (9,a) (1,ε) (1,babb) (9,abb) (1,bb) (9,b) (9,ε) (1,bca) (9,ca) (0,a) →(5,ε)(1,babb)→(2,abb)→(3,bb)→(4,b) Realization of Output funstion Output(1, babb) found ! found !

Realization of Output funstion state aba ba a ababb babb abb bb b abca bca ca (2,a) (1,ε) (1,ε) (2,abb) (2,b) (2,b) (8,b) (8,ε) (6,a) (1,ε) (1,ε) (2,a) (1,ε) (1,ε) (2,abb) (2,b) (2,b) (8,b) (8,ε) (6,a) (1,ε) (1,ε) (2,a) (2,a) (1,ε) (2,abb) (2,abb) (2,b) (2,b) (2,ε) (6,a) (6,a) (1,ε) (2,a) (2,a) (1,ε) (2,abb) (2,abb) (2,b) (2,b) (2,ε) (6,a) (6,a) (1,ε) (4,a) (1,ε) (3,ε) (4,abb) (2,b) (4,b) (9,b) (9,ε) (6,a) (1,ε) (6,a) (4,a) (1,ε) (3,ε) (4,abb) (2,b) (4,b) (9,b) (9,ε) (6,a) (1,ε) (6,a) (2,a) (4,a) (1,ε) (2,abb) (4,abb) (2,b) (4,b) (4,ε) (6,a) (6,a) (1,ε) (2,a) (4,a) (1,ε) (2,abb) (4,abb) (2,b) (4,b) (4,ε) (6,a) (6,a) (1,ε) (4,a) (1,ε) (3,ε) (4,abb) (2,b) (4,b) (5,b) (5,ε) (6,a) (1,ε) (6,a) (4,a) (1,ε) (3,ε) (4,abb) (2,b) (4,b) (5,b) (5,ε) (6,a) (1,ε) (6,a) (2,a) (1,ε) (1,ε) (2,abb) (2,b) (2,b) (9,b) (9,ε) (6,a) (1,ε) (1,ε) (2,a) (1,ε) (1,ε) (2,abb) (2,b) (2,b) (9,b) (9,ε) (6,a) (1,ε) (1,ε) (2,a) (1,ε) (7,ε) (2,abb) (2,b) (2,b) (8,b) (8,ε) (6,a) (1,ε) (1,ε) (2,a) (1,ε) (7,ε) (2,abb) (2,b) (2,b) (8,b) (8,ε) (6,a) (1,ε) (1,ε) (2,a) (2,a) (1,ε) (2,abb) (2,abb) (2,b) (2,b) (2,ε) (6,a) (6,a) (1,ε) (2,a) (2,a) (1,ε) (2,abb) (2,abb) (2,b) (2,b) (2,ε) (6,a) (6,a) (1,ε) (2,a) (1,ε) (1,ε) (2,abb) (2,b) (2,b) (9,b) (9,ε) (6,a) (1,ε) (1,ε) (2,a) (1,ε) (1,ε) (2,abb) (2,b) (2,b) (9,b) (9,ε) (6,a) (1,ε) (1,ε) →(5,ε)(1,babb)→(2,abb)→(4,b) Output(1, babb) O( number of occurrences )