Presentation is loading. Please wait.

Presentation is loading. Please wait.

Multiple Pattern Matching in LZW Compressed Text Takuya KIDA Masayuki TAKEDA Ayumi SHINOHARA Masamichi MIYAZAKI Setsuo ARIKAWA Department of Informatics.

Similar presentations


Presentation on theme: "Multiple Pattern Matching in LZW Compressed Text Takuya KIDA Masayuki TAKEDA Ayumi SHINOHARA Masamichi MIYAZAKI Setsuo ARIKAWA Department of Informatics."— Presentation transcript:

1

2 Multiple Pattern Matching in LZW Compressed Text Takuya KIDA Masayuki TAKEDA Ayumi SHINOHARA Masamichi MIYAZAKI Setsuo ARIKAWA Department of Informatics Kyushu University, Japan Nagano Fukuoka Masayuki TAKEDA Ayumi SHINOHARA Masamichi MIYAZAKI Setsuo ARIKAWA

3 Our Goal CompressedText OriginalText CompressedText Pattern Matching Machine Machine New Machine !

4 Previous studies year researcher compression method Eilam-Tsoreff and Vishkin Amir, Landau, and Vishikin Amir and Benson Farach and Thorup Gasieniec, et al. Amir, Benson and Farach Karpinski, et al. Miyazaki, et al. run-lengthtwo-dimensionalrun-lengthLZ77LZW straight-line programs 1988 1992 1992 1995 1996 1996 1997 1997

5 Previous result vs Our result Amir, Benson, and Farach's algorithm (JCSS 1996) "Let sleeping files lie: Pattern matching in Z-compressed files" Amir, Benson, and Farach's algorithm (JCSS 1996) "Let sleeping files lie: Pattern matching in Z-compressed files" –deals with only single pattern. –can find only the first occurrence of the pattern. –takes O(n+m 2 ) time and space. n : length of the compressed text, m: length of the pattern. n Our algorithm –deals with multiple patterns. –can find all occurrences of the patterns. –takes O(n+m 2 +r) time and O(n+m 2 ) space. m: total length of the patterns, r : number of pattern occurrences.

6 Lempel-Ziv-Welch compression a b ab ab ba b c aba bc abab Dictionary trie : D Σ= {a,b,c} b a b c a a a a b b b c0123 4 5 6 7 9 8 12 10 11 1234569 11 42 original text compressed text O( |D| ) = O( n )

7 Pattern : abab a 01234 b b a {abab} original text: a a b a b a a b b a b a b a b a b a a b b b a a b b a a a b b b b a a a b b a a b b found ! found ! KMP automaton Σ : goto function : failure function { } : output Basic Idea(Amir et al.)

8 {abab} {abab} 01234 ab, bab aba abab b c bc ca, ba bca, a b ba a 01234 b b a {abab} Next (0, bab)=2 Pattern : abab KMP automaton

9 01234a b a b {abab} abc ab abc Who is watching the occurrences of the pattern?! Output (2, abc)= { 〈 2, abab 〉 } Basic Idea(Amir et al.) Next (2, abc)=0

10 for Multiple Patterns n Aho-Corasick Pattern Matching Machine ac012345 6 7 9 8 b b a b c a b b {bb} {abca} {aba} {ababb,bb} Patterns:Π={aba,ababb,abca,bb} : goto function : failure function { } : output

11 Our Algorithm Input. Π : set of patterns, u 1,u 2, …,u n : LZW compressed text. u 1,u 2, …,u n : LZW compressed text. Output. All occurrences of the patterns. Construct from Π the AC machine, Construct from Π the AC machine, and the generalized suffix trie. and the generalized suffix trie. Initialize the dictionary trie, Next and Output ; Initialize the dictionary trie, Next and Output ; l:=0; state:=q 0 ; l:=0; state:=q 0 ; for i:=1 to n do begin for i:=1 to n do begin for each 〈 d,π 〉∈ Output(state,u i ) do for each 〈 d,π 〉∈ Output(state,u i ) do report "pattern π occurs at position l+d"; report "pattern π occurs at position l+d"; state:=Next(state,u i ); state:=Next(state,u i ); l:= l+ | u i | ; l:= l+ | u i | ; Update the dictionary trie, Next and Output Update the dictionary trie, Next and Output end. end. O( n+r ) O( n ) O( m 2 )

12 Ok! Let’s go!

13 State Transition Function Next (q, u) Next: Q×D → Q O( m× | D | ) !! Next(q,u) N 1 (q, u) ・ u N 1 (q, u) ・ u Next(0, u) Next(0, u) if u ∈ Factor(Π), otherwise. = O( m×m 2 ) O( | D | ) Q: states of AC machine D: strings represented by dictionary trie m: total length of patterns

14 n Table of N 1 (q, u) ・ u --- O(m×m 2 ) 0 1 2 3 4 5 6 7 8 9 state a b c ab ba bb bc ca aba abb abc bab bca abab abca babb ababb 1 1 3 1 3 1 7 1 1 1 8 2 9 4 5 9 8 2 9 9 0 0 6 0 6 0 0 0 0 0 2 2 4 2 4 2 2 2 2 2 1 3 1 3 1 1 1 3 1 1 9 9 9 5 9 9 9 9 9 9 0 6 0 6 0 0 0 6 0 0 1 1 7 1 7 1 1 1 1 1 3 3 3 3 3 3 3 3 3 3 9 9 5 9 5 9 9 9 9 9 2 4 2 4 2 2 2 4 2 2 1 7 1 7 1 1 1 7 1 1 6 6 6 6 6 6 6 6 6 6 4 4 4 4 4 4 4 4 4 4 7 7 7 7 7 7 7 7 7 7 9 5 9 5 9 9 9 5 9 9 5 5 5 5 5 5 5 5 5 5 O( | D | +m 3 ) State Transition Function Next (q, u) Π={aba,ababb,abca,bb}

15 ab c a a a a b b b c b b c a b b O( m ) Generalized Suffix Trie : explicit node O( m 2 ) : nonexplicit node

16 0 1 2 3 4 5 6 7 8 9 state a b c ab ba bb bc ca aba abb abc bab bca abab abca babb ababb 1 1 3 1 3 1 7 1 1 1 8 2 9 4 5 9 8 2 9 9 0 0 6 0 6 0 0 0 0 0 2 2 4 2 4 2 2 2 2 2 1 3 1 3 1 1 1 3 1 1 9 9 9 5 9 9 9 9 9 9 0 6 0 6 0 0 0 6 0 0 1 1 7 1 7 1 1 1 1 1 3 3 3 3 3 3 3 3 3 3 9 9 5 9 5 9 9 9 9 9 2 4 2 4 2 2 2 4 2 2 1 7 1 7 1 1 1 7 1 1 6 6 6 6 6 6 6 6 6 6 4 4 4 4 4 4 4 4 4 4 7 7 7 7 7 7 7 7 7 7 9 5 9 5 9 9 9 5 9 9 5 5 5 5 5 5 5 5 5 5 O( | D | +m 3 ) O( | D | +m 2 ) State Transition Function Next (q, u) 0 1 2 3 4 5 6 7 8 9 statestate a b ab ba bb ca aba abb bca abca babb ababb a b c ab ba bb bc ca aba abb abc bab bca abab abca babb ababb 1 1 3 1 3 1 7 1 1 1 8 2 9 4 5 9 8 2 9 9 0 0 6 0 6 0 0 0 0 0 2 2 4 2 4 2 2 2 2 2 1 3 1 3 1 1 1 3 1 1 9 9 9 5 9 9 9 9 9 9 0 6 0 6 0 0 0 6 0 0 1 1 7 1 7 1 1 1 1 1 3 3 3 3 3 3 3 3 3 3 9 9 5 9 5 9 9 9 9 9 2 4 2 4 2 2 2 4 2 2 1 7 1 7 1 1 1 7 1 1 6 6 6 6 6 4 4 4 4 4 7 7 7 7 7 7 7 7 7 7 9 5 9 5 9 9 9 5 9 9 5 5 5 5 5 5 5 5 5 5 n Table of N 1 (q, u) ・ u --- O( m×m )

17 Ancestor(q, k): the ancestor of node q with distance k in the trie of AC machine. in the trie of AC machine. u : one of the explicit descendants of node u in the generalized suffix trie. in the generalized suffix trie.

18 Output Function Output(q,u)= { 〈 i,π 〉 | 1 ≦ i ≦ | u |, π ∈ Π, and π is a suffix of string q ・ u [ 1..i ] } π is a suffix of string q ・ u [ 1..i ] }q u π i O( m× | D | ) !!! ii

19 q u π1π1π1π1 π1π1π1π1 π2π2π2π2 π3π3π3π3 O( | D | ) O(m 2 ) u~ dependent on q independent of q Output Function u~ Let be the longest prefix of u such that is a suffix of some pattern. is a suffix of some pattern. u ~

20 But... Is it really fast ? Uhmm....

21 Experiment ◆ Method 1: ◆ Method 2: ◆ Method 3: Without Decompression CompressedText OriginalText CompressedText bcbababc 9 CompressedText Decompression ! AC Machine Decompression ! AC Machine Our Algorithm

22 Experiment Original Text "The Brown corpus" 6.8 Mbytes Compressed Text 3.4 Mbytes Language: C++ (gcc without optimization) Machine : Sun SPARCstation 20. compress (UNIX command)

23 Result of the Experiment (number of pattern occurrences / original text length) 0510152025 Occurrence rate ( % ) 0 5 10 15 20 25 30 CPU time (s) Method 1 Method 2 Method 3 Our Algorithm

24 Conclusion Previous Result Our Result deals with only single pattern deals with multiple patterns can find only the first occurrence of the pattern takes O( n+m 2 ) time and space can find all occurrences of the patterns takes O( n+m 2 ) space can answer in O(n+m 2 +r) time no practical evaluation about twice faster than a decompression followed by using the AC machine

25 plain zgrep Another result (number of pattern occurrences / original text length) 0510152025 Occurrence rate ( % ) 0 5 10 15 20 25 30 CPU time (s) Method 1 Method 2 Method 3

26 LZW Compression Input. LZW compressed text u 1,u 2,…,u n. Output. Dictionary D represented in the form of trie. Method.begin D := Σ; D := Σ; for i:=1 to n-1 do begin for i:=1 to n-1 do begin if u n +1 ≦| D | then if u n +1 ≦| D | then let a be the first symbol of u i +1 ; let a be the first symbol of u i +1 ; else else let a be the first symbol of u i ; let a be the first symbol of u i ; D:=D ∪ {u i ・ a} D:=D ∪ {u i ・ a} end endend.

27 Proofq u p p Next(q,u) = p Let u be not a substring of any pattern. Next(q,u) N 1 (q, u) ・ u N 1 (q, u) ・ u Next(0, u) Next(0, u) if u ∈ Factor(Π), otherwise. =

28 q u π3π3π3π3 π2π2π2π2 π1π1π1π1 π1π1π1π1 π2π2π2π2 Output(q,u)= { 〈 i,π 〉 | 1 ≦ i ≦ | u |, π ∈ Π, and π is a suffix of string q ・ u [ 1..i ] } π is a suffix of string q ・ u [ 1..i ] } u~ Output Function

29 Realization of Output funstion Dictionary trie : D Σ= {a,b,c} flagprev(u) AC state a b c a a a a b b b c0123 4 5 6 7 9 8 12 10 11 b false0 NULL Patterns:Π={aba,ababb,abca,bb} false0 NULL true0 3 false6 NULL

30 Realization of Output funstion

31 0 1 2 3 4 5 6 7 8 9 state aba ba a ababb babb abb bb b abca bca ca (1,ba) (8,a) (1,ε) (1,babb) (8,abb) (1,bb) (8,b) (8,ε) (1,bca) (8,ca) (0,a) (1,ba) (2,a) (1,ε) (1,babb) (2,abb) (1,bb) (2,b) (2,ε) (1,bca) (2,ca) (0,a) (3,ba) (9,a) (3,ε) (3,babb) (9,abb) (3,bb) (9,b) (9,ε) (3,bca) (9,ca) (6,a) (1,ba) (4,a) (1,ε) (1,babb) (4,abb) (1,bb) (4,b) (4,ε) (1,bca) (4,ca) (0,a) (3,ba) (5,a) (3,ε) (3,babb) (5,abb) (3,bb) (5,b) (5,ε) (3,bca) (5,ca) (6,a) (1,ba) (9,a) (1,ε) (1,babb) (9,abb) (1,bb) (9,b) (9,ε) (1,bca) (9,ca) (0,a) (7,ba) (8,a) (7,ε) (7,babb) (8,abb) (7,bb) (8,b) (8,ε) (7,bca) (8,ca) (0,a) (1,ba) (2,a) (1,ε) (1,babb) (2,abb) (1,bb) (2,b) (2,ε) (1,bca) (2,ca) (0,a) (1,ba) (9,a) (1,ε) (1,babb) (9,abb) (1,bb) (9,b) (9,ε) (1,bca) (9,ca) (0,a) (1,babb)→(2,abb) Realization of Output funstion Output(1, babb)

32 for Multiple Patterns n Aho-Corasick Pattern Matching Machine ac012345 6 7 9 8 b b a b c a b b {bb} {abca} {aba} {ababb,bb} Patterns:Π={aba,ababb,abca,bb} : goto function : failure function { } : output

33 0 1 2 3 4 5 6 7 8 9 state aba ba a ababb babb abb bb b abca bca ca (1,ba) (8,a) (1,ε) (1,babb) (8,abb) (1,bb) (8,b) (8,ε) (1,bca) (8,ca) (0,a) (1,ba) (2,a) (1,ε) (1,babb) (2,abb) (1,bb) (2,b) (2,ε) (1,bca) (2,ca) (0,a) (3,ba) (9,a) (3,ε) (3,babb) (9,abb) (3,bb) (9,b) (9,ε) (3,bca) (9,ca) (6,a) (1,ba) (4,a) (1,ε) (1,babb) (4,abb) (1,bb) (4,b) (4,ε) (1,bca) (4,ca) (0,a) (3,ba) (5,a) (3,ε) (3,babb) (5,abb) (3,bb) (5,b) (5,ε) (3,bca) (5,ca) (6,a) (1,ba) (9,a) (1,ε) (1,babb) (9,abb) (1,bb) (9,b) (9,ε) (1,bca) (9,ca) (0,a) (7,ba) (8,a) (7,ε) (7,babb) (8,abb) (7,bb) (8,b) (8,ε) (7,bca) (8,ca) (0,a) (1,ba) (2,a) (1,ε) (1,babb) (2,abb) (1,bb) (2,b) (2,ε) (1,bca) (2,ca) (0,a) (1,ba) (9,a) (1,ε) (1,babb) (9,abb) (1,bb) (9,b) (9,ε) (1,bca) (9,ca) (0,a) →(5,ε)(1,babb)→(2,abb)→(3,bb)→(4,b) Realization of Output funstion Output(1, babb) found ! found !

34 Realization of Output funstion 0 1 2 3 4 5 6 7 8 9 state aba ba a ababb babb abb bb b abca bca ca (2,a) (1,ε) (1,ε) (2,abb) (2,b) (2,b) (8,b) (8,ε) (6,a) (1,ε) (1,ε) (2,a) (1,ε) (1,ε) (2,abb) (2,b) (2,b) (8,b) (8,ε) (6,a) (1,ε) (1,ε) (2,a) (2,a) (1,ε) (2,abb) (2,abb) (2,b) (2,b) (2,ε) (6,a) (6,a) (1,ε) (2,a) (2,a) (1,ε) (2,abb) (2,abb) (2,b) (2,b) (2,ε) (6,a) (6,a) (1,ε) (4,a) (1,ε) (3,ε) (4,abb) (2,b) (4,b) (9,b) (9,ε) (6,a) (1,ε) (6,a) (4,a) (1,ε) (3,ε) (4,abb) (2,b) (4,b) (9,b) (9,ε) (6,a) (1,ε) (6,a) (2,a) (4,a) (1,ε) (2,abb) (4,abb) (2,b) (4,b) (4,ε) (6,a) (6,a) (1,ε) (2,a) (4,a) (1,ε) (2,abb) (4,abb) (2,b) (4,b) (4,ε) (6,a) (6,a) (1,ε) (4,a) (1,ε) (3,ε) (4,abb) (2,b) (4,b) (5,b) (5,ε) (6,a) (1,ε) (6,a) (4,a) (1,ε) (3,ε) (4,abb) (2,b) (4,b) (5,b) (5,ε) (6,a) (1,ε) (6,a) (2,a) (1,ε) (1,ε) (2,abb) (2,b) (2,b) (9,b) (9,ε) (6,a) (1,ε) (1,ε) (2,a) (1,ε) (1,ε) (2,abb) (2,b) (2,b) (9,b) (9,ε) (6,a) (1,ε) (1,ε) (2,a) (1,ε) (7,ε) (2,abb) (2,b) (2,b) (8,b) (8,ε) (6,a) (1,ε) (1,ε) (2,a) (1,ε) (7,ε) (2,abb) (2,b) (2,b) (8,b) (8,ε) (6,a) (1,ε) (1,ε) (2,a) (2,a) (1,ε) (2,abb) (2,abb) (2,b) (2,b) (2,ε) (6,a) (6,a) (1,ε) (2,a) (2,a) (1,ε) (2,abb) (2,abb) (2,b) (2,b) (2,ε) (6,a) (6,a) (1,ε) (2,a) (1,ε) (1,ε) (2,abb) (2,b) (2,b) (9,b) (9,ε) (6,a) (1,ε) (1,ε) (2,a) (1,ε) (1,ε) (2,abb) (2,b) (2,b) (9,b) (9,ε) (6,a) (1,ε) (1,ε) →(5,ε)(1,babb)→(2,abb)→(4,b) Output(1, babb) O( number of occurrences )


Download ppt "Multiple Pattern Matching in LZW Compressed Text Takuya KIDA Masayuki TAKEDA Ayumi SHINOHARA Masamichi MIYAZAKI Setsuo ARIKAWA Department of Informatics."

Similar presentations


Ads by Google