Hideo Bannai, Shunsuke Inenaga, Masayuki Takeda Kyushu University, Japan SPIRE Cartagena, Colombia
Outline SPIRE Cartagena, Colombia Background LZ78 Factorization Straight Line Programs (SLP) Algorithms LZ78 factorization using suffix trees SLP to LZ78 Improvements
Background SPIRE Cartagena, Colombia Compr essed Repres entatio n of String BIG String This work: LZ78 factorization of grammar compressed strings Compressed String Processing (CSP) compress string for storage … but … don’t decompress all of it when using it! can be faster than processing the uncompressed text, by exploiting regularities identified by compression regard compression as a generic preprocessing! Pattern Matching process directly Edit Distance Pattern Mining etc.
LZ78 Factorization [Ziv&Lempel ’78] SPIRE Cartagena, Colombia The LZ78-factorization of string S is a factorization S = f 1 f 2... f m where f i is the longest prefix of f i... f m such that f i = f j c for some 0 ≤ j < i (let f 0 = ε) S = a l a b a r a l a l a b a r d a $ a 2 2 l 3 3 b 4 4 r 5 5 l 7 7 b 6 6 a 8 8 d 9 9 $ LZ78 trie of S (0, a ) f1f1 (0, l ) f2f2 (1, b ) f3f3 (1, r ) f4f4 (1, l ) f5f5 (5, a ) f6f6 (0, b ) f7f7 (5, d ) f8f8 (1, $ ) f9f9 O(N log σ) time O(m) space
Straight Line Programs SPIRE Cartagena, Colombia CFG in Chomsky normal form that derives single string. Can efficiently model outputs of many compression algorithms: REPAIR, SEQUITUR, LZ78, etc. Straight Line Program X 1 = a X 2 = b X 3 = X 1 X 2 X 4 = X 1 X 3 X 5 = X 4 X 3 X 6 = X 4 X 5 X 7 = X 6 X 5 SLP, n=7 Derivation tree S X7X7 X2X2 X1X1 X6X6 X2X2 X1X1 X1X1 X3X3 X2X2 X1X1 X1X1 X3X3 X4X4 X3X3 X4X4 X5X5 aabaababaabab X2X2 X1X1 X1X1 X3X3 X2X2 X1X1 X4X4 X3X3 X5X5
Problem: SLP to LZ78 SPIRE Cartagena, Colombia Input: SLP Output: LZ78 Factorization (Trie) X 1 = a X 5 = X 4 X 3 X 2 = b X 6 = X 4 X 5 X 3 = X 1 X 2 X 7 = X 6 X 5 X 4 = X 1 X a a b a b b Why “re-compress” a compressed representation? Convert the representation Some CSP algorithms require specific compression Re-compress an SLP modified by ad-hoc edits Dynamic compressed texts Compute Normalized Compression Distance [Li et al. 2004] Clustering & classification w/o decompression C LZ78 (x), C LZ78 (y), C LZ78 (xy) from SLPs of x, y Computer Scientist Make Sleeping Files Walk in their Sleep!
Our Results SPIRE Cartagena, Colombia Algorithms to compute LZ78 from SLP AlgorithmTimeSpace Direct (uncompressed) O(N log σ ) O(m) Decompress + Direct O(N log σ ) O(n+m) SLP (partial decompressions) O(nN ½ + m log N)O(nN ½ + m) SLP + Doubling O(nL + m log N)O(nL + m) SLP + Redundancy Reduction O(N α + m log N)O(N α + m) N : length of uncompressed string S σ: alphabet size n : size of SLP representing SL : length of longest LZ78 factor N α = N – α ≤ Nm : # of LZ78 factors (O(N/log N) for constant σ) α ≥ 0 is a quantity that represents the amount of redundancy in the string that is captured by the SLP
LZ78 Factorization using a Suffix Tree SPIRE Cartagena, Colombia
Suffix Tree & LZ78 SPIRE Cartagena, Colombia The LZ78 trie can be superimposed on the suffix tree S suffix tree of S LZ78 trie of S aabaababaabab 10 a b a a b a b a a b a b a a b a b b a a b a b b a b a b a a b a b a a b a b a b a b a a b a b b a a b a b a a b a b b a a b a b b
10 a b a a b a b a a b a b a a b a b b a a b a b b a b a b a a b a b a a b a b a b a b a a b a b b a a b a b LZ78 Factorization on Suffix Tree SPIRE Cartagena, Colombia aabaababaabab S Build LZ78 trie on top of suffix tree ST Nodes corresponding to LZ78 trie are marked Find longest prefix of S[i:N] in LZ78 trie O(1) time by dynamic nearest marked ancestor queries [Westbrook, ‘92] Make new node of LZ78 trie on ST O(1) time by level ancestor query on ST [Berkman & Vishkin ‘94] Compute next position i i + |f i | LZ78 factorization in O(m) time, given suffix tree preprocessed for nma & la queries i Next factor is prefix of S[i:N]. Find node in ST corresponding to S[i:N]
SLP to LZ78 SPIRE Cartagena, Colombia
Our algorithm: SLP to LZ78 SPIRE Cartagena, Colombia We only need a suffix tree that contains all distinct substrings of S with length at most c N Build GST from a set of substrings of S that contain all distinct length-c N substrings of S Main Idea For any string of length N, the length of any LZ78 factor f i satisfies: |f i | ≤ c N = (2N+¼) ½ – ½ = O(N ½ ) For any string of length N, the length of any LZ78 factor f i satisfies: |f i | ≤ c N = (2N+¼) ½ – ½ = O(N ½ ) Key Observation
Important Concept: Stabbing SPIRE Cartagena, Colombia X i stabs an interval [u:v] of S, when it is the shortest variable that derives the interval (any interval is stabbed by a unique variable) X 1 = a X 2 = b X 3 = X 1 X 2 X 4 = X 1 X 3 X 5 = X 4 X 3 X 6 = X 4 X 5 X 7 = X 6 X 5 e.g.: aaba at [9:12] is stabbed by X 5 X7X7 X2X2 X1X1 X6X6 X2X2 X1X1 X1X1 X3X3 X2X2 X1X1 X1X1 X3X3 X4X4 X3X3 X4X4 X5X5 aabaababaabab X2X2 X1X1 X1X1 X3X3 X2X2 X1X1 X4X4 X3X3 X5X5 X5X
Substrings stabbed by X i SPIRE Cartagena, Colombia All length-q substrings stabbed by X i are contained in a string t i (q) of length at most 2(q – 1) Xl(i)Xl(i) Xr(i)Xr(i) XiXi q – 1 q q Any length-q substring of S is stabbed by some unique variable X i, and therefore is a substring of some t i (q) { t i (c N ) : |X i | ≥ c N, 1 ≤ i ≤ n } will contain all distinct length-c N substrings of S ti(q)ti(q)
LZ78 Factorization from SLP SPIRE Cartagena, Colombia Algorithm: 1. Compute { t i (c N ) : |X i | ≥ c N, 1 ≤ i ≤ n } 2. Build generalized suffix tree (GST) for strings { t i (c N ) : |X i | ≥ c N, 1 ≤ i ≤ n } 3. Run LZ78 Factorization algorithm using GST O(nc N ) time/space
Example SPIRE Cartagena, Colombia N = 13, c N = 4, n = 7 { t 5 (4), t 6 (4), t 7 (4) } = { aabab, aabaab, babaab } S X7X7 X2X2 X1X1 X6X6 X2X2 X1X1 X1X1 X3X3 X2X2 X1X1 X1X1 X3X3 X4X4 X3X3 X4X4 X5X5 aabaababaabab X2X2 X1X1 X1X1 X3X3 X2X2 X1X1 X4X4 X3X3 X5X
GST & LZ78 Factors SPIRE Cartagena, Colombia The LZ78 trie superimposed on GST of {t 5 (4), t 6 (4), t 7 (4)} aabaababaabab S a a b a b a b b b a a 3 3 8,14 b 7,13 9,15 4,10,16 5,11, a b b a b a b GST of {t 5 (4),t 6 (4),t 7 (4)} LZ78 trie of S a a b a b b a a b a b b a a b a b a a b a a b b a b a a b t 5 (4) t 6 (4) t 7 (4)
Find longest prefix of S[i:N] in LZ78 trie Make new node for LZ78 trie on ST Compute next position i i + |f i | Next factor is prefix of S[i:N]. Find node in GST corresponding to S[i:N] a a b a b a a b a a b b a b a a b t 5 (4) t 6 (4) t 7 (4) S X7X7 X2X2 X1X1 X6X6 X2X2 X1X1 X1X1 X3X3 X2X2 X1X1 X1X1 X3X3 X4X4 X3X3 X4X4 X5X5 aabaababaabab X2X2 X1X1 X1X1 X3X3 X2X2 X1X1 X4X4 X3X3 X5X a a b a b a b b b a a 3 3 8,14 b 7,13 9,15 4,10,16 5,11, a b b a b a b 1 1 LZ78 Factorization on GST SPIRE Cartagena, Colombia 0 0 c N = 4 i O(log N) time w/ random access on SLP [Bille et al. 2011] O(1) time w/ dynamic nma queries
a a b a b a a b a a b b a b a a b t 5 (4) t 6 (4) t 7 (4) a a b a b a b b b a a 3 3 8,14 b 7,13 9,15 4,10,16 5,11, a b b a b a b LZ78 Factorization on GST SPIRE Cartagena, Colombia 0 0 S X7X7 X2X2 X1X1 X6X6 X2X2 X1X1 X1X1 X3X3 X2X2 X1X1 X1X1 X3X3 X4X4 X3X3 X4X4 X5X5 aabaababaabab X2X2 X1X1 X1X1 X3X3 X2X2 X1X1 X4X4 X3X3 X5X c N = 4 i Find longest prefix of S[i:N] in LZ78 trie Make new node for LZ78 trie on ST Compute next position i i + |f i | Next factor is prefix of S[i:N]. Find node in GST corresponding to S[i:N] O(log N) time w/ random access on SLP [Bille et al. 2011] O(1) time w/ dynamic nma queries
a a b a b a a b a a b b a b a a b t 5 (4) t 6 (4) t 7 (4) S X7X7 X2X2 X1X1 X6X6 X2X2 X1X1 X1X1 X3X3 X2X2 X1X1 X1X1 X3X3 X4X4 X3X3 X4X4 X5X5 aabaababaabab X2X2 X1X1 X1X1 X3X3 X2X2 X1X1 X4X4 X3X3 X5X a a b a b a b b b a a 3 3 8,14 b 7,13 9,15 4,10,16 5,11, a b b a b a b LZ78 Factorization on GST SPIRE Cartagena, Colombia 0 0 c N = 4 i LZ78 factorization can be computed in O(mlogN) time, given GST preprocessed for nma & la, and SLP preprocessed for random access queries Find longest prefix of S[i:N] in LZ78 trie Make new node for LZ78 trie on ST Compute next position i i + |f i | Next factor is prefix of S[i:N]. Find node in GST corresponding to S[i:N] O(log N) time w/ random access on SLP [Bille et al. 2011] O(1) time w/ dynamic nma queries
a a b a b a a b a a b b a b a a b t 5 (4) t 6 (4) t 7 (4) S X7X7 X2X2 X1X1 X6X6 X2X2 X1X1 X1X1 X3X3 X2X2 X1X1 X1X1 X3X3 X4X4 X3X3 X4X4 X5X5 aabaababaabab X2X2 X1X1 X1X1 X3X3 X2X2 X1X1 X4X4 X3X3 X5X a a b a b a b b b a a 3 3 8,14 b 7,13 9,15 4,10,16 5,11, a b b a b a b LZ78 Factorization on GST SPIRE Cartagena, Colombia 0 0 c N = 4 i LZ78 factorization can be computed in O(mlogN) time, given GST preprocessed for nma & la, and SLP preprocessed for random access queries 4 4 Find longest prefix of S[i:N] in LZ78 trie Make new node for LZ78 trie on ST Compute next position i i + |f i | Next factor is prefix of S[i:N]. Find node in GST corresponding to S[i:N] O(log N) time w/ random access on SLP [Bille et al. 2011] O(1) time w/ dynamic nma queries
a a b a b a a b a a b b a b a a b t 5 (4) t 6 (4) t 7 (4) S X7X7 X2X2 X1X1 X6X6 X2X2 X1X1 X1X1 X3X3 X2X2 X1X1 X1X1 X3X3 X4X4 X3X3 X4X4 X5X5 aabaababaabab X2X2 X1X1 X1X1 X3X3 X2X2 X1X1 X4X4 X3X3 X5X a a b a b a b b b a a 3 3 8,14 b 7,13 9,15 4,10,16 5,11, a b b a b a b LZ78 Factorization on GST SPIRE Cartagena, Colombia 0 0 c N = 4 i LZ78 factorization can be computed in O(mlogN) time, given GST preprocessed for nma & la, and SLP preprocessed for random access queries Find longest prefix of S[i:N] in LZ78 trie Make new node for LZ78 trie on ST Compute next position i i + |f i | Next factor is prefix of S[i:N]. Find node in GST corresponding to S[i:N] O(log N) time w/ random access on SLP [Bille et al. 2011] O(1) time w/ dynamic nma queries
a a b a b a a b a a b b a b a a b t 5 (4) t 6 (4) t 7 (4) S X7X7 X2X2 X1X1 X6X6 X2X2 X1X1 X1X1 X3X3 X2X2 X1X1 X1X1 X3X3 X4X4 X3X3 X4X4 X5X5 aabaababaabab X2X2 X1X1 X1X1 X3X3 X2X2 X1X1 X4X4 X3X3 X5X a a b a b a b b b a a 3 3 8,14 b 7,13 9,15 4,10,16 5,11, a b b a b a b LZ78 Factorization on GST SPIRE Cartagena, Colombia 0 0 c N = 4 i LZ78 factorization can be computed in O(mlogN) time, given GST preprocessed for nma & la, and SLP preprocessed for random access queries Find longest prefix of S[i:N] in LZ78 trie Make new node for LZ78 trie on ST Compute next position i i + |f i | Next factor is prefix of S[i:N]. Find node in GST corresponding to S[i:N] O(log N) time w/ random access on SLP [Bille et al. 2011] O(1) time w/ dynamic nma queries
Summary of Basic Algorithm SPIRE Cartagena, Colombia Extreme Cases: If the string is compressible, n = O(log N), m = O(N ½ ), so O(nc N + m log N) = O(N ½ log N) = o(N) If the string is not compressible, n, m = O(N) and O(nc N + m log N) = O(N 1.5 ) AlgorithmTimeSpace Direct (uncompressed) O(N log σ)O(m) Decompress + Direct O(N log σ)O(n+m) SLP O(nc N + m log N)O(nc N + m) c N = O(N ½ ) can we do better than just revert to decompress & process?
(1) Improving nc N term to nL ≤ nc N SPIRE Cartagena, Colombia Let L denote length of longest LZ78 factor of S We built GST for distinct substrings of length at most c N but actually, we only need substrings of length at most L However, L is not known beforehand… O(nc N + mlogN) time, O(nc N + m) space O(nL + mlogN) time, O(nL + m) space Assume L = 2 and run algorithm. If LZ78 trie expands beyond GST, L 2×L, rebuild GST and LZ78 trie, and continue Total time complexity for rebuild: Σ i=1..log L O(n2 i +m) = O(nL+mlogL) Doubling Technique:
(2) Improving nc N term to N α ≤ N SPIRE Cartagena, Colombia We can replace GST with suffix tree of trie for q = c N Given SLP for string S, the set of length-q substrings of S can be represented as paths in a reverse trie of size N α = N – α (q) ≤ N,where α (q) = Σ i:|X i | ≥ q (vOcc(X i ) – 1) (|t i (q)| – (q – 1)) ≥ 0 vOcc(X i ) : # of times X i occurs in derivation tree Lemma [Goto et al. CPM 2012] The suffix tree of a reverse trie can be constructed in linear time. Lemma [Shibuya 2003] O(nc N + mlogN) time, O(nc N + m) space O(N α + mlogN) time, O(N α + m) space The trie can be computed in time linear of its size. N α = O(nc N )
Example: Trie of size N α for q = 4 SPIRE Cartagena, Colombia X7X7 X2X2 X1X1 X6X6 X2X2 X1X1 X1X1 X3X3 X2X2 X1X1 X1X1 X3X3 X4X4 X3X3 X4X4 X5X5 aabaababaabab S aabab aab bab X2X2 X1X1 X1X1 X3X3 X2X2 X1X1 X4X4 X3X3 X5X5 Σ|t i (q)| : 17 Text size: 13 Trie size: 11 We can aggregate all t i (q) into a trie of size at most the text size
Summary SPIRE Cartagena, Colombia Showed algorithm for SLP LZ78 factorization at least as fast as naïve decompress & process better when string is compressible AlgorithmTimeSpace Direct (uncompressed) O(N log σ ) O(m) Decompress + Direct O(N log σ ) O(n+m) SLP (partial decompressions) O(nN ½ + m log N)O(nN ½ + m) SLP + Doubling O(nL + m log N)O(nL + m) SLP + Redundancy Reduction O(N α + m log N)O(N α + m) N : length of uncompressed string S σ: alphabet size n : size of SLP representing SL : length of longest LZ78 factor N α = N – α(c N ) ≤ Nm : # of LZ78 factors (O(N/log N) for constant σ)