A simple storage scheme for strings achieving entropy bounds Paolo Ferragina and Rossano Venturini Dipartimento di Informatica University of Pisa, Italy
The Problem Given a string S[1, n] drawn from an alphabet of size encode S in a compressed data structure S' within entropy bounds extract any substring of size (log n) symbols in constant time Thus, S' completely replaces S under the RAM model.
Previous works Sadakane and Grossi [SODA'06] introduced a scheme: nH k (S) + o(n log ) bits Ziv-Lempel’s string encoding, succinct dictionaries and data structures to path-decoding in Lz-tries G onz á lez and Navarro [CPM'06] simplify it slightly better space complexity in o() term but requires to fix the order k in advance statistical encoder (namely, Arithmetic encoding), succinct binary dictionaries and tables The term o() depends on k. The scheme is effective when k=o(log n).
Our work We propose a simpler storage scheme improves space complexity drops the use of any compressor (either LZ-like or statistical) deploys only binary encodings and tables An interesting corollary our scheme used upon the Burrows-Wheeler Transformed string bwt(S) achieves a compressed- space min(nH k (S), nH k (bwt(S))) + o(n log ) bits first time that such a kind of bound is achieved. there are cases in which one entropy is smaller than the other
Our storage scheme S b P V... frequency T... B b = ½ log n n/b blocks O( b ) = O(n ½ ) distinct blocks A table T stores the distinct blocks sorted per decreasing frequency of occurrence in S's partition.The function enc encodes the i-th block of T with the i-th element of B. The enc()s are not uniquely decodable codewords. A pointer to the start in V of each codeword is needed. enc enc( ) enc( )enc( )enc( )enc( )enc( )enc( )... 0 1000 P is stored using a two-level storage scheme
Decode a block in constant time S b P Extract 5-th block of S Access to P[5] and P[6] Fetch the codeword 00 from V len = P[6] – P[5] = 2 Now codeword is uniquely decodable Access T in position 2 len +d 2 2 +(00) 2 = = 4 V frequency T... B 5 53 Since len = O(log n) bits, all operations are executed in constant time 0 d
Space analysis Blocks table T: O( b ) = O(n ½ ) entries Each entry is represented with O(log n) bits T requires O(n ½ log n) = o(n) bits We use a two-level storage scheme [Munro 96] for the starting positions of encs (P) bits The real challenge is to bound the space of V Let us show it by introducing an alternative encoding whose bound is simpler to evaluate
Empirical entropy The 0-th order empirical entropy of S is defined as where P(c) is the frequency of the symbol c in S We define w S as the symbols following the context w in S Let S = mississippi and w = si, then w S = sp The k-th order empirical entropy of S is defined as
Statistical encoding For every position k < i < n, F i denotes the frequency of seeing S[i] within w S, where w=S[i-k, i-1] Arithmetic encoding represents S within bits. Grouping all the terms referring to the same k-th order context (w), we obtain a summation upper bounded by bits.
Blocked statistical encoding Let us consider a compressor E that encodes each block S i of S individually first k symbols are represented explicitly with k log bits b-k symbols are encoded with the k-th order Arithmetic The codeword so assigned to S i uniquely identifies it among the other distinct blocks This blocking approach increases the previous bound by O((n/b) k log ) = o(n log ) bits, with k=o( log n ) accounts the cost of storing the first k symbols of the n/b blocks
Our bound: |V| + o(n log ) Let us show that |V| < |E(S)| < nH k (S) + o(n log ) The codewords assigned by E are a subset of B The codewords assigned by enc are the shortest binary strings in B enc is better than E because it follows a golden rule in data compression: it assigns shortest codewords to more frequent blocks Thus, the space occupancy of our scheme is nH k (S) + o(n log ) bits (k=o(log n))
Summary of the main result We presented a storage scheme that O(1) time access to any substring of length (log n) Space occupancy in nH k (S) + o(n log ) bits Better space bound in o() and much simpler approach This can be used to convert any succinct data structure into a compressed data structure Open problems The o() term should be investigated more deeply because it usually dominates the k-th order entropy term Experiments are needed
Thank you!!!