Download presentation
Presentation is loading. Please wait.
Published byAbigayle Sherilyn Rice Modified over 8 years ago
1
A simple storage scheme for strings achieving entropy bounds Paolo Ferragina and Rossano Venturini Dipartimento di Informatica University of Pisa, Italy
2
The Problem Given a string S[1, n] drawn from an alphabet of size encode S in a compressed data structure S' within entropy bounds extract any substring of size (log n) symbols in constant time Thus, S' completely replaces S under the RAM model.
3
Previous works Sadakane and Grossi [SODA'06] introduced a scheme: nH k (S) + o(n log ) bits Ziv-Lempel’s string encoding, succinct dictionaries and data structures to path-decoding in Lz-tries G onz á lez and Navarro [CPM'06] simplify it slightly better space complexity in o() term but requires to fix the order k in advance statistical encoder (namely, Arithmetic encoding), succinct binary dictionaries and tables The term o() depends on k. The scheme is effective when k=o(log n).
4
Our work We propose a simpler storage scheme improves space complexity drops the use of any compressor (either LZ-like or statistical) deploys only binary encodings and tables An interesting corollary our scheme used upon the Burrows-Wheeler Transformed string bwt(S) achieves a compressed- space min(nH k (S), nH k (bwt(S))) + o(n log ) bits first time that such a kind of bound is achieved. there are cases in which one entropy is smaller than the other
5
Our storage scheme S b P V... frequency T... B 000 11 10 01 00 1 0 b = ½ log n n/b blocks O( b ) = O(n ½ ) distinct blocks A table T stores the distinct blocks sorted per decreasing frequency of occurrence in S's partition.The function enc encodes the i-th block of T with the i-th element of B. The enc()s are not uniquely decodable codewords. A pointer to the start in V of each codeword is needed. enc enc( ) enc( )enc( )enc( )enc( )enc( )enc( )... 0 1000 ... 1122356 01000... P is stored using a two-level storage scheme
6
Decode a block in constant time S b P Extract 5-th block of S Access to P[5] and P[6] Fetch the codeword 00 from V len = P[6] – P[5] = 2 Now codeword is uniquely decodable Access T in position 2 len +d 2 2 +(00) 2 = 2 2 +0 = 4 V 112236... frequency T... B 000 11 10 01 00 1 0 5 53 01000... Since len = O(log n) bits, all operations are executed in constant time 0 d
7
Space analysis Blocks table T: O( b ) = O(n ½ ) entries Each entry is represented with O(log n) bits T requires O(n ½ log n) = o(n) bits We use a two-level storage scheme [Munro 96] for the starting positions of encs (P) bits The real challenge is to bound the space of V Let us show it by introducing an alternative encoding whose bound is simpler to evaluate
8
Empirical entropy The 0-th order empirical entropy of S is defined as where P(c) is the frequency of the symbol c in S We define w S as the symbols following the context w in S Let S = mississippi and w = si, then w S = sp The k-th order empirical entropy of S is defined as
9
Statistical encoding For every position k < i < n, F i denotes the frequency of seeing S[i] within w S, where w=S[i-k, i-1] Arithmetic encoding represents S within bits. Grouping all the terms referring to the same k-th order context (w), we obtain a summation upper bounded by bits.
10
Blocked statistical encoding Let us consider a compressor E that encodes each block S i of S individually first k symbols are represented explicitly with k log bits b-k symbols are encoded with the k-th order Arithmetic The codeword so assigned to S i uniquely identifies it among the other distinct blocks This blocking approach increases the previous bound by O((n/b) k log ) = o(n log ) bits, with k=o( log n ) accounts the cost of storing the first k symbols of the n/b blocks
11
Our bound: |V| + o(n log ) Let us show that |V| < |E(S)| < nH k (S) + o(n log ) The codewords assigned by E are a subset of B The codewords assigned by enc are the shortest binary strings in B enc is better than E because it follows a golden rule in data compression: it assigns shortest codewords to more frequent blocks Thus, the space occupancy of our scheme is nH k (S) + o(n log ) bits (k=o(log n))
12
Summary of the main result We presented a storage scheme that O(1) time access to any substring of length (log n) Space occupancy in nH k (S) + o(n log ) bits Better space bound in o() and much simpler approach This can be used to convert any succinct data structure into a compressed data structure Open problems The o() term should be investigated more deeply because it usually dominates the k-th order entropy term Experiments are needed
13
Thank you!!!
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.