Succinct Data Structures Kunihiko Sadakane National Institute of Informatics
Compressing Arrays Input: array (string) of length n S[0..n1] query S[i] A, alphabet size query Return a substring of S at given position S[i..i+w1] (w = O(log n) i.e. O(log n) bits) O(1) time on word RAM with word-length O(log n) bits Index size: nHk(S) + o(n log ) bits If the compressed suffix array is used to represent the string, query time is not O(1)
Theorem Size: asymptotically same as LZ78 [Ziv, Lempel78] Consecutive log n characters (log n bits) of S at given position i are obtained in O(1) time The above access time is the same as that on uncompressed string Data can be regarded as uncompressed
LZ78(LZW) Compression [3] Divide the string into phrases in a dictionary A phrase is encoded as a number Update the dictionary Compression ratio converges into the entropy as the string grows Dictionary 1 a b 2 a 3 b 4 Input a a a b a a b a a b a b 5 1 a 1 b 3 b 5 a Output a 6
Compression Ratio of LZ78 The number of phrases c when a string S of length n is parsed Compressed size: bits If S is generated from a stationary ergodic source (entropy H) For the order-k empirical entropy Hk ( : alphabet size) [4]
Difficulty in Partial Decoding To achieve the order-k entropy, codes for characters must be determined from preceding k characters To decode a substring, its preceding substring is also necessary However,…
The term in the compressed size by LZ78 indicates there are O(k log ) bits redundancy for each word (log n bits). That is, even if k characters are stored without compression for every one word, the redundancy does not increase asymptotically. The information necessary to decode one word is only the preceding k characters, it is not necessary to decode other parts. Note that it must hold k log < log n.
Simple Data Structure 1 (S1) Divide S into blocks of w = ½ log n characters Encode characters by Huffman code n(1+H0(S)) bits Store pointers to blocks Characters in a block are decoded in O(1) time by table lookups
Simple Data Structure 2 (S2) [5] Divide S into blocks of w = ½ log n characters For each block, the first k characters are stored as it is. The remaining w k characters are encoded using arithmetic codes defined by the context of length k For all blocks, the space is
Redundancy of using arithmetic codes is 2n/w = O(n log / log n) Store pointers to the blocks Table for decoding arithmetic codes in O(1) time In total, (if k = o(log n))
Simple Data Structure 3 (S3) [6] Divide S into blocks of w = ½ log n characters Regard each block as a character, and assign code character is represented by integer from 1 to count frequency of each character Assign codes 0, 1, 00, 01, 10, 11, 000, 001,... in decreasing order of frequency Store pointers to blocks
Size Analysis Pointers to blocks: O(n log log log n/log n) Table to decode substring from a code for a block Lemma: The sum of code lengths for all blocks is at most (Size of S2)
Proof: Codes for a block S2: first k characters are stored without compression the remaining ones are encoded by arithmetic codes S3: w characters are encoded by one code In S3, more frequent patterns are assigned shorter codes ⇒ total code length is not longer than S2 Note: Size of S3 does not depend on k ⇒ The claim holds simultaneously for all 0 k o(log n)
References [1] Jesper Jansson, Kunihiko Sadakane, Wing-Kin Sung. Compressed random access memory, arXiv:1011.1708v1. [2] Kunihiko Sadakane, Gonzalo Navarro: Fully-Functional Succinct Trees. SODA 2010: 134-149. [3] Jacob Ziv and Abraham Lempel; Compression of Individual Sequences Via Variable-Rate Coding, IEEE Transactions on Information Theory, September 1978. [4] S. Rao Kosaraju, Giovanni Manzini: Compression of Low Entropy Strings with Lempel-Ziv Algorithms. SIAM J. Comput. 29(3): 893-911 (1999). [5] Rodrigo González and Gonzalo Navarro. Statistical Encoding of Succinct Data Structures. Proc. CPM'06, pages 295-306. LNCS 4009. [6] P. Ferragina and R. Venturini. A simple storage scheme for strings achieving entropy bounds. Theoretical Computer Science, 372(1):115– 121, 2007.