Succinct Data Structures

Succinct Data Structures
Kunihiko Sadakane National Institute of Informatics

Compressing Arrays Input: array (string) of length n S[0..n1] query
S[i]  A, alphabet size  query Return a substring of S at given position S[i..i+w1] (w = O(log n) i.e. O(log n) bits) O(1) time on word RAM with word-length O(log n) bits Index size: nHk(S) + o(n log ) bits If the compressed suffix array is used to represent the string, query time is not O(1)

Theorem Size： asymptotically same as LZ78 [Ziv, Lempel78]
Consecutive log n characters (log n bits) of S at given position i are obtained in O(1) time The above access time is the same as that on uncompressed string Data can be regarded as uncompressed

LZ78(LZW) Compression [3]
Divide the string into phrases in a dictionary A phrase is encoded as a number Update the dictionary Compression ratio converges into the entropy as the string grows Dictionary 1 a b 2 a 3 b 4 Input a a a b a a b a a b a b 5 1 a 1 b 3 b 5 a Output a 6

Compression Ratio of LZ78
The number of phrases c when a string S of length n is parsed Compressed size: bits If S is generated from a stationary ergodic source (entropy H) For the order-k empirical entropy Hk ( : alphabet size) [4]

Difficulty in Partial Decoding
To achieve the order-k entropy, codes for characters must be determined from preceding k characters To decode a substring, its preceding substring is also necessary However,…

The term in the compressed size by LZ78 indicates there are O(k log ) bits redundancy for each word (log n bits). That is, even if k characters are stored without compression for every one word, the redundancy does not increase asymptotically. The information necessary to decode one word is only the preceding k characters, it is not necessary to decode other parts. Note that it must hold k log  < log n.

Simple Data Structure 1 (S1)
Divide S into blocks of w = ½ log n characters Encode characters by Huffman code n(1+H0(S)) bits Store pointers to blocks Characters in a block are decoded in O(1) time by table lookups

Simple Data Structure 2 (S2) [5]
Divide S into blocks of w = ½ log n characters For each block, the first k characters are stored as it is. The remaining w  k characters are encoded using arithmetic codes defined by the context of length k For all blocks, the space is

Redundancy of using arithmetic codes is 2n/w = O(n log  / log n)
Store pointers to the blocks Table for decoding arithmetic codes in O(1) time In total, (if k = o(log n))

Simple Data Structure 3 (S3) [6]
Divide S into blocks of w = ½ log n characters Regard each block as a character, and assign code character is represented by integer from 1 to count frequency of each character Assign codes 0, 1, 00, 01, 10, 11, 000, 001,... in decreasing order of frequency Store pointers to blocks

Size Analysis Pointers to blocks: O(n log  log log n/log n)
Table to decode substring from a code for a block Lemma: The sum of code lengths for all blocks is at most (Size of S2)

Proof： Codes for a block
S2: first k characters are stored without compression the remaining ones are encoded by arithmetic codes S3: w characters are encoded by one code In S3, more frequent patterns are assigned shorter codes ⇒ total code length is not longer than S2 Note： Size of S3 does not depend on k ⇒ The claim holds simultaneously for all 0  k  o(log n)

References [1] Jesper Jansson, Kunihiko Sadakane, Wing-Kin Sung. Compressed random access memory, arXiv: v1. [2] Kunihiko Sadakane, Gonzalo Navarro: Fully-Functional Succinct Trees. SODA 2010: [3] Jacob Ziv and Abraham Lempel; Compression of Individual Sequences Via Variable-Rate Coding, IEEE Transactions on Information Theory, September 1978. [4] S. Rao Kosaraju, Giovanni Manzini: Compression of Low Entropy Strings with Lempel-Ziv Algorithms. SIAM J. Comput. 29(3): (1999). [5] Rodrigo González and Gonzalo Navarro. Statistical Encoding of Succinct Data Structures. Proc. CPM'06, pages LNCS 4009. [6] P. Ferragina and R. Venturini. A simple storage scheme for strings achieving entropy bounds. Theoretical Computer Science, 372(1):115– 121, 2007.

Succinct Data Structures

Similar presentations

Presentation on theme: "Succinct Data Structures"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Succinct Data Structures

Similar presentations

Presentation on theme: "Succinct Data Structures"— Presentation transcript:

Similar presentations

About project

Feedback