Succinct Data Structures Kunihiko Sadakane National Institute of Informatics
Succinct Data Structures Succinct data structures = succinct representation of data + succinct index Examples sets trees, graphs strings permutations, functions
Succinct Representation A representation of data whose size (roughly) matches the information-theoretic lower bound If the input is taken from L distinct ones, its information-theoretic lower bound is log L bits Example 1: a lower bound for a set S {1,2,...,n} log 2n = n bits {1,2} {1,2,3} {1,3} {2,3} {3} {2} {1} n = 3 Base of logarithm is 2
Example 2: n node ordered tree Example 3: length-n string n log bits (: alphabet size) n = 4
For any data, there exists its succinct representation enumerate all in some order, and assign codes can be represented in log L bits not clear if it supports efficient queries Query time depends on how data are represented necessary to use appropriate representations n = 4 000 001 010 011 100
Succinct Indexes Auxiliary data structure to support queries Size: o(log L) bits (Almost) the same time complexity as using conventional data structures Computation model: word RAM assume word length w = log log L (same pointer size as conventional data structures)
word RAM word RAM with word length w bits supports reading/writing w bits of memory at arbitrary address in constant time arithmetic/logical operations on two w bits numbers are done in constant time arithmetic ops.: +, , *, /, log (most significant bit) logical ops.: and, or, not, shift These operations can be done in constant time using O(w) bit tables ( > 0 is an arbitrary constant)
1. Bit Vectors B: 0,1 vector of length n B[0]B[1]…B[n1] lower bound of size = log 2n = n bits queries rank(B, x): number of ones in B[0..x]=B[0]B[1]…B[x] select(B, i): position of i-th 1 from the head (i 1) basics of all succinct data structures naïve data structure store all the answers in arrays 2n words (2 n log n bits) O(1) time queries B = 1001010001000000 3 5 9 n = 16
Succinct Index for Rank:1 Divide B into blocks of length log2 n Store rank(x)’s at block boundaries in array R Size of R rank(x): O(log2 n) time R[x/log2 n]
Succinct Index for Rank:2 Divide each block into small blocks of length ½log n Store rank(x)’s at small block boundaries in R2 Size of R2 rank(x): O(log n) time R1[x/log2 n] R2[x/log n]
Succinct Index for Rank:3 Compute answers for all possible queries for small blocks in advance, and store them in a table Size of table x 1 2 000 001 010 011 100 101 110 111 3 All patterns of ½ log n bits
Theorem:Rank on a bit-vector of length n is computed in constant time on word RAM with word length (log n) bits, using n + O(n log log n/log n) bits. Note: The size of table for computing rank inside a small block is bits. This can be ignored theoretically, but not in practice.
Succinct Index for Select More difficult than rank Let i = q (log2 n) + r (½ log n) + s if i is multiple of log2 n, store select(i) in S1 if i is multiple of ½ log n, store select(i) in S2 elements of S2 may become (n) →(log n) bits are necessary to store them → (n) bits for the entire S2 select can be computed by binary search using the index for rank, but it takes O(log n) time Divide B into large blocks, each of which contains log2 n ones. Use one of two types of data structures for each large block
If the length of a large block exceeds logc n store positions of log2 n ones for a large block the number of such large blocks is at most In total By letting c = 4, index size is
If the length m of a large block is at most logc n devide it into small blocks of length ½ log n construct a complete -ary tree storing small blocks in their leaves each node of the tree stores the number of ones in the bit vector stored in descendants number of ones in a large block is log2 n → each value is in 2 log log n bits B = 100101000100010011100000001 1 2 3 4 9 O(c) m logc n
The height of the tree is O(c) Space for storing values in the nodes for a large block is Space for the whole vector is To compute select(i), search the tree from the root the information about child nodes is represented in bits → the child to visit is determined by a table lookup in constant time Search time is O(c), i.e., constant
Theorem:rank and select on a bit-vector of length n is computed in constant time on word RAM with word length (log n) bits, using n+O(n log log n /log n) bits. Note: rank inside a large block is computed using the index for select by traversing the tree from a leaf to the root and summing rank values.
Extension of Rank/Select (1) Queries on 0 bits rankc(B, x): number of c in B[0..x] = B[0]B[1]…B[x] selectc(B, i): position of i-th c in B (i 1) from rank0(B, x) = (x+1) rank1(B, x), rank0 is done in O(1) using no additional index select0 is not computed by the index for select1 add a similar index to that for select1
Extension of Rank/Select (2) Compression of sparse vectors A 0,1 vector of length n, having m ones lower bound of size if m << n Operations to be supported rankc(B, x): number of c in B[0..x] = B[0]B[1]…B[x] selectc(B, i): position of i-th c in B (i 1)
Entropy of String Definition: order-0 entropy H0 of string S (pc: probability of appearance of letter c) Definition: order-k entropy assumption: Pr[S[i] = c] is determined from S[ik..i1] (context) ns: the number of letters whose context is s ps,c: probability of appearing c in context s abcababc context
The information-theoretic lower bound for a sparse vector asymptotically matches the order-0 entropy of the vector
Compressing Vectors (1) Divide vector into small blocks of length l = ½ log n Let mi denote number of 1’s in i-th small block Bi a small block is represented in bits Total space for all blocks is #ones #possible patterns with speficied #ones
necessary to store pointers to small blocks indexes for rank, select are the same as those for uncompressed vectors: O(n log log n /log n) bits numbers of 1’s in small blocks are already stored in the index for select necessary to store pointers to small blocks let pi denote the pointer to Bi 0= p0 < p1 < <pn/l < n (less than n because compressed) pi pi1 ½ log n if i is multiple of log n, store pi as it is (no comp.) otherwise, store pi as the difference
Theorem: A bit-vector of length n and m ones is stored in bits, and rank0, rank1, select0, select1 are computed in O(1) time on word RAM with word length (log n) bits. The data structure is constructed in O(n) time. This data structure is called FID (fully indexable dictionary) (Raman, Raman, Rao [2]). Note: if m << n, it happens , and the size of index for rank/select cannot be ignored.
References [1] G. Jacobson. Space-efficient Static Trees and Graphs. In Proc. IEEE FOCS, pages 549–554, 1989. [2] R. Raman, V. Raman, and S. S. Rao. Succinct Indexable Dictionaries with Applications to Encoding k-ary Trees and Multisets, ACM Transactions on Algorithms (TALG) , Vol. 3, Issue 4, 2007. [3] R. Grossi and J. S. Vitter. Compressed Suffix Arrays and Suffix Trees with Applications to Text Indexing and String Matching. SIAM Journal on Computing, 35(2):378–407, 2005. [4] R. Grossi, A. Gupta, and J. S. Vitter. High-Order Entropy-Compressed Text Indexes. In Proc. ACM-SIAM SODA, pages 841–850, 2003. [5] Paolo Ferragina, Giovani Manzini, Veli Mäkinen, and Gonzalo Navarro. Compressed Representations of Sequences and Full-Text Indexes. ACM Transactions on Algorithms (TALG) 3(2), article 20, 24 pages, 2007.
[6] Jérémy Barbay, Travis Gagie, Gonzalo Navarro, and Yakov Nekrich. Alphabet Partitioning for Compressed Rank/Select and Applications. Proc. ISAAC, pages 315-326. LNCS 6507, 2010. [7] Alexander Golynski , J. Ian Munro , S. Srinivasa Rao, Rank/select operations on large alphabets: a tool for text indexing, Proc. SODA, pp. 368-373, 2006. [8] Jérémy Barbay, Meng He, J. Ian Munro, S. Srinivasa Rao. Succinct indexes for strings, binary relations and multi-labeled trees. Proc. SODA, 2007. [9] Dan E. Willard. Log-logarithmic worst-case range queries are possible in space theta(n). Information Processing Letters, 17(2):81–84, 1983. [10] J. Ian Munro, Rajeev Raman, Venkatesh Raman, S. Srinivasa Rao. Succinct representations of permutations. Proc. ICALP, pp. 345–356, 2003. [11] R. Pagh. Low Redundancy in Static Dictionaries with Constant Query Time. SIAM Journal on Computing, 31(2):353–363, 2001.