Discrete Methods in Mathematical Informatics Kunihiko Sadakane The University of Tokyo

Discrete Methods in Mathematical Informatics Kunihiko Sadakane The University of Tokyo http://researchmap.jp/sada/resources/

2 Problems in Handling Big Data Data size (n) is huge –human genome: n = 3G (three billion) –Web pages: n > 10G Data do not fit in the main memory of computers –parallel/distributed algorithms –data compression –streaming algorithms

3 Problems in Data Compression Conventional data compression algorithms are not suitable for data processing –slow partial decompression –not searchable Succinct Data Structures

4 Examples of Succinct Data Structures String index –Suffix array of 100GB text: 680GB  22GB Genome assembly –Human genome: 300GB  2.5GB Road networks –Positional data of all roads in Japan: 1.7GB  170MB 11 10 11 1110111111111111 1001010110101010 represented by bitstring TACACG G CGA CGT A T GAC GTC C C ACT TCG T G $ G A $TA C $$T A $$$ T

5 Succinct data structures = Succinct representation of data + Succinct index A representation of data whose size (roughly) matches the information-theoretic lower bound If the input is taken from L distinct ones, its information-theoretic lower bound is  log L  bits Example 1: a lower bound for a set S  {1,2,...,n} –log 2 n = n bits n = 3 {1,2} {1,2,3} {1,3}{2,3} {3}{2}{1}  Base of logarithm is 2

6 Example 2: n node ordered tree Example 3: length-n string –n log  bits (  : alphabet size) n = 4

7 For any data, there exists its succinct representation –enumerate all in some order, and assign codes –can be represented in  log L  bits –not clear if it supports efficient queries Query time depends on how data are represented –necessary to use appropriate representations n = 4 000001010011100

8 Succinct Indexes Auxiliary data structure to support queries Size: o(log L) bits (Almost) the same time complexity as using conventional data structures Computation model: word RAM –assume word length w = log log L (same pointer size as conventional data structures)

9 word RAM word RAM with word length w bits supports –reading/writing w bits of memory at arbitrary address in constant time –arithmetic/logical operations on two w bits numbers are done in constant time –arithmetic ops.: +, , *, /, log (most significant bit) –logical ops.: and, or, not, shift These operations can be done in constant time using O(w  ) bit tables (  > 0 is an arbitrary constant)

10 Bit Vectors B: 0,1 vector of length n B[0]B[1]…B[n  1] lower bound of size = log 2 n = n bits queries –rank(B, x): number of ones in B[0..x]=B[0]B[1]…B[x] –select(B, i): position of i-th 1 from the head (i  1) basics of all succinct data structures naïve data structure –store all the answers in arrays –2n words (2 n log n bits) –O(1) time queries B = 1001010001000000 035 9 n = 16

11 Succinct Index for Rank:1 Divide B into blocks of length log 2 n Store rank(x)’s at block boundaries in array R Size of R rank(x): O(log 2 n) time R[x/log 2 n]

12 Succinct Index for Rank:2 Divide each block into small blocks of length ½log n Store rank(x)’s at small block boundaries in R 2 Size of R 2 rank(x): O(log n) time R 1 [x/log 2 n]R 2 [x/log n]

13 Succinct Index for Rank:3 Compute answers for all possible queries for small blocks in advance, and store them in a table Size of table 012 000000 001001 010011 011012 100111 101112 110122 111123 All patterns of ½ log n bits x

14 Theorem ： Rank on a bit-vector of length n is computed in constant time on word RAM with word length  (log n) bits, using n + O(n log log n/log n) bits. Note: The size of table for computing rank inside a small block is bits. This can be ignored theoretically (asymptotically), but not in practice.

15 Succinct Index for Select More difficult than rank –Let i = q (log 2 n) + r (½ log n) + s –if i is multiple of log 2 n, store select(i) in S 1 –if i is multiple of ½ log n, store select(i) in S 2 –elements of S 2 may become  (n) →  (log n) bits are necessary to store them →  (n) bits for the entire S 2 –select can be computed by binary search using the index for rank, but it takes O(log n) time Divide B into large blocks, each of which contains log 2 n ones. Use one of two types of data structures for each large block

16 If the length of a large block exceeds log c n –store positions of log 2 n ones – for a large block –the number of such large blocks is at most –In total –By letting c = 4, index size is

17 If the length m of a large block is at most log c n –devide it into small blocks of length ½ log n –construct a complete -ary tree storing small blocks in their leaves –each node of the tree stores the number of ones in the bit vector stored in descendants –number of ones in a large block is log 2 n → each value is in 2 log log n bits B = 100101000100010011100000001 12011210 342 9 1 m  log c n O(c)

18 The height of the tree is O(c) Space for storing values in the nodes for a large block is Space for the whole vector is To compute select(i), search the tree from the root –the information about child nodes is represented in bits → the child to visit is determined by a table lookup in constant time Search time is O(c), i.e., constant

19 Theorem ： rank and select on a bit-vector of length n is computed in constant time on word RAM with word length  (log n) bits, using n+O(n log log n /log n) bits. Note ： rank inside a large block is computed using the index for select by traversing the tree from a leaf to the root and summing rank values.

20 Extension of Rank/Select (1) Queries on 0 bits –rank c (B, x): number of c in B[0..x] = B[0]B[1]…B[x] –select c (B, i): position of i-th c in B (i  1) from rank 0 (B, x) = (x+1)  rank 1 (B, x), rank 0 is done in O(1) using no additional index select 0 is not computed by the index for select 1 add a similar index to that for select 1

21 Extension of Rank/Select (2) Compression of sparse vectors –A 0,1 vector of length n, having m ones –lower bound of size –if m << n Operations to be supported –rank c (B, x): number of c in B[0..x] = B[0]B[1]…B[x] –select c (B, i): position of i-th c in B (i  1)

22 Entropy of String Definition: order-0 entropy H 0 of string S (p c : probability of appearance of letter c) Definition: order-k entropy –assumption: Pr[S[i] = c] is determined from S[i  k..i  1] (context) –n s : the number of letters whose context is s –p s,c : probability of appearing c in context s abcababc context

23 The information-theoretic lower bound for a sparse vector asymptotically matches the order-0 entropy of the vector

24 Compressing Vectors (1) Divide vector into small blocks of length l = ½ log n Let m i denote number of 1’s in i-th small block B i –a small block is represented in bits Total space for all blocks is #ones#possible patterns with speficied #ones

25 indexes for rank, select are the same as those for uncompressed vectors: O(n log log n /log n) bits numbers of 1’s in small blocks are already stored in the index for select necessary to store pointers to small blocks let p i denote the pointer to B i –0= p 0 < p 1 <  <p n/l < n (less than n because compressed) –p i  p i  1  ½ log n if i is multiple of log n, store p i as it is (no comp.) otherwise, store p i as the difference

26 Theorem ： A bit-vector of length n and m ones is stored in bits, and rank 0, rank 1, select 0, select 1 are computed in O(1) time on word RAM with word length  (log n) bits. The data structure is constructed in O(n) time. This data structure is called FID (fully indexable dictionary) (Raman, Raman, Rao [2]). Note: if m << n, it happens, and the size of index for rank/select cannot be ignored.

27 Extension of Rank/Select (3) Multi-alphabet –access(S, x): returns S[x] –rank c (S, x): number of c in S[0..x] = S[0]S[1]…S[x] –select c (S, i): position of i-th c in S (i  1) –c  A = {0, 1, …,  1} alphabet size  > 2

28 Data Structure 1 Represent S by  0,1 vectors and compress by FID rank and select are done in O(1) time access is done in O(  ) Size –If a letter c appears n c times, the size of the vector is –In total $: 010000000 a: 000010011 c: 001100000 g: 100001100 S: g$ccaggaa

29 Data Structure 2 Store S as it is –access takes O(1) time The  0,1 vectors are computed from S –Succinct indexes for rank/select can be used without modifications –Time to obtain log n bits of 0,1 vector is O(log  ) –rank and select are done in O(log  ) time Size $: 010000000 a: 000010011 c: 001100000 g: 100001100 S: g$ccaggaa (n log  )

30 $ = 00 a = 01 c = 10 g = 11 Wavelet Trees Binary tree consisting of  1 0,1 vectors The vector in root node V[0..n  1] –V[i] = 1 ⇔ most significant bit of S[i] is 1 Right (left) subtree of root node is the wavelet tree for the string consisting of letters of S whose most significant bit is 1 (0) –Remove the most significant bit 101101100 S: g $ c c a g g a a V 0111 V0V0 S0S0 00111 V1V1 S1S1

31 Computing rank b = (most significant bit of c) c’ = (c whose most significant bit is omitted) r = rank b (V, x) rank c (S, x) = rank c’ (S b, r) (recursively computed) O(log  ) time Total length of vectors in depth d nodes is n Height of tree is log  n log  + O(n log  log log n/log n) = |S| + o(|S|) bits

32 Computing access b = access(V, x) r = rank b (V, x) access(S, x) = b : access(S b, r) (recursion) O(log  ) time b = (most significant bit of c) c’ = (c whose most significant bit is omitted) select c (S, x) = select b (V, select c’ (S, x)) O(log  ) time Computing select

33 Compressing Vectors n c : frequency of letter c Size of compressed V Size of compressed V 0 and V 1 101101100 011100111 S: g $ c c a g g a a V0V0 V0V0 V1V1

34 By adding the sizes for all levels Summary –access/rank/select: O(log  ) time –size: nH 0 +O(n log  log log n/log n)

35 Multi-ary Wavelet Trees [5] Use not binary but multi-ary for wavelet trees access/rank/select: O(log  /log log n) time Size: nH 0 +o(n log  ) bits If  = polylog(n) –access/rank/select: O(1) time –size: nH 0 +o(n) bits Size can be reduced to nH 0 +o(n)(H 0 +1) bits

36 Summary Sizeaccessrankselect n(H 0 +1.44)+   o(n) O(  ) O(1) |S| +   o(n) O(1) O(log  ) nH 0 +log   o(n)O(log  ) nH 0 +o(n log  ) O(log  /log log n) nH 0 +o(n)(nH 0 +1) O(log log  ) O(1)

37 References [1] G. Jacobson. Space-efficient Static Trees and Graphs. In Proc. IEEE FOCS, pages 549–554, 1989. [2] R. Raman, V. Raman, and S. S. Rao. Succinct Indexable Dictionaries with Applications to Encoding k-ary Trees and Multisets, ACM Transactions on Algorithms (TALG), Vol. 3, Issue 4, 2007. [3] R. Grossi and J. S. Vitter. Compressed Suffix Arrays and Suffix Trees with Applications to Text Indexing and String Matching. SIAM Journal on Computing, 35(2):378–407, 2005. [4] R. Grossi, A. Gupta, and J. S. Vitter. High-Order Entropy- Compressed Text Indexes. In Proc. ACM-SIAM SODA, pages 841– 850, 2003. [5] Paolo Ferragina, Giovani Manzini, Veli Mäkinen, and Gonzalo Navarro. Compressed Representations of Sequences and Full-Text Indexes. ACM Transactions on Algorithms (TALG) 3(2), article 20, 24 pages, 2007.

38 [6] Jérémy Barbay, Travis Gagie, Gonzalo Navarro, and Yakov Nekrich. Alphabet Partitioning for Compressed Rank/Select and Applications. Proc. ISAAC, pages 315-326. LNCS 6507, 2010. [7] Alexander Golynski, J. Ian Munro, S. Srinivasa Rao, Rank/select operations on large alphabets: a tool for text indexing, Proc. SODA, pp. 368-373, 2006. [8] Jérémy Barbay, Meng He, J. Ian Munro, S. Srinivasa Rao. Succinct indexes for strings, binary relations and multi-labeled trees. Proc. SODA, 2007. [9] Dan E. Willard. Log-logarithmic worst-case range queries are possible in space theta(n). Information Processing Letters, 17(2):81– 84, 1983. [10] J. Ian Munro, Rajeev Raman, Venkatesh Raman, S. Srinivasa Rao. Succinct representations of permutations. Proc. ICALP, pp. 345–356, 2003. [11] R. Pagh. Low Redundancy in Static Dictionaries with Constant Query Time. SIAM Journal on Computing, 31(2):353–363, 2001.

Discrete Methods in Mathematical Informatics Kunihiko Sadakane The University of Tokyo

Similar presentations

Presentation on theme: "Discrete Methods in Mathematical Informatics Kunihiko Sadakane The University of Tokyo"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Discrete Methods in Mathematical Informatics Kunihiko Sadakane The University of Tokyo

Similar presentations

Presentation on theme: "Discrete Methods in Mathematical Informatics Kunihiko Sadakane The University of Tokyo"— Presentation transcript:

Similar presentations

About project

Feedback