Download presentation
Presentation is loading. Please wait.
Published byAnis Payne Modified over 8 years ago
1
Discrete Methods in Mathematical Informatics Kunihiko Sadakane The University of Tokyo http://researchmap.jp/sada/resources/
2
2 Problems in Handling Big Data Data size (n) is huge –human genome: n = 3G (three billion) –Web pages: n > 10G Data do not fit in the main memory of computers –parallel/distributed algorithms –data compression –streaming algorithms
3
3 Problems in Data Compression Conventional data compression algorithms are not suitable for data processing –slow partial decompression –not searchable Succinct Data Structures
4
4 Examples of Succinct Data Structures String index –Suffix array of 100GB text: 680GB 22GB Genome assembly –Human genome: 300GB 2.5GB Road networks –Positional data of all roads in Japan: 1.7GB 170MB 11 10 11 1110111111111111 1001010110101010 represented by bitstring TACACG G CGA CGT A T GAC GTC C C ACT TCG T G $ G A $TA C $$T A $$$ T
5
5 Succinct data structures = Succinct representation of data + Succinct index A representation of data whose size (roughly) matches the information-theoretic lower bound If the input is taken from L distinct ones, its information-theoretic lower bound is log L bits Example 1: a lower bound for a set S {1,2,...,n} –log 2 n = n bits n = 3 {1,2} {1,2,3} {1,3}{2,3} {3}{2}{1} Base of logarithm is 2
6
6 Example 2: n node ordered tree Example 3: length-n string –n log bits ( : alphabet size) n = 4
7
7 For any data, there exists its succinct representation –enumerate all in some order, and assign codes –can be represented in log L bits –not clear if it supports efficient queries Query time depends on how data are represented –necessary to use appropriate representations n = 4 000001010011100
8
8 Succinct Indexes Auxiliary data structure to support queries Size: o(log L) bits (Almost) the same time complexity as using conventional data structures Computation model: word RAM –assume word length w = log log L (same pointer size as conventional data structures)
9
9 word RAM word RAM with word length w bits supports –reading/writing w bits of memory at arbitrary address in constant time –arithmetic/logical operations on two w bits numbers are done in constant time –arithmetic ops.: +, , *, /, log (most significant bit) –logical ops.: and, or, not, shift These operations can be done in constant time using O(w ) bit tables ( > 0 is an arbitrary constant)
10
10 Bit Vectors B: 0,1 vector of length n B[0]B[1]…B[n 1] lower bound of size = log 2 n = n bits queries –rank(B, x): number of ones in B[0..x]=B[0]B[1]…B[x] –select(B, i): position of i-th 1 from the head (i 1) basics of all succinct data structures naïve data structure –store all the answers in arrays –2n words (2 n log n bits) –O(1) time queries B = 1001010001000000 035 9 n = 16
11
11 Succinct Index for Rank:1 Divide B into blocks of length log 2 n Store rank(x)’s at block boundaries in array R Size of R rank(x): O(log 2 n) time R[x/log 2 n]
12
12 Succinct Index for Rank:2 Divide each block into small blocks of length ½log n Store rank(x)’s at small block boundaries in R 2 Size of R 2 rank(x): O(log n) time R 1 [x/log 2 n]R 2 [x/log n]
13
13 Succinct Index for Rank:3 Compute answers for all possible queries for small blocks in advance, and store them in a table Size of table 012 000000 001001 010011 011012 100111 101112 110122 111123 All patterns of ½ log n bits x
14
14 Theorem : Rank on a bit-vector of length n is computed in constant time on word RAM with word length (log n) bits, using n + O(n log log n/log n) bits. Note: The size of table for computing rank inside a small block is bits. This can be ignored theoretically (asymptotically), but not in practice.
15
15 Succinct Index for Select More difficult than rank –Let i = q (log 2 n) + r (½ log n) + s –if i is multiple of log 2 n, store select(i) in S 1 –if i is multiple of ½ log n, store select(i) in S 2 –elements of S 2 may become (n) → (log n) bits are necessary to store them → (n) bits for the entire S 2 –select can be computed by binary search using the index for rank, but it takes O(log n) time Divide B into large blocks, each of which contains log 2 n ones. Use one of two types of data structures for each large block
16
16 If the length of a large block exceeds log c n –store positions of log 2 n ones – for a large block –the number of such large blocks is at most –In total –By letting c = 4, index size is
17
17 If the length m of a large block is at most log c n –devide it into small blocks of length ½ log n –construct a complete -ary tree storing small blocks in their leaves –each node of the tree stores the number of ones in the bit vector stored in descendants –number of ones in a large block is log 2 n → each value is in 2 log log n bits B = 100101000100010011100000001 12011210 342 9 1 m log c n O(c)
18
18 The height of the tree is O(c) Space for storing values in the nodes for a large block is Space for the whole vector is To compute select(i), search the tree from the root –the information about child nodes is represented in bits → the child to visit is determined by a table lookup in constant time Search time is O(c), i.e., constant
19
19 Theorem : rank and select on a bit-vector of length n is computed in constant time on word RAM with word length (log n) bits, using n+O(n log log n /log n) bits. Note : rank inside a large block is computed using the index for select by traversing the tree from a leaf to the root and summing rank values.
20
20 Extension of Rank/Select (1) Queries on 0 bits –rank c (B, x): number of c in B[0..x] = B[0]B[1]…B[x] –select c (B, i): position of i-th c in B (i 1) from rank 0 (B, x) = (x+1) rank 1 (B, x), rank 0 is done in O(1) using no additional index select 0 is not computed by the index for select 1 add a similar index to that for select 1
21
21 Extension of Rank/Select (2) Compression of sparse vectors –A 0,1 vector of length n, having m ones –lower bound of size –if m << n Operations to be supported –rank c (B, x): number of c in B[0..x] = B[0]B[1]…B[x] –select c (B, i): position of i-th c in B (i 1)
22
22 Entropy of String Definition: order-0 entropy H 0 of string S (p c : probability of appearance of letter c) Definition: order-k entropy –assumption: Pr[S[i] = c] is determined from S[i k..i 1] (context) –n s : the number of letters whose context is s –p s,c : probability of appearing c in context s abcababc context
23
23 The information-theoretic lower bound for a sparse vector asymptotically matches the order-0 entropy of the vector
24
24 Compressing Vectors (1) Divide vector into small blocks of length l = ½ log n Let m i denote number of 1’s in i-th small block B i –a small block is represented in bits Total space for all blocks is #ones#possible patterns with speficied #ones
25
25 indexes for rank, select are the same as those for uncompressed vectors: O(n log log n /log n) bits numbers of 1’s in small blocks are already stored in the index for select necessary to store pointers to small blocks let p i denote the pointer to B i –0= p 0 < p 1 < <p n/l < n (less than n because compressed) –p i p i 1 ½ log n if i is multiple of log n, store p i as it is (no comp.) otherwise, store p i as the difference
26
26 Theorem : A bit-vector of length n and m ones is stored in bits, and rank 0, rank 1, select 0, select 1 are computed in O(1) time on word RAM with word length (log n) bits. The data structure is constructed in O(n) time. This data structure is called FID (fully indexable dictionary) (Raman, Raman, Rao [2]). Note: if m << n, it happens, and the size of index for rank/select cannot be ignored.
27
27 Extension of Rank/Select (3) Multi-alphabet –access(S, x): returns S[x] –rank c (S, x): number of c in S[0..x] = S[0]S[1]…S[x] –select c (S, i): position of i-th c in S (i 1) –c A = {0, 1, …, 1} alphabet size > 2
28
28 Data Structure 1 Represent S by 0,1 vectors and compress by FID rank and select are done in O(1) time access is done in O( ) Size –If a letter c appears n c times, the size of the vector is –In total $: 010000000 a: 000010011 c: 001100000 g: 100001100 S: g$ccaggaa
29
29 Data Structure 2 Store S as it is –access takes O(1) time The 0,1 vectors are computed from S –Succinct indexes for rank/select can be used without modifications –Time to obtain log n bits of 0,1 vector is O(log ) –rank and select are done in O(log ) time Size $: 010000000 a: 000010011 c: 001100000 g: 100001100 S: g$ccaggaa (n log )
30
30 $ = 00 a = 01 c = 10 g = 11 Wavelet Trees Binary tree consisting of 1 0,1 vectors The vector in root node V[0..n 1] –V[i] = 1 ⇔ most significant bit of S[i] is 1 Right (left) subtree of root node is the wavelet tree for the string consisting of letters of S whose most significant bit is 1 (0) –Remove the most significant bit 101101100 S: g $ c c a g g a a V 0111 V0V0 S0S0 00111 V1V1 S1S1
31
31 Computing rank b = (most significant bit of c) c’ = (c whose most significant bit is omitted) r = rank b (V, x) rank c (S, x) = rank c’ (S b, r) (recursively computed) O(log ) time Total length of vectors in depth d nodes is n Height of tree is log n log + O(n log log log n/log n) = |S| + o(|S|) bits
32
32 Computing access b = access(V, x) r = rank b (V, x) access(S, x) = b : access(S b, r) (recursion) O(log ) time b = (most significant bit of c) c’ = (c whose most significant bit is omitted) select c (S, x) = select b (V, select c’ (S, x)) O(log ) time Computing select
33
33 Compressing Vectors n c : frequency of letter c Size of compressed V Size of compressed V 0 and V 1 101101100 011100111 S: g $ c c a g g a a V0V0 V0V0 V1V1
34
34 By adding the sizes for all levels Summary –access/rank/select: O(log ) time –size: nH 0 +O(n log log log n/log n)
35
35 Multi-ary Wavelet Trees [5] Use not binary but multi-ary for wavelet trees access/rank/select: O(log /log log n) time Size: nH 0 +o(n log ) bits If = polylog(n) –access/rank/select: O(1) time –size: nH 0 +o(n) bits Size can be reduced to nH 0 +o(n)(H 0 +1) bits
36
36 Summary Sizeaccessrankselect n(H 0 +1.44)+ o(n) O( ) O(1) |S| + o(n) O(1) O(log ) nH 0 +log o(n)O(log ) nH 0 +o(n log ) O(log /log log n) nH 0 +o(n)(nH 0 +1) O(log log ) O(1)
37
37 References [1] G. Jacobson. Space-efficient Static Trees and Graphs. In Proc. IEEE FOCS, pages 549–554, 1989. [2] R. Raman, V. Raman, and S. S. Rao. Succinct Indexable Dictionaries with Applications to Encoding k-ary Trees and Multisets, ACM Transactions on Algorithms (TALG), Vol. 3, Issue 4, 2007. [3] R. Grossi and J. S. Vitter. Compressed Suffix Arrays and Suffix Trees with Applications to Text Indexing and String Matching. SIAM Journal on Computing, 35(2):378–407, 2005. [4] R. Grossi, A. Gupta, and J. S. Vitter. High-Order Entropy- Compressed Text Indexes. In Proc. ACM-SIAM SODA, pages 841– 850, 2003. [5] Paolo Ferragina, Giovani Manzini, Veli Mäkinen, and Gonzalo Navarro. Compressed Representations of Sequences and Full-Text Indexes. ACM Transactions on Algorithms (TALG) 3(2), article 20, 24 pages, 2007.
38
38 [6] Jérémy Barbay, Travis Gagie, Gonzalo Navarro, and Yakov Nekrich. Alphabet Partitioning for Compressed Rank/Select and Applications. Proc. ISAAC, pages 315-326. LNCS 6507, 2010. [7] Alexander Golynski, J. Ian Munro, S. Srinivasa Rao, Rank/select operations on large alphabets: a tool for text indexing, Proc. SODA, pp. 368-373, 2006. [8] Jérémy Barbay, Meng He, J. Ian Munro, S. Srinivasa Rao. Succinct indexes for strings, binary relations and multi-labeled trees. Proc. SODA, 2007. [9] Dan E. Willard. Log-logarithmic worst-case range queries are possible in space theta(n). Information Processing Letters, 17(2):81– 84, 1983. [10] J. Ian Munro, Rajeev Raman, Venkatesh Raman, S. Srinivasa Rao. Succinct representations of permutations. Proc. ICALP, pp. 345–356, 2003. [11] R. Pagh. Low Redundancy in Static Dictionaries with Constant Query Time. SIAM Journal on Computing, 31(2):353–363, 2001.
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.