Succinct Data Structures

Slides:

Advertisements

Similar presentations

Cell-Probe Lower Bounds for Succinct Partial Sums Mihai P ă trașcuEmanuele Viola.

Advertisements

Indexing DNA Sequences Using q-Grams

Succinct Representations of Dynamic Strings Meng He and J. Ian Munro University of Waterloo.

Succinct Data Structures for Permutations, Functions and Suffix Arrays

Analysis of Algorithms

5th July 2004CPM A Simple Optimal Representation for Balanced Parentheses Richard Geary, Naila Rahman, Rajeev Raman (University of Leicester, UK)

Succinct Representation of Balanced Parentheses, Static Trees and Planar Graphs J. Ian Munro & Venkatesh Raman.

An Improved Succinct Dynamic k-Ary Tree Representation (work in progress) Diego Arroyuelo Department of Computer Science, Universidad de Chile.

22C:19 Discrete Structures Trees Spring 2014 Sukumar Ghosh.

Paolo Ferragina, Università di Pisa Compressed Rank & Select on general strings Paolo Ferragina Dipartimento di Informatica, Università di Pisa.

Fusion Trees Advanced Data Structures Aris Tentes.

Fast Compressed Tries through Path Decompositions Roberto Grossi Giuseppe Ottaviano* Università di Pisa * Part of the work done while at Microsoft Research.

© The McGraw-Hill Companies, Inc., Chapter 2 The Complexity of Algorithms and the Lower Bounds of Problems.

Succinct Indexes for Strings, Binary Relations and Multi-labeled Trees Jérémy Barbay, Meng He, J. Ian Munro, University of Waterloo S. Srinivasa Rao, IT.

Compressed Compact Suffix Arrays Veli Mäkinen University of Helsinki Gonzalo Navarro University of Chile compact compress.

A Categorization Theorem on Suffix Arrays with Applications to Space Efficient Text Indexes Meng He, J. Ian Munro, and S. Srinivasa Rao University of Waterloo.

Succinct Data Structures Ian Munro University of Waterloo Joint work with David Benoit, Andrej Brodnik, D, Clark, F. Fich, M. He, J. Horton, A. López-Ortiz,

Succinct Representations of Trees S. Srinivasa Rao Seoul National University.

2 -1 Chapter 2 The Complexity of Algorithms and the Lower Bounds of Problems.

Full-Text Indexing via Burrows-Wheeler Transform Wing-Kai Hon Oct 18, 2006.

Tirgul 8 Universal Hashing Remarks on Programming Exercise 1 Solution to question 2 in theoretical homework 2.

Compressed Suffix Arrays based on Run-Length Encoding Veli Mäkinen Bielefeld University Gonzalo Navarro University of Chile BWTRLFID.

Obtaining Provably Good Performance from Suffix Trees in Secondary Storage Pang Ko & Srinivas Aluru Department of Electrical and Computer Engineering Iowa.

An Approach to Generalized Hashing Michael Klipper With Dan Blandford Guy Blelloch.

Mike 66 Sept Succinct Data Structures: Techniques and Lower Bounds Ian Munro University of Waterloo Joint work with/ work of Arash Farzan, Alex Golynski,

 Divide the encoded file into blocks of size b  Use an auxiliary bit vector to indicate the beginning of each block  Time – O(b)  Time vs. Memory.

Succinct Representations of Trees

Space Efficient Data Structures for Dynamic Orthogonal Range Counting Meng He and J. Ian Munro University of Waterloo.

Compressed suffix arrays and suffix trees with applications to text indexing and string matching.

Introduction n – length of text, m – length of search pattern string Generally suffix tree construction takes O(n) time, O(n) space and searching takes.

Summer School '131 Succinct Data Structures Ian Munro.

Succinct Orthogonal Range Search Structures on a Grid with Applications to Text Indexing Prosenjit Bose, Carleton University Meng He, Unversity of Waterloo.

Succinct Data Structures Ian Munro University of Waterloo Joint work with David Benoit, Andrej Brodnik, D, Clark, F. Fich, M. He, J. Horton, A. López-Ortiz,

Constant-Time LCA Retrieval Presentation by Danny Hermelin, String Matching Algorithms Seminar, Haifa University.

Succinct Dynamic Cardinal Trees with Constant Time Operations for Small Alphabet Pooya Davoodi Aarhus University May 24, 2011 S. Srinivasa Rao Seoul National.

Compressed Prefix Sums O’Neil Delpratt Naila Rahman Rajeev Raman.

Succinct Ordinal Trees Based on Tree Covering Meng He, J. Ian Munro, University of Waterloo S. Srinivasa Rao, IT University of Copenhagen.

Joint Advanced Student School Compressed Suffix Arrays Compression of Suffix Arrays to linear size Fabian Pache.

Compressing Bi-Level Images by Block Matching on a Tree Architecture Sergio De Agostino Computer Science Department Sapienza University of Rome ITALY.

Chapter 11. Chapter Summary  Introduction to trees (11.1)  Application of trees (11.2)  Tree traversal (11.3)  Spanning trees (11.4)

Discrete Methods in Mathematical Informatics Kunihiko Sadakane The University of Tokyo

Discrete Methods in Mathematical Informatics Kunihiko Sadakane The University of Tokyo

Advanced Data Structures Lecture 8 Mingmin Xie. Agenda Overview Trie Suffix Tree Suffix Array, LCP Construction Applications.

A simple storage scheme for strings achieving entropy bounds Paolo Ferragina and Rossano Venturini Dipartimento di Informatica University of Pisa, Italy.

Succinct Data Structures

Tries 4/16/2018 8:59 AM Presentation for use with the textbook Data Structures and Algorithms in Java, 6th edition, by M. T. Goodrich, R. Tamassia, and.

Succinct Data Structures

Succinct Data Structures

Tries 07/28/16 11:04 Text Compression

Succinct Data Structures

Succinct Data Structures

Tries 5/27/2018 3:08 AM Tries Tries.

Succinct Data Structures

Discrete Methods in Mathematical Informatics

CSCI 210 Data Structures and Algorithms

Succinct Data Structures: Upper, Lower & Middle Bounds

Succinct Data Structures

Reducing the Space Requirement of LZ-index

COSC160: Data Structures Linked Lists

Tries 9/14/ :13 AM Presentation for use with the textbook Data Structures and Algorithms in Java, 6th edition, by M. T. Goodrich, R. Tamassia, and.

Randomized Algorithms CS648

Discrete Methods in Mathematical Informatics

Database Design and Programming

Tries 2/23/2019 8:29 AM Tries 2/23/2019 8:29 AM Tries.

Tries 2/27/2019 5:37 PM Tries Tries.

Succinct Data Structures

Discrete Range Maximum

Path Oram An Extremely Simple Oblivious RAM Protocol

Presentation transcript:

Succinct Data Structures Kunihiko Sadakane National Institute of Informatics

Succinct Data Structures Succinct data structures = succinct representation of data + succinct index Examples sets trees, graphs strings permutations, functions

Succinct Representation A representation of data whose size (roughly) matches the information-theoretic lower bound If the input is taken from L distinct ones, its information-theoretic lower bound is log L bits Example 1: a lower bound for a set S  {1,2,...,n} log 2n = n bits {1,2} {1,2,3} {1,3} {2,3} {3} {2} {1}  n = 3 Base of logarithm is 2

Example 2: n node ordered tree Example 3: length-n string n log  bits (: alphabet size) n = 4

For any data, there exists its succinct representation enumerate all in some order, and assign codes can be represented in log L bits not clear if it supports efficient queries Query time depends on how data are represented necessary to use appropriate representations n = 4 000 001 010 011 100

Succinct Indexes Auxiliary data structure to support queries Size: o(log L) bits (Almost) the same time complexity as using conventional data structures Computation model: word RAM assume word length w = log log L (same pointer size as conventional data structures)

word RAM word RAM with word length w bits supports reading/writing w bits of memory at arbitrary address in constant time arithmetic/logical operations on two w bits numbers are done in constant time arithmetic ops.: +, , *, /, log (most significant bit) logical ops.: and, or, not, shift These operations can be done in constant time using O(w) bit tables ( > 0 is an arbitrary constant)

1. Bit Vectors B: 0,1 vector of length n B[0]B[1]…B[n1] lower bound of size = log 2n = n bits queries rank(B, x): number of ones in B[0..x]=B[0]B[1]…B[x] select(B, i): position of i-th 1 from the head (i  1) basics of all succinct data structures naïve data structure store all the answers in arrays 2n words (2 n log n bits) O(1) time queries B = 1001010001000000 3 5 9 n = 16

Succinct Index for Rank:1 Divide B into blocks of length log2 n Store rank(x)’s at block boundaries in array R Size of R rank(x): O(log2 n) time R[x/log2 n]

Succinct Index for Rank:2 Divide each block into small blocks of length ½log n Store rank(x)’s at small block boundaries in R2 Size of R2 rank(x): O(log n) time R1[x/log2 n] R2[x/log n]

Succinct Index for Rank:3 Compute answers for all possible queries for small blocks in advance, and store them in a table Size of table x 1 2 000 001 010 011 100 101 110 111 3 All patterns of ½ log n bits

Theorem：Rank on a bit-vector of length n is computed in constant time on word RAM with word length (log n) bits, using n + O(n log log n/log n) bits. Note: The size of table for computing rank inside a small block is bits. This can be ignored theoretically, but not in practice.

Succinct Index for Select More difficult than rank Let i = q (log2 n) + r (½ log n) + s if i is multiple of log2 n, store select(i) in S1 if i is multiple of ½ log n, store select(i) in S2 elements of S2 may become (n) →(log n) bits are necessary to store them → (n) bits for the entire S2 select can be computed by binary search using the index for rank, but it takes O(log n) time Divide B into large blocks, each of which contains log2 n ones. Use one of two types of data structures for each large block

If the length of a large block exceeds logc n store positions of log2 n ones for a large block the number of such large blocks is at most In total By letting c = 4, index size is

If the length m of a large block is at most logc n devide it into small blocks of length ½ log n construct a complete -ary tree storing small blocks in their leaves each node of the tree stores the number of ones in the bit vector stored in descendants number of ones in a large block is log2 n → each value is in 2 log log n bits B = 100101000100010011100000001 1 2 3 4 9 O(c) m  logc n

The height of the tree is O(c) Space for storing values in the nodes for a large block is Space for the whole vector is To compute select(i), search the tree from the root the information about child nodes is represented in bits → the child to visit is determined by a table lookup in constant time Search time is O(c), i.e., constant

Theorem：rank and select on a bit-vector of length n is computed in constant time on word RAM with word length (log n) bits, using n+O(n log log n /log n) bits. Note： rank inside a large block is computed using the index for select by traversing the tree from a leaf to the root and summing rank values.

Extension of Rank/Select (1) Queries on 0 bits rankc(B, x): number of c in B[0..x] = B[0]B[1]…B[x] selectc(B, i): position of i-th c in B (i  1) from rank0(B, x) = (x+1)  rank1(B, x), rank0 is done in O(1) using no additional index select0 is not computed by the index for select1 add a similar index to that for select1

Extension of Rank/Select (2) Compression of sparse vectors A 0,1 vector of length n, having m ones lower bound of size if m << n Operations to be supported rankc(B, x): number of c in B[0..x] = B[0]B[1]…B[x] selectc(B, i): position of i-th c in B (i  1)

Entropy of String Definition: order-0 entropy H0 of string S (pc: probability of appearance of letter c) Definition: order-k entropy assumption: Pr[S[i] = c] is determined from S[ik..i1] (context) ns: the number of letters whose context is s ps,c: probability of appearing c in context s abcababc context

The information-theoretic lower bound for a sparse vector asymptotically matches the order-0 entropy of the vector

Compressing Vectors (1) Divide vector into small blocks of length l = ½ log n Let mi denote number of 1’s in i-th small block Bi a small block is represented in bits Total space for all blocks is #ones #possible patterns with speficied #ones

necessary to store pointers to small blocks indexes for rank, select are the same as those for uncompressed vectors: O(n log log n /log n) bits numbers of 1’s in small blocks are already stored in the index for select necessary to store pointers to small blocks let pi denote the pointer to Bi 0= p0 < p1 < <pn/l < n (less than n because compressed) pi  pi1  ½ log n if i is multiple of log n, store pi as it is (no comp.) otherwise, store pi as the difference

Theorem： A bit-vector of length n and m ones is stored in bits, and rank0, rank1, select0, select1 are computed in O(1) time on word RAM with word length (log n) bits. The data structure is constructed in O(n) time. This data structure is called FID (fully indexable dictionary) (Raman, Raman, Rao [2]). Note: if m << n, it happens , and the size of index for rank/select cannot be ignored.

References [1] G. Jacobson. Space-efficient Static Trees and Graphs. In Proc. IEEE FOCS, pages 549–554, 1989. [2] R. Raman, V. Raman, and S. S. Rao. Succinct Indexable Dictionaries with Applications to Encoding k-ary Trees and Multisets, ACM Transactions on Algorithms (TALG) , Vol. 3, Issue 4, 2007. [3] R. Grossi and J. S. Vitter. Compressed Suffix Arrays and Suffix Trees with Applications to Text Indexing and String Matching. SIAM Journal on Computing, 35(2):378–407, 2005. [4] R. Grossi, A. Gupta, and J. S. Vitter. High-Order Entropy-Compressed Text Indexes. In Proc. ACM-SIAM SODA, pages 841–850, 2003. [5] Paolo Ferragina, Giovani Manzini, Veli Mäkinen, and Gonzalo Navarro. Compressed Representations of Sequences and Full-Text Indexes. ACM Transactions on Algorithms (TALG) 3(2), article 20, 24 pages, 2007.

[6] Jérémy Barbay, Travis Gagie, Gonzalo Navarro, and Yakov Nekrich. Alphabet Partitioning for Compressed Rank/Select and Applications. Proc. ISAAC, pages 315-326. LNCS 6507, 2010. [7] Alexander Golynski , J. Ian Munro , S. Srinivasa Rao, Rank/select operations on large alphabets: a tool for text indexing, Proc. SODA, pp. 368-373, 2006. [8] Jérémy Barbay, Meng He, J. Ian Munro, S. Srinivasa Rao. Succinct indexes for strings, binary relations and multi-labeled trees. Proc. SODA, 2007. [9] Dan E. Willard. Log-logarithmic worst-case range queries are possible in space theta(n). Information Processing Letters, 17(2):81–84, 1983. [10] J. Ian Munro, Rajeev Raman, Venkatesh Raman, S. Srinivasa Rao. Succinct representations of permutations. Proc. ICALP, pp. 345–356, 2003. [11] R. Pagh. Low Redundancy in Static Dictionaries with Constant Query Time. SIAM Journal on Computing, 31(2):353–363, 2001.