Succinct Data Structures: Upper, Lower & Middle Bounds

Slides:



Advertisements
Similar presentations
Succinct Representations of Dynamic Strings Meng He and J. Ian Munro University of Waterloo.
Advertisements

Succinct Data Structures for Permutations, Functions and Suffix Arrays
Introduction to Computer Science 2 Lecture 7: Extended binary trees
Succinct Representation of Balanced Parentheses, Static Trees and Planar Graphs J. Ian Munro & Venkatesh Raman.
An Improved Succinct Dynamic k-Ary Tree Representation (work in progress) Diego Arroyuelo Department of Computer Science, Universidad de Chile.
SODA Jan 11, 2004Partial Sums1 Tight Bounds for the Partial-Sums Problem Mihai PǎtraşcuErik Demaine (presenting) MIT CSAIL.
Succinct Data Structures Ian Munro University of Waterloo Joint work with David Benoit, Andrej Brodnik, D, Clark, F. Fich, M. He, J. Horton, A. López-Ortiz,
The Trie Data Structure Basic definition: a recursive tree structure that uses the digital decomposition of strings to represent a set of strings for searching.
Tries Standard Tries Compressed Tries Suffix Tries.
Succinct Representations of Trees S. Srinivasa Rao Seoul National University.
Modern Information Retrieval
B + -Trees (Part 1) Lecture 20 COMP171 Fall 2006.
B + -Trees (Part 1). Motivation AVL tree with N nodes is an excellent data structure for searching, indexing, etc. –The Big-Oh analysis shows most operations.
B + -Trees (Part 1) COMP171. Slide 2 Main and secondary memories  Secondary storage device is much, much slower than the main RAM  Pages and blocks.
Compact Representations of Separable Graphs From a paper of the same title submitted to SODA by: Dan Blandford and Guy Blelloch and Ian Kash.
Important Problem Types and Fundamental Data Structures
Mike 66 Sept Succinct Data Structures: Techniques and Lower Bounds Ian Munro University of Waterloo Joint work with/ work of Arash Farzan, Alex Golynski,
ICS 220 – Data Structures and Algorithms Week 7 Dr. Ken Cosh.
Succinct Representations of Trees
Space Efficient Data Structures for Dynamic Orthogonal Range Counting Meng He and J. Ian Munro University of Waterloo.
Introduction n – length of text, m – length of search pattern string Generally suffix tree construction takes O(n) time, O(n) space and searching takes.
Summer School '131 Succinct Data Structures Ian Munro.
Succinct Orthogonal Range Search Structures on a Grid with Applications to Text Indexing Prosenjit Bose, Carleton University Meng He, Unversity of Waterloo.
Succinct Data Structures Ian Munro University of Waterloo Joint work with David Benoit, Andrej Brodnik, D, Clark, F. Fich, M. He, J. Horton, A. López-Ortiz,
Succinct Dynamic Cardinal Trees with Constant Time Operations for Small Alphabet Pooya Davoodi Aarhus University May 24, 2011 S. Srinivasa Rao Seoul National.
Agenda Review: –Planar Graphs Lecture Content:  Concepts of Trees  Spanning Trees  Binary Trees Exercise.
Sets of Digital Data CSCI 2720 Fall 2005 Kraemer.
Lecture 10 Page 1 CS 111 Summer 2013 File Systems Control Structures A file is a named collection of information Primary roles of file system: – To store.
B-TREE. Motivation for B-Trees So far we have assumed that we can store an entire data structure in main memory What if we have so much data that it won’t.
Week 15 – Wednesday.  What did we talk about last time?  Review up to Exam 1.
Lower Bounds & Sorting in Linear Time
Tries 4/16/2018 8:59 AM Presentation for use with the textbook Data Structures and Algorithms in Java, 6th edition, by M. T. Goodrich, R. Tamassia, and.
CSC317 Selection problem q p r Randomized‐Select(A,p,r,i)
Succinct Data Structures
Tries 07/28/16 11:04 Text Compression
Data Structures: Disjoint Sets, Segment Trees, Fenwick Trees
Tries 5/27/2018 3:08 AM Tries Tries.
Succinct Data Structures
Azita Keshmiri CS 157B Ch 12 indexing and hashing
B-Trees B-Trees.
Mark Redekopp David Kempe
File System Structure How do I organize a disk into a file system?
B+-Trees.
B+-Trees.
CS 430: Information Discovery
Hashing Exercises.
Tries 9/14/ :13 AM Presentation for use with the textbook Data Structures and Algorithms in Java, 6th edition, by M. T. Goodrich, R. Tamassia, and.
13 Text Processing Hongfei Yan June 1, 2016.
Sorting.
CMSC 341 Lecture 10 B-Trees Based on slides from Dr. Katherine Gibson.
B+ Trees What are B+ Trees used for What is a B Tree What is a B+ Tree
B-Trees CSE 373 Data Structures CSE AU B-Trees.
B- Trees D. Frey with apologies to Tom Anastasio
Indexing and Hashing Basic Concepts Ordered Indices
B- Trees D. Frey with apologies to Tom Anastasio
Paolo Ferragina Dipartimento di Informatica, Università di Pisa
Lower Bounds & Sorting in Linear Time
B-Trees CSE 373 Data Structures CSE AU B-Trees.
B- Trees D. Frey with apologies to Tom Anastasio
Database Design and Programming
Tries 2/23/2019 8:29 AM Tries 2/23/2019 8:29 AM Tries.
CSE 373, Copyright S. Tanimoto, 2002 B-Trees -
Tries 2/27/2019 5:37 PM Tries Tries.
CENG 351 Data Management and File Structures
Paolo Ferragina Dipartimento di Informatica, Università di Pisa
B-Trees CSE 373 Data Structures CSE AU B-Trees.
CMSC 341 Extensible Hashing.
Important Problem Types and Fundamental Data Structures
Succinct Data Structures
Presentation transcript:

Succinct Data Structures: Upper, Lower & Middle Bounds Ian Munro University of Waterloo Joint work with/of Arash Farzan, Alex Golynski, Meng He How do we encode a large combinatorial object (e.g. a tree, string, graph, group) … even a static one in a small amount of space & still perform required operations in constant time ??? CPM June 2008

Example of a Succinct Data Structure: The (Static) Bounded Subset Given: Universe of n elements [0,...n-1] and m arbitrary elements from this universe Create: a static structure to support search in constant time (lg n bit word & usual ops) Using: Essentially minimum possible # bits … Operation: Member query in O(1) time (Brodnik & M.) CPM June 2008

Careful .. Lower Bounds Beame-Fich: Find largest less than i is tough in some ranges of m(e.g. m≈2 √lg n) But OK if i is present this can be added (Raman, Raman, Rao) CPM June 2008

Focus on Trees .. Because Computer Science is .. Arbophilic - Directories (Unix, all the rest) - Search trees (B-trees, binary search trees, digital trees or tries) - Graph structures (we do a tree based search) - Search indices for text (including DNA) CPM June 2008

A Big Patricia Trie / Suffix Trie Given a large text file; treat it as bit vector Construct a trie with leaves pointing to unique locations in text that “match” path in trie (paths must start at character boundaries) Skip the nodes where there is no branching (n-1 internal nodes) 1 1 1 0 0 0 1 1 CPM June 2008

Space for Trees Abstract data type: binary tree Size: n-1 internal nodes, n leaves Operations: child, parent, subtree size, leaf data Motivation: “Obvious” representation of an n node tree takes about 6 n lg n words (up, left, right, size, memory manager, leaf reference) i.e. full suffix tree takes about 5 or 6 times the space of suffix array (i.e. leaf references only) CPM June 2008

Succinct Representations of Trees Start with Jacobson, then others: There are about 4n/(πn)3/2 ordered rooted trees, and same number of binary trees Lower bound on specifying is about 2n bits What are the natural representations? CPM June 2008

Arbitrary Ordered Trees Use parenthesis notation Represent the tree As the binary string (((())())((())()())): traverse tree as “(“ for node, then subtrees, then “)” Each node takes 2 bits CPM June 2008

Heap-like Notation for a Binary Tree Add external nodes Enumerate level by level Store vector 11110111001000000 length 2n+1 (Here don’t know size of subtrees; can be overcome. Could use isomorphism to flip between notations) 1 1 1 1 1 1 1 1 CPM June 2008

How do we Navigate? Jacobson’s key suggestion: Operations on a bit vector rank(x) = # 1’s up to & including x select(x) = position of xth 1 So in the binary tree leftchild(x) = 2 rank(x) rightchild(x) = 2 rank(x) + 1 parent(x) = select(x/2) CPM June 2008

Rank & Select Rank: Auxiliary storage ~ 2nlglg n / lg n bits #1’s up to each (lg n)2 rd bit #1’s within these too each lg nth bit Table lookup after that Select: More complicated (especially to get this lower order term) but similar notions Key issue: Rank & Select take O(1) time with lg n bit word (M. et al) CPM June 2008

Aside: Dynamic Rank & Select Rank/Select Structures: Raw data plus some “cumulative arrays” Model: We keep a finger at a position and can insert/delete change at that spot or move 1 spot left/right When at position i maintain structures up to i and backwards from n down to i+1. Problem: in most (tree) applications rank/select updates are “all over” CPM June 2008

Lower Bound: for Rank & for Select Theorem (Golynski): Given a bit vector of length n and an “index” (extra data) of size r bits, let t be the number of bits probed to perform rank (or select) then: r=Ω(n (lg t)/t). Proof idea: Argue to reconstructing the entire string with too few rank queries (similarly for select) Corollary (Golynski): Under the lg n bit RAM model, an index of size (n lglg n/ lg n) is necessary and sufficient to perform the rank and the select operations. CPM June 2008

More on Trees Updating trees: simple mapping plus rank/select does not work well Other kinds of trees: free trees (no root or ordering on children), a simple mapping may not exist So break tree into little hunks (say (1-ε) lg n size), small enough to explicitly keep in a table, with special constraints (e.g. few edges going out of a hunk) CPM June 2008

More on Trees Keep most nodes in these little hunks (or a couple of levels of hunk size classes), a limited number can be in a core tree with “real pointers” CPM June 2008

Hunks Lead to Updates on binary trees (M., Raman & Storm), & more general trees (Farzan & M.) Also representing special classes of trees optimally (Farzan & M.) e.g. free trees 1.56..n bits, free binary trees 1.31..n bits CPM June 2008

Other Combinatorial Objects Planar Graphs (Lu et al, Barbay et al)) Permutations [n]→ [n] Or more generally Functions [n] → [n] But what operations? Clearly π(i), but also π -1(i) And then π k(i) and π -k(i) Suffix Arrays (special permutations) in linear space Arbitrary Graphs (Farzan & M.) CPM June 2008

Permutations: Backpointer Notation Let P be a simple array giving π; P[i] = π[i] Also have B[i] be a pointer t positions back in (the cycle of) the permutation; B[i]= π-t[i] .. But only define B for every tth position in cycle. (t is a constant; ignore cycle length “round-off”) So array representation P = [8 4 12 5 13 x x 3 x 2 x 10 1] 1 2 3 4 5 6 7 8 9 10 11 12 13 2 4 5 13 1 8 3 12 10 CPM June 2008

Representing Shortcuts In a cycle there is a B every t positions … But these positions can be in arbitrary order Which i’s have a B, and how do we store it? Keep a vector of all positions: 0 = no B 1 = B Rank gives the position of B[“i”] in B array So: π(i) & π -1(i) in O(1) time & (1+ε)n lg n bits Theorem: Under a pointer machine model with space (1+ ε) n references, we need time 1/ε to answer π and π -1 queries; i.e. this is as good as it gets … in the pointer model. CPM June 2008

Aside: Extending to powers of π Consider the cycles of π ( 2 6 8)( 3 5 9 10)( 4 1 7) Bit vector indicates start of each cycle ( 2 6 8 3 5 9 10 4 1 7) Ignore parens, view as new permutation, ψ. Note: ψ-1(i) is position containing i … So we have ψ and ψ-1 as before Use ψ-1(i) to find i, then bit vector (rank, select) to find πk or π-k CPM June 2008

Aside: Functions Consider an arbitrary function, f:[n]→[n] Note: f-1(i) is a set All tree edges lead to a cycle “A function is just a hairy permutation” Deal with level ancestors, result holds CPM June 2008

Back to π & π-1 in Fewer Bits This is the best we can do for O(1) operations But using Benes networks: 1-Benes network is a 2 input/2 output switch r+1-Benes network … join tops to tops #bits(n)=2#bits(n/2)+n=n lg n-n+1=min+O(n) 1 2 3 4 5 6 7 8 3 5 7 8 1 6 4 2 R-Benes Network R-Benes Network CPM June 2008

A Benes Network Realizing the permutation (std π(i) notation) (3 5 7 8 1 6 4 2) Note: O(n) bits more than “necessary” 1 2 3 4 5 6 7 8 3 5 7 8 1 6 4 2 CPM June 2008

What can we do with it? Divide into blocks of lg lg n gates … encode their actions in a word. Taking advantage of regularity of address mechanism and also Modify approach to avoid power of 2 issue Can trace a path in time O(lg n/(lg lg n) Beats previous “lower bound” by using “micro pointers” CPM June 2008

Backpointers & Benes: Both are Best Recall: Benes method “violates” the pointer machine lower bound by using “micropointers”. Indeed: With (a lot of) care, space required is lg(n!) + O(n (lg lg n)2/lg n) bits But more general… Lower Bound (Golynski): Both methods are optimal for their respective extra space constraints CPM June 2008

Permutation Lower Bound Operations: π(i), π-1(i) with times t and t’ Backpointers: “natural” + index Benes: just a pile of bits, in lg n bit words General Model: memory (lg(n!)+r bits in words Lower bound: r = extra space = Ω(lg n!/tt’) It works out both Backpointers and Benes are optimal CPM June 2008

Proof of Lower Bound: Model Model: Tree program Separate tree for each π(i) or π-1(i) Start at root, look at memory location (word) based on value required At depth d take appropriate branch based on which of n values is read CPM June 2008

Proof of Lower Bound: Set up Fix the permutation (for now) Consider table of locations inspected at every step for every query location m 5 3 4 9 2 o 7 5 9 1 8 p 9 8 3 2 q 8 1 3 4 7 r 4 6 9 3 8 s 3 7 8 t 3 7 1 2 4 π(1) π(2) π(3) π(4) . π-1(n) query CPM June 2008

Proof of Lower Bound: cont’d Take the least used cell (over all queries for this permutation location m 5 3 4 9 2 o 7 5 9 1 8 p 9 8 3 2 q 8 1 3 4 7 r 4 6 9 3 8 s 3 7 8 t 3 7 1 2 4 π(1) π(2) π(3) π(4) . π-1(n) query CPM June 2008

Proof of Lower Bound: cont’d Take the least used cell (over all queries for this permutation And NUKE (eliminate) it location m 5 3 4 9 2 o 7 5 9 1 8 p 9 8 3 2 q 8 1 3 4 7 r 4 6 9 3 8 s 3 7 8 t 3 7 1 2 4 π(1) π(2) π(3) π(4) . π-1(n) query CPM June 2008

Proof of Lower Bound: cont’d And continuing removing cells for a while .. This means some queries may become unanswerable (no matter how many probes made) … but other are still OK e.g. removing a cell for π(6) (=56) & π-1 (56) (=6) makes these unanswerable, versus cell for π(9) (=52) but not π(52)-1 (=9), We do have to remember what we removed (though not the “order”) CPM June 2008

Proof of Lower Bound: Saving Space So … we save the space for the values we no longer “need”, but we do have to remember which are destroyed d locations destroyed, order doesn’t matter: d lg(n/d) bits used to say what is gone But d lg(n) bits saved CPM June 2008

Proof of Lower Bound: Finishing Now … some queries don’t work … π(is) s=1,..c & π-1(js) s=1,..c We know is & js but not their correspondence … encode it After reduction we still need lg (n!) bits (averaging over all permutations) So reduce to that point .. Do arithmetic, bound follows CPM June 2008

Text Search Lower Bound Key point : reciprocal relation Text search operations F = access substring length p starting in ip+1, i=0,n/p I = search(X,j) jth (aligned) occurrence of X Theorem(Golynski): rtt’ = O(np(lg σ)2/ω2) r=extra space in words; σ=alphabet; ω=word size For lg n substring linear extra space needed … same as Demaine & Lopez-Ortiz, but better model CPM June 2008

Conclusion Interesting, and useful, combinatorial objects can be: Stored succinctly … lower bound +o() So that Natural queries are performed in O(1) time (or at least very close) Indeed our o() terms are often optimal But border on operations is subtle CPM June 2008