Summer School '131 Succinct Data Structures Ian Munro.

Slides:



Advertisements
Similar presentations
Succinct Representations of Dynamic Strings Meng He and J. Ian Munro University of Waterloo.
Advertisements

Succinct Data Structures for Permutations, Functions and Suffix Arrays
Succinct Representation of Balanced Parentheses, Static Trees and Planar Graphs J. Ian Munro & Venkatesh Raman.
An Improved Succinct Dynamic k-Ary Tree Representation (work in progress) Diego Arroyuelo Department of Computer Science, Universidad de Chile.
Advanced Databases: Lecture 2 Query Optimization (I) 1 Query Optimization (introduction to query processing) Advanced Databases By Dr. Akhtar Ali.
Suffix Sorting & Related Algoritmics Martin Farach-Colton Rutgers University USA.
© 2004 Goodrich, Tamassia Tries1. © 2004 Goodrich, Tamassia Tries2 Preprocessing Strings Preprocessing the pattern speeds up pattern matching queries.
1 Suffix Trees and Suffix Arrays Modern Information Retrieval by R. Baeza-Yates and B. Ribeiro-Neto Addison-Wesley, (Chapter 8)
Succinct Data Structures Ian Munro University of Waterloo Joint work with David Benoit, Andrej Brodnik, D, Clark, F. Fich, M. He, J. Horton, A. López-Ortiz,
1 Data structures for Pattern Matching Suffix trees and suffix arrays are a basic data structure in pattern matching Reported by: Olga Sergeeva, Saint.
The Trie Data Structure Basic definition: a recursive tree structure that uses the digital decomposition of strings to represent a set of strings for searching.
Tries Standard Tries Compressed Tries Suffix Tries.
Succinct Representations of Trees S. Srinivasa Rao Seoul National University.
Advanced Algorithm Design and Analysis (Lecture 4) SW5 fall 2004 Simonas Šaltenis E1-215b
BTrees & Bitmap Indexes
FALL 2006CENG 351 Data Management and File Structures1 External Sorting.
CS 206 Introduction to Computer Science II 12 / 01 / 2008 Instructor: Michael Eckmann.
B + -Trees (Part 1) Lecture 20 COMP171 Fall 2006.
6/26/2015 7:13 PMTries1. 6/26/2015 7:13 PMTries2 Outline and Reading Standard tries (§9.2.1) Compressed tries (§9.2.2) Suffix tries (§9.2.3) Huffman encoding.
B + -Trees (Part 1). Motivation AVL tree with N nodes is an excellent data structure for searching, indexing, etc. –The Big-Oh analysis shows most operations.
Tirgul 6 B-Trees – Another kind of balanced trees Problem set 1 - some solutions.
CS 206 Introduction to Computer Science II 12 / 10 / 2008 Instructor: Michael Eckmann.
B + -Trees (Part 1) COMP171. Slide 2 Main and secondary memories  Secondary storage device is much, much slower than the main RAM  Pages and blocks.
Compact Representations of Separable Graphs From a paper of the same title submitted to SODA by: Dan Blandford and Guy Blelloch and Ian Kash.
Important Problem Types and Fundamental Data Structures
Mike 66 Sept Succinct Data Structures: Techniques and Lower Bounds Ian Munro University of Waterloo Joint work with/ work of Arash Farzan, Alex Golynski,
ICS 220 – Data Structures and Algorithms Week 7 Dr. Ken Cosh.
B-trees (Balanced Trees) A B-tree is a special kind of tree, similar to a binary tree. However, It is not a binary search tree. It is not a binary tree.
Succinct Representations of Trees
Space Efficient Data Structures for Dynamic Orthogonal Range Counting Meng He and J. Ian Munro University of Waterloo.
A Summary of XISS and Index Fabric Ho Wai Shing. Contents Definition of Terms XISS (Li and Moon, VLDB2001) Numbering Scheme Indices Stored Join Algorithms.
Introduction n – length of text, m – length of search pattern string Generally suffix tree construction takes O(n) time, O(n) space and searching takes.
© 2004 Goodrich, Tamassia Tries1. © 2004 Goodrich, Tamassia Tries2 Preprocessing Strings Preprocessing the pattern speeds up pattern matching queries.
CSE AU B-Trees1 B-Trees CSE 373 Data Structures.
12.1 Chapter 12: Indexing and Hashing Spring 2009 Sections , , Problems , 12.7, 12.8, 12.13, 12.15,
P p Chapter 10 has several programming projects, including a project that uses heaps. p p This presentation shows you what a heap is, and demonstrates.
Succinct Data Structures Ian Munro University of Waterloo Joint work with David Benoit, Andrej Brodnik, D, Clark, F. Fich, M. He, J. Horton, A. López-Ortiz,
CMSC 341 B- Trees D. Frey with apologies to Tom Anastasio.
An experimental study of priority queues By Claus Jensen University of Copenhagen.
Succinct Dynamic Cardinal Trees with Constant Time Operations for Small Alphabet Pooya Davoodi Aarhus University May 24, 2011 S. Srinivasa Rao Seoul National.
Tries1. 2 Outline and Reading Standard tries (§9.2.1) Compressed tries (§9.2.2) Suffix tries (§9.2.3)
Lecture 11COMPSCI.220.FS.T Balancing an AVLTree Two mirror-symmetric pairs of cases to rebalance the tree if after the insertion of a new key to.
Agenda Review: –Planar Graphs Lecture Content:  Concepts of Trees  Spanning Trees  Binary Trees Exercise.
Sets of Digital Data CSCI 2720 Fall 2005 Kraemer.
Succinct Ordinal Trees Based on Tree Covering Meng He, J. Ian Munro, University of Waterloo S. Srinivasa Rao, IT University of Copenhagen.
Lecture 10 Page 1 CS 111 Summer 2013 File Systems Control Structures A file is a named collection of information Primary roles of file system: – To store.
B-TREE. Motivation for B-Trees So far we have assumed that we can store an entire data structure in main memory What if we have so much data that it won’t.
Internal and External Sorting External Searching
Week 15 – Wednesday.  What did we talk about last time?  Review up to Exam 1.
Exam 3 Review Data structures covered: –Hashing and Extensible hashing –Priority queues and binary heaps –Skip lists –B-Tree –Disjoint sets For each of.
Onlinedeeneislam.blogspot.com1 Design and Analysis of Algorithms Slide # 1 Download From
8/3/2007CMSC 341 BTrees1 CMSC 341 B- Trees D. Frey with apologies to Tom Anastasio.
Navigation Piles with Applications to Sorting, Priority Queues, and Priority Deques Jyrki Katajainen and Fabio Vitale Department of Computing, University.
Tries 4/16/2018 8:59 AM Presentation for use with the textbook Data Structures and Algorithms in Java, 6th edition, by M. T. Goodrich, R. Tamassia, and.
Succinct Data Structures
Tries 07/28/16 11:04 Text Compression
Data Structures: Disjoint Sets, Segment Trees, Fenwick Trees
Succinct Data Structures
Tries 5/27/2018 3:08 AM Tries Tries.
Succinct Data Structures
Experimental evaluation of Navigation piles
Succinct Data Structures: Upper, Lower & Middle Bounds
Hashing Exercises.
Tries 9/14/ :13 AM Presentation for use with the textbook Data Structures and Algorithms in Java, 6th edition, by M. T. Goodrich, R. Tamassia, and.
Discrete Methods in Mathematical Informatics
Paolo Ferragina Dipartimento di Informatica, Università di Pisa
Database Design and Programming
Tries 2/23/2019 8:29 AM Tries 2/23/2019 8:29 AM Tries.
Paolo Ferragina Dipartimento di Informatica, Università di Pisa
Succinct Data Structures
Presentation transcript:

Summer School '131 Succinct Data Structures Ian Munro

Summer School '132 General Motivation In Many Computations... Storage Costs of Pointers and Other Structures Dominate that of Real Data Often this information is not “just random pointers” How do we encode a combinatorial object (e.g. a tree) of specialized information … even a static one in a small amount of space & still perform queries in constant time ???

Summer School '133 Representation of a combinatorial object: Space requirement of representation “close to” information theoretic lower bound and Time for operations required of the data type comparable to that of representation without such space constraints (O(1)) Succinct Data Structure

Example : Static Bounded Subset Summer School '134

5 Beame-Fich: Find largest less than i is tough in some ranges of m(e.g. m≈2 √lg n ) But OK if i is present this can be added (Raman, Raman, Rao etc) Careful.. Lower Bounds

Summer School '136 Focus on Trees.. Because Computer Science is.. Arbophilic Directories (Unix, all the rest) Search trees (B-trees, binary search trees, digital trees or tries) Graph structures (we do a tree based search) and a key application Search indices for text (including DNA)

Summer School '137 Preprocess Text for Search A Big Patricia Trie/Suffix Trie Given a large text file; treat it as bit vector Construct a trie with leaves pointing to unique locations in text that “match” path in trie (paths must start at character boundaries) Skip the nodes where there is no branching ( n-1 internal nodes)

So the basic story on text search A suffix tree (40 years old this year) permits search for any arbitrary query string in time proportional to the query string. But the usual space for the tree can be prohibitive Most users, especially in Bioinformatics as well as Open Text and Manber & Myers went to suffix arrays instead. Suffix array: reference to each index point in order by what is pointed to Summer School '138

The Issue Suffix tree/ array methods remain extremely effective, especially for single user, single machine searches. So, can we represent a tree (e.g. a binary tree) in substantially less space? Summer School '139

10 Abstract data type: binary tree Size: n-1 internal nodes, n leaves Operations: child, parent, subtree size, leaf data Motivation: “Obvious” representation of an n node tree takes about 6 n lg n bit words (up, left, right, size, memory manager, leaf reference) i.e. full suffix tree takes about 5 or 6 times the space of suffix array (i.e. leaf references only) Space for Trees

Succinct Representations of Trees Summer School '1311

Summer School '1312 Use parenthesis notation Represent the tree As the binary string (((())())((())()())): traverse tree as “(“ for node, then subtrees, then “)” Each node takes 2 bits Arbitrary Order Trees

Summer School '1313 Only 1 heap (shape) on n nodes Balanced tree, bottom level pushed left number nodes row by row; lchild(i)=2i; rchild(i)=2i+1 What you learned about Heaps

Summer School '1314 Only 1 heap (shape) on n nodes Balanced tree, bottom level pushed left number nodes row by row; lchild(i)=2i; rchild(i)=2i+1 Data: Parent value > child This gives an implicit data structure for priority queue What you learned about Heaps

Summer School '1315 Add external nodes Enumerate level by level Store vector length 2n+1 (Here we don’t know size of subtrees; can be overcome. Could use isomorphism to flip between notations) Generalizing: Heap-like Notation for any Binary Tree

Summer School '1316 Add external nodes Enumerate level by level Store vector length 2n+1 (Here we don’t know size of subtrees; can be overcome. Could use isomorphism to flip between notations) What you didn’t know about Heaps

Summer School '1317 How do we Navigate? Jacobson’s key suggestion: Operations on a bit vector rank(x) = # 1’s up to & including x select(x) = position of x th 1 So in the binary tree leftchild(x) = 2 rank(x) rightchild(x) = 2 rank(x) + 1 parent(x) = select(x/2)

Summer School '1318 Add external nodes Enumerate level by level Store vector length 2n+1 (Here don’t know size of subtrees; can be overcome. Could use isomorphism to flip between notations) Heap-like Notation for a Binary Tree Rank 5 Node 11

Summer School '1319 Rank & Select Rank: Auxiliary storage ~ 2nlglg n / lg n bits #1’s up to each (lg n) 2 rd bit #1’s within these too each lg n th bit Table lookup after that Select: More complicated (especially to get this lower order term) but similar notions Key issue: Rank & Select take O(1) time with lg n bit word (M. et al)… as detailed on the board

Summer School '1320 Lower Bound: for Rank & for Select Theorem (Golynski): Given a bit vector of length n and an “index” (extra data) of size r bits, let t be the number of bits probed to perform rank (or select) then:r=Ω(n (lg t)/t). Proof idea: Argue to reconstructing the entire string with too few rank queries (similarly for select) Corollary (Golynski): Under the lg n bit RAM model, an index of size (n lglg n/ lg n) is necessary and sufficient to perform the rank and the select operations in O(lg n) bit probes, so in) O(1) time.

Summer School '1321 Planar Graphs (Jacobson; Lu et al; Barbay et al) Subset of [n] (Brodnik & M) Permutations [n] → [n] Or more generally Functions [n] → [n] But what operations? Clearly π(i), but also π -1 (i) And then π k (i) and π -k (i) Other Combinatorial Objects

More Data Types Suffix Arrays (special permutations; references to positions in text sorted lexicographically) in linear space Arbitrary Graphs (Farzan & M) “Arbitrary” Classes of Trees (Farzan & M) Partial Orders (M & Nicholson) Summer School '1322

But first … how about integers Of “arbitrary” size Clearly lg n bits … if we take n as an upper bound But what if we have “no idea” Elias: lglg n 0’s, lg n in lglg n bits, n in lg n bits Can we do better? A useful trick in many representations Summer School '1323

Dictionary over n elements [m] Summer School '1324

More on Trees Summer School '1325

Ordinal Trees Children ordered, no bound on number of children, i th cannot exist without i-1 st These correspond to balanced parentheses expressions, Catalan number of forests on n nodes A variety of representations ….. Summer School '1326

But first we need: Indexable Dictionaries  Summer School '1327

Trees Key rule … nodes numbered 1 to n, but data structure gets to choose “names” of nodes Would like ordinal operations: parent, i th child, degree, subtree size Plus child i for cardinal Summer School '1328

Ordinals Many orderings: L evel O rder U nary D egree S equence Node: d 1’s (child birth announcements) then a 0 (death of the node) Write in level order: root has a “1 in the sky”, then birth order = death order Gives O(1) time for parent, i th child, degree Balanced parens gives others, DFUDS … all Summer School '1329

 Summer School '1330

Another approach  Summer School '1331

More on Trees Dynamic trees: Tough going, mainly memory management M, Storm and Raman and Raman, Raman & Rao Other classes: Decomposition into big tree (o(n) nodes); minitrees hanging off (again o(n) in total); and microtrees (most nodes here)microtrees small enough to be coded in table of size o(n) If micotrees have “special feature”, encoding can be optimal.. Even if you don’t know what that means. (Farzan & M) Summer School '1332

Permutations and Functions Permutation π, write in natural form: π (i) i = 1,…n: space n lg n bits, good! Great for computing π, but how about π -1 or π k Other option: write in cycles, mildly worse for space, much worse for any calculations above Summer School '1333

Summer School '1334 Let P be a simple array giving π ; P[i] = π [i] Also have B[i] be a pointer t positions back in (the cycle of) the permutation; B[i]= π -t [i].. But only define B for every t th position in cycle. (t is a constant; ignore cycle length “round-off”) So array representation P = [ x x 3 x 2 x 10 1] Permutations: a Shortcut Notation

Summer School '1335 In a cycle there is a B every t positions … But these positions can be in arbitrary order Which i’s have a B, and how do we store it? Keep a vector of all positions: 0 = no B1 = B Rank gives the position of B[“i”] in B array So: π (i) & π -1 (i) in O(1) time & (1+ε)n lg n bits Theorem: Under a pointer machine model with space (1+ ε) n references, we need time 1/ε to answer π and π -1 queries; i.e. this is as good as it gets … in the pointer model. Representing Shortcuts

Summer School '1336 This is the best we can do for O(1) operations But using Benes networks: 1-Benes network is a 2 input/2 output switch r+1-Benes network … join tops to tops #bits(n)=2#bits(n/2)+n=n lg n-n+1=min+(n) R-Benes Network Getting it n lg n Bits

Summer School '1337 Realizing the permutation (std π (i) notation) π = ( ) ; π -1 = ( ) Note: (n) bits more than “necessary” A Benes Network

Summer School '1338 Divide into blocks of lg lg n gates … encode their actions in a word. Taking advantage of regularity of address mechanism and also Modify approach to avoid power of 2 issue Can trace a path in time O(lg n/(lg lg n) This is the best time we are able get for π and π -1 in nearly minimum space. What can we do with it?

Summer School '1339 Observe: This method “violates” the pointer machine lower bound by using “micropointers”. But … More general Lower Bound (Golynski): Both methods are optimal for their respective extra space constraints Both are Best

Summer School '1340 Consider the cycles of π ( 2 6 8)( )( 4 1 7) Bit vector indicates start of each cycle ( ) Ignore parens, view as new permutation, ψ. Note: ψ -1 (i) is position containing i … So we have ψ and ψ -1 as before Use ψ -1 (i) to find i, then n bit vector (rank, select) to find π k or π -k Back to the main track: Powers of π

Summer School '1341 Now consider arbitrary functions [n] → [n] “A function is just a hairy permutation” All tree edges lead to a cycle Functions

Summer School '1342 Essentially write down the components in a convenient order and use the n lg n bits to describe the mapping (as per permutations) To get f k (i): Find the level ancestor (k levels up) in a tree Or Go up to root and apply f the remaining number of steps around a cycle Challenges here

Summer School '1343 There are several level ancestor techniques using O(1) time and O(n) WORDS. Adapt Bender & Farach-Colton to work in O(n) bits But going the other way … Level Ancestors

Summer School '1344 Moving Down the tree requires care f -3 ( ) = ( ) The trick: Report all nodes on a given level of a tree in time proportional to the number of nodes, and Don’t waste time on trees with no answers Level Ancestors

Summer School '1345 Given an arbitrary function f: [n] → [n] With an n lg n + O(n) bit representation we can compute f k (i) in O(1) time and f -k (i) in time O(1 + size of answer). f & f-1 are very useful in several applications … then on to binary relations (HTML markup) Final Function Result

Summer School '1346 Interesting, and useful, combinatorial objects can be: Stored succinctly … O(lower bound) +o() So that Natural queries are performed in O(1) time (or at least very close) Programs: This can make the difference between using the data type and not … Conclusion