Succinct Data Structures Ian Munro University of Waterloo Joint work with David Benoit, Andrej Brodnik, D, Clark, F. Fich, M. He, J. Horton, A. López-Ortiz,

Slides:



Advertisements
Similar presentations
February 12, 2007 WALCOM '2007 1/22 DiskTrie: An Efficient Data Structure Using Flash Memory for Mobile Devices N. M. Mosharaf Kabir Chowdhury Md. Mostofa.
Advertisements

Succinct Representations of Dynamic Strings Meng He and J. Ian Munro University of Waterloo.
Succinct Data Structures for Permutations, Functions and Suffix Arrays
CS252: Systems Programming Ninghui Li Program Interview Questions.
Analysis of Algorithms
Binary Trees CSC 220. Your Observations (so far data structures) Array –Unordered Add, delete, search –Ordered Linked List –??
Succinct Representation of Balanced Parentheses, Static Trees and Planar Graphs J. Ian Munro & Venkatesh Raman.
An Improved Succinct Dynamic k-Ary Tree Representation (work in progress) Diego Arroyuelo Department of Computer Science, Universidad de Chile.
22C:19 Discrete Structures Trees Spring 2014 Sukumar Ghosh.
MS 101: Algorithms Instructor Neelima Gupta
Data Compressor---Huffman Encoding and Decoding. Huffman Encoding Compression Typically, in files and messages, Each character requires 1 byte or 8 bits.
A New Compressed Suffix Tree Supporting Fast Search and its Construction Algorithm Using Optimal Working Space Dong Kyue Kim 1 andHeejin Park 2 1 School.
Suffix Sorting & Related Algoritmics Martin Farach-Colton Rutgers University USA.
A Categorization Theorem on Suffix Arrays with Applications to Space Efficient Text Indexes Meng He, J. Ian Munro, and S. Srinivasa Rao University of Waterloo.
1 Suffix Trees and Suffix Arrays Modern Information Retrieval by R. Baeza-Yates and B. Ribeiro-Neto Addison-Wesley, (Chapter 8)
Tries Standard Tries Compressed Tries Suffix Tries.
Succinct Representations of Trees S. Srinivasa Rao Seoul National University.
Tries Search for ‘bell’ O(n) by KMP algorithm O(dm) in a trie Tries
Advanced Algorithm Design and Analysis (Lecture 4) SW5 fall 2004 Simonas Šaltenis E1-215b
Modern Information Retrieval Chapter 8 Indexing and Searching.
Modern Information Retrieval
Goodrich, Tamassia String Processing1 Pattern Matching.
Data Structures & Algorithms Radix Search Richard Newman based on slides by S. Sahni and book by R. Sedgewick.
BTrees & Bitmap Indexes
Lists A list is a finite, ordered sequence of data items. Two Implementations –Arrays –Linked Lists.
Lecture 5: Master Theorem and Linear Time Sorting
Tirgul 6 B-Trees – Another kind of balanced trees Problem set 1 - some solutions.
DAST 2005 Week 4 – Some Helpful Material Randomized Quick Sort & Lower bound & General remarks…
Data Structures & Algorithms Radix Search Richard Newman based on slides by S. Sahni and book by R. Sedgewick.
Compact Representations of Separable Graphs From a paper of the same title submitted to SODA by: Dan Blandford and Guy Blelloch and Ian Kash.
Important Problem Types and Fundamental Data Structures
Address Lookup in IP Routers. 2 Routing Table Lookup Routing Decision Forwarding Decision Forwarding Decision Routing Table Routing Table Routing Table.
Mike 66 Sept Succinct Data Structures: Techniques and Lower Bounds Ian Munro University of Waterloo Joint work with/ work of Arash Farzan, Alex Golynski,
 Divide the encoded file into blocks of size b  Use an auxiliary bit vector to indicate the beginning of each block  Time – O(b)  Time vs. Memory.
Succinct Representations of Trees
Space Efficient Data Structures for Dynamic Orthogonal Range Counting Meng He and J. Ian Munro University of Waterloo.
A Summary of XISS and Index Fabric Ho Wai Shing. Contents Definition of Terms XISS (Li and Moon, VLDB2001) Numbering Scheme Indices Stored Join Algorithms.
Introduction n – length of text, m – length of search pattern string Generally suffix tree construction takes O(n) time, O(n) space and searching takes.
CSC 41/513: Intro to Algorithms Linear-Time Sorting Algorithms.
Summer School '131 Succinct Data Structures Ian Munro.
Sorting Fun1 Chapter 4: Sorting     29  9.
Succinct Orthogonal Range Search Structures on a Grid with Applications to Text Indexing Prosenjit Bose, Carleton University Meng He, Unversity of Waterloo.
Succinct Data Structures Ian Munro University of Waterloo Joint work with David Benoit, Andrej Brodnik, D, Clark, F. Fich, M. He, J. Horton, A. López-Ortiz,
Constant-Time LCA Retrieval Presentation by Danny Hermelin, String Matching Algorithms Seminar, Haifa University.
Succinct Dynamic Cardinal Trees with Constant Time Operations for Small Alphabet Pooya Davoodi Aarhus University May 24, 2011 S. Srinivasa Rao Seoul National.
CompSci 100E 39.1 Memory Model  For this course: Assume Uniform Access Time  All elements in an array accessible with same time cost  Reality is somewhat.
1 CS 430: Information Discovery Lecture 4 Files Structures for Inverted Files.
Suffix trees. Trie A tree representing a set of strings. a b c e e f d b f e g { aeef ad bbfe bbfg c }
Sets of Digital Data CSCI 2720 Fall 2005 Kraemer.
Succinct Ordinal Trees Based on Tree Covering Meng He, J. Ian Munro, University of Waterloo S. Srinivasa Rao, IT University of Copenhagen.
1 CPS216: Advanced Database Systems Notes 05: Operators for Data Access (contd.) Shivnath Babu.
Joint Advanced Student School Compressed Suffix Arrays Compression of Suffix Arrays to linear size Fabian Pache.
Internal and External Sorting External Searching
Week 15 – Wednesday.  What did we talk about last time?  Review up to Exam 1.
Onlinedeeneislam.blogspot.com1 Design and Analysis of Algorithms Slide # 1 Download From
Discrete Methods in Mathematical Informatics Kunihiko Sadakane The University of Tokyo
Advanced Data Structures Lecture 8 Mingmin Xie. Agenda Overview Trie Suffix Tree Suffix Array, LCP Construction Applications.
Tries 4/16/2018 8:59 AM Presentation for use with the textbook Data Structures and Algorithms in Java, 6th edition, by M. T. Goodrich, R. Tamassia, and.
Tries 07/28/16 11:04 Text Compression
Data Structures: Disjoint Sets, Segment Trees, Fenwick Trees
Succinct Data Structures
Tries 5/27/2018 3:08 AM Tries Tries.
Succinct Data Structures
Succinct Data Structures: Upper, Lower & Middle Bounds
Tries 9/14/ :13 AM Presentation for use with the textbook Data Structures and Algorithms in Java, 6th edition, by M. T. Goodrich, R. Tamassia, and.
Paolo Ferragina Dipartimento di Informatica, Università di Pisa
Tries 2/23/2019 8:29 AM Tries 2/23/2019 8:29 AM Tries.
Paolo Ferragina Dipartimento di Informatica, Università di Pisa
Succinct Data Structures
Presentation transcript:

Succinct Data Structures Ian Munro University of Waterloo Joint work with David Benoit, Andrej Brodnik, D, Clark, F. Fich, M. He, J. Horton, A. López-Ortiz, S. Srinivasa Rao, Rajeev Raman, Venkatesh Raman, Adam Storm … How do we encode a large tree or other combinatorial object of specialized information … even a static one in a small amount of space and still perform queries in constant time ???

Example of a Succinct Data Structure: The (Static) Bounded Subset Given: Universe of n elements [0,...n-1] and m arbitrary elements from this universe Create: a static structure to support search in constant time (lg n bit word and usual operations) Using: Essentially minimum possible # bits... Operation: Member query in O(1) time (Brodnik & M.)

Focus on Trees.. Because Computer Science is.. Arbophilic - Directories (Unix, all the rest) - Search trees (B-trees, binary search trees, digital trees or tries) - Graph structures (we do a tree based search) - Search indices for text (including DNA)

A Big Patricia Trie / Suffix Trie  Given a large text file; treat it as bit vector  Construct a trie with leaves pointing to unique locations in text that “match” path in trie (paths must start at character boundaries)  Skip the nodes where there is no branching ( n-1 internal nodes)

Space for Trees Abstract data type: binary tree Size: n-1 internal nodes, n leaves Operations: child, parent, subtree size, leaf data Motivation: “Obvious” representation of an n node tree takes about 6 n lg n words (up, left, right, size, memory manager, leaf reference) i.e. full suffix tree takes about 5 or 6 times the space of suffix array (i.e. leaf references only)

Succinct Representations of Trees Start with Jacobson, then others: There are about 4 n /(πn) 3/2 ordered rooted trees, and same number of binary trees Lower bound on specifying is about 2n bits What are the natural representations?

Arbitrary Ordered Trees  Use parenthesis notation  Represent the tree  As the binary string (((())())((())()())): traverse tree as “(“ for node, then subtrees, then “)”  Each node takes 2 bits

Heap-like Notation for a Binary Tree Add external nodes Enumerate level by level Store vector length2n+1 (Here don’t know size of subtrees; can be overcome. Could use isomorphism to flip between notations)

How do we Navigate? Jacobson’s key suggestion: Operations on a bit vector rank(x) = # 1’s up to & including x select(x) = position of x th 1 So in the binary tree leftchild(x) = 2 rank(x) rightchild(x) = 2 rank(x) + 1 parent(x) = select(x/2)

Rank & Select Rank -Auxiliary storage ~ 2nlglg n / lg n bits #1’s up to each (lg n) 2 rd bit #1’s within these too each lg n th bit Table lookup after that Select -more complicated but similar notions Key issue: Rank & Select take O(1) time with lg n bit word (M. et al) Aside: Interesting data type by itself

Other Combinatorial Objects Planar Graphs (Lu et al) Permutations [n] → [n] Or more generally Functions [n] → [n] But what are the operations? Clearly π(i), but also π -1 (i) And then π k (i) and π -k (i) Suffix Arrays (special permutations) in linear space

Permutations: a Shortcut Notation Let P be a simple array giving π; P[i] = π[i] Also have B[i] be a pointer t positions back in (the cycle of) the permutation; B[i]= π -t [i].. But only define B for every t th position in cycle. (t is a constant; ignore cycle length “round-off”) So array representation P = [ x x 3 x 2 x 10 1]

Representing Shortcuts In a cycle there is a B every t positions … But these positions can be in arbitrary order Which i’s have a B, and how do we store it? Keep a vector of all positions 0 indicates no B1 indicates a B Rank gives the position of B[“i”] in B array So: π(i) and π -1 (i) in O(1) time & (1+ε)n lg n bits Theorem: Under a pointer machine model with space (1+ ε) n references, we need time 1/ε to answer π and π -1 queries; i.e. this is as good as it gets.

Getting n lg n Bits: an Aside This is the best we can do for O(1) operations But using Benes networks: 1-Benes network is a 2 input/2 output switch r+1-Benes network … join tops to tops R-Benes Network

A Benes Network Realizing the permutation ( )

What can we do with it? Divide into blocks of lg lg n gates … encode their actions in a word. Taking advantage of regularity of address mechanism and also Modify approach to avoid power of 2 issue Can trace a path in time O(lg n/(lg lg n) This is the best time we are able get for π and π -1 in minimum space. Observe: This method “violates” the pointer machine lower bound by using “micropointers”.

Back to the main track: Powers of π Consider the cycles of π ( 2 6 8)( )( 4 1 7) Keep a bit vector to indicate the start of each cycle ( ) Ignoring parentheses, view as new permutation, ψ. Note: ψ -1 (i) is position containing i … So we have ψ and ψ -1 as before Use ψ -1 (i) to find i, then bit vector (rank, select) to find π k or π -k

Functions Now consider arbitrary functions [n] → [n] “A function is just a hairy permutation” All tree edges lead to a cycle

Challenges here Essentially write down the components in a convenient order and use the n lg n bits to describe the mapping (as per permutations) To get f k (i): Find the level ancestor (k levels up) in a tree Or Go up to root and apply f the remaining number of steps around a cycle

Level Ancestors There are several level ancestor techniques using O(1) time and O(n) WORDS. Adapt Bender & Farach-Colton to work in O(n) bits But going the other way …

f -k is a set Moving Down the tree requires care f -3 ( ) = ( ) The trick: Report all nodes on a given level of a tree in time proportional to the number of nodes, and Don’t waste time on trees with no answers

Final Function Result Given an arbitrary function f: [n] → [n] With an n lg n + O(n) bit representation we can compute f k (i) in O(1) time and f -k (i) in time O(1 + size of answer).

Back to Text … And Suffix Arrays Text T[1..n] over (a,b)*# (a<#<b) There are 2 n-1 such texts, which of the n! suffix arrays are valid? SA= is abbaaba# SA -1 = M= isn’t..why?

Ascending to Max M is a permutation so M -1 is its inverse i.e. M -1 [i] says where i is in M Ascending-to-Max:  1  i  n-2 i) M -1 [i] < M -1 [n] and M -1 [i+1] < M -1 [n]  M -1 [i] < M -1 [i+1] ii) M -1 [i] > M -1 [n] and M -1 [i+1] > M -1 [n]  M -1 [i] > M -1 [i+1] OK NO

Non-Nesting Non-Nesting: 1  i,j  n-1 and M -1 [i]<M -1 [j] i) M -1 [i] < M -1 [i+1] and M -1 [j] < M -1 [j+1]  M -1 [i+1] < M -1 [j+1] ii) M -1 [i] > M -1 [i+1] and M -1 [j] > M -1 [j+1]  M -1 [i+1] < M -1 [j+1] OK NO

Characterization Theorem for Suffix Arrays on Binary Texts Theorem: Ascending to Max & Non-nesting  Suffix Array Corollary: Clean method of breaking SA into segments Corollary: Linear time algorithm to check whether SA is valid

Cardinality Queries T= a b a a a b b a a a b a a b b # Remember lengths longest run of a’s and of b’s SA (broken by runs, but not stored explicitly ) 8 3 | | |16 | |6 14 B a,  bit vector.. If SA -1 [i-1] in an “a” section store 1 in B a,[SA -1 [i]], else 0 B a Create rank structure on B a, and similarly B b, (Note these are reversed except at #) Algorithm Count(T,P) s ← 1; e ← n; i ← m; while i>0 and se do if P[i[=a then s ← rank 1 (B a,s-1)+1; e ← rank 1 (B a,e) else s ← n a rank 1 (B b,s-1); e ← n a + 1 +rank 1 (B b,e) i ← i-1 Return max(e-s+1,0) Time: O(length of query)

Listing Queries Complex methods Key idea: for queries of length at least d, index every d th position.. For T and forT(reversed) So we have matches for T[i..n] and T[1,i-1] View these as points in 2 space (Ferragina & Manzini and Grossi & Vitter) Do a range query (Alstrup et al) Variety of results follow

General Conclusion Interesting, and useful, combinatorial objects can be: Stored succinctly … O(lower bound) +o() So that Natural queries are performed in O(1) time (or at least very close) This can make the difference between using them and not …