The Wavelet Trie: Maintaining an Indexed Sequence of Strings in Compressed Space Roberto GrossiGiuseppe Ottaviano * Università di Pisa * Part of the work.

Slides:



Advertisements
Similar presentations
Succinct Representations of Dynamic Strings Meng He and J. Ian Munro University of Waterloo.
Advertisements

I/O-Algorithms Lars Arge Fall 2014 September 25, 2014.
Space-Efficient Data Structures for Top-k Completion Giuseppe Ottaviano Università di Pisa Bo-June (Paul) Hsu Microsoft Research WWW 2013.
Paolo Ferragina, Università di Pisa Compressed Permuterm Index Paolo Ferragina Dipartimento di Informatica, Università di Pisa.
An Improved Succinct Dynamic k-Ary Tree Representation (work in progress) Diego Arroyuelo Department of Computer Science, Universidad de Chile.
Paolo Ferragina, Università di Pisa Compressed Rank & Select on general strings Paolo Ferragina Dipartimento di Informatica, Università di Pisa.
Fusion Trees Advanced Data Structures Aris Tentes.
Fast Compressed Tries through Path Decompositions Roberto Grossi Giuseppe Ottaviano* Università di Pisa * Part of the work done while at Microsoft Research.
A New Compressed Suffix Tree Supporting Fast Search and its Construction Algorithm Using Optimal Working Space Dong Kyue Kim 1 andHeejin Park 2 1 School.
Lecture # 02 07/02/2013Dr. Muhammad Umair 1. 07/02/2013Dr. Muhammad Umair 2  Numeric  Integer Numbers  0,10,15,4563 etc.  Fractional Number  10.5,
Genome-scale disk-based suffix tree indexing Benjarath Phoophakdee Mohammed J. Zaki Compiled by: Amit Mahajan Chaitra Venus.
296.3: Algorithms in the Real World
© 2004 Goodrich, Tamassia Tries1. © 2004 Goodrich, Tamassia Tries2 Preprocessing Strings Preprocessing the pattern speeds up pattern matching queries.
1 Suffix Trees and Suffix Arrays Modern Information Retrieval by R. Baeza-Yates and B. Ribeiro-Neto Addison-Wesley, (Chapter 8)
Suffix Trees Suffix trees Linearized suffix trees Virtual suffix trees Suffix arrays Enhanced suffix arrays Suffix cactus, suffix vectors, …
The Trie Data Structure Basic definition: a recursive tree structure that uses the digital decomposition of strings to represent a set of strings for searching.
Tries Standard Tries Compressed Tries Suffix Tries.
Tries Search for ‘bell’ O(n) by KMP algorithm O(dm) in a trie Tries
Advanced Algorithm Design and Analysis (Lecture 4) SW5 fall 2004 Simonas Šaltenis E1-215b
Modern Information Retrieval Chapter 8 Indexing and Searching.
Wavelet Trees Ankur Gupta Butler University. Text Dictionary Problem The input is a text T drawn from an alphabet Σ. We want to support the following.
Modern Information Retrieval
BTrees & Bitmap Indexes
Rank-Sensitive Data Structures Iwona Bialynicka-Birula and Roberto Grossi (Università di Pisa) 12 th Symposium on String Processing and Information Retrieval.
On Demand String Sorting over Unbounded Alphabets Carmel Kent Moshe Lewenstein Dafna Sheinwald.
1 Query Languages. 2 Boolean Queries Keywords combined with Boolean operators: –OR: (e 1 OR e 2 ) –AND: (e 1 AND e 2 ) –BUT: (e 1 BUT e 2 ) Satisfy e.
1 Compressed Index for Dictionary Matching WK Hon (NTHU), TW Lam (HKU), R Shah (LSU), SL Tam (HKU), JS Vitter (Purdue)
Design a Data Structure Suppose you wanted to build a web search engine, a la Alta Vista (so you can search for “banana slugs” or “zyzzyvas”) index say.
Full-Text Indexing via Burrows-Wheeler Transform Wing-Kai Hon Oct 18, 2006.
Indexed Search Tree (Trie) Fawzi Emad Chau-Wen Tseng Department of Computer Science University of Maryland, College Park.
CSC 213 Lecture 18: Tries. Announcements Quiz results are getting better Still not very good, however Average score on last quiz was 5.5 Every student.
Amortized Rigidness in Dynamic Cartesian Trees Iwona Białynicka-Birula and Roberto Grossi Università di Pisa STACS 2006.
Department of Computer Eng. & IT Amirkabir University of Technology (Tehran Polytechnic) Data Structures Lecturer: Abbas Sarraf Search.
A New Point Access Method based on Wavelet Trees Nieves R. Brisaboa, Miguel R. Luaces, Diego Seco Database Laboratory University of A Coruña A Coruña,
6/26/2015 7:13 PMTries1. 6/26/2015 7:13 PMTries2 Outline and Reading Standard tries (§9.2.1) Compressed tries (§9.2.2) Suffix tries (§9.2.3) Huffman encoding.
1 Basic Text Processing and Indexing. 2 Document Processing Steps Lexical analysis (tokenizing) Stopwords removal Stemming Selection of indexing terms.
Indexing and Searching
Important Problem Types and Fundamental Data Structures
Address Lookup in IP Routers. 2 Routing Table Lookup Routing Decision Forwarding Decision Forwarding Decision Routing Table Routing Table Routing Table.
Mike 66 Sept Succinct Data Structures: Techniques and Lower Bounds Ian Munro University of Waterloo Joint work with/ work of Arash Farzan, Alex Golynski,
Basics of Data Compression Paolo Ferragina Dipartimento di Informatica Università di Pisa.
Space Efficient Data Structures for Dynamic Orthogonal Range Counting Meng He and J. Ian Munro University of Waterloo.
Lecture Objectives  To learn how to use a Huffman tree to encode characters using fewer bytes than ASCII or Unicode, resulting in smaller files and reduced.
Efficient Minimal Perfect Hash Language Models David Guthrie, Mark Hepple, Wei Liu University of Sheffield.
Introduction n – length of text, m – length of search pattern string Generally suffix tree construction takes O(n) time, O(n) space and searching takes.
© 2004 Goodrich, Tamassia Tries1. © 2004 Goodrich, Tamassia Tries2 Preprocessing Strings Preprocessing the pattern speeds up pattern matching queries.
4.8 Huffman Codes These lecture slides are supplied by Mathijs de Weerd.
Sorting Fun1 Chapter 4: Sorting     29  9.
Read Alignment Algorithms. The Problem 2 Given a very long reference sequence of length n and given several short strings.
Succinct Orthogonal Range Search Structures on a Grid with Applications to Text Indexing Prosenjit Bose, Carleton University Meng He, Unversity of Waterloo.
Succinct Data Structures Ian Munro University of Waterloo Joint work with David Benoit, Andrej Brodnik, D, Clark, F. Fich, M. He, J. Horton, A. López-Ortiz,
Succinct Dynamic Cardinal Trees with Constant Time Operations for Small Alphabet Pooya Davoodi Aarhus University May 24, 2011 S. Srinivasa Rao Seoul National.
Random access to arrays of variable-length items
Tries1. 2 Outline and Reading Standard tries (§9.2.1) Compressed tries (§9.2.2) Suffix tries (§9.2.3)
Sets of Digital Data CSCI 2720 Fall 2005 Kraemer.
Everything is String. Closed Factorization Golnaz Badkobeh 1, Hideo Bannai 2, Keisuke Goto 2, Tomohiro I 2, Costas S. Iliopoulos 3, Shunsuke Inenaga 2,
Index construction: Compression of postings Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 5.3 and a paper.
Discrete Methods in Mathematical Informatics Kunihiko Sadakane The University of Tokyo
Advanced Data Structures Lecture 8 Mingmin Xie. Agenda Overview Trie Suffix Tree Suffix Array, LCP Construction Applications.
Generic Trees—Trie, Compressed Trie, Suffix Trie (with Analysi
Tries 4/16/2018 8:59 AM Presentation for use with the textbook Data Structures and Algorithms in Java, 6th edition, by M. T. Goodrich, R. Tamassia, and.
Tries 07/28/16 11:04 Text Compression
Tries 5/27/2018 3:08 AM Tries Tries.
Succinct Data Structures
Andrzej Ehrenfeucht, University of Colorado, Boulder
Tries 9/14/ :13 AM Presentation for use with the textbook Data Structures and Algorithms in Java, 6th edition, by M. T. Goodrich, R. Tamassia, and.
B+ Trees What are B+ Trees used for What is a B Tree What is a B+ Tree
Tries 2/23/2019 8:29 AM Tries 2/23/2019 8:29 AM Tries.
Tries 2/27/2019 5:37 PM Tries Tries.
Sequences 5/17/ :43 AM Pattern Matching.
Presentation transcript:

The Wavelet Trie: Maintaining an Indexed Sequence of Strings in Compressed Space Roberto GrossiGiuseppe Ottaviano * Università di Pisa * Part of the work done while at Microsoft Research Cambridge

Disclaimer THIS HAS NOTHING TO DO WITH WAVELETS!

Indexed String Sequences (foo, bar, foobar, foo, bar, bar, foo) Queries – Access(i): access the i-th element Access(2) = foobar – Rank(s, pos): count occurrences of s before pos Rank( bar, 5) = 2 – Select(s, i): find the i-th occurrence of a s Select( foo, 2) =

Prefix operations (foo, bar, foobar, foo, bar, bar, foo) Queries – RankPrefix(p, pos): count strings prefixed by p before pos RankPrefix( foo, 5) = 3 – SelectPrefix(p, i): find the i-th string prefixed by p SelectPrefix( foo, 2) =

Example: storing relations Write the columns as string sequences – Store them separately – Reduce relational operations to sequence queries User Leonard Penny Sheldon Penny Leonard Sheldon Likes URL battle.net/wow/ tmz.com battle.net/wow/ thecheesecakefactory.com wikipedia.org/Star_Trek wikipedia.org/String_theory marvel.com What does Sheldon like? Who likes pages from domain wikipedia.org? Other operations: range counting, …

Dynamic sequences We want to support the following operations: Insert(s, pos): insert the string s immediately before position pos Append(s): append the string s at end of the sequence (special case of Insert) Delete(pos): delete the string at position pos If data structure only supports Append, we call it append-only, otherwise dynamic (or fully dynamic)

Requirements Store the sequence in as little space as possible – Close to the information-theoretic lower bound But still be able to support all the described operations (query and update) efficiently – Aim for worst-case polylog operations

Some notation (foo, bar, foobar, foo, bar, bar, foo) Sequence S, |S| = n – In the example n = 7 String set S set is unordered set of distinct strings appearing in S – In the example, {foo, bar, foobar}, |S set | = 3 – Also called alphabet Sequence symbols can also be integers, characters, … – As long as they are binarized to strings

Wavelet Trees Introduced in 2003 to represent Compressed Suffix Arrays Support Access/Rank/Select on sequences on a finite alphabet (of integers) – Reduces to operations on bitvectors by recursively partitioning the alphabet String sequences can be reduced to integer sequences

Wavelet Trees S = (a, b, r, a, c, a, d, a, b, r, a), S set ={a, b, c, d, r} abracadabra abaaaba rcdr 1011 rdr 101 a rd cb {c, d, r}{a, b} {d, r}

Wavelet Trees Space equal to entropy of the sequence – Plus negligible terms Supports Access/Rank/Select in O(log |S set |) Later extended to support Insert/Delete… – … but tree structure is fixed a priori – String set S set is cannot be changed! – Unrealistic restriction in many database applications

The Wavelet Trie The Wavelet Trie is a Wavelet Tree on sequences of binary strings (S set ⊂ {0, 1} * ) Supports Access/Rank(Prefix)/Select(Prefix) Fully dynamic… … or append only (with better bounds) The string set need not be known in advance

Wavelet Trie: Construction Common prefix: α Branching bit: β α: 010 β: Sequence of binary strings

Wavelet Trie: Construction α: 010 β: α: ε β: 101 α: 01 α: 10 α: ε β: α: 10 α: ε β: 110 α: ε

Wavelet Trie: Access α: 010 β: α: ε β: 101 α: 01 α: 10 α: ε β: α: 10 α: ε β: 110 α: ε Access(5) = Rank is similar α: 010 β: α: ε β: 1011 α: ε β: 101 1

Wavelet Trie: Select α: 010 β: α: ε β: 101 α: 01 α: 10 α: ε β: 1011 α: 10 α: ε β: 110 α: ε Select( , 1) = α: 010 β: α: ε β: 101 α: 01 α: 10 α: ε β: 1011 α: 10 α: ε β: 110 α: ε 4

Wavelet Trie: Append α: 010 β: α: ε β: 101 α: 01 α: 10 α: ε β: α: 10 α: ε β: 110 α: ε α: ε α: 0 α: ε β: 11 0 Insert/Delete are similar

Space analysis Information-theoretic lower bound – LB(S) = LT(S set ) + nH 0 (S) – LT is the information-theoretic lower bound for storing a set of strings Static WT: LB(S) + o(ĥn) Append-only WT: LB(S) + PT(S set )+ o(ĥn) – PT(S set ): space taken by the Patricia Trie Fully dynamic WT: LB(S) + PT(S set )+ O(nH 0 (S))

Operations time complexity Need new dynamic bitvectors to support initialization (create a bitvector 0 n or 1 n ) Static and Append-only Wavelet Trie – All supported operations in O(|s| + h s ) – h s is number of nodes traversed by string s Fully dynamic Wavelet Trie – All supported operations in O(|s| + h s log n) – Deletion may take O(|ŝ| + h s log n) where ŝ is longest string in the trie

Thanks for your attention! Questions?