Fast Compressed Tries through Path Decompositions Roberto Grossi Giuseppe Ottaviano* Università di Pisa * Part of the work done while at Microsoft Research.

Slides:



Advertisements
Similar presentations
Binary Trees CSC 220. Your Observations (so far data structures) Array –Unordered Add, delete, search –Ordered Linked List –??
Advertisements

Space-Efficient Data Structures for Top-k Completion Giuseppe Ottaviano Università di Pisa Bo-June (Paul) Hsu Microsoft Research WWW 2013.
Paolo Ferragina, Università di Pisa Compressed Permuterm Index Paolo Ferragina Dipartimento di Informatica, Università di Pisa.
An Improved Succinct Dynamic k-Ary Tree Representation (work in progress) Diego Arroyuelo Department of Computer Science, Universidad de Chile.
Partitioned Elias-Fano Indexes
TREES Chapter 6. Trees - Introduction  All previous data organizations we've studied are linear—each element can have only one predecessor and successor.
The Wavelet Trie: Maintaining an Indexed Sequence of Strings in Compressed Space Roberto GrossiGiuseppe Ottaviano * Università di Pisa * Part of the work.
A New Compressed Suffix Tree Supporting Fast Search and its Construction Algorithm Using Optimal Working Space Dong Kyue Kim 1 andHeejin Park 2 1 School.
The course Project #1: Dictionary search with 1 error The problem consists of building a data structure to index a large dictionary of strings.
Compressed Compact Suffix Arrays Veli Mäkinen University of Helsinki Gonzalo Navarro University of Chile compact compress.
A Categorization Theorem on Suffix Arrays with Applications to Space Efficient Text Indexes Meng He, J. Ian Munro, and S. Srinivasa Rao University of Waterloo.
© 2004 Goodrich, Tamassia Tries1. © 2004 Goodrich, Tamassia Tries2 Preprocessing Strings Preprocessing the pattern speeds up pattern matching queries.
The Trie Data Structure Basic definition: a recursive tree structure that uses the digital decomposition of strings to represent a set of strings for searching.
Tries Standard Tries Compressed Tries Suffix Tries.
Hierarchy-conscious Data Structures for String Analysis Carlo Fantozzi PhD Student (XVI ciclo) Bioinformatics Course - June 25, 2002.
Advanced Algorithm Design and Analysis (Lecture 4) SW5 fall 2004 Simonas Šaltenis E1-215b
Modern Information Retrieval Chapter 8 Indexing and Searching.
1 More Specialized Data Structures String data structures Spatial data structures.
Data Structures, Search and Sort Algorithms Kar-Hai Chu
Modern Information Retrieval
Higher Order Tries Key = Social Security Number.   9 decimal digits. 10-way trie (order 10 trie) Height
Full-Text Indexing via Burrows-Wheeler Transform Wing-Kai Hon Oct 18, 2006.
Indexed Search Tree (Trie) Fawzi Emad Chau-Wen Tseng Department of Computer Science University of Maryland, College Park.
CSC 213 Lecture 18: Tries. Announcements Quiz results are getting better Still not very good, however Average score on last quiz was 5.5 Every student.
Department of Computer Eng. & IT Amirkabir University of Technology (Tehran Polytechnic) Data Structures Lecturer: Abbas Sarraf Search.
6/26/2015 7:13 PMTries1. 6/26/2015 7:13 PMTries2 Outline and Reading Standard tries (§9.2.1) Compressed tries (§9.2.2) Suffix tries (§9.2.3) Huffman encoding.
COMP 171 Data Structures and Algorithms Tutorial 10 Hash Tables.
Lecture 06: Tree Structures Topics: Trees in general Binary Search Trees Application: Huffman Coding Other types of Trees.
IP Address Lookup Masoud Sabaei Assistant professor
 Divide the encoded file into blocks of size b  Use an auxiliary bit vector to indicate the beginning of each block  Time – O(b)  Time vs. Memory.
Basics of Data Compression Paolo Ferragina Dipartimento di Informatica Università di Pisa.
A Summary of XISS and Index Fabric Ho Wai Shing. Contents Definition of Terms XISS (Li and Moon, VLDB2001) Numbering Scheme Indices Stored Join Algorithms.
© 2004 Goodrich, Tamassia Tries1. © 2004 Goodrich, Tamassia Tries2 Preprocessing Strings Preprocessing the pattern speeds up pattern matching queries.
Data : The Small Forwarding Table(SFT), In general, The small forwarding table is the compressed version of a trie. Since SFT organizes.
Data Structures and Algorithms Lecture (BinaryTrees) Instructor: Quratulain.
Succinct Data Structures Ian Munro University of Waterloo Joint work with David Benoit, Andrej Brodnik, D, Clark, F. Fich, M. He, J. Horton, A. López-Ortiz,
Lossless Compression CIS 465 Multimedia. Compression Compression: the process of coding that will effectively reduce the total number of bits needed to.
1 Searching Searching in a sorted linked list takes linear time in the worst and average case. Searching in a sorted array takes logarithmic time in the.
Random access to arrays of variable-length items
Tries1. 2 Outline and Reading Standard tries (§9.2.1) Compressed tries (§9.2.2) Suffix tries (§9.2.3)
Higher Order Tries Key = Social Security Number.   9 decimal digits. 10-way trie (order 10 trie) Height
1 Tries When searching for the name “Smith” in a phone book, we first locate the group of names starting with “S”, then within those we search for “m”,
Compressed Prefix Sums O’Neil Delpratt Naila Rahman Rajeev Raman.
Suffix trees. Trie A tree representing a set of strings. a b c e e f d b f e g { aeef ad bbfe bbfg c }
Sets of Digital Data CSCI 2720 Fall 2005 Kraemer.
Joint Advanced Student School Compressed Suffix Arrays Compression of Suffix Arrays to linear size Fabian Pache.
Index construction: Compression of postings Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 5.3 and a paper.
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval" Basics
Contents What is a trie? When to use tries
18-1 Chapter 18 Binary Trees Data Structures and Design in Java © Rick Mercer.
Chapter 11. Chapter Summary  Introduction to trees (11.1)  Application of trees (11.2)  Tree traversal (11.3)  Spanning trees (11.4)
Discrete Methods in Mathematical Informatics Kunihiko Sadakane The University of Tokyo
Advanced Data Structures Lecture 8 Mingmin Xie. Agenda Overview Trie Suffix Tree Suffix Array, LCP Construction Applications.
A simple storage scheme for strings achieving entropy bounds Paolo Ferragina and Rossano Venturini Dipartimento di Informatica University of Pisa, Italy.
Generic Trees—Trie, Compressed Trie, Suffix Trie (with Analysi
Tries 4/16/2018 8:59 AM Presentation for use with the textbook Data Structures and Algorithms in Java, 6th edition, by M. T. Goodrich, R. Tamassia, and.
Succinct Data Structures
Tries 07/28/16 11:04 Text Compression
Tries 5/27/2018 3:08 AM Tries Tries.
Higher Order Tries Key = Social Security Number.
Succinct Data Structures
Succinct Data Structures
Reducing the Space Requirement of LZ-index
Tries 9/14/ :13 AM Presentation for use with the textbook Data Structures and Algorithms in Java, 6th edition, by M. T. Goodrich, R. Tamassia, and.
HEXA: Compact Data Structures for Faster Packet Processing
Chapter 8 – Binary Search Tree
Higher Order Tries Key = Social Security Number.
Data Structures and Analysis (COMP 410)
Tries 2/23/2019 8:29 AM Tries 2/23/2019 8:29 AM Tries.
Tries 2/27/2019 5:37 PM Tries Tries.
Presentation transcript:

Fast Compressed Tries through Path Decompositions Roberto Grossi Giuseppe Ottaviano* Università di Pisa * Part of the work done while at Microsoft Research Cambridge

t three trial triangle trie triple triply iree εε l εεεgle hr e p a l n Node label Branching character Compacted tries ye

Applications String dictionaries – With prefix lookup, predecessor, … – Exploit prefix compression Monotone perfect hash functions – “Hollow” or “Blind” tries [ALENEX 09] – Binary tree (no need store branching chars) – No need to store node labels, just lengths (skips)

Height vs. performance Tries can be deep – no guarantee on height Bad with pointer-based trees – ~1 cache miss per child operation Worse with succinct tree encodings – Need to access several directories – Many cache misses per child operation – Large constants hidden in the O(1)

Path decomposition t iree εε l εεεgle hr e p a l n ye he p l Query: triple Recurse here with suffix le

Centroid path decomposition Decompose along the heavy paths – choose the edge that has most descendants Height of the decomposed tree: O(log n) – Usually lower Average height Web QueriesURLsSynthetic Compacted trie Centroid trie Hollow trie Centroid hollow trie

Succinct encoding [PODS 08] presents a succinct data structure for centroid path-decomposed tries Not practical: need complex operations on succinct trees We introduce a simpler and practical encoding This encoding enables also simple compression of the labels

Succinct encoding he p l L : t1ri2a1ngle BP: ( ((( ) B : h epl (spaces added for clarity) Node label written literally, interleaved with number of other branching characters at that point in array L Corresponding branching characters in array B Tree encoded with DFUDS in bitvector BP – Variant of Range Min-Max tree [ALENEX 10] to support Find{Close,Open}, more space-efficient (Range Min tree)

Compression of L...$...index.html$....html$....html$...index.html$...$...35$...5$...5$...35$ … 3 index … 5.html … Dictionary Dictionary codewords shared among labels Codewords do not cross label boundaries ( $ ) Use vbyte to compress the codeword ids

Compression of L Node labels ( t1ri2a1ngle, l1e, …): – each label is suffix of a string in the set – interleaved with few “special characters” 1, 2, 3,… Compressible if strings are compressible Dictionary and parsing computed with modified Re-Pair – Domain-specific compression can be used instead Decompression overhead negligible

Experimental results (time) Experiments show gains in time comparable to the gains in height Confirm that bottleneck is traversal operations Web QueriesURLsSynthetic Trie Centroid trie Hollow trie [ALENEX 09] Hollow trie Centroid hollow trie (microseconds, lower is better) Code available at

Experimental results (space) For strings with many common prefixes, even non-compressed trie is space-efficient Labels compression considerably increases space-efficiency Decompression time overhead: ~10% Web QueriesURLsSynthetic Hu-Tucker Front Coding40.9%24.4%19.1% Centroid trie55.6%22.4%17.9% Centroid trie + compression31.5%13.6%0.4% (compression ratio, lower is better) Code available at

Thanks for your attention! Questions?

References [ALENEX 10] D. Arroyuelo, R. Cánovas, G. Navarro, and K. Sadakane. Succinct trees in practice. In ALENEX, pages 84–97, [ALENEX 09] D. Belazzougui, P. Boldi, R. Pagh, and S. Vigna. Monotone minimal perfect hashing: searching a sorted table with O(1) accesses. In SODA, pages 785– 794, [PODS 08] P. Ferragina, R. Grossi, A. Gupta, R. Shah, and J. S. Vitter. On searching compressed string collections cache-obliviously. In PODS, pages 181–190, 2008.