Interplay between Stringology and Data Structure Design Roberto Grossi.

Slides:



Advertisements
Similar presentations
Dynamic Rank-Select Structures with Applications to Run-Length Encoded Texts Sunho Lee and Kunsoo Park Seoul National Univ.
Advertisements

Lecture #1 From 0-th order entropy compression To k-th order entropy compression.
Succinct Representations of Dynamic Strings Meng He and J. Ian Munro University of Waterloo.
Planar point location -- example
Text Indexing The Suffix Array. Basic notation and facts Occurrences of P in T = All suffixes of T having P as a prefix SUF(T) = Sorted set of suffixes.
Introduction to Computer Science 2 Lecture 7: Extended binary trees
Space-Efficient Algorithms for Document Retrieval Veli Mäkinen University of Helsinki Joint work with Niko Välimäki.
Michael Alves, Patrick Dugan, Robert Daniels, Carlos Vicuna
Paolo Ferragina, Università di Pisa Compressed Permuterm Index Paolo Ferragina Dipartimento di Informatica, Università di Pisa.
Text Indexing The Suffix Array. Basic notation and facts Occurrences of P in T = All suffixes of T having P as a prefix SUF(T) = Sorted set of suffixes.
©Silberschatz, Korth and Sudarshan12.1Database System Concepts Chapter 12: Indexing and Hashing Basic Concepts Ordered Indices B+-Tree Index Files B-Tree.
On the Competitiveness of Self Organizing Linear Search J. Ian Munro (University of Waterloo) Competitiveness: How well does an on line algorithm do in.
Fast Compressed Tries through Path Decompositions Roberto Grossi Giuseppe Ottaviano* Università di Pisa * Part of the work done while at Microsoft Research.
Suffix Trees, Suffix Arrays and Suffix Trays Richard Cole Tsvi Kopelowitz Moshe Lewenstein.
A New Compressed Suffix Tree Supporting Fast Search and its Construction Algorithm Using Optimal Working Space Dong Kyue Kim 1 andHeejin Park 2 1 School.
What about the trees of the Mississippi? Suffix Trees explained in an algorithm for indexing large biological sequences Jacob Kleerekoper & Marjolijn Elsinga.
Compressed Compact Suffix Arrays Veli Mäkinen University of Helsinki Gonzalo Navarro University of Chile compact compress.
Suffix Sorting & Related Algoritmics Martin Farach-Colton Rutgers University USA.
A Categorization Theorem on Suffix Arrays with Applications to Space Efficient Text Indexes Meng He, J. Ian Munro, and S. Srinivasa Rao University of Waterloo.
Suffix Sorting & Related Algoritmics Martin Farach-Colton Rutgers University USA.
Tries Standard Tries Compressed Tries Suffix Tries.
Wavelet Trees Ankur Gupta Butler University. Text Dictionary Problem The input is a text T drawn from an alphabet Σ. We want to support the following.
On Demand String Sorting over Unbounded Alphabets Carmel Kent Moshe Lewenstein Dafna Sheinwald.
1 Compressed Index for Dictionary Matching WK Hon (NTHU), TW Lam (HKU), R Shah (LSU), SL Tam (HKU), JS Vitter (Purdue)
Design a Data Structure Suppose you wanted to build a web search engine, a la Alta Vista (so you can search for “banana slugs” or “zyzzyvas”) index say.
Full-Text Indexing via Burrows-Wheeler Transform Wing-Kai Hon Oct 18, 2006.
PhD Thesis Iwona Bialynicka-Birula Ranked Queries in Index Data Structures.
Suffix arrays. Suffix array We loose some of the functionality but we save space. Let s = abab Sort the suffixes lexicographically: ab, abab, b, bab The.
Amortized Rigidness in Dynamic Cartesian Trees Iwona Białynicka-Birula and Roberto Grossi Università di Pisa STACS 2006.
Suffix trees and suffix arrays presentation by Haim Kaplan.
Department of Computer Eng. & IT Amirkabir University of Technology (Tehran Polytechnic) Data Structures Lecturer: Abbas Sarraf Search.
Text Indexing S. Srinivasa Rao April 19, 2007 [based on slides by Paolo Ferragina]
A Simpler Analysis of Burrows-Wheeler Based Compression Haim Kaplan Shir Landau Elad Verbin.
Tries. (Compacted) Trie y s 1 z stile zyg 5 etic ial ygy aibelyite czecin omo systile syzygetic syzygial syzygy szaibelyite szczecin.
Introducing Hashing Chapter 21 Copyright ©2012 by Pearson Education, Inc. All rights reserved.
Compressed Index for a Dynamic Collection of Texts H.W. Chan, W.K. Hon, T.W. Lam The University of Hong Kong.
Data Structures Hashing Uri Zwick January 2014.
Mike 66 Sept Succinct Data Structures: Techniques and Lower Bounds Ian Munro University of Waterloo Joint work with/ work of Arash Farzan, Alex Golynski,
1 Search Trees - Motivation Assume you would like to store several (key, value) pairs in a data structure that would support the following operations efficiently.
Compressed suffix arrays and suffix trees with applications to text indexing and string matching.
Trevor Brown – University of Toronto B-slack trees: Space efficient B-trees.
© 2006 Pearson Addison-Wesley. All rights reserved13 B-1 Chapter 13 (continued) Advanced Implementation of Tables.
David Luebke 1 10/25/2015 CS 332: Algorithms Skip Lists Hash Tables.
On the Sorting-Complexity of Suffix Tree Construction MARTIN FARACH-COLTON PAOLO FERRAGINA S. MUTHUKRISHNAN Requires Math fonts downloadable from herehere.
Compressed Prefix Sums O’Neil Delpratt Naila Rahman Rajeev Raman.
Hash Tables - Motivation
Suffix trees. Trie A tree representing a set of strings. a b c e e f d b f e g { aeef ad bbfe bbfg c }
Sets of Digital Data CSCI 2720 Fall 2005 Kraemer.
Four different data structures, each one best in a different setting. Simple Heap Balanced Heap Fibonacci Heap Incremental Heap Our results.
3.1. Binary Search Trees   . Ordered Dictionaries Keys are assumed to come from a total order. Old operations: insert, delete, find, …
Hashing 1 Hashing. Hashing 2 Hashing … * Again, a (dynamic) set of elements in which we do ‘search’, ‘insert’, and ‘delete’ n Linear ones: lists, stacks,
Review for Final Exam Non-cumulative, covers material since exam 2 Data structures covered: –Treaps –Hashing –Disjoint sets –Graphs For each of these data.
Joint Advanced Student School Compressed Suffix Arrays Compression of Suffix Arrays to linear size Fabian Pache.
CHAPTER 9 HASH TABLES, MAPS, AND SKIP LISTS ACKNOWLEDGEMENT: THESE SLIDES ARE ADAPTED FROM SLIDES PROVIDED WITH DATA STRUCTURES AND ALGORITHMS IN C++,
The Burrows-Wheeler Transform: Theory and Practice Article by: Giovanni Manzini Original Algorithm by: M. Burrows and D. J. Wheeler Lecturer: Eran Vered.
Onlinedeeneislam.blogspot.com1 Design and Analysis of Algorithms Slide # 1 Download From
Suffix Tree 6 Mar MinKoo Seo. Contents  Basic Text Searching  Introduction to Suffix Tree  Suffix Trees and Exact Matching  Longest Common Substring.
Discrete Methods in Mathematical Informatics Kunihiko Sadakane The University of Tokyo
Discrete Methods in Mathematical Informatics Kunihiko Sadakane The University of Tokyo
Advanced Data Structures Lecture 8 Mingmin Xie. Agenda Overview Trie Suffix Tree Suffix Array, LCP Construction Applications.
Compressed Suffix Arrays for Massive Data Jouni Sirén SPIRE 2009.
A simple storage scheme for strings achieving entropy bounds Paolo Ferragina and Rossano Venturini Dipartimento di Informatica University of Pisa, Italy.
Generic Trees—Trie, Compressed Trie, Suffix Trie (with Analysi
Linear Time Suffix Array Construction Using D-Critical Substrings
15-853:Algorithms in the Real World
Succinct Data Structures
Succinct Data Structures
COMP9319 Web Data Compression and Search
Suffix trees.
Hashing.
Presentation transcript:

Interplay between Stringology and Data Structure Design Roberto Grossi

Interplay between Stringology and Data Structure Design (limited view: my own experience) Roberto Grossi

Interplay between Stringology and Data Structure Design (limited view: my own experience) Roberto Grossi advertising

4 Interaction between stringology and data structures Case studies: Compressed text indexing [G., Gupta, Vitter] Multi-key data structures [Crescenzi, G., Italiano] [Franceschini, G.] [G., Italiano] Order vs. disorder in searching [Franceschini, G.] In-place vector sorting [Franceschini, G.]

5 Compressed text indexing Replace text 2  n ) self-indexing binary string [Ferragina, Manzini] [Sadakane] [G., Gupta, Vitter] n log  bits ) n H h + … bits (where H h = h-order empirical entropy) Unique algorithmic framework: wavelet tree + finite set model + succinct dictionaries + … Text indexing: new implementation of CSA (compressed suffix array) Text indexing: new implementation of CSA (compressed suffix array)

6 Compressed text indexing Replace text 2  n ) self-indexing binary string [Ferragina, Manzini] [Sadakane] [G., Gupta, Vitter] n log  bits ) n H h + … bits (where H h = h-order empirical entropy) Unique algorithmic framework: wavelet tree + finite set model + succinct dictionaries + … Compression: new analysis of BWT (Burrows-Wheeler transform) Compression: new analysis of BWT (Burrows-Wheeler transform) Text indexing: new implementation of CSA (compressed suffix array) Text indexing: new implementation of CSA (compressed suffix array)

7 Suffix arrays, BWT, and H h (high- order empirical entropy) Equivalently use contexts x of order h for cx instead of xc T = mississippi# # i# ippi# issippi# ississippi# mississippi pi# ppi# sippi# sissippi# ssippi# ssissippi# ipssm#pissiiipssm#pissii

8 Suffix arrays, BWT, and H h (high- order empirical entropy) Context x = i, h =1 Chars c = p, s, m Store “ pssm ” using just bits Get n H h bits!!! Add bits to encode the partition. # i# ippi# issippi# ississippi# mississippi pi# ppi# sippi# sissippi# ssippi# ssissippi# ipssm#pissiiipssm#pissii

9 Incremental representation Example: mark p pssm ! 1000 remove p ; mark m ssm ! 001 remove m ; mark s ss ! 11 We obtain 3 subsets: Encode each subset, containing t items out of n, using bits.

10 Getting the multinomial coefficient Sum of the log binomial coefficients of the subsets’ sizes

11 Wavelet trees Generalize the idea from the linear list to any tree shape Cost is independent of the shape (e.g. assign access frequencies)

12 Bound on bits of space Crucial encoding for the BWT partition: r (positive) subsets’ sizes, t 1, …, t r, with  i t i = n. Let enc(t 1, …, t r ) be the number of bits for encoding the sequence of these r sizes. Let g’ =  h+1 and g =  h+1 log , both independent of n ! 1. Then, r · g’ and storing BWT takes nH h + [enc(t 1,..., t r ) - 1/2  i log t i ] + O(r log  ) bits.

13 Bound on bits of space Crucial encoding for the BWT partition: r (positive) subsets’ sizes, t 1, …, t r, with  i t i = n. Let enc(t 1, …, t r ) be the number of bits for encoding the sequence of these r sizes. Let g’ =  h+1 and g =  h+1 log , both independent of n ! 1. Then, r · g’ and storing BWT takes nH h + [enc(t 1,..., t r ) - 1/2  i log t i ] + O(r log  ) · nH h + g’ log(n/g’) + O(g) bits.

14 Interaction between stringology and data structures Case studies: Compressed text indexing Multi-key data structures [Crescenzi, G., Italiano] [Franceschini, G.] [G., Italiano] Order vs. disorder in searching In-place vector sorting

15 Why multi-key data? Strings are everywhere… Keys are arbitrarily long Multi-dimensional points Multiple precision numbers Textual data XML paths URLs and IP addresses … Modeled as strings in  k, for unbounded alphabets Q: How to avoid O(k) slowdown factor in the cost of the operations supported by known data structures?

16 I. Ad hoc data structures Some examples ternary search trees [Clampett] [Bentley, Sedgewick] tries […] lexicographic D-trees [Mehlhorn] multi-dimensional B-trees [Gueting, Kriegel] multi-dimensional AVL trees [Vaishnavi] lexicographic splay trees [Sleator, Tarjan] multi-dimensional BST [Gonzalez] [Roura] multi-BB-trees [Vaishnavi] … Search, insert, delete in O(k + log n ) time Split and concatenate in O(k + log n ) time

17 II. Augmenting access paths Reuse many data structures for 1-dim keys: AVL trees, red-black trees skip lists (a,b)-trees BB[α]-trees self-adjusting trees random search trees (treaps,…) … Inherit their combinatorial properties Traversing is driven by comparisons

18 III. Using an oracle for strings Data structure D = black box performing comparisons on pairs of 1-dim keys. General theorem for transforming D into a data structure D’ for strings (no efficiency loss). Oracle DS lcp for maintaing order in a linked list of strings, along with their lcps (extends Dietz-Sleator list).

19 The general technique New data structure D ’ = old data structure D + oracle DS lcp Method: comparison is O(1)-time if we know lcp ( x, y )=min { j j x [ j +1]  y [ j +1] } ( x < y iff x [ lcp +1] < y [ lcp +1]) use DS lcp for storing and comparing pairs of strings in D ’ in constant time use predecessors and lcp s computed so far to insert a new string y into D ’ (and DS lcp )

20 Theorem for general transformation Comparison driven data structure D for n keys : ins identifies pred or succ it does not necessarily imply (log n) per ins in a sequence of operations; e.g., finger search trees

21 Theorem for general transformation Comparison driven data structure D for n keys : ins identifies pred or succ String data structure D’ for n strings it does not necessarily imply (log n) per ins in a sequence of operations; e.g., finger search trees

22 Theorem for general transformation Comparison driven data structure D for n keys : ins identifies pred or succ String data structure D’ for n strings Space S ( n )Space S ( n ) + O ( n )

23 Theorem for general transformation Comparison driven data structure D for n keys : ins identifies pred or succ String data structure D’ for n strings Space S ( n )Space S ( n ) + O ( n ) Operation op on O (1) keys in D in T ( n ) time Operation op on O (1) strings in D ’ in O(T ( n ) ) time

24 Theorem for general transformation Comparison driven data structure D for n keys : ins identifies pred or succ String data structure D’ for n strings Space S ( n )Space S ( n ) + O ( n ) Operation op on O (1) keys in D in T ( n ) time Operation op on O (1) strings in D ’ in O(T ( n ) ) time Operation op involving y not in D, in T ( n ) time Operation op involving y not in D ’, in O(T ( n ) + k ) time

25 Some features No need to reinvent the wheel for data structs designers Better than using compact trie + Dietz-Sleator list + dynamic LCA when T ( n ) = o ( log n ), e.g.: weighted search O( log (  i w i )/ w ) finger search O( log d ) set manipulation O( n log( m / n ) )

26 Interaction between stringology and data structures Case studies: Compressed text indexing Multi-key data structures Order vs. disorder in searching [Franceschini, G.] In-place vector sorting

27 Searching In-Place a Sorted(?) Array of Strings “Imagine how hard it would be to use a dictionary if its words were not alphabetized!” -- D.E. Knuth, The Art of Comp. Prog., vol. 3, 1998

28 Order vs. Disorder: An experiment  Think of your table desk… 1. Are the papers on your desk in sorted order? 2. Probably not! 3. Unsorted data seems to provide more informative content than sorted data… 4. Can we formalize this intuition in the comparison model?

29 Preprocessing by sorting  In-place search the lexicographically sorted array in [Andersson, Hagerup, Håstad, Petersson, ’94, ’95, ’01]: time  Upper/lower bounds. The classical  (log n) when k = 1.

30 Permuting is better ! For any key length k, there exists an “unsorted” permutation attaining simultaneously  (k + log n) time O(1) extra space Optimal among all possible permutations, better than those resulting from sorting. Warning: suffix array search is not in-place (since LCP takes more than O(1) extra cells).

31 Basic tool: Bit stealing Simple, yet effective, idea on pairwise sorted keys: For keys of length k ) O(k) slowdown factor. Q: Can we get O(1) decoding time? Implicit bits encoded by pairwise exchanging keys! Implicit bits encoded by pairwise exchanging keys!

32 K-dimensional bit stealing: Digging a ditch! Using d = lcp(x i, x j )+1, decode a bit in O(1) time, by checking mismatches, x i [d] and x j [d]. Idea exploited for digging a ditch, in O(k + r) time: DIGGING(x 1 … x r ) d à 1, i à 1, j à r while i < j do // twin positions i and j while d · k and x i [d] = x j [d] do d à d + 1 i à i + 1, j à j - 1

33 Ditch: twin positions and twin intervals Create twin intervals with same digging depth; bit stealing is O(1) time with keys in twin positions.

34 Large DITCH Encode information for the twin intervals in O(k log n) distinct keys (which are still searchable). These twin positions can encode 3 bits

35 Inside each twin interval T Searching A reduces to searching in a specific twin interval T. Use modified Manber-Myers search for accessing just O(log n) stealed bits in T for lcp information (instead of O(log n) £ O(log n) bits).  It is provably more efficient to keep data “unsorted” rather than “sorted” for in-place searching.

36 Interaction between stringology and data structures Case studies: Compressed text indexing Multi-key data structures Order vs. disorder in searching In-place vector sorting [Franceschini, G.]

37 Logical order ´ physical layout Knuth’s indirect addressing: 1. permute the records’ pointers to find their ranks 2. permute the records according to the ranks What if records are scrambled during merging? Irregular access pattern to records

38 In-place model for vector sorting: GVSP( ) Comparison model extended to keys of length k, using O(1) extra memory cells m vectors of length k to be sorted p vectors for internal buffering h stealed bits with 2h vectors initially ) m = n and p = h = 0

39 Optimal time-space bounds Reduce recursively GVSP( ) to simpler instances Use internal implicit data structures for strings in some of the instances Sorting cost is time-space optimal: O(nk + n log n) time/comparisons O(n) vector moves O(1) words of memory for extra space

40 Conclusions Joint work on the “reverse” contribution, from stringology to data structure/algorithm design. Fruitful interplay between the two areas: Compressed text indexing Multi-key data structures Order vs. disorder in searching In-place vector sorting