Random access to arrays of variable-length items

Slides:



Advertisements
Similar presentations
Michael Alves, Patrick Dugan, Robert Daniels, Carlos Vicuna
Advertisements

Paolo Ferragina, Università di Pisa Compressed Permuterm Index Paolo Ferragina Dipartimento di Informatica, Università di Pisa.
An Improved Succinct Dynamic k-Ary Tree Representation (work in progress) Diego Arroyuelo Department of Computer Science, Universidad de Chile.
Paolo Ferragina, Università di Pisa Compressed Rank & Select on general strings Paolo Ferragina Dipartimento di Informatica, Università di Pisa.
Fast Compressed Tries through Path Decompositions Roberto Grossi Giuseppe Ottaviano* Università di Pisa * Part of the work done while at Microsoft Research.
Dictionaries and Data-Aware Measures Ankur Gupta Butler University.
The Wavelet Trie: Maintaining an Indexed Sequence of Strings in Compressed Space Roberto GrossiGiuseppe Ottaviano * Università di Pisa * Part of the work.
A New Compressed Suffix Tree Supporting Fast Search and its Construction Algorithm Using Optimal Working Space Dong Kyue Kim 1 andHeejin Park 2 1 School.
Two implementation issues Alphabet size Generalizing to multiple strings.
Genome-scale disk-based suffix tree indexing Benjarath Phoophakdee Mohammed J. Zaki Compiled by: Amit Mahajan Chaitra Venus.
Suffix Trees Suffix trees Linearized suffix trees Virtual suffix trees Suffix arrays Enhanced suffix arrays Suffix cactus, suffix vectors, …
Succinct Representations of Trees S. Srinivasa Rao Seoul National University.
1 Pattern Processing and Searching In RAM Michael Robinson Ph.D. candidate Advisor: Dr. Giri Narasimhan School of Computing and Information Sciences BioRG.
1 More Specialized Data Structures String data structures Spatial data structures.
Wavelet Trees Ankur Gupta Butler University. Text Dictionary Problem The input is a text T drawn from an alphabet Σ. We want to support the following.
GTCAGATGAGCAAAGTAGACACTCCAGTAACGCGGTGAGTACATTAA exon intron intergene Find Gene Structures in DNA Intergene State First Exon State Intron State.
Full-Text Indexing via Burrows-Wheeler Transform Wing-Kai Hon Oct 18, 2006.
Compressed Suffix Arrays based on Run-Length Encoding Veli Mäkinen Bielefeld University Gonzalo Navarro University of Chile BWTRLFID.
6/26/2015 7:13 PMTries1. 6/26/2015 7:13 PMTries2 Outline and Reading Standard tries (§9.2.1) Compressed tries (§9.2.2) Suffix tries (§9.2.3) Huffman encoding.
1 A Lempel-Ziv text index on secondary storage Diego Arroyuelo and Gonzalo Navarro Combinatorial Pattern Matching 2007.
Information Retrieval Space occupancy evaluation.
Basics of Data Compression Paolo Ferragina Dipartimento di Informatica Università di Pisa.
Succinct Representations of Trees
An Online Algorithm for Finding the Longest Previous Factors Daisuke Okanohara University of Tokyo Karlsruhe, Sep 15, 2008 Kunihiko.
Compressed suffix arrays and suffix trees with applications to text indexing and string matching.
1 CPS216: Advanced Database Systems Notes 04: Operators for Data Access Shivnath Babu.
Introduction n – length of text, m – length of search pattern string Generally suffix tree construction takes O(n) time, O(n) space and searching takes.
© 2004 Goodrich, Tamassia Tries1. © 2004 Goodrich, Tamassia Tries2 Preprocessing Strings Preprocessing the pattern speeds up pattern matching queries.
Basics of Data Compression Paolo Ferragina Dipartimento di Informatica Università di Pisa.
Balanced Binary Search Tree 황승원 Fall 2010 CSE, POSTECH.
Read Alignment Algorithms. The Problem 2 Given a very long reference sequence of length n and given several short strings.
Succinct Orthogonal Range Search Structures on a Grid with Applications to Text Indexing Prosenjit Bose, Carleton University Meng He, Unversity of Waterloo.
Huffman coding Content 1 Encoding and decoding messages Fixed-length coding Variable-length coding 2 Huffman coding.
Sets of Digital Data CSCI 2720 Fall 2005 Kraemer.
Compressed Suffix Arrays and Suffix Trees Roberto Grossi, Jeffery Scott Vitter.
Index Construction: sorting Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading Chap 4.
B ACKWARD S EARCH FM-I NDEX (F ULL - TEXT INDEX IN M INUTE SPACE ) Paper by Ferragina & Manzini Presentation by Yuval Rikover.
Hashing 1 Hashing. Hashing 2 Hashing … * Again, a (dynamic) set of elements in which we do ‘search’, ‘insert’, and ‘delete’ n Linear ones: lists, stacks,
Joint Advanced Student School Compressed Suffix Arrays Compression of Suffix Arrays to linear size Fabian Pache.
Chapter 15 A External Methods. © 2004 Pearson Addison-Wesley. All rights reserved 15 A-2 A Look At External Storage External storage –Exists beyond the.
CS6045: Advanced Algorithms Data Structures. Hashing Tables Motivation: symbol tables –A compiler uses a symbol table to relate symbols to associated.
Index construction: Compression of postings Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 5.3 and a paper.
Index construction: Compression of documents Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading Managing-Gigabytes: pg 21-36, 52-56,
ETRI Linear-Time Search in Suffix Arrays July 14, 2003 Jeong Seop Sim, Dong Kyue Kim Heejin Park, Kunsoo Park.
TREES Lecture 11 CS2110 – Spring Readings and Homework  Textbook, Chapter 23, 24  Homework: A thought problem (draw pictures!)  Suppose you use.
Discrete Methods in Mathematical Informatics Kunihiko Sadakane The University of Tokyo
Discrete Methods in Mathematical Informatics Kunihiko Sadakane The University of Tokyo
Advanced Data Structures Lecture 8 Mingmin Xie. Agenda Overview Trie Suffix Tree Suffix Array, LCP Construction Applications.
A simple storage scheme for strings achieving entropy bounds Paolo Ferragina and Rossano Venturini Dipartimento di Informatica University of Pisa, Italy.
Index construction: Compression of postings
Tries 4/16/2018 8:59 AM Presentation for use with the textbook Data Structures and Algorithms in Java, 6th edition, by M. T. Goodrich, R. Tamassia, and.
Tries 07/28/16 11:04 Text Compression
CPS216: Data-intensive Computing Systems
Tries 5/27/2018 3:08 AM Tries Tries.
Succinct Data Structures
Succinct Data Structures
Mark Redekopp David Kempe
Paolo Ferragina Dipartimento di Informatica, Università di Pisa
Tries 9/14/ :13 AM Presentation for use with the textbook Data Structures and Algorithms in Java, 6th edition, by M. T. Goodrich, R. Tamassia, and.
Index Construction: sorting
Auto-completion Search
Trees Lecture 9 CS2110 – Fall 2009.
Index construction: Compression of postings
Paolo Ferragina Dipartimento di Informatica, Università di Pisa
Tries 2/23/2019 8:29 AM Tries 2/23/2019 8:29 AM Tries.
Index construction: Compression of postings
Index construction: Compression of postings
Paolo Ferragina Dipartimento di Informatica, Università di Pisa
Rank and Select data structures
Trees Lecture 10 CS2110 – Spring 2013.
Presentation transcript:

Random access to arrays of variable-length items Prof. Paolo Ferragina, Algoritmi per "Information Retrieval" Random access to arrays of variable-length items Paolo Ferragina Dipartimento di Informatica Università di Pisa

A basic problem ! T Independent of string-length distribution Abaco#Battle#Car#Cold#Cod#Defense#Google#Yahoo#.... T Array of pointers (log m) bits per string = (n log m) bits= 32 n bits. We could drop the separating NULL Independent of string-length distribution It is effective for few strings It is bad for medium/large sets of strings

We aim at achieving ≈ n log(m/n) bits ≤ n log m A basic problem ! Abaco#Battle#Car#Cold#Cod#Defense#Google#Yahoo#.... T AbacoBattleCarColdCodDefenseGoogleYahoo.... X 10000100000100100010010000001000010000.... B 10#2#5#6#20#31#3#3#.... A We could drop msb 1010101011101010111111111.... X 1000101001001000100001010.... B We aim at achieving ≈ n log(m/n) bits ≤ n log m

Another textDB: Labeled Graph

Rank/Select Wish to index the bit vector B (possibly compressed). B 00101001010101011111110000011010101.... Rank1(6) = 2 m = |B| n = #1 Rankb(i) = number of b in B[1,i] Selectb(i) = position of the i-th b in B Do exist data structures that solve this problem in O(1) query time and very small extra space (i.e. +o(m) bits)

The Bit-Vector Index: B + o(m) m = |B| n = #1s The Bit-Vector Index: B + o(m) Goal. B is read-only, and the additional index takes o(m) bits. Rank B 00101001010101011 1111100010110101 0101010111000.... Z 8 18 block pos #1 z (bucket-relative) Rank1 4 5 8 0000 1 .... ... 1011 2 (absolute) Rank1 Setting Z = poly(log m) and z=(1/2) log m: Extra space is + (m/Z) log m + (m/z) log Z + o(m) + O(m loglog m / log m) = o(m) bits Rank time is O(1) Term o(m) is crucial in practice, B is untouched (not compressed)

The Bit-Vector Index B m = |B| n = #1s 0010100101010101111111000001101010101010111001.... size r is variable  k consecutive 1s Sparse case: If r > k2 store explicitly the position of the k 1s Dense case: k ≤ r ≤ k2, recurse... One level is enough!! ... still need a table of size o(m). Setting k ≈ polylog m Extra space is + o(m), and B is not touched! Select time is O(1) There exists a Bit-Vector Index taking o(m) extra bits and constant time for Rank/Select. B is read-only!

Elias-Fano index&compress If w = log (m/n) and z = log n, where m = |B| and n = #1 then L takes n w = n log (m/n) bits H takes n 1s + n 0s = 2n bits z = 3, w=2 0 1 2 3 4 5 6 7 (Select1 on H) In unary Select1(i) on B  uses L and (Select1(H,i) – i) in +o(n) space Actually you can do binary search over B, but compressed !

If you wish to play with Rank and Select Prof. Paolo Ferragina, Algoritmi per "Information Retrieval" If you wish to play with Rank and Select m/10 + n log m/n Rank in 0.4 msec, Select in < 1 msec vs 32n bits of explicit pointers

Generalised Rank and Select Rank(c,i) = #c in L[1,i] Select(c,i) = position of the i-th c in L L = a b a a a c b c d a b e c d ... Select( a , 2 ) = 3 Rank( a , 7 ) = 4

Generalised Rank and Select If S is small (i.e. constant) Build binary Rank data structure per symbol of S Rank takes O(1) time and o(|T|) space [even entropy bounded] If S is large (words ?) Need a smarter solution: Wavelet Tree data structure Algorithmic reduction: >> Reduce Rank&Select over arbitrary strings ... to Rank&Select over binary strings

The Wavelet Tree abracadabra Alphabetic Tree a b c d r

The Wavelet Tree abracadabra a b c d r abaaaba rcdr cd aaaaa rr bb d c You do not need the leaves because of {0,1} in their parent d c

Total space may be estimated as The Wavelet Tree abracadabra 00101010010 0100010 a b c d r abaaaba rcdr 1001 01 cd Total space may be estimated as O(|S| log |S|) bits Fact. Given the alphabetic tree and the binary strings, we can recover the original string !!

The Wavelet Tree abracadabra 00101010010 a b c d r abaaaba 0100010 Reduce to right symbols Rank(c,8) abracadabra 00101010010 Rank(c,3) a b c d r abaaaba 0100010 rcdr 1001 Rank(c,2) Reduce to left symbols cd 01

The Wavelet Tree abracadabra 00101010010 a b c d r abaaaba 0100010 Select is similar The Wavelet Tree Right move = Rank1 Rank(c,8) abracadabra 00101010010 Rank1(8)=3 a b c d r abaaaba 0100010 rcdr 1001 Rank0(3)=2 Rank0(2)=1 Left move = Rank0 cd 01 Left move = Rank0 Generalised R&S  Binary R&S with log |S| slowdown

Generalised Rank and Select If S is large the Wavelet Tree data structure guarantees Rank and Select take o(log | S |) time and nH0 + n bits of space (like Huffman) Other bounds are possible, with d-ary trees: logd | S | time and n log | S | + o(n) bits

WT vs 2D-range search WT + Rank&Select solves 2D-range Sort by y 2 4 6 8 10 12 14 16 16 14 12 10 8 6 4 2 Sort by y Write x y-sort [5,12] [4,10] T 4 10 7 13 1 14 6 11 10 7 1 6 13 14 11 10 10 11 6 7 10 7 6 11 5 12 x-sort [4,10] [5,12] x T = 2 3 8 7 13 1 14 6 11 10 16 15 12 9 5 4

String search vs 2D-range search T = a b r a c a d r a b r a 1 2 3 4 5 6 7 8 9 10 11 12 Pos SA suffix point 12 a 1,12 9 abra 2,9 1 abracadabra 3,1 4 4 acadrabra 4,4 5 6 adrabra 5,6 6 10 bra 6,10 7 2 bracadabra 7,2 8 5 cadabra 8,5 9 7 dabra 9,7 10 11 ra 10,11 11 8 rabra 11,8 12 3 racadabra 12,3 Build the suffix array for T For each T[i,n] at position SA[j] build a point <j,i> Search for P[1,p] (=ra) in T[s,e] (T[3,8]) Search P in the Suffix Array, and find the range [L,R] of suffixes which are prefixed by P (= [10,12]) Perform a 2D-range search in [L, R] x [s, e-p+1] [10,12] x [3, 7=8-2+1]  (12,3) Prefix search over multi-attributes

Prefix search vs 2D-range search Given a dictionary of records <s1[i], s2[i]> Construct two tries, one for s1’s and one for s2’s strings Number the leaves from left to right <ugo, rossi>, <uto, blu> <caio, rod>, <ivo, bleu> A

Prefix search vs 2D-range search For every record, create a 2D-point <a,b> Two-prefix searches <P,Q>= <u*, ro*> Search P & Q in the tries Identify the range of leaves (ints) delimited by P and Q Perform a 2D-range search over the ranges: [PL, PR] x [QL, QR] A <ugo, rossi>, <uto, bla> <caio, rod>, <ivo, bleu>