Rank and Select data structures

Slides:



Advertisements
Similar presentations
Hashing and Indexing John Ortiz.
Advertisements

Paolo Ferragina, Università di Pisa Compressed Rank & Select on general strings Paolo Ferragina Dipartimento di Informatica, Università di Pisa.
Dictionaries and Data-Aware Measures Ankur Gupta Butler University.
Succinct Representations of Trees S. Srinivasa Rao Seoul National University.
Wavelet Trees Ankur Gupta Butler University. Text Dictionary Problem The input is a text T drawn from an alphabet Σ. We want to support the following.
Indexes. Primary Indexes Dense Indexes Pointer to every record of a sequential file, (ordered by search key). Can make sense because records may be much.
Indexes. Primary Indexes Dense Indexes Pointer to every record of a sequential file, (ordered by search key). Can make sense because records may be much.
COMP 451/651 Indexes Chapter 1.
Index on EmpID PRIMARY INDEX Key Field (No Repeat Values) Ordering field (records are ordered by the field value) SECONDARY – KEY INDEX Key Field (No Repeat.
Primary Indexes Dense Indexes
Information Retrieval Space occupancy evaluation.
 Divide the encoded file into blocks of size b  Use an auxiliary bit vector to indicate the beginning of each block  Time – O(b)  Time vs. Memory.
Basics of Data Compression Paolo Ferragina Dipartimento di Informatica Università di Pisa.
Compressed suffix arrays and suffix trees with applications to text indexing and string matching.
Introduction n – length of text, m – length of search pattern string Generally suffix tree construction takes O(n) time, O(n) space and searching takes.
Succinct Orthogonal Range Search Structures on a Grid with Applications to Text Indexing Prosenjit Bose, Carleton University Meng He, Unversity of Waterloo.
Positional Data Organization and Compression in Web Inverted Indexes Leonidas Akritidis Panayiotis Bozanis Department of Computer & Communication Engineering,
Random access to arrays of variable-length items
Index construction: Compression of postings Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 5.3 and a paper.
Indexing OLAP Data Sunita Sarawagi Monowar Hossain York University.
1. Searching The basic characteristics of any searching algorithm is that searching should be efficient, it should have less number of computations involved.
Discrete Methods in Mathematical Informatics Kunihiko Sadakane The University of Tokyo
A simple storage scheme for strings achieving entropy bounds Paolo Ferragina and Rossano Venturini Dipartimento di Informatica University of Pisa, Italy.
10/3/2017 Chapter 6 Index Structures.
Sorts, CompareTo Method and Strings
Index construction: Compression of postings
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
Sorting Mr. Jacobs.
Locality-sensitive hashing and its applications
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
Succinct Data Structures
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
Top-K documents Exact retrieval
Succinct Data Structures
Succinct Data Structures
Packing to fewer dimensions
Query processing: phrase queries and positional indexes
File Organizations and Indexes
Two equivalent problems
Mark Redekopp David Kempe
Succinct Data Structures
Chapter 11: File System Implementation
Query processing: phrase queries and positional indexes
Lecture 21: Hash Tables Monday, February 28, 2005.
Paolo Ferragina Dipartimento di Informatica, Università di Pisa
Chapter 11: File System Implementation
Packing to fewer dimensions
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
Algorithm design and Analysis
Chapter 11: File System Implementation
Index Construction: sorting
Auto-completion Search
Y. Kotidis, S. Muthukrishnan,
Index construction: Compression of postings
JavaScript Arrays.
Data Structures: Searching
Database Design and Programming
Given value and sorted array, find index.
2018, Spring Pusan National University Ki-Joune Li
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
Index construction: Compression of postings
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
Low Depth Cache-Oblivious Algorithms
File Compression Even though disks have gotten bigger, we are still running short on disk space A common technique is to compress files so that they take.
Query processing: phrase queries and positional indexes
Chapter 11: File System Implementation
Index construction: Compression of postings
Packing to fewer dimensions
Lecture 21 Amortized Analysis
17CS1102 DATA STRUCTURES © 2018 KLEF – The contents of this presentation are an intellectual and copyrighted property of KL University. ALL RIGHTS RESERVED.
Presentation transcript:

Rank and Select data structures Prof. Paolo Ferragina, Algoritmi per "Information Retrieval" Rank and Select data structures

A basic problem ! How do you retrieve the k-th string? D D B Abaco$Battle$Car$Cold$Cod .... D Array of n string pointers to strings of total length m (n log m) bits = 32 n bits. it depends on the number of strings it is independent of string length Abaco Battle Car Cold Cod .... D 10000 100000 100 1000 100 .... B Spaces are introduced for simplicity You could drop the $ How do you retrieve the k-th string?

Rank/Select Wish to index the bit vector B[1,m] (possibly compressed). 00101001010101011111110000011010101.... Rank1(6) = 2 m = |B| n = #1 Rankb(i) = number of b in B[1,i] Selectb(i) = position of the i-th b in B Two approaches: Takes |B| + o(|B|) bits of space, Aims at achieving n log(m/n) bits, by deplyoing Elias-Fano + point (1)

The Bit-Vector Index: |B| + o(|B|) m = |B| n = #1s The Bit-Vector Index: |B| + o(|B|) Goal. B is read-only, and the additional index takes o(m) bits. Rank B 00101001010101011 1111100010110101 0101010111000.... Z 8 18 block pos #1 z (bucket-relative) Rank1 4 5 8 0000 1 .... ... 1011 2 (absolute) Rank1 Setting Z = poly(log m) and z=(1/2) log m: Extra space is + (m/Z) log m + (m/z) log Z + o(m) + O(m loglog m / log m) = o(m) bits Rank time is O(1) Term o(m) is crucial in practice, B is untouched (not compressed)

The Select operation B Extra space is + o(m), and B is not touched! m = |B| n = #1s The Select operation B 0010100101010101111111000001101010101010111001.... size r is variable until the subarray includes k = (log m)2 1s Sparse case: If r > k2 = (log m)4 , we store explicitly the position of the k = (log m)2 1s, because we have at most (m/r) blocks of this type, each taking (m/r) * k * log m bits = O(m / log m) = o(m) bits Dense case: k ≤ r ≤ k2, recurse by repeating the argument now with k’ = (log log m)2. If r’ including k’ 1s > log m bits, then store the k’ positions explicitly using O(log log m) bits each, thus O(m/log log m) = o(m) bits in total. Otherwise r’ < log m, and thus a precomputed table is enough. Extra space is + o(m), and B is not touched! Select time is O(1)

Via Elias-Fano (|L| + |H| + o(|H|)) Therefore B is not needed Recall that by setting w = log (m/n) and z = log n, where m = |B| and n = #1 then Space = n log (m/n) bits + 2n bits (Build Select1 on H so we need extra |H| + o(|H|) bits = 2n + o(n) bits ) z = 3, w=2 0 1 2 3 4 5 6 7 Select1(i) on B  uses L and (Select1(H,i) – i) in +o(n) space Rank1(i) on B  Needs binary search over B

If you wish to play with Rank and Select Prof. Paolo Ferragina, Algoritmi per "Information Retrieval" If you wish to play with Rank and Select m/10 + n log (m/n) Rank in 0.4 msec, Select in < 1 msec vs 32n bits of explicit pointers