Index construction: Compression of postings Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 5.3 and a paper.

Slides:



Advertisements
Similar presentations
Information Retrieval in Practice
Advertisements

Latent Semantic Indexing (mapping onto a smaller space of latent concepts) Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 18.
Paolo Ferragina, Università di Pisa Compressed Permuterm Index Paolo Ferragina Dipartimento di Informatica, Università di Pisa.
Partitioned Elias-Fano Indexes
Multidimensional Data. Many applications of databases are "geographic" = 2­dimensional data. Others involve large numbers of dimensions. Example: data.
Multidimensional Data Rtrees Bitmap indexes. R-Trees For “regions” (typically rectangles) but can represent points. Supports NN, “where­am­I” queries.
Paolo Ferragina, Università di Pisa Compressed Rank & Select on general strings Paolo Ferragina Dipartimento di Informatica, Università di Pisa.
Fast Compressed Tries through Path Decompositions Roberto Grossi Giuseppe Ottaviano* Università di Pisa * Part of the work done while at Microsoft Research.
PrasadL07IndexCompression1 Index Compression Adapted from Lectures by Prabhakar Raghavan (Yahoo, Stanford) and Christopher Manning.
CpSc 881: Information Retrieval. 2 Why compression? (in general) Use less disk space (saves money) Keep more stuff in memory (increases speed) Increase.
Hinrich Schütze and Christina Lioma Lecture 5: Index Compression
Index Compression David Kauchak cs160 Fall 2009 adapted from:
The Wavelet Trie: Maintaining an Indexed Sequence of Strings in Compressed Space Roberto GrossiGiuseppe Ottaviano * Università di Pisa * Part of the work.
Genome-scale disk-based suffix tree indexing Benjarath Phoophakdee Mohammed J. Zaki Compiled by: Amit Mahajan Chaitra Venus.
Succinct Representations of Trees S. Srinivasa Rao Seoul National University.
Wavelet Trees Ankur Gupta Butler University. Text Dictionary Problem The input is a text T drawn from an alphabet Σ. We want to support the following.
Indexing UCSB 290N. Mainly based on slides from the text books of Croft/Metzler/Strohman and Manning/Raghavan/Schutze All slides ©Addison Wesley, 2008.
BTrees & Bitmap Indexes
Full-Text Indexing via Burrows-Wheeler Transform Wing-Kai Hon Oct 18, 2006.
Zone indexes Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 6.1.
CS336: Intelligent Information Retrieval
Web Algorithmics Web Search Engines. Retrieve docs that are “relevant” for the user query Doc : file word or pdf, web page, , blog, e-book,... Query.
Last Time –Main memory indexing (T trees) and a real system. –Optimize for CPU, space, and logging. But things have changed drastically! Hardware trend:
Hash Tables1 Part E Hash Tables  
An Approach to Generalized Hashing Michael Klipper With Dan Blandford Guy Blelloch.
CS 255: Database System Principles slides: Variable length data and record By:- Arunesh Joshi( 107) Id: Cs257_107_ch13_13.7.
Compact Representations of Separable Graphs From a paper of the same title submitted to SODA by: Dan Blandford and Guy Blelloch and Ian Kash.
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 5: Index Compression 1.
IR IL Compression.  code for integer encoding x > 0 and Length =  log 2 x  +1 e.g., 9 represented as.   code for x takes 2  log 2 x  +1 bits (ie.
Information Retrieval Space occupancy evaluation.
Indexing Debapriyo Majumdar Information Retrieval – Spring 2015 Indian Statistical Institute Kolkata.
Basics of Data Compression Paolo Ferragina Dipartimento di Informatica Università di Pisa.
Succinct Representations of Trees
CMPT 454, Simon Fraser University, Fall 2009, Martin Ester 75 Database Systems II Record Organization.
Efficient Minimal Perfect Hash Language Models David Guthrie, Mark Hepple, Wei Liu University of Sheffield.
Oct 29, 2001CSE 373, Autumn External Storage For large data sets, the computer will have to access the disk. Disk access can take 200,000 times longer.
Introduction n – length of text, m – length of search pattern string Generally suffix tree construction takes O(n) time, O(n) space and searching takes.
Data : The Small Forwarding Table(SFT), In general, The small forwarding table is the compressed version of a trie. Since SFT organizes.
« Performance of Compressed Inverted List Caching in Search Engines » Proceedings of the International World Wide Web Conference Commitee, Beijing 2008)
Basics of Data Compression Paolo Ferragina Dipartimento di Informatica Università di Pisa.
1 Models of Computation o Turing Machines o Finite store + Tape o transition function: o If in state S1 and current cell is a 0 (1) then write 1 (0) and.
Introduction to Information Retrieval Introduction to Information Retrieval Information Retrieval and Web Search Lecture 5: Index Compression.
Search engines 2 Øystein Torbjørnsen Fast Search and Transfer.
Succinct Orthogonal Range Search Structures on a Grid with Applications to Text Indexing Prosenjit Bose, Carleton University Meng He, Unversity of Waterloo.
Positional Data Organization and Compression in Web Inverted Indexes Leonidas Akritidis Panayiotis Bozanis Department of Computer & Communication Engineering,
Random access to arrays of variable-length items
Inverted File Compression In Managing Gigabytes 과목 : 정보검색론 강의 : 부산대학교 권혁철.
Sec 14.7 Bitmap Indexes Shabana Kazi. Introduction A bitmap index is a special kind of index that stores the bulk of its data as bit arrays (commonly.
Compressed Prefix Sums O’Neil Delpratt Naila Rahman Rajeev Raman.
Sets of Digital Data CSCI 2720 Fall 2005 Kraemer.
Index Construction: sorting Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading Chap 4.
Bringing Together Paradox, Counting, and Computation To Make Randomness! CS Lecture 21 
Index construction: Compression of documents Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading Managing-Gigabytes: pg 21-36, 52-56,
Query processing: optimizations Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 2.3.
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 8: Index Compression.
Index construction: Compression of postings
Succinct Data Structures
Query processing: phrase queries and positional indexes
Paolo Ferragina Dipartimento di Informatica, Università di Pisa
Index Construction: sorting
Auto-completion Search
Index construction: Compression of postings
Paolo Ferragina Dipartimento di Informatica, Università di Pisa
Variable Length Data and Records
Database Design and Programming
Index construction: Compression of postings
Query processing: phrase queries and positional indexes
Index construction: Compression of postings
Paolo Ferragina Dipartimento di Informatica, Università di Pisa
Rank and Select data structures
Presentation transcript:

Index construction: Compression of postings Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 5.3 and a paper

 code for integer encoding x > 0 and Length =  log 2 x  +1 e.g., 9 represented as.   code for x takes 2  log 2 x  +1 bits (ie. factor of 2 from optimal) Length-1 Optimal for Pr(x) = 1/2x 2, and i.i.d integers

It is a prefix-free encoding… Given the following sequence of  coded integers, reconstruct the original sequence:

 code for integer encoding Use  -coding to reduce the length of the first field Useful for medium-sized integers e.g., 19 represented as.  coding x takes about  log 2 x  + 2  log 2 (  log 2 x  )  + 2 bits. Optimal for Pr(x) = 1/2x(log x) 2, and i.i.d integers

Variable-byte  codes [10.2 bits per TREC12] Wish to get very fast (de)compress  byte-align Given a binary representation of an integer Append 0s to front, to get a multiple-of-7 number of bits Form groups of 7-bits each Append to the last group the bit 0, and to the other groups the bit 1 (tagging) e.g., v=  binary(v) = Note: We waste 1 bit per byte, and avg 4 for the first byte. But it is a prefix code, and encodes also the value 0 !!

PForDelta coding 1011 … … a block of 128 numbers = 256 bits = 32 bytes Use b (e.g. 2) bits to encode 128 numbers or create exceptions Encode exceptions: ESC or pointers Choose b to encode 90% values, or trade-off: b  waste more bits, b  more exceptions Translate data: [base, base + 2 b -1]  [0,2 b -1]

Random access to postings lists and other data types (e.g. encoding skips?) Paolo Ferragina Dipartimento di Informatica Università di Pisa

A basic problem ! Abaco#Battle#Car#Cold#Cod#Defense#Google#Yahoo#.... T Array of pointers (log m) bits per string = (n log m) bits= 32 n bits. We could drop the separating NULL  Independent of string-length distribution It is effective for few strings  It is bad for medium/large sets of strings

A basic problem ! B Abaco#Battle#Car#Cold#Cod#Defense#Google#Yahoo#.... T 10#2#5#6#20#31#3#3#.... A X AbacoBattleCarColdCodDefenseGoogleYahoo.... X B We could drop msb We aim at achieving ≈ n log(m/n) bits < n log m

Another textDB: Labeled Graph

Rank/Select B Rank b (i) = number of b in B[1,i] Select b (i) = position of the i-th b in B Rank 1 (6) = 2 Select 1 (3) = 8 m = |B| n = #1 Do exist data structures that solve this problem in O(1) query time and efficient space (i.e. +o(m) bits additional) Wish to index the bit vector B compressed.

The Bit-Vector Index: m+o(m) m = |B| n = #1s Goal. B is read-only, and the additional index takes o(m) bits B Z 8 18 (absolute) Rank 1 Setting Z = poly(log m) and z=(1/2) log m: Space is |B| + (m/Z) log m + (m/z) log Z + o(m)  m + O(m loglog m / log m) bits Rank time is O(1) Term o(m) is crucial in practice, B is untouched (not compressed) blockpos#1 z (bucket-relative) Rank Rank

The Bit-Vector Index m = |B| n = #1s B size r is variable  k consecutive 1s Sparse case: If r > k 2 store explicitly the position of the k 1s Dense case: k ≤ r ≤ k 2, recurse... One level is enough!!... still need a table of size o(m). Setting k ≈ polylog m Space is m + o(m), and B is not touched! Select time is O(1) There exists a Bit-Vector Index taking o(m) extra bits and constant time for Rank/Select. B is read-only!

z = 3, w=2 Elias-Fano index&compress If w = log (m/n) and z = log n, where m = |B| and n = #1 then - L takes n w = n log (m/n) bits - H takes n 1s + n 0s = 2n bits In unary Actually you can do binary search over B, but compressed ! Select 1 (i) on B  uses L and ( Select 1 (H,i) – i) in +o(n) space ( Select 1 on H)

If you wish to play with Rank and Select m/10 + n log m/n Rank in 0.4  sec, Select in < 1  sec vs 32n bits of explicit pointers