Index construction: Compression of postings

Slides:



Advertisements
Similar presentations
Information Retrieval in Practice
Advertisements

Paolo Ferragina, Università di Pisa Compressed Permuterm Index Paolo Ferragina Dipartimento di Informatica, Università di Pisa.
Partitioned Elias-Fano Indexes
Paolo Ferragina, Università di Pisa Compressed Rank & Select on general strings Paolo Ferragina Dipartimento di Informatica, Università di Pisa.
Dictionaries and Data-Aware Measures Ankur Gupta Butler University.
PrasadL07IndexCompression1 Index Compression Adapted from Lectures by Prabhakar Raghavan (Yahoo, Stanford) and Christopher Manning.
CpSc 881: Information Retrieval. 2 Why compression? (in general) Use less disk space (saves money) Keep more stuff in memory (increases speed) Increase.
Introduction to Information Retrieval Introduction to Information Retrieval Information Retrieval and Web Search Lecture 5: Index Compression.
Hinrich Schütze and Christina Lioma Lecture 5: Index Compression
Information Retrieval and Web Search
Zone indexes Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 6.1.
CS336: Intelligent Information Retrieval
Web Algorithmics Web Search Engines. Retrieve docs that are “relevant” for the user query Doc : file word or pdf, web page, , blog, e-book,... Query.
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 5: Index Compression 1.
IR IL Compression.  code for integer encoding x > 0 and Length =  log 2 x  +1 e.g., 9 represented as.   code for x takes 2  log 2 x  +1 bits (ie.
Information Retrieval Space occupancy evaluation.
Indexing Debapriyo Majumdar Information Retrieval – Spring 2015 Indian Statistical Institute Kolkata.
Basics of Data Compression Paolo Ferragina Dipartimento di Informatica Università di Pisa.
Introduction n – length of text, m – length of search pattern string Generally suffix tree construction takes O(n) time, O(n) space and searching takes.
Indices Tomasz Bartoszewski. Inverted Index Search Construction Compression.
Basics of Data Compression Paolo Ferragina Dipartimento di Informatica Università di Pisa.
1 ITCS 6265 Information Retrieval and Web Mining Lecture 5 – Index compression.
Introduction to Information Retrieval Introduction to Information Retrieval Information Retrieval and Web Search Lecture 5: Index Compression.
Succinct Orthogonal Range Search Structures on a Grid with Applications to Text Indexing Prosenjit Bose, Carleton University Meng He, Unversity of Waterloo.
Positional Data Organization and Compression in Web Inverted Indexes Leonidas Akritidis Panayiotis Bozanis Department of Computer & Communication Engineering,
Random access to arrays of variable-length items
Inverted File Compression In Managing Gigabytes 과목 : 정보검색론 강의 : 부산대학교 권혁철.
Compressed Prefix Sums O’Neil Delpratt Naila Rahman Rajeev Raman.
Performance of Compressed Inverted Indexes. Reasons for Compression  Compression reduces the size of the index  Compression can increase the performance.
Index construction: Compression of postings Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 5.3 and a paper.
Index construction: Compression of documents Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading Managing-Gigabytes: pg 21-36, 52-56,
Madhuri Gollu Id: 207. Agenda Agenda  Records with Variable Length Fields  Records with Repeating Fields  Variable Format Records  Records that do.
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 8: Index Compression.
A simple storage scheme for strings achieving entropy bounds Paolo Ferragina and Rossano Venturini Dipartimento di Informatica University of Pisa, Italy.
Index construction: Compression of postings
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
COMP9319: Web Data Compression and Search
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
Top-K documents Exact retrieval
Text Indexing and Search
Succinct Data Structures
Packing to fewer dimensions
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
Query processing: phrase queries and positional indexes
Two equivalent problems
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
Query processing: phrase queries and positional indexes
Paolo Ferragina Dipartimento di Informatica, Università di Pisa
Packing to fewer dimensions
Index Construction: sorting
Auto-completion Search
Y. Kotidis, S. Muthukrishnan,
Paolo Ferragina Dipartimento di Informatica, Università di Pisa
Variable Length Data and Records
Database Design and Programming
2018, Spring Pusan National University Ki-Joune Li
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
Lecture 5: Index Compression Hankz Hankui Zhuo
Index construction: Compression of postings
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
3-3. Index compression Most slides were adapted from Stanford CS 276 course and University of Munich IR course.
Query processing: phrase queries and positional indexes
Index construction: Compression of postings
Paolo Ferragina Dipartimento di Informatica, Università di Pisa
Packing to fewer dimensions
Rank and Select data structures
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
Presentation transcript:

Index construction: Compression of postings Prof. Paolo Ferragina, Algoritmi per "Information Retrieval" Index construction: Compression of postings Paolo Ferragina Dipartimento di Informatica Università di Pisa

Sec. 3.1 Gap encoding 1 1 2 7 20 14 … Then you compress the resulting integers with variable-length prefix-free codes, as follows…

Prof. Paolo Ferragina, Algoritmi per "Information Retrieval" Variable-byte codes Wish to get very fast (de)compress  byte-align Given a binary representation of an integer Append 0s to front, to get a multiple-of-7 number of bits Form groups of 7-bits each Append to the last group the bit 0, and to the other groups the bit 1 (tagging) e.g., v=214+1  binary(v) = 100000000000001 10000001 10000000 00000001 Note: We waste 1 bit per byte, and avg 4 for the first byte. But it is a prefix code, and encodes also the value 0 !! T-nibble: We could design this code over t-bits, not just t=8

Prof. Paolo Ferragina, Algoritmi per "Information Retrieval" PForDelta coding 10 01 … 42 23 2 1 a block of 128 numbers = 256 bits = 32 bytes Use b (e.g. 2) bits to encode 128 numbers (32 bytes) or exceptions Translate data: [base, base + 2b-2]  [0,2b - 2] Encode exceptions with value 2b-1 Choose b to encode 90% values, or trade-off: b waste more bits, b more exceptions

Prof. Paolo Ferragina, Algoritmi per "Information Retrieval" g-code Binary Length-1 Binary length x > 0 and Binary length = log2 x +1 e.g., 9 represented as 0001001. g-code for x takes 2 log2 x +1 bits (ie. factor of 2 from optimal) Optimal for Pr(x) = 1/2x2, and i.i.d integers

It is a prefix-free encoding… Prof. Paolo Ferragina, Algoritmi per "Information Retrieval" It is a prefix-free encoding… Given the following sequence of g-coded integers, reconstruct the original sequence: 0001000001100110000011101100111 8 59 7 6 3

Elias-Fano If w = log (m/n) and z = log n, where m = |B| and n = #1 then L takes n w = n log (m/n) bits H takes n 1s + n 0s = 2n bits z = 3, w=2 0 1 2 3 4 5 6 7 (Select1 on H) In unary How to get the i-th number ? Take the i-th group of w bits in L and then represent the value (Select1(H,i) – i) in z bits

Rank and Select data structures Prof. Paolo Ferragina, Algoritmi per "Information Retrieval" Rank and Select data structures

A basic problem ! D D B (n log m) bits = 32 n bits. Abaco, Battle, Car, Cold, Cod .... D Array of n string pointers to strings of total length m (n log m) bits = 32 n bits. it depends on the number of strings it is independent of string length Abaco Battle Car Cold Cod .... D 10000 100000 100 1000 100 .... B Spaces are introduced for simplicity

Rank/Select Wish to index the bit vector B (possibly compressed). B 00101001010101011111110000011010101.... Rank1(6) = 2 m = |B| n = #1 Rankb(i) = number of b in B[1,i] Selectb(i) = position of the i-th b in B Two approaches: Takes |B| + o(|B|) bits of space, Aims at achieving n log(m/n) bits, by deplyoing Elias-Fano + point (1)

The Bit-Vector Index: |B| + o(|B|) m = |B| n = #1s The Bit-Vector Index: |B| + o(|B|) Goal. B is read-only, and the additional index takes o(m) bits. Rank B 00101001010101011 1111100010110101 0101010111000.... Z 8 18 block pos #1 z (bucket-relative) Rank1 4 5 8 0000 1 .... ... 1011 2 (absolute) Rank1 Setting Z = poly(log m) and z=(1/2) log m: Extra space is + (m/Z) log m + (m/z) log Z + o(m) + O(m loglog m / log m) = o(m) bits Rank time is O(1) Term o(m) is crucial in practice, B is untouched (not compressed) There exists a Bit-Vector Index taking o(m) extra bits and constant time for Rank/Select. B is needed and read-only!

The Select operation B Extra space is + o(m), and B is not touched! m = |B| n = #1s The Select operation B 0010100101010101111111000001101010101010111001.... size r is variable until the subarray includes k = (log m)2 1s Sparse case: If r > k2 = (log m)4 , we store explicitly the position of the k = (log m)2 1s, because we have at most (m/r) blocks of this type, each taking (m/r) * k * log m bits = O(m / log m) = o(m) bits Dense case: k ≤ r ≤ k2, recurse by repeating the argument with now k’ = (log log m)2. If r’ including k’ 1s > log m bits, then store the k’ positions explicitly using O(log log m) bits each, thus O(m/log log m) = o(m) bits in total. Otherwise r’ < log m, and thus a precomputed table is enough. Extra space is + o(m), and B is not touched! Select time is O(1) There exists a Bit-Vector Index taking o(m) extra bits and constant time for Rank/Select. B is needed and read-only!

Via Elias-Fano (B is not needed) Recall that by setting w = log (m/n) and z = log n, where m = |B| and n = #1 then Space = n log (m/n) bits + 2n bits (Build Select1 on H so we need extra |H| + o(|H|) bits = 2n + o(n) bits ) z = 3, w=2 0 1 2 3 4 5 6 7 Select1(i) on B  uses L and (Select1(H,i) – i) in +o(n) space Rank1(i) on B  Needs binary search over B

If you wish to play with Rank and Select Prof. Paolo Ferragina, Algoritmi per "Information Retrieval" If you wish to play with Rank and Select m/10 + n log (m/n) Rank in 0.4 msec, Select in < 1 msec vs 32n bits of explicit pointers