Presentation is loading. Please wait.

Presentation is loading. Please wait.

Index construction: Compression of postings Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 5.3 and a paper.

Similar presentations


Presentation on theme: "Index construction: Compression of postings Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 5.3 and a paper."— Presentation transcript:

1 Index construction: Compression of postings Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 5.3 and a paper

2  code for integer encoding x > 0 and Length =  log 2 x  +1 e.g., 9 represented as.   code for x takes 2  log 2 x  +1 bits (ie. factor of 2 from optimal) Length-1 Optimal for Pr(x) = 1/2x 2, and i.i.d integers

3 It is a prefix-free encoding… Given the following sequence of  coded integers, reconstruct the original sequence: 0001000001100110000011101100111 8 63 597

4  code for integer encoding Use  -coding to reduce the length of the first field Useful for medium-sized integers e.g., 19 represented as.  coding x takes about  log 2 x  + 2  log 2 (  log 2 x  )  + 2 bits. Optimal for Pr(x) = 1/2x(log x) 2, and i.i.d integers

5 Variable-byte  codes [10.2 bits per TREC12] Wish to get very fast (de)compress  byte-align Given a binary representation of an integer Append 0s to front, to get a multiple-of-7 number of bits Form groups of 7-bits each Append to the last group the bit 0, and to the other groups the bit 1 (tagging) e.g., v=2 14 +1  binary(v) = 100000000000001 10000001 10000000 00000001 Note: We waste 1 bit per byte, and avg 4 for the first byte. But it is a prefix code, and encodes also the value 0 !!

6 PForDelta coding 1011 …01 11 0142231110 233…11332313422 a block of 128 numbers = 256 bits = 32 bytes Use b (e.g. 2) bits to encode 128 numbers or create exceptions Encode exceptions: ESC or pointers Choose b to encode 90% values, or trade-off: b  waste more bits, b  more exceptions Translate data: [base, base + 2 b -1]  [0,2 b -1]

7 Random access to postings lists and other data types (e.g. encoding skips?) Paolo Ferragina Dipartimento di Informatica Università di Pisa

8 A basic problem ! Abaco#Battle#Car#Cold#Cod#Defense#Google#Yahoo#.... T Array of pointers (log m) bits per string = (n log m) bits= 32 n bits. We could drop the separating NULL  Independent of string-length distribution It is effective for few strings  It is bad for medium/large sets of strings

9 A basic problem ! 10000100000100100010010000001000010000.... B Abaco#Battle#Car#Cold#Cod#Defense#Google#Yahoo#.... T 10#2#5#6#20#31#3#3#.... A 1010101011101010111111111.... X AbacoBattleCarColdCodDefenseGoogleYahoo.... X 1000101001001000100001010.... B We could drop msb We aim at achieving ≈ n log(m/n) bits < n log m

10 Another textDB: Labeled Graph

11 Rank/Select 00101001010101011111110000011010101.... B Rank b (i) = number of b in B[1,i] Select b (i) = position of the i-th b in B Rank 1 (6) = 2 Select 1 (3) = 8 m = |B| n = #1 Do exist data structures that solve this problem in O(1) query time and efficient space (i.e. +o(m) bits additional) Wish to index the bit vector B compressed.

12 The Bit-Vector Index: m+o(m) m = |B| n = #1s Goal. B is read-only, and the additional index takes o(m) bits. 00101001010101011 1111100010110101 0101010111000.... B Z 8 18 (absolute) Rank 1 Setting Z = poly(log m) and z=(1/2) log m: Space is |B| + (m/Z) log m + (m/z) log Z + o(m)  m + O(m loglog m / log m) bits Rank time is O(1) Term o(m) is crucial in practice, B is untouched (not compressed) 000010....... 101121.... blockpos#1 z (bucket-relative) Rank 1 4 5 8 Rank

13 The Bit-Vector Index m = |B| n = #1s 0010100101010101111111000001101010101010111001.... B size r is variable  k consecutive 1s Sparse case: If r > k 2 store explicitly the position of the k 1s Dense case: k ≤ r ≤ k 2, recurse... One level is enough!!... still need a table of size o(m). Setting k ≈ polylog m Space is m + o(m), and B is not touched! Select time is O(1) There exists a Bit-Vector Index taking o(m) extra bits and constant time for Rank/Select. B is read-only!

14 z = 3, w=2 Elias-Fano index&compress If w = log (m/n) and z = log n, where m = |B| and n = #1 then - L takes n w = n log (m/n) bits - H takes n 1s + n 0s = 2n bits 0 1 2 3 4 5 6 7 In unary Actually you can do binary search over B, but compressed ! Select 1 (i) on B  uses L and ( Select 1 (H,i) – i) in +o(n) space ( Select 1 on H)

15 If you wish to play with Rank and Select m/10 + n log m/n Rank in 0.4  sec, Select in < 1  sec vs 32n bits of explicit pointers


Download ppt "Index construction: Compression of postings Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 5.3 and a paper."

Similar presentations


Ads by Google