Index construction: Compression of postings Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 5.3 and a paper
code for integer encoding x > 0 and Length = log 2 x +1 e.g., 9 represented as. code for x takes 2 log 2 x +1 bits (ie. factor of 2 from optimal) Length-1 Optimal for Pr(x) = 1/2x 2, and i.i.d integers
It is a prefix-free encoding… Given the following sequence of coded integers, reconstruct the original sequence:
code for integer encoding Use -coding to reduce the length of the first field Useful for medium-sized integers e.g., 19 represented as. coding x takes about log 2 x + 2 log 2 ( log 2 x ) + 2 bits. Optimal for Pr(x) = 1/2x(log x) 2, and i.i.d integers
Variable-byte codes [10.2 bits per TREC12] Wish to get very fast (de)compress byte-align Given a binary representation of an integer Append 0s to front, to get a multiple-of-7 number of bits Form groups of 7-bits each Append to the last group the bit 0, and to the other groups the bit 1 (tagging) e.g., v= binary(v) = Note: We waste 1 bit per byte, and avg 4 for the first byte. But it is a prefix code, and encodes also the value 0 !!
PForDelta coding 1011 … … a block of 128 numbers = 256 bits = 32 bytes Use b (e.g. 2) bits to encode 128 numbers or create exceptions Encode exceptions: ESC or pointers Choose b to encode 90% values, or trade-off: b waste more bits, b more exceptions Translate data: [base, base + 2 b -1] [0,2 b -1]
Random access to postings lists and other data types (e.g. encoding skips?) Paolo Ferragina Dipartimento di Informatica Università di Pisa
A basic problem ! Abaco#Battle#Car#Cold#Cod#Defense#Google#Yahoo#.... T Array of pointers (log m) bits per string = (n log m) bits= 32 n bits. We could drop the separating NULL Independent of string-length distribution It is effective for few strings It is bad for medium/large sets of strings
A basic problem ! B Abaco#Battle#Car#Cold#Cod#Defense#Google#Yahoo#.... T 10#2#5#6#20#31#3#3#.... A X AbacoBattleCarColdCodDefenseGoogleYahoo.... X B We could drop msb We aim at achieving ≈ n log(m/n) bits < n log m
Another textDB: Labeled Graph
Rank/Select B Rank b (i) = number of b in B[1,i] Select b (i) = position of the i-th b in B Rank 1 (6) = 2 Select 1 (3) = 8 m = |B| n = #1 Do exist data structures that solve this problem in O(1) query time and efficient space (i.e. +o(m) bits additional) Wish to index the bit vector B compressed.
The Bit-Vector Index: m+o(m) m = |B| n = #1s Goal. B is read-only, and the additional index takes o(m) bits B Z 8 18 (absolute) Rank 1 Setting Z = poly(log m) and z=(1/2) log m: Space is |B| + (m/Z) log m + (m/z) log Z + o(m) m + O(m loglog m / log m) bits Rank time is O(1) Term o(m) is crucial in practice, B is untouched (not compressed) blockpos#1 z (bucket-relative) Rank Rank
The Bit-Vector Index m = |B| n = #1s B size r is variable k consecutive 1s Sparse case: If r > k 2 store explicitly the position of the k 1s Dense case: k ≤ r ≤ k 2, recurse... One level is enough!!... still need a table of size o(m). Setting k ≈ polylog m Space is m + o(m), and B is not touched! Select time is O(1) There exists a Bit-Vector Index taking o(m) extra bits and constant time for Rank/Select. B is read-only!
z = 3, w=2 Elias-Fano index&compress If w = log (m/n) and z = log n, where m = |B| and n = #1 then - L takes n w = n log (m/n) bits - H takes n 1s + n 0s = 2n bits In unary Actually you can do binary search over B, but compressed ! Select 1 (i) on B uses L and ( Select 1 (H,i) – i) in +o(n) space ( Select 1 on H)
If you wish to play with Rank and Select m/10 + n log m/n Rank in 0.4 sec, Select in < 1 sec vs 32n bits of explicit pointers