Download presentation
Presentation is loading. Please wait.
Published byHarriet Shields Modified over 8 years ago
1
Index construction: Compression of postings Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 5.3 and a paper
2
code for integer encoding x > 0 and Length = log 2 x +1 e.g., 9 represented as. code for x takes 2 log 2 x +1 bits (ie. factor of 2 from optimal) Length-1 Optimal for Pr(x) = 1/2x 2, and i.i.d integers
3
It is a prefix-free encoding… Given the following sequence of coded integers, reconstruct the original sequence: 0001000001100110000011101100111 8 63 597
4
code for integer encoding Use -coding to reduce the length of the first field Useful for medium-sized integers e.g., 19 represented as. coding x takes about log 2 x + 2 log 2 ( log 2 x ) + 2 bits. Optimal for Pr(x) = 1/2x(log x) 2, and i.i.d integers
5
Variable-byte codes [10.2 bits per TREC12] Wish to get very fast (de)compress byte-align Given a binary representation of an integer Append 0s to front, to get a multiple-of-7 number of bits Form groups of 7-bits each Append to the last group the bit 0, and to the other groups the bit 1 (tagging) e.g., v=2 14 +1 binary(v) = 100000000000001 10000001 10000000 00000001 Note: We waste 1 bit per byte, and avg 4 for the first byte. But it is a prefix code, and encodes also the value 0 !!
6
PForDelta coding 1011 …01 11 0142231110 233…11332313422 a block of 128 numbers = 256 bits = 32 bytes Use b (e.g. 2) bits to encode 128 numbers or create exceptions Encode exceptions: ESC or pointers Choose b to encode 90% values, or trade-off: b waste more bits, b more exceptions Translate data: [base, base + 2 b -1] [0,2 b -1]
7
Random access to postings lists and other data types (e.g. encoding skips?) Paolo Ferragina Dipartimento di Informatica Università di Pisa
8
A basic problem ! Abaco#Battle#Car#Cold#Cod#Defense#Google#Yahoo#.... T Array of pointers (log m) bits per string = (n log m) bits= 32 n bits. We could drop the separating NULL Independent of string-length distribution It is effective for few strings It is bad for medium/large sets of strings
9
A basic problem ! 10000100000100100010010000001000010000.... B Abaco#Battle#Car#Cold#Cod#Defense#Google#Yahoo#.... T 10#2#5#6#20#31#3#3#.... A 1010101011101010111111111.... X AbacoBattleCarColdCodDefenseGoogleYahoo.... X 1000101001001000100001010.... B We could drop msb We aim at achieving ≈ n log(m/n) bits < n log m
10
Another textDB: Labeled Graph
11
Rank/Select 00101001010101011111110000011010101.... B Rank b (i) = number of b in B[1,i] Select b (i) = position of the i-th b in B Rank 1 (6) = 2 Select 1 (3) = 8 m = |B| n = #1 Do exist data structures that solve this problem in O(1) query time and efficient space (i.e. +o(m) bits additional) Wish to index the bit vector B compressed.
12
The Bit-Vector Index: m+o(m) m = |B| n = #1s Goal. B is read-only, and the additional index takes o(m) bits. 00101001010101011 1111100010110101 0101010111000.... B Z 8 18 (absolute) Rank 1 Setting Z = poly(log m) and z=(1/2) log m: Space is |B| + (m/Z) log m + (m/z) log Z + o(m) m + O(m loglog m / log m) bits Rank time is O(1) Term o(m) is crucial in practice, B is untouched (not compressed) 000010....... 101121.... blockpos#1 z (bucket-relative) Rank 1 4 5 8 Rank
13
The Bit-Vector Index m = |B| n = #1s 0010100101010101111111000001101010101010111001.... B size r is variable k consecutive 1s Sparse case: If r > k 2 store explicitly the position of the k 1s Dense case: k ≤ r ≤ k 2, recurse... One level is enough!!... still need a table of size o(m). Setting k ≈ polylog m Space is m + o(m), and B is not touched! Select time is O(1) There exists a Bit-Vector Index taking o(m) extra bits and constant time for Rank/Select. B is read-only!
14
z = 3, w=2 Elias-Fano index&compress If w = log (m/n) and z = log n, where m = |B| and n = #1 then - L takes n w = n log (m/n) bits - H takes n 1s + n 0s = 2n bits 0 1 2 3 4 5 6 7 In unary Actually you can do binary search over B, but compressed ! Select 1 (i) on B uses L and ( Select 1 (H,i) – i) in +o(n) space ( Select 1 on H)
15
If you wish to play with Rank and Select m/10 + n log m/n Rank in 0.4 sec, Select in < 1 sec vs 32n bits of explicit pointers
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.