IR IL Compression
code for integer encoding x > 0 and Length = log 2 x +1 e.g., 9 represented as. code for x takes 2 log 2 x +1 bits (ie. factor of 2 from optimal) Length-1 Optimal for Pr(x) = 1/2x 2, and i.i.d integers
It is a prefix-free encoding… Given the following sequence of coded integers, reconstruct the original sequence:
code for integer encoding Use -coding to reduce the length of the first field Useful for medium-sized integers e.g., 19 represented as. coding x takes about log 2 x + 2 log 2 ( log 2 x ) + 2 bits. Optimal for Pr(x) = 1/2x(log x) 2, and i.i.d integers
Variable-byte codes [10.2 bits per TREC12] Wish to get very fast (de)compress byte-align Given a binary representation of an integer Append 0s to front, to get a multiple-of-7 number of bits Form groups of 7-bits each Append to the last group the bit 0, and to the other groups the bit 1 (tagging) e.g., v= binary(v) = Note: We waste 1 bit per byte, and avg 4 for the first byte. But it is a prefix code, and encodes also the value 0 !!
Rice code (simplification of Golomb code) It is a parametric code: depends on k Quotient q= (v-1)/k , and the rest is r= v – k * q – 1 Useful when integers concentrated around k How do we choose k ? Usually k 0.69 * mean(v) [Bernoulli model] Optimal for Pr(v) = p (1-p) v-1, where mean(v)=1/p, and i.i.d ints [q times 0s] 1 Log k bits
Interpolative coding = M = Recursive coding preorder traversal of a balanced binary tree At every step we know (initially, they are encoded): num = |M| = 12, Lidx=1, low = 1, Ridx=12, hi = 21 Take the middle element: h= (Lidx+Ridx)/2=6 M[6]=9, num_left= h – Lidx = 5, num_right= Ridx-h = 6 low + left_size =1+5 = 6 ≤ M[h] ≤ hi – right_size = (21 – 6) = 15 We can encode 9-6=3 in log 2 (15-6+1) = 4 bits lo=1, hi=9-1=8, num=5 lo=9+1=10, hi=21, num=6
PForDelta coding 1011 … … a block of 128 numbers Use b (e.g. 2) bits to encode 128 numbers or create exceptions Encode exceptions: ESC or pointers Choose b to encode 90% values, or trade-off: b waste more bits, b more exceptions Translate data: [base, base + 2 b -1] [0,2 b -1]