Presentation is loading. Please wait.

Presentation is loading. Please wait.

IR IL Compression.  code for integer encoding x > 0 and Length =  log 2 x  +1 e.g., 9 represented as.   code for x takes 2  log 2 x  +1 bits (ie.

Similar presentations


Presentation on theme: "IR IL Compression.  code for integer encoding x > 0 and Length =  log 2 x  +1 e.g., 9 represented as.   code for x takes 2  log 2 x  +1 bits (ie."— Presentation transcript:

1 IR IL Compression

2  code for integer encoding x > 0 and Length =  log 2 x  +1 e.g., 9 represented as.   code for x takes 2  log 2 x  +1 bits (ie. factor of 2 from optimal) Length-1 Optimal for Pr(x) = 1/2x 2, and i.i.d integers

3 It is a prefix-free encoding… Given the following sequence of  coded integers, reconstruct the original sequence: 0001000001100110000011101100111 8 63 597

4  code for integer encoding Use  -coding to reduce the length of the first field Useful for medium-sized integers e.g., 19 represented as.  coding x takes about  log 2 x  + 2  log 2 (  log 2 x  )  + 2 bits. Optimal for Pr(x) = 1/2x(log x) 2, and i.i.d integers

5 Variable-byte  codes [10.2 bits per TREC12] Wish to get very fast (de)compress  byte-align Given a binary representation of an integer Append 0s to front, to get a multiple-of-7 number of bits Form groups of 7-bits each Append to the last group the bit 0, and to the other groups the bit 1 (tagging) e.g., v=2 14 +1  binary(v) = 100000000000001 10000001 10000000 00000001 Note: We waste 1 bit per byte, and avg 4 for the first byte. But it is a prefix code, and encodes also the value 0 !!

6 Rice code (simplification of Golomb code) It is a parametric code: depends on k Quotient q=  (v-1)/k , and the rest is r= v – k * q – 1 Useful when integers concentrated around k How do we choose k ? Usually k  0.69 * mean(v) [Bernoulli model] Optimal for Pr(v) = p (1-p) v-1, where mean(v)=1/p, and i.i.d ints [q times 0s] 1 Log k bits

7 Interpolative coding  = 1 1 1 2 2 2 2 4 3 1 1 1 M = 1 2 3 5 7 9 11 15 18 19 20 21 Recursive coding  preorder traversal of a balanced binary tree At every step we know (initially, they are encoded): num = |M| = 12, Lidx=1, low = 1, Ridx=12, hi = 21 Take the middle element: h= (Lidx+Ridx)/2=6  M[6]=9, num_left= h – Lidx = 5, num_right= Ridx-h = 6 low + left_size =1+5 = 6 ≤ M[h] ≤ hi – right_size = (21 – 6) = 15 We can encode 9-6=3 in log 2 (15-6+1) = 4 bits lo=1, hi=9-1=8, num=5 lo=9+1=10, hi=21, num=6

8 PForDelta coding 1011 …01 11 0142231110 233…11332313422 a block of 128 numbers Use b (e.g. 2) bits to encode 128 numbers or create exceptions Encode exceptions: ESC or pointers Choose b to encode 90% values, or trade-off: b  waste more bits, b  more exceptions Translate data: [base, base + 2 b -1]  [0,2 b -1]


Download ppt "IR IL Compression.  code for integer encoding x > 0 and Length =  log 2 x  +1 e.g., 9 represented as.   code for x takes 2  log 2 x  +1 bits (ie."

Similar presentations


Ads by Google