Download presentation
Presentation is loading. Please wait.
Published byAdam Cummings Modified over 9 years ago
1
Basics of Data Compression Paolo Ferragina Dipartimento di Informatica Università di Pisa
2
Uniquely Decodable Codes A variable length code assigns a bit string (codeword) of variable length to every symbol e.g. a = 1, b = 01, c = 101, d = 011 What if you get the sequence 1011 ? A uniquely decodable code can always be uniquely decomposed into their codewords.
3
Prefix Codes A prefix code is a variable length code in which no codeword is a prefix of another one e.g a = 0, b = 100, c = 101, d = 11 Can be viewed as a binary trie 0 1 a bc d 0 01 1
4
Average Length For a code C with codeword length L[s], the average length is defined as p(A) =.7 [0], p(B) = p(C) = p(D) =.1 [1--] L a =.7 * 1 +.3 * 3 = 1.6 bit (Huffman achieves 1.5 bit) We say that a prefix code C is optimal if for all prefix codes C’, L a (C) L a (C’)
5
Entropy (Shannon, 1948) For a source S emitting symbols with probability p(s), the self information of s is: bits Lower probability higher information Entropy is the weighted average of i(s) 0-th order empirical entropy of string T i(s) 0 <= H <= log | | H -> 0, skewed distribution H max, uniform distribution
6
Performance: Compression ratio Compression ratio = #bits in output / #bits in input Compression performance: We relate entropy against compression ratio. p(A) =.7, p(B) = p(C) = p(D) =.1 H ≈ 1.36 bits Huffman ≈ 1.5 bits per symb Shannon In practice Avg cw length Empirical H vs Compression ratio An optimal code is surely one that…
7
Index construction: Compression of postings Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 5.3 and a paper
8
code for integer encoding x > 0 and Length = log 2 x +1 e.g., 9 represented as. code for x takes 2 log 2 x +1 bits (ie. factor of 2 from optimal) Length-1 Optimal for Pr(x) = 1/2x 2, and i.i.d integers
9
It is a prefix-free encoding… Given the following sequence of coded integers, reconstruct the original sequence: 0001000001100110000011101100111 8 63 597
10
code for integer encoding Use -coding to reduce the length of the first field Useful for medium-sized integers e.g., 19 represented as. coding x takes about log 2 x + 2 log 2 ( log 2 x ) + 2 bits. Optimal for Pr(x) = 1/2x(log x) 2, and i.i.d integers
11
Variable-byte codes [10.2 bits per TREC12] Wish to get very fast (de)compress byte-align Given a binary representation of an integer Append 0s to front, to get a multiple-of-7 number of bits Form groups of 7-bits each Append to the last group the bit 0, and to the other groups the bit 1 (tagging) e.g., v=2 14 +1 binary(v) = 100000000000001 10000001 10000000 00000001 Note: We waste 1 bit per byte, and avg 4 for the first byte. But it is a prefix code, and encodes also the value 0 !!
12
PForDelta coding 1011 …01 11 0142231110 233…11332313422 a block of 128 numbers = 256 bits = 32 bytes Use b (e.g. 2) bits to encode 128 numbers or create exceptions Encode exceptions: ESC or pointers Choose b to encode 90% values, or trade-off: b waste more bits, b more exceptions Translate data: [base, base + 2 b -1] [0,2 b -1]
13
A basic problem ! Abaco#Battle#Car#Cold#Cod#Defense#Google#Yahoo#.... T Array of pointers (log m) bits per string, (n log m) bits overall. We could drop the separating NULL Independent of string-length distribution It is effective for few strings It is bad for medium/large sets of strings Space = 32 * n bits
14
A basic problem ! 10000100000100100010010000001000010000.... B Abaco#Battle#Car#Cold#Cod#Defense#Google#Yahoo#.... T 10#2#5#6#20#31#3#3#.... T 1010101011101010111111111.... X AbacoBattleCarColdCodDefenseGoogleYahoo.... X 1000101001001000100001010.... B We could drop msb We aim at achieving ≈n log(m/n) bits
15
Rank/Select 00101001010101011111110000011010101.... B Rank b (i) = number of b in B[1,i] Select b (i) = position of the i-th b in B Rank 1 (6) = 2 Select 1 (3) = 8 m = |B| n = #1 Do exist data structures that solve this problem in O(1) query time and n log(m/n) + o(m) bits of space
16
z = 3, w=2 Elias-Fano useful for Rank/Select If w = log (m/n) and z = log n, where m = |B| and n = #1 then - L takes n log (m/n) bits - H takes 2n bits 0 1 2 3 4 5 6 7 In unary Actually you can do binary search over B, but compressed ! Select 1 on B uses L and Select 1 on H taking +o(n) space
17
If you wish to play with Rank and Select A next lecture… m/10 + n log m/n Rank in 0.4 sec, Select in < 1 sec For binary search cfr 2n + n log (m/n) only select vs 32n bits of explicit pointers
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.