Download presentation
Presentation is loading. Please wait.
Published byArnold Copeland Modified over 9 years ago
1
Basics of Data Compression Paolo Ferragina Dipartimento di Informatica Università di Pisa
2
Uniquely Decodable Codes A variable length code assigns a bit string (codeword) of variable length to every symbol e.g. a = 1, b = 01, c = 101, d = 011 What if you get the sequence 1011 ? A uniquely decodable code can always be uniquely decomposed into their codewords.
3
Prefix Codes A prefix code is a variable length code in which no codeword is a prefix of another one e.g a = 0, b = 100, c = 101, d = 11 Can be viewed as a binary trie 0 1 a bc d 0 01 1
4
Average Length For a code C with codeword length L[s], the average length is defined as p(A) =.7 [0], p(B) = p(C) = p(D) =.1 [1--] L a =.7 * 1 +.3 * 3 = 1.6 bit (Huffman achieves 1.5 bit) We say that a prefix code C is optimal if for all prefix codes C’, L a (C) L a (C’)
5
Entropy (Shannon, 1948) For a source S emitting symbols with probability p(s), the self information of s is: bits Lower probability higher information Entropy is the weighted average of i(s) 0-th order empirical entropy of string T i(s) 0 <= H <= log | | H -> 0, skewed distribution H max, uniform distribution
6
Performance: Compression ratio Compression ratio = #bits in output / #bits in input Compression performance: We relate entropy against compression ratio. p(A) =.7, p(B) = p(C) = p(D) =.1 H ≈ 1.36 bits Huffman ≈ 1.5 bits per symb Shannon In practice Avg cw length Empirical H vs Compression ratio An optimal code is surely one that…
7
Index construction: Compression of postings Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 5.3 and a paper
8
code for integer encoding x > 0 and Length = log 2 x +1 e.g., 9 represented as. code for x takes 2 log 2 x +1 bits (ie. factor of 2 from optimal) Length-1 Optimal for Pr(x) = 1/2x 2, and i.i.d integers
9
It is a prefix-free encoding… Given the following sequence of coded integers, reconstruct the original sequence: 0001000001100110000011101100111 8 63 597
10
code for integer encoding Use -coding to reduce the length of the first field Useful for medium-sized integers e.g., 19 represented as. coding x takes about log 2 x + 2 log 2 ( log 2 x ) + 2 bits. Optimal for Pr(x) = 1/2x(log x) 2, and i.i.d integers
11
Variable-byte codes [10.2 bits per TREC12] Wish to get very fast (de)compress byte-align Given a binary representation of an integer Append 0s to front, to get a multiple-of-7 number of bits Form groups of 7-bits each Append to the last group the bit 0, and to the other groups the bit 1 (tagging) e.g., v=2 14 +1 binary(v) = 100000000000001 10000001 10000000 00000001 Note: We waste 1 bit per byte, and avg 4 for the first byte. But it is a prefix code, and encodes also the value 0 !!
12
PForDelta coding 1011 …01 11 0142231110 233…11332313422 a block of 128 numbers = 256 bits = 32 bytes Use b (e.g. 2) bits to encode 128 numbers or create exceptions Encode exceptions: ESC or pointers Choose b to encode 90% values, or trade-off: b waste more bits, b more exceptions Translate data: [base, base + 2 b -1] [0,2 b -1]
13
Index construction: Compression of documents Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading Managing-Gigabytes: pg 21-36, 52-56, 74-79
14
Raw docs are needed
15
Various Approaches Statistical coding Huffman codes Arithmetic codes Dictionary coding LZ77, LZ78, LZSS,… Gzip, zippy, snappy,… Text transforms Burrows-Wheeler Transform bzip
16
Document Compression Huffman coding
17
Huffman Codes Invented by Huffman as a class assignment in ‘50. Used in most compression algorithms gzip, bzip, jpeg (as option), fax compression,… Properties: Generates optimal prefix codes Cheap to encode and decode L a (Huff) = H if probabilities are powers of 2 Otherwise, L a (Huff) < H +1 < +1 bit per symb on avg!!
18
Running Example p(a) =.1, p(b) =.2, p(c ) =.2, p(d) =.5 a(.1)b(.2)d(.5)c(.2) a=000, b=001, c=01, d=1 There are 2 n-1 “equivalent” Huffman trees (.3) (.5) (1) What about ties (and thus, tree depth) ? 0 0 0 1 1 1
19
Encoding and Decoding Encoding: Emit the root-to-leaf path leading to the symbol to be encoded. Decoding: Start at root and take branch for each bit received. When at leaf, output its symbol and return to root. a(.1)b(.2) (.3) c(.2) (.5)d(.5) 0 0 0 1 1 1 abc... 00000101 101001... dcb
20
Huffman in practice The compressed file of n symbols, consists of: Preamble: tree encoding + symbols in leaves Body: compressed text of n symbols Preamble = (| | log | |) bits Body is at least nH and at most nH+n bits Extra +n is bad for very skewed distributions, namely ones for which H -> 0 Example: p(a) = 1/n, p(b) = n-1/n
21
There are better choices T=aaaaaaaaab Huffman = {a,b}-encoding + 10 bits RLE = = (9) + (1) + {a,b}-encoding = 0001001 1 + {a,b}-encoding So RLE saves 2 bits to Huffman, because it is not a prefix-code. In fact it does not map symbol -> bits uniquely, as Huffman, but the mapping may actually change and, moreover, it uses fractions of bits.
22
Idea on Huffman? Goal: Reduce the impact of the +1 bit Solution: Divide the text into blocks of k symbols The +1 is spread over k symbols So the loss is 1/k per symbol Caution: Alphabet = k, preamble gets larger. At the limit, preamble = 1 block equal to the input text, and compressed text is 1 bit only. No compression!
23
Document Compression Arithmetic coding
24
Introduction It uses “fractional” parts of bits!! Gets nH(T) + 2 bits vs. nH(T)+n of Huffman Used in JPEG/MPEG (as option), Bzip More time costly than Huffman, but integer implementation is not too bad. Ideal performance. In practice, it is 0.02 * n
25
Symbol interval Assign each symbol to an interval range from 0 (inclusive) to 1 (exclusive). e.g. a =.2 c =.3 b =.5 cum[c] = p(a)+p(b) =.7 cum[b] = p(a) =.2 cum[a] =.0 The interval for a particular symbol will be called the symbol interval (e.g for b it is [.2,.7))
26
Sequence interval Coding the message sequence: bac The final sequence interval is [.27,.3) a =.2 c =.3 b =.5 0.0 0.2 0.7 1.0 a c b 0.2 0.3 0.55 0.7 a c b 0.2 0.22 0.27 0.3 (0.7-0.2)*0.3=0.15 (0.3-0.2)*0.5 = 0.05 (0.3-0.2)*0.3=0.03 (0.3-0.2)*0.2=0.02 (0.7-0.2)*0.2=0.1 (0.7-0.2)*0.5 = 0.25
27
The algorithm To code a sequence of symbols with probabilities p i (i = 1..n) use the following algorithm: p(a) =.2 p(c) =.3 p(b) =.5 0.27 0.2 0.3
28
The algorithm Each message narrows the interval by a factor p[T i ] Final interval size is Sequence interval [ l n, l n + s n ) Take a number inside
29
Decoding Example Decoding the number.49, knowing the message is of length 3: The message is bbc. a =.2 c =.3 b =.5 0.0 0.2 0.7 1.0 a c b 0.2 0.3 0.55 0.7 a c b 0.3 0.35 0.475 0.55 0.49
30
How do we encode that number? If x = v/2 k (dyadic fraction) then the encoding is equal to bin(v) over k digits (possibly padded with 0s in front)
31
How do we encode that number? Binary fractional representation: FractionalEncode(x) 1.x = 2 * x 2.If x < 1 output 0 3.else {output 1; x = x - 1; } 2 * (1/3) = 2/3 < 1, output 0 2 * (2/3) = 4/3 > 1, output 1 4/3 – 1 = 1/3 Incremental Generation
32
Which number do we encode? Truncate the encoding to the first d = log 2 (2/s n ) bits Truncation gets a smaller number… how much smaller? Truncation Compression l n + s n lnln l n + s n /2 0∞0∞
33
Bound on code length Theorem: For a text T of length n, the Arithmetic encoder generates at most log 2 (2/s n ) < 1 + log 2 2/s n = 1 + (1 - log 2 s n ) = 2 - log 2 (∏ i=1,n p(T i ) ) = 2 - log 2 (∏ [p( ) occ( ) ]) = 2 - ∑ occ( ) * log 2 p( ) ≈ 2 + ∑ ( n*p( ) ) * log 2 (1/p( )) = 2 + n H(T) bits T = acabc s n = p(a) *p(c) *p(a) *p(b) *p(c) = p(a) 2 * p(b) * p(c) 2
34
Document Compression Dictionary-based compressors
35
LZ77 Algorithm’s step: Output Advance by len + 1 A buffer “window” has fixed length and moves aacaacabcaaaaaa Dictionary (all substrings starting here) aacaacabcaaaaaa c ac ac
36
LZ77 Decoding Decoder keeps same dictionary window as encoder. Finds substring and inserts a copy of it What if l > d? (overlap with text to be compressed) E.g. seen = abcd, next codeword is (2,9,e) Simply copy starting at the cursor for (i = 0; i < len; i++) out[cursor+i] = out[cursor-d+i] Output is correct: abcdcdcdcdcdce
37
LZ77 Optimizations used by gzip LZSS: Output one of the following formats (0, position, length) or (1,char) Typically uses the second format if length < 3. Special greedy: possibly use shorter match so that next match is better Hash Table for speed-up searches on triplets Triples are coded with Huffman’s code
38
You find this at: www.gzip.org/zlib/
39
Google’s solution
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.