Download presentation
Presentation is loading. Please wait.
Published byJeremy Hutchinson Modified over 9 years ago
1
Index construction: Compression of documents Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading Managing-Gigabytes: pg 21-36, 52-56, 74-79
2
Raw docs are needed
3
Various Approaches Statistical coding Huffman codes Arithmetic codes Dictionary coding LZ77, LZ78, LZSS,… Gzip, zippy, snappy,… Text transforms Burrows-Wheeler Transform bzip
4
Basics of Data Compression Paolo Ferragina Dipartimento di Informatica Università di Pisa
5
Uniquely Decodable Codes A variable length code assigns a bit string (codeword) of variable length to every symbol e.g. a = 1, b = 01, c = 101, d = 011 What if you get the sequence 1011 ? A uniquely decodable code can always be uniquely decomposed into their codewords.
6
Prefix Codes A prefix code is a variable length code in which no codeword is a prefix of another one e.g a = 0, b = 100, c = 101, d = 11 Can be viewed as a binary trie 0 1 a bc d 0 01 1
7
Average Length For a code C with codeword length L[s], the average length is defined as p(A) =.7 [0], p(B) = p(C) = p(D) =.1 [1--] L a =.7 * 1 +.3 * 3 = 1.6 bit (Huffman achieves 1.5 bit) We say that a prefix code C is optimal if for all prefix codes C’, L a (C) L a (C’)
8
Entropy (Shannon, 1948) For a source S emitting symbols with probability p(s), the self information of s is: bits Lower probability higher information Entropy is the weighted average of i(s) 0-th order empirical entropy of string T i(s) 0 <= H <= log | | H -> 0, skewed distribution H max, uniform distribution
9
Performance: Compression ratio Compression ratio = #bits in output / #bits in input Compression performance: We relate entropy against compression ratio. p(A) =.7, p(B) = p(C) = p(D) =.1 H ≈ 1.36 bits Huffman ≈ 1.5 bits per symb Shannon In practice Avg cw length Empirical H vs Compression ratio An optimal code is surely one that…
10
Document Compression Huffman coding
11
Huffman Codes Invented by Huffman as a class assignment in ‘50. Used in most compression algorithms gzip, bzip, jpeg (as option), fax compression,… Properties: Generates optimal prefix codes Cheap to encode and decode L a (Huff) = H if probabilities are powers of 2 Otherwise, L a (Huff) < H +1 < +1 bit per symb on avg!!
12
Running Example p(a) =.1, p(b) =.2, p(c ) =.2, p(d) =.5 a(.1)b(.2)d(.5)c(.2) a=000, b=001, c=01, d=1 There are 2 n-1 “equivalent” Huffman trees (.3) (.5) (1) What about ties (and thus, tree depth) ? 0 0 0 1 1 1
13
Encoding and Decoding Encoding: Emit the root-to-leaf path leading to the symbol to be encoded. Decoding: Start at root and take branch for each bit received. When at leaf, output its symbol and return to root. a(.1)b(.2) (.3) c(.2) (.5)d(.5) 0 0 0 1 1 1 abc... 00000101 101001... dcb
14
Huffman in practice The compressed file of n symbols, consists of: Preamble: tree encoding + symbols in leaves Body: compressed text of n symbols Preamble = (| | log | |) bits Body is at least nH bits and at most nH+n bits Extra +n is bad for very skewed distributions, namely ones for which H -> 0 Example: p(a) = 1/n, p(b) = n-1/n
15
There are better choices T=aaaaaaaaab Huffman = {a,b}-encoding + 10 bits RLE = = (9) + (1) + {a,b}-encoding = 0001001 1 + {a,b}-encoding So RLE saves 2 bits to Huffman, because it is not a prefix-code. In fact it does not map symbol -> bits uniquely, as Huffman, but the mapping may actually change and, moreover, it uses fractions of bits. Fax, bzip,… are using RLE
16
Idea on Huffman? Goal: Reduce the impact of the +1 bit Solution: Divide the text into blocks of k symbols The +1 is spread over k symbols So the loss is 1/k per symbol Caution: Alphabet = k, tree gets larger, and so preamble. At the limit, preamble = 1 k-gram = the input text, and the compressed text is 1 bit only. This means no compression at all !
17
Document Compression Arithmetic coding
18
Introduction It uses “fractional” parts of bits!! Gets < nH(T) + 2 bits vs. < nH(T)+n (Huffman) Used in JPEG/MPEG (as option), bzip More time costly than Huffman, but integer implementation is not too bad. Ideal performance. In practice, it is 0.02 * n
19
Symbol interval Assign each symbol to an interval range from 0 (inclusive) to 1 (exclusive). e.g. a =.2 c =.3 b =.5 cum[c] = p(a)+p(b) =.7 cum[b] = p(a) =.2 cum[a] =.0 The interval for a particular symbol will be called the symbol interval (e.g for b it is [.2,.7))
20
Sequence interval Coding the message sequence: bac The final sequence interval is [.27,.3) a =.2 c =.3 b =.5 0.0 0.2 0.7 1.0 a c b 0.2 0.3 0.55 0.7 a c b 0.2 0.22 0.27 0.3 (0.7-0.2)*0.3=0.15 (0.3-0.2)*0.5 = 0.05 (0.3-0.2)*0.3=0.03 (0.3-0.2)*0.2=0.02 (0.7-0.2)*0.2=0.1 (0.7-0.2)*0.5 = 0.25
21
The algorithm To code a sequence of symbols with probabilities p i (i = 1..n) use the following algorithm: p(a) =.2 p(c) =.3 p(b) =.5 0.27 0.2 0.3
22
The algorithm Each message narrows the interval by a factor p[T i ] Final interval size is Sequence interval [ l n, l n + s n ) Take a number inside
23
Decoding Example Decoding the number.49, knowing the message is of length 3: The message is bbc. a =.2 c =.3 b =.5 0.0 0.2 0.7 1.0 a c b 0.2 0.3 0.55 0.7 a c b 0.3 0.35 0.475 0.55 0.49
24
How do we encode that number? If x = v/2 k (dyadic fraction) then the encoding is equal to bin(v) over k digits (possibly padded with 0s in front)
25
How do we encode that number? Binary fractional representation: FractionalEncode(x) 1.x = 2 * x 2.If x < 1 output 0 3.else {output 1; x = x - 1; } 2 * (1/3) = 2/3 < 1, output 0 2 * (2/3) = 4/3 > 1, output 1 4/3 – 1 = 1/3 Incremental Generation
26
Which number do we encode? Truncate the encoding to the first d = log 2 (2/s n ) bits Truncation gets a smaller number… how much smaller? Truncation Compression l n + s n lnln l n + s n /2 0∞0∞
27
Bound on code length Theorem: For a text T of length n, the Arithmetic encoder generates at most log 2 (2/s n ) < 1 + log 2 2/s n = 1 + (1 - log 2 s n ) = 2 - log 2 (∏ i=1,n p(T i ) ) = 2 - log 2 (∏ [p( ) occ( ) ]) = 2 - ∑ occ( ) * log 2 p( ) ≈ 2 + ∑ ( n*p( ) ) * log 2 (1/p( )) = 2 + n H(T) bits T = acabc s n = p(a) *p(c) *p(a) *p(b) *p(c) = p(a) 2 * p(b) * p(c) 2
28
Document Compression Dictionary-based compressors
29
LZ77 Algorithm’s step: Output Advance by len + 1 A buffer “window” has fixed length and moves aacaacabcaaaaaa Dictionary (all substrings starting here) aacaacabcaaaaaa c ac ac
30
LZ77 Decoding Decoder keeps same dictionary window as encoder. Finds substring and inserts a copy of it What if l > d? (overlap with text to be compressed) E.g. seen = abcd, next codeword is (2,9,e) Simply copy starting at the cursor for (i = 0; i < len; i++) out[cursor+i] = out[cursor-d+i] Output is correct: abcdcdcdcdcdce
31
LZ77 Optimizations used by gzip LZSS: Output one of the following formats (0, position, length) or (1,char) Typically uses the second format if length < 3. Special greedy: possibly use shorter match so that next match is better Hash Table for speed-up searches on triplets Triples are coded with Huffman’s code
32
You find this at: www.gzip.org/zlib/
33
Google’s solution
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.