Index construction: Compression of documents Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading Managing-Gigabytes: pg 21-36, 52-56,

Index construction: Compression of documents Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading Managing-Gigabytes: pg 21-36, 52-56, 74-79

Raw docs are needed

Various Approaches Statistical coding Huffman codes Arithmetic codes Dictionary coding LZ77, LZ78, LZSS,… Gzip, zippy, snappy,… Text transforms Burrows-Wheeler Transform bzip

Basics of Data Compression Paolo Ferragina Dipartimento di Informatica Università di Pisa

Uniquely Decodable Codes A variable length code assigns a bit string (codeword) of variable length to every symbol e.g. a = 1, b = 01, c = 101, d = 011 What if you get the sequence 1011 ? A uniquely decodable code can always be uniquely decomposed into their codewords.

Prefix Codes A prefix code is a variable length code in which no codeword is a prefix of another one e.g a = 0, b = 100, c = 101, d = 11 Can be viewed as a binary trie 0 1 a bc d 0 01 1

Average Length For a code C with codeword length L[s], the average length is defined as p(A) =.7 [0], p(B) = p(C) = p(D) =.1 [1--] L a =.7 * 1 +.3 * 3 = 1.6 bit (Huffman achieves 1.5 bit) We say that a prefix code C is optimal if for all prefix codes C’, L a (C)  L a (C’)

Entropy (Shannon, 1948) For a source S emitting symbols with probability p(s), the self information of s is: bits Lower probability  higher information Entropy is the weighted average of i(s) 0-th order empirical entropy of string T i(s) 0 <= H <= log |  | H -> 0, skewed distribution H max, uniform distribution

Performance: Compression ratio Compression ratio = #bits in output / #bits in input Compression performance: We relate entropy against compression ratio. p(A) =.7, p(B) = p(C) = p(D) =.1 H ≈ 1.36 bits Huffman ≈ 1.5 bits per symb Shannon In practice Avg cw length Empirical H vs Compression ratio An optimal code is surely one that…

Document Compression Huffman coding

Huffman Codes Invented by Huffman as a class assignment in ‘50. Used in most compression algorithms gzip, bzip, jpeg (as option), fax compression,… Properties: Generates optimal prefix codes Cheap to encode and decode L a (Huff) = H if probabilities are powers of 2 Otherwise, L a (Huff) < H +1  < +1 bit per symb on avg!!

Running Example p(a) =.1, p(b) =.2, p(c ) =.2, p(d) =.5 a(.1)b(.2)d(.5)c(.2) a=000, b=001, c=01, d=1 There are 2 n-1 “equivalent” Huffman trees (.3) (.5) (1) What about ties (and thus, tree depth) ? 0 0 0 1 1 1

Encoding and Decoding Encoding: Emit the root-to-leaf path leading to the symbol to be encoded. Decoding: Start at root and take branch for each bit received. When at leaf, output its symbol and return to root. a(.1)b(.2) (.3) c(.2) (.5)d(.5) 0 0 0 1 1 1 abc...  00000101 101001...  dcb

Huffman in practice The compressed file of n symbols, consists of: Preamble: tree encoding + symbols in leaves Body: compressed text of n symbols Preamble =  (|  | log |  |) bits Body is at least nH bits and at most nH+n bits Extra +n is bad for very skewed distributions, namely ones for which H -> 0 Example: p(a) = 1/n, p(b) = n-1/n

There are better choices T=aaaaaaaaab Huffman = {a,b}-encoding + 10 bits RLE = =  (9) +  (1) + {a,b}-encoding = 0001001 1 + {a,b}-encoding So RLE saves 2 bits to Huffman, because it is not a prefix-code. In fact it does not map symbol -> bits uniquely, as Huffman, but the mapping may actually change and, moreover, it uses fractions of bits. Fax, bzip,… are using RLE

Idea on Huffman? Goal: Reduce the impact of the +1 bit Solution: Divide the text into blocks of k symbols The +1 is spread over k symbols So the loss is 1/k per symbol Caution: Alphabet =  k, tree gets larger, and so preamble. At the limit, preamble = 1 k-gram = the input text, and the compressed text is 1 bit only. This means no compression at all !

Document Compression Arithmetic coding

Introduction It uses “fractional” parts of bits!! Gets < nH(T) + 2 bits vs. < nH(T)+n (Huffman) Used in JPEG/MPEG (as option), bzip More time costly than Huffman, but integer implementation is not too bad. Ideal performance. In practice, it is 0.02 * n

Symbol interval Assign each symbol to an interval range from 0 (inclusive) to 1 (exclusive). e.g. a =.2 c =.3 b =.5     cum[c] = p(a)+p(b) =.7 cum[b] = p(a) =.2 cum[a] =.0 The interval for a particular symbol will be called the symbol interval (e.g for b it is [.2,.7))

Sequence interval Coding the message sequence: bac The final sequence interval is [.27,.3) a =.2 c =.3 b =.5 0.0 0.2 0.7 1.0 a c b 0.2 0.3 0.55 0.7 a c b 0.2 0.22 0.27 0.3 (0.7-0.2)*0.3=0.15 (0.3-0.2)*0.5 = 0.05 (0.3-0.2)*0.3=0.03 (0.3-0.2)*0.2=0.02 (0.7-0.2)*0.2=0.1 (0.7-0.2)*0.5 = 0.25

The algorithm To code a sequence of symbols with probabilities p i (i = 1..n) use the following algorithm: p(a) =.2 p(c) =.3 p(b) =.5 0.27 0.2 0.3

The algorithm Each message narrows the interval by a factor p[T i ] Final interval size is Sequence interval [ l n, l n + s n ) Take a number inside

Decoding Example Decoding the number.49, knowing the message is of length 3: The message is bbc. a =.2 c =.3 b =.5 0.0 0.2 0.7 1.0 a c b 0.2 0.3 0.55 0.7 a c b 0.3 0.35 0.475 0.55 0.49

How do we encode that number? If x = v/2 k (dyadic fraction) then the encoding is equal to bin(v) over k digits (possibly padded with 0s in front)

How do we encode that number? Binary fractional representation: FractionalEncode(x) 1.x = 2 * x 2.If x < 1 output 0 3.else {output 1; x = x - 1; } 2 * (1/3) = 2/3 < 1, output 0 2 * (2/3) = 4/3 > 1, output 1  4/3 – 1 = 1/3 Incremental Generation

Which number do we encode? Truncate the encoding to the first d =  log 2 (2/s n )  bits Truncation gets a smaller number… how much smaller? Truncation  Compression l n + s n lnln l n + s n /2 0∞0∞

Bound on code length Theorem: For a text T of length n, the Arithmetic encoder generates at most  log 2 (2/s n )  < 1 + log 2 2/s n = 1 + (1 - log 2 s n ) = 2 - log 2 (∏ i=1,n p(T i ) ) = 2 - log 2 (∏  [p(  ) occ(  ) ]) = 2 - ∑  occ(  ) * log 2 p(  ) ≈ 2 + ∑  ( n*p(  ) ) * log 2 (1/p(  )) = 2 + n H(T) bits T = acabc s n = p(a) *p(c) *p(a) *p(b) *p(c) = p(a) 2 * p(b) * p(c) 2

Document Compression Dictionary-based compressors

LZ77 Algorithm’s step: Output Advance by len + 1 A buffer “window” has fixed length and moves aacaacabcaaaaaa Dictionary (all substrings starting here) aacaacabcaaaaaa c ac ac

LZ77 Decoding Decoder keeps same dictionary window as encoder. Finds substring and inserts a copy of it What if l > d? (overlap with text to be compressed) E.g. seen = abcd, next codeword is (2,9,e) Simply copy starting at the cursor for (i = 0; i < len; i++) out[cursor+i] = out[cursor-d+i] Output is correct: abcdcdcdcdcdce

LZ77 Optimizations used by gzip LZSS: Output one of the following formats (0, position, length) or (1,char) Typically uses the second format if length < 3. Special greedy: possibly use shorter match so that next match is better Hash Table for speed-up searches on triplets Triples are coded with Huffman’s code

You find this at: www.gzip.org/zlib/

Google’s solution

Index construction: Compression of documents Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading Managing-Gigabytes: pg 21-36, 52-56,

Similar presentations

Presentation on theme: "Index construction: Compression of documents Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading Managing-Gigabytes: pg 21-36, 52-56,"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Index construction: Compression of documents Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading Managing-Gigabytes: pg 21-36, 52-56,

Similar presentations

Presentation on theme: "Index construction: Compression of documents Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading Managing-Gigabytes: pg 21-36, 52-56,"— Presentation transcript:

Similar presentations

About project

Feedback