Data Compression Section 4.8 of [KT]
Data formats Space runs out very fast: need to compress files
Encodings Fixed length encodings: 8 bits per character (ASCII) Decoding simple: every 8 bits forms a character Morse Code: encoding using dots (0) and dashes (1) e: 0 t: 1 a: 01 frequent letters encoded by shorter strings Ambiguity: 0101 could translate to eta, aa, etet, aet Problem: encoding for a letter (e) is prefix of encoding of another (a) Pauses added between letters Actually an encoding using dots, dashes and pauses Need a code in which decoding is unambiguous
Prefix Codes Code: function : S {0,1}* S: alphabet {0,1}*: set of all possible 0/1 strings Prefix code: is prefix-free if: for all x,y in S, (x) is not a prefix of (y) Encoding: string x1x2x3…. encoded as (x1) (x2) (x3)…. Decoding: read shortest prefix that matches some character’s code, delete prefix and repeat
Prefix Code: Example 1(a) = 11 1(b) = 01 1(c) = 001 1(d) = 10 This is a prefix code string cecab 0010000011101 Can be decoded unambiguously Multiple prefix codes possible: which one is better? 2(a) = 11 2(b) = 10 2(c) = 01 2(d) = 001 2(e) = 000
Cost of a Prefix Code For each x in S, fx= frequency = fraction of times x appears in an average text ∑x in S fx = 1 ABL( ) = average #bits per letter for prefix code = ∑x in S fx | (x)| Average encoding length for a text of n letters = nABL() = ∑x in S n fx | (x)| Example: fa = 0.32, fb = 0.25, fc = 0.20, fd = 0.18, fe = 0.05 ABL(1) = 0.32·3 + 0.25 · 2 + 0.20 · 3 + 0.18 · 2 + 0.05 · 3 = 2.25 Cost of a fixed length encoding? ABL(2) = 0.32·2 + 0.25 · 2 + 0.20 · 2 + 0.18 · 3 + 0.05 · 3 = 2.23 is optimal if ABL() is the minimum possible
Prefix Codes to Binary Trees Constructing a tree corresponding to a prefix code Recursively: All letters x whose encoding (x) starts with 0 are in left subtree of the root All letters x whose encoding (x) starts with 1 are in right subtree of the root Recursively construct left and right subtrees
Prefix codes to binary trees a,d b,c,e b d a 1 b d a e c c,e
Deriving a prefix code from a binary tree Let T be a binary tree with |S| leaves Label each leaf with a letter x in S For each x in S follow path from root to leaf labeled x each time path goes from a node to a left child, put 0 each time path goes from a node to its right child, put 1 e d c b a 1 b 011 c 010 d 001 e 000
Different codes from different trees
Codes constructed from a binary tree Lemma: The encoding of S constructed from binary tree T is a prefix code Proof: Suppose encoding of x is a prefix of the encoding of y Then, root to x path is a prefix of the root to y path x is not a leaf
ABL(T) Length of encoding of x in S = length of path from root to x = depthT(x) b d a 1 e c ABL() = ∑x in S fx depthT(x) = ABL(T) Choosing optimal code is equivalent to choosing tree T with minimum ABL(T)
Structure of optimal trees Full binary tree: A binary tree T is full if every non-leaf node in T has two children Lemma: The binary tree corresponding to the optimal code is full Proof: Suppose T is not full. Then there is a non-leaf node u that has only one child. If u is the root, form T’ by deleting it If u is not the root, bypass it to form T’ ABL(T’) < ABL(T) w w T’ T u v v
Attempt I: Top down approach Intuition: produce tree with leaves that are as close to the root as possible - low average depth Shannon-Fano code Split S into sets S1 and S2 so that total frequency in each set is as close to 1/2 as possible Recursively form subtrees T1 and T2 for S1 and S2, resp. Make T1 and T2 as children of a root node Performs fairly well in practice, but not necessarily optimal
Attempt I: Example S={a,b,c,d,e} fa = 0.32, fb = 0.25, fc = 0.20, fd = 0.18, fe = 0.05 S S2 S1 a, d b, c, e a d b c,e ABL=2.25 a b c d e a d b Optimal ABL=1.83 e c
Structure of the optimal tree Suppose we knew the optimal binary tree T*. How would we assign labels to the leaves? Lemma Suppose u,v are leaves of T*, such that depth(u) < depth(v). Further, suppose in the optimal labeling of T*, leaf u is labeled with letter y and leaf v is labeled with letter z. Then, fy ≥ fz. Proof (Exchange argument). Suppose fy < fz. Consider the new code obtained by exchanging y and z. new ABL - old ABL = depth(u)fz + depth(v)fy - depth(u)fy - depth(v)fz = (depth(v)-depth(u)) (fy - fz) < 0 High frequency letters must be at lower depth leaves in T*
Optimal labeling of optimal tree Order leaves of T* in non-decreasing order of depth Order letters in non-increasing order of frequency Match letters to leaves in this order The above order cannot be suboptimal Letters assigned to leaves at the same depth can be interchanged fa = 0.32, fb = 0.25, fc = 0.20, fd = 0.18, fe = 0.05 a b c d e Optimal labeling
Properties of Optimal Prefix Codes Lemma There is an optimal prefix code with tree T*, in which the two lowest frequency letters are assigned to leaves that are siblings in T* Suppose v is the leaf at maximum depth in T* v must have a sibling w two lowest frequency letters y, z must be assigned to v,w safe to “lock up” y, z together w v y z
Huffman’s algorithm
Huffman’s algorithm: example S = {a, b, c, d, e} fa = 0.32, fb = 0.25, fc = 0.20, fd = 0.18, fe = 0.05 S’ = {a, b, c, (de)} a b a b c (de) c T’ d e T*
Proving optimality of Huffman’s algorithm Lemma ABL(T’) = ABL(T) - f Proof The depth of every letter x≠y*, z* is the same in T, T’. ABL(T) = ∑x in S fx depthT(x) = fy* depthT(y*) + fz* depthT(z*) + ∑x≠y*, z* fx depthT(x) = (fy* + fz*)(1+depthT’()) + ∑x≠y*, z* fx depthT(x) = f (1+depthT’()) + ∑x≠y*, z* fx depthT(x) = f + f depthT’() + ∑x≠y*, z* fx depthT’(x) = f + ABL(T’)
Proving optimality of Huffman’s algorithm Lemma The Huffman code for a given alphabet achieves the minimum average number of bits per letter of any prefix code Proof The proof is by induction on the size of the alphabet. T: tree produced by Huffman’s algo. Z: optimum tree. Suppose ABL(Z) < ABL(T). Let y*, z* be the two lowest frequency letters. W.l.o.g., leaves labeled y*, z* are siblings in T and Z. Z’: delete leaves labeled y*, z* from Z and label their parent by new letter T’: delete leaves labeled y*, z* from T and label their parent by new letter ABL(T’)=ABL(T) - f, ABL(Z’)=ABL(Z) - f ABL(Z’) < ABL(T’). Contradicts optimality of T’ for S’ = S-{y*, z*} {}
Implementation and Running Time Main operations: identify two lowest frequency letters and merge Array implementation: O(k) time per iterations O(n2) time, n=|S| Priority Queue implementation: O(log k) time per iteration with k letters O(n log n) time overall
Extensions Encode selective information: 1000 X 1000 image with very few black pixels Can store coordinates of black pixels explicitly Adaptive coding Frequencies of letters may change over the text Change the encoding locally, depending on frequencies
Lempel-Ziv Coding Basis of zip, gzip, compress, etc. Maintain dictionary D of some of the patterns seen so far Encode (the largest possible) pattern W by the index in the dictionary, if it exists When pattern W is coded, add Wa to D if there is space, where a is the letter after W in the text LZ: encodes variable length blocks to fixed ones Huffman: encodes fixed length blocks to variable ones
MP3 format Stands for MPEG (Motion Picture Experts Group) audio layer-3 Raw audio format: (e.g. on a CD) Sample signal 44,100 times per second Gives a sequence of real numbers s1, s2, …, sT Quantization: approximate each sample using 2B One sample for each channel: two channels for stereo 44,100 X 16 X 2 = 1,411,200 bits per second MP3 Fixed length to variable length encoding Further compression by identifying properties of human ear:
MP3 format Stands for MPEG (Motion Picture Experts Group) audio layer-3 Raw audio format: (e.g. on a CD) Sample signal 44,100 times per second Gives a sequence of real numbers s1, s2, …, sT Quantization: approximate each sample using 2B One sample for each channel: two channels for stereo 44,100 X 16 X 2 = 1,411,200 bits per second MP3 Fixed length to variable length encoding Further compression by identifying properties of human ear: Some sounds cannot be heard by the ear Some sounds are heard much better When two sounds are simultaneously played, we hear only the louder one
JPEG format Designed by the Joint Photographers Expert Group Uses Huffman coding
Forward Discrete Cosine Transform Like Fourier transform: separates image into sub-bands of differing importance
Sample image Highest compression Lowest compression