Huffman codes Binary character code: each character is represented by a unique binary string. A data file can be coded in two ways: a b c d e f frequency(%) 45 13 12 16 9 5 fixed-length code 000 001 010 011 100 101 variable-length code 111 1101 1100 The first way needs 1003=300 bits. The second way needs 45 1+13 3+12 3+16 3+9 4+5 4=232 bits. 2019/7/4 CS3335 Design and Analysis of Algorithms/WANG Lusheng
CS3335 Design and Analysis of Algorithms/WANG Lusheng Variable-length code Need some care to read the code. 001011101 (codeword: a=0, b=00, c=01, d=11.) Where to cut? 00 can be explained as either aa or b. Prefix of 0011: 0, 00, 001, and 0011. Prefix codes: no codeword is a prefix of some other codeword. (prefix free) Prefix codes are simple to encode and decode. 2019/7/4 CS3335 Design and Analysis of Algorithms/WANG Lusheng
Using codeword in Table to encode and decode Encode: abc = 0.101.100 = 0101100 (just concatenate the codewords.) Decode: 001011101 = 0.0.101.1101 = aabe a b c d e f frequency(%) 45 13 12 16 9 5 fixed-length code 000 001 010 011 100 101 variable-length code 111 1101 1100 2019/7/4 CS3335 Design and Analysis of Algorithms/WANG Lusheng
Encode: abc = 0.101.100 = 0101100 (just concatenate the codewords.) Decode: 001011101 = 0.0.101.1101 = aabe (use the (right)binary tree below:) a:45 b:13 c:12 d:16 e:9 f:5 1 100 14 86 28 58 a:45 b:13 c:12 d:16 e:9 f:5 55 25 30 14 100 1 Tree for the fixed length codeword Tree for variable-length codeword 2019/7/4 CS3335 Design and Analysis of Algorithms/WANG Lusheng
CS3335 Design and Analysis of Algorithms/WANG Lusheng Binary tree Every nonleaf node has two children. The fixed-length code in our example is not optimal. The total number of bits required to encode a file is f ( c ) : the frequency (number of occurrences) of c in the file dT(c): denote the depth of c’s leaf in the tree 2019/7/4 CS3335 Design and Analysis of Algorithms/WANG Lusheng
Constructing an optimal code Formal definition of the problem: Input: a set of characters C={c1, c2, …, cn}, each cC has frequency f[c]. Output: a binary tree representing codewords so that the total number of bits required for the file is minimized. Huffman proposed a greedy algorithm to solve the problem. 2019/7/4 CS3335 Design and Analysis of Algorithms/WANG Lusheng
CS3335 Design and Analysis of Algorithms/WANG Lusheng b:13 d:16 a:45 (b) a:45 d:16 e:9 f:5 14 1 b:13 c:12 2019/7/4 CS3335 Design and Analysis of Algorithms/WANG Lusheng
CS3335 Design and Analysis of Algorithms/WANG Lusheng 14 1 b:13 c:12 25 (c) a:45 b:13 c:12 d:16 e:9 f:5 25 30 14 1 (d) 2019/7/4 CS3335 Design and Analysis of Algorithms/WANG Lusheng
CS3335 Design and Analysis of Algorithms/WANG Lusheng b:13 c:12 d:16 e:9 f:5 55 25 30 14 100 1 a:45 b:13 c:12 d:16 e:9 f:5 55 25 30 14 1 (f) (e) 2019/7/4 CS3335 Design and Analysis of Algorithms/WANG Lusheng
CS3335 Design and Analysis of Algorithms/WANG Lusheng HUFFMAN(C) 1 n:=|C| 2 Q:=C 3 for i:=1 to n-1 do 4 z:=ALLOCATE_NODE() 5 x:=left[z]:=EXTRACT_MIN(Q) 6 y:=right[z]:=EXTRACT_MIN(Q) 7 f[z]:=f[x]+f[y] 8 INSERT(Q,z) 9 return EXTRACT_MIN(Q) 2019/7/4 CS3335 Design and Analysis of Algorithms/WANG Lusheng
CS3335 Design and Analysis of Algorithms/WANG Lusheng The Huffman Algorithm This algorithm builds the tree T corresponding to the optimal code in a bottom-up manner. C is a set of n characters, and each character c in C is a character with a defined frequency f[c]. Q is a priority queue, keyed on f, used to identify the two least-frequent characters to merge together. The result of the merger is a new object (internal node) whose frequency is the sum of the two objects. 2019/7/4 CS3335 Design and Analysis of Algorithms/WANG Lusheng
CS3335 Design and Analysis of Algorithms/WANG Lusheng Time complexity Lines 4-8 are executed n-1 times. Each heap operation in Lines 4-8 takes O(lg n) time. Total time required is O(n lg n). Note: The details of heap operation will not be tested. Time complexity O(n lg n) should be remembered. 2019/7/4 CS3335 Design and Analysis of Algorithms/WANG Lusheng
CS3335 Design and Analysis of Algorithms/WANG Lusheng Another example: e:4 a:6 c:6 b:9 d:11 c:6 b:9 d:11 e:4 a:6 10 1 2019/7/4 CS3335 Design and Analysis of Algorithms/WANG Lusheng
CS3335 Design and Analysis of Algorithms/WANG Lusheng 10 1 d:11 c:6 b:9 15 1 c:6 b:9 15 1 d:11 e:4 a:6 10 1 21 2019/7/4 CS3335 Design and Analysis of Algorithms/WANG Lusheng
CS3335 Design and Analysis of Algorithms/WANG Lusheng b:9 15 1 d:11 e:4 a:6 10 21 36 2019/7/4 CS3335 Design and Analysis of Algorithms/WANG Lusheng
Correctness of Huffman’s Greedy Algorithm (Fun Part, not required) Again, we use our general strategy. Let x and y are the two characters in C having the lowest frequencies. (the first two characters selected in the greedy algorithm.) We will show the two properties: There exists an optimal solution Topt (binary tree representing codewords) such that x and y are siblings in Topt. Let z be a new character with frequency f[z]=f[x]+f[y] and C’=C-{x, y}{z}. Let T’ be an optimal tree for C’. Then we can get Topt from T’ by replacing z with z x y 2019/7/4 CS3335 Design and Analysis of Algorithms/WANG Lusheng
CS3335 Design and Analysis of Algorithms/WANG Lusheng Proof of Property 1 b x y c x b c y Topt Tnew Look at the lowest siblings in Topt, say, b and c. Exchange x with b and y with c. B(Topt)-B(Tnew)0 since f[x] and f[y] are the smallest. 1 is proved. 2019/7/4 CS3335 Design and Analysis of Algorithms/WANG Lusheng
CS3335 Design and Analysis of Algorithms/WANG Lusheng Let z be a new character with frequency f[z]=f[x]+f[y] and C’=C-{x, y}{z}. Let T’ be an optimal tree for C’. Then we can get Topt from T’ by replacing z with Proof: Let T be the tree obtained from T’ by replacing z with the three nodes. B(T)=B(T’)+f[x]+f[y]. … (1) (the length of the codes for x and y are 1 bit more than that of z.) Now prove T= Topt by contradiction. If TTopt, then B(T)>B(Topt). …(2) From 1, x and y are siblings in Topt . Thus, we can delete x and y from Topt and get another tree T’’ for C’. B(T’’)=B(Topt) –f[x]-f[y]<B(T)-f[x]-f[y]=B(T’). using (2) using (1) Thus, T(T’’)<B(T’). Contradiction to the assumption : T’ is optimum for C’. z y x 2019/7/4 CS3335 Design and Analysis of Algorithms/WANG Lusheng