Huffman Codes Message consisting of five characters: a, b, c, d,e Probabilities: .12, .4, .15, .08, .25 Encode each character into sequence of 0’s and 1’s so that no code for a character is the prefix of the code for any other character Prefix property Can decode a string of 0’s and 1’s by repeatedly deleting prefixes of the string that are codes for the character
Example Both codes have prefix property Symbol Probability Code 1 Code 2 a b c d e .12 .40 .15 .08 .25 000 001 010 011 100 000 11 01 001 10 Both codes have prefix property Decode code 1: “grab” 3 bits at a time and translate each group into a character Ex.: 001010011 bcd
Example Cont’d Symbol Probability Code 1 Code 2 a b c d e .12 .40 .15 .08 .25 000 001 010 011 100 000 11 01 001 10 Decode code 2: Repeatedly “grab” prefixes that are codes for characters and remove them from input Only difference, cannot “slice” up input at once How many bits depends on encoded character Ex.: 1101001 bcd
Big Deal? Huffman coding results in shorter average length of compressed (encoded) message Code 1 has average length of 3 multiply length of code for each symbol by probability of occurrence of that symbol Code 2 has average length of 2.2 (3*.12) + (2*.40) + ( 2*.15) + (3*.08) + (2*.25) Can we do better? Problem: Given a set of characters and their probabilities, find a code with the prefix property such that the average length of a code for a character is minimum
Representation Label leaves in tree by characters represented Think of prefix codes as paths in binary trees Following a path from a node to its left child as appending a 0 to a code, and proceeding form node to right child as appending 1 Can represent any prefix code as a binary tree Prefix property guarantees no character can have a code that is an interior node Conversely, labeling the leaves of a binary tree with characters gives us a code with prefix property
Sample Binary Trees 1 1 1 1 1 e b c 1 1 1 a d a b c d e Code 1 Code 2
Huffman’s Algorithm Select two characters a and b having the lowest probabilities and replacing them with a single (imaginary) character, say x x’s probability of occurrence is the sum of the probabilities for a and b Now find an optimal prefix code for this smaller set of characters, using the above procedure recursively Code for original character set is obtained by using the code for x with a 0 appended for a and with a 1 appended for b
Steps in the Construction of a Huffman Tree Sort input characters by frequency .08 .12 .15 .25 .40 . . . . . d a c e b
Merge a and d .20 .15 .25 .40 . . . . d a c e b
Merge a, d with c .35 .25 .40 . . e b c d a
Merge a, c, d with e .60 .40 . b e c d a
Final Tree 1.00 Codes: a - 1111 b - 0 c - 110 d - 1110 e - 10 average code length: 2.15 1 b 1 e 1 c 1 d a
Huffman Algorithm Example of greedy algorithm Combine nodes whenever possible without considering potential drawbacks inherent in making such a move I.e., at any individual stage select that option which is “locally optimal” Recall: vertex coloring problem Does not always yield optimal solution; however, Huffman coding is optimal See textbook for proof
Finishing Remarks Works well in theory, several restrictive assumptions (1) Frequency of letters is independent of the context of that letter in message Not true in English language (2) Huffman coding works better when large variation in frequency of letters Actual frequencies must match expected ones Examples: DEED 8 bits (12 bits ASCII) FUZZ 20 bits (12 bits ASCII)