Data Compression and Huffman Trees (HW 4) Data Structures Fall 2008 Modified by Eugene Weinstein
Representing Text (ASCII) Way of representing characters as bits –Characters are ‘a’, ‘b’, ‘1’, ‘%’, ‘\n’, ‘\t’… Each character is represented by a unique 7 bit code. There are 128 possible characters. –STATIC LENGTH ENCODING To encode a long text, we encode it character by character.
Inefficiency of ASCII Realization: In many natural files, we are much more likely to see the letter ‘e’ than the character ‘&’, yet they are both encoded using 7 bits! Solution: Use variable length encoding! The encoding for ‘e’ should be shorter than the encoding for ‘&’.
Variable Length Coding Assume we know the distribution of characters (‘e’ appears 1000 times, ‘&’ appears 1 time) Each character will be encoded using a number of bits that is inversely proportional to its frequency (made precise later). Need a ‘prefix free’ encoding: if ‘e’ = 001 than we cannot assign ‘&’ to be Since encoding is variable length, need to know when to stop.
Encoding Trees Think of encoding as an (unbalanced) tree. Data is in leaf nodes only (prefix free). ‘e’ = 0, ‘a’ = 10, ‘b’ = 11 How to decode ‘01110’? e ab
Cost of a Tree For each character c i let f i be its frequency in the file. Given an encoding tree T, let d i be the depth of c i in the tree (number of bits needed to encode the character). The length of the file after encoding it with the coding scheme defined by T will be C(T)= Σd i f i
Creating an Optimal T Problem: Find tree T with C(T) minimal. Solution (Huffman 1952): –Create a tree for each character. The weight of the tree W(T) is the frequency of the character. –Repeat n-1 times (n = number of chars) Select trees T’, T’’ with lowest weights. Merge them together to form T. Set W(T)= W(T’) + W(T’’) Implement Using Min-Heap. What is running time?
Optimality Intuition Need to show that Huffman’s algorithm indeed results in a tree T with optimal C(T)= Σc i f i. The two least weight letters should be on bottom as siblings (otherwise improve cost by swapping). Intuitively when we combine trees we can think of this as a new letter with combined weight.
Homework Implement: –public class HuffmanTree Has traversal/code printing method –public class HuffmanNode (Comparable) Contains letter, integer frequency Has accessor (getter) methods –public class BinaryHeap (given in class) Read a file ‘huff.txt’ which includes letters and frequencies: –A 20 E 24 G 3 H 4 I 17 L 6 N 5 O 10 S 8 V 1 W 2 Create a Huffman Tree, algorithm: book Print “legend”: the code of each character
10 Tips and Implementation Notes HuffmanNode should be Comparable to work with BinaryHeap –How to implement compareTo method? Implement toString method in BinaryHeap –Print heap after every rearrangement Understand binary heap operations: –insert –deleteMin 10