File Compression Even though disks have gotten bigger, we are still running short on disk space A common technique is to compress files so that they take up less space on the disk We can save space by taking advantage of the fact that most files have a relatively low “information content” 4/7/2019 Compression
Run Length Encoding The simplest type of redundancy in a file is long runs of repeated characters AAAABBBAABBBBBCCCCCCCC This string can be represented more compactly by replacing each repeated string with a single occurrence of the character and a count 4A3B2A5B8C For binary files a refined version of this method can yield dramatic savings 4/7/2019 Compression
Variable Length Encoding Suppose we wish to encode ABRACADABRA Instead of using the standard 8 (or 16) bits to represent these letters, why not use 3? A = 000 000001100000010000011000001100000 B = 001 C = 010 D = 011 R = 100 4/7/2019 Compression
We can do better!! Why use the same number of bits for each letter This is not really a code because it depends on the blanks 011100101001110 4/7/2019 Compression
Consider this Tree B D A C R 4/7/2019 Compression
More Formally Start with a frequency table A 5 B 2 R C 1 D 4/7/2019 Compression
More Formally Create a binary tree out of the two elements with the lowest frequencies New frequency is the sum of the frequencies Add new node to the frequency table 2 C, 1 D, 1 4/7/2019 Compression
More Formally Repeat until only one element is left in the table 11 6 2 4 C, 1 D, 1 B,2 R, 2 4/7/2019 Compression
Huffman Coding The general method for finding this code was developed by D. Huffman in 1952 4/7/2019 Compression