Download presentation
Presentation is loading. Please wait.
Published byNorman Thornton Modified over 9 years ago
1
Representation of Strings Background Huffman Encoding
2
Representing Strings How much space do we need? Assume we represent every character. How many bits to represent each character? Depends on ||
3
Bits to encode a character Two character alphabet{A,B} one bit per character: 0 = A, 1 = B Four character alphabet{A,B,C,D} two bits per character: 00 = A, 01 = B, 10 = C, 11 = D Six character alphabet {A,B,C,D,E, F} three bits per character: 000 = A, 001 = B, 010 = C, 011 = D, 100=E, 101 =F, 110 =unused, 111=unused
4
More generally The bit sequence representing a character is called the encoding of the character. There are 2 n different bit sequences of length n, ceil(lg||) bits required to represent each character in if we use the same number of bits for each character then length of encoding of a word is |w| * ceil(lg||)
5
Can we do better?? If is very small, might use run- length encoding
6
Taking a step back … Why do we need compression? rate of creation of image and video data image data from digital camera today 1k by 1.5 k is common = 1.5 mbytes need 2k by 3k to equal 35mm slide = 6 mbytes video at even low resolution of 512 by 512 and 3 bytes per pixel, 30 frames/second
7
Compression basics video data rate 23.6 mbytes/second 2 hours of video = 169 gigabytes mpeg-1 compresses 23.6 mbytesdown to 187 kbytes per second 169 gigabytes down to 1.3 gigabytes compression is essential for both storage and transmission of data
8
Compression basics compression is very widely used jpeg, gif for single images mpeg1, 2, 3, 4 for video sequence zip for computer data mp3 for sound based on two fundamental principles spatial coherence and temporal coherence similarity with spatial neighbor similarity with temporal neighbor
9
Basics of compression character = basic data unit in the input stream -- represents byte, bit, etc. strings = sequences of characters encoding = compression decoding = decompression codeword = data elements used to represent input characters or character strings codetable = list of codewords
10
Codeword encoding/compression takes characters/strings as input and uses codetable to decide on which codewords to produce decoder/decompressor takes codewords as input and uses same codetable to decide on which characters/strings to produce Encoder Decoder Input Data Stream Output Data Stream Data Storage Or Transmission
11
Codetable clearly both encoder and decoder must pass the encoded data as a series of codewords also must pass the codetable the codetable can be passed explicitly or implicitly that is we either pass it across agree on it beforehand (hard wired) recreate it from the codewords (clever!)
12
Basic definitions compression ratio = size of original data / compressed data basically higher compression ratio the better lossless compression output data is exactly same as input data essential for encoding computer processed data lossy compression output data not same as input data acceptable for data that is only viewed or heard
13
Lossless versus lossy human visual system less sensitive to high frequency losses and to losses in color lossy compression acceptable for visual data degree of loss is usually a parameter of the compression algorithm tradeoff - loss versus compression higher compression => more loss lower compression => less loss
14
Symmetric versus asymmetric symmetric encoding time == decoding time essential for real-time applications (ie. video or audio on demand) asymmetric encoding time >> decoding ok for write-once, read-many situations
15
Entropy encoding compression that does not take into account what is being compressed normally is also lossless encoding most common types of entropy encoding run length encoding Huffman encoding modified Huffman (fax…) Lempel Ziv
16
Source encoding takes into account type of data (ie. visual) normally is lossy but can also be lossless most common types in use: JPEG, GIF = single images MPEG = sequence of images (video) MP3 = sound sequence often uses entropy encoding as a sub- routine
17
Run length encoding one of simplest and earliest types of compression take account of repeating data (called runs) runs are represented by a count along with the original data eg. AAAABB => 4A2B do you run length encode a single character? no, use a special prefix character to represent start of runs
18
Run length encoding runs are represented as prefix char itself becomes 1 want a prefix char that is not too common an example early use is MacPaint file format run length encoding is lossless and has fixed length codewords
19
MacPaint File Format
20
Run length encoding works best for images with solid background good example of such an image is a cartoon does not work as well for natural images does not work well for English text however, is almost always a part of a larger compression system
21
What if … the string we encode doesn’t use all the letters in the alphabet? log 2 (ceil(|set_of_characters_used|) But then also need to store / transmit the mapping from encodings to characters … and is typically close to size of alphabet
22
Huffman Encoding: Assumes encoding on a per- character basis Observation: assigning shorter codes to frequently used characters can result in overall shorter encodings of strings requires assigning longer codes to rarely used characters
23
Huffman Encoding Problem: when decoding, need to know how many bits to read off for each character. Solution: Choose an encoding that ensures that no character encoding is the prefix of any other character encoding. An encoding tree has this property.
24
Huffman encoding assume we know the frequency of each character in the input stream then encode each character as a variable length bit string, with the length inversely proportional to the character frequency variable length codewords are used; early example is Morse code Huffman produced an algorithm for assigning codewords optimally
25
Huffman encoding input = probabilities of occurrence of each input character (frequencies of occurrence) output is a binary tree each leaf node is an input character each branch is a zero or one bit codeword for a leaf is the concatenation of bits for the path from the root to the leaf codeword is a variable length bit string a very good compression ratio (optimal)?
26
Huffman encoding Basic algorithm Mark all characters as free tree nodes While there is more than one free node Take two nodes with lowest freq. of occurrence Create a new tree node with these nodes as children and with freq. equal to the sum of their freqs. Remove the two children from the free node list. Add the new parent to the free node list Last remaining free node is the root of the binary tree used for encoding/decoding
27
A Huffman Encoding Tree 12 21 9 7 43 5 23 ATRN E 01 01 01 01
28
12 21 9 7 43 5 23 ATRN E 01 01 01 01 A000 T001 R010 N011 E1
29
Weighted path length A000 T001 R010 N011 E1 Weighted path = Len(code(A)) * f(A) + Len(code(T)) * f(T) + Len(code(R) ) * f(R) + Len(code(N)) * f(N) + Len(code(E)) * f(E) = (3 * 3) + ( 2 * 3) + (3 * 3) + (4 *3) + (9*1) = 9 + 6 + 9 + 12 + 9 = 45 Claim (proof in text) : no other encoding can result in a shorter weighted path length
30
Building the Huffman Tree A3A3 T4T4 R4R4 E5E5
31
A3A3 T4T4 R4R4 E5E5 7
32
R4R4 E5E5 A3A3 T4T4 7
33
R4R4 E5E5 A3A3 T4T4 7 9
34
A3A3 T4T4 7 R4R4 E5E5 9
35
A3A3 T4T4 7 R4R4 E5E5 9 16
36
Building the Huffman Tree A3A3 T4T4 7 R4R4 E5E5 9 16 0 0 1 1 01 00 01 10 11
37
Huffman example a series of colors in an 8 by 8 screen colors are red, green, cyan, blue, magenta, yellow, and black sequence is rkkkkkkk gggmcbrr kkkrrkkk bbbmybbr kkrrrrgg gggggggr kkbcccrr grrrrgrr
38
Another Huffman example ColorFrequency Black (K)19 Red ( R)17 Green (G)16 Blue (B)5 Cyan ( C)4 Magenta (M)2 Yellow (Y)1
39
Another Huffman Example
40
Another Huffman example, cont’d
41
Huffman example, cont’d Red = 00Blue = 111Magenta = 11010 Black = 01Cyan = 1100Yellow = 11011 Green = 10
42
Fixed versus variable length codewords run length codewords are fixed length Huffman codewords are variable length length inversely proportional to frequency all variable length compression schemes have the prefix property one code can not be the prefix of another binary tree structure guarantees that this is the case (a leaf node is a leaf node!)
43
Huffman encoding advantages maximum compression ratio assuming correct probabilities of occurrence easy to implement and fast disadvantages need two passes for both encoder and decoder one to create the frequency distribution one to encode/decode the data can avoid this by sending tree (takes time) or by having unchanging frequencies
44
Modified Huffman encoding if we know frequency of occurrences, then Huffman works very well consider case of a fax; mostly long white spaces with short bursts of black do the following run length encode each string of bits on a line Huffman encode these run length codewords use a predefined frequency distribution combination run length, then Huffman
45
Beyond Huffman Coding … 1977 – Lempel & Ziv, Israeli information theorists, develop a dictionary-based compression method (LZ77) 1978 – they develop another dictionary-based compression method (LZ78) … coming soon ….
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.