Source Coding Hafiz Malik Dept. of Electrical & Computer Engineering The University of Michigan-Dearborn
Overview Information Source Source Encode Channel Encode Pulse Modulate Encrypt Bandpass Modulate Multiplex Multiple Access Frequency Spreading XMT Source Coding – eliminate redundancy in the data, send same information in fewer bits Channel Coding – Detect/Correct errors in signaling and improve BER
What is information? – Student slips on ice in winter in Southfield, MI – Professor slips on ice, breaks arm, class cancelled – Student slips on ice in July in Southfield, MI Which statement contains the most information? Statement with unpredictability factor conveys the most new information and surprise Information (I) is related to -log(Pr{event}) Here, Pr{event} 1 I 0, & Pr{event} 0 I ∞]
Measure of Randomness Entropy: average information per source output Entropy H (bits) Probability, p Let we have two events, E1 & E2 with Pr{E1} = p & Pr{E2} = q {E1 U E2} = S Here p=1-q
Comments on Entropy When log is taken to base 2, unit of H bits/event & for base e nats Entropy is maximum for a uniformly distributed random variable Example:
Example: Average Information Content in English Language Calculate the average information in bits/character in English assuming each letter is equally likely
Real-world English But English characters do not appear with the same frequency in the real-world English literature, probabilities of each character are assumes as, Pr{a}=Pr{e}=Pr{o}=Pr{t}=0.10 Pr{h}=Pr{i}=Pr{n}=Pr{r}=Pr{s}=0.07 Pr{c}=Pr{d}=Pr{f}=Pr{l}=Pr{p}=Pr{u}=Pr{y} =0.02 Pr{b}=Pr{g}=Pr{j}=Pr{k}=Pr{g}=Pr{v}=Pr{w}=Pr{x}=Pr{z} =0.01
8 Example Binary Sequence X, P(X=1) = P(X=0) = ½ P b =0.01 (1 error per 100 bits) R=1000 binary symbol/sec
Source Coding Goal is to find an efficient description of information sources – Reduce required bandwidth – Reduce memory to store Memoryless –If symbols from source are independent, one symbol does not depend on next Memory – elements of a sequence depend on one another, e.g. UNIVERSIT_?, 10-tuple contains less information since dependent
10 Source Coding (II) This means that it’s more efficient to code information with memory as groups of symbols
11 Desirable Properties Length – Fixed Length – ASCII – Variable Length – Morse Code, JPEG Uniquely Decodable – allow user to invert mapping to the original Prefix-Free – No codeword can be a prefix of any other codeword Average Code Length (n i is code length of i th symbol)
12 Uniquely Decodable and Prefix Free Codes Uniquely decodable? – Not code 1 – If “10111” sent, is code 3 ‘babbb’or ‘bacb’? Not code 3 or 6 Prefix-Free – Not code 4, – prefix contains ‘1’ Avg Code Length – Code 2: n=2 – Code 5: n=1.23 XiXi P(X i ) a0.73 b0.25 c0.02 Sym bol Code 1 Code 2 Code 3 Code 4 Code 5 Code 6 a b c
Huffman Code Characteristics of Huffman Codes: – Prefix-free, variable length code that can achieve the shortest average code length for an alphabet – Most frequent symbols have short codes Procedure – List all symbols and probabilities in descending order – Merge branches with two lowest probabilities, combine their probabilities – Repeat until one branch is left
Huffman Code Example a b c d e f The Code: A11 B00 C101 D100 E011 F010 Compression Ratio: 3.0/2.4=1.25 Entropy: 2.32
Example: Consider a random vector X = {a, b, c} with associated probabilities as listed in the Table Calculate the entropy of this symbol set Find the Huffman Code for this symbol set Find the compression ratio and efficiency of this code XiXi P(X i ) a0.73 b0.25 c0.02
Extension Codes Combine alphabet symbols to increase variability Try to combine very common 2,3 letter combinations, e.g.: th,sh, ed, and, the,ing,ion XiXi P(X i ) aa ab ba bb ac ca bc cb cc Codenini n i P(X i )
17 Lempel-Ziv (ZIP) Codes Huffman codes have some shortcomings – Know symbol probability information a priori – Coding tree must be known at coder/decoder Lempel-Ziv algorithm use text itself to iteratively construct a sequence of variable length code words Used in gzip, UNIX compress, LZW algorithms
18 Lempel-Ziv Algorithm Look through a code dictionary with already coded segments – If matches segment, send and store segment + new character in dictionary – If no match, store in dictionary, send
19 LZ Coding: Example Encode [a b a a b a b b b b b b b a b b b b b a] Code Dictionary Address Contents Encoded Packets 1a 2b 3aa 4ba 5bb 6bbb 7bba 8bbbb Note: 9 code words, 3 bit address, 1 bit for new character,
20 Notes on Lempel-Ziv Codes Compression gains occur for long sequences Other variations exist that encode data using packets and run length encoding: