February 3, 2010Harvard QR481 Coding and Entropy.

February 3, 2010Harvard QR481 Coding and Entropy

February 3, 20102 Squeezing out the “Air” Suppose you want to ship pillows in boxes and are charged by the size of the box Lossless data compression Entropy = lower limit of compressibility Harvard QR48

February 3, 20103 Claude Shannon (1916-2001) A Mathematical Theory of Communication (1948) Harvard QR48

February 3, 20104 Communication over a Channel Source Coded Bits Received Bits Decoded Message S X Y T Channel symbols bits bits symbols Encode bits before putting them in the channel Decode bits when they come out of the channel E.g. the transformation from S into X changes “yea” --> 1 “nay” --> 0 Changing Y into T does the reverse For now, assume no noise in the channel, i.e. X=Y Harvard QR48

February 3, 20105 Example: Telegraphy Source English letters -> Morse Code D -.. D Baltimore Washington -.. Harvard QR48

February 3, 20106 Low and High Information Content Messages The more frequent a message is, the less information it conveys when it occurs Two weather forecast messages: Bos: LA: In LA “Sunny” is a low information message and “cloudy” is a high information message Harvard QR48

February 3, 20107 Harvard Grades Less information in Harvard grades now than in recent past %AA-B+BB-C+ 20052425211362 19952123201483 198614192117105 Harvard QR48

February 3, 20108 Fixed Length Codes (Block Codes) Example: 4 symbols, A, B, C, D A=00, B=01, C=10, D=11 In general, with n symbols, codes need to be of length lg n, rounded up For English text, 26 letters + space = 27 symbols, length = 5 since 2 4 < 27 < 2 5 (replace all punctuation marks by space) AKA “block codes” Harvard QR48

February 3, 20109 Modeling the Message Source Characteristics of the stream of messages coming from the source affect the choice of the coding method We need a model for a source of English text that can be described and analyzed mathematically SourceDestination Harvard QR48

February 3, 201010 How can we improve on block codes? Simple 4-symbol example: A, B, C, D If that is all we know, need 2 bits/symbol What if we know symbol frequencies? Use shorter codes for more frequent symbols Morse Code does something like this Example: ABCD.7.1 0100101110 Harvard QR48

February 3, 201011 Prefix Codes Only one way to decode left to right ABCD.7.1 0100101110 Harvard QR48

February 3, 201012 Minimum Average Code Length? Average bits per symbol: ABCD.7.1 0100101110 ABCD.7.1 010110111.7·1+.1·2+.1·3+.1·3 = 1.5.7·1+.1·3+.1·3+.1·3 = 1.6 bits/symbol (down from 2) Harvard QR48

February 3, 201013 Entropy of this code <= 1.5 bits/symbol ABCD.7.1 010110111.7·1+.1·2+.1·3+.1·3 = 1.5 Possibly lower? How low? Harvard QR48

February 3, 2010Harvard QR4814 Self-Information If a symbol S has frequency p, its self- information is H(S) = lg(1/p) = -lg p. SABCD p.25 H(S)H(S)2222 p.7.1 H(S)H(S).513.32

February 3, 2010Harvard QR4815 First-Order Entropy of Source = Average Self-Information SABCD p.25 -lgp2222 -plgp.5 p.7.1 -lgp.513.32 -plgp.357.332 -∑ plgp 2 1.353

February 3, 2010Harvard QR4816 Entropy, Compressibility, Redundancy Lower entropy  More redundant  More compressible  Less information Higher entropy  Less redundant  Less compressible  More information A source of “yea”s and “nay”s takes 24 bits per symbol but contains at most one bit per symbol of information 010110010100010101000001 = yea 010011100100000110101001 = nay

February 8, 2010Harvard QR4817 ABCD.7.1 010110111 Entropy and Compression Average length for this code =.7·1+.1·2+.1·3+.1·3 = 1.5 No code taking only symbol frequencies into account can be better than first-order entropy First-order Entropy of this source =.7·lg(1/.7)+.1·lg(1/.1)+.1·lg(1/.1)+.1·lg(1/.1) = 1.353 First-order Entropy of English is about 4 bits/character based on “typical” English texts “Efficiency” of code = (entropy of source)/(average code length) = 1.353/1.5 = 90%

February 8, 2010Harvard QR4818 A Simple Prefix Code: Huffman Codes Suppose we know the symbol frequencies. We can calculate the (first-order) entropy. Can we design a code to match? There is an algorithm that transforms a set of symbol frequencies into a variable-length, prefix code that achieves average code length approximately equal to the entropy. David Huffman, 1951

February 8, 2010Harvard QR4819 Huffman Code Example A.35 B.05 C.2 D.15 E.25 BD.2 BCD.4 AE.6 ABCDE 1.0

February 8, 2010Harvard QR4820 Huffman Code Example A.35 B.05 C.2 D.15 E.25 BD.2 BCD.4 AE.6 ABCDE 1.0 01 0 1 0 1 0 1 A00 B100 C11 D101 E01 Entropy2.12 Ave length 2.20

February 8, 2010Harvard QR4821 Efficiency of Huffman Codes Huffman codes are as efficient as possible if only first-order information (symbol frequencies) is taken into account. Huffman code is always within 1 bit/symbol of the entropy.

February 8, 2010Harvard QR4822 Second-Order Entropy Second-Order Entropy of a source is calculated by treating digrams as single symbols according to their frequencies Occurrences of q and u are not independent so it is helpful to treat qu as one Second-order entropy of English is about 3.3 bits/character

How English Would Look Based on frequencies alone 0: xfoml rxkhrjffjuj zlpwcfwkcyj ffjeyvkcqsghyd qpaamkbzaacibzlhjqd 1: ocroh hli rgwr nmielwis eu ll nbnesebya th eei alhenhttpa oobttva 2: On ie antsoutinys are t inctore st be s deamy achin d ilonasive tucoowe at 3: IN NO IST LAT WHEY CRATICT FROURE BIRS GROCID PONDENOME OF DEMONSTURES OF THE REPTAGIN IS REGOACTIONA February 8, 2010Harvard QR4823

How English Would Look Based on word frequencies 1) REPRESENTING AND SPEEDILY IS AN GOOD APT OR COME CAN DIFFERENT NATURAL HERE HE THE A IN CAME THE TO OF TO EXPERT GRAY COME TO FURNISHES THE LINE MESSAGE HAD BE THESE 2) THE HEAD AND IN FRONTAL ATTACK ON AN ENGLISH WRITER THAT THE CHARACTER OF THIS POINT IS THEREFORE ANOTHER METHOD FOR THE LETTERS THAT THE TIME OF WHO EVER TOLD THE PROBLEM FOR AN UNEXPECTED February 8, 2010Harvard QR4824

February 8, 2010Harvard QR4825 What is entropy of English? Entropy is the “limit” of the information per symbol using single symbols, digrams, trigrams, … Not really calculable because English is a finite language! Nonetheless it can be determined experimentally using Shannon’s game Answer: a little more than 1 bit/character

February 8, 2010Harvard QR4826 Shannon’s Remarkable 1948 paper

February 8, 2010Harvard QR4827 Shannon’s Source Coding Theorem No code can achieve efficiency greater than 1, but For any source, there are codes with efficiency as close to 1 as desired. The proof does not give a method to find the best codes. It just sets a limit on how good they can be.

February 8, 2010Harvard QR4828 Huffman coding used widely Eg JPEGs use Huffman codes to for the pixel-to-pixel changes in color values Colors usually change gradually so there are many small numbers, 0, 1, 2, in this sequence JPEGs sometimes use a fancier compression method called “arithmetic coding” Arithmetic coding produces 5% better compression

February 8, 2010Harvard QR4829 Why don’t JPEGs use arithmetic coding? Because it is patented by IBM United States Patent 4,905,297 Langdon, Jr., et al. February 27, 1990 Arithmetic coding encoder and decoder system Abstract Apparatus and method for compressing and de-compressing binary decision data by arithmetic coding and decoding wherein the estimated probability Qe of the less probable of the two decision events, or outcomes, adapts as decisions are successively encoded. To facilitate coding computations, an augend value A for the current number line interval is held to approximate … What if Huffman had patented his code?

February 3, 2010Harvard QR481 Coding and Entropy.

Similar presentations

Presentation on theme: "February 3, 2010Harvard QR481 Coding and Entropy."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

February 3, 2010Harvard QR481 Coding and Entropy.

Similar presentations

Presentation on theme: "February 3, 2010Harvard QR481 Coding and Entropy."— Presentation transcript:

Similar presentations

About project

Feedback