Download presentation
Presentation is loading. Please wait.
1
Information Theory Rong Jin
2
Outline Information Entropy Mutual information Noisy channel model
3
Information Information knowledge Information: reduction in uncertainty Example: 1. flip a coin 2. roll a die #2 is more uncertain than #1 Therefore, more information is provided by the outcome of #2 than #1
4
Definition of Information Let E be some event that occurs with probability P(E). If we are told that E has occurred, then we say we have received I(E)=log 2 (1/P(E)) bits of information Example: Result of a fair coin flip (log 2 2=1 bit) Result of a fair die roll (log 2 6=2.585 bits)
5
Information is Additive I(k fair coin tosses) = log2 k =k bits Example: information conveyed by words Random word from a 100,000 word vocabulary: I(word) = log(100,000) = 16.6 bits A 1000 word document from the same source I(document) = 16,600 bits A 480x640 pixel, 16-greyscale video picture: I(picture) = 307,200 * log16 = 1,228,800 bits A picture is worth a 1000 words!
6
Information is Additive I(k fair coin tosses) = log2 k =k bits Example: information conveyed by words Random word from a 100,000 word vocabulary: I(word) = log(100,000) = 16.6 bits A 1000 word document from the same source I(document) = 16,600 bits A 480x640 pixel, 16-greyscale video picture: I(picture) = 307,200 * log16 = 1,228,800 bits A picture is worth a 1000 words!
7
Information is Additive I(k fair coin tosses) = log2 k =k bits Example: information conveyed by words Random word from a 100,000 word vocabulary: I(word) = log(100,000) = 16.6 bits A 1000 word document from the same source I(document) = 16,600 bits A 480x640 pixel, 16-greyscale video picture: I(picture) = 307,200 * log16 = 1,228,800 bits A picture is worth a 1000 words!
8
Information is Additive I(k fair coin tosses) = log2 k =k bits Example: information conveyed by words Random word from a 100,000 word vocabulary: I(word) = log(100,000) = 16.6 bits A 1000 word document from the same source I(document) = 16,600 bits A 480x640 pixel, 16-greyscale video picture: I(picture) = 307,200 * log16 = 1,228,800 bits A picture is worth more than a 1000 words!
9
Outline Information Entropy Mutual Information Cross Entropy and Learning
10
Entropy A zero-memory information source S is a source that emits symbols from an alphabet {s 1, s 2,…, s k } with probability {p 1, p 2,…,p k }, respectively, where the symbols emitted are statistically independent. What is the average amount of information in observing the output of the source S? Call this entropy:
11
Entropy A zero-memory information source S is a source that emits symbols from an alphabet {s 1, s 2,…, s k } with probability {p 1, p 2,…,p k }, respectively, where the symbols emitted are statistically independent. What is the average amount of information in observing the output of the source S? Call this entropy:
12
Explanation of Entropy 1.Average amount of information provided per symbol 2.Average # of bits needed to communicate each symbol
13
Properties of Entropy 1. Non-negative: H(P) 0 2. For any other probability distribution {q 1,…,q k }, 3. H(P) logk, with equality iff p i =1/k for all i 4. The further P is from uniform, the lower the entropy.
14
Entropy: k = 2 Notice zero information at edges maximum information at 0.5 (1 bit) drop off more quickly close edges than in the middle
15
The Entropy of English 27 characters (A-Z, space) 100,000 words (average 6.5 char each) Assuming independence between successive characters: Uniform character distribution: log27 = 4.75 bits/char True character distribution: 4.03 bits/character Assuming independence between successive words: Uniform word distribution: log100,1000/6.5 = 2.55 bits/char True word distribution: 9.45/6.5 = 1.45 bits/character True entropy of English is much lower!
16
The Entropy of English 27 characters (A-Z, space) 100,000 words (average 6.5 char each) Assuming independence between successive characters: Uniform character distribution: log27 = 4.75 bits/char True character distribution: 4.03 bits/character Assuming independence between successive words: Uniform word distribution: (log100,1000)/6.5 = 2.55 bits/char True word distribution: 9.45/6.5 = 1.45 bits/character True entropy of English is much lower!
17
Entropy of Two Sources Temperature T P(T = hot) = 0.3 P(T = mild) = 0.5 P(T = cold) = 0.2 H(T) = H(0.3, 0.5, 0.2) = 1.485 Humidity M P(M = low) = 0.6 P(M = high) = 0.4 H(M) = H(0.6, 0.4) = 0.971 Random variable T, M are not independent P(T=t, M=m) P(T=t)P(M=m)
18
H(T) = 1.485 H(M) = 0.971 H(T) + H(M) = 2.456 Joint Entropy H(T, M) = H(0.1, 0.4, 0.1, 0.2, 0.1, 0.1, 0.1) = 2.321 H(T, M) H(T) + H(M) Joint Entropy Joint Probability P(T, M)
19
Conditional Entropy Conditional Entropy H(T|M = low) = 1.252 H(T|M = high) = 1.5 Average conditional entropy How much is M telling us on average about T? H(T) – H(T|M) = 1.485 – 1.351 = 0.134 bits Conditional Probability P(T| M)
20
Mutual Information Properties: Indicate the amount of information one random variable can provide to another one Symmetric I(X;Y) = I(Y;X) Non-negative Zero iff X, Y are independent
21
Relationship H(X, Y) H(X) H(Y) H(X|Y) H(Y|X) I(X;Y)
22
A Distance Measure Between Distributions Kullback-Leibler distance: Properties of Kullback-Leibler distance Non-negative: KL(P D ||P M )=0 iff P D = P M Minimizing KL distance P M get close to P D Non-symmetric: KL(P D ||P M ) KL(P M ||P D )
23
Bregman Distance ' ( x ) i saconvex f unc t i on.
24
Compression Algorithm for TC Sports Training Examples Politics Compress 109K 116K New Document
25
Compression Algorithm for TC Sports Training Examples Politics Compress 109K 116K New Document Politics New Document Sports Compress 129K 126K Topic: Sports
26
26 The Noisy Channel Prototypical case: Input Output (noisy) The channel 0,1,1,1,0,1,0,1,... (adds noise) 0,1,1,0,0,1,1,0,... Model: probability of error (noise): Example: p(0|1) =.3 p(1|1) =.7 p(1|0) =.4 p(0|0) =.6 The Task: known: the noisy output; want to know: the input (decoding) Source coding theorem Channel coding theorem
27
27 Noisy Channel Applications OCR straightforward: text print (adds noise), scan image Handwriting recognition text neurons, muscles (“noise”), scan/digitize image Speech recognition (dictation, commands, etc.) text conversion to acoustic signal (“noise”) acoustic waves Machine Translation text in target language translation (“noise”) source language
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.