ENTROPY Entropy measures the uncertainty in a random experiment. Let X be a discrete random variable with range S X = { 1,2,3,... k} and pmf p k = P X (X = k) Uncertainty of Uncertainty = 0 Uncertainty → ∞
Entropy of X ≡ expected uncertainty of outcomes If log 2 is used units are bits, with ln, units are nats. By convention, if P(X= x) = 0 i.e., when P(X = x) = 1 -0 log (0) = 0
For a binary random variable X { 0,1}, Let p ≡ P(X=1) H X is maximum when p=0.5 ↔ 0,1 are equally probable ↔ max uncertainty p = 1, p= 0 no uncertainty → H X = 0 Image:
In general, H X of n equally probable outcomes= n bits e.g., n-bit equiprobable numbers n bits As each bit is specified, H X decreases by 1 bit. When all n-bits are specified, H X = 0 for 01for 11
Relative Entropy: If p = ( p 1, p 2,..., p K ) X ~ p q = ( q 1, q 2,..., q K ) Y ~ q are two pmf’s. H (p;q) ≡ relative entropy of q with respect to p K outcomes for both X and Y
H (p;q) is often used as a metric for probability distributions, is called the Kullback – Leibler Distance. To prove the assertions, use x log x 1 1
This is called maximum entropy (ME) or the minimum relative entropy (MRE) situation. Thus: Only one possible outcomeK equally probable outcomes
For a continuous random variable all entropies are maximally uncertain. entropy cannot be defined as for discrete random variables. Instead, differential entropy is used In fact, the integral extends only over the region where f X (x) > 0 Differential Entropy:
Information Theory Let X be a random variable with S X = { x 1, …., x k } Information about outcomes of X is to be sent over a channel. How can outcomes { x 1, …., x k } be coded so that all information is carried with maximal efficiency? XReceiver Channel Source Destination
Best code → minimum expected codeword length. Code must be instantaneously decodable, i.e. no codeword is a prefix for any other. → construct a code tree e.g. S = {x 1, x 2, x 3, x 4,x 5 } x 1 = 00 x 2 = 01 x 3 = 10 x 4 = 110 x 5 = 111 x1x1 x2x2 x3x3 x4x4 x5x
If l k = length of code for x k E (codeword length) For instantaneous binary codes Kraft Inequality For D-ary code
i.e. 1. minimum average codeword length = entropy of X 2.most efficient code is obtained when length(x k ) = - log p k i.e. word lengths are inversely proportional to their probabilities. 1 bits of information in X = entropy of X 2 a maximally efficient code can always be found when all p k are powers of 2 otherwise One such optimal code is the Huffman code constructed by a Huffman tree.
e.g. Let S X = { A, B, C, D, E } with pmf = { 0.1, 0.3, 0.25, 0.2, 0.15} At every step, combine nodes with minimal sum: 1) 2) 3) 4) A = 000 B = 01 C = 10 D = 11 E = A B B B B A A A C C C C D D D E E E E D
for any binary tree with each leaf a codeword Let l max = longest codeword leaf at level l max (root = level 0) If all leaves are at level l max # leaves = If a leaf is at l k < l max it eliminates leaves from the full tree.
( Remember, each leaf is eliminated by exactly one codeword) A CD B
In General if the tree is complete, If not, it is < 1 e.g. A BC
Maximum Entropy Method Given random variable X, S X = { x 1, …., x k } unknown pmf p k = p(x k ) constraint E(g(x)) = r 1 Estimate p k Hypothesis : is the maximum entropy pmf. Proof: Suppose pmf q ≠ p satisfies 1
In general, given n constraints, E(g 1 (x)) = r 1 _____ 1-1. E(g n (x)) = r n _____ 1-n the ME pmf has the form