Prof. Pushpak Bhattacharyya, IIT Bombay1 Basics Of Entropy CS 621 Artificial Intelligence Lecture /09/05 Prof. Pushpak Bhattacharyya
Prof. Pushpak Bhattacharyya, IIT Bombay2 Entropy Measure of uncertainty/structure in the data used in classification.
Prof. Pushpak Bhattacharyya, IIT Bombay3 Tabular data A 1 A 2 A 3 A 4 ……… D V 11 V 21 V 31 ….……… Y V 21 V 22 V 32 ….……… N Concentrate on an attribute. Partition on the value of attribute.
Prof. Pushpak Bhattacharyya, IIT Bombay4 Information theory through Entropy paves way for classification. E(S) = - P + log P + - P - log P - P + = Proportion of +ve examples ( D = Y ) P - = Proportion of –ve examples ( D = N )
Prof. Pushpak Bhattacharyya, IIT Bombay5 Focus on attribute A i Gain( S, A i ) = E(S) = Σ E(A i )(|s v | / |s|) v v belongs to Values of A i Decision tree is constructed from GAIN using ID 3 algorithm.
Prof. Pushpak Bhattacharyya, IIT Bombay6 Entropy Define the terminology. S = {s 1, s 2 … s q } Symbols from source P(S i ) = Emission problem of s i = P i
Prof. Pushpak Bhattacharyya, IIT Bombay7 Notion of “Information” from S If P(s i ) = P i = 1 There is no information conveyed. “NO SURPRISE” Convention: P i = 0, I = ∞
Prof. Pushpak Bhattacharyya, IIT Bombay8 I is the “Information” function 1)I(S i ) = 0 If P(s i ) = P i = 1 2)I(s i s j ) = I(s i ) + I(s j ) wanted Assuming emission of s i and s j are independent events P(s i s j ) = P(s i ) * P( s j )
Prof. Pushpak Bhattacharyya, IIT Bombay9 Form of I is I(s i ) = log 1/ (P i = P(s i )) I(s i ) = infinity for P i = 0 Information Amount of surprise. Length of code to send the message.
Prof. Pushpak Bhattacharyya, IIT Bombay10 Average information = Expected value of Information. = E(I(s i )) = ΣP i log(1/P i ) Entropy of S.
Prof. Pushpak Bhattacharyya, IIT Bombay11 Properties of Entropy Minimum value of Entropy: If any of P i = 1 => P j = 0 for all j ≠ i E(S) = 0 Minimum value
Prof. Pushpak Bhattacharyya, IIT Bombay12 Maximum value By convention E(S) = 0 when P i = 0 for all i. LimitP i log 1/P i = 0 P i 0
Prof. Pushpak Bhattacharyya, IIT Bombay13 Example S = {P 1, P 2 } P 1 + P 2 = 1 Tossing of coin E(s) = - [ P 1 log P 1 + P 2 log P 2 ] where P 1 = 0.5 = P 2 E(s) = 1.0
Prof. Pushpak Bhattacharyya, IIT Bombay14 If all events are equally likely then entropy is max. We expect that S = {s 1, s 2 … s q } E(S) will be max when each P i = 1/q.
Prof. Pushpak Bhattacharyya, IIT Bombay15 Theorem E(S) = maximum when each P i = 1/q S = {s 1, s 2 … s q } Lemma : ln(x) = log e x <= x – 1 consider f(x) = x -1 – ln x f(1) = 0
Prof. Pushpak Bhattacharyya, IIT Bombay16 df(x)/dx = 1 – 1/x equating to 0 x = 1 f(x) had extremum at x = 1 d 2 f(x)/dx 2 = 1/x 2 > 0 for x > 0 x = 1 is a minima
Prof. Pushpak Bhattacharyya, IIT Bombay17 f(x) had minimum at x =1 f(x) = x -1 – ln x ln x = 1
Prof. Pushpak Bhattacharyya, IIT Bombay18 Corollary Let, Σ x i = 1 i = 1 to m Σ y i = 1 and x i >= 0, y i > 0 i = 1 to m x i and y i are probability distributions. Σ x i ln 1/ x i <= Σ x i ln 1/ y i i = 1 to m i = 1 to m
Prof. Pushpak Bhattacharyya, IIT Bombay19 Proof Σ x i ln 1/ x i - Σ x i ln 1/ y i i = 1 to m = Σ x i ln y i / x i i = 1 to m <= Σ x i (y i / x i - 1) i = 1 to m = Σ y i - Σ x i = 0 i = 1 to m i = 1 to m
Prof. Pushpak Bhattacharyya, IIT Bombay20 This proves that Σ x i ln 1/ x i <= Σ x i ln 1/ y i i = 1 to m i = 1 to m Proof of theorem follows from this corollary.
Prof. Pushpak Bhattacharyya, IIT Bombay21 S = {s 1, s 2 … s q } P(s i ) =P i, i = 1… q Choose x i = P i, y i = 1/q Σ x i ln 1/x i <= Σ x i ln1/y i i = 1 to q i = 1 to n Σ P i ln 1/P i <= Σ x i ln q i = 1 to q i = 1 to n
Prof. Pushpak Bhattacharyya, IIT Bombay22 So, E(S) <= Σ P i ln q i = 1 to m = ln q. Σ P i = ln q E(S) is upper bounded by Σ P i ln q = ln q i = 1 to m Which value is reached when each P i = q. This establishes the maximum value for Entropy.
Prof. Pushpak Bhattacharyya, IIT Bombay23 E(s) is defined as Σ P i log r 1/ P i i = 1 to q log r x to ln x transformation is a constant multiplying factor log r x = log r e * log e x
Prof. Pushpak Bhattacharyya, IIT Bombay24 Summary Review Established the intuition for information function I(s i ). Related to the ‘surprise’. Average information is called entropy. Minimum value of E = 0. Maximum of E ? – Lemma ln x <= x – 1 – Corollary Σ x i ln 1/ x i <= Σ x i ln 1/ y i Σ x i = Σ y i = 1 Max E is ln q * k and is reached where p i = 1/q for each i
Prof. Pushpak Bhattacharyya, IIT Bombay25 Shannon asked what is the “entropy of English language” ? S = {a,b,c,…. ‘,’,‘:’, ….} P(a) = Relative frequency of ‘a’ from large corpus. P(b) = … This gives P i s E(English) = Σ P i log1/P i = 4.08
Prof. Pushpak Bhattacharyya, IIT Bombay26 Max entropy for tossing of coin: = 1.0 when the coin is unbiased. Interest of the reader/listener Novel is more interesting than a scientific paper for some people.
Prof. Pushpak Bhattacharyya, IIT Bombay27 Summary Application of Noisy Channel to ASR. Formulated as Bayesian Decision Making. Studied phonetic problems. Why probability model is needed?