Presentation is loading. Please wait.

Presentation is loading. Please wait.

Copyright © 2001, 2003, Andrew W. Moore Entropy and Information Gain Andrew W. Moore Professor School of Computer Science Carnegie Mellon University www.cs.cmu.edu/~awm.

Similar presentations


Presentation on theme: "Copyright © 2001, 2003, Andrew W. Moore Entropy and Information Gain Andrew W. Moore Professor School of Computer Science Carnegie Mellon University www.cs.cmu.edu/~awm."— Presentation transcript:

1 Copyright © 2001, 2003, Andrew W. Moore Entropy and Information Gain Andrew W. Moore Professor School of Computer Science Carnegie Mellon University www.cs.cmu.edu/~awm awm@cs.cmu.edu 412-268-7599 Note to other teachers and users of these slides. Andrew would be delighted if you found this source material useful in giving your own lectures. Feel free to use these slides verbatim, or to modify them to fit your own needs. PowerPoint originals are available. If you make use of a significant portion of these slides in your own lecture, please include this message, or the following link to the source repository of Andrew’s tutorials: http://www.cs.cmu.edu/~awm/tutorials. Comments and corrections gratefully received.http://www.cs.cmu.edu/~awm/tutorials

2 Information Gain: Slide 2 Copyright © 2001, 2003, Andrew W. Moore Bits You are watching a set of independent random samples of X You see that X has four possible values So you might see: BAACBADCDADDDA… You transmit data over a binary serial link. You can encode each reading with two bits (e.g. A = 00, B = 01, C = 10, D = 11) 0100001001001110110011111100… P(X=A) = 1/4P(X=B) = 1/4P(X=C) = 1/4P(X=D) = 1/4

3 Information Gain: Slide 3 Copyright © 2001, 2003, Andrew W. Moore Fewer Bits Someone tells you that the probabilities are not equal It’s possible… …to invent a coding for your transmission that only uses 1.75 bits on average per symbol. How? P(X=A) = 1/2P(X=B) = 1/4P(X=C) = 1/8P(X=D) = 1/8

4 Information Gain: Slide 4 Copyright © 2001, 2003, Andrew W. Moore Fewer Bits Someone tells you that the probabilities are not equal It’s possible… …to invent a coding for your transmission that only uses 1.75 bits on average per symbol. How? (This is just one of several ways) P(X=A) = 1/2P(X=B) = 1/4P(X=C) = 1/8P(X=D) = 1/8 A0 B10 C110 D111

5 Information Gain: Slide 5 Copyright © 2001, 2003, Andrew W. Moore Fewer Bits Suppose there are three equally likely values… Here’s a naïve coding, costing 2 bits per symbol Can you think of a coding that would need only 1.6 bits per symbol on average? In theory, it can in fact be done with 1.58496 bits per symbol. P(X=A) = 1/3P(X=B) = 1/3P(X=C) = 1/3 A00 B01 C10

6 Information Gain: Slide 6 Copyright © 2001, 2003, Andrew W. Moore Suppose X can have one of m values… V 1, V 2, … V m What’s the smallest possible number of bits, on average, per symbol, needed to transmit a stream of symbols drawn from X’s distribution? It’s H(X) = The entropy of X (Shannon, 1948) “High Entropy” means X is from a uniform (boring) distribution “Low Entropy” means X is from varied (peaks and valleys) distribution General Case P(X=V 1 ) = p 1 P(X=V 2 ) = p 2 ….P(X=V m ) = p m

7 Information Gain: Slide 7 Copyright © 2001, 2003, Andrew W. Moore Suppose X can have one of m values… V 1, V 2, … V m What’s the smallest possible number of bits, on average, per symbol, needed to transmit a stream of symbols drawn from X’s distribution? It’s H(X) = The entropy of X “High Entropy” means X is from a uniform (boring) distribution “Low Entropy” means X is from varied (peaks and valleys) distribution General Case P(X=V 1 ) = p 1 P(X=V 2 ) = p 2 ….P(X=V m ) = p m A histogram of the frequency distribution of values of X would be flat A histogram of the frequency distribution of values of X would have many lows and one or two highs

8 Information Gain: Slide 8 Copyright © 2001, 2003, Andrew W. Moore Suppose X can have one of m values… V 1, V 2, … V m What’s the smallest possible number of bits, on average, per symbol, needed to transmit a stream of symbols drawn from X’s distribution? It’s H(X) = The entropy of X “High Entropy” means X is from a uniform (boring) distribution “Low Entropy” means X is from varied (peaks and valleys) distribution General Case P(X=V 1 ) = p 1 P(X=V 2 ) = p 2 ….P(X=V m ) = p m A histogram of the frequency distribution of values of X would be flat A histogram of the frequency distribution of values of X would have many lows and one or two highs..and so the values sampled from it would be all over the place..and so the values sampled from it would be more predictable

9 Information Gain: Slide 9 Copyright © 2001, 2003, Andrew W. Moore Entropy in a nut-shell Low EntropyHigh Entropy

10 Information Gain: Slide 10 Copyright © 2001, 2003, Andrew W. Moore Entropy in a nut-shell Low EntropyHigh Entropy..the values (locations of soup) unpredictable... almost uniformly sampled throughout our dining room..the values (locations of soup) sampled entirely from within the soup bowl

11 Information Gain: Slide 11 Copyright © 2001, 2003, Andrew W. Moore Significado de Entropia Entropia es como una medida de desorden. Si la entropia es alta hay mas desorden (imagen borrosa). Su ka entropia es baja hay mas orden (imagen mas clara). Si la entropia es alta se necesita mas informacion para describir los datos. Es decir perder entropia es lo mismo que ganar en informacion Muy usado en la construccion de arboles de decision.

12 Information Gain: Slide 12 Copyright © 2001, 2003, Andrew W. Moore Entropy of a PDF Natural log (ln or log e ) The larger the entropy of a distribution… …the harder it is to predict …the harder it is to compress it …the less spiky the distribution

13 Information Gain: Slide 13 Copyright © 2001, 2003, Andrew W. Moore The “box” distribution -w/20w/2 1/w

14 Information Gain: Slide 14 Copyright © 2001, 2003, Andrew W. Moore Unit variance box distribution 0

15 Information Gain: Slide 15 Copyright © 2001, 2003, Andrew W. Moore The Hat distribution 0

16 Information Gain: Slide 16 Copyright © 2001, 2003, Andrew W. Moore Unit variance hat distribution 0

17 Information Gain: Slide 17 Copyright © 2001, 2003, Andrew W. Moore The “2 spikes” distribution 01 Dirac Delta

18 Information Gain: Slide 18 Copyright © 2001, 2003, Andrew W. Moore Entropies of unit-variance distributions DistributionEntropy Box1.242 Hat1.396 2 spikes-infinity ???1.4189 Largest possible entropy of any unit- variance distribution

19 Information Gain: Slide 19 Copyright © 2001, 2003, Andrew W. Moore Unit variance Gaussian

20 Information Gain: Slide 20 Copyright © 2001, 2003, Andrew W. Moore Specific Conditional Entropy H(Y|X=v) Suppose I’m trying to predict output Y and I have input X Let’s assume this reflects the true probabilities E.G. From this data we estimate P(LikeG = Yes) = 0.5 P(Major = Math & LikeG = No) = 0.25 P(Major = Math) = 0.5 P(LikeG = Yes | Major = History) = 0 Note: H(X) = 1.5 H(Y) = 1 X = College Major Y = Likes “Gladiator” XY MathYes HistoryNo CSYes MathNo MathNo CSYes HistoryNo MathYes

21 Information Gain: Slide 21 Copyright © 2001, 2003, Andrew W. Moore Definition of Specific Conditional Entropy: H(Y |X=v) = The entropy of Y among only those records in which X has value v X = College Major Y = Likes “Gladiator” XY MathYes HistoryNo CSYes MathNo MathNo CSYes HistoryNo MathYes Specific Conditional Entropy H(Y|X=v)

22 Information Gain: Slide 22 Copyright © 2001, 2003, Andrew W. Moore Definition of Specific Conditional Entropy: H(Y |X=v) = The entropy of Y among only those records in which X has value v Example: H(Y|X=Math) = 1 H(Y|X=History) = 0 H(Y|X=CS) = 0 X = College Major Y = Likes “Gladiator” XY MathYes HistoryNo CSYes MathNo MathNo CSYes HistoryNo MathYes Specific Conditional Entropy H(Y|X=v)

23 Information Gain: Slide 23 Copyright © 2001, 2003, Andrew W. Moore Conditional Entropy H(Y|X) Definition of Conditional Entropy: H(Y |X) = The average specific conditional entropy of Y = if you choose a record at random what will be the conditional entropy of Y, conditioned on that row’s value of X = Expected number of bits to transmit Y if both sides will know the value of X = Σ j Prob(X=v j ) H(Y | X = v j ) X = College Major Y = Likes “Gladiator” XY MathYes HistoryNo CSYes MathNo MathNo CSYes HistoryNo MathYes

24 Information Gain: Slide 24 Copyright © 2001, 2003, Andrew W. Moore Conditional Entropy Definition of Conditional Entropy: H(Y|X) = The average conditional entropy of Y = Σ j Prob(X=v j ) H(Y | X = v j ) X = College Major Y = Likes “Gladiator” Example: vjvj Prob(X=v j )H(Y | X = v j ) Math0.51 History0.250 CS0.250 H(Y|X) = 0.5 * 1 + 0.25 * 0 + 0.25 * 0 = 0.5 XY MathYes HistoryNo CSYes MathNo MathNo CSYes HistoryNo MathYes

25 Information Gain: Slide 25 Copyright © 2001, 2003, Andrew W. Moore Information Gain Definition of Information Gain: IG(Y|X) = I must transmit Y. How many bits on average would it save me if both ends of the line knew X? IG(Y|X) = H(Y) - H(Y | X) X = College Major Y = Likes “Gladiator” Example: H(Y) = 1 H(Y|X) = 0.5 Thus IG(Y|X) = 1 – 0.5 = 0.5 XY MathYes HistoryNo CSYes MathNo MathNo CSYes HistoryNo MathYes

26 Information Gain: Slide 26 Copyright © 2001, 2003, Andrew W. Moore Relative Entropy:Distance Kullback- Leibler Sea p(x) y q(x) dos distribuciones de probabilidad, entonces la distancia Kullback-Leibler entre ellas esta dada por

27 Information Gain: Slide 27 Copyright © 2001, 2003, Andrew W. Moore Mutual Information Es lo mismo que Information gain pero desde otro punto de vista. Suponga que tenemos dos variables aleatorias X y Y con distribucion conjunta r(x,y) y distribuciones marginales p(x) y q(y). Entonces la informacion mutua se define por Mutual informacion es la entropia relativa entre la distribucion conjunta y el producto de las distribuciones marginales

28 Information Gain: Slide 28 Copyright © 2001, 2003, Andrew W. Moore Mutual Information I(X,Y)=H(Y)-H(Y/X) En efecto,

29 Information Gain: Slide 29 Copyright © 2001, 2003, Andrew W. Moore Mutual information I(X,Y)=H(Y)-H(Y/X) I(X,Y)=H(X)-H(X/Y) I(X,Y)=H(X)+H(Y)-H(X,Y) I(X,Y)=I(Y,X) I(X,X)=H(X)

30 Information Gain: Slide 30 Copyright © 2001, 2003, Andrew W. Moore Information Gain Example

31 Information Gain: Slide 31 Copyright © 2001, 2003, Andrew W. Moore Another example

32 Information Gain: Slide 32 Copyright © 2001, 2003, Andrew W. Moore Relative Information Gain Definition of Relative Information Gain: RIG(Y|X) = I must transmit Y, what fraction of the bits on average would it save me if both ends of the line knew X? RIG(Y|X) = [H(Y) - H(Y | X) ]/ H(Y) X = College Major Y = Likes “Gladiator” Example: H(Y|X) = 0.5 H(Y) = 1 Thus IG(Y|X) = (1 – 0.5)/1 = 0.5 XY MathYes HistoryNo CSYes MathNo MathNo CSYes HistoryNo MathYes

33 Information Gain: Slide 33 Copyright © 2001, 2003, Andrew W. Moore What is Information Gain used for? Suppose you are trying to predict whether someone is going live past 80 years. From historical data you might find… IG(LongLife | HairColor) = 0.01 IG(LongLife | Smoker) = 0.2 IG(LongLife | Gender) = 0.25 IG(LongLife | LastDigitOfSSN) = 0.00001 IG tells you how interesting a 2-d contingency table is going to be.

34 Information Gain: Slide 34 Copyright © 2001, 2003, Andrew W. Moore Cross Entropy Sea X una variable aleatoria con distribucion conocida p(x) y distribucion estimada q(x), la “cross entropy” mide la diferencia entre las dos distribuciones y se define por donde H(X) es la entropia de X con respecto a la distribucion p y KL es la distancia Kullback-Leibler ente p y q. Si p y q son discretas se reduce a : y para p y q continuas se tiene

35 Information Gain: Slide 35 Copyright © 2001, 2003, Andrew W. Moore Bivariate Gaussians Then defineto mean Where the Gaussian’s parameters are… Where we insist that  is symmetric non-negative definite

36 Information Gain: Slide 36 Copyright © 2001, 2003, Andrew W. Moore Bivariate Gaussians Then defineto mean Where the Gaussian’s parameters are… Where we insist that  is symmetric non-negative definite It turns out that E[X] =  and Cov[X] = . (Note that this is a resulting property of Gaussians, not a definition)

37 Information Gain: Slide 37 Copyright © 2001, 2003, Andrew W. Moore Evaluating p(x): Step 1 1.Begin with vector x x 

38 Information Gain: Slide 38 Copyright © 2001, 2003, Andrew W. Moore Evaluating p(x): Step 2 1.Begin with vector x 2.Define  = x -  x  

39 Information Gain: Slide 39 Copyright © 2001, 2003, Andrew W. Moore Evaluating p(x): Step 3 1.Begin with vector x 2.Define  = x -  3.Count the number of contours crossed of the ellipsoids formed  -1 D = this count = sqrt(  T  -1  ) = Mahalonobis Distance between x and  x   Contours defined by sqrt(  T  -1  ) = constant

40 Information Gain: Slide 40 Copyright © 2001, 2003, Andrew W. Moore Evaluating p(x): Step 4 1.Begin with vector x 2.Define  = x -  3.Count the number of contours crossed of the ellipsoids formed  -1 D = this count = sqrt(  T  -1  ) = Mahalonobis Distance between x and  4.Define w = exp(-D 2 /2) D 2 exp(-D 2 /2) x close to  in squared Mahalonobis space gets a large weight. Far away gets a tiny weight

41 Information Gain: Slide 41 Copyright © 2001, 2003, Andrew W. Moore Evaluating p(x): Step 5 1.Begin with vector x 2.Define  = x -  3.Count the number of contours crossed of the ellipsoids formed  -1 D = this count = sqrt(  T  -1  ) = Mahalonobis Distance between x and  4.Define w = exp(-D 2 /2) 5. D 2 exp(-D 2 /2)

42 Information Gain: Slide 42 Copyright © 2001, 2003, Andrew W. Moore Normal Bivariada NB(0,0,1,1,0) persp(x,y,a,theta=30,phi=10,zlab="f(x,y)",box=FALSE,col=4)

43 Information Gain: Slide 43 Copyright © 2001, 2003, Andrew W. Moore Normal Bivariada NB(0,0,1,1,0) filled.contour(x,y,a,nlevels=4,col=2:5)

44 Information Gain: Slide 44 Copyright © 2001, 2003, Andrew W. Moore Multivariate Gaussians Then defineto mean Where the Gaussian’s parameters have… Where we insist that  is symmetric non-negative definite Again, E[X] =  and Cov[X] = . (Note that this is a resulting property of Gaussians, not a definition)

45 Information Gain: Slide 45 Copyright © 2001, 2003, Andrew W. Moore General Gaussians x1x1 x2x2

46 Information Gain: Slide 46 Copyright © 2001, 2003, Andrew W. Moore Axis-Aligned Gaussians x1x1 x2x2

47 Information Gain: Slide 47 Copyright © 2001, 2003, Andrew W. Moore Spherical Gaussians x1x1 x2x2

48 Information Gain: Slide 48 Copyright © 2001, 2003, Andrew W. Moore Subsets of variables This will be our standard notation for breaking an m- dimensional distribution into subsets of variables

49 Information Gain: Slide 49 Copyright © 2001, 2003, Andrew W. Moore Gaussian Marginals are Gaussian THEN U is also distributed as a Gaussian Margin- alize

50 Information Gain: Slide 50 Copyright © 2001, 2003, Andrew W. Moore Gaussian Marginals are Gaussian THEN U is also distributed as a Gaussian Margin- alize This fact is not immediately obvious Obvious, once we know it’s a Gaussian (why?)

51 Information Gain: Slide 51 Copyright © 2001, 2003, Andrew W. Moore Gaussian Marginals are Gaussian THEN U is also distributed as a Gaussian Margin- alize How would you prove this?

52 Information Gain: Slide 52 Copyright © 2001, 2003, Andrew W. Moore Linear Transforms remain Gaussian Multiply Matrix A Assume X is an m-dimensional Gaussian r.v. Define Y to be a p-dimensional r. v. thusly (note ): …where A is a p x m matrix. Then…

53 Information Gain: Slide 53 Copyright © 2001, 2003, Andrew W. Moore Adding samples of 2 independent Gaussians is Gaussian + Why doesn’t this hold if X and Y are dependent? Which of the below statements is true? If X and Y are dependent, then X+Y is Gaussian but possibly with some other covariance If X and Y are dependent, then X+Y might be non-Gaussian

54 Information Gain: Slide 54 Copyright © 2001, 2003, Andrew W. Moore Conditional of Gaussian is Gaussian Condition- alize

55 Information Gain: Slide 55 Copyright © 2001, 2003, Andrew W. Moore Caso Normal Bivariado

56 Information Gain: Slide 56 Copyright © 2001, 2003, Andrew W. Moore

57 Information Gain: Slide 57 Copyright © 2001, 2003, Andrew W. Moore P(w|m=82) P(w|m=76) P(w)

58 Information Gain: Slide 58 Copyright © 2001, 2003, Andrew W. Moore P(w|m=82) P(w|m=76) P(w) Note: conditional variance is independent of the given value of v Note: conditional variance can only be equal to or smaller than marginal variance Note: marginal mean is a linear function of v Note: when given value of v is  v, the conditional mean of u is  u

59 Information Gain: Slide 59 Copyright © 2001, 2003, Andrew W. Moore Gaussians and the chain rule Chain Rule Let A be a constant matrix

60 Information Gain: Slide 60 Copyright © 2001, 2003, Andrew W. Moore Available Gaussian tools Chain Rule Condition- alize + Multiply Matrix A Margin- alize

61 Information Gain: Slide 61 Copyright © 2001, 2003, Andrew W. Moore Assume… You are an intellectual snob You have a child

62 Information Gain: Slide 62 Copyright © 2001, 2003, Andrew W. Moore Intellectual snobs with children …are obsessed with IQ In the world as a whole, IQs are drawn from a Gaussian N(100,15 2 )

63 Information Gain: Slide 63 Copyright © 2001, 2003, Andrew W. Moore IQ tests If you take an IQ test you’ll get a score that, on average (over many tests) will be your IQ But because of noise on any one test the score will often be a few points lower or higher than your true IQ. SCORE | IQ ~ N(IQ,10 2 )

64 Information Gain: Slide 64 Copyright © 2001, 2003, Andrew W. Moore Assume… You drag your kid off to get tested She gets a score of 130 “Yippee” you screech and start deciding how to casually refer to her membership of the top 2% of IQs in your Christmas newsletter. P(X<130|  =100,  2 =15 2 ) = P(X<2|  =0,  2 =1) = erf(2) = 0.977

65 Information Gain: Slide 65 Copyright © 2001, 2003, Andrew W. Moore Assume… You drag your kid off to get tested She gets a score of 130 “Yippee” you screech and start deciding how to casually refer to her membership of the top 2% of IQs in your Christmas newsletter. P(X<130|  =100,  2 =15 2 ) = P(X<2|  =0,  2 =1) = erf(2) = 0.977 You are thinking: Well sure the test isn’t accurate, so she might have an IQ of 120 or she might have an 1Q of 140, but the most likely IQ given the evidence “score=130” is, of course, 130. Can we trust this reasoning?

66 Information Gain: Slide 66 Copyright © 2001, 2003, Andrew W. Moore What we really want: IQ~N(100,15 2 ) S|IQ ~ N(IQ, 10 2 ) S=130 Question: What is IQ | (S=130)? Called the Posterior Distribution of IQ

67 Information Gain: Slide 67 Copyright © 2001, 2003, Andrew W. Moore Which tool or tools? IQ~N(100,15 2 ) S|IQ ~ N(IQ, 10 2 ) S=130 Question: What is IQ | (S=130)? Chain Rule Condition- alize + Multiply Matrix A Margin- alize

68 Information Gain: Slide 68 Copyright © 2001, 2003, Andrew W. Moore Plan IQ~N(100,15 2 ) S|IQ ~ N(IQ, 10 2 ) S=130 Question: What is IQ | (S=130)? Chain Rule Condition- alize Swap

69 Information Gain: Slide 69 Copyright © 2001, 2003, Andrew W. Moore Working… IQ~N(100,15 2 ) S|IQ ~ N(IQ, 10 2 ) S=130 Question: What is IQ | (S=130)?


Download ppt "Copyright © 2001, 2003, Andrew W. Moore Entropy and Information Gain Andrew W. Moore Professor School of Computer Science Carnegie Mellon University www.cs.cmu.edu/~awm."

Similar presentations


Ads by Google