Presentation is loading. Please wait.

Presentation is loading. Please wait.

CS B 351 S TATISTICAL L EARNING. A GENDA Learning coin flips, learning Bayes net parameters Likelihood functions, maximum likelihood estimation (MLE)

Similar presentations


Presentation on theme: "CS B 351 S TATISTICAL L EARNING. A GENDA Learning coin flips, learning Bayes net parameters Likelihood functions, maximum likelihood estimation (MLE)"— Presentation transcript:

1 CS B 351 S TATISTICAL L EARNING

2 A GENDA Learning coin flips, learning Bayes net parameters Likelihood functions, maximum likelihood estimation (MLE) Priors, maximum a posteriori estimation (MAP) Bayesian estimation

3 L EARNING C OIN F LIPS Observe that c out of N draws are cherries (data) Intuition: c/N might be a good hypothesis for the fraction of cherries in the bag “Intuitive” parameter estimate: empirical distribution P(cherry)  c / N ( Why is this reasonable? Perhaps we got a bad draw! )

4 L EARNING C OIN F LIPS Observe that c out of N draws are cherries (data) Let the unknown fraction of cherries be  (hypothesis) Probability of drawing a cherry is  Assumption: draws are independent and identically distributed (i.i.d)

5 L EARNING C OIN F LIPS Probability of drawing a cherry is  Assumption: draws are independent and identically distributed (i.i.d) Probability of drawing 2 cherries is  Probability of drawing 2 limes is (1-  ) 2 Probability of drawing 1 cherry and 1 lime: 

6 L IKELIHOOD F UNCTION Likelihood: the probability of the data d ={d 1,…,d N } given the hypothesis  P( d |  ) =  j P(d j |  ) i.i.d assumption

7 L IKELIHOOD F UNCTION Likelihood: the probability of the data d ={d 1,…,d N } given the hypothesis  P( d |  ) =  j P(d j |  ) =  j  if d j =Cherry 1-  if d j =Lime Probability model, assuming  is given

8 L IKELIHOOD F UNCTION Likelihood: the probability of the data d ={d 1,…,d N } given the hypothesis  P( d |  ) =  j P(d j |  ) =  j =  c (1-  ) N-c Gather c cherry terms together, then N-c lime terms  if d j =Cherry 1-  if d j =Lime

9 M AXIMUM L IKELIHOOD Likelihood of data d ={d 1,…,d N } given  P( d |  ) =  c (1-  ) N-c

10 M AXIMUM L IKELIHOOD Likelihood of data d ={d 1,…,d N } given  P( d |  ) =  c (1-  ) N-c

11 M AXIMUM L IKELIHOOD Likelihood of data d ={d 1,…,d N } given  P( d |  ) =  c (1-  ) N-c

12 M AXIMUM L IKELIHOOD Likelihood of data d ={d 1,…,d N } given  P( d |  ) =  c (1-  ) N-c

13 M AXIMUM L IKELIHOOD Likelihood of data d ={d 1,…,d N } given  P( d |  ) =  c (1-  ) N-c

14 M AXIMUM L IKELIHOOD Likelihood of data d ={d 1,…,d N } given  P( d |  ) =  c (1-  ) N-c

15 M AXIMUM L IKELIHOOD Likelihood of data d ={d 1,…,d N } given  P( d |  ) =  c (1-  ) N-c

16 M AXIMUM L IKELIHOOD Peaks of likelihood function seem to hover around the fraction of cherries… Sharpness indicates some notion of certainty…

17 M AXIMUM L IKELIHOOD P( d |  ) is the likelihood function The quantity argmax  P( d |  ) is known as the maximum likelihood estimate (MLE)  =1 is MLE

18 M AXIMUM L IKELIHOOD P( d |  ) is the likelihood function The quantity argmax  P( d |  ) is known as the maximum likelihood estimate (MLE)  =1 is MLE

19 M AXIMUM L IKELIHOOD P( d |  ) is the likelihood function The quantity argmax  P( d |  ) is known as the maximum likelihood estimate (MLE)  =2/3 is MLE

20 M AXIMUM L IKELIHOOD P( d |  ) is the likelihood function The quantity argmax  P( d |  ) is known as the maximum likelihood estimate (MLE)  =1/2 is MLE

21 M AXIMUM L IKELIHOOD P( d |  ) is the likelihood function The quantity argmax  P( d |  ) is known as the maximum likelihood estimate (MLE)  =2/5 is MLE

22 P ROOF : E MPIRICAL F REQUENCY IS THE MLE l(  ) = log P( d |  ) = log [  c (1-  ) N-c ]

23 P ROOF : E MPIRICAL F REQUENCY IS THE MLE l(  ) = log P( d |  ) = log [  c (1-  ) N-c ] = log [  c ] + log [(1-  ) N-c ]

24 P ROOF : E MPIRICAL F REQUENCY IS THE MLE l(  ) = log P( d |  ) = log [  c (1-  ) N-c ] = log [  c ] + log [(1-  ) N-c ] = c log  + (N-c) log (1-  )

25 P ROOF : E MPIRICAL F REQUENCY IS THE MLE l(  ) = log P( d |  ) = c log  + (N-c) log (1-  ) Setting dl/d  (  = 0 gives the maximum likelihood estimate

26 P ROOF : E MPIRICAL F REQUENCY IS THE MLE dl/d  (  ) = c/  – (N-c)/(1-  ) At MLE, c/  – (N-c)/(1-  ) = 0 … =>  = c/N

27 M AXIMUM L IKELIHOOD FOR BN For any BN, the ML parameters of any CPT can be derived by the fraction of observed values in the data, conditioned on matched parent values Alarm EarthquakeBurglar E: 500 B: 200 N=1000 P(E) = 0.5P(B) = 0.2 A|E,B: 19/20 A|B: 188/200 A|E: 170/500 A| : 1/380 EBP(A|E,B) TT0.95 FT TF0.34 FF0.003

28 F ITTING CPT S VIA MLE M examples D =( d [1],…, d [M]) Each d [i] is a complete example of all variables in the Bayes net Assumption: each d [i] is sampled i.i.d. from the joint distribution of the BN

29 F ITTING CPT S VIA MLE M examples D =( d [1],…, d [M]) Each d [i] is a complete example of all variables in the Bayes net Assumption: each d [i] is sampled i.i.d. from the joint distribution of the BN Suppose BN has a single variable X Estimate X’s CPT, P(X) X

30 F ITTING CPT S VIA MLE M examples D =( d [1],…, d [M]) Each d [i] is a complete example of all variables in the Bayes net Assumption: each d [i] is sampled i.i.d. from the joint distribution of the BN Suppose BN has a single variable X Estimate X’s CPT, P(X) (Just learning a coin flip as usual) P MLE (X) = empirical distribution of D P MLE (X=T) = Count(X=T) / M P MLE (X=F) = Count(X=F) / M X

31 F ITTING CPT S VIA MLE M examples D =( d [1],…, d [M]) Each d [i] is a complete example of all variables in the Bayes net Assumption: each d [i] is sampled i.i.d. from the joint distribution of the BN Suppose BN to the right: Estimate P(X), P(Y|X) Estimate P MLE (X) as usual X Y

32 F ITTING CPT S VIA MLE M examples D =( d [1],…, d [M]) Each d [i] is a complete example of all variables in the Bayes net Assumption: each d [i] is sampled i.i.d. from the joint distribution of the BN Estimate P MLE (Y|X) with… X Y P(Y|X) X TF Y T F

33 F ITTING CPT S VIA MLE M examples D =( d [1],…, d [M]) Each d [i] is a complete example of all variables in the Bayes net Assumption: each d [i] is sampled i.i.d. from the joint distribution of the BN Estimate P MLE (Y|X) with… X Y P(Y|X) X TF Y TCount(Y=T,X=T) / Count(X=T) Count(Y=T,X=F) / Count(X=F) FCount(Y=F,X=T) / Count(X=T) Count(Y=F,X=F) / Count(X=F)

34 F ITTING CPT S VIA MLE X2X2 Y X1X1 X3X3

35 O THER MLE RESULTS Categorical distributions (Non-binary discrete variables): empirical distribution is MLE Make histogram, divide by N Continuous Gaussian distributions Mean = average of data Standard deviation = standard deviation of data Gaussian (normal) distributionHistogram

36 N ICE PROPERTIES OF MLE Easy to compute (for certain probability models) With enough data, the  MLE estimate will approach the true unknown value of 

37 P ROBLEMS WITH MLE The MLE was easy to compute… but what happens when we don’t have much data? Motivation You hand me a coin from your pocket 1 flip, turns up tails Whats the MLE?

38 P ROBLEMS WITH MLE The MLE was easy to compute… but what happens when we don’t have much data? Motivation You hand me a coin from your pocket 1 flip, turns up tails Whats the MLE?  MLE has a high variance with small sample sizes

39 V ARIANCE OF AN E STIMATOR : I NTUITION The dataset D is just a sample of the underlying distribution, and if we could “do over” the sample, then we might get a new dataset D’. With D’, our MLE estimate  MLE ’ might be different How much? How often? Assume all values of  are equally likely In the case of 1 draw, D would have just as likely been a Lime. In that case,  MLE = 0 So with probability 0.5,  MLE would be 1, and with the same probability,  MLE would be 0. High variance: typical “do overs” give drastically different results!

40 F ITTING B AYESIAN N ETWORK CPT S WITH MLE Potential problem: for large k, very few datapoints will share the values x 1,…,x k ! O(M*P(x 1,…,x k )), but some values may be even rarer Data fragmentation

41 I S THERE A B ETTER W AY ? B AYESIAN L EARNING

42 A N A LTERNATIVE APPROACH : B AYESIAN E STIMATION P( D |  ) is the likelihood P(  ) is the hypothesis prior P(  | D ) =  P( D |  ) P(  ) is the posterior Distribution of hypotheses given the data  d[1]d[2]d[M]

43 B AYESIAN P REDICTION For a new draw Y: use hypothesis posterior to predict P(Y|D) Y  d[1]d[2]d[M]

44 C ANDY E XAMPLE Candy comes in 2 flavors, cherry and lime, with identical wrappers Manufacturer makes 5 indistinguishable bags Suppose we draw What bag are we holding? What flavor will we draw next? h1 C: 100% L: 0% h2 C: 75% L: 25% h3 C: 50% L: 50% h4 C: 25% L: 75% h5 C: 0% L: 100%

45 B AYESIAN L EARNING Main idea: Compute the probability of each hypothesis, given the data Data D : Hypotheses: h 1,…,h 5 h1 C: 100% L: 0% h2 C: 75% L: 25% h3 C: 50% L: 50% h4 C: 25% L: 75% h5 C: 0% L: 100%

46 B AYESIAN L EARNING Main idea: Compute the probability of each hypothesis, given the data Data D : Hypotheses: h 1,…,h 5 h1 C: 100% L: 0% h2 C: 75% L: 25% h3 C: 50% L: 50% h4 C: 25% L: 75% h5 C: 0% L: 100% P(h i |D) P(D|h i ) We want this… But all we have is this!

47 U SING B AYES ’ R ULE P(h i | D ) =  P( D |h i ) P(h i ) is the posterior (Recall, 1/  = P( D ) =  i P( D |h i ) P(h i )) P( D |h i ) is the likelihood P(h i ) is the hypothesis prior h1 C: 100% L: 0% h2 C: 75% L: 25% h3 C: 50% L: 50% h4 C: 25% L: 75% h5 C: 0% L: 100%

48 C OMPUTING THE P OSTERIOR Assume draws are independent Let P(h 1 ),…,P(h 5 ) = (0.1, 0.2, 0.4, 0.2, 0.1) D = { 10 x } P(D|h 1 ) = 0 P(D|h 2 ) = 0.25 10 P(D|h 3 ) = 0.5 10 P(D|h 4 ) = 0.75 10 P(D|h 5 ) = 1 10 P(D|h 1 )P(h 1 )=0 P(D|h 2 )P(h 2 )=9e-8 P(D|h 3 )P(h 3 )=4e-4 P(D|h 4 )P(h 4 )=0.011 P(D|h 5 )P(h 5 )=0.1 P(h 1 |D) =0 P(h 2 |D) =0.00 P(h 3 |D) =0.00 P(h 4 |D) =0.10 P(h 5 |D) =0.90 Sum = 1/  = 0.1114

49 P OSTERIOR H YPOTHESES

50 P REDICTING THE N EXT D RAW P(Y| d ) =  i P(Y|h i, D )P(h i | D ) =  i P(Y|h i )P(h i | D ) P(h 1 |D) =0 P(h 2 |D) =0.00 P(h 3 |D) =0.00 P(h 4 |D) =0.10 P(h 5 |D) =0.90 H DY P(Y|h 1 ) =0 P(Y|h 2 ) =0.25 P(Y|h 3 ) =0.5 P(Y|h 4 ) =0.75 P(Y|h 5 ) =1 Probability that next candy drawn is a lime P(Y|D) = 0.975

51 P(N EXT C ANDY IS L IME | D )

52 B ACK TO C OIN FLIPS : U NIFORM PRIOR, B ERNOULLI D ISTRIBUTION ii d[1]d[2]d[M] Y

53 A SSUMPTION : U NIFORM PRIOR, B ERNOULLI D ISTRIBUTION ii d[1]d[2]d[M] Y Can think of this as a “correction” using “virtual counts”

54 N ONUNIFORM PRIORS P(  | D )  P( D |  )P(  ) =  c (1-  ) N-c P(  ) Define, for all , the probability that I believe in  10  P(  )

55 B ETA D ISTRIBUTION Beta ,  (  ) =    -1 (1-  )  -1 ,  hyperparameters > 0  is a normalization constant  =  =1 is uniform distribution

56 P OSTERIOR WITH B ETA P RIOR Posterior  c (1-  ) N-c P(  ) =   c+  -1 (1-  ) N-c+  -1 = Beta  +c,  +N-c (  ) Prediction = mean E[  =(c+  )/(N+  +  )

57 P OSTERIOR WITH B ETA P RIOR What does this mean? Prior specifies a “virtual count” of a=  -1 heads, b=  -1 tails See heads, increment a See tails, increment b Effect of prior diminishes with more data

58 C HOOSING A P RIOR Part of the design process; must be chosen according to your intuition Uninformed belief , strong belief =>  high

59 F ITTING CPT S VIA MAP M examples D =( d [1],…, d [M]), virtual counts a, b Estimate P MLE (Y|X) by assuming we’ve seen a examples of Y=T, and b examples of Y=F P(Y|X) X TF Y T(Count(Y=T,X=T)+ a ) / (Count(X=T)+ a + b ) (Count(Y=T,X=F) + a )/ (Count(X=F)+ a + b ) F(Count(Y=F,X=T) + b )/ (Count(X=T)+ a + b ) (Count(Y=F,X=F)+ b )/ (Count(X=F)+ a + b )

60 P ROPERTIES OF MAP Approaches the MLE as dataset grows large (effect of prior diminishes in the face of evidence) More stable estimates than MLE with small sample sizes Lower variance, but added bias Needs a designer’s judgment to set the prior

61 E XTENSIONS OF B ETA P RIORS Parameters of multi-valued (categorical) distributions, e.g. histograms: Dirichlet prior Mathematical derivation more complex, but in practice still takes the form of “virtual counts” 01 510

62 R ECAP Parameter learning via coin flips Maximum Likelihood Bayesian Learning with Beta prior Learning Bayes net parameters

63 N EXT T IME Introduction to machine learning R&N 18.1-3


Download ppt "CS B 351 S TATISTICAL L EARNING. A GENDA Learning coin flips, learning Bayes net parameters Likelihood functions, maximum likelihood estimation (MLE)"

Similar presentations


Ads by Google