CS B 351 S TATISTICAL L EARNING
A GENDA Learning coin flips, learning Bayes net parameters Likelihood functions, maximum likelihood estimation (MLE) Priors, maximum a posteriori estimation (MAP) Bayesian estimation
L EARNING C OIN F LIPS Observe that c out of N draws are cherries (data) Intuition: c/N might be a good hypothesis for the fraction of cherries in the bag “Intuitive” parameter estimate: empirical distribution P(cherry) c / N ( Why is this reasonable? Perhaps we got a bad draw! )
L EARNING C OIN F LIPS Observe that c out of N draws are cherries (data) Let the unknown fraction of cherries be (hypothesis) Probability of drawing a cherry is Assumption: draws are independent and identically distributed (i.i.d)
L EARNING C OIN F LIPS Probability of drawing a cherry is Assumption: draws are independent and identically distributed (i.i.d) Probability of drawing 2 cherries is Probability of drawing 2 limes is (1- ) 2 Probability of drawing 1 cherry and 1 lime:
L IKELIHOOD F UNCTION Likelihood: the probability of the data d ={d 1,…,d N } given the hypothesis P( d | ) = j P(d j | ) i.i.d assumption
L IKELIHOOD F UNCTION Likelihood: the probability of the data d ={d 1,…,d N } given the hypothesis P( d | ) = j P(d j | ) = j if d j =Cherry 1- if d j =Lime Probability model, assuming is given
L IKELIHOOD F UNCTION Likelihood: the probability of the data d ={d 1,…,d N } given the hypothesis P( d | ) = j P(d j | ) = j = c (1- ) N-c Gather c cherry terms together, then N-c lime terms if d j =Cherry 1- if d j =Lime
M AXIMUM L IKELIHOOD Likelihood of data d ={d 1,…,d N } given P( d | ) = c (1- ) N-c
M AXIMUM L IKELIHOOD Likelihood of data d ={d 1,…,d N } given P( d | ) = c (1- ) N-c
M AXIMUM L IKELIHOOD Likelihood of data d ={d 1,…,d N } given P( d | ) = c (1- ) N-c
M AXIMUM L IKELIHOOD Likelihood of data d ={d 1,…,d N } given P( d | ) = c (1- ) N-c
M AXIMUM L IKELIHOOD Likelihood of data d ={d 1,…,d N } given P( d | ) = c (1- ) N-c
M AXIMUM L IKELIHOOD Likelihood of data d ={d 1,…,d N } given P( d | ) = c (1- ) N-c
M AXIMUM L IKELIHOOD Likelihood of data d ={d 1,…,d N } given P( d | ) = c (1- ) N-c
M AXIMUM L IKELIHOOD Peaks of likelihood function seem to hover around the fraction of cherries… Sharpness indicates some notion of certainty…
M AXIMUM L IKELIHOOD P( d | ) is the likelihood function The quantity argmax P( d | ) is known as the maximum likelihood estimate (MLE) =1 is MLE
M AXIMUM L IKELIHOOD P( d | ) is the likelihood function The quantity argmax P( d | ) is known as the maximum likelihood estimate (MLE) =1 is MLE
M AXIMUM L IKELIHOOD P( d | ) is the likelihood function The quantity argmax P( d | ) is known as the maximum likelihood estimate (MLE) =2/3 is MLE
M AXIMUM L IKELIHOOD P( d | ) is the likelihood function The quantity argmax P( d | ) is known as the maximum likelihood estimate (MLE) =1/2 is MLE
M AXIMUM L IKELIHOOD P( d | ) is the likelihood function The quantity argmax P( d | ) is known as the maximum likelihood estimate (MLE) =2/5 is MLE
P ROOF : E MPIRICAL F REQUENCY IS THE MLE l( ) = log P( d | ) = log [ c (1- ) N-c ]
P ROOF : E MPIRICAL F REQUENCY IS THE MLE l( ) = log P( d | ) = log [ c (1- ) N-c ] = log [ c ] + log [(1- ) N-c ]
P ROOF : E MPIRICAL F REQUENCY IS THE MLE l( ) = log P( d | ) = log [ c (1- ) N-c ] = log [ c ] + log [(1- ) N-c ] = c log + (N-c) log (1- )
P ROOF : E MPIRICAL F REQUENCY IS THE MLE l( ) = log P( d | ) = c log + (N-c) log (1- ) Setting dl/d ( = 0 gives the maximum likelihood estimate
P ROOF : E MPIRICAL F REQUENCY IS THE MLE dl/d ( ) = c/ – (N-c)/(1- ) At MLE, c/ – (N-c)/(1- ) = 0 … => = c/N
M AXIMUM L IKELIHOOD FOR BN For any BN, the ML parameters of any CPT can be derived by the fraction of observed values in the data, conditioned on matched parent values Alarm EarthquakeBurglar E: 500 B: 200 N=1000 P(E) = 0.5P(B) = 0.2 A|E,B: 19/20 A|B: 188/200 A|E: 170/500 A| : 1/380 EBP(A|E,B) TT0.95 FT TF0.34 FF0.003
F ITTING CPT S VIA MLE M examples D =( d [1],…, d [M]) Each d [i] is a complete example of all variables in the Bayes net Assumption: each d [i] is sampled i.i.d. from the joint distribution of the BN
F ITTING CPT S VIA MLE M examples D =( d [1],…, d [M]) Each d [i] is a complete example of all variables in the Bayes net Assumption: each d [i] is sampled i.i.d. from the joint distribution of the BN Suppose BN has a single variable X Estimate X’s CPT, P(X) X
F ITTING CPT S VIA MLE M examples D =( d [1],…, d [M]) Each d [i] is a complete example of all variables in the Bayes net Assumption: each d [i] is sampled i.i.d. from the joint distribution of the BN Suppose BN has a single variable X Estimate X’s CPT, P(X) (Just learning a coin flip as usual) P MLE (X) = empirical distribution of D P MLE (X=T) = Count(X=T) / M P MLE (X=F) = Count(X=F) / M X
F ITTING CPT S VIA MLE M examples D =( d [1],…, d [M]) Each d [i] is a complete example of all variables in the Bayes net Assumption: each d [i] is sampled i.i.d. from the joint distribution of the BN Suppose BN to the right: Estimate P(X), P(Y|X) Estimate P MLE (X) as usual X Y
F ITTING CPT S VIA MLE M examples D =( d [1],…, d [M]) Each d [i] is a complete example of all variables in the Bayes net Assumption: each d [i] is sampled i.i.d. from the joint distribution of the BN Estimate P MLE (Y|X) with… X Y P(Y|X) X TF Y T F
F ITTING CPT S VIA MLE M examples D =( d [1],…, d [M]) Each d [i] is a complete example of all variables in the Bayes net Assumption: each d [i] is sampled i.i.d. from the joint distribution of the BN Estimate P MLE (Y|X) with… X Y P(Y|X) X TF Y TCount(Y=T,X=T) / Count(X=T) Count(Y=T,X=F) / Count(X=F) FCount(Y=F,X=T) / Count(X=T) Count(Y=F,X=F) / Count(X=F)
F ITTING CPT S VIA MLE X2X2 Y X1X1 X3X3
O THER MLE RESULTS Categorical distributions (Non-binary discrete variables): empirical distribution is MLE Make histogram, divide by N Continuous Gaussian distributions Mean = average of data Standard deviation = standard deviation of data Gaussian (normal) distributionHistogram
N ICE PROPERTIES OF MLE Easy to compute (for certain probability models) With enough data, the MLE estimate will approach the true unknown value of
P ROBLEMS WITH MLE The MLE was easy to compute… but what happens when we don’t have much data? Motivation You hand me a coin from your pocket 1 flip, turns up tails Whats the MLE?
P ROBLEMS WITH MLE The MLE was easy to compute… but what happens when we don’t have much data? Motivation You hand me a coin from your pocket 1 flip, turns up tails Whats the MLE? MLE has a high variance with small sample sizes
V ARIANCE OF AN E STIMATOR : I NTUITION The dataset D is just a sample of the underlying distribution, and if we could “do over” the sample, then we might get a new dataset D’. With D’, our MLE estimate MLE ’ might be different How much? How often? Assume all values of are equally likely In the case of 1 draw, D would have just as likely been a Lime. In that case, MLE = 0 So with probability 0.5, MLE would be 1, and with the same probability, MLE would be 0. High variance: typical “do overs” give drastically different results!
F ITTING B AYESIAN N ETWORK CPT S WITH MLE Potential problem: for large k, very few datapoints will share the values x 1,…,x k ! O(M*P(x 1,…,x k )), but some values may be even rarer Data fragmentation
I S THERE A B ETTER W AY ? B AYESIAN L EARNING
A N A LTERNATIVE APPROACH : B AYESIAN E STIMATION P( D | ) is the likelihood P( ) is the hypothesis prior P( | D ) = P( D | ) P( ) is the posterior Distribution of hypotheses given the data d[1]d[2]d[M]
B AYESIAN P REDICTION For a new draw Y: use hypothesis posterior to predict P(Y|D) Y d[1]d[2]d[M]
C ANDY E XAMPLE Candy comes in 2 flavors, cherry and lime, with identical wrappers Manufacturer makes 5 indistinguishable bags Suppose we draw What bag are we holding? What flavor will we draw next? h1 C: 100% L: 0% h2 C: 75% L: 25% h3 C: 50% L: 50% h4 C: 25% L: 75% h5 C: 0% L: 100%
B AYESIAN L EARNING Main idea: Compute the probability of each hypothesis, given the data Data D : Hypotheses: h 1,…,h 5 h1 C: 100% L: 0% h2 C: 75% L: 25% h3 C: 50% L: 50% h4 C: 25% L: 75% h5 C: 0% L: 100%
B AYESIAN L EARNING Main idea: Compute the probability of each hypothesis, given the data Data D : Hypotheses: h 1,…,h 5 h1 C: 100% L: 0% h2 C: 75% L: 25% h3 C: 50% L: 50% h4 C: 25% L: 75% h5 C: 0% L: 100% P(h i |D) P(D|h i ) We want this… But all we have is this!
U SING B AYES ’ R ULE P(h i | D ) = P( D |h i ) P(h i ) is the posterior (Recall, 1/ = P( D ) = i P( D |h i ) P(h i )) P( D |h i ) is the likelihood P(h i ) is the hypothesis prior h1 C: 100% L: 0% h2 C: 75% L: 25% h3 C: 50% L: 50% h4 C: 25% L: 75% h5 C: 0% L: 100%
C OMPUTING THE P OSTERIOR Assume draws are independent Let P(h 1 ),…,P(h 5 ) = (0.1, 0.2, 0.4, 0.2, 0.1) D = { 10 x } P(D|h 1 ) = 0 P(D|h 2 ) = P(D|h 3 ) = P(D|h 4 ) = P(D|h 5 ) = 1 10 P(D|h 1 )P(h 1 )=0 P(D|h 2 )P(h 2 )=9e-8 P(D|h 3 )P(h 3 )=4e-4 P(D|h 4 )P(h 4 )=0.011 P(D|h 5 )P(h 5 )=0.1 P(h 1 |D) =0 P(h 2 |D) =0.00 P(h 3 |D) =0.00 P(h 4 |D) =0.10 P(h 5 |D) =0.90 Sum = 1/ =
P OSTERIOR H YPOTHESES
P REDICTING THE N EXT D RAW P(Y| d ) = i P(Y|h i, D )P(h i | D ) = i P(Y|h i )P(h i | D ) P(h 1 |D) =0 P(h 2 |D) =0.00 P(h 3 |D) =0.00 P(h 4 |D) =0.10 P(h 5 |D) =0.90 H DY P(Y|h 1 ) =0 P(Y|h 2 ) =0.25 P(Y|h 3 ) =0.5 P(Y|h 4 ) =0.75 P(Y|h 5 ) =1 Probability that next candy drawn is a lime P(Y|D) = 0.975
P(N EXT C ANDY IS L IME | D )
B ACK TO C OIN FLIPS : U NIFORM PRIOR, B ERNOULLI D ISTRIBUTION ii d[1]d[2]d[M] Y
A SSUMPTION : U NIFORM PRIOR, B ERNOULLI D ISTRIBUTION ii d[1]d[2]d[M] Y Can think of this as a “correction” using “virtual counts”
N ONUNIFORM PRIORS P( | D ) P( D | )P( ) = c (1- ) N-c P( ) Define, for all , the probability that I believe in 10 P( )
B ETA D ISTRIBUTION Beta , ( ) = -1 (1- ) -1 , hyperparameters > 0 is a normalization constant = =1 is uniform distribution
P OSTERIOR WITH B ETA P RIOR Posterior c (1- ) N-c P( ) = c+ -1 (1- ) N-c+ -1 = Beta +c, +N-c ( ) Prediction = mean E[ =(c+ )/(N+ + )
P OSTERIOR WITH B ETA P RIOR What does this mean? Prior specifies a “virtual count” of a= -1 heads, b= -1 tails See heads, increment a See tails, increment b Effect of prior diminishes with more data
C HOOSING A P RIOR Part of the design process; must be chosen according to your intuition Uninformed belief , strong belief => high
F ITTING CPT S VIA MAP M examples D =( d [1],…, d [M]), virtual counts a, b Estimate P MLE (Y|X) by assuming we’ve seen a examples of Y=T, and b examples of Y=F P(Y|X) X TF Y T(Count(Y=T,X=T)+ a ) / (Count(X=T)+ a + b ) (Count(Y=T,X=F) + a )/ (Count(X=F)+ a + b ) F(Count(Y=F,X=T) + b )/ (Count(X=T)+ a + b ) (Count(Y=F,X=F)+ b )/ (Count(X=F)+ a + b )
P ROPERTIES OF MAP Approaches the MLE as dataset grows large (effect of prior diminishes in the face of evidence) More stable estimates than MLE with small sample sizes Lower variance, but added bias Needs a designer’s judgment to set the prior
E XTENSIONS OF B ETA P RIORS Parameters of multi-valued (categorical) distributions, e.g. histograms: Dirichlet prior Mathematical derivation more complex, but in practice still takes the form of “virtual counts”
R ECAP Parameter learning via coin flips Maximum Likelihood Bayesian Learning with Beta prior Learning Bayes net parameters
N EXT T IME Introduction to machine learning R&N