Download presentation
Presentation is loading. Please wait.
Published byCody Davidson Modified over 9 years ago
1
CS B 351 S TATISTICAL L EARNING
2
A GENDA Learning coin flips, learning Bayes net parameters Likelihood functions, maximum likelihood estimation (MLE) Priors, maximum a posteriori estimation (MAP) Bayesian estimation
3
L EARNING C OIN F LIPS Observe that c out of N draws are cherries (data) Intuition: c/N might be a good hypothesis for the fraction of cherries in the bag “Intuitive” parameter estimate: empirical distribution P(cherry) c / N ( Why is this reasonable? Perhaps we got a bad draw! )
4
L EARNING C OIN F LIPS Observe that c out of N draws are cherries (data) Let the unknown fraction of cherries be (hypothesis) Probability of drawing a cherry is Assumption: draws are independent and identically distributed (i.i.d)
5
L EARNING C OIN F LIPS Probability of drawing a cherry is Assumption: draws are independent and identically distributed (i.i.d) Probability of drawing 2 cherries is Probability of drawing 2 limes is (1- ) 2 Probability of drawing 1 cherry and 1 lime:
6
L IKELIHOOD F UNCTION Likelihood: the probability of the data d ={d 1,…,d N } given the hypothesis P( d | ) = j P(d j | ) i.i.d assumption
7
L IKELIHOOD F UNCTION Likelihood: the probability of the data d ={d 1,…,d N } given the hypothesis P( d | ) = j P(d j | ) = j if d j =Cherry 1- if d j =Lime Probability model, assuming is given
8
L IKELIHOOD F UNCTION Likelihood: the probability of the data d ={d 1,…,d N } given the hypothesis P( d | ) = j P(d j | ) = j = c (1- ) N-c Gather c cherry terms together, then N-c lime terms if d j =Cherry 1- if d j =Lime
9
M AXIMUM L IKELIHOOD Likelihood of data d ={d 1,…,d N } given P( d | ) = c (1- ) N-c
10
M AXIMUM L IKELIHOOD Likelihood of data d ={d 1,…,d N } given P( d | ) = c (1- ) N-c
11
M AXIMUM L IKELIHOOD Likelihood of data d ={d 1,…,d N } given P( d | ) = c (1- ) N-c
12
M AXIMUM L IKELIHOOD Likelihood of data d ={d 1,…,d N } given P( d | ) = c (1- ) N-c
13
M AXIMUM L IKELIHOOD Likelihood of data d ={d 1,…,d N } given P( d | ) = c (1- ) N-c
14
M AXIMUM L IKELIHOOD Likelihood of data d ={d 1,…,d N } given P( d | ) = c (1- ) N-c
15
M AXIMUM L IKELIHOOD Likelihood of data d ={d 1,…,d N } given P( d | ) = c (1- ) N-c
16
M AXIMUM L IKELIHOOD Peaks of likelihood function seem to hover around the fraction of cherries… Sharpness indicates some notion of certainty…
17
M AXIMUM L IKELIHOOD P( d | ) is the likelihood function The quantity argmax P( d | ) is known as the maximum likelihood estimate (MLE) =1 is MLE
18
M AXIMUM L IKELIHOOD P( d | ) is the likelihood function The quantity argmax P( d | ) is known as the maximum likelihood estimate (MLE) =1 is MLE
19
M AXIMUM L IKELIHOOD P( d | ) is the likelihood function The quantity argmax P( d | ) is known as the maximum likelihood estimate (MLE) =2/3 is MLE
20
M AXIMUM L IKELIHOOD P( d | ) is the likelihood function The quantity argmax P( d | ) is known as the maximum likelihood estimate (MLE) =1/2 is MLE
21
M AXIMUM L IKELIHOOD P( d | ) is the likelihood function The quantity argmax P( d | ) is known as the maximum likelihood estimate (MLE) =2/5 is MLE
22
P ROOF : E MPIRICAL F REQUENCY IS THE MLE l( ) = log P( d | ) = log [ c (1- ) N-c ]
23
P ROOF : E MPIRICAL F REQUENCY IS THE MLE l( ) = log P( d | ) = log [ c (1- ) N-c ] = log [ c ] + log [(1- ) N-c ]
24
P ROOF : E MPIRICAL F REQUENCY IS THE MLE l( ) = log P( d | ) = log [ c (1- ) N-c ] = log [ c ] + log [(1- ) N-c ] = c log + (N-c) log (1- )
25
P ROOF : E MPIRICAL F REQUENCY IS THE MLE l( ) = log P( d | ) = c log + (N-c) log (1- ) Setting dl/d ( = 0 gives the maximum likelihood estimate
26
P ROOF : E MPIRICAL F REQUENCY IS THE MLE dl/d ( ) = c/ – (N-c)/(1- ) At MLE, c/ – (N-c)/(1- ) = 0 … => = c/N
27
M AXIMUM L IKELIHOOD FOR BN For any BN, the ML parameters of any CPT can be derived by the fraction of observed values in the data, conditioned on matched parent values Alarm EarthquakeBurglar E: 500 B: 200 N=1000 P(E) = 0.5P(B) = 0.2 A|E,B: 19/20 A|B: 188/200 A|E: 170/500 A| : 1/380 EBP(A|E,B) TT0.95 FT TF0.34 FF0.003
28
F ITTING CPT S VIA MLE M examples D =( d [1],…, d [M]) Each d [i] is a complete example of all variables in the Bayes net Assumption: each d [i] is sampled i.i.d. from the joint distribution of the BN
29
F ITTING CPT S VIA MLE M examples D =( d [1],…, d [M]) Each d [i] is a complete example of all variables in the Bayes net Assumption: each d [i] is sampled i.i.d. from the joint distribution of the BN Suppose BN has a single variable X Estimate X’s CPT, P(X) X
30
F ITTING CPT S VIA MLE M examples D =( d [1],…, d [M]) Each d [i] is a complete example of all variables in the Bayes net Assumption: each d [i] is sampled i.i.d. from the joint distribution of the BN Suppose BN has a single variable X Estimate X’s CPT, P(X) (Just learning a coin flip as usual) P MLE (X) = empirical distribution of D P MLE (X=T) = Count(X=T) / M P MLE (X=F) = Count(X=F) / M X
31
F ITTING CPT S VIA MLE M examples D =( d [1],…, d [M]) Each d [i] is a complete example of all variables in the Bayes net Assumption: each d [i] is sampled i.i.d. from the joint distribution of the BN Suppose BN to the right: Estimate P(X), P(Y|X) Estimate P MLE (X) as usual X Y
32
F ITTING CPT S VIA MLE M examples D =( d [1],…, d [M]) Each d [i] is a complete example of all variables in the Bayes net Assumption: each d [i] is sampled i.i.d. from the joint distribution of the BN Estimate P MLE (Y|X) with… X Y P(Y|X) X TF Y T F
33
F ITTING CPT S VIA MLE M examples D =( d [1],…, d [M]) Each d [i] is a complete example of all variables in the Bayes net Assumption: each d [i] is sampled i.i.d. from the joint distribution of the BN Estimate P MLE (Y|X) with… X Y P(Y|X) X TF Y TCount(Y=T,X=T) / Count(X=T) Count(Y=T,X=F) / Count(X=F) FCount(Y=F,X=T) / Count(X=T) Count(Y=F,X=F) / Count(X=F)
34
F ITTING CPT S VIA MLE X2X2 Y X1X1 X3X3
35
O THER MLE RESULTS Categorical distributions (Non-binary discrete variables): empirical distribution is MLE Make histogram, divide by N Continuous Gaussian distributions Mean = average of data Standard deviation = standard deviation of data Gaussian (normal) distributionHistogram
36
N ICE PROPERTIES OF MLE Easy to compute (for certain probability models) With enough data, the MLE estimate will approach the true unknown value of
37
P ROBLEMS WITH MLE The MLE was easy to compute… but what happens when we don’t have much data? Motivation You hand me a coin from your pocket 1 flip, turns up tails Whats the MLE?
38
P ROBLEMS WITH MLE The MLE was easy to compute… but what happens when we don’t have much data? Motivation You hand me a coin from your pocket 1 flip, turns up tails Whats the MLE? MLE has a high variance with small sample sizes
39
V ARIANCE OF AN E STIMATOR : I NTUITION The dataset D is just a sample of the underlying distribution, and if we could “do over” the sample, then we might get a new dataset D’. With D’, our MLE estimate MLE ’ might be different How much? How often? Assume all values of are equally likely In the case of 1 draw, D would have just as likely been a Lime. In that case, MLE = 0 So with probability 0.5, MLE would be 1, and with the same probability, MLE would be 0. High variance: typical “do overs” give drastically different results!
40
F ITTING B AYESIAN N ETWORK CPT S WITH MLE Potential problem: for large k, very few datapoints will share the values x 1,…,x k ! O(M*P(x 1,…,x k )), but some values may be even rarer Data fragmentation
41
I S THERE A B ETTER W AY ? B AYESIAN L EARNING
42
A N A LTERNATIVE APPROACH : B AYESIAN E STIMATION P( D | ) is the likelihood P( ) is the hypothesis prior P( | D ) = P( D | ) P( ) is the posterior Distribution of hypotheses given the data d[1]d[2]d[M]
43
B AYESIAN P REDICTION For a new draw Y: use hypothesis posterior to predict P(Y|D) Y d[1]d[2]d[M]
44
C ANDY E XAMPLE Candy comes in 2 flavors, cherry and lime, with identical wrappers Manufacturer makes 5 indistinguishable bags Suppose we draw What bag are we holding? What flavor will we draw next? h1 C: 100% L: 0% h2 C: 75% L: 25% h3 C: 50% L: 50% h4 C: 25% L: 75% h5 C: 0% L: 100%
45
B AYESIAN L EARNING Main idea: Compute the probability of each hypothesis, given the data Data D : Hypotheses: h 1,…,h 5 h1 C: 100% L: 0% h2 C: 75% L: 25% h3 C: 50% L: 50% h4 C: 25% L: 75% h5 C: 0% L: 100%
46
B AYESIAN L EARNING Main idea: Compute the probability of each hypothesis, given the data Data D : Hypotheses: h 1,…,h 5 h1 C: 100% L: 0% h2 C: 75% L: 25% h3 C: 50% L: 50% h4 C: 25% L: 75% h5 C: 0% L: 100% P(h i |D) P(D|h i ) We want this… But all we have is this!
47
U SING B AYES ’ R ULE P(h i | D ) = P( D |h i ) P(h i ) is the posterior (Recall, 1/ = P( D ) = i P( D |h i ) P(h i )) P( D |h i ) is the likelihood P(h i ) is the hypothesis prior h1 C: 100% L: 0% h2 C: 75% L: 25% h3 C: 50% L: 50% h4 C: 25% L: 75% h5 C: 0% L: 100%
48
C OMPUTING THE P OSTERIOR Assume draws are independent Let P(h 1 ),…,P(h 5 ) = (0.1, 0.2, 0.4, 0.2, 0.1) D = { 10 x } P(D|h 1 ) = 0 P(D|h 2 ) = 0.25 10 P(D|h 3 ) = 0.5 10 P(D|h 4 ) = 0.75 10 P(D|h 5 ) = 1 10 P(D|h 1 )P(h 1 )=0 P(D|h 2 )P(h 2 )=9e-8 P(D|h 3 )P(h 3 )=4e-4 P(D|h 4 )P(h 4 )=0.011 P(D|h 5 )P(h 5 )=0.1 P(h 1 |D) =0 P(h 2 |D) =0.00 P(h 3 |D) =0.00 P(h 4 |D) =0.10 P(h 5 |D) =0.90 Sum = 1/ = 0.1114
49
P OSTERIOR H YPOTHESES
50
P REDICTING THE N EXT D RAW P(Y| d ) = i P(Y|h i, D )P(h i | D ) = i P(Y|h i )P(h i | D ) P(h 1 |D) =0 P(h 2 |D) =0.00 P(h 3 |D) =0.00 P(h 4 |D) =0.10 P(h 5 |D) =0.90 H DY P(Y|h 1 ) =0 P(Y|h 2 ) =0.25 P(Y|h 3 ) =0.5 P(Y|h 4 ) =0.75 P(Y|h 5 ) =1 Probability that next candy drawn is a lime P(Y|D) = 0.975
51
P(N EXT C ANDY IS L IME | D )
52
B ACK TO C OIN FLIPS : U NIFORM PRIOR, B ERNOULLI D ISTRIBUTION ii d[1]d[2]d[M] Y
53
A SSUMPTION : U NIFORM PRIOR, B ERNOULLI D ISTRIBUTION ii d[1]d[2]d[M] Y Can think of this as a “correction” using “virtual counts”
54
N ONUNIFORM PRIORS P( | D ) P( D | )P( ) = c (1- ) N-c P( ) Define, for all , the probability that I believe in 10 P( )
55
B ETA D ISTRIBUTION Beta , ( ) = -1 (1- ) -1 , hyperparameters > 0 is a normalization constant = =1 is uniform distribution
56
P OSTERIOR WITH B ETA P RIOR Posterior c (1- ) N-c P( ) = c+ -1 (1- ) N-c+ -1 = Beta +c, +N-c ( ) Prediction = mean E[ =(c+ )/(N+ + )
57
P OSTERIOR WITH B ETA P RIOR What does this mean? Prior specifies a “virtual count” of a= -1 heads, b= -1 tails See heads, increment a See tails, increment b Effect of prior diminishes with more data
58
C HOOSING A P RIOR Part of the design process; must be chosen according to your intuition Uninformed belief , strong belief => high
59
F ITTING CPT S VIA MAP M examples D =( d [1],…, d [M]), virtual counts a, b Estimate P MLE (Y|X) by assuming we’ve seen a examples of Y=T, and b examples of Y=F P(Y|X) X TF Y T(Count(Y=T,X=T)+ a ) / (Count(X=T)+ a + b ) (Count(Y=T,X=F) + a )/ (Count(X=F)+ a + b ) F(Count(Y=F,X=T) + b )/ (Count(X=T)+ a + b ) (Count(Y=F,X=F)+ b )/ (Count(X=F)+ a + b )
60
P ROPERTIES OF MAP Approaches the MLE as dataset grows large (effect of prior diminishes in the face of evidence) More stable estimates than MLE with small sample sizes Lower variance, but added bias Needs a designer’s judgment to set the prior
61
E XTENSIONS OF B ETA P RIORS Parameters of multi-valued (categorical) distributions, e.g. histograms: Dirichlet prior Mathematical derivation more complex, but in practice still takes the form of “virtual counts” 01 510
62
R ECAP Parameter learning via coin flips Maximum Likelihood Bayesian Learning with Beta prior Learning Bayes net parameters
63
N EXT T IME Introduction to machine learning R&N 18.1-3
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.