CS B 351 S TATISTICAL L EARNING. A GENDA Learning coin flips, learning Bayes net parameters Likelihood functions, maximum likelihood estimation (MLE)

Slides:



Advertisements
Similar presentations
Bayes rule, priors and maximum a posteriori
Advertisements

Evaluating Classifiers
S TATISTICAL L EARNING (F ROM DATA TO DISTRIBUTIONS )
ECE 8443 – Pattern Recognition LECTURE 05: MAXIMUM LIKELIHOOD ESTIMATION Objectives: Discrete Features Maximum Likelihood Resources: D.H.S: Chapter 3 (Part.
Oliver Schulte Machine Learning 726
Flipping A Biased Coin Suppose you have a coin with an unknown bias, θ ≡ P(head). You flip the coin multiple times and observe the outcome. From observations,
Learning in Bayes Nets Task 1: Given the network structure and given data, where a data point is an observed setting for the variables, learn the CPTs.
Bayesian Wrap-Up (probably). 5 minutes of math... Marginal probabilities If you have a joint PDF:... and want to know about the probability of just one.
Ai in game programming it university of copenhagen Statistical Learning Methods Marco Loog.
Chapter 20 of AIMA KAIST CS570 Lecture note
Visual Recognition Tutorial
Statistical Learning: Bayesian and ML COMP155 Sections May 2, 2007.
Bayesian learning finalized (with high probability)
Basics of Statistical Estimation. Learning Probabilities: Classical Approach Simplest case: Flipping a thumbtack tails heads True probability  is unknown.
Presenting: Assaf Tzabari
. PGM: Tirgul 10 Parameter Learning and Priors. 2 Why learning? Knowledge acquisition bottleneck u Knowledge acquisition is an expensive process u Often.
Bayesian Learning and Learning Bayesian Networks.
Learning Bayesian Networks
Thanks to Nir Friedman, HU
Rutgers CS440, Fall 2003 Introduction to Statistical Learning Reading: Ch. 20, Sec. 1-4, AIMA 2 nd Ed.
Learning Bayesian Networks (From David Heckerman’s tutorial)
CS Bayesian Learning1 Bayesian Learning. CS Bayesian Learning2 States, causes, hypotheses. Observations, effect, data. We need to reconcile.
Crash Course on Machine Learning
Recitation 1 Probability Review
Chapter Two Probability Distributions: Discrete Variables
Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.
S TATISTICAL L EARNING (F ROM DATA TO DISTRIBUTIONS )
Estimation and Hypothesis Testing. The Investment Decision What would you like to know? What will be the return on my investment? Not possible PDF for.
Bayesian Inference Ekaterina Lomakina TNU seminar: Bayesian inference 1 March 2013.
CS B 351 L EARNING P ROBABILISTIC M ODELS. M OTIVATION Past lectures have studied how to infer characteristics of a distribution, given a fully- specified.
Estimating parameters in a statistical model Likelihood and Maximum likelihood estimation Bayesian point estimates Maximum a posteriori point.
IID Samples In supervised learning, we usually assume that data points are sampled independently and from the same distribution IID assumption: data are.
CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 11: Bayesian learning continued Geoffrey Hinton.
Statistical Learning (From data to distributions).
Empirical Research Methods in Computer Science Lecture 7 November 30, 2005 Noah Smith.
Elementary manipulations of probabilities Set probability of multi-valued r.v. P({x=Odd}) = P(1)+P(3)+P(5) = 1/6+1/6+1/6 = ½ Multi-variant distribution:
: Chapter 3: Maximum-Likelihood and Baysian Parameter Estimation 1 Montri Karnjanadecha ac.th/~montri.
CS B 553: A LGORITHMS FOR O PTIMIZATION AND L EARNING Monte Carlo Methods for Probabilistic Inference.
Probability Course web page: vision.cis.udel.edu/cv March 19, 2003  Lecture 15.
INTRODUCTION TO Machine Learning 3rd Edition
CS B 553: A LGORITHMS FOR O PTIMIZATION AND L EARNING Parameter Learning with Hidden Variables & Expectation Maximization.
Machine Learning 5. Parametric Methods.
1 Learning P-maps Param. Learning Graphical Models – Carlos Guestrin Carnegie Mellon University September 24 th, 2008 Readings: K&F: 3.3, 3.4, 16.1,
Univariate Gaussian Case (Cont.)
Statistical NLP: Lecture 4 Mathematical Foundations I: Probability Theory (Ch2)
CS Ensembles and Bayes1 Ensembles, Model Combination and Bayesian Combination.
Evaluating Hypotheses. Outline Empirically evaluating the accuracy of hypotheses is fundamental to machine learning – How well does this estimate accuracy.
CSC321: Lecture 8: The Bayesian way to fit models Geoffrey Hinton.
Naïve Bayes Classification Recitation, 1/25/07 Jonathan Huang.
MLPR - Questions. Can you go through integration, differentiation etc. Why do we need priors? Difference between prior and posterior. What does Bayesian.
Crash course in probability theory and statistics – part 2 Machine Learning, Wed Apr 16, 2008.
Oliver Schulte Machine Learning 726
CS479/679 Pattern Recognition Dr. George Bebis
Oliver Schulte Machine Learning 726
Probability Theory and Parameter Estimation I
CS 2750: Machine Learning Density Estimation
Ch3: Model Building through Regression
Parameter Estimation 主講人:虞台文.
CS 2750: Machine Learning Probability Review Density Estimation
Bayes Net Learning: Bayesian Approaches
CS 416 Artificial Intelligence
Oliver Schulte Machine Learning 726
More about Posterior Distributions
Statistical NLP: Lecture 4
LECTURE 09: BAYESIAN LEARNING
LECTURE 07: BAYESIAN ESTIMATION
Parametric Methods Berlin Chen, 2005 References:
Mathematical Foundations of BME Reza Shadmehr
Presentation transcript:

CS B 351 S TATISTICAL L EARNING

A GENDA Learning coin flips, learning Bayes net parameters Likelihood functions, maximum likelihood estimation (MLE) Priors, maximum a posteriori estimation (MAP) Bayesian estimation

L EARNING C OIN F LIPS Observe that c out of N draws are cherries (data) Intuition: c/N might be a good hypothesis for the fraction of cherries in the bag “Intuitive” parameter estimate: empirical distribution P(cherry)  c / N ( Why is this reasonable? Perhaps we got a bad draw! )

L EARNING C OIN F LIPS Observe that c out of N draws are cherries (data) Let the unknown fraction of cherries be  (hypothesis) Probability of drawing a cherry is  Assumption: draws are independent and identically distributed (i.i.d)

L EARNING C OIN F LIPS Probability of drawing a cherry is  Assumption: draws are independent and identically distributed (i.i.d) Probability of drawing 2 cherries is  Probability of drawing 2 limes is (1-  ) 2 Probability of drawing 1 cherry and 1 lime: 

L IKELIHOOD F UNCTION Likelihood: the probability of the data d ={d 1,…,d N } given the hypothesis  P( d |  ) =  j P(d j |  ) i.i.d assumption

L IKELIHOOD F UNCTION Likelihood: the probability of the data d ={d 1,…,d N } given the hypothesis  P( d |  ) =  j P(d j |  ) =  j  if d j =Cherry 1-  if d j =Lime Probability model, assuming  is given

L IKELIHOOD F UNCTION Likelihood: the probability of the data d ={d 1,…,d N } given the hypothesis  P( d |  ) =  j P(d j |  ) =  j =  c (1-  ) N-c Gather c cherry terms together, then N-c lime terms  if d j =Cherry 1-  if d j =Lime

M AXIMUM L IKELIHOOD Likelihood of data d ={d 1,…,d N } given  P( d |  ) =  c (1-  ) N-c

M AXIMUM L IKELIHOOD Likelihood of data d ={d 1,…,d N } given  P( d |  ) =  c (1-  ) N-c

M AXIMUM L IKELIHOOD Likelihood of data d ={d 1,…,d N } given  P( d |  ) =  c (1-  ) N-c

M AXIMUM L IKELIHOOD Likelihood of data d ={d 1,…,d N } given  P( d |  ) =  c (1-  ) N-c

M AXIMUM L IKELIHOOD Likelihood of data d ={d 1,…,d N } given  P( d |  ) =  c (1-  ) N-c

M AXIMUM L IKELIHOOD Likelihood of data d ={d 1,…,d N } given  P( d |  ) =  c (1-  ) N-c

M AXIMUM L IKELIHOOD Likelihood of data d ={d 1,…,d N } given  P( d |  ) =  c (1-  ) N-c

M AXIMUM L IKELIHOOD Peaks of likelihood function seem to hover around the fraction of cherries… Sharpness indicates some notion of certainty…

M AXIMUM L IKELIHOOD P( d |  ) is the likelihood function The quantity argmax  P( d |  ) is known as the maximum likelihood estimate (MLE)  =1 is MLE

M AXIMUM L IKELIHOOD P( d |  ) is the likelihood function The quantity argmax  P( d |  ) is known as the maximum likelihood estimate (MLE)  =1 is MLE

M AXIMUM L IKELIHOOD P( d |  ) is the likelihood function The quantity argmax  P( d |  ) is known as the maximum likelihood estimate (MLE)  =2/3 is MLE

M AXIMUM L IKELIHOOD P( d |  ) is the likelihood function The quantity argmax  P( d |  ) is known as the maximum likelihood estimate (MLE)  =1/2 is MLE

M AXIMUM L IKELIHOOD P( d |  ) is the likelihood function The quantity argmax  P( d |  ) is known as the maximum likelihood estimate (MLE)  =2/5 is MLE

P ROOF : E MPIRICAL F REQUENCY IS THE MLE l(  ) = log P( d |  ) = log [  c (1-  ) N-c ]

P ROOF : E MPIRICAL F REQUENCY IS THE MLE l(  ) = log P( d |  ) = log [  c (1-  ) N-c ] = log [  c ] + log [(1-  ) N-c ]

P ROOF : E MPIRICAL F REQUENCY IS THE MLE l(  ) = log P( d |  ) = log [  c (1-  ) N-c ] = log [  c ] + log [(1-  ) N-c ] = c log  + (N-c) log (1-  )

P ROOF : E MPIRICAL F REQUENCY IS THE MLE l(  ) = log P( d |  ) = c log  + (N-c) log (1-  ) Setting dl/d  (  = 0 gives the maximum likelihood estimate

P ROOF : E MPIRICAL F REQUENCY IS THE MLE dl/d  (  ) = c/  – (N-c)/(1-  ) At MLE, c/  – (N-c)/(1-  ) = 0 … =>  = c/N

M AXIMUM L IKELIHOOD FOR BN For any BN, the ML parameters of any CPT can be derived by the fraction of observed values in the data, conditioned on matched parent values Alarm EarthquakeBurglar E: 500 B: 200 N=1000 P(E) = 0.5P(B) = 0.2 A|E,B: 19/20 A|B: 188/200 A|E: 170/500 A| : 1/380 EBP(A|E,B) TT0.95 FT TF0.34 FF0.003

F ITTING CPT S VIA MLE M examples D =( d [1],…, d [M]) Each d [i] is a complete example of all variables in the Bayes net Assumption: each d [i] is sampled i.i.d. from the joint distribution of the BN

F ITTING CPT S VIA MLE M examples D =( d [1],…, d [M]) Each d [i] is a complete example of all variables in the Bayes net Assumption: each d [i] is sampled i.i.d. from the joint distribution of the BN Suppose BN has a single variable X Estimate X’s CPT, P(X) X

F ITTING CPT S VIA MLE M examples D =( d [1],…, d [M]) Each d [i] is a complete example of all variables in the Bayes net Assumption: each d [i] is sampled i.i.d. from the joint distribution of the BN Suppose BN has a single variable X Estimate X’s CPT, P(X) (Just learning a coin flip as usual) P MLE (X) = empirical distribution of D P MLE (X=T) = Count(X=T) / M P MLE (X=F) = Count(X=F) / M X

F ITTING CPT S VIA MLE M examples D =( d [1],…, d [M]) Each d [i] is a complete example of all variables in the Bayes net Assumption: each d [i] is sampled i.i.d. from the joint distribution of the BN Suppose BN to the right: Estimate P(X), P(Y|X) Estimate P MLE (X) as usual X Y

F ITTING CPT S VIA MLE M examples D =( d [1],…, d [M]) Each d [i] is a complete example of all variables in the Bayes net Assumption: each d [i] is sampled i.i.d. from the joint distribution of the BN Estimate P MLE (Y|X) with… X Y P(Y|X) X TF Y T F

F ITTING CPT S VIA MLE M examples D =( d [1],…, d [M]) Each d [i] is a complete example of all variables in the Bayes net Assumption: each d [i] is sampled i.i.d. from the joint distribution of the BN Estimate P MLE (Y|X) with… X Y P(Y|X) X TF Y TCount(Y=T,X=T) / Count(X=T) Count(Y=T,X=F) / Count(X=F) FCount(Y=F,X=T) / Count(X=T) Count(Y=F,X=F) / Count(X=F)

F ITTING CPT S VIA MLE X2X2 Y X1X1 X3X3

O THER MLE RESULTS Categorical distributions (Non-binary discrete variables): empirical distribution is MLE Make histogram, divide by N Continuous Gaussian distributions Mean = average of data Standard deviation = standard deviation of data Gaussian (normal) distributionHistogram

N ICE PROPERTIES OF MLE Easy to compute (for certain probability models) With enough data, the  MLE estimate will approach the true unknown value of 

P ROBLEMS WITH MLE The MLE was easy to compute… but what happens when we don’t have much data? Motivation You hand me a coin from your pocket 1 flip, turns up tails Whats the MLE?

P ROBLEMS WITH MLE The MLE was easy to compute… but what happens when we don’t have much data? Motivation You hand me a coin from your pocket 1 flip, turns up tails Whats the MLE?  MLE has a high variance with small sample sizes

V ARIANCE OF AN E STIMATOR : I NTUITION The dataset D is just a sample of the underlying distribution, and if we could “do over” the sample, then we might get a new dataset D’. With D’, our MLE estimate  MLE ’ might be different How much? How often? Assume all values of  are equally likely In the case of 1 draw, D would have just as likely been a Lime. In that case,  MLE = 0 So with probability 0.5,  MLE would be 1, and with the same probability,  MLE would be 0. High variance: typical “do overs” give drastically different results!

F ITTING B AYESIAN N ETWORK CPT S WITH MLE Potential problem: for large k, very few datapoints will share the values x 1,…,x k ! O(M*P(x 1,…,x k )), but some values may be even rarer Data fragmentation

I S THERE A B ETTER W AY ? B AYESIAN L EARNING

A N A LTERNATIVE APPROACH : B AYESIAN E STIMATION P( D |  ) is the likelihood P(  ) is the hypothesis prior P(  | D ) =  P( D |  ) P(  ) is the posterior Distribution of hypotheses given the data  d[1]d[2]d[M]

B AYESIAN P REDICTION For a new draw Y: use hypothesis posterior to predict P(Y|D) Y  d[1]d[2]d[M]

C ANDY E XAMPLE Candy comes in 2 flavors, cherry and lime, with identical wrappers Manufacturer makes 5 indistinguishable bags Suppose we draw What bag are we holding? What flavor will we draw next? h1 C: 100% L: 0% h2 C: 75% L: 25% h3 C: 50% L: 50% h4 C: 25% L: 75% h5 C: 0% L: 100%

B AYESIAN L EARNING Main idea: Compute the probability of each hypothesis, given the data Data D : Hypotheses: h 1,…,h 5 h1 C: 100% L: 0% h2 C: 75% L: 25% h3 C: 50% L: 50% h4 C: 25% L: 75% h5 C: 0% L: 100%

B AYESIAN L EARNING Main idea: Compute the probability of each hypothesis, given the data Data D : Hypotheses: h 1,…,h 5 h1 C: 100% L: 0% h2 C: 75% L: 25% h3 C: 50% L: 50% h4 C: 25% L: 75% h5 C: 0% L: 100% P(h i |D) P(D|h i ) We want this… But all we have is this!

U SING B AYES ’ R ULE P(h i | D ) =  P( D |h i ) P(h i ) is the posterior (Recall, 1/  = P( D ) =  i P( D |h i ) P(h i )) P( D |h i ) is the likelihood P(h i ) is the hypothesis prior h1 C: 100% L: 0% h2 C: 75% L: 25% h3 C: 50% L: 50% h4 C: 25% L: 75% h5 C: 0% L: 100%

C OMPUTING THE P OSTERIOR Assume draws are independent Let P(h 1 ),…,P(h 5 ) = (0.1, 0.2, 0.4, 0.2, 0.1) D = { 10 x } P(D|h 1 ) = 0 P(D|h 2 ) = P(D|h 3 ) = P(D|h 4 ) = P(D|h 5 ) = 1 10 P(D|h 1 )P(h 1 )=0 P(D|h 2 )P(h 2 )=9e-8 P(D|h 3 )P(h 3 )=4e-4 P(D|h 4 )P(h 4 )=0.011 P(D|h 5 )P(h 5 )=0.1 P(h 1 |D) =0 P(h 2 |D) =0.00 P(h 3 |D) =0.00 P(h 4 |D) =0.10 P(h 5 |D) =0.90 Sum = 1/  =

P OSTERIOR H YPOTHESES

P REDICTING THE N EXT D RAW P(Y| d ) =  i P(Y|h i, D )P(h i | D ) =  i P(Y|h i )P(h i | D ) P(h 1 |D) =0 P(h 2 |D) =0.00 P(h 3 |D) =0.00 P(h 4 |D) =0.10 P(h 5 |D) =0.90 H DY P(Y|h 1 ) =0 P(Y|h 2 ) =0.25 P(Y|h 3 ) =0.5 P(Y|h 4 ) =0.75 P(Y|h 5 ) =1 Probability that next candy drawn is a lime P(Y|D) = 0.975

P(N EXT C ANDY IS L IME | D )

B ACK TO C OIN FLIPS : U NIFORM PRIOR, B ERNOULLI D ISTRIBUTION ii d[1]d[2]d[M] Y

A SSUMPTION : U NIFORM PRIOR, B ERNOULLI D ISTRIBUTION ii d[1]d[2]d[M] Y Can think of this as a “correction” using “virtual counts”

N ONUNIFORM PRIORS P(  | D )  P( D |  )P(  ) =  c (1-  ) N-c P(  ) Define, for all , the probability that I believe in  10  P(  )

B ETA D ISTRIBUTION Beta ,  (  ) =    -1 (1-  )  -1 ,  hyperparameters > 0  is a normalization constant  =  =1 is uniform distribution

P OSTERIOR WITH B ETA P RIOR Posterior  c (1-  ) N-c P(  ) =   c+  -1 (1-  ) N-c+  -1 = Beta  +c,  +N-c (  ) Prediction = mean E[  =(c+  )/(N+  +  )

P OSTERIOR WITH B ETA P RIOR What does this mean? Prior specifies a “virtual count” of a=  -1 heads, b=  -1 tails See heads, increment a See tails, increment b Effect of prior diminishes with more data

C HOOSING A P RIOR Part of the design process; must be chosen according to your intuition Uninformed belief , strong belief =>  high

F ITTING CPT S VIA MAP M examples D =( d [1],…, d [M]), virtual counts a, b Estimate P MLE (Y|X) by assuming we’ve seen a examples of Y=T, and b examples of Y=F P(Y|X) X TF Y T(Count(Y=T,X=T)+ a ) / (Count(X=T)+ a + b ) (Count(Y=T,X=F) + a )/ (Count(X=F)+ a + b ) F(Count(Y=F,X=T) + b )/ (Count(X=T)+ a + b ) (Count(Y=F,X=F)+ b )/ (Count(X=F)+ a + b )

P ROPERTIES OF MAP Approaches the MLE as dataset grows large (effect of prior diminishes in the face of evidence) More stable estimates than MLE with small sample sizes Lower variance, but added bias Needs a designer’s judgment to set the prior

E XTENSIONS OF B ETA P RIORS Parameters of multi-valued (categorical) distributions, e.g. histograms: Dirichlet prior Mathematical derivation more complex, but in practice still takes the form of “virtual counts”

R ECAP Parameter learning via coin flips Maximum Likelihood Bayesian Learning with Beta prior Learning Bayes net parameters

N EXT T IME Introduction to machine learning R&N