S TATISTICAL L EARNING (F ROM DATA TO DISTRIBUTIONS )

Slides:



Advertisements
Similar presentations
Pattern Recognition and Machine Learning
Advertisements

Bayes rule, priors and maximum a posteriori
A Tutorial on Learning with Bayesian Networks
Naïve Bayes. Bayesian Reasoning Bayesian reasoning provides a probabilistic approach to inference. It is based on the assumption that the quantities of.
Image Modeling & Segmentation
ITCS 3153 Artificial Intelligence Lecture 24 Statistical Learning Chapter 20 Lecture 24 Statistical Learning Chapter 20.
ECE 8443 – Pattern Recognition LECTURE 05: MAXIMUM LIKELIHOOD ESTIMATION Objectives: Discrete Features Maximum Likelihood Resources: D.H.S: Chapter 3 (Part.
Flipping A Biased Coin Suppose you have a coin with an unknown bias, θ ≡ P(head). You flip the coin multiple times and observe the outcome. From observations,
CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 10: The Bayesian way to fit models Geoffrey Hinton.
Bayesian Wrap-Up (probably). 5 minutes of math... Marginal probabilities If you have a joint PDF:... and want to know about the probability of just one.
Statistical Learning Methods Copyright, 1996 © Dale Carnegie & Associates, Inc. Chapter 20 (20.1, 20.2, 20.3, 20.4) Fall 2005.
Ai in game programming it university of copenhagen Statistical Learning Methods Marco Loog.
Chapter 20 of AIMA KAIST CS570 Lecture note
Visual Recognition Tutorial
Lecture 5: Learning models using EM
Bayesian learning finalized (with high probability)
Basics of Statistical Estimation. Learning Probabilities: Classical Approach Simplest case: Flipping a thumbtack tails heads True probability  is unknown.
. PGM: Tirgul 10 Parameter Learning and Priors. 2 Why learning? Knowledge acquisition bottleneck u Knowledge acquisition is an expensive process u Often.
Bayesian Learning and Learning Bayesian Networks.
© Daniel S. Weld 1 Naïve Bayes & Expectation Maximization CSE 573.
Learning Bayesian Networks
Computer vision: models, learning and inference
Rutgers CS440, Fall 2003 Introduction to Statistical Learning Reading: Ch. 20, Sec. 1-4, AIMA 2 nd Ed.
© Daniel S. Weld 1 Statistical Learning CSE 573 Lecture 16 slides which overlap fix several errors.
Learning Bayesian Networks (From David Heckerman’s tutorial)
CS Bayesian Learning1 Bayesian Learning. CS Bayesian Learning2 States, causes, hypotheses. Observations, effect, data. We need to reconcile.
CS B 553: A LGORITHMS FOR O PTIMIZATION AND L EARNING Structure Learning.
Crash Course on Machine Learning
Recitation 1 Probability Review
Chapter Two Probability Distributions: Discrete Variables
Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.
S TATISTICAL L EARNING (F ROM DATA TO DISTRIBUTIONS )
ECE 8443 – Pattern Recognition LECTURE 06: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Bias in ML Estimates Bayesian Estimation Example Resources:
Bayesian Inference Ekaterina Lomakina TNU seminar: Bayesian inference 1 March 2013.
CS B 351 L EARNING P ROBABILISTIC M ODELS. M OTIVATION Past lectures have studied how to infer characteristics of a distribution, given a fully- specified.
IID Samples In supervised learning, we usually assume that data points are sampled independently and from the same distribution IID assumption: data are.
CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 11: Bayesian learning continued Geoffrey Hinton.
ECE 8443 – Pattern Recognition LECTURE 07: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Class-Conditional Density The Multivariate Case General.
Statistical Learning (From data to distributions).
Empirical Research Methods in Computer Science Lecture 7 November 30, 2005 Noah Smith.
Probability Course web page: vision.cis.udel.edu/cv March 19, 2003  Lecture 15.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 07: BAYESIAN ESTIMATION (Cont.) Objectives:
CS B 553: A LGORITHMS FOR O PTIMIZATION AND L EARNING Parameter Learning with Hidden Variables & Expectation Maximization.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Machine Learning 5. Parametric Methods.
1 Learning P-maps Param. Learning Graphical Models – Carlos Guestrin Carnegie Mellon University September 24 th, 2008 Readings: K&F: 3.3, 3.4, 16.1,
Lecture 3: MLE, Bayes Learning, and Maximum Entropy
Statistics Sampling Distributions and Point Estimation of Parameters Contents, figures, and exercises come from the textbook: Applied Statistics and Probability.
Univariate Gaussian Case (Cont.)
CS B 351 S TATISTICAL L EARNING. A GENDA Learning coin flips, learning Bayes net parameters Likelihood functions, maximum likelihood estimation (MLE)
CS Ensembles and Bayes1 Ensembles, Model Combination and Bayesian Combination.
CSC321: Lecture 8: The Bayesian way to fit models Geoffrey Hinton.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
MLPR - Questions. Can you go through integration, differentiation etc. Why do we need priors? Difference between prior and posterior. What does Bayesian.
Oliver Schulte Machine Learning 726
LECTURE 06: MAXIMUM LIKELIHOOD ESTIMATION
CS 2750: Machine Learning Density Estimation
Bayes Net Learning: Bayesian Approaches
Maximum Likelihood Estimation
CS 416 Artificial Intelligence
Oliver Schulte Machine Learning 726
Still More Uncertainty
More about Posterior Distributions
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Instructors: Fei Fang (This Lecture) and Dave Touretzky
CSCI 5822 Probabilistic Models of Human and Machine Learning
LECTURE 09: BAYESIAN LEARNING
LECTURE 07: BAYESIAN ESTIMATION
Parametric Methods Berlin Chen, 2005 References:
Mathematical Foundations of BME Reza Shadmehr
Presentation transcript:

S TATISTICAL L EARNING (F ROM DATA TO DISTRIBUTIONS )

A GENDA Learning a discrete probability distribution from data Maximum likelihood estimation (MLE) Maximum a posteriori (MAP) estimation

M OTIVATION Agent has made observations (data) Now must make sense of it (hypotheses) Hypotheses alone may be important (e.g., in basic science) For inference (e.g., forecasting) To take sensible actions (decision making) A basic component of economics, social and hard sciences, engineering, …

M ACHINE L EARNING VS. S TATISTICS Machine Learning  automated statistics This lecture Bayesian learning Maximum likelihood (ML) learning Maximum a posteriori (MAP) learning Learning Bayes Nets (R&N ) Future lectures try to do more with even less data Decision tree learning Neural nets Support vector machines …

P UTTING G ENERAL P RINCIPLES TO P RACTICE Note: this lecture will cover general principles of statistical learning on toy problems Grounded in some of the most theoretically principled approaches to learning But the techniques are far more general Practical applications to larger problems requires a bit of mathematical savvy and “elbow grease”

B EAUTIFUL RESULTS IN MATH GIVE RISE TO SIMPLE IDEAS OR H OW TO JUMP THROUGH HOOPS IN ORDER TO JUSTIFY SIMPLE IDEAS …

C ANDY E XAMPLE Candy comes in 2 flavors, cherry and lime, with identical wrappers Manufacturer makes 5 indistinguishable bags Suppose we draw What bag are we holding? What flavor will we draw next? h1 C: 100% L: 0% h2 C: 75% L: 25% h3 C: 50% L: 50% h4 C: 25% L: 75% h5 C: 0% L: 100%

B AYESIAN L EARNING Main idea: Compute the probability of each hypothesis, given the data Data d : Hypotheses: h 1,…,h 5 h1 C: 100% L: 0% h2 C: 75% L: 25% h3 C: 50% L: 50% h4 C: 25% L: 75% h5 C: 0% L: 100%

B AYESIAN L EARNING Main idea: Compute the probability of each hypothesis, given the data Data d : Hypotheses: h 1,…,h 5 h1 C: 100% L: 0% h2 C: 75% L: 25% h3 C: 50% L: 50% h4 C: 25% L: 75% h5 C: 0% L: 100% P(h i |d) P(d|h i ) We want this… But all we have is this!

U SING B AYES ’ R ULE P(h i | d ) =  P( d |h i ) P(h i ) is the posterior (Recall, 1/  = P( d ) =  i P( d |h i ) P(h i )) P( d |h i ) is the likelihood P(h i ) is the hypothesis prior h1 C: 100% L: 0% h2 C: 75% L: 25% h3 C: 50% L: 50% h4 C: 25% L: 75% h5 C: 0% L: 100%

L IKELIHOOD AND PRIOR Likelihood is the probability of observing the data, given the hypothesis model Hypothesis prior is the probability of a hypothesis, before having observed any data

C OMPUTING THE P OSTERIOR Assume draws are independent Let P(h 1 ),…,P(h 5 ) = (0.1, 0.2, 0.4, 0.2, 0.1) d = { 10 x } P(d|h 1 ) = 0 P(d|h 2 ) = P(d|h 3 ) = P(d|h 4 ) = P(d|h 5 ) = 1 10 P(d|h 1 )P(h 1 )=0 P(d|h 2 )P(h 2 )=9e-8 P(d|h 3 )P(h 3 )=4e-4 P(d|h 4 )P(h 4 )=0.011 P(d|h 5 )P(h 5 )=0.1 P(h 1 |d) =0 P(h 2 |d) =0.00 P(h 3 |d) =0.00 P(h 4 |d) =0.10 P(h 5 |d) =0.90 Sum = 1/  =

P OSTERIOR H YPOTHESES

P REDICTING THE N EXT D RAW P(X| d ) =  i P(X|h i, d )P(h i | d ) =  i P(X|h i )P(h i | d ) P(h 1 |d) =0 P(h 2 |d) =0.00 P(h 3 |d) =0.00 P(h 4 |d) =0.10 P(h 5 |d) =0.90 H DX P(X|h 1 ) =0 P(X|h 2 ) =0.25 P(X|h 3 ) =0.5 P(X|h 4 ) =0.75 P(X|h 5 ) =1 Probability that next candy drawn is a lime P(X|d) = 0.975

P(N EXT C ANDY IS L IME | D )

P ROPERTIES OF B AYESIAN L EARNING If exactly one hypothesis is correct, then the posterior probability of the correct hypothesis will tend toward 1 as more data is observed The effect of the prior distribution decreases as more data is observed

H YPOTHESIS S PACES OFTEN I NTRACTABLE To learn a probability distribution, a hypothesis would have to be a joint probability table over state variables 2 n entries => hypothesis space is 2 n-1 -dimensional! 2^(2 n ) deterministic hypotheses 6 boolean variables => over hypotheses And what the heck would a prior be?

L EARNING C OIN F LIPS Let the unknown fraction of cherries be  (hypothesis) Probability of drawing a cherry is  Suppose draws are independent and identically distributed (i.i.d) Observe that c out of N draws are cherries (data)

L EARNING C OIN F LIPS Let the unknown fraction of cherries be  (hypothesis) Intuition: c/N might be a good hypothesis (or it might not, depending on the draw!)

M AXIMUM L IKELIHOOD Likelihood of data d ={d 1,…,d N } given  P( d |  ) =  j P(d j |  ) =  c (1-  ) N-c i.i.d assumptionGather c cherry terms together, then N-c lime terms

M AXIMUM L IKELIHOOD Likelihood of data d ={d 1,…,d N } given  P( d |  ) =  c (1-  ) N-c

M AXIMUM L IKELIHOOD Likelihood of data d ={d 1,…,d N } given  P( d |  ) =  c (1-  ) N-c

M AXIMUM L IKELIHOOD Likelihood of data d ={d 1,…,d N } given  P( d |  ) =  c (1-  ) N-c

M AXIMUM L IKELIHOOD Likelihood of data d ={d 1,…,d N } given  P( d |  ) =  c (1-  ) N-c

M AXIMUM L IKELIHOOD Likelihood of data d ={d 1,…,d N } given  P( d |  ) =  c (1-  ) N-c

M AXIMUM L IKELIHOOD Likelihood of data d ={d 1,…,d N } given  P( d |  ) =  c (1-  ) N-c

M AXIMUM L IKELIHOOD Likelihood of data d ={d 1,…,d N } given  P( d |  ) =  c (1-  ) N-c

M AXIMUM L IKELIHOOD Peaks of likelihood function seem to hover around the fraction of cherries… Sharpness indicates some notion of certainty…

M AXIMUM L IKELIHOOD P( d |  ) be the likelihood function The quantity argmax  P( d |  ) is known as the maximum likelihood estimate (MLE)

M AXIMUM L IKELIHOOD The maximum of P( d |  ) is obtained at the same place that the maximum of log P( d |  ) is obtained Log is a monotonically increasing function So the MLE is the same as maximizing log likelihood… but: Multiplications turn into additions We don’t have to deal with such tiny numbers

M AXIMUM L IKELIHOOD The maximum of P( d |  ) is obtained at the same place that the maximum of log P( d |  ) is obtained Log is a monotonically increasing function l(  ) = log P( d |  ) = log [  c (1-  ) N-c ]

M AXIMUM L IKELIHOOD The maximum of P( d |  ) is obtained at the same place that the maximum of log P( d |  ) is obtained Log is a monotonically increasing function l(  ) = log P( d |  ) = log [  c (1-  ) N-c ] = log [  c ] + log [(1-  ) N-c ]

M AXIMUM L IKELIHOOD The maximum of P( d |  ) is obtained at the same place that the maximum of log P( d |  ) is obtained Log is a monotonically increasing function l(  ) = log P( d |  ) = log [  c (1-  ) N-c ] = log [  c ] + log [(1-  ) N-c ] = c log  + (N-c) log (1-  )

M AXIMUM L IKELIHOOD The maximum of P( d |  ) is obtained at the same place that the maximum of log P( d |  ) is obtained Log is a monotonically increasing function l(  ) = log P( d |  ) = c log  + (N-c) log (1-  ) At a maximum of a function, its derivative is 0 So, dl/d  (  = 0 at the maximum likelihood estimate

M AXIMUM L IKELIHOOD The maximum of P( d |  ) is obtained at the same place that the maximum of log P( d |  ) is obtained Log is a monotonically increasing function l(  ) = log P( d |  ) = c log  + (N-c) log (1-  ) At a maximum of a function, its derivative is 0 So, dl/d  (  = 0 at the maximum likelihood estimate => 0 = c/  – (N-c)/(1-  ) =>  = c/N

M AXIMUM L IKELIHOOD FOR BN For any BN, the ML parameters of any CPT can be derived by the fraction of observed values in the data, conditioned on matched parent values Alarm EarthquakeBurglar E: 500 B: 200 N=1000 P(E) = 0.5P(B) = 0.2 A|E,B: 19/20 A|B: 188/200 A|E: 170/500 A| : 1/380 EBP(A|E,B) TT0.95 FT TF0.34 FF0.003

O THER MLE RESULTS Multi-valued variables: take fraction of counts for each value Continuous Gaussian distributions: take average value as mean, standard deviation of data as standard deviation

M AXIMUM L IKELIHOOD P ROPERTIES As the number of data points approaches infinity, the MLE approaches the true estimate With little data, MLEs can vary wildly

M AXIMUM L IKELIHOOD IN CANDY BAG EXAMPLE h ML = argmax hi P( d |h i ) P(X| d )  P(X|h ML ) h ML = undefinedh5h5 P(X|h ML ) P(X|d)

B ACK TO C OIN F LIPS The MLE is easy to compute… but what about those small sample sizes? Motivation You hand me a coin from your pocket 1 flip, turns up tails Whats the MLE?

M AXIMUM A P OSTERIORI E STIMATION Maximum a posteriori (MAP) estimation Idea: use the hypothesis prior to get a better initial estimate than ML, without resorting to full Bayesian estimation “Most coins I’ve seen have been fair coins, so I won’t let the first few tails sway my estimate much” “Now that I’ve seen 100 tails in a row, I’m pretty sure it’s not a fair coin anymore”

M AXIMUM A P OSTERIORI P(  | d ) is the posterior probability of the hypothesis, given the data argmax  P(  | d ) is known as the maximum a posteriori (MAP) estimate Posterior of hypothesis  given data d ={d 1,…,d N } P(  | d ) = 1/  P(d|  ) P(  ) Max over  doesn’t affect  So MAP estimate is argmax  P(d|  ) P(  )

M AXIMUM A P OSTERIORI h MAP = argmax hi P(h i | d ) P(X| d )  P(X|h MAP ) h MAP = h3h3 h4h4 h5h5 P(X|h MAP ) P(X|d)

A DVANTAGES OF MAP AND MLE OVER B AYESIAN ESTIMATION Involves an optimization rather than a large summation Local search techniques For some types of distributions, there are closed- form solutions that are easily computed

B ACK TO C OIN F LIPS Need some prior distribution P(  ) P(  | d )  P( d |  )P(  ) =  c (1-  ) N-c P(  ) Define, for all , the probability that I believe in  10  P(  )

MAP ESTIMATE Could maximize  c (1-  ) N-c P(  ) using some optimization Turns out for some families of P(  ), the MAP estimate is easy to compute 10  P(  ) Beta distributions

B ETA D ISTRIBUTION Beta ,  (  ) =    -1 (1-  )  -1 ,  hyperparameters > 0  is a normalization constant  =  =1 is uniform distribution

P OSTERIOR WITH B ETA P RIOR Posterior  c (1-  ) N-c P(  ) =   c+  -1 (1-  ) N-c+  -1 MAP estimate  =(c+  -1)/(N+  +  -2) Posterior is also a beta distribution!

P OSTERIOR WITH B ETA P RIOR What does this mean? Prior specifies a “virtual count” of a=  -1 heads, b=  -1 tails See heads, increment a See tails, increment b MAP estimate is a/(a+b) Effect of prior diminishes with more data

MAP FOR BN For any BN, the MAP parameters of any CPT can be derived by the fraction of observed values in the data + the prior virtual counts Alarm EarthquakeBurglar E: B: N= P(E) = P(B) = A|E,B: (19+100)/(20+100) A|B: ( )/( ) A|E: ( )/( ) A| : (1+3)/( ) EBP(A|E,B) TT FT TF FF

E XTENSIONS OF B ETA P RIORS Multiple discrete values: Dirichlet prior Mathematical expression more complex, but in practice still takes the form of “virtual counts” Mean, standard deviation for Gaussian distributions: Gamma prior Conjugate priors preserve the representation of prior and posterior distributions, but do not necessary exist for general distributions

R ECAP Bayesian learning: infer the entire distribution over hypotheses Robust but expensive for large problems Maximum Likelihood (ML): optimize likelihood Good for large datasets, not robust for small ones Maximum A Posteriori (MAP): weight likelihood by a hypothesis prior during optimization A compromise solution for small datasets Like Bayesian learning, how to specify the prior?

A PPLICATION TO S TRUCTURE L EARNING So far we’ve assumed a BN structure, which assumes we have enough domain knowledge to declare strict independence relationships Problem : how to learn a sparse probabilistic model whose structure is unknown?

B AYESIAN S TRUCTURE L EARNING But what if the Bayesian network has unknown structure? Are Earthquake and Burglar independent? P(E,B) = P(E)P(B)? EarthquakeBurglar ? BE ……

B AYESIAN S TRUCTURE L EARNING Hypothesis 1: dependent P(E,B) as a probability table has 3 free parameters Hypothesis 2: independent Individual tables P(E),P(B) have 2 total free parameters Maximum likelihood will always prefer hypothesis 1! EarthquakeBurglar BE ……

B AYESIAN S TRUCTURE L EARNING Occam’s razor: Prefer simpler explanations over complex ones Penalize free parameters in probability tables EarthquakeBurglar BE ……

I TERATIVE G REEDY A LGORITHM Fit a simple model (ML), compute likelihood Repeat: EarthquakeBurglar BE ……

I TERATIVE G REEDY A LGORITHM Fit a simple model (ML), compute likelihood Repeat: Pick a candidate edge to add to the BN EarthquakeBurglar BE ……

I TERATIVE G REEDY A LGORITHM Fit a simple model (ML), compute likelihood Repeat: Pick a candidate edge to add to the BN Compute new ML probabilities EarthquakeBurglar BE ……

I TERATIVE G REEDY A LGORITHM Fit a simple model (ML), compute likelihood Repeat: Pick a candidate edge to add to the BN Compute new ML probabilities If new likelihood exceeds old likelihood + threshold, Add the edge Otherwise, repeat with a different edge EarthquakeBurglar BE …… (Usually log likelihood)

O THER T OPICS IN L EARNING D ISTRIBUTIONS Learning continuous distributions using kernel density estimation / nearest neighbor smoothing Expectation Maximization for hidden variables Learning Bayes nets with hidden variables Learning HMMs from observations Learning Gaussian Mixture Models of continuous data

N EXT T IME Introduction to machine learning R&N