Statistical Learning (From data to distributions).

Slides:



Advertisements
Similar presentations
Pattern Recognition and Machine Learning
Advertisements

Bayes rule, priors and maximum a posteriori
ITCS 3153 Artificial Intelligence Lecture 24 Statistical Learning Chapter 20 Lecture 24 Statistical Learning Chapter 20.
S TATISTICAL L EARNING (F ROM DATA TO DISTRIBUTIONS )
Flipping A Biased Coin Suppose you have a coin with an unknown bias, θ ≡ P(head). You flip the coin multiple times and observe the outcome. From observations,
Learning in Bayes Nets Task 1: Given the network structure and given data, where a data point is an observed setting for the variables, learn the CPTs.
Statistical Learning Methods Copyright, 1996 © Dale Carnegie & Associates, Inc. Chapter 20 (20.1, 20.2, 20.3, 20.4) Fall 2005.
Ai in game programming it university of copenhagen Statistical Learning Methods Marco Loog.
Chapter 20 of AIMA KAIST CS570 Lecture note
Bayesian Learning, Regression-based learning. Overview  Bayesian Learning  Full  MAP learning  Maximum Likelihood Learning  Learning Bayesian Networks.
Visual Recognition Tutorial
Overview Full Bayesian Learning MAP learning
Hidden Markov Model 11/28/07. Bayes Rule The posterior distribution Select k with the largest posterior distribution. Minimizes the average misclassification.
Statistical Learning: Bayesian and ML COMP155 Sections May 2, 2007.
Lecture 5: Learning models using EM
Basics of Statistical Estimation. Learning Probabilities: Classical Approach Simplest case: Flipping a thumbtack tails heads True probability  is unknown.
A gentle introduction to Gaussian distribution. Review Random variable Coin flip experiment X = 0X = 1 X: Random variable.
Machine Learning CMPT 726 Simon Fraser University
. PGM: Tirgul 10 Parameter Learning and Priors. 2 Why learning? Knowledge acquisition bottleneck u Knowledge acquisition is an expensive process u Often.
Maximum Likelihood (ML), Expectation Maximization (EM)
Bayesian Learning and Learning Bayesian Networks.
Learning Bayesian Networks
Thanks to Nir Friedman, HU
Rutgers CS440, Fall 2003 Introduction to Statistical Learning Reading: Ch. 20, Sec. 1-4, AIMA 2 nd Ed.
Learning Bayesian Networks (From David Heckerman’s tutorial)
CS Bayesian Learning1 Bayesian Learning. CS Bayesian Learning2 States, causes, hypotheses. Observations, effect, data. We need to reconcile.
Crash Course on Machine Learning
Recitation 1 Probability Review
Chapter Two Probability Distributions: Discrete Variables
Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.
S TATISTICAL L EARNING (F ROM DATA TO DISTRIBUTIONS )
EM and expected complete log-likelihood Mixture of Experts
Model Inference and Averaging
Bayesian Inference Ekaterina Lomakina TNU seminar: Bayesian inference 1 March 2013.
A statistical model Μ is a set of distributions (or regression functions), e.g., all uni-modal, smooth distributions. Μ is called a parametric model if.
CS B 351 L EARNING P ROBABILISTIC M ODELS. M OTIVATION Past lectures have studied how to infer characteristics of a distribution, given a fully- specified.
CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 11: Bayesian learning continued Geoffrey Hinton.
CSE 446: Point Estimation Winter 2012 Dan Weld Slides adapted from Carlos Guestrin (& Luke Zettlemoyer)
Empirical Research Methods in Computer Science Lecture 7 November 30, 2005 Noah Smith.
Elementary manipulations of probabilities Set probability of multi-valued r.v. P({x=Odd}) = P(1)+P(3)+P(5) = 1/6+1/6+1/6 = ½ Multi-variant distribution:
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: ML and Simple Regression Bias of the ML Estimate Variance of the ML Estimate.
CS Statistical Machine learning Lecture 24
Lecture 2: Statistical learning primer for biologists
Machine Learning 5. Parametric Methods.
1 Learning P-maps Param. Learning Graphical Models – Carlos Guestrin Carnegie Mellon University September 24 th, 2008 Readings: K&F: 3.3, 3.4, 16.1,
Review of statistical modeling and probability theory Alan Moses ML4bio.
CS B 351 S TATISTICAL L EARNING. A GENDA Learning coin flips, learning Bayes net parameters Likelihood functions, maximum likelihood estimation (MLE)
Hidden Markov Models. A Hidden Markov Model consists of 1.A sequence of states {X t |t  T } = {X 1, X 2,..., X T }, and 2.A sequence of observations.
Ch 1. Introduction Pattern Recognition and Machine Learning, C. M. Bishop, Updated by J.-H. Eom (2 nd round revision) Summarized by K.-I.
CSC321: Lecture 8: The Bayesian way to fit models Geoffrey Hinton.
MLPR - Questions. Can you go through integration, differentiation etc. Why do we need priors? Difference between prior and posterior. What does Bayesian.
Oliver Schulte Machine Learning 726
Probability Theory and Parameter Estimation I
CS 2750: Machine Learning Density Estimation
Ch3: Model Building through Regression
CS 2750: Machine Learning Probability Review Density Estimation
Bayes Net Learning: Bayesian Approaches
Maximum Likelihood Estimation
CS 416 Artificial Intelligence
Oliver Schulte Machine Learning 726
Data Mining Lecture 11.
More about Posterior Distributions
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
CSCI 5822 Probabilistic Models of Human and Machine Learning
Parametric Methods Berlin Chen, 2005 References:
Mathematical Foundations of BME Reza Shadmehr
A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models Jeff A. Bilmes International.
A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models Jeff A. Bilmes International.
Presentation transcript:

Statistical Learning (From data to distributions)

Reminders HW5 deadline extended to Friday

Agenda Learning a probability distribution from data Maximum likelihood estimation (MLE) Maximum a posteriori (MAP) estimation Expectation Maximization (EM)

Motivation Agent has made observations (data) Now must make sense of it (hypotheses) –Hypotheses alone may be important (e.g., in basic science) –For inference (e.g., forecasting) –To take sensible actions (decision making) A basic component of economics, social and hard sciences, engineering, …

Candy Example Candy comes in 2 flavors, cherry and lime, with identical wrappers Manufacturer makes 5 (indistinguishable) bags Suppose we draw What bag are we holding? What flavor will we draw next? H1 C: 100% L: 0% H2 C: 75% L: 25% H3 C: 50% L: 50% H4 C: 25% L: 75% H5 C: 0% L: 100%

Machine Learning vs. Statistics Machine Learning  automated statistics This lecture –Bayesian learning, the more “traditional” statistics (R&N ) –Learning Bayes Nets

Bayesian Learning Main idea: Consider the probability of each hypothesis, given the data Data d: Hypotheses: P(h i |d) h1 C: 100% L: 0% h2 C: 75% L: 25% h3 C: 50% L: 50% h4 C: 25% L: 75% h5 C: 0% L: 100%

Using Bayes’ Rule P(h i |d) =  P(d|h i ) P(h i ) is the posterior –(Recall, 1/  =  i P(d|h i ) P(h i )) P(d|h i ) is the likelihood P(h i ) is the hypothesis prior h1 C: 100% L: 0% h2 C: 75% L: 25% h3 C: 50% L: 50% h4 C: 25% L: 75% h5 C: 0% L: 100%

Computing the Posterior Assume draws are independent Let P(h 1 ),…,P(h5) = (0.1,0.2,0.4,0.2,0.1) d = { 10 x } P(d|h 1 ) = 0 P(d|h 2 ) = P(d|h 3 ) = P(d|h 4 ) = P(d|h 5 ) = 1 10 P(d|h 1 )P(h 1 )=0 P(d|h 2 )P(h 2 )=9e-8 P(d|h 3 )P(h 3 )=4e-4 P(d|h 4 )P(h 4 )=0.011 P(d|h 5 )P(h 5 )=0.1 P(h 1 |d) =0 P(h 2 |d) =0.00 P(h 3 |d) =0.00 P(h 4 |d) =0.10 P(h 5 |d) =0.90 Sum = 1/  =

Posterior Hypotheses

Predicting the Next Draw P(X|d) =  i P(X|h i,d)P(h i |d) =  i P(X|h i )P(h i |d) P(h 1 |d) =0 P(h 2 |d) =0.00 P(h 3 |d) =0.00 P(h 4 |d) =0.10 P(h 5 |d) =0.90 H DX P(X|h 1 ) =0 P(X|h 2 ) =0.25 P(X|h 3 ) =0.5 P(X|h 4 ) =0.75 P(X|h 5 ) =1 Probability that next candy drawn is a lime P(X|d) = 0.975

P(Next Candy is Lime | d)

Other properties of Bayesian Estimation Any learning technique trades off between good fit and hypothesis complexity Prior can penalize complex hypotheses –Many more complex hypotheses than simple ones –Ockham’s razor

Hypothesis Spaces often Intractable A hypothesis is a joint probability table over state variables –2 n entries => hypothesis space is [0,1]^(2 n ) –2^(2 n ) deterministic hypotheses 6 boolean variables => over hypotheses Summing over hypotheses is expensive!

Some Common Simplifications Maximum a posteriori estimation (MAP) –h MAP = argmax hi P(h i |d) –P(X|d)  P(X|h MAP ) Maximum likelihood estimation (ML) –h ML = argmax hi P(d|h i ) –P(X|d)  P(X|h ML ) Both approximate the true Bayesian predictions as the # of data grows large

Maximum a Posteriori h MAP = argmax hi P(h i |d) P(X|d)  P(X|h MAP ) h MAP = h3h3 h4h4 h5h5 P(X|h MAP ) P(X|d)

Maximum a Posteriori For large amounts of data, P(incorrect hypothesis|d) => 0 For small sample sizes, MAP predictions are “overconfident” P(X|h MAP ) P(X|d)

Maximum Likelihood h ML = argmax hi P(d|h i ) P(X|d)  P(X|h ML ) h ML = undefinedh5h5 P(X|h ML ) P(X|d)

Maximum Likelihood h ML = h MAP with uniform prior Relevance of prior diminishes with more data Preferred by some statisticians –Are priors “cheating”? –What is a prior anyway?

Advantages of MAP and MLE over Bayesian estimation Involves an optimization rather than a large summation –Local search techniques For some types of distributions, there are closed-form solutions that are easily computed

Learning Coin Flips (Bernoulli distribution) Let the unknown fraction of cherries be  Suppose draws are independent and identically distributed (i.i.d) Observe that c out of N draws are cherries

Maximum Likelihood Likelihood of data d={d 1,…,d N } given  –P(d|  ) =  j P(d j |  ) =  c (1-  ) N-c i.i.d assumptionGather c cherries together, then N-c limes

Maximum Likelihood Same as maximizing log likelihood L(d|  )= log P(d|  ) = c log  (N-c) log(1-  ) max  L(d|  ) => dL/d  = 0 => 0 = c/  – (N-c)/(1-  ) =>  = c/N

Maximum Likelihood for BN For any BN, the ML parameters of any CPT can be derived by the fraction of observed values in the data Alarm EarthquakeBurglar E 500 B: 200 N=1000 P(E) = 0.5P(B) = 0.2 A|E,B: 19/20 A|B: 188/200 A|E: 170/500 A| : 1/380 EBP(A|E,B) TT0.95 FT TF0.34 FF0.003

Maximum Likelihood for Gaussian Models Observe a continuous variable x 1,…,x N Fit a Gaussian with mean , std  –Standard procedure: write log likelihood L = N(C – log  ) –  j (x j -  ) 2 /(2  2 ) –Set derivatives to zero

Observe a continuous variable x 1,…,x N Results:  = 1/N  x j (sample mean)  2 = 1/N  (x j -  ) 2 (sample variance) Maximum Likelihood for Gaussian Models

Y is a child of X Data (x j,y j ) X is gaussian, Y is a linear Gaussian function of X –Y(x) ~ N(ax+b,  ) ML estimate of a, b is given by least squares regression,  by standard errors Maximum Likelihood for Conditional Linear Gaussians X Y

Back to Coin Flips What about Bayesian or MAP learning? Motivation –I pick a coin out of my pocket –1 flip turns up heads –Whats the MLE?

Back to Coin Flips Need some prior distribution P(  ) P(  |d) = P(d|  )P(  ) =  c (1-  ) N-c P(  ) Define, for all , the probability that I believe in  10  P(  )

MAP estimate Could maximize  c (1-  ) N-c P(  ) using some optimization Turns out for some families of P(  ), the MAP estimate is easy to compute 10  P(  ) Beta distributions (Conjugate prior)

Beta Distribution Beta a,b (  ) =   a-1 (1-  ) b-1 –a, b hyperparameters –  is a normalization constant –Mean at a/(a+b)

Posterior with Beta Prior Posterior  c (1-  ) N-c P(  ) =   c+a-1 (1-  ) N-c+b-1 MAP estimate  =(c+a)/(N+a+b) Posterior is also a beta distribution! –See heads, increment a –See tails, increment b –Prior specifies a “virtual count” of a heads, b tails

Does this work in general? Only specific distributions have the right type of prior –Bernoulli, Poisson, geometric, Gaussian, exponential, … Otherwise, MAP needs a (often expensive) numerical optimization

How to deal with missing observations? Very difficult statistical problem in general E.g., surveys –Did the person not fill out political affiliation randomly? –Or do independents do this more often than someone with a strong affiliation? Better if a variable is completely hidden

Expectation Maximization for Gaussian Mixture models Data have labels to which Gaussian they belong to, but label is a hidden variable Clustering: N gaussian distributions E step: compute probability a datapoint belongs to each gaussian M step: compute ML estimates of each gaussian, weighted by the probability that each sample belongs to it

Learning HMMs Want to find transition and observation probabilities Data: many sequences {O 1:t (j) for 1  j  N} Problem: we don’t observe the X’s! X0X0 X1X1 X2X2 X3X3 O1O1 O2O2 O3O3

Learning HMMs X0X0 X1X1 X2X2 X3X3 O1O1 O2O2 O3O3 Assume stationary markov chain, discrete states x 1,…,x m Transition parameters  ij = P(X t+1 =x j |X t =x i ) Observation parameters  i = P(O|X t =x i )

Assume stationary markov chain, discrete states x 1,…,x m Transition parameters  ij = P(X t+1 =x j |X t =x i ) Observation parameters  i = P(O|X t =x i ) Initial states i = P(X 0 =xi) Learning HMMs x1x1 x3x3 x2x2 O  13,  31 33 22

Expectation Maximization Initialize parameters randomly E-step: infer expected probabilities of hidden variables over time, given current parameters M-step: maximize likelihood of data over parameters x1x1 x3x3 x2x2 O  13,  31 33 22            32  33,  1,  2,  3 ) P(initial state)P(transition ij)P(emission)

Expectation Maximization x1x1 x3x3 x2x2 O  13,  31 33 22 Initialize   E: Compute E[P(Z=z|  (0),O)] x1x2x3x2 x1 x2 x1x3x2 Z: all combinations of hidden sequences Result: probability distribution over hidden state at time t M: compute  (1) = ML estimate of transition / obs. distributions            32  33,  1,  2,  3 )

Expectation Maximization x1x1 x3x3 x2x2 O  13,  31 33 22            32  33,  1,  2,  3 ) Initialize   E: Compute E[P(Z=z|  (0),O)] x1x2x3x2 x1 x2 x1x3x2 Z: all combinations of hidden sequences Result: probability distribution over hidden state at time t M: compute  (1) = ML estimate of transition / obs. distributions This is the hard part…

E-Step on HMMs Computing expectations can be done by: –Sampling –Using the forward/backward algorithm on the unrolled HMM (R&N pp. 546) The latter gives the classic Baum-Welch algorithm Note that EM can still get stuck in local optima or even saddle points

Next Time Machine learning