Maximum Likelihood (ML), Expectation Maximization (EM)

Slides:

Advertisements

Similar presentations

Mixture Models and the EM Algorithm

Advertisements

EKF, UKF TexPoint fonts used in EMF.

1 12. Principles of Parameter Estimation The purpose of this lecture is to illustrate the usefulness of the various concepts introduced and studied in.

EM Algorithm Jur van den Berg.

Expectation Maximization

Supervised Learning Recap

Mapping with Known Poses

Beam Sensor Models Pieter Abbeel UC Berkeley EECS Many slides adapted from Thrun, Burgard and Fox, Probabilistic Robotics TexPoint fonts used in EMF. Read.

Nonlinear Optimization for Optimal Control

K-means clustering Hongning Wang

Bayes Filters Pieter Abbeel UC Berkeley EECS Many slides adapted from Thrun, Burgard and Fox, Probabilistic Robotics TexPoint fonts used in EMF. Read the.

Visual Recognition Tutorial

Maximum likelihood (ML) and likelihood ratio (LR) test

Particle Filters Pieter Abbeel UC Berkeley EECS Many slides adapted from Thrun, Burgard and Fox, Probabilistic Robotics TexPoint fonts used in EMF. Read.

Maximum likelihood Conditional distribution and likelihood Maximum likelihood estimations Information in the data and likelihood Observed and Fisher’s.

. Learning Bayesian networks Slides by Nir Friedman.

Lecture 5: Learning models using EM

Maximum likelihood (ML) and likelihood ratio (LR) test

Basics of Statistical Estimation. Learning Probabilities: Classical Approach Simplest case: Flipping a thumbtack tails heads True probability  is unknown.

Probability: Review TexPoint fonts used in EMF.

Maximum A Posteriori (MAP) Estimation Pieter Abbeel UC Berkeley EECS TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.:

Expectation Maximization Algorithm

End of Chapter 8 Neil Weisenfeld March 28, 2005.

Visual Recognition Tutorial

What is it? When would you use it? Why does it work? How do you implement it? Where does it stand in relation to other methods? EM algorithm reading group.

EM Algorithm Likelihood, Mixture Models and Clustering.

Smoother Pieter Abbeel UC Berkeley EECS Many slides adapted from Thrun, Burgard and Fox, Probabilistic Robotics TexPoint fonts used in EMF. Read the TexPoint.

Kalman Filtering Pieter Abbeel UC Berkeley EECS Many slides adapted from Thrun, Burgard and Fox, Probabilistic Robotics TexPoint fonts used in EMF. Read.

Gaussians Pieter Abbeel UC Berkeley EECS Many slides adapted from Thrun, Burgard and Fox, Probabilistic Robotics TexPoint fonts used in EMF. Read the TexPoint.

EM algorithm LING 572 Fei Xia 03/02/06. Outline The EM algorithm EM for PM models Three special cases –Inside-outside algorithm –Forward-backward algorithm.

Maximum likelihood (ML)

Particle Filters++ TexPoint fonts used in EMF.

Crash Course on Machine Learning

Incomplete Graphical Models Nan Hu. Outline Motivation K-means clustering Coordinate Descending algorithm Density estimation EM on unconditional mixture.

Biointelligence Laboratory, Seoul National University

Chapter Two Probability Distributions: Discrete Variables

Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.

Introduction and Motivation Approaches for DE: Known model → parametric approach: p(x;θ) (Gaussian, Laplace,…) Unknown model → nonparametric approach Assumes.

EM and expected complete log-likelihood Mixture of Experts

Model Inference and Averaging

Bayesian Inference Ekaterina Lomakina TNU seminar: Bayesian inference 1 March 2013.

1 HMM - Part 2 Review of the last lecture The EM algorithm Continuous density HMM.

Lecture 19: More EM Machine Learning April 15, 2010.

ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Deterministic vs. Random Maximum A Posteriori Maximum Likelihood Minimum.

CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 11: Bayesian learning continued Geoffrey Hinton.

ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Conjugate Priors Multinomial Gaussian MAP Variance Estimation Example.

ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: ML and Simple Regression Bias of the ML Estimate Variance of the ML Estimate.

HMM - Part 2 The EM algorithm Continuous density HMM.

INTRODUCTION TO Machine Learning 3rd Edition

CS Statistical Machine learning Lecture 24

Computer Vision Lecture 6. Probabilistic Methods in Segmentation.

CSE 517 Natural Language Processing Winter 2015

Learning Sequence Motifs Using Expectation Maximization (EM) and Gibbs Sampling BMI/CS 776 Mark Craven

Bayesian Speech Synthesis Framework Integrating Training and Synthesis Processes Kei Hashimoto, Yoshihiko Nankaku, and Keiichi Tokuda Nagoya Institute.

Machine Learning 5. Parametric Methods.

ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: MLLR For Two Gaussians Mean and Variance Adaptation MATLB Example Resources:

EE 551/451, Fall, 2006 Communication Systems Zhu Han Department of Electrical and Computer Engineering Class 15 Oct. 10 th, 2006.

ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.

. The EM algorithm Lecture #11 Acknowledgement: Some slides of this lecture are due to Nir Friedman.

Unsupervised Learning Part 2. Topics How to determine the K in K-means? Hierarchical clustering Soft clustering with Gaussian mixture models Expectation-Maximization.

CS479/679 Pattern Recognition Dr. George Bebis

12. Principles of Parameter Estimation

Probability Theory and Parameter Estimation I

Latent Variables, Mixture Models and EM

Hidden Markov Models Part 2: Algorithms

Bayesian Models in Machine Learning

Probabilistic Models with Latent Variables

Parametric Methods Berlin Chen, 2005 References:

12. Principles of Parameter Estimation

Probabilistic Surrogate Models

Presentation transcript:

Maximum Likelihood (ML), Expectation Maximization (EM) Pieter Abbeel UC Berkeley EECS Many slides adapted from Thrun, Burgard and Fox, Probabilistic Robotics TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAAAAAAAAAAAA

Outline Maximum likelihood (ML) Priors, and maximum a posteriori (MAP) Cross-validation Expectation Maximization (EM)

Thumbtack Let µ = P(up), 1-µ = P(down) How to determine µ ? Empirical estimate: 8 up, 2 down 

http://web.me.com/todd6ton/Site/Classroom_Blog/Entries/2009/10/7_A_Thumbtack_Experiment.html

Maximum Likelihood µ = P(up), 1-µ = P(down) Observe: Likelihood of the observation sequence depends on µ: Maximum likelihood finds extrema at µ = 0, µ = 1, µ = 0.8 Inspection of each extremum yields µML = 0.8

Maximum Likelihood More generally, consider binary-valued random variable with µ = P(1), 1-µ = P(0), assume we observe n1 ones, and n0 zeros Likelihood: Derivative: Hence we have for the extrema: n1/(n0+n1) is the maximum = empirical counts.

Log-likelihood The function is a monotonically increasing function of x Hence for any (positive-valued) function f: In practice often more convenient to optimize the log- likelihood rather than the likelihood itself Example:

Log-likelihood  Likelihood Reconsider thumbtacks: 8 up, 2 down Likelihood Definition: A function f is concave if and only Concave functions are generally easier to maximize then non-concave functions log-likelihood Not Concave Concave

Concavity and Convexity f is concave if and only “Easy” to maximize f is convex if and only “Easy” to minimize x1 x2 x1 x2 ¸ x2+(1-¸)x2 ¸ x2+(1-¸)x2

ML for Multinomial Consider having received samples \theta^{\rm ML} can be found through solving a linear system Perhaps this derivation is easier with Lagrange multipliers; but those are to be covered later in the course only…

ML for Fully Observed HMM Given samples Dynamics model: Observation model:  Independent ML problems for each and each

ML for Exponential Distribution Source: wikipedia Consider having received samples 3.1, 8.2, 1.7 ll

ML for Exponential Distribution Source: wikipedia Consider having received samples

Uniform Consider having received samples

ML for Gaussian Consider having received samples

ML for Conditional Gaussian Equivalently: More generally:

ML for Conditional Gaussian

ML for Conditional Multivariate Gaussian To find C: notice that there is effectively twice the same equation: the equation formed by terms 1 and 4, and by 2 and 3, which are the transposes of each other. So if we can solve one of these equations, we are done. Solving 1 + 4 is easy to (assuming \Sigma and X^\top X full rank, of course)

Aside: Key Identities for Derivation on Previous Slide

ML Estimation in Fully Observed Linear Gaussian Bayes Filter Setting Consider the Linear Gaussian setting: Fully observed, i.e., given  Two separate ML estimation problems for conditional multivariate Gaussian: 1: 2:

Priors --- Thumbtack Let µ = P(up), 1-µ = P(down) How to determine µ ? ML estimate: 5 up, 0 down  Laplace estimate: add a fake count of 1 for each outcome

Priors --- Thumbtack Alternatively, consider $µ$ to be random variable Prior P(µ) / µ(1-µ) Measurements: P( x | µ ) Posterior: Maximum A Posterior (MAP) estimation = find µ that maximizes the posterior 

Priors --- Beta Distribution Figure source: Wikipedia

Priors --- Dirichlet Distribution Generalizes Beta distribution MAP estimate corresponds to adding fake counts n1, …, nK

MAP for Mean of Univariate Gaussian Assume variance known. (Can be extended to also find MAP for variance.) Prior:

MAP for Univariate Conditional Linear Gaussian Assume variance known. (Can be extended to also find MAP for variance.) Prior: [Interpret!]

MAP for Univariate Conditional Linear Gaussian: Example for run=1:4 a = randn; b = randn; x = (rand(5,1) - 0.5); y = a*x + b + randn(5,1); X = [ones(5,1) x]; ba_ML = (X'*X)^(-1)*X'*y; ba_MAP = (eye(2) + X'*X)^(-1)*(X'*y); figure; plot(x, y, '.'); hold on; plot(x, ba_ML(1) + ba_ML(2)*x, 'r-'); plot(x, ba_MAP(1) + ba_MAP(2)*x, 'k-'); plot(x, b + a*x, 'g-'); end TRUE --- Samples . ML --- MAP ---

Prior for Multivariate Conditional Linear Gaussian

Cross Validation Choice of prior will heavily influence quality of result Fine-tune choice of prior through cross-validation: 1. Split data into “training” set and “validation” set 2. For a range of priors, Train: compute µMAP on training set Cross-validate: evaluate performance on validation set by evaluating the likelihood of the validation data under µMAP just found 3. Choose prior with highest validation score For this prior, compute µMAP on (training+validation) set Typical training / validation splits: 1-fold: 70/30, random split 10-fold: partition into 10 sets, average performance for each of the sets being the validation set and the other 9 being the training set

Outline Maximum likelihood (ML) Priors, and maximum a posteriori (MAP) Cross-validation Expectation Maximization (EM)

Mixture of Gaussians Generally: Example: ML Objective: given data z(1), …, z(m) Setting derivatives w.r.t. µ, ¹, § equal to zero does not enable to solve for their ML estimates in closed form We can evaluate function  we can in principle perform local optimization, see future lectures. In this lecture: “EM” algorithm, which is typically used to efficiently optimize the objective (locally)

Expectation Maximization (EM) Example: Model: Goal: Given data z(1), …, z(m) (but no x(i) observed) Find maximum likelihood estimates of ¹1, ¹2 EM basic idea: if x(i) were known  two easy-to-solve separate ML problems EM iterates over E-step: For i=1,…,m fill in missing data x(i) according to what is most likely given the current model ¹ M-step: run ML for completed data, which gives new model ¹

EM Derivation EM solves a Maximum Likelihood problem of the form: µ: parameters of the probabilistic model we try to find x: unobserved variables z: observed variables Jensen’s Inequality

Jensen’s inequality Illustration: P(X=x1) = 1-¸, P(X=x2) = ¸ x1 E[X] = ¸ x2+(1-¸)x2 Illustration: P(X=x1) = 1-¸, P(X=x2) = ¸

EM Derivation (ctd) Jensen’s Inequality: equality holds when is an affine function. This is achieved for EM Algorithm: Iterate 1. E-step: Compute 2. M-step: Compute M-step optimization can be done efficiently in most cases E-step is usually the more expensive step It does not fill in the missing data x with hard values, but finds a distribution q(x)

EM Derivation (ctd) M-step objective is upper- bounded by true objective M-step objective is equal to true objective at current parameter estimate  Improvement in true objective is at least as large as improvement in M-step objective

EM 1-D Example --- 2 iterations Estimate 1-d mixture of two Gaussians with unit variance: one parameter ¹ ; ¹1 = ¹ - 7.5, ¹2 = ¹+7.5 % true distribution: 3/4 N(-2,1) + 1/4 N(2,1) clear; clc; close all; offset = 5; middle = 0; mu1 = middle-offset/2; mu2 = middle+offset/2; sigma1 = 1; sigma2 = 1; theta1 = 1/2; theta2 = 1/2; x_for_plotting = middle-offset:0.01:middle+offset; y_for_plotting = theta1*N(x_for_plotting, mu1, sigma1) ... + theta2*N(x_for_plotting, mu2, sigma2); figure; plot(x_for_plotting, y_for_plotting); xlabel('x'); ylabel('probability density'); N_samples = 20; samples1 = (randn(1, theta1*N_samples)-.5)*2*sigma1 + mu1; samples2 = (randn(1, theta2*N_samples)-.5)*2*sigma2 + mu2; samples = [samples1 samples2]; figure; plot(samples, zeros(size(samples)), 'x'); axis([min(samples) max(samples) 0 1]); % parameterized distribution: 3/4 N(mu, 1) + 1/4 N(mu+offset,1) % ML problem: find best mu offset0 = offset+10; % what we try to fit to the data: xx_for_plotting = -offset0:0.1:offset0; yy_for_plotting = theta1*N(xx_for_plotting, -offset0/2, sigma1) + theta2*N(xx_for_plotting, offset0/2, sigma2); figure; plot(xx_for_plotting, yy_for_plotting); % let's plot overall landscape: mu_for_plotting = 4*x_for_plotting; for i=1:length(mu_for_plotting) ML_objective(i) = sum(log(theta1*N(samples, mu_for_plotting(i)-offset0/2, sigma1) ... + theta2*N(samples, mu_for_plotting(i)+offset0/2, sigma2))); end figure; plot(mu_for_plotting, ML_objective); xlabel('mu'); ylabel('log-likelihood'); %init: mu = middle; for em_iter = 1:2%:20 mus(em_iter) = mu; for i=1:length(samples) p1(i) = theta1*N(samples(i), mu-offset0/2, sigma1); p2(i) = theta2*N(samples(i), mu+offset0/2, sigma2); z = p1+p2; p1 = p1./z; p2 = p2./z; EM_bound(i) = sum(p1.*(log(theta1)+logN(samples, mu_for_plotting(i)-offset0/2, sigma1)) ... + p2.*(log(theta2)+logN(samples, mu_for_plotting(i)+offset0/2, sigma2))) - sum(p1.*log(p1) + p2.*log(p2)); % figure; plot(mu_for_plotting, ML_objective); xlabel('mu'); ylabel('log-likelihood'); hold on; plot(mu_for_plotting, EM_bound, 'r'); axis([min(mu_for_plotting) max(mu_for_plotting) min(ML_objective) max(ML_objective)]); ll(em_iter) = sum(log(theta1*N(samples, mu-offset0/2, sigma1) ... + theta2*N(samples, mu+offset0/2, sigma2))); plot(mu, ll(em_iter), 'kx'); mu = (p1*(samples+offset0/2)' + p2*(samples-offset0/2)')/N_samples; figure; plot(ll);

EM for Mixture of Gaussians X ~ Multinomial Distribution, P(X=k ; µ) = µk Z ~ N(¹k, §k) Observed: z(1), z(2), …, z(m)

EM for Mixture of Gaussians E-step: M-step:

EM Run for Mixture of Gaussians Debugging tip: check log-likelihood score for each step! should increase, otherwise BUG! TODO

ML Objective HMM Given samples Dynamics model: Observation model: No simple decomposition into independent ML problems for each and each No closed form solution found by setting derivatives equal to zero

ML Objective HMM ML objective: Can efficiently evaluate by running (forward) filter + minor additional computation: Forward filter: for t=0, 1, …, T-1  We can evaluate function  we can in principle perform local optimization, see future lectures. In this lecture: “EM” algorithm, which is typically used to efficiently optimize the objective (locally)

EM for HMM --- M-step µ and ° computed from “soft” counts

EM for HMM --- E-step No need to find conditional full joint Run smoother to find:

EM Example for HMM TODO: make this slide

ML Objective for Linear Gaussians Linear Gaussian setting: Given ML objective: EM-derivation: same as HMM

EM Derivation for Linear Gaussians TODO type this out (perhaps)

EM for Linear Gaussians --- E-Step Forward: Backward:

EM for Linear Gaussians --- M-step [Updates for A, B, C, d. TODO: Fill in once found/derived.]

EM for Linear Gaussians --- The Log-likelihood When running EM, it can be good to keep track of the log- likelihood score --- it is supposed to increase every iteration

EM for Extended Kalman Filter Setting As the linearization is only an approximation, when performing the updates, we might end up with parameters that result in a lower (rather than higher) log-likelihood score  Solution: instead of updating the parameters to the newly estimated ones, interpolate between the previous parameters and the newly estimated ones. Perform a “line- search” to find the setting that achieves the highest log- likelihood score

EM Example for Linear Gaussians TODO