EE-148 Expectation Maximization Markus Weber 5/11/99.

Slides:



Advertisements
Similar presentations
Pattern Recognition and Machine Learning
Advertisements

Image Modeling & Segmentation
ECE 8443 – Pattern Recognition LECTURE 05: MAXIMUM LIKELIHOOD ESTIMATION Objectives: Discrete Features Maximum Likelihood Resources: D.H.S: Chapter 3 (Part.
CS479/679 Pattern Recognition Dr. George Bebis
2 – In previous chapters: – We could design an optimal classifier if we knew the prior probabilities P(wi) and the class- conditional probabilities P(x|wi)
EM Algorithm Jur van den Berg.
Maximum Likelihood And Expectation Maximization Lecture Notes for CMPUT 466/551 Nilanjan Ray.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Jensen’s Inequality (Special Case) EM Theorem.
Segmentation and Fitting Using Probabilistic Methods
Hidden Variables, the EM Algorithm, and Mixtures of Gaussians Computer Vision CS 143, Brown James Hays 02/22/11 Many slides from Derek Hoiem.
Hidden Variables, the EM Algorithm, and Mixtures of Gaussians Computer Vision CS 543 / ECE 549 University of Illinois Derek Hoiem 03/15/12.
Visual Recognition Tutorial
6/10/ Visual Recognition1 Radial Basis Function Networks Computer Science, KAIST.
First introduced in 1977 Lots of mathematical derivation Problem : given a set of data (data is incomplete or having missing values). Goal : assume the.
Lecture 5: Learning models using EM
Expectation Maximization for GMM Comp344 Tutorial Kai Zhang.
Visual Recognition Tutorial
Pattern Recognition. Introduction. Definitions.. Recognition process. Recognition process relates input signal to the stored concepts about the object.
Introduction to Bayesian Parameter Estimation
EM Algorithm Likelihood, Mixture Models and Clustering.
Clustering with Bregman Divergences Arindam Banerjee, Srujana Merugu, Inderjit S. Dhillon, Joydeep Ghosh Presented by Rohit Gupta CSci 8980: Machine Learning.
Incomplete Graphical Models Nan Hu. Outline Motivation K-means clustering Coordinate Descending algorithm Density estimation EM on unconditional mixture.
Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.
ECE 8443 – Pattern Recognition LECTURE 06: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Bias in ML Estimates Bayesian Estimation Example Resources:
Speech Recognition Pattern Classification. 22 September 2015Veton Këpuska2 Pattern Classification  Introduction  Parametric classifiers  Semi-parametric.
EM and expected complete log-likelihood Mixture of Experts
Model Inference and Averaging
1 HMM - Part 2 Review of the last lecture The EM algorithm Continuous density HMM.
Lecture 19: More EM Machine Learning April 15, 2010.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Deterministic vs. Random Maximum A Posteriori Maximum Likelihood Minimum.
Overview Particle filtering is a sequential Monte Carlo methodology in which the relevant probability distributions are iteratively estimated using the.
CHAPTER 7: Clustering Eick: K-Means and EM (modified Alpaydin transparencies and new transparencies added) Last updated: February 25, 2014.
ECE 8443 – Pattern Recognition LECTURE 07: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Class-Conditional Density The Multivariate Case General.
CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.
EECS 274 Computer Vision Segmentation by Clustering II.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: ML and Simple Regression Bias of the ML Estimate Variance of the ML Estimate.
MACHINE LEARNING 8. Clustering. Motivation Based on E ALPAYDIN 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2  Classification problem:
: Chapter 3: Maximum-Likelihood and Baysian Parameter Estimation 1 Montri Karnjanadecha ac.th/~montri.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 07: BAYESIAN ESTIMATION (Cont.) Objectives:
HMM - Part 2 The EM algorithm Continuous density HMM.
Intro. ANN & Fuzzy Systems Lecture 23 Clustering (4)
1 CSE 552/652 Hidden Markov Models for Speech Recognition Spring, 2005 Oregon Health & Science University OGI School of Science & Engineering John-Paul.
CS Statistical Machine learning Lecture 24
Lecture 2: Statistical learning primer for biologists
ECE 8443 – Pattern Recognition Objectives: Jensen’s Inequality (Special Case) EM Theorem Proof EM Example – Missing Data Intro to Hidden Markov Models.
Chapter 20 Classification and Estimation Classification – Feature selection Good feature have four characteristics: –Discrimination. Features.
Elements of Pattern Recognition CNS/EE Lecture 5 M. Weber P. Perona.
Lecture 3: MLE, Bayes Learning, and Maximum Entropy
1 Chapter 8: Model Inference and Averaging Presented by Hui Fang.
Univariate Gaussian Case (Cont.)
Statistical Models for Automatic Speech Recognition Lukáš Burget.
CS Statistical Machine learning Lecture 25 Yuan (Alan) Qi Purdue CS Nov
Hidden Variables, the EM Algorithm, and Mixtures of Gaussians Computer Vision CS 543 / ECE 549 University of Illinois Derek Hoiem 02/22/11.
Computational Intelligence: Methods and Applications Lecture 26 Density estimation, Expectation Maximization. Włodzisław Duch Dept. of Informatics, UMK.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Jensen’s Inequality (Special Case) EM Theorem.
Density Estimation in R Ha Le and Nikolaos Sarafianos COSC 7362 – Advanced Machine Learning Professor: Dr. Christoph F. Eick 1.
Ch 1. Introduction Pattern Recognition and Machine Learning, C. M. Bishop, Updated by J.-H. Eom (2 nd round revision) Summarized by K.-I.
Part 3: Estimation of Parameters. Estimation of Parameters Most of the time, we have random samples but not the densities given. If the parametric form.
Model Inference and Averaging
Classification of unlabeled data:
Statistical Models for Automatic Speech Recognition
LECTURE 10: EXPECTATION MAXIMIZATION (EM)
Latent Variables, Mixture Models and EM
Course Outline MODEL INFORMATION COMPLETE INCOMPLETE
Statistical Models for Automatic Speech Recognition
'Linear Hierarchical Models'
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
LECTURE 09: BAYESIAN LEARNING
LECTURE 07: BAYESIAN ESTIMATION
Learning From Observed Data
Presentation transcript:

EE-148 Expectation Maximization Markus Weber 5/11/99

Overview Expectation Maximization is a technique used to estimate probability densities under missing (unobserved) data. –Density Estimation –Observed vs. Missing Data –EM

Probability Density Estimation Why is it important? It is the essence of –Pattern Recognition  estimate p(class|observations) from training data –Learning Theory (includes pattern recognition) Many other methods use density estimation –HMM –Kalman Filters

Probability Density Estimation how does it work? Given: samples, {x i } Two major philosophies: Parametric Provide a parametrized class of density functions, e.g. –Gaussians: p(x) = f(x, mean, Cov) –Mixture of Gaussians: p(x) = f(x, Mean, Cov, Estimation means to find the parameters which best model the data. Measure? Maximum Likelihood! Choice of class reflects prior knowledge Non-Parametric Density is modeled explicitly through the samples, e.g. –Parzen Windows (Rosenblatt, ‘56; Parzen, ‘62): Make a histogram and convolve with kernel (could be Gaussian) K-nearest-neighbor Prior knowledge less prominent

Maximum Likelihood The standard method (besides Bayesian inference) for parametric density estimation Def. Likelihood of a parameter (independent samples): Oftentimes one uses negative log-likelihood: We can use this as an error function while estimating . For a Gaussian, the sample mean and covariance are the ML estimates!

Missing Data Problems Occur whenever part of the data is unknown –intrinsically inaccessible, e.g.: Constellation models: p(X,N,d,h,O)  d,h not known!  p(O|X,N)? Gaussian mixture models: which cluster does a data point belong to? –data is lost/erroneous e.g.: Some faulty/noisy process has generated the data. You erased the wrong file and part of your data is gone. If the missing data is correlated in any way with the observed, we can hope to extract information about the missing data from the observed. If the missing data is independent from the observed, everything is lost.

p(y|z, )p(z|y, ) Samples, x i  R N, from a joint Gaussian In some x i, some dimensions are lost/unobserved. We know where data is missing. How can we estimate the parameters of the Gaussian? How can we replace the missing data? Example Problem y z y missing z missing p(y,z| ) Essential EM ideas: –If we had an estimate of the joint density, the conditional densities would tell us how the missing data is distributed. –If we had an estimate of the missing data distribution, we could use it to estimate the joint density. There is a way to iterate the above two steps which will steadily improve the overall likelihood.

Expectation Maximization Task: Estimate p(y,z| ,  ) (Gaussian) from the available data. We want to do maximum likelihood density estimation, although we do not know all the data. If we knew the missing data, we would minimize the negative log-likelihood (o = observed, m = missing): Since we do not know the missing data, X m, we want to minimize the likelihood of the observed data, X o : We propose an iterative solution ( 1, 2,...) But for now we are still stuck with a log of an integral.

Some Rewriting We write p n (.) for p(.| n ). We can rewrite the expression for the likelihood: and similarly: Now we use Jensen’s inequality: and obtain

The Gist This shows that, if we minimize Q(, ) with respect to the second argument, the likelihood can only increase. We still need to show that after convergence (at a minimum of Q) we have reached a minimum of the likelihood. We will skip this part here. To summarize: At every iteration, we need to minimize the following expression with respect to n This corresponds to our initial intuition: We minimize the expectation of the joint likelihood, where expectation is computed using conditional of the previously estimated density.

Specific Example This is what we need to minimize: Concrete: the Gaussian example, z T = (x o, x m ) T : Update rule for  : Update rule for  is equally simple to derive.

Solution to Example Problem (for proof see separate handout) Compute at each iteration:

Demo

Other Applications of EM Estimating mixture densities Learning constellation models from unlabeled data Many problems can be formulated in an EM framework: –HMMs –PCA –Latent variable models –“condensation” algorithm (learning complex motions) –many computer vision problems