Lecture 17 Gaussian Mixture Models and Expectation Maximization

Slides:



Advertisements
Similar presentations
Image Modeling & Segmentation
Advertisements

Mixture Models and the EM Algorithm
Unsupervised Learning
Expectation Maximization
Supervised Learning Recap
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Jensen’s Inequality (Special Case) EM Theorem.
Segmentation and Fitting Using Probabilistic Methods
K-means clustering Hongning Wang
Lecture 17: Supervised Learning Recap Machine Learning April 6, 2010.
Most slides from Expectation Maximization (EM) Northwestern University EECS 395/495 Special Topics in Machine Learning.
Clustering.
Gaussian Mixture Example: Start After First Iteration.
Slide 1 EE3J2 Data Mining EE3J2 Data Mining - revision Martin Russell.
Expectation Maximization for GMM Comp344 Tutorial Kai Zhang.
Maximum Likelihood (ML), Expectation Maximization (EM)
Visual Recognition Tutorial
ECE 5984: Introduction to Machine Learning
EE462 MLCV 1 Lecture 3-4 Clustering (1hr) Gaussian Mixture and EM (1hr) Tae-Kyun Kim.
INTRODUCTION TO Machine Learning ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
A Unifying Review of Linear Gaussian Models
Gaussian Mixture Models and Expectation Maximization.
Incomplete Graphical Models Nan Hu. Outline Motivation K-means clustering Coordinate Descending algorithm Density estimation EM on unconditional mixture.
Biointelligence Laboratory, Seoul National University
Machine Learning Saarland University, SS 2007 Holger Bast Max-Planck-Institut für Informatik Saarbrücken, Germany Lecture 9, Friday June 15 th, 2007 (EM.
EM and expected complete log-likelihood Mixture of Experts
CSC321: Neural Networks Lecture 12: Clustering Geoffrey Hinton.
1 HMM - Part 2 Review of the last lecture The EM algorithm Continuous density HMM.
Machine Learning Lecture 23: Statistical Estimation with Sampling Iain Murray’s MLSS lecture on videolectures.net:
CPSC 502, Lecture 15Slide 1 Introduction to Artificial Intelligence (AI) Computer Science cpsc502, Lecture 16 Nov, 3, 2011 Slide credit: C. Conati, S.
Lecture 19: More EM Machine Learning April 15, 2010.
CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.
First topic: clustering and pattern recognition Marc Sobel.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: ML and Simple Regression Bias of the ML Estimate Variance of the ML Estimate.
MACHINE LEARNING 8. Clustering. Motivation Based on E ALPAYDIN 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2  Classification problem:
ECE 5984: Introduction to Machine Learning Dhruv Batra Virginia Tech Topics: –Unsupervised Learning: Kmeans, GMM, EM Readings: Barber
Overview of the final test for CSC Overview PART A: 7 easy questions –You should answer 5 of them. If you answer more we will select 5 at random.
HMM - Part 2 The EM algorithm Continuous density HMM.
Lecture 6 Spring 2010 Dr. Jianjun Hu CSCE883 Machine Learning.
B AYESIAN L EARNING & G AUSSIAN M IXTURE M ODELS Jianping Fan Dept of Computer Science UNC-Charlotte.
CS Statistical Machine learning Lecture 24
1 Clustering: K-Means Machine Learning , Fall 2014 Bhavana Dalvi Mishra PhD student LTI, CMU Slides are based on materials from Prof. Eric Xing,
Lecture 2: Statistical learning primer for biologists
Radial Basis Function ANN, an alternative to back propagation, uses clustering of examples in the training set.
ECE 8443 – Pattern Recognition Objectives: Jensen’s Inequality (Special Case) EM Theorem Proof EM Example – Missing Data Intro to Hidden Markov Models.
Flat clustering approaches
Chapter 13 (Prototype Methods and Nearest-Neighbors )
CSE 517 Natural Language Processing Winter 2015
Maximum Entropy Model, Bayesian Networks, HMM, Markov Random Fields, (Hidden/Segmental) Conditional Random Fields.
Statistical Models for Automatic Speech Recognition Lukáš Burget.
CSE 446: Expectation Maximization (EM) Winter 2012 Daniel Weld Slides adapted from Carlos Guestrin, Dan Klein & Luke Zettlemoyer.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Reestimation Equations Continuous Distributions.
Hidden Markov Models. A Hidden Markov Model consists of 1.A sequence of states {X t |t  T } = {X 1, X 2,..., X T }, and 2.A sequence of observations.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Jensen’s Inequality (Special Case) EM Theorem.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Mixture Densities Maximum Likelihood Estimates.
Big Data Infrastructure Week 9: Data Mining (4/4) This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States.
The EM algorithm for Mixture of Gaussians & Classification with Generative models Jakob Verbeek December 2, 2011 Course website:
Introduction to Data Science: Lecture 6
Probability Theory and Parameter Estimation I
Lecture 18 Expectation Maximization
Bayesian Rule & Gaussian Mixture Models
LECTURE 10: EXPECTATION MAXIMIZATION (EM)
CS 2750: Machine Learning Expectation Maximization
Latent Variables, Mixture Models and EM
دانشگاه صنعتی امیرکبیر Instructor : Saeed Shiry
Bayesian Models in Machine Learning
Probabilistic Models with Latent Variables
Statistical Models for Automatic Speech Recognition
LECTURE 21: CLUSTERING Objectives: Mixture Densities Maximum Likelihood Estimates Application to Gaussian Mixture Models k-Means Clustering Fuzzy k-Means.
Biointelligence Laboratory, Seoul National University
EM Algorithm and its Applications
Presentation transcript:

Lecture 17 Gaussian Mixture Models and Expectation Maximization Machine Learning

Last Time Logistic Regression

Today Gaussian Mixture Models Expectation Maximization

The Problem You have data that you believe is drawn from n populations You want to identify parameters for each population You don’t know anything about the populations a priori Except you believe that they’re gaussian…

Gaussian Mixture Models Rather than identifying clusters by “nearest” centroids Fit a Set of k Gaussians to the data Maximum Likelihood over a mixture model p(x) = \pi_0f_0(x) + \pi_1f_1(x) + \pi_2f_2(x) + \ldots + \pi_kf_k(x)

GMM example

Mixture Models Formally a Mixture Model is the weighted sum of a number of pdfs where the weights are determined by a distribution,

Gaussian Mixture Models GMM: the weighted sum of a number of Gaussians where the weights are determined by a distribution,

Graphical Models with unobserved variables What if you have variables in a Graphical model that are never observed? Latent Variables Training latent variable models is an unsupervised learning application uncomfortable amused sweating laughing

Latent Variable HMMs We can cluster sequences using an HMM with unobserved state variables We will train latent variable models using Expectation Maximization

Expectation Maximization Both the training of GMMs and Graphical Models with latent variables can be accomplished using Expectation Maximization Step 1: Expectation (E-step) Evaluate the “responsibilities” of each cluster with the current parameters Step 2: Maximization (M-step) Re-estimate parameters using the existing “responsibilities” Similar to k-means training.

Latent Variable Representation We can represent a GMM involving a latent variable What does this give us? TODO: plate notation

GMM data and Latent variables

One last bit We have representations of the joint p(x,z) and the marginal, p(x)… The conditional of p(z|x) can be derived using Bayes rule. The responsibility that a mixture component takes for explaining an observation x.

Maximum Likelihood over a GMM As usual: Identify a likelihood function And set partials to zero…

Maximum Likelihood of a GMM Optimization of means.

Maximum Likelihood of a GMM Optimization of covariance Note the similarity to the regular MLE without responsibility terms.

Maximum Likelihood of a GMM Optimization of mixing term \frac{\partial \ln p(x|\pi, \mu,\Sigma) + \lambda\left(\sum_{k=1}^K \pi_k -1\right)}{\partial \pi_k}&=&

MLE of a GMM

EM for GMMs Initialize the parameters Evaluate the log likelihood Expectation-step: Evaluate the responsibilities Maximization-step: Re-estimate Parameters Check for convergence

EM for GMMs E-step: Evaluate the Responsibilities

EM for GMMs M-Step: Re-estimate Parameters

Visual example of EM

Potential Problems Incorrect number of Mixture Components Singularities

Incorrect Number of Gaussians

Incorrect Number of Gaussians

Singularities A minority of the data can have a disproportionate effect on the model likelihood. For example…

GMM example

Singularities When a mixture component collapses on a given point, the mean becomes the point, and the variance goes to zero. Consider the likelihood function as the covariance goes to zero. The likelihood approaches infinity.

Relationship to K-means K-means makes hard decisions. Each data point gets assigned to a single cluster. GMM/EM makes soft decisions. Each data point can yield a posterior p(z|x) Soft K-means is a special case of EM.

Soft means as GMM/EM Assume equal covariance matrices for every mixture component: Likelihood: Responsibilities: As epsilon approaches zero, the responsibility approaches unity. p(x|\mu_k,\Sigma_k) = \frac{1}{(2\pi\epsilon)^{M/2}}\exp\left\{-\frac{1}{2\epsilon}\lVert x-\mu_k\rVert^2\right\}

Soft K-Means as GMM/EM Overall Log likelihood as epsilon approaches zero: The expectation of soft k-means is the intercluster variability Note: only the means are reestimated in Soft K-means. The covariance matrices are all tied.

General form of EM Given a joint distribution over observed and latent variables: Want to maximize: Initialize parameters E Step: Evaluate: M-Step: Re-estimate parameters (based on expectation of complete-data log likelihood) Check for convergence of params or likelihood

Next Time Proof of Expectation Maximization in GMMs Hidden Markov Models