CS 2750: Machine Learning Expectation Maximization

Slides:



Advertisements
Similar presentations
Part 2: Unsupervised Learning
Advertisements

University of Joensuu Dept. of Computer Science P.O. Box 111 FIN Joensuu Tel fax Gaussian Mixture.
Image Modeling & Segmentation
Mixture Models and the EM Algorithm
Unsupervised Learning
Expectation Maximization
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Jensen’s Inequality (Special Case) EM Theorem.
Segmentation and Fitting Using Probabilistic Methods
Hidden Variables, the EM Algorithm, and Mixtures of Gaussians Computer Vision CS 543 / ECE 549 University of Illinois Derek Hoiem 03/15/12.
Visual Recognition Tutorial
Overview Full Bayesian Learning MAP learning
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
First introduced in 1977 Lots of mathematical derivation Problem : given a set of data (data is incomplete or having missing values). Goal : assume the.
A gentle introduction to Gaussian distribution. Review Random variable Coin flip experiment X = 0X = 1 X: Random variable.
Clustering.
Expectation Maximization for GMM Comp344 Tutorial Kai Zhang.
EE462 MLCV 1 Lecture 3-4 Clustering (1hr) Gaussian Mixture and EM (1hr) Tae-Kyun Kim.
INTRODUCTION TO Machine Learning ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Gaussian Mixture Models and Expectation Maximization.
Incomplete Graphical Models Nan Hu. Outline Motivation K-means clustering Coordinate Descending algorithm Density estimation EM on unconditional mixture.
Biointelligence Laboratory, Seoul National University
1 HMM - Part 2 Review of the last lecture The EM algorithm Continuous density HMM.
Machine Learning Lecture 23: Statistical Estimation with Sampling Iain Murray’s MLSS lecture on videolectures.net:
Lecture 19: More EM Machine Learning April 15, 2010.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: ML and Simple Regression Bias of the ML Estimate Variance of the ML Estimate.
MACHINE LEARNING 8. Clustering. Motivation Based on E ALPAYDIN 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2  Classification problem:
ECE 5984: Introduction to Machine Learning Dhruv Batra Virginia Tech Topics: –Unsupervised Learning: Kmeans, GMM, EM Readings: Barber
Lecture 17 Gaussian Mixture Models and Expectation Maximization
HMM - Part 2 The EM algorithm Continuous density HMM.
Lecture 6 Spring 2010 Dr. Jianjun Hu CSCE883 Machine Learning.
CS Statistical Machine learning Lecture 24
Lecture 2: Statistical learning primer for biologists
ECE 8443 – Pattern Recognition Objectives: Jensen’s Inequality (Special Case) EM Theorem Proof EM Example – Missing Data Intro to Hidden Markov Models.
Flat clustering approaches
Design and Implementation of Speech Recognition Systems Fall 2014 Ming Li Special topic: the Expectation-Maximization algorithm and GMM Sep Some.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Jensen’s Inequality (Special Case) EM Theorem.
Unsupervised Learning Part 2. Topics How to determine the K in K-means? Hierarchical clustering Soft clustering with Gaussian mixture models Expectation-Maximization.
Unsupervised Learning Reading: Chapter 8 from Introduction to Data Mining by Tan, Steinbach, and Kumar, pp , , (
CS 2750: Machine Learning Review
Machine Learning and Data Mining Clustering
Lecture 18 Expectation Maximization
CS 2750: Machine Learning Density Estimation
CS 2750: Machine Learning Probability Review Density Estimation
Classification of unlabeled data:
LECTURE 10: EXPECTATION MAXIMIZATION (EM)
Latent Variables, Mixture Models and EM
Unsupervised-learning Methods for Image Clustering
دانشگاه صنعتی امیرکبیر Instructor : Saeed Shiry
Bayesian Models in Machine Learning
Expectation Maximization Mixture Models HMMs
Probabilistic Models with Latent Variables
Statistical Models for Automatic Speech Recognition
CS 2750: Machine Learning Expectation Maximization
Expectation Maximization
Lecture 11: Mixture of Gaussians
Gaussian Mixture Models And their training with the EM algorithm
Seam Carving Project 1a due at midnight tonight.
Unsupervised Learning II: Soft Clustering with Gaussian Mixture Models
Segmentation (continued)
INTRODUCTION TO Machine Learning
10701 Recitation Pengtao Xie
Announcements Project 4 questions Evaluations.
Machine Learning for Signal Processing Expectation Maximization Mixture Models Bhiksha Raj Week Apr /18797.
Biointelligence Laboratory, Seoul National University
Machine Learning and Data Mining Clustering
Biointelligence Laboratory, Seoul National University
EM Algorithm and its Applications
EM Algorithm 主講人:虞台文.
A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models Jeff A. Bilmes International.
A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models Jeff A. Bilmes International.
Presentation transcript:

CS 2750: Machine Learning Expectation Maximization Prof. Adriana Kovashka University of Pittsburgh April 11, 2017

Plan for this lecture EM for Hidden Markov Models (last lecture) EM for Gaussian Mixture Models EM in general K-means ~ GMM EM for Naïve Bayes

The Gaussian Distribution Chris Bishop

Mixture of two Gaussians Mixtures of Gaussians Old Faithful data set Single Gaussian Mixture of two Gaussians Chris Bishop

Mixtures of Gaussians Combine simple models into a complex model: K=3 Component Mixing coefficient Chris Bishop

Mixtures of Gaussians Introducing latent variables z, one for each x 1-of-K representation: Also: Then p(x) = Σz p(x, z) = Σz p(z) p(x|z) = Figures from Chris Bishop

Mixtures of Gaussians Responsibility of component k for explaining x: Figures from Chris Bishop

Generating samples Sample a value z* from p(z) then sample a value for x from p(x|z*) Color generated samples using z* (left) Color samples using responsibilities (right) Figures from Chris Bishop

Finding parameters of mixture Want to maximize: Set derivative with respect to means to 0: Figures from Chris Bishop

Finding parameters of mixture Set derivative wrt covariances to 0: Set derivative wrt mixing coefficients to 0: Figures from Chris Bishop

Reminder Responsibilities: So parameters of GMM depend on responsibilities and vice versa… What can we do? Figures from Chris Bishop

Reminder: K-means clustering Basic idea: randomly initialize the k cluster centers, and iterate: Randomly initialize the cluster centers, c1, ..., cK Given cluster centers, determine points in each cluster For each point p, find the closest ci. Put p into cluster i Given points in each cluster, solve for ci Set ci to be the mean of points in cluster i If ci have changed, repeat Step 2 Steve Seitz CS 376 Lecture 7 Segmentation

Iterative algorithm Figures from Chris Bishop

Figures from Chris Bishop

Initialization E step M step Figures from Chris Bishop

EM in general For models with hidden variables, the probability of the data X = {xn} depends on the hidden variables Z = {zn} Complete data set {X, Z} Incomplete data set (observed) X, have We want to maximize: Figures from Chris Bishop

EM in general We don’t know the values for the hidden variables Z We also don’t know the model parameters θ Set one, infer the other Keep model parameters θ fixed, infer the hidden variables Z (compute conditional probability) Keep Z fixed, infer the model parameters θ (maximize expectation)

EM in general Figures from Chris Bishop

GMM in the context of general EM Complete data log likelihood: Expectation of this likelihood: Choose initial values μold, Σold, πold, evaluate responsibilities γ(znk) Fix responsibilities, maximize expectation wrt μ, Σ, π Figures from Chris Bishop

K-means ~ GMM Consider GMM where all components have covariance matrix εI (I = the identity matrix): Responsibilities are then: As ε  0, most responsibilities  0, except γ(znj)  1 for whichever j has the smallest Figures from Chris Bishop

K-means ~ GMM Expectation of the complete-data log likelihood, which we want to maximize: Recall in K-means we wanted to minimize: where rnk = 1 if instance n belongs to cluster k, 0 otherwise Figures from Chris Bishop

Example: EM for Naïve Bayes We have a graphical model: Flavor = class (hidden) Shape + color = features (observed) flavor shape color P(round, brown) 0.282 P(round, red) 0.139 P(square, brown) 0.141 P(square, red) 0.438 Total candies 1000 Log likelihood P(X|Θ) = 282*log(0.282) + … = -1827.17 Example from Rebecca Hwa

Example: EM for Naïve Bayes Model parameters = conditional probability table, initialize it randomly P(straw) 0.59 P(choc) 0.41 P(round|straw) 0.96 P(square|straw) 0.04 P(brown|straw) 0.08 P(red|straw) 0.92 P(round|choc) 0.66 P(square|choc) 0.34 P(brown|choc) 0.72 P(red|choc) 0.28 Example from Rebecca Hwa

Example: EM for Naïve Bayes Expectation step Keeping model (CPT) fixed, estimate values for joint probability distribution by multiplying P(flavor) * P(shape|flavor) * P(color|flavor) Then use these to compute conditional probability of the hidden variable flavor: P(straw|shape,color) = P(straw,shape,color) / P(shape,color) = P(straw,shape,color) / ΣflavorP(flavor,shape,color) Example from Rebecca Hwa

Example: EM for Naïve Bayes Expectation step p(straw,round,brown) 0.05 p(straw,square,brown) 0.00 p(straw,round,red) 0.52 p(straw,square,red) 0.02 p(choc,round,brown) 0.19 p(choc,square,brown) 0.10 p(choc,round,red) 0.08 p(choc,square,red) 0.04 p(straw | round, brown) 0.19 p(straw | square, brown) 0.02 p(straw | round, red) 0.87 p(straw | square, red) 0.36 p(choc | round, brown) 0.81 p(choc | square, brown) 0.98 p(choc | round, red) 0.13 p(choc | square, red) 0.64 P(round, brown) 0.24 P(round, red) 0.60 P(square, brown) 0.10 P(square, red) 0.06 Log likelihood P(X|Θ) = 282*log(0.24) + … = -2912.63 Example from Rebecca Hwa

Example: EM for Naïve Bayes Maximization step Expected # “strawberry, round, brown” candies = P(strawberry | round, brown) // from E step * # (round, brown) = 0.19 * 282 = 53.6 Can find other expected counts needed for CPT Example from Rebecca Hwa

Example: EM for Naïve Bayes Maximization step exp # straw, round 174.56 exp. # straw, square 159.16 exp # straw, red 277.91 exp # straw, brown 55.81 exp # straw 333.72 exp # choc, round 246.44 exp # choc, square 419.84 exp # choc, red 299.09 exp # choc, brown 367.19 exp # choc 666.28 P(straw) 0.33 P(choc) 0.67 P(round|straw) 0.52 P(square|straw) 0.48 P(brown|straw) 0.17 P(red|straw) 0.83 P(round|choc) 0.37 P(square|choc) 0.63 P(brown|choc) 0.55 P(red|choc) 0.45 Example from Rebecca Hwa

Example: EM for Naïve Bayes Maximization step Compute with new parameters (CPT) P(round, brown) P(round, red) P(square, brown) P(square, red) Compute likelihood P(X|Θ) Repeat until convergence Example from Rebecca Hwa