Expectation- Maximization. News o’ the day First “3-d” picture of sun Anybody got red/green sunglasses?

Slides:



Advertisements
Similar presentations
Unsupervised Learning Clustering K-Means. Recall: Key Components of Intelligent Agents Representation Language: Graph, Bayes Nets, Linear functions Inference.
Advertisements

Mixture Models and the EM Algorithm
Unsupervised Learning
Clustering Beyond K-means
Expectation Maximization
Supervised Learning Recap
CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 10: The Bayesian way to fit models Geoffrey Hinton.
Bayesian Wrap-Up (probably). 5 minutes of math... Marginal probabilities If you have a joint PDF:... and want to know about the probability of just one.
Probability theory and average-case complexity. Review of probability theory.
Segmentation and Fitting Using Probabilistic Methods
Machine Learning and Data Mining Clustering
Visual Recognition Tutorial
EE 290A: Generalized Principal Component Analysis Lecture 6: Iterative Methods for Mixture-Model Segmentation Sastry & Yang © Spring, 2011EE 290A, University.
Slide 1 EE3J2 Data Mining EE3J2 Data Mining Lecture 10 Statistical Modelling Martin Russell.
First introduced in 1977 Lots of mathematical derivation Problem : given a set of data (data is incomplete or having missing values). Goal : assume the.
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.
Lecture 5: Learning models using EM
Announcements Project 2 more signup slots questions Picture taking at end of class.
Bayesian learning finalized (with high probability)
Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Clustering.
Expectation Maximization Algorithm
Unsupervised Learning
Expectation-Maximization
Visual Recognition Tutorial
Unsupervised, Cont’d Expectation Maximization. Presentation tips Practice! Work on knowing what you’re going to say at each point. Know your own presentation.
Computer vision: models, learning and inference Chapter 10 Graphical Models.
Expectation-Maximization (EM) Chapter 3 (Duda et al.) – Section 3.9
ECE 5984: Introduction to Machine Learning
Bayesian Learning, cont’d. Administrivia Homework 1 returned today (details in a second) Reading 2 assigned today S. Thrun, Learning occupancy grids with.
Clustering & Dimensionality Reduction 273A Intro Machine Learning.
Radial Basis Function Networks
Incomplete Graphical Models Nan Hu. Outline Motivation K-means clustering Coordinate Descending algorithm Density estimation EM on unconditional mixture.
Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.
Gaussian Mixture Model and the EM algorithm in Speech Recognition
EM and expected complete log-likelihood Mixture of Experts
COMMON EVALUATION FINAL PROJECT Vira Oleksyuk ECE 8110: Introduction to machine Learning and Pattern Recognition.
CSC321: Neural Networks Lecture 12: Clustering Geoffrey Hinton.
CPSC 502, Lecture 15Slide 1 Introduction to Artificial Intelligence (AI) Computer Science cpsc502, Lecture 16 Nov, 3, 2011 Slide credit: C. Conati, S.
IID Samples In supervised learning, we usually assume that data points are sampled independently and from the same distribution IID assumption: data are.
Lecture 19: More EM Machine Learning April 15, 2010.
CHAPTER 7: Clustering Eick: K-Means and EM (modified Alpaydin transparencies and new transparencies added) Last updated: February 25, 2014.
CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.
MACHINE LEARNING 8. Clustering. Motivation Based on E ALPAYDIN 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2  Classification problem:
ECE 5984: Introduction to Machine Learning Dhruv Batra Virginia Tech Topics: –Unsupervised Learning: Kmeans, GMM, EM Readings: Barber
Clustering Algorithms Presented by Michael Smaili CS 157B Spring
HMM - Part 2 The EM algorithm Continuous density HMM.
Learning In Bayesian Networks. General Learning Problem Set of random variables X = {X 1, X 2, X 3, X 4, …} Training set D = { X (1), X (2), …, X (N)
Gaussian Mixture Models and Expectation-Maximization Algorithm.
CS B 553: A LGORITHMS FOR O PTIMIZATION AND L EARNING Parameter Learning with Hidden Variables & Expectation Maximization.
Lecture 2: Statistical learning primer for biologists
Flat clustering approaches
Final Exam Review CS479/679 Pattern Recognition Dr. George Bebis 1.
Information Bottleneck versus Maximum Likelihood Felix Polyakov.
RADFORD M. NEAL GEOFFREY E. HINTON 발표: 황규백
Hidden Markov Models. A Hidden Markov Model consists of 1.A sequence of states {X t |t  T } = {X 1, X 2,..., X T }, and 2.A sequence of observations.
Machine Learning Expectation Maximization and Gaussian Mixtures CSE 473 Chapter 20.3.
Classification of unlabeled data:
Latent Variables, Mixture Models and EM
Expectation-Maximization
Bayesian Models in Machine Learning
Probabilistic Models with Latent Variables
Gaussian Mixture Models And their training with the EM algorithm
Seam Carving Project 1a due at midnight tonight.
Segmentation (continued)
Learning From Observed Data
Clustering (2) & EM algorithm
Machine Learning and Data Mining Clustering
A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models Jeff A. Bilmes International.
Presentation transcript:

Expectation- Maximization

News o’ the day First “3-d” picture of sun Anybody got red/green sunglasses?

Administrivia No noose is good noose

Where we’re at Last time: E^3 Finished up (our brief survey of) RL Today: Intro to unsupervised learning The expectation-maximization “algorithm”

What’s with this EM thing? Nobody expects...

Unsupervised learning EM is (one form of) unsupervised learning: Given: data Find: “structure” of that data Clusters -- what points “group together”? (we’ll do this one today) Taxonomies -- what’s descended from/related to what? Parses -- grammatical structure of a sentence Hidden variables -- “behind the scenes”

Example task

We can see the clusters easily but the computer can’t. How can we get the computer to identify the clusters? Need: algorithm that takes data and returns a label (cluster ID) for each data point

Parsing example What’s the grammatical structure of this sentence? He never claimed to be a god.

Parsing example He never claimed to be a god. What’s the grammatical structure of this sentence? NVVNDetAdv NP VP NP VP S

Parsing example He never claimed to be a god. What’s the grammatical structure of this sentence? NVVNDetAdv NP VP NP VP S Note: entirely hidden information! Need to infer (guess) it in an ~unsupervised way.

EM assumptions All learning algorithms require data assumptions EM: generative model Description of process that generates your data Assumes: hidden (latent) variables Probability model: assigns probability to data + hidden variables Often think: generate hidden var, then generate data based on that hidden var

Classic latent var model Data generator looks like this: Behind a curtain: I flip a weighted coin Heads: I roll a 6-sided die Tails: I roll a 4-sided die I show you: Outcome of die

Your mission Data you get is sequence of die outcomes 6, 3, 3, 1, 5, 4, 2, 1, 6, 3, 1, 5, 2,... Your task: figure out what the coin flip was for each of these numbers Hidden variable: c≡outcome of coin flip What makes this hard?

A more “practical” example Robot navigating in physical world Locations in world can be occupied or unoccupied Robot wants occupancy map (so it doesn’t bump into things) Sensors are imperfect (noise, object variation, etc.) Given: sensor data Infer: occupied/unoccupied for each location

Classic latent var model This process describes (generates) prob distribution over numbers Hidden state: outcome of coin flip

Classic latent var model This process describes (generates) prob distribution over numbers Hidden state: outcome of coin flip

Classic latent var model This process describes (generates) prob distribution over numbers Hidden state: outcome of coin flip Observed state: outcome of die given (conditioned on) coin flip result

Classic latent var model This process describes (generates) prob distribution over numbers Hidden state: outcome of coin flip Observed state: outcome of die given (conditioned on) coin flip result

Classic latent var model This process describes (generates) prob distribution over numbers Hidden state: outcome of coin flip Observed state: outcome of die given (conditioned on) coin flip result

Probability of observations Final probability of outcome ( x ) is mixture of probability for each possible coin result:

Probability of observations Final probability of outcome ( x ) is mixture of probability for each possible coin result:

Probability of observations Final probability of outcome ( x ) is mixture of probability for each possible coin result:

Your goal Given model, and data, x 1, x 2,..., x n Find Pr[c i |x i ] So we need the model Model given by parameters: Θ= 〈 p,θ heads,θ tails 〉 Where θ heads and θ tails are die outcome probabilities; p is prob of heads

Where’s the problem? To get Pr[c i |x i ], you need Pr[x i |c i ] To get Pr[x i |c i ], you need model parameters To get model parameters, you need Pr[c i |x i ] Oh oh...

EM to the rescue! Turns out that you can run this “chicken and egg process” in a loop and eventually get the right * answer Make an initial guess about coin assignments Repeat: Use guesses to get parameters (M step) Use parameters to update coin guesses (E step) Until converged

EM to the rescue! function [Prc,Theta]=EM(X) // initialization Prc=pick_random_values() // the EM loop repeat { // M step: pick maximum likelihood // parameters: // argmax_theta(Pr[x,c|theta]) Theta=get_params_from_c(Prc) // E step: use complete model to get data // likelihood: Pr[c|x]=1/z*Pr[x|c,theta] Prc=get_labels_from_params(X,Theta) } until(converged)

Wierd, but true This is counterintuitive, but it works Essentially, you’re improving guesses on each step M step “maximizes” parameters, Θ, given data E step finds “expectation” of hidden data, given Θ Both are driving toward max likelihood joint soln Guaranteed to converge Not guaranteed to find global optimum...

Very easy example Two Gaussian (“bell curve”) clusters Well separated in space Two dimensions

In more detail Gaussian mixture w/ k “components” (clusters/blobs) Mixture probability:

In more detail Gaussian mixture w/ k “components” (clusters/blobs) Mixture probability: One for each component

In more detail Gaussian mixture w/ k “components” (clusters/blobs) Mixture probability: Weight (probability) of each component

In more detail Gaussian mixture w/ k “components” (clusters/blobs) Mixture probability: Gaussian distribution for each component w/ mean vector μ i and covariance matrix Σ i

In more detail Gaussian mixture w/ k “components” (clusters/blobs) Mixture probability: Normalizing term for Gaussian

In more detail Gaussian mixture w/ k “components” (clusters/blobs) Mixture probability: Squared distance of data point x from mean μ i (with respect to Σ i )

In more detail Gaussian mixture w/ k “components” (clusters/blobs) Mixture probability:

Hidden variables Introduce the “hidden variable”, c i (x) (or just c i for short) Denotes “amount by which data point x belongs to cluster i ” Sometimes called “cluster ownership”, “salience”, “relevance”, etc.

M step Need: parameters ( Θ ) given hidden variables ( c i ) and N data points, x 1, x 2,...,x N Q: what are the parameters of the model? (What do we need to learn?)

M step Need: parameters ( Θ ) given hidden variables ( c i ) and N data points, x 1, x 2,...,x N A: Θ= 〈 α i,μ i,Σ i 〉 i=1..k

M step Need: parameters ( Θ ) given hidden variables ( c i ) and N data points, x 1, x 2,...,x N A: Θ= 〈 α i,μ i,Σ i 〉 i=1..k

M step Need: parameters ( Θ ) given hidden variables ( c i ) and N data points, x 1, x 2,...,x N A: Θ= 〈 α i,μ i,Σ i 〉 i=1..k

M step Need: parameters ( Θ ) given hidden variables ( c i ) and N data points, x 1, x 2,...,x N A: Θ= 〈 α i,μ i,Σ i 〉 i=1..k

E step Need: probability of hidden variable ( c i ) given fixed parameters ( Θ ) and observed data ( x 1,...,x N )

Another example k=3 Gaussian clusters Different means, covariances Well separated

Restart Problem: EM has found a “minimum energy” solution It’s only “locally” optimal B/c of poor starting choice, it ended up in wrong local optimum -- not global optimum Default answer: pick a new random start and re-run

Final example More Gaussians. How many clusters here?

Note... Doesn’t always work out this well in practice Sometimes the machine is smarter than humans Usually, if it’s hard for us, it’s hard for the machine too... First ~7-10 times I ran this one, it lost one cluster altogether ( α 3 → )

Unresolved issues Notice: different cluster IDs (colors) end up on different blobs of points in each run Answer is “unique only up to permutation” I can swap around cluster IDs without changing solution Can’t tell what “right” cluster assignment is

Unresolved issues “Order” of model I.e., what k should you use? Hard to know, in general Can just try a bunch and find one that “works best” Problem: answer tends to get monotonically better w/ increasing k Best answer to date: Chinese restaurant process