Machine Learning Expectation Maximization and Gaussian Mixtures CSE 473 Chapter 20.3.

Slides:



Advertisements
Similar presentations
Image Modeling & Segmentation
Advertisements

Mixture Models and the EM Algorithm
Unsupervised Learning
ECE 8443 – Pattern Recognition LECTURE 05: MAXIMUM LIKELIHOOD ESTIMATION Objectives: Discrete Features Maximum Likelihood Resources: D.H.S: Chapter 3 (Part.
2 – In previous chapters: – We could design an optimal classifier if we knew the prior probabilities P(wi) and the class- conditional probabilities P(x|wi)
LECTURE 11: BAYESIAN PARAMETER ESTIMATION
Expectation Maximization
Segmentation and Fitting Using Probabilistic Methods
DATA MINING van data naar informatie Ronald Westra Dep. Mathematics Maastricht University.
Hidden Variables, the EM Algorithm, and Mixtures of Gaussians Computer Vision CS 143, Brown James Hays 02/22/11 Many slides from Derek Hoiem.
Hidden Variables, the EM Algorithm, and Mixtures of Gaussians Computer Vision CS 543 / ECE 549 University of Illinois Derek Hoiem 03/15/12.
Visual Recognition Tutorial
EE 290A: Generalized Principal Component Analysis Lecture 6: Iterative Methods for Mixture-Model Segmentation Sastry & Yang © Spring, 2011EE 290A, University.
Slide 1 EE3J2 Data Mining EE3J2 Data Mining Lecture 10 Statistical Modelling Martin Russell.
First introduced in 1977 Lots of mathematical derivation Problem : given a set of data (data is incomplete or having missing values). Goal : assume the.
Unsupervised Learning: Clustering Rong Jin Outline  Unsupervised learning  K means for clustering  Expectation Maximization algorithm for clustering.
Announcements  Project proposal is due on 03/11  Three seminars this Friday (EB 3105) Dealing with Indefinite Representations in Pattern Recognition.
1 Learning Entity Specific Models Stefan Niculescu Carnegie Mellon University November, 2003.
Clustering.
Expectation Maximization Algorithm
Visual Recognition Tutorial
© Daniel S. Weld 1 Naïve Bayes & Expectation Maximization CSE 573.
Learning Bayesian Networks
Pattern Recognition. Introduction. Definitions.. Recognition process. Recognition process relates input signal to the stored concepts about the object.
Expectation-Maximization (EM) Chapter 3 (Duda et al.) – Section 3.9
Learning HMM parameters Sushmita Roy BMI/CS 576 Oct 21 st, 2014.
Speech Technology Lab Ƅ ɜ: m ɪ ŋ ǝ m EEM4R Spoken Language Processing - Introduction Training HMMs Version 4: February 2005.
Semi-Supervised Learning
ECE 8443 – Pattern Recognition LECTURE 06: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Bias in ML Estimates Bayesian Estimation Example Resources:
EM and expected complete log-likelihood Mixture of Experts
Machine Learning Lecture 23: Statistical Estimation with Sampling Iain Murray’s MLSS lecture on videolectures.net:
CPSC 502, Lecture 15Slide 1 Introduction to Artificial Intelligence (AI) Computer Science cpsc502, Lecture 16 Nov, 3, 2011 Slide credit: C. Conati, S.
Lecture note for Stat 231: Pattern Recognition and Machine Learning 4. Maximum Likelihood Prof. A.L. Yuille Stat 231. Fall 2004.
Lecture 19: More EM Machine Learning April 15, 2010.
CHAPTER 7: Clustering Eick: K-Means and EM (modified Alpaydin transparencies and new transparencies added) Last updated: February 25, 2014.
ECE 8443 – Pattern Recognition LECTURE 07: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Class-Conditional Density The Multivariate Case General.
CSE 446 Logistic Regression Winter 2012 Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: ML and Simple Regression Bias of the ML Estimate Variance of the ML Estimate.
MACHINE LEARNING 8. Clustering. Motivation Based on E ALPAYDIN 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2  Classification problem:
: Chapter 3: Maximum-Likelihood and Baysian Parameter Estimation 1 Montri Karnjanadecha ac.th/~montri.
Mixture of Gaussians This is a probability distribution for random variables or N-D vectors such as… –intensity of an object in a gray scale image –color.
HMM - Part 2 The EM algorithm Continuous density HMM.
Lecture 6 Spring 2010 Dr. Jianjun Hu CSCE883 Machine Learning.
Computer Vision Lecture 6. Probabilistic Methods in Segmentation.
Gaussian Mixture Models and Expectation-Maximization Algorithm.
Lecture 2: Statistical learning primer for biologists
1 Unsupervised Learning and Clustering Shyh-Kang Jeng Department of Electrical Engineering/ Graduate Institute of Communication/ Graduate Institute of.
Flat clustering approaches
CSE 517 Natural Language Processing Winter 2015
Machine Learning 5. Parametric Methods.
1 Chapter 8: Model Inference and Averaging Presented by Hui Fang.
CSE 446: Expectation Maximization (EM) Winter 2012 Daniel Weld Slides adapted from Carlos Guestrin, Dan Klein & Luke Zettlemoyer.
Hidden Variables, the EM Algorithm, and Mixtures of Gaussians Computer Vision CS 543 / ECE 549 University of Illinois Derek Hoiem 02/22/11.
Machine Learning Expectation Maximization and Gaussian Mixtures CSE 473 Chapter 20.3.
Computational Intelligence: Methods and Applications Lecture 26 Density estimation, Expectation Maximization. Włodzisław Duch Dept. of Informatics, UMK.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Mixture Densities Maximum Likelihood Estimates.
Unsupervised Learning Part 2. Topics How to determine the K in K-means? Hierarchical clustering Soft clustering with Gaussian mixture models Expectation-Maximization.
Classification of unlabeled data:
Statistical Models for Automatic Speech Recognition
Latent Variables, Mixture Models and EM
Course Outline MODEL INFORMATION COMPLETE INCOMPLETE
دانشگاه صنعتی امیرکبیر Instructor : Saeed Shiry
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Unsupervised Learning II: Soft Clustering with Gaussian Mixture Models
EM for Inference in MV Data
LECTURE 21: CLUSTERING Objectives: Mixture Densities Maximum Likelihood Estimates Application to Gaussian Mixture Models k-Means Clustering Fuzzy k-Means.
LECTURE 07: BAYESIAN ESTIMATION
Learning From Observed Data
EM for Inference in MV Data
Presentation transcript:

Machine Learning Expectation Maximization and Gaussian Mixtures CSE 473 Chapter 20.3

Feedback in Learning Supervised learning: correct answers for each example Unsupervised learning: correct answers not given Reinforcement learning: occasional rewards

The problem of finding labels for unlabeled data So far we have solved “supervised” classification problems where a teacher told us the label of each example. In nature, items often do not come with labels. How can we learn labels without a teacher? Unlabeled data Labeled data From Shadmehr & Diedrichsen

Example: image segmentation Identify pixels that are white matter, gray matter, or outside of the brain Outside the brain Gray matter White matter Pixel value (normalized) From Shadmehr & Diedrichsen

Raw Proximity Sensor Data Measured distances for expected distance of 300 cm. SonarLaser

Gaussians --   Univariate  Multivariate

© D. Weld and D. Fox 8 Fitting a Gaussian PDF to Data Poor fitGood fit From Russell

© D. Weld and D. Fox 9 Fitting a Gaussian PDF to Data Suppose y = y 1,…,y n,…,y N is a set of N data values Given a Gaussian PDF p with mean  and variance , define: How do we choose  and  to maximise this probability? From Russell

© D. Weld and D. Fox 10 Maximum Likelihood Estimation Define the best fitting Gaussian to be the one such that p(y| ,  ) is maximised. Terminology: p(y| ,  ), thought of as a function of y is the probability (density) of y p(y| ,  ), thought of as a function of ,  is the likelihood of ,  Maximizing p(y| ,  ) with respect to ,  is called Maximum Likelihood (ML) estimation of ,  From Russell

© D. Weld and D. Fox 11 ML estimation of ,  Intuitively: The maximum likelihood estimate of  should be the average value of y 1,…,y N, (the sample mean) The maximum likelihood estimate of  should be the variance of y 1,…,y N. (the sample variance) This turns out to be true: p(y| ,  ) is maximised by setting: From Russell

© D. Weld and D. Fox 12

© D. Weld and D. Fox 13

© D. Weld and D. Fox 14

If our data is not labeled, we can hypothesize that: 1.There are exactly m classes in the data: 2.Each class y occurs with a specific frequency: 3.Examples of class y are governed by a specific distribution: According to our hypothesis, each example x (i) must have been generated from a specific “mixture” distribution: We might hypothesize that the distributions are Gaussian: Mixtures Parameters of the distributions Mixing proportionsNormal distribution

Hidden variable Measured variable y  Py Graphical Representation of Gaussian Mixtures

Learning of mixture models

Learning Mixtures from Data Consider fixed K = 2 e.g., Unknown parameters  = {  1 ,  1,  2 ,  2,   } Given data D = {x 1,…….x N }, we want to find the parameters  that “best fit” the data

Early Attempts Weldon’s data, n=1000 crabs from Bay of Naples - Ratio of forehead to body length - Suspected existence of 2 separate species

Early Attempts Karl Pearson, 1894: - JRSS paper - proposed a mixture of 2 Gaussians - 5 parameters  = {  1 ,  1,  2 ,  2,   } - parameter estimation -> method of moments - involved solution of 9th order equations! ( see Chapter 10, Stigler (1986), The History of Statistics )

“The solution of an equation of the ninth degree, where almost all powers, to the ninth, of the unknown quantity are existing, is, however, a very laborious task. Mr. Pearson has indeed possessed the energy to perform his heroic task…. But I fear he will have few successors…..” Charlier (1906)

Maximum Likelihood Principle Fisher, 1922 assume a probabilistic model likelihood = p(data | parameters, model) find the parameters that make the data most likely

1977: The EM Algorithm Dempster, Laird, and Rubin General framework for likelihood-based parameter estimation with missing data start with initial guesses of parameters E-step: estimate memberships given params M-step: estimate params given memberships Repeat until convergence Converges to a (local) maximum of likelihood E-step and M-step are often computationally simple Generalizes to maximum a posteriori (with priors)

© D. Weld and D. Fox 24 EM for Mixture of Gaussians E-step: Compute probability that point x j was generated by component i: M-step: Compute new mean, covariance, and component weights:

Anemia Group Control Group

Mixture Density How can we determine the model parameters?

Raw Sensor Data Measured distances for expected distance of 300 cm. SonarLaser

Approximation Results Sonar Laser 300cm400cm

© Daniel S. Weld 37 Hidden Variables But we can’t observe the disease variable Can’t we learn without it?

© Daniel S. Weld 38 We –could- But we’d get a fully-connected network With 708 parameters (vs. 78) Much harder to learn!

© Daniel S. Weld 39 Chicken & Egg Problem If we knew that a training instance (patient) had the disease… It would be easy to learn P(symptom | disease) But we can’t observe disease, so we don’t. If we knew params, e.g. P(symptom | disease) then it’d be easy to estimate if the patient had the disease. But we don’t know these parameters.

© Daniel S. Weld 40 Expectation Maximization (EM) (high-level version) Pretend we do know the parameters Initialize randomly [E step] Compute probability of instance having each possible value of the hidden variable [M step] Treating each instance as fractionally having both values compute the new parameter values Iterate until convergence!