The EM algorithm for Mixture of Gaussians & Classification with Generative models Jakob Verbeek December 2, 2011 Course website:

Slides:



Advertisements
Similar presentations
Pattern Recognition and Machine Learning
Advertisements

Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) ETHEM ALPAYDIN © The MIT Press, 2010
Unsupervised Learning
Pattern Recognition and Machine Learning
ECE 8443 – Pattern Recognition LECTURE 05: MAXIMUM LIKELIHOOD ESTIMATION Objectives: Discrete Features Maximum Likelihood Resources: D.H.S: Chapter 3 (Part.
2 – In previous chapters: – We could design an optimal classifier if we knew the prior probabilities P(wi) and the class- conditional probabilities P(x|wi)
Clustering with k-means and mixture of Gaussian densities Jakob Verbeek December 3, 2010 Course website:
Lecture 3 Nonparametric density estimation and classification
Chapter 4: Linear Models for Classification
Visual Recognition Tutorial
MACHINE LEARNING 9. Nonparametric Methods. Introduction Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 
Lecture 5: Learning models using EM
Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Clustering.
Machine Learning CMPT 726 Simon Fraser University
Visual Recognition Tutorial
Arizona State University DMML Kernel Methods – Gaussian Processes Presented by Shankar Bhargav.
Pattern Recognition. Introduction. Definitions.. Recognition process. Recognition process relates input signal to the stored concepts about the object.
Computer vision: models, learning and inference
EE462 MLCV 1 Lecture 3-4 Clustering (1hr) Gaussian Mixture and EM (1hr) Tae-Kyun Kim.
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
Gaussian Mixture Models and Expectation Maximization.
Crash Course on Machine Learning
Biointelligence Laboratory, Seoul National University
Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.
Methods in Medical Image Analysis Statistics of Pattern Recognition: Classification and Clustering Some content provided by Milos Hauskrecht, University.
The EM algorithm, and Fisher vector image representation
ECE 8443 – Pattern Recognition LECTURE 06: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Bias in ML Estimates Bayesian Estimation Example Resources:
ECE 8443 – Pattern Recognition LECTURE 03: GAUSSIAN CLASSIFIERS Objectives: Normal Distributions Whitening Transformations Linear Discriminants Resources.
COMMON EVALUATION FINAL PROJECT Vira Oleksyuk ECE 8110: Introduction to machine Learning and Pattern Recognition.
CHAPTER 7: Clustering Eick: K-Means and EM (modified Alpaydin transparencies and new transparencies added) Last updated: February 25, 2014.
ECE 8443 – Pattern Recognition LECTURE 07: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Class-Conditional Density The Multivariate Case General.
CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.
Classification 1: generative and non-parameteric methods Jakob Verbeek January 7, 2011 Course website:
Ch 4. Linear Models for Classification (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, Summarized and revised by Hee-Woong Lim.
Image Modeling & Segmentation Aly Farag and Asem Ali Lecture #2.
Machine Learning CUNY Graduate Center Lecture 4: Logistic Regression.
MACHINE LEARNING 8. Clustering. Motivation Based on E ALPAYDIN 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2  Classification problem:
Lecture 17 Gaussian Mixture Models and Expectation Maximization
Jakob Verbeek December 11, 2009
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Lecture 2: Statistical learning primer for biologists
Radial Basis Function ANN, an alternative to back propagation, uses clustering of examples in the training set.
Nonparametric Density Estimation Riu Baring CIS 8526 Machine Learning Temple University Fall 2007 Christopher M. Bishop, Pattern Recognition and Machine.
Chapter 13 (Prototype Methods and Nearest-Neighbors )
Lecture 3: MLE, Bayes Learning, and Maximum Entropy
Part 3: Estimation of Parameters. Estimation of Parameters Most of the time, we have random samples but not the densities given. If the parametric form.
CS Statistical Machine learning Lecture 7 Yuan (Alan) Qi Purdue CS Sept Acknowledgement: Sargur Srihari’s slides.
Non-separable SVM's, and non-linear classification using kernels Jakob Verbeek December 16, 2011 Course website:
Machine Learning and Category Representation Jakob Verbeek November 25, 2011 Course website:
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 1: INTRODUCTION.
Unsupervised Learning Part 2. Topics How to determine the K in K-means? Hierarchical clustering Soft clustering with Gaussian mixture models Expectation-Maximization.
Deep Feedforward Networks
Ch8: Nonparametric Methods
CH 5: Multivariate Methods
Classification of unlabeled data:
Statistical Models for Automatic Speech Recognition
Latent Variables, Mixture Models and EM
Course Outline MODEL INFORMATION COMPLETE INCOMPLETE
Hidden Markov Models Part 2: Algorithms
Bayesian Models in Machine Learning
Probabilistic Models with Latent Variables
Statistical Models for Automatic Speech Recognition
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Pattern Recognition and Machine Learning
Generally Discriminant Analysis
LECTURE 21: CLUSTERING Objectives: Mixture Densities Maximum Likelihood Estimates Application to Gaussian Mixture Models k-Means Clustering Fuzzy k-Means.
Parametric Methods Berlin Chen, 2005 References:
Biointelligence Laboratory, Seoul National University
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Presentation transcript:

The EM algorithm for Mixture of Gaussians & Classification with Generative models Jakob Verbeek December 2, 2011 Course website:

Clustering with k-means and MoG Assignment: – K-means: hard assignment, discontinuity at cluster border – MoG: soft assignment, 50/50 assignment at midpoint Cluster representation – K-means: center only – MoG: center, covariance matrix, mixing weight If all covariance matrices are constrained to be and then EM algorithm = k-means algorithm For both k-means and MoG clustering – Number of clusters needs to be fixed in advance – Results depend on initialization, no optimal learning algorithms – Can be generalized to other types of distances or densities

Mixture of Gaussian (MoG) density Mixture density is weighted sum of Gaussians – Mixing weight: importance of each cluster Density has to integrate to 1, so we require

Clustering with Gaussian mixture density Given: data set of N points x n, n=1,…,N Find mixture of Gaussians (MoG) that best explains data – Maximize log-likelihood of fixed data set X w.r.t. parameters of MoG – Assume data points are drawn independently from MoG MoG clustering very similar to k-means clustering – In addition to centers also represents cluster shape: cov. matrix – Also an iterative algorithm to find parameters – Also sensitive to initialization of paramters

Maximum likelihood estimation of MoG No simple equation as in the case of a single Gaussian Use EM algorithm – Initialize MoG: parameters or soft-assign – E-step: soft assign of data points to clusters – M-step: update the cluster parameters – Repeat EM steps, terminate if converged Convergence of parameters or assignments E-step: compute posterior on z given x: M-step: update Gaussians from data points weighted by posterior

Maximum likelihood estimation of MoG Example of several EM iterations

Entropy of a distribution Entropy captures uncertainty in a distribution – Maximum for uniform distribution – Minimum, zero, for delta peak on single value Connection to information coding (Noiseless coding theorem, Shannon 1948) – Frequent messages short code, optimal code length is (at least) -log p bits – Entropy: expected code length Suppose uniform distribution over 8 outcomes: 3 bit code words Suppose distribution: 1/2,1/4, 1/8, 1/16, 1/64, 1/64, 1/64, 1/64, entropy 2 bits! Code words: 0, 10, 110, 1110, , ,111110, Codewords are “self-delimiting”: – code is of length 6 and starts with 4 ones, or stops after first 0. Low entropyHigh entropy

Kullback-Leibler divergence Asymmetric dissimilarity between distributions – Minimum, zero, if distributions are equal – Maximum, infinity, if p has a zero where q is non-zero Interpretation in coding theory – Sub-optimality when messages distributed according to q, but coding with codeword lengths derived from p – Difference of expected code lengths – Suppose distribution q: 1/2,1/4, 1/8, 1/16, 1/64, 1/64, 1/64, 1/64 – Coding with uniform 3-bit code, p=uniform – Expected code length using p: 3 bits – Optimal expected code length, entropy H(q) = 2 bits – KL divergence D(q|p) = 1 bit

EM bound on log-likelihood Define Gauss. mixture p(x) as marginal distribution of p(x,z) Posterior distribution on latent cluster assignment Let q(z) be arbitrary distribution over cluster assignment Bound log-likelihood by subtracting KL divergence D(q(z) || p(z|x))

Maximizing the EM bound on log-likelihood E-step: fix model parameters, update distributions q n to reset the bound – KL divergence zero if distributions are equal – Thus set q n (z n ) = p(z n |x n ) – After updating the q n the bound equals the true log-likelihood

Maximizing the EM bound on log-likelihood M-step: fix the q n, update model parameters Terms for each Gaussian decoupled from rest !

Maximizing the EM bound on log-likelihood Derive the optimal values for the mixing weights – Maximize – Take into account that weights sum to one, define – Take derivative for mixing weight j >1

Maximizing the EM bound on log-likelihood Derive the optimal values for the MoG parameters – For each Gaussian maximize – Compute gradients and set to zero to find optimal parameters

EM bound on log-likelihood L is bound on data log-likelihood for any distribution q Iterative coordinate ascent on F – E-step optimize q, makes bound tight – M-step optimize parameters

EM bound on log-likelihood L is bound on data log-likelihood for any distribution q Iterative coordinate ascent on F – E-step optimize q, makes bound tight – M-step optimize parameters

Classification Given: training images and their categoriesWhat are the categories of these test images? apple pear tomato cow dog horse

Classification Goal is to predict for a test data input the corresponding class label. – Data input x, eg. image but could be anything, format may be vector or other – Class label y, can take one out of at least 2 discrete values, can be more In binary classification we often refer to one class as “positive”, and the other as “negative” Classifier: function f(x) that assigns a class to x, or probabilities over the classes. Training data: pairs (x,y) of inputs x, and corresponding class label y. Learning a classifier: determine function f(x) from some family of functions based on the available training data. Classifier partitions the input space into regions where data is assigned to a given class – Specific form of these boundaries will depend on the family of classifiers used

Discriminative vs generative methods Generative probabilistic methods – Model the density of inputs x from each class p(x|y) – Estimate class prior probability p(y) – Use Bayes’ rule to infer distribution over class given input Discriminative (probabilistic) methods – Directly estimate class probability given input: p(y|x) – Some methods do not have probabilistic interpretation, eg. they fit a function f(x), and assign to class 1 if f(x)>0, and to class 2 if f(x)<0

Generative classification methods Generative probabilistic methods – Model the density of inputs x from each class p(x|y) – Estimate class prior probability p(y) – Use Bayes’ rule to infer distribution over class given input 1. Selection of model class: – Parametric model: Gaussian (for continuous), Bernoulli (for binary), … – Semi-parametric models: mixtures of Gaussian / Bernoulli / … – Non-parametric models: histograms, nearest-neighbor method, … 2. Estimate parameters of density for each class to obtain p(x|class) – Eg: run EM to learn Gaussian mixture on data of each class 3. Estimate prior probability of each class – If data point is equally likely given each class, then assign to the most probable class. – Prior probability might be different than the number of available examples !

Generative classification methods Generative probabilistic methods – Model the density of inputs x from each class p(x|y) – Estimate class prior probability p(y) – Use Bayes’ rule to predict classes given input Given class conditional model, classification is trivial: just apply Bayes’ rule – Compute p(x|class) for each class, – multiply with class prior probability – Normalize to obtain the class probabilities Adding new classes can be done by adding a new class conditional model – Existing class conditional models stay as they are – Just estimate p(x|new class) from training examples of new class – Plug-in the new class model when using Bayes-rule to predict class

Generative classification methods Generative probabilistic methods – Model the density of inputs x from each class p(x|y) – Estimate class prior probability p(y) – Use Bayes’ rule to predict classes given input Three-class example in 2d with parametric model – Single Gaussian model per class P(x|class)p(class|x)

Generative classification methods Generative probabilistic methods – Model the density of inputs x from each class p(x|y) – Estimate class prior probability p(y) – Use Bayes’ rule to infer distribution over class given input 1. Selection of model class: – Parametric model: Gaussian (for continuous), Bernoulli (for binary), … – Semi-parametric models: mixtures of Gaussian, mixtures of Bernoulli, … – Non-parametric models: histograms, nearest-neighbor method, … 2. Estimate parameters of density for each class to obtain p(x|class) – Eg: run EM to learn Gaussian mixture on data of each class 3. Estimate prior probability of each class – If data point is equally likely given each class, then assign to the most probable class. – Prior probability might be different than the number of available examples !

Histogram methods Suppose we – have N data points – use a histogram with C cells How to set the density level in each cell ? – Maximum likelihood estimator. – Proportional to nr of points n in cell – Inversely proportional to volume V of cell Problems with histogram method: – # cells scales exponentially with the dimension of the data – Discontinuous density estimate – How to choose cell size?

The ‘curse of dimensionality’ Number of bins increases exponentially with the dimensionality of the data. – Fine division of each dimension: many empty bins – Rough division of each dimension: poor density model The number of parameters may be reduced assuming independence between the dimensions of x: the naïve Bayes model – Eg for histogram model: we estimate a histogram per dimension – Still C D cells, but only D x C parameters to estimate, instead of C D Model is “naïve” since it assumes that all variables are independent… Unrealistic for high dimensional data, where variables tend to be dependent – Typically poor density estimator for p(x|y) – Classification performance may still be good using the derived p(y|x) Also applies to other distributions, eg multivariate Gaussian, instead of full covariance matrix with D 2 parameters, we estimate variance per dimension

Example of a naïve Bayes model Hand-written digit classification – Input: binary 28x28 scanned digit images, collect in 784 long vector – Desired output: class label of image Generative model over 28 x 28 pixel images ( possible images) – Independent Bernoulli model for each class – Probability per pixel per class – Maximum likelihood estimator is average value per pixel per class Classify using Bayes’ rule:

k-nearest-neighbor estimation method Idea: put a cell around the test sample we want to know p(x) for – fix number of samples in the cell, find the right cell size. Probability to find a point in a sphere A centered on x 0 with volume v is Smooth density approximately constant in small region, and thus Alternatively: estimate P from the fraction of training data in A – Total N data points, k in the sphere A Combine the above to obtain estimate – Same formula as used in the histogram method to set the density level in each cell

k-nearest-neighbor estimation method Method in practice: – Choose k – For given x, compute the volume v which contain k samples. – Estimate density with Volume of a sphere with radius r in d dimensions is What effect does k have? – Data sampled from mixture of Gaussians plotted in green – Larger k, larger region, smoother estimate Selection of k typically by cross validation

k-nearest-neighbor classification rule Use k-nearest neighbor density estimation to find p(x|category) Apply Bayes rule for classification: k-nearest neighbor classification – Find sphere volume v to capture k data points for estimate – Use the same sphere for each class for estimates – Estimate class prior probabilities – Calculate class posterior distribution

k-nearest-neighbor classification rule Effect of k on classification boundary – Larger number of neighbors: Larger regions, smoother class boundaries Pros: Very simple – just set k, and choose a distance measure, no learning – Generic: applies to almost anything, as long as you have a distance Cons: Very costly when having large training data set – Need to store all data (memory) – Need to compute distances to all data (time)

Summary generative classification methods (Semi-) Parametric models, eg p(data |class) is Gaussian, or mixture of … – Pros: no need to store training data, just the class conditional models – Cons: may fit the data poorly, and might therefore lead to poor classification result Non-parametric models: – Advantage is their flexibility no assumption on shape of data distribution – Histograms: Only practical in low dimensional space (<5 or so), application in high dimensional space will lead to exponentially many cells, most of which will be empty Naïve Bayes modeling in higher dimensional cases – K-nearest neighbor & kernel density estimation: expensive storing all training data (memory space) Computing nearest neighbors or points with non-zero kernel evaluation (computation)

References A good and very accessible book that covers all machine learning aspects of the course is – Pattern recognition & machine learning – Chris Bishop, Springer, 2006 Buy it if you are interested in machine learning! For clustering with k-means & mixture of Gaussians read – Section – Chapter 9, except – Optionally, Section 1.6 on information theory For non-parameteric classification read – Section 2.5, except 2.5.1