Machine Learning – Lecture 4

Slides:



Advertisements
Similar presentations
Pattern Recognition and Machine Learning
Advertisements

Part 2: Unsupervised Learning
Principles of Density Estimation
Unsupervised Learning
ECE 8443 – Pattern Recognition LECTURE 05: MAXIMUM LIKELIHOOD ESTIMATION Objectives: Discrete Features Maximum Likelihood Resources: D.H.S: Chapter 3 (Part.
Giansalvo EXIN Cirrincione unit #3 PROBABILITY DENSITY ESTIMATION labelled unlabelled A specific functional form for the density model is assumed. This.
Maximum Likelihood And Expectation Maximization Lecture Notes for CMPUT 466/551 Nilanjan Ray.
Supervised Learning Recap
Segmentation and Fitting Using Probabilistic Methods
Machine Learning and Data Mining Clustering
Visual Recognition Tutorial
Lecture 17: Supervised Learning Recap Machine Learning April 6, 2010.
EE 290A: Generalized Principal Component Analysis Lecture 6: Iterative Methods for Mixture-Model Segmentation Sastry & Yang © Spring, 2011EE 290A, University.
MACHINE LEARNING 9. Nonparametric Methods. Introduction Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 
Lecture 5: Learning models using EM
Announcements  Project proposal is due on 03/11  Three seminars this Friday (EB 3105) Dealing with Indefinite Representations in Pattern Recognition.
Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Clustering Color/Intensity
Clustering.
Expectation Maximization Algorithm
Maximum Likelihood (ML), Expectation Maximization (EM)
Expectation-Maximization
Visual Recognition Tutorial
Pattern Recognition. Introduction. Definitions.. Recognition process. Recognition process relates input signal to the stored concepts about the object.
Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.
ECE 8443 – Pattern Recognition LECTURE 06: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Bias in ML Estimates Bayesian Estimation Example Resources:
Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.
Machine Learning – Lecture 1
Perceptual and Sensory Augmented Computing Machine Learning, WS 13/14 Machine Learning – Lecture 14 Introduction to Regression Bastian Leibe.
COMMON EVALUATION FINAL PROJECT Vira Oleksyuk ECE 8110: Introduction to machine Learning and Pattern Recognition.
CSC321: Neural Networks Lecture 12: Clustering Geoffrey Hinton.
1 HMM - Part 2 Review of the last lecture The EM algorithm Continuous density HMM.
CPSC 502, Lecture 15Slide 1 Introduction to Artificial Intelligence (AI) Computer Science cpsc502, Lecture 16 Nov, 3, 2011 Slide credit: C. Conati, S.
Perceptual and Sensory Augmented Computing Machine Learning, Summer’12 Machine Learning – Lecture 4 Linear Discriminant Functions Bastian Leibe.
Perceptual and Sensory Augmented Computing Advanced Machine Learning Winter’12 Advanced Machine Learning Lecture 3 Linear Regression II Bastian.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Deterministic vs. Random Maximum A Posteriori Maximum Likelihood Minimum.
Perceptual and Sensory Augmented Computing Machine Learning Summer’09 Machine Learning – Lecture 2 Probability Density Estimation Bastian Leibe.
CHAPTER 7: Clustering Eick: K-Means and EM (modified Alpaydin transparencies and new transparencies added) Last updated: February 25, 2014.
CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.
Machine Learning – Lecture 6
ECE 8443 – Pattern Recognition LECTURE 10: HETEROSCEDASTIC LINEAR DISCRIMINANT ANALYSIS AND INDEPENDENT COMPONENT ANALYSIS Objectives: Generalization of.
Image Modeling & Segmentation Aly Farag and Asem Ali Lecture #2.
Perceptual and Sensory Augmented Computing Machine Learning WS 13/14 Machine Learning – Lecture 3 Probability Density Estimation II Bastian.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: ML and Simple Regression Bias of the ML Estimate Variance of the ML Estimate.
MACHINE LEARNING 8. Clustering. Motivation Based on E ALPAYDIN 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2  Classification problem:
Perceptual and Sensory Augmented Computing Machine Learning, WS 13/14 Machine Learning – Lecture 15 Regression II Bastian Leibe RWTH Aachen.
Gaussian Mixture Models and Expectation-Maximization Algorithm.
1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
Prototype Classification Methods Fu Chang Institute of Information Science Academia Sinica ext. 1819
Lecture 2: Statistical learning primer for biologists
Flat clustering approaches
Chapter 13 (Prototype Methods and Nearest-Neighbors )
Final Exam Review CS479/679 Pattern Recognition Dr. George Bebis 1.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 04: GAUSSIAN CLASSIFIERS Objectives: Whitening.
Machine Learning – Lecture 11
1 Chapter 8: Model Inference and Averaging Presented by Hui Fang.
Perceptual and Sensory Augmented Computing Machine Learning, WS 13/14 Machine Learning – Lecture 5 Linear Discriminant Functions Bastian Leibe.
1 Kernel Machines A relatively new learning methodology (1992) derived from statistical learning theory. Became famous when it gave accuracy comparable.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Mixture Densities Maximum Likelihood Estimates.
Machine Learning Expectation Maximization and Gaussian Mixtures CSE 473 Chapter 20.3.
Unsupervised Learning Part 2. Topics How to determine the K in K-means? Hierarchical clustering Soft clustering with Gaussian mixture models Expectation-Maximization.
Classification of unlabeled data:
Latent Variables, Mixture Models and EM
Hidden Markov Models Part 2: Algorithms
Probabilistic Models with Latent Variables
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
LECTURE 21: CLUSTERING Objectives: Mixture Densities Maximum Likelihood Estimates Application to Gaussian Mixture Models k-Means Clustering Fuzzy k-Means.
Parametric Methods Berlin Chen, 2005 References:
EM Algorithm and its Applications
Presentation transcript:

Machine Learning – Lecture 4 Mixture Models and EM 31.10.2013 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de/ leibe@vision.rwth-aachen.de Many slides adapted from B. Schiele TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAAAAAAAAAAAAAAAAAAAAAAAAAA

Announcements Exercise 1 available Exercise modalities Bayes decision theory Maximum Likelihood Kernel density estimation / k-NN Gaussian Mixture Models  Submit your results to Ishrat until evening of 10.11. Exercise modalities Please form teams of up to 3 people! Make it easy for Ishrat to correct your solutions: Turn in everything as a single zip archive. Use the provided Matlab framework. For each exercise, you need to implement the corresponding apply function. If the screen output matches the expected output, you will get the points for the exercise; else, no points. B. Leibe

Course Outline Fundamentals (2 weeks) Bayes Decision Theory Probability Density Estimation Discriminative Approaches (5 weeks) Linear Discriminant Functions Support Vector Machines Ensemble Methods & Boosting Randomized Trees, Forests & Ferns Generative Models (4 weeks) Bayesian Networks Markov Random Fields B. Leibe

Recap: Maximum Likelihood Approach Computation of the likelihood Single data point: Assumption: all data points are independent Log-likelihood Estimation of the parameters µ (Learning) Maximize the likelihood (=minimize the negative log-likelihood)  Take the derivative and set it to zero. Slide credit: Bernt Schiele B. Leibe

Recap: Bayesian Learning Approach Bayesian view: Consider the parameter vector µ as a random variable. When estimating the parameters, what we compute is Assumption: given µ, this doesn’t depend on X anymore This is entirely determined by the parameter µ (i.e. by the parametric form of the pdf). Slide adapted from Bernt Schiele B. Leibe

Recap: Bayesian Learning Approach Discussion The more uncertain we are about µ, the more we average over all possible parameter values. Likelihood of the parametric form µ given the data set X. Estimate for x based on parametric form µ Prior for the parameters µ Normalization: integrate over all possible values of µ B. Leibe

Recap: Histograms Basic idea: Partition the data space into distinct bins with widths ¢i and count the number of observations, ni, in each bin. Often, the same width is used for all bins, ¢i = ¢. This can be done, in principle, for any dimensionality D… …but the required number of bins grows exponen- tially with D! B. Leibe Image source: C.M. Bishop, 2006

Recap: Kernel Density Estimation see Exercise 1.4 Approximation formula: Kernel methods Place a kernel window k at location x and count how many data points fall inside it. fixed V determine K fixed K determine V Kernel Methods K-Nearest Neighbor K-Nearest Neighbor Increase the volume V until the K next data points are found. Slide adapted from Bernt Schiele B. Leibe

Topics of This Lecture Mixture distributions K-Means Clustering Mixture of Gaussians (MoG) Maximum Likelihood estimation attempt K-Means Clustering Algorithm Applications EM Algorithm Credit assignment problem MoG estimation Interpretation of K-Means Technical advice B. Leibe

Mixture Distributions A single parametric distribution is often not sufficient E.g. for multimodal data Single Gaussian Mixture of two Gaussians B. Leibe Image source: C.M. Bishop, 2006

Mixture of Gaussians (MoG) Sum of M individual Normal distributions In the limit, every smooth distribution can be approximated this way (if M is large enough) Slide credit: Bernt Schiele B. Leibe

Likelihood of measurement x given mixture component j Mixture of Gaussians with and . Notes The mixture density integrates to 1: The mixture parameters are Likelihood of measurement x given mixture component j Prior of component j Slide adapted from Bernt Schiele B. Leibe

Mixture of Gaussians (MoG) “Generative model” “Weight” of mixture component 1 2 3 Mixture component Mixture density Slide credit: Bernt Schiele B. Leibe

Mixture of Multivariate Gaussians B. Leibe Image source: C.M. Bishop, 2006

Mixture of Multivariate Gaussians Mixture weights / mixture coefficients: with and Parameters: Slide credit: Bernt Schiele B. Leibe Image source: C.M. Bishop, 2006

Mixture of Multivariate Gaussians “Generative model” 1 3 2 Slide credit: Bernt Schiele B. Leibe Image source: C.M. Bishop, 2006

Mixture of Gaussians – 1st Estimation Attempt Maximum Likelihood Minimize Let’s first look at ¹j: We can already see that this will be difficult, since This will cause problems! Slide adapted from Bernt Schiele B. Leibe

Mixture of Gaussians – 1st Estimation Attempt Minimization: We thus obtain “responsibility” of component j for xn B. Leibe

Mixture of Gaussians – 1st Estimation Attempt But… I.e. there is no direct analytical solution! Complex gradient function (non-linear mutual dependencies) Optimization of one Gaussian depends on all other Gaussians! It is possible to apply iterative numerical optimization here, but in the following, we will see a simpler method. B. Leibe

Mixture of Gaussians – Other Strategy Observed data: Unobserved data: 1 111 22 2 2 Unobserved = “hidden variable”: j|x 1 111 00 0 0 0 000 11 1 1 Slide credit: Bernt Schiele B. Leibe

Mixture of Gaussians – Other Strategy Assuming we knew the values of the hidden variable… ML for Gaussian #1 ML for Gaussian #2 assumed known 1 111 22 2 2 j 1 111 00 0 0 0 000 11 1 1 Slide credit: Bernt Schiele B. Leibe

Mixture of Gaussians – Other Strategy Assuming we knew the mixture components… Bayes decision rule: Decide j = 1 if assumed known 1 111 22 2 2 j Slide credit: Bernt Schiele B. Leibe

Mixture of Gaussians – Other Strategy Chicken and egg problem – what comes first? In order to break the loop, we need an estimate for j. E.g. by clustering… We don’t know any of those! 1 111 22 2 2 j Slide credit: Bernt Schiele B. Leibe

Clustering with Hard Assignments Let’s first look at clustering with “hard assignments” Slide credit: Bernt Schiele B. Leibe

Topics of This Lecture Mixture distributions K-Means Clustering Mixture of Gaussians (MoG) Maximum Likelihood estimation attempt K-Means Clustering Algorithm Applications EM Algorithm Credit assignment problem MoG estimation Interpretation of K-Means Technical advice B. Leibe

K-Means Clustering Iterative procedure Initialization: pick K arbitrary centroids (cluster means) Assign each sample to the closest centroid. Adjust the centroids to be the means of the samples assigned to them. Go to step 2 (until no change) Algorithm is guaranteed to converge after finite #iterations. Local optimum Final result depends on initialization. Slide credit: Bernt Schiele B. Leibe

K-Means – Example with K=2 B. Leibe Image source: C.M. Bishop, 2006

K-Means Clustering K-Means optimizes the following objective function: where In practice, this procedure usually converges quickly to a local optimum. B. Leibe Image source: C.M. Bishop, 2006

Example Application: Image Compression Take each pixel as one data point. K-Means Clustering Set the pixel color to the cluster mean. B. Leibe Image source: C.M. Bishop, 2006

Example Application: Image Compression K = 2 K = 3 K =10 Original image B. Leibe Image source: C.M. Bishop, 2006

Summary K-Means Pros Problem cases Extensions Simple, fast to compute Converges to local minimum of within-cluster squared error Problem cases Setting k? Sensitive to initial centers Sensitive to outliers Detects spherical clusters only Extensions Speed-ups possible through efficient search structures General distance measures: k-medoids Slide credit: Kristen Grauman B. Leibe

Topics of This Lecture Mixture distributions K-Means Clustering Mixture of Gaussians (MoG) Maximum Likelihood estimation attempt K-Means Clustering Algorithm Applications EM Algorithm Credit assignment problem MoG estimation Interpretation of K-Means Technical advice B. Leibe

EM Clustering Clustering with “soft assignments” Expectation step of the EM algorithm 0.01 0.2 0.8 0.99 0.99 0.8 0.2 0.01 Slide credit: Bernt Schiele B. Leibe

Maximum Likelihood estimate EM Clustering Clustering with “soft assignments” Maximization step of the EM algorithm 0.01 0.2 0.8 0.99 0.99 0.8 0.2 0.01 Maximum Likelihood estimate Slide credit: Bernt Schiele B. Leibe

Credit Assignment Problem If we are just given x, we don’t know which mixture component this example came from We can however evaluate the posterior probability that an observed x was generated from the first mixture component. Slide credit: Bernt Schiele B. Leibe

Mixture Density Estimation Example Assume we want to estimate a 2-component MoG model If each sample in the training set were labeled j2{1,2} according to which mixture component (1 or 2) had generated it, then the estimation would be easy. Labeled examples = no credit assignment problem. Slide credit: Bernt Schiele B. Leibe

Mixture Density Estimation Example When examples are labeled, we can estimate the Gaussians independently Using Maximum Likelihood estimation for single Gaussians. Notation Let li be the label for sample xi Let N be the number of samples Let Nj be the number of samples labeled j Then for each j2{1,2} we set Slide credit: Bernt Schiele B. Leibe

Mixture Density Estimation Example Of course, we don’t have such labels li… But we can guess what the labels might be based on our current mixture distribution estimate (credit assignment problem). We get soft labels or posterior probabilities of which Gaussian generated which example: When the Gaussians are almost identical (as in the figure), then for almost any given sample xi.  Even small differences can help to determine how to update the Gaussians. Slide credit: Bernt Schiele B. Leibe

EM Algorithm Expectation-Maximization (EM) Algorithm E-Step: softly assign samples to mixture components M-Step: re-estimate the parameters (separately for each mixture component) based on the soft assignments = soft number of samples labeled j Slide adapted from Bernt Schiele B. Leibe

EM Algorithm – An Example B. Leibe Image source: C.M. Bishop, 2006

EM – Technical Advice When implementing EM, we need to take care to avoid singularities in the estimation! Mixture components may collapse on single data points. E.g. consider the case (this also holds in general) Assume component j is exactly centered on data point xn. This data point will then contribute a term in the likelihood function For ¾j ! 0, this term goes to infinity!  Need to introduce regularization Enforce minimum width for the Gaussians E.g., instead of §-1, use (§ + ¾minI)-1 B. Leibe Image source: C.M. Bishop, 2006

EM – Technical Advice (2) EM is very sensitive to the initialization Will converge to a local optimum of E. Convergence is relatively slow.  Initialize with k-Means to get better results! k-Means is itself initialized randomly, will also only find a local optimum. But convergence is much faster. Typical procedure Run k-Means M times (e.g. M = 10-100). Pick the best result (lowest error J). Use this result to initialize EM Set ¹j to the corresponding cluster mean from k-Means. Initialize §j to the sample covariance of the associated data points. B. Leibe

K-Means Clustering Revisited Interpreting the procedure Initialization: pick K arbitrary centroids (cluster means) Assign each sample to the closest centroid. Adjust the centroids to be the means of the samples assigned to them. Go to step 2 (until no change) (E-Step) (M-Step) B. Leibe

K-Means Clustering Revisited K-Means clustering essentially corresponds to a Gaussian Mixture Model (MoG or GMM) estimation with EM whenever The covariances are of the K Gaussians are set to §j = ¾2I For some small, fixed ¾2 k-Means MoG Slide credit: Bernt Schiele B. Leibe

Summary: Gaussian Mixture Models Properties Very general, can represent any (continuous) distribution. Once trained, very fast to evaluate. Can be updated online. Problems / Caveats Some numerical issues in the implementation  Need to apply regularization in order to avoid singularities. EM for MoG is computationally expensive Especially for high-dimensional problems! More computational overhead and slower convergence than k-Means Results very sensitive to initialization  Run k-Means for some iterations as initialization! Need to select the number of mixture components K.  Model selection problem (see Lecture 16) B. Leibe

Topics of This Lecture Mixture distributions K-Means Clustering Mixture of Gaussians (MoG) Maximum Likelihood estimation attempt K-Means Clustering Algorithm Applications EM Algorithm Credit assignment problem MoG estimation Interpretation of K-Means Technical advice B. Leibe

Applications Mixture models are used in many practical applications. Wherever distributions with complex or unknown shapes need to be represented… Popular application in Computer Vision Model distributions of pixel colors. Each pixel is one data point in, e.g., RGB space.  Learn a MoG to represent the class-conditional densities.  Use the learned models to classify other pixels. B. Leibe Image source: C.M. Bishop, 2006

Application: Background Model for Tracking Train background MoG for each pixel Model “common“ appearance variation for each background pixel. Initialization with an empty scene. Update the mixtures over time Adapt to lighting changes, etc. Used in many vision-based tracking applications Anything that cannot be explained by the background model is labeled as foreground (=object). Easy segmentation if camera is fixed. Gaussian Mixture C. Stauffer, E. Grimson, Learning Patterns of Activity Using Real-Time Tracking, IEEE Trans. PAMI, 22(8):747-757, 2000. B. Leibe Image Source: Daniel Roth, Tobias Jäggli

Application: Image Segmentation User assisted image segmentation User marks two regions for foreground and background. Learn a MoG model for the color values in each region. Use those models to classify all other pixels.  Simple segmentation procedure (building block for more complex applications) B. Leibe

Application: Color-Based Skin Detection Collect training samples for skin/non-skin pixels. Estimate MoG to represent the skin/ non-skin densities skin non-skin Classify skin color pixels in novel images M. Jones and J. Rehg, Statistical Color Models with Application to Skin Detection, IJCV 2002.

Interested to Try It? Here’s how you can access a webcam in Matlab: function out = webcam % uses "Image Acquisition Toolbox„ adaptorName = 'winvideo'; vidFormat = 'I420_320x240'; vidObj1= videoinput(adaptorName, 1, vidFormat); set(vidObj1, 'ReturnedColorSpace', 'rgb'); set(vidObj1, 'FramesPerTrigger', 1); out = vidObj1 ; cam = webcam(); img=getsnapshot(cam); B. Leibe

References and Further Reading More information about EM and MoG estimation is available in Chapter 2.3.9 and the entire Chapter 9 of Bishop’s book (recommendable to read). Additional information Original EM paper: A.P. Dempster, N.M. Laird, D.B. Rubin, „Maximum-Likelihood from incomplete data via EM algorithm”, In Journal Royal Statistical Society, Series B. Vol 39, 1977 EM tutorial: J.A. Bilmes, “A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models“, TR-97-021, ICSI, U.C. Berkeley, CA,USA Christopher M. Bishop Pattern Recognition and Machine Learning Springer, 2006 B. Leibe