EM and model selection CS8690 Computer Vision University of Missouri at Columbia Missing variable problems In many vision problems, if some variables.

Slides:



Advertisements
Similar presentations
Mixture Models and the EM Algorithm
Advertisements

Unsupervised Learning
Clustering Beyond K-means
Computer Vision - A Modern Approach Set: Probability in segmentation Slides by D.A. Forsyth Missing variable problems In many vision problems, if some.
K Means Clustering , Nearest Cluster and Gaussian Mixture
Segmentation and Fitting Using Probabilistic Methods
Machine Learning and Data Mining Clustering
Hidden Variables, the EM Algorithm, and Mixtures of Gaussians Computer Vision CS 143, Brown James Hays 02/22/11 Many slides from Derek Hoiem.
Hidden Variables, the EM Algorithm, and Mixtures of Gaussians Computer Vision CS 543 / ECE 549 University of Illinois Derek Hoiem 03/15/12.
Visual Recognition Tutorial
Content Based Image Clustering and Image Retrieval Using Multiple Instance Learning Using Multiple Instance Learning Xin Chen Advisor: Chengcui Zhang Department.
Overview Full Bayesian Learning MAP learning
EE 290A: Generalized Principal Component Analysis Lecture 6: Iterative Methods for Mixture-Model Segmentation Sastry & Yang © Spring, 2011EE 290A, University.
Lecture 6 Image Segmentation
Motion Detection And Analysis Michael Knowles Tuesday 13 th January 2004.
Resampling techniques Why resampling? Jacknife Cross-validation Bootstrap Examples of application of bootstrap.
First introduced in 1977 Lots of mathematical derivation Problem : given a set of data (data is incomplete or having missing values). Goal : assume the.
Region Segmentation. Find sets of pixels, such that All pixels in region i satisfy some constraint of similarity.
Announcements  Project proposal is due on 03/11  Three seminars this Friday (EB 3105) Dealing with Indefinite Representations in Pattern Recognition.
Computer Vision - A Modern Approach Set: Fitting Slides by D.A. Forsyth Fitting Choose a parametric object/some objects to represent a set of tokens Most.
Problem Sets Problem Set 3 –Distributed Tuesday, 3/18. –Due Thursday, 4/3 Problem Set 4 –Distributed Tuesday, 4/1 –Due Tuesday, 4/15. Probably a total.
Segmentation Divide the image into segments. Each segment:
Announcements Project 2 more signup slots questions Picture taking at end of class.
Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Clustering Color/Intensity
Clustering.
Expectation Maximization Algorithm
Image Segmentation Some slides: courtesy of O. Capms, Penn State, J.Ponce and D. Fortsyth, Computer Vision Book.
Some slides and illustrations from D. Forsyth, …
Fitting. Choose a parametric object/some objects to represent a set of tokens Most interesting case is when criterion is not local –can’t tell whether.
Visual Recognition Tutorial
Tracking with Linear Dynamic Models. Introduction Tracking is the problem of generating an inference about the motion of an object given a sequence of.
Clustering with Bregman Divergences Arindam Banerjee, Srujana Merugu, Inderjit S. Dhillon, Joydeep Ghosh Presented by Rohit Gupta CSci 8980: Machine Learning.
Lecture 10: Robust fitting CS4670: Computer Vision Noah Snavely.
Computer Vision - A Modern Approach Set: Segmentation Slides by D.A. Forsyth Segmentation and Grouping Motivation: not information is evidence Obtain a.
Image Segmentation Image segmentation is the operation of partitioning an image into a collection of connected sets of pixels. 1. into regions, which usually.
Methods in Medical Image Analysis Statistics of Pattern Recognition: Classification and Clustering Some content provided by Milos Hauskrecht, University.
Segmentation Techniques Luis E. Tirado PhD qualifying exam presentation Northeastern University.
EM and expected complete log-likelihood Mixture of Experts
CSC321: Neural Networks Lecture 12: Clustering Geoffrey Hinton.
CPSC 502, Lecture 15Slide 1 Introduction to Artificial Intelligence (AI) Computer Science cpsc502, Lecture 16 Nov, 3, 2011 Slide credit: C. Conati, S.
CHAPTER 7: Clustering Eick: K-Means and EM (modified Alpaydin transparencies and new transparencies added) Last updated: February 25, 2014.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: ML and Simple Regression Bias of the ML Estimate Variance of the ML Estimate.
MACHINE LEARNING 8. Clustering. Motivation Based on E ALPAYDIN 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2  Classification problem:
Made by: Maor Levy, Temple University Many data points, no labels 2.
Lecture 6 Spring 2010 Dr. Jianjun Hu CSCE883 Machine Learning.
EECS 274 Computer Vision Model Fitting. Fitting Choose a parametric object/some objects to represent a set of points Three main questions: –what object.
Flat clustering approaches
Visual Tracking by Cluster Analysis Arthur Pece Department of Computer Science University of Copenhagen
Model fitting ECE 847: Digital Image Processing Stan Birchfield Clemson University.
Advanced Artificial Intelligence Lecture 8: Advance machine learning.
Hidden Variables, the EM Algorithm, and Mixtures of Gaussians Computer Vision CS 543 / ECE 549 University of Illinois Derek Hoiem 02/22/11.
Machine Learning Expectation Maximization and Gaussian Mixtures CSE 473 Chapter 20.3.
Computational Intelligence: Methods and Applications Lecture 26 Density estimation, Expectation Maximization. Włodzisław Duch Dept. of Informatics, UMK.
Machine Learning Expectation Maximization and Gaussian Mixtures CSE 473 Chapter 20.3.
Gaussian Mixture Model classification of Multi-Color Fluorescence In Situ Hybridization (M-FISH) Images Amin Fazel 2006 Department of Computer Science.
Unsupervised Learning Part 2. Topics How to determine the K in K-means? Hierarchical clustering Soft clustering with Gaussian mixture models Expectation-Maximization.
K-Means Segmentation.
Fitting.
Classification of unlabeled data:
Latent Variables, Mixture Models and EM
Advanced Artificial Intelligence
Hidden Markov Models Part 2: Algorithms
Computer Vision - A Modern Approach
Bayesian Models in Machine Learning
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Missing variable problems
EM Algorithm and its Applications
Calibration and homographies
Presentation transcript:

EM and model selection

CS8690 Computer Vision University of Missouri at Columbia Missing variable problems In many vision problems, if some variables were known the maximum likelihood inference problem would be easy –fitting; if we knew which line each token came from, it would be easy to determine line parameters –segmentation; if we knew the segment each pixel came from, it would be easy to determine the segment parameters –fundamental matrix estimation; if we knew which feature corresponded to which, it would be easy to determine the fundamental matrix –etc. This sort of thing happens in statistics, too In many vision problems, if some variables were known the maximum likelihood inference problem would be easy –fitting; if we knew which line each token came from, it would be easy to determine line parameters –segmentation; if we knew the segment each pixel came from, it would be easy to determine the segment parameters –fundamental matrix estimation; if we knew which feature corresponded to which, it would be easy to determine the fundamental matrix –etc. This sort of thing happens in statistics, too

CS8690 Computer Vision University of Missouri at Columbia Missing variable problems Strategy –estimate appropriate values for the missing variables –plug these in, now estimate parameters –re-estimate appropriate values for missing variables, continue eg –guess which line gets which point –now fit the lines –now reallocate points to lines, using our knowledge of the lines –now refit, etc. We’ve seen this line of thought before (k means) Strategy –estimate appropriate values for the missing variables –plug these in, now estimate parameters –re-estimate appropriate values for missing variables, continue eg –guess which line gets which point –now fit the lines –now reallocate points to lines, using our knowledge of the lines –now refit, etc. We’ve seen this line of thought before (k means)

CS8690 Computer Vision University of Missouri at Columbia Missing variables - strategy We have a problem with parameters, missing variables This suggests: Iterate until convergence –replace missing variable with expected values, given fixed values of parameters –fix missing variables, choose parameters to maximize likelihood given fixed values of missing variables We have a problem with parameters, missing variables This suggests: Iterate until convergence –replace missing variable with expected values, given fixed values of parameters –fix missing variables, choose parameters to maximize likelihood given fixed values of missing variables e.g., iterate till convergence –allocate each point to a line with a weight, which is the probability of the point given the line –refit lines to the weighted set of points Converges to local extremum Somewhat more general form is available

CS8690 Computer Vision University of Missouri at Columbia K-Means Choose a fixed number of clusters Choose cluster centers and point-cluster allocations to minimize error can’t do this by search, because there are too many possible allocations. Choose a fixed number of clusters Choose cluster centers and point-cluster allocations to minimize error can’t do this by search, because there are too many possible allocations. Algorithm –fix cluster centers; allocate points to closest cluster –fix allocation; compute best cluster centers x could be any set of features for which we can compute a distance (careful about scaling) * From Marc Pollefeys COMP

CS8690 Computer Vision University of Missouri at Columbia K-Means

CS8690 Computer Vision University of Missouri at Columbia K-Means * From Marc Pollefeys COMP

CS8690 Computer Vision University of Missouri at Columbia Image Segmentation by K-Means Select a value of K Select a feature vector for every pixel (color, texture, position, or combination of these etc.) Define a similarity measure between feature vectors (Usually Euclidean Distance). Apply K-Means Algorithm. Apply Connected Components Algorithm. Merge any components of size less than some threshold to an adjacent component that is most similar to it. Select a value of K Select a feature vector for every pixel (color, texture, position, or combination of these etc.) Define a similarity measure between feature vectors (Usually Euclidean Distance). Apply K-Means Algorithm. Apply Connected Components Algorithm. Merge any components of size less than some threshold to an adjacent component that is most similar to it. * From Marc Pollefeys COMP

CS8690 Computer Vision University of Missouri at Columbia K-Means Is an approximation to EM –Model (hypothesis space): Mixture of N Gaussians –Latent variables: Correspondence of data and Gaussians We notice: –Given the mixture model, it’s easy to calculate the correspondence –Given the correspondence it’s easy to estimate the mixture models Is an approximation to EM –Model (hypothesis space): Mixture of N Gaussians –Latent variables: Correspondence of data and Gaussians We notice: –Given the mixture model, it’s easy to calculate the correspondence –Given the correspondence it’s easy to estimate the mixture models

CS8690 Computer Vision University of Missouri at Columbia Generalized K-Means (EM)

CS8690 Computer Vision University of Missouri at Columbia Generalized K-Means Converges! Proof [Neal/Hinton, McLachlan/Krishnan] : –E/M step does not decrease data likelihood –Converges at saddle point Converges! Proof [Neal/Hinton, McLachlan/Krishnan] : –E/M step does not decrease data likelihood –Converges at saddle point

CS8690 Computer Vision University of Missouri at Columbia Idea Data generated from mixture of Gaussians Latent variables: Correspondence between Data Items and Gaussians Data generated from mixture of Gaussians Latent variables: Correspondence between Data Items and Gaussians

CS8690 Computer Vision University of Missouri at Columbia Learning a Gaussian Mixture (with known covariance) M-Step E-Step

CS8690 Computer Vision University of Missouri at Columbia In general we may imagine that an image comprises L segments (labels) Within segment l the pixels (feature vectors) have a probability distribution represented by represents the parameters of the data in segment l –Mean and variance of the greylevels –Mean vector and covariance matrix of the colours –Texture parameters 14 The Expectation/Maximization (EM) algorithm

CS8690 Computer Vision University of Missouri at Columbia 15 The Expectation/Maximization (EM) algorithm

CS8690 Computer Vision University of Missouri at Columbia The Expectation/Maximization (EM) algorithm Once again a chicken and egg problem arises –If we knew then we could obtain a labelling for each by simply choosing that label which maximizes –If we knew the label for each we could obtain by using a simple maximum likelihood estimator The EM algorithm is designed to deal with this type of problem but it frames it slightly differently –It regards segmentation as a missing (or incomplete) data estimation problem 16

CS8690 Computer Vision University of Missouri at Columbia The Expectation/Maximization (EM) algorithm The incomplete data are just the measured pixel greylevels or feature vectors –We can define a probability distribution of the incomplete data as The complete data are the measured greylevels or feature vectors plus a mapping function which indicates the labelling of each pixel –Given the complete data (pixels plus labels) we can easily work out estimates of the parameters –But from the incomplete data no closed form solution exists

CS8690 Computer Vision University of Missouri at Columbia The Expectation/Maximization (EM) algorithm Once again we resort to an iterative strategy and hope that we get convergence The algorithm is as follows: 18 Initialize an estimate of Repeat Step 1: (E step) Obtain an estimate of the labels based on the current parameter estimates Step 2: (M step) Update the parameter estimates based on the current labelling Until Convergence

CS8690 Computer Vision University of Missouri at Columbia The Expectation/Maximization (EM) algorithm A recent approach to applying EM to image segmentation is to assume the image pixels or feature vectors follow a mixture model –Generally we assume that each component of the mixture model is a Gaussian –A Gaussian mixture model (GMM) 19

CS8690 Computer Vision University of Missouri at Columbia The Expectation/Maximization (EM) algorithm Our parameter space for our distribution now includes the mean vectors and covariance matrices for each component in the mixture plus the mixing weights We choose a Gaussian for each component because the ML estimate of each parameter in the E-step becomes linear 20

CS8690 Computer Vision University of Missouri at Columbia The Expectation/Maximization (EM) algorithm Define a posterior probability as the probability that pixel j belongs to region l given the value of the feature vector Using Bayes rule we can write the following equation This actually is the E-step of our EM algorithm as allows us to assign probabilities to each label at each pixel 21

CS8690 Computer Vision University of Missouri at Columbia The Expectation/Maximization (EM) algorithm The M step simply updates the parameter estimates using MLL estimation 22

CS8690 Computer Vision University of Missouri at Columbia Figure from “Color and Texture Based Image Segmentation Using EM and Its Application to Content Based Image Retrieval”,S.J. Belongie et al., Proc. Int. Conf. Computer Vision, 1998, c1998, IEEE Segmentation with EM

CS8690 Computer Vision University of Missouri at Columbia EM Clustering: Results

CS8690 Computer Vision University of Missouri at Columbia Lines and robustness We have one line, and n points Some come from the line, some from “noise” This is a mixture model: We have one line, and n points Some come from the line, some from “noise” This is a mixture model: We wish to determine –line parameters –p(comes from line)

CS8690 Computer Vision University of Missouri at Columbia Estimating the mixture model Introduce a set of hidden variables, , one for each point. They are one when the point is on the line, and zero when off. If these are known, the negative log-likelihood becomes (the line’s parameters are  c): Introduce a set of hidden variables, , one for each point. They are one when the point is on the line, and zero when off. If these are known, the negative log-likelihood becomes (the line’s parameters are  c): Here K is a normalising constant, kn is the noise intensity (we’ll choose this later).

CS8690 Computer Vision University of Missouri at Columbia Substituting for delta We shall substitute the expected value of , for a given  recall  =( , c, ) E(  _i)=1. P(  _i=1|  ) We shall substitute the expected value of , for a given  recall  =( , c, ) E(  _i)=1. P(  _i=1|  ) Notice that if kn is small and positive, then if distance is small, this value is close to 1 and if it is large, close to zero

CS8690 Computer Vision University of Missouri at Columbia Algorithm for line fitting Obtain some start point Now compute  ’s using formula above Now compute maximum likelihood estimate of – , c come from fitting to weighted points – comes by counting Obtain some start point Now compute  ’s using formula above Now compute maximum likelihood estimate of – , c come from fitting to weighted points – comes by counting Iterate to convergence

CS8690 Computer Vision University of Missouri at Columbia

CS8690 Computer Vision University of Missouri at Columbia The expected values of the deltas at the maximum (notice the one value close to zero).

CS8690 Computer Vision University of Missouri at Columbia Closeup of the fit

CS8690 Computer Vision University of Missouri at Columbia Choosing parameters What about the noise parameter, and the sigma for the line? –several methods from first principles knowledge of the problem (seldom really possible) play around with a few examples and choose (usually quite effective, as precise choice doesn’t matter much) –notice that if kn is large, this says that points very seldom come from noise, however far from the line they lie usually biases the fit, by pushing outliers into the line rule of thumb; its better to fit to the better fitting points, within reason; if this is hard to do, then the model could be a problem What about the noise parameter, and the sigma for the line? –several methods from first principles knowledge of the problem (seldom really possible) play around with a few examples and choose (usually quite effective, as precise choice doesn’t matter much) –notice that if kn is large, this says that points very seldom come from noise, however far from the line they lie usually biases the fit, by pushing outliers into the line rule of thumb; its better to fit to the better fitting points, within reason; if this is hard to do, then the model could be a problem

CS8690 Computer Vision University of Missouri at Columbia Other examples Segmentation –a segment is a gaussian that emits feature vectors (which could contain colour; or colour and position; or colour, texture and position). –segment parameters are mean and (perhaps) covariance –if we knew which segment each point belonged to, estimating these parameters would be easy –rest is on same lines as fitting line Segmentation –a segment is a gaussian that emits feature vectors (which could contain colour; or colour and position; or colour, texture and position). –segment parameters are mean and (perhaps) covariance –if we knew which segment each point belonged to, estimating these parameters would be easy –rest is on same lines as fitting line Fitting multiple lines –rather like fitting one line, except there are more hidden variables –easiest is to encode as an array of hidden variables, which represent a table with a one where the i’th point comes from the j’th line, zeros otherwise –rest is on same lines as above

CS8690 Computer Vision University of Missouri at Columbia Issues with EM Local maxima –can be a serious nuisance in some problems –no guarantee that we have reached the “right” maximum Starting –k means to cluster the points is often a good idea Local maxima –can be a serious nuisance in some problems –no guarantee that we have reached the “right” maximum Starting –k means to cluster the points is often a good idea

CS8690 Computer Vision University of Missouri at Columbia Local maximum

CS8690 Computer Vision University of Missouri at Columbia which is an excellent fit to some points

CS8690 Computer Vision University of Missouri at Columbia and the deltas for this maximum

CS8690 Computer Vision University of Missouri at Columbia A dataset that is well fitted by four lines

CS8690 Computer Vision University of Missouri at Columbia Result of EM fitting, with one line (or at least, one available local maximum).

CS8690 Computer Vision University of Missouri at Columbia Result of EM fitting, with two lines (or at least, one available local maximum).

CS8690 Computer Vision University of Missouri at Columbia Seven lines can produce a rather logical answer

CS8690 Computer Vision University of Missouri at Columbia Motion segmentation with EM Model image pair (or video sequence) as consisting of regions of parametric motion –affine motion is popular Now we need to –determine which pixels belong to which region –estimate parameters Model image pair (or video sequence) as consisting of regions of parametric motion –affine motion is popular Now we need to –determine which pixels belong to which region –estimate parameters Likelihood –assume Straightforward missing variable problem, rest is calculation

CS8690 Computer Vision University of Missouri at Columbia Three frames from the MPEG “flower garden” sequence Figure from “Representing Images with layers,”, by J. Wang and E.H. Adelson, IEEE Transactions on Image Processing, 1994, c 1994, IEEE

CS8690 Computer Vision University of Missouri at Columbia Grey level shows region no. with highest probability Segments and motion fields associated with them Figure from “Representing Images with layers,”, by J. Wang and E.H. Adelson, IEEE Transactions on Image Processing, 1994, c 1994, IEEE

CS8690 Computer Vision University of Missouri at Columbia If we use multiple frames to estimate the appearance of a segment, we can fill in occlusions; so we can re-render the sequence with some segments removed. Figure from “Representing Images with layers,”, by J. Wang and E.H. Adelson, IEEE Transactions on Image Processing, 1994, c 1994, IEEE

CS8690 Computer Vision University of Missouri at Columbia Some generalities Many, but not all problems that can be attacked with EM can also be attacked with RANSAC –need to be able to get a parameter estimate with a manageably small number of random choices. –RANSAC is usually better Many, but not all problems that can be attacked with EM can also be attacked with RANSAC –need to be able to get a parameter estimate with a manageably small number of random choices. –RANSAC is usually better Didn’t present in the most general form –in the general form, the likelihood may not be a linear function of the missing variables –in this case, one takes an expectation of the likelihood, rather than substituting expected values of missing variables –Issue doesn’t seem to arise in vision applications.

CS8690 Computer Vision University of Missouri at Columbia Model Selection We wish to choose a model to fit to data –e.g. is it a line or a circle? –e.g is this a perspective or orthographic camera? –e.g. is there an aeroplane there or is it noise? We wish to choose a model to fit to data –e.g. is it a line or a circle? –e.g is this a perspective or orthographic camera? –e.g. is there an aeroplane there or is it noise? Issue –In general, models with more parameters will fit a dataset better, but are poorer at prediction –This means we can’t simply look at the negative log-likelihood (or fitting error)

CS8690 Computer Vision University of Missouri at Columbia Top is not necessarily a better fit than bottom (actually, almost always worse)

CS8690 Computer Vision University of Missouri at Columbia

CS8690 Computer Vision University of Missouri at Columbia We can discount the fitting error with some term in the number of parameters in the model.

CS8690 Computer Vision University of Missouri at Columbia Discounts AIC (an information criterion) –choose model with smallest value of –p is the number of parameters AIC (an information criterion) –choose model with smallest value of –p is the number of parameters BIC (Bayes information criterion) –choose model with smallest value of –N is the number of data points Minimum description length –same criterion as BIC, but derived in a completely different way

CS8690 Computer Vision University of Missouri at Columbia Cross-validation Split data set into two pieces, fit to one, and compute negative log- likelihood on the other Average over multiple different splits Choose the model with the smallest value of this average Split data set into two pieces, fit to one, and compute negative log- likelihood on the other Average over multiple different splits Choose the model with the smallest value of this average The difference in averages for two different models is an estimate of the difference in KL divergence of the models from the source of the data

CS8690 Computer Vision University of Missouri at Columbia Model averaging Very often, it is smarter to use multiple models for prediction than just one e.g. motion capture data –there are a small number of schemes that are used to put markers on the body –given we know the scheme S and the measurements D, we can estimate the configuration of the body X Very often, it is smarter to use multiple models for prediction than just one e.g. motion capture data –there are a small number of schemes that are used to put markers on the body –given we know the scheme S and the measurements D, we can estimate the configuration of the body X We want If it is obvious what the scheme is from the data, then averaging makes little difference If it isn’t, then not averaging underestimates the variance of X --- we think we have a more precise estimate than we do.