Presentation is loading. Please wait.

Presentation is loading. Please wait.

Machine Learning – Lecture 4

Similar presentations


Presentation on theme: "Machine Learning – Lecture 4"— Presentation transcript:

1 Machine Learning – Lecture 4
Mixture Models and EM Bastian Leibe RWTH Aachen Many slides adapted from B. Schiele TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAAAAAAAAAAAAAAAAAAAAAAAAAA

2 Announcements Exercise 1 available Exercise modalities
Bayes decision theory Maximum Likelihood Kernel density estimation / k-NN Gaussian Mixture Models  Submit your results to Ishrat until evening of Exercise modalities Please form teams of up to 3 people! Make it easy for Ishrat to correct your solutions: Turn in everything as a single zip archive. Use the provided Matlab framework. For each exercise, you need to implement the corresponding apply function. If the screen output matches the expected output, you will get the points for the exercise; else, no points. B. Leibe

3 Course Outline Fundamentals (2 weeks)
Bayes Decision Theory Probability Density Estimation Discriminative Approaches (5 weeks) Linear Discriminant Functions Support Vector Machines Ensemble Methods & Boosting Randomized Trees, Forests & Ferns Generative Models (4 weeks) Bayesian Networks Markov Random Fields B. Leibe

4 Recap: Maximum Likelihood Approach
Computation of the likelihood Single data point: Assumption: all data points are independent Log-likelihood Estimation of the parameters µ (Learning) Maximize the likelihood (=minimize the negative log-likelihood)  Take the derivative and set it to zero. Slide credit: Bernt Schiele B. Leibe

5 Recap: Bayesian Learning Approach
Bayesian view: Consider the parameter vector µ as a random variable. When estimating the parameters, what we compute is Assumption: given µ, this doesn’t depend on X anymore This is entirely determined by the parameter µ (i.e. by the parametric form of the pdf). Slide adapted from Bernt Schiele B. Leibe

6 Recap: Bayesian Learning Approach
Discussion The more uncertain we are about µ, the more we average over all possible parameter values. Likelihood of the parametric form µ given the data set X. Estimate for x based on parametric form µ Prior for the parameters µ Normalization: integrate over all possible values of µ B. Leibe

7 Recap: Histograms Basic idea:
Partition the data space into distinct bins with widths ¢i and count the number of observations, ni, in each bin. Often, the same width is used for all bins, ¢i = ¢. This can be done, in principle, for any dimensionality D… …but the required number of bins grows exponen- tially with D! B. Leibe Image source: C.M. Bishop, 2006

8 Recap: Kernel Density Estimation
see Exercise 1.4 Approximation formula: Kernel methods Place a kernel window k at location x and count how many data points fall inside it. fixed V determine K fixed K determine V Kernel Methods K-Nearest Neighbor K-Nearest Neighbor Increase the volume V until the K next data points are found. Slide adapted from Bernt Schiele B. Leibe

9 Topics of This Lecture Mixture distributions K-Means Clustering
Mixture of Gaussians (MoG) Maximum Likelihood estimation attempt K-Means Clustering Algorithm Applications EM Algorithm Credit assignment problem MoG estimation Interpretation of K-Means Technical advice B. Leibe

10 Mixture Distributions
A single parametric distribution is often not sufficient E.g. for multimodal data Single Gaussian Mixture of two Gaussians B. Leibe Image source: C.M. Bishop, 2006

11 Mixture of Gaussians (MoG)
Sum of M individual Normal distributions In the limit, every smooth distribution can be approximated this way (if M is large enough) Slide credit: Bernt Schiele B. Leibe

12 Likelihood of measurement x given mixture component j
Mixture of Gaussians with and Notes The mixture density integrates to 1: The mixture parameters are Likelihood of measurement x given mixture component j Prior of component j Slide adapted from Bernt Schiele B. Leibe

13 Mixture of Gaussians (MoG)
“Generative model” “Weight” of mixture component 1 2 3 Mixture component Mixture density Slide credit: Bernt Schiele B. Leibe

14 Mixture of Multivariate Gaussians
B. Leibe Image source: C.M. Bishop, 2006

15 Mixture of Multivariate Gaussians
Mixture weights / mixture coefficients: with and Parameters: Slide credit: Bernt Schiele B. Leibe Image source: C.M. Bishop, 2006

16 Mixture of Multivariate Gaussians
“Generative model” 1 3 2 Slide credit: Bernt Schiele B. Leibe Image source: C.M. Bishop, 2006

17 Mixture of Gaussians – 1st Estimation Attempt
Maximum Likelihood Minimize Let’s first look at ¹j: We can already see that this will be difficult, since This will cause problems! Slide adapted from Bernt Schiele B. Leibe

18 Mixture of Gaussians – 1st Estimation Attempt
Minimization: We thus obtain “responsibility” of component j for xn B. Leibe

19 Mixture of Gaussians – 1st Estimation Attempt
But… I.e. there is no direct analytical solution! Complex gradient function (non-linear mutual dependencies) Optimization of one Gaussian depends on all other Gaussians! It is possible to apply iterative numerical optimization here, but in the following, we will see a simpler method. B. Leibe

20 Mixture of Gaussians – Other Strategy
Observed data: Unobserved data: Unobserved = “hidden variable”: j|x Slide credit: Bernt Schiele B. Leibe

21 Mixture of Gaussians – Other Strategy
Assuming we knew the values of the hidden variable… ML for Gaussian #1 ML for Gaussian #2 assumed known j Slide credit: Bernt Schiele B. Leibe

22 Mixture of Gaussians – Other Strategy
Assuming we knew the mixture components… Bayes decision rule: Decide j = 1 if assumed known j Slide credit: Bernt Schiele B. Leibe

23 Mixture of Gaussians – Other Strategy
Chicken and egg problem – what comes first? In order to break the loop, we need an estimate for j. E.g. by clustering… We don’t know any of those! j Slide credit: Bernt Schiele B. Leibe

24 Clustering with Hard Assignments
Let’s first look at clustering with “hard assignments” Slide credit: Bernt Schiele B. Leibe

25 Topics of This Lecture Mixture distributions K-Means Clustering
Mixture of Gaussians (MoG) Maximum Likelihood estimation attempt K-Means Clustering Algorithm Applications EM Algorithm Credit assignment problem MoG estimation Interpretation of K-Means Technical advice B. Leibe

26 K-Means Clustering Iterative procedure
Initialization: pick K arbitrary centroids (cluster means) Assign each sample to the closest centroid. Adjust the centroids to be the means of the samples assigned to them. Go to step 2 (until no change) Algorithm is guaranteed to converge after finite #iterations. Local optimum Final result depends on initialization. Slide credit: Bernt Schiele B. Leibe

27 K-Means – Example with K=2
B. Leibe Image source: C.M. Bishop, 2006

28 K-Means Clustering K-Means optimizes the following objective function:
where In practice, this procedure usually converges quickly to a local optimum. B. Leibe Image source: C.M. Bishop, 2006

29 Example Application: Image Compression
Take each pixel as one data point. K-Means Clustering Set the pixel color to the cluster mean. B. Leibe Image source: C.M. Bishop, 2006

30 Example Application: Image Compression
K = 2 K = 3 K =10 Original image B. Leibe Image source: C.M. Bishop, 2006

31 Summary K-Means Pros Problem cases Extensions Simple, fast to compute
Converges to local minimum of within-cluster squared error Problem cases Setting k? Sensitive to initial centers Sensitive to outliers Detects spherical clusters only Extensions Speed-ups possible through efficient search structures General distance measures: k-medoids Slide credit: Kristen Grauman B. Leibe

32 Topics of This Lecture Mixture distributions K-Means Clustering
Mixture of Gaussians (MoG) Maximum Likelihood estimation attempt K-Means Clustering Algorithm Applications EM Algorithm Credit assignment problem MoG estimation Interpretation of K-Means Technical advice B. Leibe

33 EM Clustering Clustering with “soft assignments”
Expectation step of the EM algorithm Slide credit: Bernt Schiele B. Leibe

34 Maximum Likelihood estimate
EM Clustering Clustering with “soft assignments” Maximization step of the EM algorithm Maximum Likelihood estimate Slide credit: Bernt Schiele B. Leibe

35 Credit Assignment Problem
If we are just given x, we don’t know which mixture component this example came from We can however evaluate the posterior probability that an observed x was generated from the first mixture component. Slide credit: Bernt Schiele B. Leibe

36 Mixture Density Estimation Example
Assume we want to estimate a 2-component MoG model If each sample in the training set were labeled j2{1,2} according to which mixture component (1 or 2) had generated it, then the estimation would be easy. Labeled examples = no credit assignment problem. Slide credit: Bernt Schiele B. Leibe

37 Mixture Density Estimation Example
When examples are labeled, we can estimate the Gaussians independently Using Maximum Likelihood estimation for single Gaussians. Notation Let li be the label for sample xi Let N be the number of samples Let Nj be the number of samples labeled j Then for each j2{1,2} we set Slide credit: Bernt Schiele B. Leibe

38 Mixture Density Estimation Example
Of course, we don’t have such labels li… But we can guess what the labels might be based on our current mixture distribution estimate (credit assignment problem). We get soft labels or posterior probabilities of which Gaussian generated which example: When the Gaussians are almost identical (as in the figure), then for almost any given sample xi.  Even small differences can help to determine how to update the Gaussians. Slide credit: Bernt Schiele B. Leibe

39 EM Algorithm Expectation-Maximization (EM) Algorithm
E-Step: softly assign samples to mixture components M-Step: re-estimate the parameters (separately for each mixture component) based on the soft assignments = soft number of samples labeled j Slide adapted from Bernt Schiele B. Leibe

40 EM Algorithm – An Example
B. Leibe Image source: C.M. Bishop, 2006

41 EM – Technical Advice When implementing EM, we need to take care to avoid singularities in the estimation! Mixture components may collapse on single data points. E.g. consider the case (this also holds in general) Assume component j is exactly centered on data point xn. This data point will then contribute a term in the likelihood function For ¾j ! 0, this term goes to infinity!  Need to introduce regularization Enforce minimum width for the Gaussians E.g., instead of §-1, use (§ + ¾minI)-1 B. Leibe Image source: C.M. Bishop, 2006

42 EM – Technical Advice (2)
EM is very sensitive to the initialization Will converge to a local optimum of E. Convergence is relatively slow.  Initialize with k-Means to get better results! k-Means is itself initialized randomly, will also only find a local optimum. But convergence is much faster. Typical procedure Run k-Means M times (e.g. M = ). Pick the best result (lowest error J). Use this result to initialize EM Set ¹j to the corresponding cluster mean from k-Means. Initialize §j to the sample covariance of the associated data points. B. Leibe

43 K-Means Clustering Revisited
Interpreting the procedure Initialization: pick K arbitrary centroids (cluster means) Assign each sample to the closest centroid. Adjust the centroids to be the means of the samples assigned to them. Go to step 2 (until no change) (E-Step) (M-Step) B. Leibe

44 K-Means Clustering Revisited
K-Means clustering essentially corresponds to a Gaussian Mixture Model (MoG or GMM) estimation with EM whenever The covariances are of the K Gaussians are set to §j = ¾2I For some small, fixed ¾2 k-Means MoG Slide credit: Bernt Schiele B. Leibe

45 Summary: Gaussian Mixture Models
Properties Very general, can represent any (continuous) distribution. Once trained, very fast to evaluate. Can be updated online. Problems / Caveats Some numerical issues in the implementation  Need to apply regularization in order to avoid singularities. EM for MoG is computationally expensive Especially for high-dimensional problems! More computational overhead and slower convergence than k-Means Results very sensitive to initialization  Run k-Means for some iterations as initialization! Need to select the number of mixture components K.  Model selection problem (see Lecture 16) B. Leibe

46 Topics of This Lecture Mixture distributions K-Means Clustering
Mixture of Gaussians (MoG) Maximum Likelihood estimation attempt K-Means Clustering Algorithm Applications EM Algorithm Credit assignment problem MoG estimation Interpretation of K-Means Technical advice B. Leibe

47 Applications Mixture models are used in many practical applications.
Wherever distributions with complex or unknown shapes need to be represented… Popular application in Computer Vision Model distributions of pixel colors. Each pixel is one data point in, e.g., RGB space.  Learn a MoG to represent the class-conditional densities.  Use the learned models to classify other pixels. B. Leibe Image source: C.M. Bishop, 2006

48 Application: Background Model for Tracking
Train background MoG for each pixel Model “common“ appearance variation for each background pixel. Initialization with an empty scene. Update the mixtures over time Adapt to lighting changes, etc. Used in many vision-based tracking applications Anything that cannot be explained by the background model is labeled as foreground (=object). Easy segmentation if camera is fixed. Gaussian Mixture C. Stauffer, E. Grimson, Learning Patterns of Activity Using Real-Time Tracking, IEEE Trans. PAMI, 22(8): , 2000. B. Leibe Image Source: Daniel Roth, Tobias Jäggli

49 Application: Image Segmentation
User assisted image segmentation User marks two regions for foreground and background. Learn a MoG model for the color values in each region. Use those models to classify all other pixels.  Simple segmentation procedure (building block for more complex applications) B. Leibe

50 Application: Color-Based Skin Detection
Collect training samples for skin/non-skin pixels. Estimate MoG to represent the skin/ non-skin densities skin non-skin Classify skin color pixels in novel images M. Jones and J. Rehg, Statistical Color Models with Application to Skin Detection, IJCV 2002.

51 Interested to Try It? Here’s how you can access a webcam in Matlab:
function out = webcam % uses "Image Acquisition Toolbox„ adaptorName = 'winvideo'; vidFormat = 'I420_320x240'; vidObj1= videoinput(adaptorName, 1, vidFormat); set(vidObj1, 'ReturnedColorSpace', 'rgb'); set(vidObj1, 'FramesPerTrigger', 1); out = vidObj1 ; cam = webcam(); img=getsnapshot(cam); B. Leibe

52 References and Further Reading
More information about EM and MoG estimation is available in Chapter and the entire Chapter 9 of Bishop’s book (recommendable to read). Additional information Original EM paper: A.P. Dempster, N.M. Laird, D.B. Rubin, „Maximum-Likelihood from incomplete data via EM algorithm”, In Journal Royal Statistical Society, Series B. Vol 39, 1977 EM tutorial: J.A. Bilmes, “A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models“, TR , ICSI, U.C. Berkeley, CA,USA Christopher M. Bishop Pattern Recognition and Machine Learning Springer, 2006 B. Leibe


Download ppt "Machine Learning – Lecture 4"

Similar presentations


Ads by Google