Presentation is loading. Please wait.

Presentation is loading. Please wait.

EE462 MLCV 1 Lecture 3-4 Clustering (1hr) Gaussian Mixture and EM (1hr) Tae-Kyun Kim.

Similar presentations


Presentation on theme: "EE462 MLCV 1 Lecture 3-4 Clustering (1hr) Gaussian Mixture and EM (1hr) Tae-Kyun Kim."— Presentation transcript:

1 EE462 MLCV 1 Lecture 3-4 Clustering (1hr) Gaussian Mixture and EM (1hr) Tae-Kyun Kim

2 EE462 MLCV 2 Data points (green), 2D vectors, are grouped to two homogenous clusters (blue and red). Clustering is achieved by an iterative algorithm (left to right). The cluster centers are marked x. Vector Clustering

3 EE462 MLCV 3 ` RGBRGB Pixel Clustering (Image Quantisation) Image pixels are represented by 3D vectors of R,G,B values. The vectors are grouped to K=10,3,2 clusters, and represented by the mean values of the respective clusters.

4 EE462 MLCV 4 dimension D … …… or raw pixels … K codewords Patch Clustering Image patches are harvested around interest points from a large number of images. They are represented by finite dimensional vectors, and clustered to form a visual dictionary. SIFT 20 D=400 Lecture 9-10 (BoW)

5 EE462 MLCV 5 Image Clustering Whole images are represented as finite dimensional vectors. Homogenous vectors are grouped together in Euclidean space. Lecture 9- 10 (BoW) ……

6 EE462 MLCV 6 K-means vs GMM Hard clustering: a data point is assigned a cluster. Soft clustering: a data point is explained by a mix of multiple Gaussians probabilistically. Two standard methods are k-means and Gaussian Mixture Model (GMM). K-means assigns data points to the nearest clusters, while GMM represents data by multiple Gaussian densities.

7 EE462 MLCV 7 Matrix and Vector Derivatives Matrix and vector derivatives are obtained first by element-wise derivatives and then reforming them into matrices and vectors.

8 EE462 MLCV 8 Matrix and Vector Derivatives

9 EE462 MLCV 9 K-means Clustering Given a data set {x 1,…, x N } of N observations in a D- dimensional space, our goal is to partition the data set into K clusters or groups. The vectors μ k, where k = 1,...,K, represent k-th cluster, e.g. the centers of the clusters. Binary indicator variables are defined for each data point x n, r nk ∈ {0, 1}, where k = 1,...,K. 1-of-K coding scheme: x n is assigned to cluster k then r nk = 1, and r nj = 0 for j ≠ k.

10 EE462 MLCV 10 The objective function that measures distortion is We ought to find {r nk } and {μ k } that minimise J.

11 EE462 MLCV 11 till converge Iterative solution: Step 1: We minimise J with respect to r nk, keeping μ k fixed. J is a linear function of r nk, we have a closed form solution Step 2: We minimise J with respect to μ k keeping r nk fixed. J is a quadratic of μ k. We set its derivative with respect to μ k to zero, First we choose some initial values for μ k.

12 EE462 MLCV 12 K=2 μ1 μ 2 r nk

13 EE462 MLCV 13 It provides convergence proof. Local minimum: its result depends on initial values of μ k.

14 EE462 MLCV 14 Generalisation of K-means using a more generic dissimilarity measure V (x n, μ k ). The objective function to minimise is Circles in the same size V = ( x n - u k ) T Σ k -1 ( x n - u k ) Generalisation of K-means Cluster shapes by different Σ k:, where Σ k denotes the covariance matrix. Σ k: = I

15 EE462 MLCV 15 Generalisation of K-means Σ k: a diagonal matrix Σ k: an isotropic matrix Different sized circles Ellipses Σ k: a full matrix Rotated ellipses

16 EE462 MLCV 16 Statistical Pattern Recognition Toolbox for Matlab http://cmp.felk.cvut.cz/cmp/software/stp rtool/ …\stprtool\probab\cmeans.m

17 EE462 MLCV 17 Mixture of Gaussians Denote z as 1-of-K representation: z k ∈ {0, 1} and Σ k z k = 1. We define the joint distribution p(x, z) by a marginal distribution p(z) and a conditional distribution p(x|z). Lecture 11-12 (Prob. Graphical models) Hidden variable Observable variable: data

18 EE462 MLCV 18 The marginal distribution over z is written by the mixing coefficients π k where The marginal distribution is in the form of Similarly,

19 EE462 MLCV 19 The marginal distribution of x is, which is as a linear superposition of Gaussians.

20 EE462 MLCV 20 The conditional probability p(z k = 1|x) denoted by γ(z k ) is obtained by Bayes' theorem, We view π k as the prior probability of z k = 1, and γ(z k ) as the posterior probability. γ(z k ) is the responsibility that k-component takes for explaining the observation x.

21 EE462 MLCV 21 Maximum Likelihood Estimation s.t. Given a data set of X = {x 1,…, x N }, the log of the likelihood function is

22 EE462 MLCV 22 Setting the derivatives of ln p(X|π, μ, Σ) with respect to μ k to zero, we obtain

23 EE462 MLCV 23 objective ftn. f(x) constraints g(x) max f(x) s.t. g(x)=0 Refer to Optimisation course or http://en.wikipedia.org/wiki/Lagrange_multiplierhttp://en.wikipedia.org/wiki/Lagrange_multiplier Finally, we maximise ln p(X|π, μ, Σ) with respect to the mixing coefficients π k. We use a Largrange multiplier

24 EE462 MLCV 24 which gives we find λ = -N and

25 EE462 MLCV 25 EM (Expectation Maximisation) for Gaussian Mixtures 1.Initialise the means μ k, covariances Σ k and mixing coefficients π k. 2.Ε step: Evaluate the responsibilities using the current parameter values 3.M step: RE-estimate the parameters using the current responsibilities

26 EE462 MLCV 26 4. Evaluate the log likelihood and check for convergence of either the parameters or the log likelihood. If the convergence criterion is not satisfied, return to step 2. EM (Expectation Maximisation) for Gaussian Mixtures

27 EE462 MLCV 27

28 EE462 MLCV 28 Statistical Pattern Recognition Toolbox for Matlab http://cmp.felk.cvut.cz/cmp/software/stp rtool/ …\stprtool\visual\pgmm.m …\stprtool\demos\demo_emgmm.m

29 EE462 MLCV 29 Information Theory The amount of information can be viewed as the degree of surprise on the value of x. If we have two events x and y that are unrelated, h(x,y) = h(x) + h(y). As p(x,y) = p(x)p(y), thus h(x) takes the logarithm of p(x) as where the minus sign ensures that information is positive or zero. Lecture 7 (Random forest) 0 1

30 EE462 MLCV 30 The average amount of information (called entropy) is given by The differential entropy for a multivariate continuous variable x is


Download ppt "EE462 MLCV 1 Lecture 3-4 Clustering (1hr) Gaussian Mixture and EM (1hr) Tae-Kyun Kim."

Similar presentations


Ads by Google