ECE 5984: Introduction to Machine Learning Dhruv Batra Virginia Tech Topics: –Unsupervised Learning: Kmeans, GMM, EM Readings: Barber
Midsem Presentations Graded Mean 8/10 = 80% –Min: 3 –Max: 10 (C) Dhruv Batra2
Tasks (C) Dhruv Batra3 Classificationxy Regressionxy Discrete Continuous Clusteringxc Discrete ID Dimensionality Reduction xz Continuous Supervised Learning Unsupervised Learning
Learning only with X –Y not present in training data Some example unsupervised learning problems: –Clustering / Factor Analysis –Dimensionality Reduction / Embeddings –Density Estimation with Mixture Models (C) Dhruv Batra4
New Topic: Clustering Slide Credit: Carlos Guestrin5
Synonyms Clustering Vector Quantization Latent Variable Models Hidden Variable Models Mixture Models Algorithms: –K-means –Expectation Maximization (EM) (C) Dhruv Batra6
Some Data 7(C) Dhruv BatraSlide Credit: Carlos Guestrin
K-means 1.Ask user how many clusters they’d like. (e.g. k=5) 8(C) Dhruv BatraSlide Credit: Carlos Guestrin
K-means 1.Ask user how many clusters they’d like. (e.g. k=5) 2.Randomly guess k cluster Center locations 9(C) Dhruv BatraSlide Credit: Carlos Guestrin
K-means 1.Ask user how many clusters they’d like. (e.g. k=5) 2.Randomly guess k cluster Center locations 3.Each datapoint finds out which Center it’s closest to. (Thus each Center “owns” a set of datapoints) 10(C) Dhruv BatraSlide Credit: Carlos Guestrin
K-means 1.Ask user how many clusters they’d like. (e.g. k=5) 2.Randomly guess k cluster Center locations 3.Each datapoint finds out which Center it’s closest to. 4.Each Center finds the centroid of the points it owns 11(C) Dhruv BatraSlide Credit: Carlos Guestrin
K-means 1.Ask user how many clusters they’d like. (e.g. k=5) 2.Randomly guess k cluster Center locations 3.Each datapoint finds out which Center it’s closest to. 4.Each Center finds the centroid of the points it owns… 5.…and jumps there 6.…Repeat until terminated! 12(C) Dhruv BatraSlide Credit: Carlos Guestrin
K-means Randomly initialize k centers – (0) = 1 (0),…, k (0) Assign: –Assign each point i {1,…n} to nearest center: – Recenter: – j becomes centroid of its points 13(C) Dhruv BatraSlide Credit: Carlos Guestrin
K-means Demo – – ppletKM.htmlhttp://home.deib.polimi.it/matteucc/Clustering/tutorial_html/A ppletKM.html (C) Dhruv Batra14
What is K-means optimizing? Objective F( ,C): function of centers and point allocations C: – –1-of-k encoding Optimal K-means: –min min a F( ,a) 15(C) Dhruv Batra
Coordinate descent algorithms 16(C) Dhruv BatraSlide Credit: Carlos Guestrin Want: min a min b F(a,b) Coordinate descent: –fix a, minimize b –fix b, minimize a –repeat Converges!!! –if F is bounded –to a (often good) local optimum as we saw in applet (play with it!) K-means is a coordinate descent algorithm!
Optimize objective function: Fix , optimize a (or C) 17(C) Dhruv BatraSlide Credit: Carlos Guestrin K-means as Co-ordinate Descent
Optimize objective function: Fix a (or C), optimize 18(C) Dhruv BatraSlide Credit: Carlos Guestrin K-means as Co-ordinate Descent
One important use of K-means Bag-of-word models in computer vision (C) Dhruv Batra19
Bag of Words model aardvark0 about2 all2 Africa1 apple0 anxious0... gas1... oil1 … Zaire0 Slide Credit: Carlos Guestrin(C) Dhruv Batra20
Object Bag of ‘words’ Fei-Fei Li
Interest Point Features Normalize patch Detect patches [Mikojaczyk and Schmid ’02] [Matas et al. ’02] [Sivic et al. ’03] Compute SIFT descriptor [Lowe’99] Slide credit: Josef Sivic
… Patch Features Slide credit: Josef Sivic
dictionary formation … Slide credit: Josef Sivic
Clustering (usually k-means) Vector quantization … Slide credit: Josef Sivic
Clustered Image Patches Fei-Fei et al. 2005
Visual Polysemy. Single visual word occurring on different (but locally similar) parts on different object categories. Visual Synonyms. Two different visual words representing a similar part of an object (wheel of a motorbike). Visual synonyms and polysemy Andrew Zisserman
Image representation ….. frequency codewords Fei-Fei Li
(One) bad case for k-means Clusters may overlap Some clusters may be “wider” than others GMM to the rescue! Slide Credit: Carlos Guestrin(C) Dhruv Batra30
GMM (C) Dhruv Batra31Figure Credit: Kevin Murphy
Recall Multi-variate Gaussians (C) Dhruv Batra32
GMM (C) Dhruv Batra33Figure Credit: Kevin Murphy
Hidden Data Causes Problems #1 Fully Observed (Log) Likelihood factorizes Marginal (Log) Likelihood doesn’t factorize All parameters coupled! (C) Dhruv Batra34
GMM vs Gaussian Joint Bayes Classifier On Board –Observed Y vs Unobserved Z –Likelihood vs Marginal Likelihood (C) Dhruv Batra35
Hidden Data Causes Problems #2 (C) Dhruv Batra36Figure Credit: Kevin Murphy
Hidden Data Causes Problems #2 Identifiability (C) Dhruv Batra37Figure Credit: Kevin Murphy
Hidden Data Causes Problems #3 Likelihood has singularities if one Gaussian “collapses” (C) Dhruv Batra38
Special case: spherical Gaussians and hard assignments Slide Credit: Carlos Guestrin(C) Dhruv Batra39 If P(X|Z=k) is spherical, with same for all classes: If each x i belongs to one class C(i) (hard assignment), marginal likelihood: M(M)LE same as K-means!!!
The K-means GMM assumption There are k components Component i has an associated mean vector Slide Credit: Carlos Guestrin(C) Dhruv Batra40
The K-means GMM assumption There are k components Component i has an associated mean vector Each component generates data from a Gaussian with mean m i and covariance matrix Each data point is generated according to the following recipe: Slide Credit: Carlos Guestrin(C) Dhruv Batra41
The K-means GMM assumption There are k components Component i has an associated mean vector Each component generates data from a Gaussian with mean m i and covariance matrix Each data point is generated according to the following recipe: 1.Pick a component at random: Choose component i with probability P(y=i) Slide Credit: Carlos Guestrin(C) Dhruv Batra42
The K-means GMM assumption There are k components Component i has an associated mean vector Each component generates data from a Gaussian with mean m i and covariance matrix Each data point is generated according to the following recipe: 1.Pick a component at random: Choose component i with probability P(y=i) 2.Datapoint x Slide Credit: Carlos Guestrin(C) Dhruv Batra43
The General GMM assumption There are k components Component i has an associated mean vector m i Each component generates data from a Gaussian with mean m i and covariance matrix i Each data point is generated according to the following recipe: 1.Pick a component at random: Choose component i with probability P(y=i) 2.Datapoint ~ N(m i, i ) Slide Credit: Carlos Guestrin(C) Dhruv Batra44
K-means vs GMM K-Means – ppletKM.htmlhttp://home.deib.polimi.it/matteucc/Clustering/tutorial_html/A ppletKM.html GMM – (C) Dhruv Batra45