K-MEANS Michael Jones ENEE698Q Fall 2011
Overview Introduction Problem Formulation How K-Means Works Pros and Cons of Using K-Means How to Improve K-Means K-Means on a Manifold Vector Quantization
Introduction K-means was first proposed by Stuart Lloyd in 1957 as a technique for pulse-code modulation. “Least square quantization in PCM”, Bell Telephone Laboratories Paper. Groups data into K clusters and attempts to group data points to minimize the sum of squares distance to their central mean. Algorithm works by iterating between two stages until the data points converge.
Problem Formulation Given a data set of {x 1,…,x N } which consists of N random instances of a random D-dimensional Euclidean variable x. Introduce a set of K prototype vectors, µ k where k=1,…,K and µ k corresponds to the mean of the k th cluster. Goal is to find a grouping of data points and prototype vectors that minimizes the sum of squares distance of each data point.
Problem Formulation (cont.) This can be formalized by introduce a indicator variable for each data point: r nk is {0,1}, and k=1,…,K Our objective function becomes:
How K-Means works Algorithm initializes the K prototype vectors to K distinct random data points. Cycles between two stages until convergence is reached. 1. For each data point, determine r nk where: 2. Update µ k :
How K-Means works (cont) K-Means follows the Expectation Maximization algorithm. Stage 1 is the E step. Stage 2 is the M step. If K and D are fixed, the clustering can be performed in time.
How K-Means works (example
Pros and Cons of K-Means Convergence: J may converge to a local minima and not the global minimum. May have to repeat algorithm multiple times. Inter-Vector Relationships: Works well for Euclidian data but cannot make use of inter-vector relationships with each x. With a large data set, the Euclidian distance calculations can be slow. K is an input parameter. If K is inappropriately chosen it may yield poor results.
How to Improve K-Means The E step can modified to have a general dissimilarity measure which leads to the K- medoids algorithm. Can speed up K-means through various methods: Pre-compute a tree where near by points are in the same sub tree. (Ramas. And Paliwal, 1990) Use triangle inequality for computing distances. (Hodgson,1998).
Vector Quantization Proposed by Robert M. Gray Algorithm is nearly identical to K-Means “Step 0. Given: A training sequence and an initial decoder. Step 1. Encode the training sequence into a sequence of channel symbols using the given decoder minimum distortion rule. If the average distortion is small enough, quit. Step 2. Replace the old reproduction codeword of the decoder for each channel symbol v by the centriod of all training vectors which mapped into v in Step 1. Go to Step 1.”
K-Means on a Manifold K-Means can be performed on a manifold if one can compute the mean of the data. Fletcher et al. introduced the notion of computing means on Riemannian manifolds. Turaga et al. performed such an experiment applying K-Means on Riemannian manifolds. Used iterative algorithm to find the sample Karcher mean Used the dissimilarity measure:
Sources Bishop C., “K-Means Clustering” in Pattern Recognition and Machine Learning, 2006, Fletcher, P., Lu, C., Pitzer, M., Joshi, & S., “Principal Geodesic Analysis for the Study of Nonlinear Statistics of Shape” from IEEE Transactions on Medical Imaging, VOL. 23, NO. 8, August 2004, Gray, M., “Vector Quantization” in IEEE ASSP Magazine, pp , April Turaga, P., Veeraraghavan, A., Srivastava, A. & Chellappa, R., “Statistical Computations on Grassman and Stiefel manifolds for Image and Video-Based Recognition” in IEEE PAMI, accepted 2010.
Questions? July, 2010