Data Mining – Algorithms: K Means Clustering Chapter 4, Section 4.8
K Means Clusting K – is the number of clusters K must be specified in advance (option or parameter to algorithm) Develops “Cluster Centers” Starts with random center points Puts instances into “closest” cluster – based on euclidean distance Creates new centers based on instances included Refines iteratively until no change
Example See bankrawnumericKMeansVersion2.xls
Pseudo-code for K Means Clustering Loop through K times current centroid = Randomly generate values for each attribute Done = False All instances cluster = none WHILE not Done Total distance = 0 Done = true For each instance instance’s previous cluster = instance’s cluster measure euclidean distance to each centroid find smallest distance and assign instance to that cluster if new cluster != previous cluster Done=False add smallest distance to total distance Report total distance For each cluster loop through attributes loop through instances assigned to cluster update totals calculate average for attribute for cluster – producing new centroid END While
K Means Clustering Simple and Effective The minimum is a local minimum No guarantee that the total Euclidean distance is a global minimum Final clusters are quite sensitive to the initial (random) cluster centers This is true for all practical clustering techniques (since they are greedy hill climbers) Common to run several times and manually choose the best final result (one with the smallest total Euclidean distance)
Let’s run WEKA on this …
WEKA - Take-Home Number of iterations: 2 Within cluster sum of squared errors: 7.160553788978607 Cluster centroids: Cluster 0 Mean/Mode: 40.2 9215 1013.65 22 8.4 24002.221 Std Devs: 10.4019 4607.5 206.0537 4.4721 6.5038 21098.1457 Cluster 1 Mean/Mode: 27.6 2795.2167 423.89 15.2667 4.5333 4224.9115 Std Devs: 11.5437 3493.6652 227.8601 4.8766 6.7387 5836.1117 Clustered Instances 0 5 ( 25%) 1 15 ( 75%) This was with default – k = 2 (2 clusters) It only had to loop twice Sum of euclidean distances is shown Means (and SDs) for each attribute for each cluster are shown Number of instances in each cluster are shown You can visualize the cluster (right click on result list) <DO> You can change the number of clusters generated <DO> You can change the random seed to see how results differ <DO> Weka doesn’t give you a list of which instance is in which cluster – but can add to arff file by using Preprocess tab – Filters.Unsupervised.Attribute.AddCluster
Numeric Attributes Simple K Means is designed for Numeric Attributes Nominal Attributes similarity measurement has to use all or nothing Centroid uses mode instead of mean
End Section 4.8