A Genetic Algorithm Approach to K-Means Clustering Craig Stanek CS401 November 17, 2004
What Is Clustering? “partitioning the data being mined into several groups (or clusters) of data instances, in such a way that: Each cluster has instances that are very similar (or “near”) to each other, and The instances in each cluster are very different (or “far away”) from the instances in the other clusters” --Alex A. Freitas, “Data Mining and Knowledge Discovery with Evolutionary Algorithms”
Segmentation and Differentiation Why Cluster? Segmentation and Differentiation
Why Cluster? Outlier Detection
Why Cluster? Classification
K-Means Clustering Specify K clusters Randomly initialize K “centroids” Classify each data instance to closest cluster according to distance from centroid Recalculate cluster centroids Repeat steps (3) and (4) until no data instances move to a different cluster
Drawbacks of K-Means Algorithm Local rather than global optimum Sensitive to initial choice of centroids K must be chosen apriori Minimizes intra-cluster distance but does not consider inter-cluster distance
Problem Statement Can a Genetic Algorithm approach do better than standard K-means Algorithm? Is there an alternative fitness measure that can take into account both intra-cluster similarity and inter-cluster differentiation? Can a GA be used to find the optimum number of clusters for a given data set?
Representation of Individuals Randomly generated number of clusters Medoid-based integer string (each gene is a distinct data instance) Example: 58 113 162 23 244
Genetic Algorithm Approach Why Medoids?
Genetic Algorithm Approach Why Medoids?
Genetic Algorithm Approach Why Medoids?
Recombination Parent #1: 36 108 82 Parent #2: 5 80 147 82 108 6 36 6 Child #1: 5 82 80 Child #2:
Fitness Function Let rij represent the jth data instance of the ith cluster and Mi be the medoid of the ith cluster Let X = Let Y = Fitness = Y / X
Experimental Setup Iris Plant Data (UCI Repository) 150 data instances 4 dimensions Known classifications 3 classes 50 instances of each
Experimental Setup Iris Data Set
Experimental Setup Iris Data Set
Standard K-Means vs. Medoid-Based EA Total Trials 30 Avg. Correct 120.1 134.9 Avg. % Correct 80.1% 89.9% Min. Correct 77 133 Max. Correct 134 135 Avg. Fitness 78.94 84.00
Standard K-Means Clustering Iris Data Set
Medoid-Based EA Iris Data Set
Standard Fitness EA vs. Proposed Fitness EA Total Trials 30 Avg. Correct 134.9 134.0 Avg. % Correct 89.9% 89.3% Min. Correct 133 134 Max. Correct 135 Avg. Generations 82.7 24.9
Fixed vs. Variable Number of Clusters EA Total Trials 30 Avg. Correct 134.0 Avg. % Correct 89.3% Min. Correct 134 Max. Correct Avg. # of Clusters 3 7
Variable Number of Clusters EA Iris Data Set
Conclusions GA better at obtaining globally optimal solution Proposed fitness function shows promise Difficulty letting GA determine “correct” number of clusters on its own
Future Work Other data sets Alternative fitness function Scalability GA comparison to simulated annealing