A Genetic Algorithm Approach to K-Means Clustering

A Genetic Algorithm Approach to K-Means Clustering
Craig Stanek CS401 November 17, 2004

What Is Clustering? “partitioning the data being mined into several groups (or clusters) of data instances, in such a way that: Each cluster has instances that are very similar (or “near”) to each other, and The instances in each cluster are very different (or “far away”) from the instances in the other clusters” --Alex A. Freitas, “Data Mining and Knowledge Discovery with Evolutionary Algorithms”

Segmentation and Differentiation
Why Cluster? Segmentation and Differentiation

Why Cluster? Outlier Detection

Why Cluster? Classification

K-Means Clustering Specify K clusters
Randomly initialize K “centroids” Classify each data instance to closest cluster according to distance from centroid Recalculate cluster centroids Repeat steps (3) and (4) until no data instances move to a different cluster

Drawbacks of K-Means Algorithm
Local rather than global optimum Sensitive to initial choice of centroids K must be chosen apriori Minimizes intra-cluster distance but does not consider inter-cluster distance

Problem Statement Can a Genetic Algorithm approach do better than standard K-means Algorithm? Is there an alternative fitness measure that can take into account both intra-cluster similarity and inter-cluster differentiation? Can a GA be used to find the optimum number of clusters for a given data set?

Representation of Individuals
Randomly generated number of clusters Medoid-based integer string (each gene is a distinct data instance) Example: 58 113 162 23 244

Genetic Algorithm Approach
Why Medoids?

Recombination Parent #1: 36 108 82 Parent #2: 5 80 147 82 108 6 36 6
Child #1: 5 82 80 Child #2:

Fitness Function Let rij represent the jth data instance of the ith cluster and Mi be the medoid of the ith cluster Let X = Let Y = Fitness = Y / X

Experimental Setup Iris Plant Data (UCI Repository) 150 data instances
4 dimensions Known classifications 3 classes 50 instances of each

Experimental Setup Iris Data Set

Standard K-Means vs. Medoid-Based EA
Total Trials 30 Avg. Correct 120.1 134.9 Avg. % Correct 80.1% 89.9% Min. Correct 77 133 Max. Correct 134 135 Avg. Fitness 78.94 84.00

Standard K-Means Clustering
Iris Data Set

Medoid-Based EA Iris Data Set

Standard Fitness EA vs. Proposed Fitness EA
Total Trials 30 Avg. Correct 134.9 134.0 Avg. % Correct 89.9% 89.3% Min. Correct 133 134 Max. Correct 135 Avg. Generations 82.7 24.9

Fixed vs. Variable Number of Clusters EA
Total Trials 30 Avg. Correct 134.0 Avg. % Correct 89.3% Min. Correct 134 Max. Correct Avg. # of Clusters 3 7

Variable Number of Clusters EA
Iris Data Set

Conclusions GA better at obtaining globally optimal solution
Proposed fitness function shows promise Difficulty letting GA determine “correct” number of clusters on its own

Future Work Other data sets Alternative fitness function Scalability
GA comparison to simulated annealing

A Genetic Algorithm Approach to K-Means Clustering

Similar presentations

Presentation on theme: "A Genetic Algorithm Approach to K-Means Clustering"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

A Genetic Algorithm Approach to K-Means Clustering

Similar presentations

Presentation on theme: "A Genetic Algorithm Approach to K-Means Clustering"— Presentation transcript:

Similar presentations

About project

Feedback