Download presentation
Presentation is loading. Please wait.
Published byMarcus French Modified over 6 years ago
1
A Genetic Algorithm Approach to K-Means Clustering
Craig Stanek CS401 November 17, 2004
2
What Is Clustering? “partitioning the data being mined into several groups (or clusters) of data instances, in such a way that: Each cluster has instances that are very similar (or “near”) to each other, and The instances in each cluster are very different (or “far away”) from the instances in the other clusters” --Alex A. Freitas, “Data Mining and Knowledge Discovery with Evolutionary Algorithms”
3
Segmentation and Differentiation
Why Cluster? Segmentation and Differentiation
4
Why Cluster? Outlier Detection
5
Why Cluster? Classification
6
K-Means Clustering Specify K clusters
Randomly initialize K “centroids” Classify each data instance to closest cluster according to distance from centroid Recalculate cluster centroids Repeat steps (3) and (4) until no data instances move to a different cluster
7
Drawbacks of K-Means Algorithm
Local rather than global optimum Sensitive to initial choice of centroids K must be chosen apriori Minimizes intra-cluster distance but does not consider inter-cluster distance
8
Problem Statement Can a Genetic Algorithm approach do better than standard K-means Algorithm? Is there an alternative fitness measure that can take into account both intra-cluster similarity and inter-cluster differentiation? Can a GA be used to find the optimum number of clusters for a given data set?
9
Representation of Individuals
Randomly generated number of clusters Medoid-based integer string (each gene is a distinct data instance) Example: 58 113 162 23 244
10
Genetic Algorithm Approach
Why Medoids?
11
Genetic Algorithm Approach
Why Medoids?
12
Genetic Algorithm Approach
Why Medoids?
13
Recombination Parent #1: 36 108 82 Parent #2: 5 80 147 82 108 6 36 6
Child #1: 5 82 80 Child #2:
14
Fitness Function Let rij represent the jth data instance of the ith cluster and Mi be the medoid of the ith cluster Let X = Let Y = Fitness = Y / X
15
Experimental Setup Iris Plant Data (UCI Repository) 150 data instances
4 dimensions Known classifications 3 classes 50 instances of each
16
Experimental Setup Iris Data Set
17
Experimental Setup Iris Data Set
18
Standard K-Means vs. Medoid-Based EA
Total Trials 30 Avg. Correct 120.1 134.9 Avg. % Correct 80.1% 89.9% Min. Correct 77 133 Max. Correct 134 135 Avg. Fitness 78.94 84.00
19
Standard K-Means Clustering
Iris Data Set
20
Medoid-Based EA Iris Data Set
21
Standard Fitness EA vs. Proposed Fitness EA
Total Trials 30 Avg. Correct 134.9 134.0 Avg. % Correct 89.9% 89.3% Min. Correct 133 134 Max. Correct 135 Avg. Generations 82.7 24.9
22
Fixed vs. Variable Number of Clusters EA
Total Trials 30 Avg. Correct 134.0 Avg. % Correct 89.3% Min. Correct 134 Max. Correct Avg. # of Clusters 3 7
23
Variable Number of Clusters EA
Iris Data Set
24
Conclusions GA better at obtaining globally optimal solution
Proposed fitness function shows promise Difficulty letting GA determine “correct” number of clusters on its own
25
Future Work Other data sets Alternative fitness function Scalability
GA comparison to simulated annealing
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.