A Genetic Algorithm Approach to K-Means Clustering

Slides:



Advertisements
Similar presentations
Evolutionary Algorithms Nicolas Kruchten 4 th Year Engineering Science Infrastructure Option.
Advertisements

K-Means Clustering Algorithm Mining Lab
SEEM Tutorial 4 – Clustering. 2 What is Cluster Analysis?  Finding groups of objects such that the objects in a group will be similar (or.
Ali Husseinzadeh Kashan Spring 2010
Hierarchical Clustering
1 CSE 980: Data Mining Lecture 16: Hierarchical Clustering.
Hierarchical Clustering, DBSCAN The EM Algorithm
Clustering Basic Concepts and Algorithms
CS6800 Advanced Theory of Computation
WEI-MING CHEN k-medoid clustering with genetic algorithm.
Content Based Image Clustering and Image Retrieval Using Multiple Instance Learning Using Multiple Instance Learning Xin Chen Advisor: Chengcui Zhang Department.
Slide 1 EE3J2 Data Mining Lecture 16 Unsupervised Learning Ali Al-Shahib.
KNN, LVQ, SOM. Instance Based Learning K-Nearest Neighbor Algorithm (LVQ) Learning Vector Quantization (SOM) Self Organizing Maps.
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.
Parallel K-Means Clustering Based on MapReduce The Key Laboratory of Intelligent Information Processing, Chinese Academy of Sciences Weizhong Zhao, Huifang.
K-means Clustering. What is clustering? Why would we want to cluster? How would you determine clusters? How can you do this efficiently?
Genetic Algorithm What is a genetic algorithm? “Genetic Algorithms are defined as global optimization procedures that use an analogy of genetic evolution.
Health and CS Philip Chan. DNA, Genes, Proteins What is the relationship among DNA Genes Proteins ?
Optimization of thermal processes2007/2008 Optimization of thermal processes Maciej Marek Czestochowa University of Technology Institute of Thermal Machinery.
Unsupervised Learning. CS583, Bing Liu, UIC 2 Supervised learning vs. unsupervised learning Supervised learning: discover patterns in the data that relate.
A Genetic Algorithms Approach to Feature Subset Selection Problem by Hasan Doğu TAŞKIRAN CS 550 – Machine Learning Workshop Department of Computer Engineering.
1 Lecture 10 Clustering. 2 Preview Introduction Partitioning methods Hierarchical methods Model-based methods Density-based methods.
CS 484 – Artificial Intelligence1 Announcements Lab 3 due Tuesday, November 6 Homework 6 due Tuesday, November 6 Lab 4 due Thursday, November 8 Current.
Genetic Algorithms Genetic algorithms imitate a natural optimization process: natural selection in evolution. Developed by John Holland at the University.
Bahman Bahmani Stanford University
Clustering I. 2 The Task Input: Collection of instances –No special class label attribute! Output: Clusters (Groups) of instances where members of a cluster.
Cluster Analysis Potyó László. Cluster: a collection of data objects Similar to one another within the same cluster Similar to one another within the.
Clustering Clustering is a technique for finding similarity groups in data, called clusters. I.e., it groups data instances that are similar to (near)
Zeidat&Eick, MLMTA, Las Vegas K-medoid-style Clustering Algorithms for Supervised Summary Generation Nidal Zeidat & Christoph F. Eick Dept. of Computer.
Machine Learning Queens College Lecture 7: Clustering.
D Nagesh Kumar, IIScOptimization Methods: M8L5 1 Advanced Topics in Optimization Evolutionary Algorithms for Optimization and Search.
Data Mining By Farzana Forhad CS 157B. Agenda Decision Tree and ID3 Rough Set Theory Clustering.
Genetic Algorithm Dr. Md. Al-amin Bhuiyan Professor, Dept. of CSE Jahangirnagar University.
Cluster Analysis What is Cluster Analysis? Types of Data in Cluster Analysis A Categorization of Major Clustering Methods Partitioning Methods.
Data Mining Cluster Analysis: Basic Concepts and Algorithms Lecture Notes Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction.
Hirophysics.com The Genetic Algorithm vs. Simulated Annealing Charles Barnes PHY 327.
Chapter 14 Genetic Algorithms.
What Is Cluster Analysis?
Data Mining: Basic Cluster Analysis
Unsupervised Learning: Clustering
Unsupervised Learning: Clustering
Semi-Supervised Clustering
Clustering CSC 600: Data Mining Class 21.
Machine Learning Clustering: K-means Supervised Learning
Evolutionary Technique for Combinatorial Reverse Auctions
Slides by Eamonn Keogh (UC Riverside)
Ke Chen Reading: [7.3, EA], [9.1, CMB]
Constrained Clustering -Semi Supervised Clustering-
Data Mining K-means Algorithm
Dimension reduction : PCA and Clustering by Agnieszka S. Juncker
Topic 3: Cluster Analysis
Artificial Intelligence (CS 370D)
خوشه بندي تکاملي مبتني بر شکل
CSE 5243 Intro. to Data Mining
CS621: Artificial Intelligence
AIM: Clustering the Data together
Jianping Fan Dept of CS UNC-Charlotte
CSE572, CBS598: Data Mining by H. Liu
Clustering and Multidimensional Scaling
Ke Chen Reading: [7.3, EA], [9.1, CMB]
Data Mining 資料探勘 分群分析 (Cluster Analysis) Min-Yuh Day 戴敏育
DATA MINING Introductory and Advanced Topics Part II - Clustering
Evolutionary Computation,
Boltzmann Machine (BM) (§6.4)
Nearest Neighbors CSC 576: Data Mining.
Clustering Large Datasets in Arbitrary Metric Space
Topic 5: Cluster Analysis
Unsupervised Learning: Clustering
SEEM4630 Tutorial 3 – Clustering.
Presentation transcript:

A Genetic Algorithm Approach to K-Means Clustering Craig Stanek CS401 November 17, 2004

What Is Clustering? “partitioning the data being mined into several groups (or clusters) of data instances, in such a way that: Each cluster has instances that are very similar (or “near”) to each other, and The instances in each cluster are very different (or “far away”) from the instances in the other clusters” --Alex A. Freitas, “Data Mining and Knowledge Discovery with Evolutionary Algorithms”

Segmentation and Differentiation Why Cluster? Segmentation and Differentiation

Why Cluster? Outlier Detection

Why Cluster? Classification

K-Means Clustering Specify K clusters Randomly initialize K “centroids” Classify each data instance to closest cluster according to distance from centroid Recalculate cluster centroids Repeat steps (3) and (4) until no data instances move to a different cluster

Drawbacks of K-Means Algorithm Local rather than global optimum Sensitive to initial choice of centroids K must be chosen apriori Minimizes intra-cluster distance but does not consider inter-cluster distance

Problem Statement Can a Genetic Algorithm approach do better than standard K-means Algorithm? Is there an alternative fitness measure that can take into account both intra-cluster similarity and inter-cluster differentiation? Can a GA be used to find the optimum number of clusters for a given data set?

Representation of Individuals Randomly generated number of clusters Medoid-based integer string (each gene is a distinct data instance) Example: 58 113 162 23 244

Genetic Algorithm Approach Why Medoids?

Genetic Algorithm Approach Why Medoids?

Genetic Algorithm Approach Why Medoids?

Recombination Parent #1: 36 108 82 Parent #2: 5 80 147 82 108 6 36 6 Child #1: 5 82 80 Child #2:

Fitness Function Let rij represent the jth data instance of the ith cluster and Mi be the medoid of the ith cluster Let X = Let Y = Fitness = Y / X

Experimental Setup Iris Plant Data (UCI Repository) 150 data instances 4 dimensions Known classifications 3 classes 50 instances of each

Experimental Setup Iris Data Set

Experimental Setup Iris Data Set

Standard K-Means vs. Medoid-Based EA Total Trials 30 Avg. Correct 120.1 134.9 Avg. % Correct 80.1% 89.9% Min. Correct 77 133 Max. Correct 134 135 Avg. Fitness 78.94 84.00

Standard K-Means Clustering Iris Data Set

Medoid-Based EA Iris Data Set

Standard Fitness EA vs. Proposed Fitness EA Total Trials 30 Avg. Correct 134.9 134.0 Avg. % Correct 89.9% 89.3% Min. Correct 133 134 Max. Correct 135 Avg. Generations 82.7 24.9

Fixed vs. Variable Number of Clusters EA Total Trials 30 Avg. Correct 134.0 Avg. % Correct 89.3% Min. Correct 134 Max. Correct Avg. # of Clusters 3 7

Variable Number of Clusters EA Iris Data Set

Conclusions GA better at obtaining globally optimal solution Proposed fitness function shows promise Difficulty letting GA determine “correct” number of clusters on its own

Future Work Other data sets Alternative fitness function Scalability GA comparison to simulated annealing