Centroid index Cluster level quality measure

Slides:



Advertisements
Similar presentations
AMCS/CS229: Machine Learning
Advertisements

Clustering.
Clustering Basic Concepts and Algorithms
Principal Component Analysis (PCA) for Clustering Gene Expression Data K. Y. Yeung and W. L. Ruzzo.
Cluster Evaluation Metrics that can be used to evaluate the quality of a set of document clusters.
Model-based clustering of gene expression data Ka Yee Yeung 1,Chris Fraley 2, Alejandro Murua 3, Adrian E. Raftery 2, and Walter L. Ruzzo 1 1 Department.
Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.
University at BuffaloThe State University of New York Cluster Validation Cluster validation q Assess the quality and reliability of clustering results.
Cluster Analysis.  What is Cluster Analysis?  Types of Data in Cluster Analysis  A Categorization of Major Clustering Methods  Partitioning Methods.
Clustering Evaluation April 29, Today Cluster Evaluation – Internal We don’t know anything about the desired labels – External We have some information.
Adapted by Doug Downey from Machine Learning EECS 349, Bryan Pardo Machine Learning Clustering.
Visual Recognition Tutorial
Cluster Validation.
Principal Component Analysis (PCA) for Clustering Gene Expression Data K. Y. Yeung and W. L. Ruzzo.
Graph-based consensus clustering for class discovery from gene expression data Zhiwen Yum, Hau-San Wong and Hongqiang Wang Bioinformatics, 2007.
Today Evaluation Measures Accuracy Significance Testing
Clustering Methods: Part 2d Pasi Fränti Speech & Image Processing Unit School of Computing University of Eastern Finland Joensuu, FINLAND Swap-based.
Clustering methods Course code: Pasi Fränti Speech & Image Processing Unit School of Computing University of Eastern Finland Joensuu,
Unsupervised Learning and Clustering k-means clustering Sum-of-Squared Errors Competitive Learning SOM Pre-processing and Post-processing techniques.
Cluster validation Pasi Fränti Clustering methods: Part 3 Speech and Image Processing Unit School of Computing University of Eastern Finland
Cut-based & divisive clustering Clustering algorithms: Part 2b Pasi Fränti Speech & Image Processing Unit School of Computing University of Eastern.
Lecture 20: Cluster Validation
Genetic Algorithm Using Iterative Shrinking for Solving Clustering Problems UNIVERSITY OF JOENSUU DEPARTMENT OF COMPUTER SCIENCE FINLAND Pasi Fränti and.
Accuracy Assessment Having produced a map with classification is only 50% of the work, we need to quantify how good the map is. This step is called the.
Exploiting Context Analysis for Combining Multiple Entity Resolution Systems -Ramu Bandaru Zhaoqi Chen Dmitri V.kalashnikov Sharad Mehrotra.
Clustering I. 2 The Task Input: Collection of instances –No special class label attribute! Output: Clusters (Groups) of instances where members of a cluster.
Genetic algorithms (GA) for clustering Pasi Fränti Clustering Methods: Part 2e Speech and Image Processing Unit School of Computing University of Eastern.
Information Retrieval and Organisation Chapter 16 Flat Clustering Dell Zhang Birkbeck, University of London.
Cluster validation Integration ICES Bioinformatics.
Genotype Calling Matt Schuerman. Biological Problem How do we know an individual’s SNP values (genotype)? Each SNP can have two values (A/B) Each individual.
Data Mining By Farzana Forhad CS 157B. Agenda Decision Tree and ID3 Rough Set Theory Clustering.
Introduction to Data Mining Clustering & Classification Reference: Tan et al: Introduction to data mining. Some slides are adopted from Tan et al.
Genetic Algorithms for clustering problem Pasi Fränti
Cluster Analysis What is Cluster Analysis? Types of Data in Cluster Analysis A Categorization of Major Clustering Methods Partitioning Methods.
DATA MINING: CLUSTER ANALYSIS (3) Instructor: Dr. Chun Yu School of Statistics Jiangxi University of Finance and Economics Fall 2015.
Agglomerative clustering (AC)
Cluster Validity For supervised classification we have a variety of measures to evaluate how good our model is Accuracy, precision, recall For cluster.
Machine Learning University of Eastern Finland
Today Cluster Evaluation Internal External
Sampath Jayarathna Cal Poly Pomona
Semi-Supervised Clustering
Hierarchical Clustering: Time and Space requirements
A Genetic Algorithm Approach to K-Means Clustering
Random Swap algorithm Pasi Fränti
Data Mining K-means Algorithm
CSSE463: Image Recognition Day 11
Big Data Analysis and Mining
Clustering (3) Center-based algorithms Fuzzy k-means
A Consensus-Based Clustering Method
Clustering Evaluation The EM Algorithm
CSE 4705 Artificial Intelligence
Random Swap algorithm Pasi Fränti
CSSE463: Image Recognition Day 11
Critical Issues with Respect to Clustering
Data Mining 資料探勘 分群分析 (Cluster Analysis) Min-Yuh Day 戴敏育
DATA MINING Introductory and Advanced Topics Part II - Clustering
K-means properties Pasi Fränti
Overcoming Resolution Limits in MDL Community Detection
Cluster Validity For supervised classification we have a variety of measures to evaluate how good our model is Accuracy, precision, recall For cluster.
LECTURE 21: CLUSTERING Objectives: Mixture Densities Maximum Likelihood Estimates Application to Gaussian Mixture Models k-Means Clustering Fuzzy k-Means.
Text Categorization Berlin Chen 2003 Reference:
Pasi Fränti and Sami Sieranoja
Junheng, Shengming, Yunsheng 11/09/2018
CSSE463: Image Recognition Day 11
CSSE463: Image Recognition Day 11
Supervised machine learning: creating a model
Data Mining Cluster Analysis: Basic Concepts and Algorithms
Dimensionally distributed Pasi Fränti and Sami Sieranoja
Presentation transcript:

Centroid index Cluster level quality measure Pasi Fränti 9.2.2017

Clustering accuracy

Classification accuracy Known class labels Solution A: Solution B: P Oranges Oranges 100 % Precision = 5/7 = 71% Recall = 5/5 = 100% Apples Apples 100 % Precision = 3/3 = 100% Recall = 3/5 = 60%

Clustering accuracy No class labels! Solution A: Solution B: ???

Sum-of-squared error (MSE) Internal index Sum-of-squared error (MSE) Solution A: Solution B: 11.3 31.2 30 40 18.7 8.8

External index Solution A: Solution B: Compare two solutions 5/7 = 71% 2/5 = 40% 2 3/5 = 60% 3 Two clustering (A and B) Clustering against ground truth

Set-matching based methods M. Rezaei and P. Fränti, "Set-matching methods for external cluster validity", IEEE Trans. on Knowledge and Data Engineering, 28 (8), 2173-2186, August 2016.

What about this…? Solution 1: Solution 2: Solution 3: ? ?

Selection of existed methods External index Selection of existed methods Pair-counting measures Rand index (RI) [Rand, 1971] Adjusted Rand index (ARI) [Hubert & Arabie, 1985] Information-theoretic measures Mutual information (MI) [Vinh, Epps, Bailey, 2010] Normalized Mutual information (NMI) [Kvalseth, 1987] Set-matching based measures Normalized van Dongen (NVD) [Kvalseth, 1987] Criterion H (CH) [Meila & Heckerman, 2001] Purity [Rendon et al, 2011] Centroid index (CI) [Fränti, Rezaei & Zhao, 2014]

Cluster level measure

Point level vs. cluster level Point-level differences Cluster-level mismatches

Point level vs. cluster level Agglomerative (AC) ARI=0.91 ARI=0.82 Random Swap (RS) K-means ARI=0.88

Point level vs. cluster level

Centroid index

Pigeon hole principle 15 pigeons (clustering A) 15 pigeon holes (clustering B) Only one bijective (1:1) mapping exists

Centroid index (CI) CI = 4 [Fränti, Rezaei, Zhao, Pattern Recognition 2014] CI = 4 empty 15 prototypes (pigeons) 15 real clusters (pigeon holes) empty empty empty

Definitions Find nearest centroids (AB): Detect prototypes with no mapping: Number of orphans: Centroid index: Number of zero mappings! Mapping both ways

Example S2 CI = 2  1 1 2 Indegree Counts 1 2 1 Mappings 1 1 1 1 1 1 1 1 1 1 1 1 Value 0 indicates an orphan cluster CI = 2

Example Birch1 Mappings  CI = 7

Two clusters but only one allocated Example Birch2 Mappings  1 1 1 Two clusters but only one allocated 3 1 Three mapped into one CI = 18

Two clusters but only one allocated Example Birch2 Mappings  2 1 Two clusters but only one allocated 1 1 Three mapped into one CI = 18 1

Unbalanced example K-means result KMGT CI = 4 500 4 1 2 2000 492 1011 458 560 490 989 K-means tend to put too many clusters here … … and too few here

Unbalanced example K-means result GTKM CI = 4 5 1 1 1

Experiments

Mean Squared Errors Raw numbers don’t tell much Green = Correct clustering structure Data set Clustering quality (MSE) KM RKM KM++ XM AC RS GKM GA Bridge 179.76 176.92 173.64 179.73 168.92 164.64 164.78 161.47 House 6.67 6.43 6.28 6.20 6.27 5.96 5.91 5.87 Miss America 5.95 5.83 5.52 5.92 5.36 5.28 5.21 5.10 3.61 3.28 2.50 3.57 2.62 2.83 - 2.44 Birch1 5.47 5.01 4.88 5.12 4.73 4.64 Birch2 7.47 5.65 3.07 6.29 2.28 Birch3 2.51 2.07 1.92 1.96 1.86 S1 19.71 8.92 8.93 S2 20.58 13.28 15.87 13.44 S3 19.57 16.89 17.70 S4 17.73 15.70 15.71 17.52 Raw numbers don’t tell much

Adjusted Rand Index (ARI) [Hubert & Arabie, 1985] Data set Adjusted Rand Index (ARI) KM RKM KM++ XM AC RS GKM GA Bridge 0.38 0.40 0.39 0.37 0.43 0.52 0.50 1 House 0.44 0.47 0.53 Miss America 0.19 0.18 0.20 0.23 0.46 0.49 - Birch 1 0.85 0.93 0.98 0.91 0.96 1.00 Birch 2 0.81 0.86 0.95 Birch 3 0.74 0.82 0.87 S1 0.83 S2 0.80 0.99 0.89 S3 0.92 S4 0.94 0.77 How high is good?

Normalized Mutual information Normalized Mutual Information (NMI) [Kvalseth, 1987] Data set Normalized Mutual Information (NMI) KM RKM KM++ XM AC RS GKM GA Bridge 0.77 0.78 0.80 0.83 0.82 1.00 House 0.81 0.84 Miss America 0.64 0.63 0.66 - Birch 1 0.95 0.97 0.99 0.96 0.98 Birch 2 Birch 3 0.90 0.94 0.93 S1 S2 S3 0.92 S4 0.88 0.85

Normalized Van Dongen (NVD) [Kvalseth, 1987] Data set Normalized Van Dongen (NVD) KM RKM KM++ XM AC RS GKM GA Bridge 0.45 0.42 0.43 0.46 0.38 0.32 0.33 0.00 House 0.44 0.40 0.37 0.31 Miss America 0.60 0.61 0.59 0.57 0.55 0.53 0.34 0.39 - Birch 1 0.09 0.04 0.01 0.06 0.02 Birch 2 0.12 0.08 0.03 Birch 3 0.19 0.10 0.13 S1 S2 0.11 S3 0.05 S4 Lower is better

Centroid Similarity Index Centroid Similarity Index (CSI) [Fränti, Rezaei, Zhao, 2014] Data set Centroid Similarity Index (CSI) KM RKM KM++ XM AC RS GKM GA Bridge 0.47 0.51 0.49 0.45 0.57 0.62 0.63 1.00 House 0.50 0.54 0.55 0.66 Miss America 0.32 0.33 0.38 0.40 0.42 --- Birch 1 0.87 0.94 0.98 0.93 0.99 Birch 2 0.76 0.84 0.83 Birch 3 0.71 0.82 0.81 0.86 S1 S2 0.91 S3 0.89 S4 0.85 Ok but lacks threshold

Centroid Index Data set C-Index (CI2) KM RKM KM++ XM AC RS GKM GA [Fränti, Rezaei, Zhao, 2014] Data set C-Index (CI2) KM RKM KM++ XM AC RS GKM GA Bridge 74 63 58 81 33 35 House 56 45 40 37 31 22 20 Miss America 88 91 67 38 43 36 39 47 26 23 --- Birch 1 7 3 1 4 Birch 2 18 11 12 Birch 3 10 2 S1 S2 S3 S4

Going deeper…

Accurate clustering GAIS-2002 similar to GAIS-2012 ? -0.04 -0.29 -0.29 Bridge GAIS-2002 similar to GAIS-2012 ? Method MSE GKM Global K-means 164.78 RS Random swap (5k) 164.64 GA Genetic algorithm 161.47 RS8M Random swap (8M) 161.02 GAIS-2002 GAIS 160.72 + RS1M GAIS + RS (1M) 160.49 + RS8M GAIS + RS (8M) 160.43 GAIS-2012 160.68 160.45 160.39 + PRS GAIS + PRS 160.33 + RS8M + GAIS + RS (8M) + PRS 160.28 -0.04 -0.29 -0.29

GAIS’02 and GAIS’12 the same? Virtually the same MSE-values GAIS 2002 GAIS 2012 160.8 160.72 -0.04 160.68 160.6 -0.29 -0.29 160.43 -0.04 160.39 160.4 GAIS 2002 +RS(8M) GAIS 2002 +RS(8M) 160.2

GAIS’02 and GAIS’12 the same? But different structure! GAIS 2002 GAIS 2012 160.8 160.72 CI=17 160.68 160.6 CI=0 CI=1 160.43 CI=18 160.39 160.4 GAIS 2002 +RS(8M) GAIS 2012 +RS(8M) 160.2

Seemingly the same solutions Same structure “same family” Main algorithm: + Tuning 1 + Tuning 2 RS8M GAIS 2002 GAIS 2012 × RS1M --- 19 23 24 22 GAIS (2002) 14 15 16 + RS1M 13 + RS8M GAIS (2012) 25 17 18 1 + PRS + RS8M + PRS

But why? Real cluster structure missing Clusters allocated like well optimized “grid” Several grids results different allocation Overall clustering quality can still be the same

RS runs A1 A8 B1 B8 C1 C8 - 1 26 24 25 3

RS runs 1M vs 8M CI=26 MSE=0.07% CI=26 MSE=0.02% A1 B1 C1 160.8 1M: 161.37 161.48 161.44 CI=1 MSE=0.09% CI=1 MSE=0.15% CI=3 MSE=0.28% 160.6 8M: 161.22 161.24 160.99 160.4 A8 B8 C8 CI=24 MSE=0.01% CI=26 MSE=0.16% 160.2

Generalization

Three alternatives 1. Prototype similarity 3. Partition similarity Prototype must exist 2. Model similarity Derived from model

Partition similarity Cluster similarity using Jaccard Calculated from contigency table

Gaussian mixture model 1 1 RI 0.98 ARI 0.84 MI 3.60 NMI 0.94 NVD 0.08 CH 0.16 1 1 1 1 3 1 CI=2 1 1 1 1 Split-and-Merge EM Random Swap EM x 1

Gaussian mixture model CI=2 CI=1 CI=3 CI=0 CI=1 CI=0

Arbitrary-shape data

KMGT 1 1 2 CI = 2 2 1

GTKM 1 1 1 CI = 1 1 2 1

SLGT 1 1 3 1 CI = 2

GTSL 2 2 2 CI = 3

Summary of experiments prototype similarity

Summary of experiments partition similarity

Number of wrong clusters Conclusions Simple cluster level measure Generalized to GMM and arbitrary-shaped data Value has clear interpretation: CI=0 CI>0 Number of wrong clusters Correct