Presentation is loading. Please wait.

Presentation is loading. Please wait.

Cluster validation Pasi Fränti Clustering methods: Part 3 Speech and Image Processing Unit School of Computing University of Eastern Finland 15.4.2014.

Similar presentations


Presentation on theme: "Cluster validation Pasi Fränti Clustering methods: Part 3 Speech and Image Processing Unit School of Computing University of Eastern Finland 15.4.2014."— Presentation transcript:

1 Cluster validation Pasi Fränti Clustering methods: Part 3 Speech and Image Processing Unit School of Computing University of Eastern Finland 15.4.2014

2 Part I: Introduction

3 Supervised classification: Class labels known for ground truth Accuracy, precision, recall Cluster analysis No class labels Validation need to: Compare clustering algorithms Solve number of clusters Avoid finding patterns in noise Cluster validation P Precision = 5/5 = 100% Recall = 5/7 = 71% Oranges: Apples: Precision = 3/5 = 60% Recall = 3/3 = 100%

4 Internal Index: Validate without external info With different number of clusters Solve the number of clusters External Index Validate against ground truth Compare two clusters: (how similar) Measuring clustering validity ? ? ? ?

5 Random Points K-means DBSCAN Complete Link Clustering of random data

6 1.Distinguishing whether non-random structure actually exists in the data (one cluster). 2.Comparing the results of a cluster analysis to externally known results, e.g., to externally given class labels. 3.Evaluating how well the results of a cluster analysis fit the data without reference to external information. 4.Comparing the results of two different sets of cluster analyses to determine which is better. 5.Determining the number of clusters. Cluster validation process

7 Cluster validation quantitativeobjectiveCluster validation refers to procedures that evaluate the results of clustering in a quantitative and objective fashion. [Jain & Dubes, 1988] –How to be “quantitative”: To employ the measures. –How to be “objective”: To validate the measures! m* INPUT: DataSet(X) Clustering Algorithm Validity Index Different number of clusters m Partitions P Codebook C Cluster validation process

8 Part II: Internal indexes

9

10

11 Internal indexes Ground truth is rarely available but unsupervised validation must be done. Minimizes (or maximizes) internal index: –Variances of within cluster and between clusters –Rate-distortion method –F-ratio –Davies-Bouldin index (DBI) –Bayesian Information Criterion (BIC) –Silhouette Coefficient –Minimum description principle (MDL) –Stochastic complexity (SC)

12 Mean square error (MSE) Knee-point between 14 and 15 clusters. The more clusters the smaller the MSE. Small knee-point near the correct value. But how to detect?

13 Mean square error (MSE) 5 clusters 10 clusters

14 Minimize within cluster variance (MSE) Maximize between cluster variance Inter-cluster variance is maximized Intra-cluster variance is minimized From MSE to cluster validity

15 Jump point of MSE (rate-distortion approach) First derivative of powered MSE values: Biggest jump on 15 clusters.

16 Sum-of-squares based indexes SSW / k ---- Ball and Hall (1965) k 2 |W| ---- Marriot (1971) ---- Calinski & Harabasz (1974) log(SSB/SSW) ---- Hartigan (1975) ---- Xu (1997) (d is the dimension of data; N is the size of data; k is the number of clusters) SSW = Sum of squares within the clusters (=MSE) SSB = Sum of squares between the clusters

17 Variances Within cluster: Between clusters: Total Variance of data set: SSW SSB

18 F-ratio variance test Variance-ratio F-test Measures ratio of between-groups variance against the within-groups variance (original f-test) F-ratio (WB-index): SSB

19 Calculation of F-ratio

20 F-ratio for dataset S1

21 F-ratio for dataset S2

22 F-ratio for dataset S3

23 F-ratio for dataset S4

24 Extension of the F-ratio for S3

25 Sum-of-square based index SSW / mlog(SSB/SSW) m* SSW/SSB SSW / SSB & MSE

26 Davies-Bouldin index (DBI) Minimize intra cluster variance Maximize the distance between clusters Cost function weighted sum of the two:

27 Davies-Bouldin index (DBI)

28 Measured values for S2

29 Cohesion: measures how closely related are objects in a cluster Separation: measure how distinct or well- separated a cluster is from other clusters cohesion separation Silhouette coefficient [Kaufman&Rousseeuw, 1990]

30 Cohesion a(x): average distance of x to all other vectors in the same cluster. Separation b(x): average distance of x to the vectors in other clusters. Find the minimum among the clusters. silhouette s(x): s(x) = [-1, +1]: -1=bad, 0=indifferent, 1=good Silhouette coefficient (SC): Silhouette coefficient

31 separation x a(x): average distance in the cluster cohesion x b(x): average distances to others clusters, find minimal Silhouette coefficient

32 Performance of Silhouette coefficient

33 BIC= Bayesian Information Criterion L(θ) -- log-likelihood function of all models; n -- size of data set; m -- number of clusters Under spherical Gaussian assumption, we get : Formula of BIC in partitioning-based clustering d -- dimension of the data set n i -- size of the i th cluster ∑ i -- covariance of i th cluster Bayesian information criterion (BIC)

34 Knee Point Detection on BIC SD(m) = F(m-1) + F(m+1) – 2∙F(m) Original BIC = F(m)

35 Internal indexes

36 Soft partitions

37 Comparison of the indexes K-means

38 Comparison of the indexes Random Swap

39 Part III: Stochastic complexity for binary data

40 Stochastic complexity Principle of minimum description length (MDL): find clustering C that can be used for describing the data with minimum information. Data = Clustering + description of data. Clustering defined by the centroids. Data defined by: –which cluster (partition index) –where in cluster (difference from centroid)

41 Solution for binary data where This can be simplified to:

42 Number of clusters by stochastic complexity (SC)

43 Part IV: External indexes

44 Measure the number of pairs that are in: Same class both in P and G. Same class in P but different in G. Different classes in P but same in G. Different classes both in P and G. GP a b c d a b c d Pair-counting measures

45 Agreement:a, d Disagreement:b, c GP a b c d a b c d Rand and Adjusted Rand index [Rand, 1971] [Hubert and Arabie, 1985]

46 If true class labels (ground truth) are known, the validity of a clustering can be verified by comparing the class labels and clustering labels. n ij = number of objects in class i and cluster j External indexes

47 Rand statistics Visual example

48 Pointwise measures

49 Vectors assigned to: Same cluster Different clusters Same cluster in ground truth2024 Different clusters in ground truth2072 Rand index= (20+72) / (20+24+20+72) = 92/136 = 0.68 Rand index (example) Adjusted Rand = (to be calculated) = 0.xx

50 Pair counting Information theoretic Set matching External indexes

51 51 Agreement:a, d Disagreement:b, c GP a b c d a b c d Rand Index: Adjusted Rand Index: Pair-counting measures

52 - Based on the concept of entropy - Mutual Information (MI) measures the information that two clusterings share and Variation of Information (VI) is the complement of MI Information-theoretic measures

53 Categories –Point-level –Cluster-level Three problems –How to measure the similarity of two clusters? –How to pair clusters? –How to calculate overall similarity? Set-matching measures

54 P2, P3P2, P1 Criterion H/NVD/CSI 200250 J0.800.25 SD0.890.40 BB0.800.25 Jaccard Sorensen-Dice P 1 n 1 =1000 P 2 n 2 =250 P 3 n 3 =200 Braun-Banquet Similarity of two clusters

55 Matching problem in weighted bipartite graph G P GP G1G1 G2G2 G3G3 P1P1 P2P2 P3P3 Pairing

56 Matching or Pairing? Algorithms –Greedy –Optimal pairing

57 Matching based on number of shared objects Clustering P: big circles Clustering G: shape of objects Normalized Van Dongen

58 - Similarity of two clusters j: the index of paired cluster with P i - Total SImilarity - Optimal pairing using Hungarian algorithm S=100% S=50% GjGj PiPi Pair Set Index (PSI)

59 Adjustment for chance size of clusters in P : n 1 >n 2 >…>n K size of clusters in G : m 1 >m 2 >…>m K’ Pair Set Index (PSI)

60 Symmetric Normalized to number of clusters Normalized to size of clusters Adjusted Range in [0,1] Number of clusters can be different Properties of PSI

61 Random partitioning Changing number of clusters in P from 1 to 20 1100020003000 G P Randomly partitioning into two cluster

62 Linearity property Enlarging the first cluster Wrong labeling some part of each cluster 10002000 3000 G 12502000 3000 P1 2000 3000 P2 2500 3000 P3 3000 P4 10002000 3000 G 9001900 2900 P1 500 1500 2500 8001800 2800 P2 P3 334 P4 1333 2333

63 Cluster size imbalance 8001800 3000 P1 1 10002000 3000 G1 80018002500 P2 1 10002000 2500 G2

64 Number of clusters 8001800 P1 1000 2000 G1 80018002800 P2 1000 2000 3000 G2

65 Part V: Cluster-level measure

66 Comparing partitions of centroids Point-level differences Cluster-level mismatches

67 Centroid index (CI) [Fränti, Rezaei, Zhao, Pattern Recognition, 2014] Given two sets of centroids C and C’, find nearest neighbor mappings (C  C’): Detect prototypes with no mapping: Centroid index: Number of zero mappings!

68 1 1 2 1 1 1 1 0 1 2 1 1 1 1 0 Data S 2 Example of centroid index Value 1 indicate same cluster Index-value equals to the count of zero-mappings Mappings Counts CI = 2

69 Example of the Centroid index Two clusters but only one allocated Three mapped into one 1 1 0 1 3 1

70 K-means Random Swap Merge-based (PNN) ARI=0.88 CI=1 ARI=0.82 CI=1 ARI=0.91 CI=0 Adjusted Rand vs. Centroid index

71 Mapping is not symmetric (C  C’ ≠ C’  C) Symmetric centroid index: Pointwise variant (Centroid Similarity Index): –Matching clusters based on CI –Similarity of clusters Centroid index properties

72 1 2 43 0 0.87 1 0.65 1 0.56 1 0.56 1 0.53 Distance to ground truth (2 clusters): 1  GT CI=1 CSI=0.50 2  GT CI=1 CSI=0.50 3  GT CI=1 CSI=0.50 4  GT CI=1 CSI=0.50 Centroid index

73 Data set Clustering quality (MSE) KMRKM KM++XM ACRSGKM GA Bridge179.76176.92173.64179.73168.92164.64164.78161.47 House6.676.436.286.206.275.965.915.87 Miss America5.955.835.525.925.365.285.215.10 House3.613.282.503.572.622.83-2.44 Birch 1 5.475.014.885.124.734.64- Birch 2 7.475.653.076.292.28 - Birch 3 2.512.071.922.071.961.86- S1S1 19.718.92 8.938.92 S2S2 20.5813.28 15.8713.4413.28 S3S3 19.5716.89 17.7016.89 S4S4 17.7315.70 15.7117.5215.7015.7115.70 Mean Squared Errors

74 Data set Adjusted Rand Index (ARI) KMRKM KM++XM ACRSGKM GA Bridge0.380.400.390.370.430.520.501 House0.40 0.440.470.430.53 1 Miss America0.19 0.180.20 0.231 House0.460.490.520.460.49 -1 Birch 1 0.850.930.980.910.961.00-1 Birch 2 0.810.860.950.8611-1 Birch 3 0.740.820.870.820.860.91-1 S1S1 0.831.00 S2S2 0.800.99 0.890.980.99 S3S3 0.860.96 0.920.96 S4S4 0.820.93 0.940.770.93 Adjusted Rand Index

75 Data set Normalized Mutual Information (NMI) KMRKM KM++XM ACRSGKM GA Bridge 0.770.78 0.770.800.830.821.00 House 0.80 0.810.820.810.830.841.00 Miss America 0.64 0.630.64 0.66 1.00 House 0.81 0.820.81 0.82-1.00 Birch 1 0.950.970.990.960.981.00- Birch 2 0.960.970.990.971.00 - Birch 3 0.900.94 0.93 0.96-1.00 S1S1 0.931.00 S2S2 0.900.99 0.950.990.930.99 S3S3 0.920.97 0.940.97 S4S4 0.880.94 0.950.850.94 Normalized Mutual information

76 Data set Normalized Van Dongen (NVD) KMRKMKM++XMACRSGKMGA Bridge 0.450.420.430.460.380.320.330.00 House 0.440.430.400.370.400.330.310.00 Miss America 0.60 0.610.590.570.550.530.00 House 0.400.370.340.39 0.34-0.00 Birch 1 0.090.040.010.060.020.00- Birch 2 0.120.080.030.090.00 - Birch 3 0.190.120.100.13 0.06-0.00 S1S1 0.090.00 S2S2 0.110.00 0.060.010.040.00 S3S3 0.080.02 0.050.00 0.02 S4S4 0.110.04 0.030.130.04 Normalized Van Dongen

77 Data set C-Index (CI 2 ) KMRKMKM++XMACRSGKMGA Bridge 7463588133 35 0 House 56454037312220 0 Miss America 88916788384336 0 House 433922472623--- 0 Birch 1 7314 00---0 Birch 2 1811412 00---0 Birch 3 2311710 72---0 S1S1 20000000 S2S2 20010000 S3S3 10000000 S4S4 10001000 Centroid Index

78 Data set Centroid Similarity Index (CSI) KMRKMKM++XMACRSGKMGA Bridge0.470.510.490.450.570.620.631.00 House0.490.500.540.570.550.630.661.00 Miss America0.32 0.330.380.400.421.00 House0.540.570.630.540.570.62---1.00 Birch 1 0.870.940.980.930.991.00---1.00 Birch 2 0.760.840.940.831.00 ---1.00 Birch 3 0.710.820.870.810.860.93---1.00 S1S1 0.831.00 S2S2 0.821.00 0.911.00 S3S3 0.890.99 0.980.99 S4S4 0.870.98 0.990.850.98 Centroid Similarity Index

79 MethodMSE GKMGlobal K-means164.78 RSRandom swap (5k)164.64 GAGenetic algorithm161.47 RS 8M Random swap (8M)161.02 GAIS-2002GAIS160.72 + RS 1M GAIS + RS (1M)160.49 + RS 8M GAIS + RS (8M)160.43 GAIS-2012GAIS160.68 + RS 1M GAIS + RS (1M)160.45 + RS 8M GAIS + RS (8M)160.39 + PRSGAIS + PRS160.33 + RS 8M +GAIS + RS (8M) +160.28 High quality clustering

80 Main algorithm: + Tuning 1 + Tuning 2 RS 8M GAIS 2002GAIS 2012 ×××× ×××× RS 1M × RS 8M × ×××× RS 1M × RS 8M ×× RS 8M ---19 2324 2322 GAIS (2002)23---001415 1416 + RS 1M 230---01415 1413 + RS 8M 2300---1415 1413 GAIS (2012)251718 ---1111 + RS 1M 251718 1---001 + RS 8M 251718 10---01 + PRS251718 100---1 + RS 8M + PRS241718 1111--- Centroid index values

81 Summary of external indexes (existing measures)

82

83 Part VI: Efficient implementation

84 Strategies for efficient search Brute force: solve clustering for all possible number of clusters. Stepwise: as in brute force but start using previous solution and iterate less. Criterion-guided search: Integrate cost function directly into the optimization function.

85 Brute force search strategy Number of clusters Search for each separately 100 %

86 Stepwise search strategy Number of clusters Start from the previous result 30-40 %

87 Criterion guided search Number of clusters Integrate with the cost function! 3-6 %

88 Stopping criterion for stepwise search strategy

89 Comparison of search strategies

90 Open questions Iterative algorithm (K-means or Random Swap) with criterion-guided search … or … Hierarchical algorithm ??? Potential topic for MSc or PhD thesis !!!

91 Literature 1.G.W. Milligan, and M.C. Cooper, “An examination of procedures for determining the number of clusters in a data set”, Psychometrika, Vol.50, 1985, pp. 159-179. 2.E. Dimitriadou, S. Dolnicar, and A. Weingassel, “An examination of indexes for determining the number of clusters in binary data sets”, Psychometrika, Vol.67, No.1, 2002, pp. 137-160. 3.D.L. Davies and D.W. Bouldin, "A cluster separation measure “, IEEE Transactions on Pattern Analysis and Machine Intelligence, 1(2), 224-227, 1979. 4.J.C. Bezdek and N.R. Pal, "Some new indexes of cluster validity “, IEEE Transactions on Systems, Man and Cybernetics, 28(3), 302-315, 1998. 5.H. Bischof, A. Leonardis, and A. Selb, "MDL Principle for robust vector quantization“, Pattern Analysis and Applications, 2(1), 59-72, 1999. 6.P. Fränti, M. Xu and I. Kärkkäinen, "Classification of binary vectors by using DeltaSC-distance to minimize stochastic complexity", Pattern Recognition Letters, 24 (1-3), 65-73, January 2003.

92 Literature 7.G.M. James, C.A. Sugar, "Finding the Number of Clusters in a Dataset: An Information-Theoretic Approach". Journal of the American Statistical Association, vol. 98, 397-408, 2003. 8.P.K. Ito, Robustness of ANOVA and MANOVA Test Procedures. In: Krishnaiah P. R. (ed), Handbook of Statistics 1: Analysis of Variance. North- Holland Publishing Company, 1980. 9.I. Kärkkäinen and P. Fränti, "Dynamic local search for clustering with unknown number of clusters", Int. Conf. on Pattern Recognition (ICPR’02), Québec, Canada, vol. 2, 240-243, August 2002. 10.D. Pellag and A. Moore, "X-means: Extending K-Means with Efficient Estimation of the Number of Clusters", Int. Conf. on Machine Learning (ICML), 727-734, San Francisco, 2000. 11.S. Salvador and P. Chan, "Determining the Number of Clusters/Segments in Hierarchical Clustering/Segmentation Algorithms", IEEE Int. Con. Tools with Artificial Intelligence (ICTAI), 576-584, Boca Raton, Florida, November, 2004. 12.M. Gyllenberg, T. Koski and M. Verlaan, "Classification of binary vectors by stochastic complexity ". Journal of Multivariate Analysis, 63(1), 47-72, 1997.

93 Literature 13.M. Gyllenberg, T. Koski and M. Verlaan, "Classification of binary vectors by stochastic complexity ". Journal of Multivariate Analysis, 63(1), 47-72, 1997. 14.X. Hu and L. Xu, "A Comparative Study of Several Cluster Number Selection Criteria", Int. Conf. Intelligent Data Engineering and Automated Learning (IDEAL), 195-202, Hong Kong, 2003. 15.Kaufman, L. and P. Rousseeuw, 1990. Finding Groups in Data: An Introduction to Cluster Analysis. John Wiley and Sons, London. ISBN: 10:0471878766. 16.[1.3] M.Halkidi, Y.Batistakis and M.Vazirgiannis: Cluster validity methods: part 1, SIGMOD Rec., Vol.31, No.2, pp.40-45, 2002 17.R. Tibshirani, G. Walther, T. Hastie. Estimating the number of clusters in a data set via the gap statistic. J.R.Statist. Soc. B(2001) 63, Part 2, pp.411-423. 18.T. Lange, V. Roth, M, Braun and J. M. Buhmann. Stability-based validation of clustering solutions. Neural Computation. Vol. 16, pp. 1299-1323. 2004.

94 Literature 19.Q. Zhao, M. Xu and P. Fränti, "Sum-of-squares based clustering validity index and significance analysis", Int. Conf. on Adaptive and Natural Computing Algorithms (ICANNGA’09), Kuopio, Finland, LNCS 5495, 313- 322, April 2009. 20.Q. Zhao, M. Xu and P. Fränti, "Knee point detection on bayesian information criterion", IEEE Int. Conf. Tools with Artificial Intelligence (ICTAI), Dayton, Ohio, USA, 431-438, November 2008. 21.W.M. Rand, “Objective criteria for the evaluation of clustering methods,” Journal of the American Statistical Association, 66, 846–850, 1971 22.L. Hubert and P. Arabie, “Comparing partitions”, Journal of Classification, 2(1), 193-218, 1985. 23.P. Fränti, M. Rezaei and Q. Zhao, "Centroid index: Cluster level similarity measure", Pattern Recognition, 2014. (accepted)


Download ppt "Cluster validation Pasi Fränti Clustering methods: Part 3 Speech and Image Processing Unit School of Computing University of Eastern Finland 15.4.2014."

Similar presentations


Ads by Google