Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Model-based evaluation of clustering validation measures Advisor : Dr. Hsu Presenter : Zih-Hui Lin Author :Marcel Brun, Chao Sima, Jianping Hua, James Lowey, Brent Carroll, Edward Suh, Edward R. Dougherty Pattern Recognition, 2007
Intelligent Database Systems Lab N.Y.U.S.T. I. M. 2 Motivation Objective Model-based analysis Conclusions Outline
Intelligent Database Systems Lab N.Y.U.S.T. I. M. 3 Motivation Historically, a host of “validity” measures have been proposed for evaluating clustering results based on a single realization of the random-point-set process. No doubt one would like to measure the accuracy of a cluster operator based on a single application. But is this feasible?
Intelligent Database Systems Lab N.Y.U.S.T. I. M. 4 Objective In this paper we consider a number of proposed validity measures and we examine how well they correlate with error rates across a number of clustering algorithms and random- point-set models
Intelligent Database Systems Lab N.Y.U.S.T. I. M. 5 Model-based analysis 1. Specification of labeled point processes 2. Generation of samples from the processes 3. Application of clustering algorithms to the data 4. Estimation of the error of several algorithms from these samples 5. Computation of the several validation measures for these algorithms on the same samples: 6. Quantification of the quality of the indices
Intelligent Database Systems Lab N.Y.U.S.T. I. M. 6 Model-based analysis (1/2) 1. Specification of labeled point processes ─ requires determining some labeled point process with sufficient variability to obtain a broad range of error values ─ avoids overly simple models that may be beneficial for some specific measures. 2. Generation of samples from the processes: This step involves generating 100 sample sets (sets with their labels) for each process.100 sample sets 3. Application of clustering algorithms to the dataclustering algorithms
Intelligent Database Systems Lab N.Y.U.S.T. I. M. 7 Model-based analysis (2/2) 4. Estimation of the error of several algorithms from these samples Estimation of the error 5. Computation of the several validation measures for these algorithms on the same samples:validation measures Internal indices Internal indices Relative indices Relative indices External indices External indices 6. Quantification of the quality of the indices
Intelligent Database Systems Lab N.Y.U.S.T. I. M. 8 Conclusions when investigating the performance of a proposed clustering algorithm, it is best to consider varied models and use the true clustering error.
Intelligent Database Systems Lab N.Y.U.S.T. I. M. 9 Data set
Intelligent Database Systems Lab N.Y.U.S.T. I. M. 10 Clustering algorithms CodeAlgorithmParameters kmK-means fcmFuzzy C-meansb = 2 a,b so[eu,b]SOMDistance = Euclidean, Neighborhood = bubble b,c hi[eu,co]HierarchicalDistance = Euclidean, Linkage = Complete hi[c,co]HierarchicalDistance = 1-abs(Pearson Corr), Linkage = Complete hi[eu,si]HierarchicalDistance = Euclidean, Linkage = Single hi[c,si]HierarchicalDistance = 1-abs(Pearson Corr), Linkage = Single em[diag]EMMixing Model = Diagonal a,b
Intelligent Database Systems Lab N.Y.U.S.T. I. M. 11 Error measure
Intelligent Database Systems Lab N.Y.U.S.T. I. M. 12 validation measures Internal validation Internal validation ─ is based on calculating properties of the resulting clusters; is based on calculating properties of the resulting clusters; relative validation relative validation ─ is based on comparisons of partitions generated by the same algorithm with different parameters or different subsets of the data; is based on comparisons of partitions generated by the same algorithm with different parameters or different subsets of the data; external validation external validation ─ compares the partition generated by the clustering algorithm and a given partition of the data. compares the partition generated by the clustering algorithm and a given partition of the data.
Intelligent Database Systems Lab N.Y.U.S.T. I. M. 13 Internal validation Model 1 Model 2 Model 3 Model 4 Model 5 Below Trace criterion, determinant criterion and invariant criterion
Intelligent Database Systems Lab N.Y.U.S.T. I. M. 14 Internal validation – Dunn’s indices ─ the ratio between the minimum distance between two clusters and the size of the largest cluster Model 2 Model 5 Model 1
Intelligent Database Systems Lab N.Y.U.S.T. I. M. 15 Internal validation – Silhouette index ─ The silhouette is the average, over all clusters, of the silhouette width of their points 1
Intelligent Database Systems Lab N.Y.U.S.T. I. M. 16 Internal validation – Hubert’s correlation with distance matrix M = n(n − 1)/2 be the number of pairs of different vectors
Intelligent Database Systems Lab N.Y.U.S.T. I. M. 17 Relative validation indices – Figure of merit ─ when used on microarray data, the clusters represent different biological groups, and therefore, points (genes) in the same cluster will possess similar pattern vectors (expression profiles) for additional features (arrays).
Intelligent Database Systems Lab N.Y.U.S.T. I. M. 18 Relative validation indices – Stability ─ the ability of a clustered data set to predict the clustering of another data set sampled from the same source.
Intelligent Database Systems Lab N.Y.U.S.T. I. M. 19 External validation indices – Hubert’s correlation ─ The Hubert statistic is based on the fact that the more similar the partitions, the more similar the matrices would be, and this similarity can be measured by their correlation. x i and x j belong to the same cluster→ d(i,j)=1 x i and x j belong to different cluster→ d(i,j)=0
Intelligent Database Systems Lab N.Y.U.S.T. I. M. 20 External validation indices – Rand statistics, Jaccard coefficient and Folkes and Mallows index A (true partition) TrueFalse B (clustering partition) Trueac Flasebd