Estimating the Number of Clusters (k) Clustering error cannot be used as a criterion for deciding on the number of clusters. Selection Approaches: Use a Criterion to select among the solutions for several values of k (kmeans or GMMs are used) Criterion(k): Training Objective(k) + Model Complexity(k) Model Complexity: Bayesian arguments (BIC): L(k) – M(k) lnN Information theory (MDL, MML) Variance ratio criterion (VRC) (matlab)Variance ratio criterion Davies-Bouldin Criterion (matlab) Davies-Bouldin Criterion Silhouette criterion (matlab)Silhouette criterion Gap Statistic (matlab) Gap Statistic
Estimating the Number of Clusters (k) Optimal solutions wrt clustering error do not always reveal the true clustering structure
Estimating the Number of Clusters (k) Top – down (incremental) Starting from one component Iteratively add components (usually through splitting) Until no component can be further splitted based on a criterion (one cluster is preferable over two clusters)
Estimating the Number of Clusters (k) Top – down (incremental) X-means (BIC criterion for 2 clusters) (Pelleg & Moore, ICML 2000)X-means G-means (1d test for Gaussianity, PCA-based projection) (Hamerly & Elkan, NIPS 2003)G-means Dip-means (test for unimodality) (Kalogeratos & Likas, NIPS 2012) Dip-means