Download presentation
Presentation is loading. Please wait.
1
K-means properties Pasi Fränti 11.10.2017
K-means properties on six clustering benchmark datasets Pasi Fränti and Sami Sieranoja Algorithms, 2017.
2
SSE = sum-of-squared errors
Goal of k-means Input N points: X={x1, x2, …, xN} Output partition and k centroids: P={p1, p2, …, pk} C={c1, c2, …, ck} Objective function: SSE = sum-of-squared errors
3
Goal of k-means Input N points: Output partition and k centroids:
X={x1, x2, …, xN} Output partition and k centroids: P={p1, p2, …, pk} C={c1, c2, …, ck} Objective function: Assumptions: SSE is suitable k is known
4
K-means algorithm K-Means(X, C) → (C, P) REPEAT Cprev ← C;
X = Data set C = Cluster centroids P = Partition K-Means(X, C) → (C, P) REPEAT Cprev ← C; FOR i=1 TO N DO pi ← FindNearest(xi, C); FOR j=1 TO k DO cj ← Average of xi pi = j; UNTIL C = Cprev Assignment step Centroid step
5
K-means optimization steps
Assignment step: Centroid step:
6
Problems of k-means Distance of clusters
Cannot move centroids between clusters far away
7
Problems of k-means Dependency of initial solution
After k-means:
8
K-means performance How affected by?
1. Overlap 3. Dimensionality 2. Number of clusters 4. Unbalance of cluster sizes
9
Basic Benchmark
10
Data sets statistics Dataset Varying Size Clusters Per cluster A
Number of clusters 3000 – 7500 150 S Overlap 5000 15 333 Dim Dimensions 1024 16 64 G2 Dimensions + overlap 2048 2 Birch Structure 100,000 100 1000 Unbalance Balance 6500 8
11
A sets A1 A2 A3 Spherical clusters
K=20 K=35 K=50 A1 A2 A3 Spherical clusters Number of clusters changing from k=20 to 50 Subsets of each other: A1 A2 A3. Other parameters fixed: Cluster size = 150 Deviation = 1402 Overlap = 0.30 - Dimensionality = 2
12
Strong overlap but the clusters still recognizable
S sets K=15 Gaussian clusters (few truncated) S S S S 1 2 3 4 overlap increases 9% 22% 41% 44% Least overlap Strong overlap but the clusters still recognizable
13
Unbalance Areas well-separated Dense clusters Sparse clusters 100 100
K=8 Dense clusters Sparse clusters st.dev=2043 st.dev=6637 100 100 Areas well-separated 2000 100 2000 2000 100 100 *Correct clustering can be obtained by minimizing SSE
14
DIM sets Well-separated clusters in high-dimensional spaces
K=16 DIM32 Well-separated clusters in high-dimensional spaces Dimensions vary: 32, 64, 128, 256, 512, 1024
15
G2 Datasets G2-2-30 G2-2-50 G2-2-70 Dataset name: G2-dim-sd
K=2 G2-2-30 G2-2-50 G2-2-70 600,600 500,500 Dataset name: G2-dim-sd Centroid 1: [500,500, ...] Centroid 2: [600,600, ...] Dimensions: 2,4,8,16, St.dev. 10,20,30,
16
Birch Birch1 Birch2 Regular 10x10 grid Constant variance
offset = amplitude = phaseshift = frequency = y(x) = amplitude * sin(2**frequency*x + phaseshift) + offset
17
Birch2 subsets B2-random B2-sub N=100 000 k=100 N=99 000 k=99 N=98 000
Random subsampling …...………….. ….… Cutting off last cluster k=3 N=2 000 k=2 N=1 000 k=1
18
Properties
19
Measured properties Overlap Contrast Intrinsic dimensionality H-index
Distance profiles
20
Misclassification probability
Points from blue cluster that are closer to red centroid. Points from red cluster that are closer to blue centroid. Points = 2048 Incorrect = 20 Overlap = 20 / 2048 0.9 %
21
Overlap 16 % Points = 2048 Evidence = 332 Overlap = 332 / 2048
Points in blue cluster whose red neigbor is closer than its centroids. Points in red cluster whose blue neighbor is closer than its centroids. Points = 2048 Evidence = 332 Overlap = 332 / 2048 16 % d1 = distance to nearest centroid d2 = distance to 2nd nearest
22
Contrast
23
Intrinsic dimensionality
Average of distances Variance of distances Unbalance DIM 0.4 Birch1 Birch2 2.6 S sets A1 A2 A3 8.3 1.5 2.0 2.5 2.2
24
H-index Rank: 1 2 3 4 5 6 7 Hub: 4 2 2 2 2 2 0 Hub Hubness values: 2 4
4 2 Hub 2 2 2 Rank: Hub:
25
Distance profiles Data that contains clusters tends to have two peaks:
Local distances: distances inside the clusters Global distances: distances across different clusters A1 A2 A3 S1 S2 S3 S4 Birch1 Birch2 DIM32 Unbalance
26
Distance profiles G2 datasets
G2: dimension increases D=2 D=4 D=8 D=16 D=32 D=1024 G2: overlap increases sd=10 sd=20 sd=30 sd=40 sd=50 sd=60 D=2 sd=10 sd=20 sd=30 sd=40 sd=50 sd=60 D=128
27
Summary of the properties
28
G2 overlap Overlap decreases Overlap increases
29
G2 contrast Contrast decreases Contrast decreases
30
G2 Intrinsic dimensionality
ID increases (if overlap) ID increases Most significant
31
G2 H-index H-index increases No change Most significant
32
Evaluation
33
Internal measures Sum of squared distances (SSE)
Normalized mean square error (nMSE) Approximation ratio ()
34
External measures Centroid index
P. Fränti, M. Rezaei, Q. Zhao Centroid index: cluster level similarity measure Pattern Recognition 2014 CI=4 Missing centroids Too many centroids
35
External measures Success rate
17% CI=1 CI=2 CI=1 CI=0 CI=2 CI=2
36
Results
37
Summary of results
38
Dependency on overlap S datasets
Success rates and CI-values: overlap increases S S S S 1 2 3 4 3% 11% 12% 26% CI=1.8 CI=1.4 CI=1.3 CI=0.9
39
Dependency on overlap G2 datasets
40
Why overlap helps? Overlap = 7% Overlap = 22% 13 iterations
41
Main observation 1. Overlap Overlap is good!
42
Dependency on clusters (k) A datasets
Clusters increases K=20 K=35 K=50 A1 A2 A3 1% 0% 0% Success: 2.5 4.5 6.6 CI: 13% 13% 13% Relative CI:
43
Dependency on clusters (k)
44
Dependency on data size (N)
45
Main observation 2. Number of clusters Linear increase with k!
46
Dependency on dimensions DIM datasets
Dimensions increases 32 64 128 256 512 1024 CI: 3.6 3.5 3.8 3.8 3.9 3.7 Success rate: 0%
47
Dependency on dimensions G2 datasets
Success degrades Success improves
48
Lack of overlap is the cause!
Correlation: 0.91 Success
49
Main observation 3. Dimensionality No direct effect!
50
Effect of unbalance DIM datasets
Success: Average CI: Problem originates from the random initialization. 0% 3.9
51
Main observation 4. Unbalance of cluster sizes Unbalance is bad!
52
Improving k-means
53
Better initialization technique
Simple initializations: Random centroids (Random) [Forgy][MacQueen] Further point heuristic (max) [Gonzalez] More complex: K-means++ [Vasilievski] Luxburg [Luxburg]
54
Initialization techniques Varying N
55
Initialization techniques Varying k
56
Repeated k-means (RKM)
Repeat 100 times Can increase changes to success significantly In principle, running forever would solve Limitations if k is large
57
Genetic Algorithm (GA)
A better algorithm Random Swap (RS) Genetic Algorithm (GA) cs.uef.fi/pages/franti/research/ga.txt
58
Overall comparison CI-values
59
Conclusions
60
Conclusions How did K-means perform?
1. Overlap 3. Dimensionality Good! No change 2. Number of clusters 4. Unbalance of cluster sizes Bad! Bad!
61
References J. MacQueen, Some methods for classification and analysis of multivariate observations, Berkeley Symposium on Mathematical Statistics and Probability, Volume 1: Statistics, pp , University of California Press, Berkeley, Calif., 1967. S.P. Lloyd, Least squares quantization in PCM, IEEE Trans. on Information Theory, 28 (2), 129–137, 1982. Forgy, E. (1965). Cluster analysis of multivariate data: Efficiency vs. interpretability of classification. Biometrics, 21, 768. M. Steinbach, L. Ertöz, V. Kumar, The challenges of clustering high dimensional data, New Vistas in Statistical Physics -- Applications in Econophysics, Bioinformatics, and Pattern Recognition, Springer-Verlag, 2003. U. Luxburg, R.C. Williamson, I. Guyon, "Clustering: Science or Art?", J. Machine Learning Research, 27: 65–79, 2012. P. Fränti, "Genetic algorithm with deterministic crossover for vector quantization", Pattern Recognition Letters, 21 (1), 61-68, 2000 P. Fränti and J. Kivijärvi, "Randomized local search algorithm for the clustering problem", Pattern Analysis and Applications, 3 (4), , 2000. P. Fränti, O. Virmajoki and V. Hautamäki, Fast agglomerative clustering using a k-nearest neighbor graph, IEEE Trans. on Pattern Analysis and Machine Intelligence, 28 (11), , November 2006. Zhang R. Ramakrishnan and M. Livny, BIRCH: A new data clustering algorithm and its applications, Data Mining and Knowledge Discovery, 1 (2), , 1997. I. Kärkkäinen and P. Fränti, Dynamic local search algorithm for the clustering problem, Research Report A P. Fränti and O. Virmajoki, Iterative shrinking method for clustering problems, Pattern Recognition, 39 (5), , May 2006. P. Fränti R. Mariescu-Istodor and C. Zhong, XNN graph IAPR Joint Int. Workshop on Structural, Syntactic, and Statistical Pattern Recognition Merida, Mexico, LNCS 10029, , November 2016. M. Rezaei and P. Fränti, "Set-matching methods for external cluster validity", IEEE Trans. on Knowledge and Data Engineering, 28 (8), , August 2016. E. Chavez and G. Navarro, A probabilistic spell for the curse of dimensionality. Workshop on Algorithm Engineering and Experimentation, LNCS 2153, , 2001. N. Tomasev, M. Radovanovi, D. Mladeni and M. Ivanovi, “The role of hubness in clustering high-dimensional data”, IEEE Trans. on Knowledge and Data Engineering, 26 (3), , March 2014. D. Steinley, Local optima in k-means clustering: what you don’t know may hurt you”, Psychological Methods, 8, 294–304, 2003. P. Fränti, M. Rezaei and Q. Zhao, "Centroid index: Cluster level similarity measure", Pattern Recognition, 47 (9), , 2014. T. Gonzalez, Clustering to minimize the maximum intercluster distance. Theoretical Computer Science, 38 (2–3), 293–306, 1985.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.