Pasi Fränti and Sami Sieranoja

Pasi Fränti and Sami Sieranoja
How much k-means can be improved by using better initialization and repeats? Pasi Fränti and Sami Sieranoja P. Fränti and S. Sieranoja, ”How much k-means can be improved by using better initialization and repeats?", Pattern Recognition, 2019.

Introduction

SSE = sum-of-squared errors
Goal of k-means Input N points: X={x1, x2, …, xN} Output partition and k centroids: P={p1, p2, …, pk} C={c1, c2, …, ck} Objective function: SSE = sum-of-squared errors

Goal of k-means Output partition and k centroids: Objective function:
Input N points: X={x1, x2, …, xN} Output partition and k centroids: P={p1, p2, …, pk} C={c1, c2, …, ck} Objective function: Assumptions: SSE is suitable k is known

Using SSE objective function

K-means algorithm 6

K-means optimization steps
Assignment step: Centroid step: Before After

Examples

Iterate by k-means 1st iteration

Iterate by k-means 2nd iteration

Iterate by k-means 3rd iteration

Iterate by k-means 16th iteration

Final result 25 iterations

Problems of k-means Distance of clusters
Cannot move centroids between clusters far away 17

Data and methodology

Clustering basic benchmark
Fränti and Sieranoja K-means properties on six clustering benchmark datasets Applied Intelligence, 2018. S1 S2 S3 S4 Unbalance 100 100 2000 100 2000 2000 100 100 Birch1 Birch2 DIM32 A1 A2 A3 19

Centroid index Requires ground truth
CI = Centroid index: P. Fränti, M. Rezaei and Q. Zhao "Centroid index: cluster level similarity measure” Pattern Recognition, 47 (9), , September 2014. 20

Centroid index example
CI=4 Missing centroids Too many centroids 21

Success rate How often CI=0?
17% CI=1 CI=2 CI=1 CI=0 CI=2 CI=2 22

Properties of k-means

Cluster overlap Definition
d1 = distance to nearest centroid d2 = distance to nearest in other cluster

Dependency on overlap S datasets
Success rates and CI-values: overlap increases S S S S 1 2 3 4 3% 11% 12% 26% CI=1.8 CI=1.4 CI=1.3 CI=0.9

Why overlap helps? Overlap = 7% Overlap = 22% 13 iterations

Linear dependency on clusters (k) Birch2 dataset
16%

Dependency on dimensions DIM datasets
Dimensions increases 32 64 128 256 512 1024 CI: 3.6 3.5 3.8 3.8 3.9 3.7 Success rate: 0%

Lack of overlap is the cause! G2 datasets
Dimensions increases Overlap Correlation: Overlap increases 0.91 Success

Effect of unbalance Unbalabce datasets
Success: Average CI: Problem originates from the random initialization. 0% 3.9

Summary of k-means properties
1. Overlap 3. Dimensionality Good! No direct effect 2. Number of clusters 4. Unbalance of cluster sizes Linear dependency Bad!

How to improve?

Repeated k-means (RKM)
Must include randomness Initialize Repeat 100 times K-means 33

How to initialize? Some obvious heuristics: Furthest point Sorting
Density Projection Clear state-of-the-art is missing: No single technique outperforms others in all cases. Initialization not significantly easier than the clustering itself. K-means can be used as fine-tuner with almost anything. Another desperate effort: Repeat it LOTS OF times 34

Initialization techniques Criteria
Simple to implement Lower (or equal) time complexity than k-means No additional parameters Include randomness 35

Requirements for initialization
Simple to implement Random centroids has 2 functions + 26 lines of code Repeated k-means 5 functions lines of code Simpler than k-means itself 2. Faster than k-means K-means real-time (<1 s) up to N10,000 O(IkN)  25N assuming I25 and k=N Must be faster than O(N2) 36

Simplicity of algorithms
Parameters Functions Lines of code Kinnunen, Sidoroff, Tuononen and Fränti, "Comparison of clustering methods: a case study of text-independent speaker modeling" Pattern Recognition Letters, 2011.

Genetic Algorithm (GA)
Alternative: A better algorithm Random Swap (RS) Genetic Algorithm (GA) CI = 0 P. Fränti, "Efficiency of random swap clustering", Journal of Big Data, 2018 CI = 0 P. Fränti, "Genetic algorithm with deterministic crossover for vector quantization", Pattern Recognition Letters, 2000.

Initialization techniques

Techniques considered

Random centroids

Random partitions

Furthest points (maxmin)

Projection-based initialization
Most common projection axis: Diagonal Principal axis (PCA) Principle curves Initialization: Uniform partition along axis Used in divisive clustering: Iterative split PCA = O(DN)-O(D2N) 46

Furthest point projection

Projection example (1) Good projection! Birch2 48

Projection example (2) Bad projection! Birch1 49

Projection example (3) Bad projection! Unbalance 50

More complex projections
I. Cleju, P. Fränti, X. Wu, "Clustering based on principal curve", Scandinavian Conf. on Image Analysis, LNCS vol. 3540, June 2005. 51

More complex projections Travelling salesman problem!
I. Cleju, P. Fränti, X. Wu, "Clustering based on principal curve", Scandinavian Conf. on Image Analysis, LNCS vol. 3540, June 2005. Travelling salesman problem! 52

Sorting heuristic Sorting Reference point 2 3 1 4

Density-based heuristics Luxburg
Luxburg’s technique: Selects k log(k) preliminary clusters using k-means Eliminate the smallest. Furthest point heuristic to select k centroids.

Splitting algorithm Two random points K-means Split
Select biggest cluster Select two random points Re-allocate points Tune by k-means locally K-means Split

Results

Success rates K-means (without repeats)
Average success rate No. of datasets never solved Most problems: a2, a3, unbalance, Birch1, Birch2

Cluster overlap High cluster overlap
After K-means Initial 58

Cluster overlap Low cluster overlap
G2 Cluster overlap Low cluster overlap After K-means Initial

Number of clusters Birch2 subsets
60

Dimensions 61

Unbalance 62

Success rates Repeated k-means
Average success rate Furthest point approaches solve unbalance No. of datasets never solved Still problems: Birch1, Birch2 63

How many repeats needed?

How many repeats needed?
Unbalance

Summary of results CI-values
K-means: CI= % Repeated K-means: CI= % Maxmin initialization: CI= % Both: CI= % Most application: Good enough! Accuracy vital: Find better method! (random swap) Cluster overlap most important factor

Effect of different factors

Conclusions How effective: Repeats + Maxmin reduces error 4.5  0.7
Is it enough: For most applications: YES If accuracy important: NO Important factors: Cluster overlap critical for k-means Dimensions does not matter

Random swap

Random swap (RS) Initialize Swap K-means Repeat 5000 times
Only 2 iterations P. Fränti, "Efficiency of random swap clustering", Journal of Big Data, 2018 CI=0 70

How many repeats? P. Fränti, "Efficiency of random swap clustering", Journal of Big Data, 2018

The end

Pasi Fränti and Sami Sieranoja

Similar presentations

Presentation on theme: "Pasi Fränti and Sami Sieranoja"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Pasi Fränti and Sami Sieranoja

Similar presentations

Presentation on theme: "Pasi Fränti and Sami Sieranoja"— Presentation transcript:

Similar presentations

About project

Feedback