Pasi Fränti and Sami Sieranoja

Slides:



Advertisements
Similar presentations
Applications of one-class classification
Advertisements

K-means Clustering Given a data point v and a set of points X,
Aggregating local image descriptors into compact codes
PARTITIONAL CLUSTERING
DIMENSIONALITY REDUCTION BY RANDOM PROJECTION AND LATENT SEMANTIC INDEXING Jessica Lin and Dimitrios Gunopulos Ângelo Cardoso IST/UTL December
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.
Dimension reduction : PCA and Clustering Agnieszka S. Juncker Slides: Christopher Workman and Agnieszka S. Juncker Center for Biological Sequence Analysis.
CONTENT BASED FACE RECOGNITION Ankur Jain 01D05007 Pranshu Sharma Prashant Baronia 01D05005 Swapnil Zarekar 01D05001 Under the guidance of Prof.
Basic Data Mining Techniques
Dimension reduction : PCA and Clustering Christopher Workman Center for Biological Sequence Analysis DTU.
Dimension reduction : PCA and Clustering by Agnieszka S. Juncker Part of the slides is adapted from Chris Workman.
Parallel K-Means Clustering Based on MapReduce The Key Laboratory of Intelligent Information Processing, Chinese Academy of Sciences Weizhong Zhao, Huifang.
K-means Clustering. What is clustering? Why would we want to cluster? How would you determine clusters? How can you do this efficiently?
Laurent Itti: CS599 – Computational Architectures in Biological Vision, USC Lecture 7: Coding and Representation 1 Computational Architectures in.
Clustering Ram Akella Lecture 6 February 23, & 280I University of California Berkeley Silicon Valley Center/SC.
Presented By Wanchen Lu 2/25/2013
Unsupervised Learning. CS583, Bing Liu, UIC 2 Supervised learning vs. unsupervised learning Supervised learning: discover patterns in the data that relate.
Clustering Methods: Part 2d Pasi Fränti Speech & Image Processing Unit School of Computing University of Eastern Finland Joensuu, FINLAND Swap-based.
Clustering methods Course code: Pasi Fränti Speech & Image Processing Unit School of Computing University of Eastern Finland Joensuu,
Feature extraction 1.Introduction 2.T-test 3.Signal Noise Ratio (SNR) 4.Linear Correlation Coefficient (LCC) 5.Principle component analysis (PCA) 6.Linear.
Self-organizing map Speech and Image Processing Unit Department of Computer Science University of Joensuu, FINLAND Pasi Fränti Clustering Methods: Part.
Cut-based & divisive clustering Clustering algorithms: Part 2b Pasi Fränti Speech & Image Processing Unit School of Computing University of Eastern.
1 Motivation Web query is usually two or three words long. –Prone to ambiguity –Example “keyboard” –Input device of computer –Musical instruments How can.
Genetic Algorithm Using Iterative Shrinking for Solving Clustering Problems UNIVERSITY OF JOENSUU DEPARTMENT OF COMPUTER SCIENCE FINLAND Pasi Fränti and.
ECE 8443 – Pattern Recognition LECTURE 10: HETEROSCEDASTIC LINEAR DISCRIMINANT ANALYSIS AND INDEPENDENT COMPONENT ANALYSIS Objectives: Generalization of.
2005MEE Software Engineering Lecture 11 – Optimisation Techniques.
Genetic algorithms (GA) for clustering Pasi Fränti Clustering Methods: Part 2e Speech and Image Processing Unit School of Computing University of Eastern.
Event retrieval in large video collections with circulant temporal encoding CVPR 2013 Oral.
Project 11: Determining the Intrinsic Dimensionality of a Distribution Okke Formsma, Nicolas Roussis and Per Løwenborg.
Clustering Clustering is a technique for finding similarity groups in data, called clusters. I.e., it groups data instances that are similar to (near)
Chapter 13 (Prototype Methods and Nearest-Neighbors )
Vector Quantization Vector quantization is used in many applications such as image and voice compression, voice recognition (in general statistical pattern.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 12: Advanced Discriminant Analysis Objectives:
Database Management Systems, R. Ramakrishnan 1 Algorithms for clustering large datasets in arbitrary metric spaces.
Clustering (1) Chapter 7. Outline Introduction Clustering Strategies The Curse of Dimensionality Hierarchical k-means.
1 Learning Bias & Clustering Louis Oliphant CS based on slides by Burr H. Settles.
How to cluster data Algorithm review Extra material for DAA Prof. Pasi Fränti Speech & Image Processing Unit School of Computing University.
Genetic Algorithms for clustering problem Pasi Fränti
Data statistics and transformation revision Michael J. Watts
Agglomerative clustering (AC)
Optimization Problems
Data Mining: Basic Cluster Analysis
Chapter 3: Maximum-Likelihood Parameter Estimation
Centroid index Cluster level quality measure
LECTURE 11: Advanced Discriminant Analysis
Fast nearest neighbor searches in high dimensions Sami Sieranoja
A modified hyperplane clustering algorithm allows for efficient and accurate clustering of extremely large datasets Ashok Sharma, Robert Podolsky, Jieping.
Random Swap algorithm Pasi Fränti
Data Mining K-means Algorithm
CSSE463: Image Recognition Day 11
Dimension reduction : PCA and Clustering by Agnieszka S. Juncker
Machine Learning Basics
Random Swap algorithm Pasi Fränti
CSSE463: Image Recognition Day 11
Optimization Problems
SMEM Algorithm for Mixture Models
KMeans Clustering on Hadoop Fall 2013 Elke A. Rundensteiner
K-means properties Pasi Fränti
Dimension reduction : PCA and Clustering
Vertical K Median Clustering
LECTURE 21: CLUSTERING Objectives: Mixture Densities Maximum Likelihood Estimates Application to Gaussian Mixture Models k-Means Clustering Fuzzy k-Means.
Randomized Algorithms
CSSE463: Image Recognition Day 11
CSSE463: Image Recognition Day 11
Supervised machine learning: creating a model
Mean-shift outlier detection
Dimensionally distributed Pasi Fränti and Sami Sieranoja
Introduction to Machine learning
Clustering methods: Part 10
Presentation transcript:

Pasi Fränti and Sami Sieranoja How much k-means can be improved by using better initialization and repeats? Pasi Fränti and Sami Sieranoja 11.4.2019 P. Fränti and S. Sieranoja, ”How much k-means can be improved by using better initialization and repeats?", Pattern Recognition, 2019.

Introduction

SSE = sum-of-squared errors Goal of k-means Input N points: X={x1, x2, …, xN} Output partition and k centroids: P={p1, p2, …, pk} C={c1, c2, …, ck} Objective function: SSE = sum-of-squared errors

Goal of k-means Output partition and k centroids: Objective function: Input N points: X={x1, x2, …, xN} Output partition and k centroids: P={p1, p2, …, pk} C={c1, c2, …, ck} Objective function: Assumptions: SSE is suitable k is known

Using SSE objective function

K-means algorithm http://cs.uef.fi/sipu/clustering/animator/ 6

K-means optimization steps Assignment step: Centroid step: Before After

Examples

Iterate by k-means 1st iteration

Iterate by k-means 2nd iteration

Iterate by k-means 3rd iteration

Iterate by k-means 16th iteration

Iterate by k-means 17th iteration

Iterate by k-means 18th iteration

Iterate by k-means 19th iteration

Final result 25 iterations

Problems of k-means Distance of clusters Cannot move centroids between clusters far away 17

Data and methodology

Clustering basic benchmark Fränti and Sieranoja K-means properties on six clustering benchmark datasets Applied Intelligence, 2018. S1 S2 S3 S4 Unbalance 100 100 2000 100 2000 2000 100 100 Birch1 Birch2 DIM32 A1 A2 A3 19

Centroid index Requires ground truth CI = Centroid index: P. Fränti, M. Rezaei and Q. Zhao "Centroid index: cluster level similarity measure” Pattern Recognition, 47 (9), 3034-3045, September 2014. 20

Centroid index example CI=4 Missing centroids Too many centroids 21

Success rate How often CI=0? 17% CI=1 CI=2 CI=1 CI=0 CI=2 CI=2 22

Properties of k-means

Cluster overlap Definition d1 = distance to nearest centroid d2 = distance to nearest in other cluster

Dependency on overlap S datasets Success rates and CI-values: overlap increases S S S S 1 2 3 4 3% 11% 12% 26% CI=1.8 CI=1.4 CI=1.3 CI=0.9

Why overlap helps? Overlap = 7% Overlap = 22% 13 iterations

Linear dependency on clusters (k) Birch2 dataset 16%

Dependency on dimensions DIM datasets Dimensions increases 32 64 128 256 512 1024 CI: 3.6 3.5 3.8 3.8 3.9 3.7 Success rate: 0%

Lack of overlap is the cause! G2 datasets Dimensions increases Overlap Correlation: Overlap increases 0.91 Success

Effect of unbalance Unbalabce datasets Success: Average CI: Problem originates from the random initialization. 0% 3.9

Summary of k-means properties 1. Overlap 3. Dimensionality Good! No direct effect 2. Number of clusters 4. Unbalance of cluster sizes Linear dependency Bad!

How to improve?

Repeated k-means (RKM) Must include randomness Initialize Repeat 100 times K-means 33

How to initialize? Some obvious heuristics: Furthest point Sorting Density Projection Clear state-of-the-art is missing: No single technique outperforms others in all cases. Initialization not significantly easier than the clustering itself. K-means can be used as fine-tuner with almost anything. Another desperate effort: Repeat it LOTS OF times 34

Initialization techniques Criteria Simple to implement Lower (or equal) time complexity than k-means No additional parameters Include randomness 35

Requirements for initialization Simple to implement Random centroids has 2 functions + 26 lines of code Repeated k-means 5 functions + 162 lines of code Simpler than k-means itself 2. Faster than k-means K-means real-time (<1 s) up to N10,000 O(IkN)  25N1.5 --- assuming I25 and k=N Must be faster than O(N2) 36

Simplicity of algorithms Parameters Functions Lines of code Kinnunen, Sidoroff, Tuononen and Fränti, "Comparison of clustering methods: a case study of text-independent speaker modeling" Pattern Recognition Letters, 2011.

Genetic Algorithm (GA) Alternative: A better algorithm Random Swap (RS) Genetic Algorithm (GA) CI = 0 P. Fränti, "Efficiency of random swap clustering", Journal of Big Data, 2018 CI = 0 P. Fränti, "Genetic algorithm with deterministic crossover for vector quantization", Pattern Recognition Letters, 2000.

Initialization techniques

Techniques considered

Random centroids

Random partitions

Random partitions

Furthest points (maxmin)

Furthest points (maxmin)

Projection-based initialization Most common projection axis: Diagonal Principal axis (PCA) Principle curves Initialization: Uniform partition along axis Used in divisive clustering: Iterative split PCA = O(DN)-O(D2N) 46

Furthest point projection

Projection example (1) Good projection! Birch2 48

Projection example (2) Bad projection! Birch1 49

Projection example (3) Bad projection! Unbalance 50

More complex projections I. Cleju, P. Fränti, X. Wu, "Clustering based on principal curve", Scandinavian Conf. on Image Analysis, LNCS vol. 3540, June 2005. 51

More complex projections Travelling salesman problem! I. Cleju, P. Fränti, X. Wu, "Clustering based on principal curve", Scandinavian Conf. on Image Analysis, LNCS vol. 3540, June 2005. Travelling salesman problem! 52

Sorting heuristic Sorting Reference point 2 3 1 4

Density-based heuristics Luxburg Luxburg’s technique: Selects k log(k) preliminary clusters using k-means Eliminate the smallest. Furthest point heuristic to select k centroids.

Splitting algorithm Two random points K-means Split Select biggest cluster Select two random points Re-allocate points Tune by k-means locally K-means Split

Results

Success rates K-means (without repeats) Average success rate No. of datasets never solved Most problems: a2, a3, unbalance, Birch1, Birch2

Cluster overlap High cluster overlap After K-means Initial 58

Cluster overlap Low cluster overlap G2 Cluster overlap Low cluster overlap After K-means Initial

Number of clusters Birch2 subsets 60

Dimensions 61

Unbalance 62

Success rates Repeated k-means Average success rate Furthest point approaches solve unbalance No. of datasets never solved Still problems: Birch1, Birch2 63

How many repeats needed?

How many repeats needed? Unbalance

Summary of results CI-values K-means: CI=4.5 15% Repeated K-means: CI=2.0 6% Maxmin initialization: CI=2.1 6% Both: CI=0.7 1% Most application: Good enough! Accuracy vital: Find better method! (random swap) Cluster overlap most important factor

Effect of different factors

Conclusions How effective: Repeats + Maxmin reduces error 4.5  0.7 Is it enough: For most applications: YES If accuracy important: NO Important factors: Cluster overlap critical for k-means Dimensions does not matter

Random swap

Random swap (RS) Initialize Swap K-means Repeat 5000 times Only 2 iterations P. Fränti, "Efficiency of random swap clustering", Journal of Big Data, 2018 CI=0 70

How many repeats? P. Fränti, "Efficiency of random swap clustering", Journal of Big Data, 2018

The end