Download presentation
Presentation is loading. Please wait.
1
INITIALISATION OF K-MEANS
A COMPLEMENTARY SQUARE-ERROR CLUSTERING CRITERION AND INITIALISATION OF K-MEANS BORIS MIRKIN DATA ANALYSIS& MACHINE INTELLIGENCE, NATIONAL RESEARCH UNIVERSITY HIGHER SCHOOL OF ECONOMICS, MOSCOW RF COMPUTER SCIENCE & INFORMATION SYSTEMS, UNIVERSITY LONDON UK S U P P O RT O F T H E N R U H S E A C A D E M I C F U N D P RO G R A M ( G R A N T A S U B S I DY G R A N T E D TO T H E N R U H S E BY T H E R F G OV E R N M E N T BIRKBECK D U E TO F O R T H E I M P L E M E N TAT I O N O F T H E G LO B A L C O M P E T I T I V E N E S S P RO G R A M I N – ) I S A C K N OW L E D G E D. BigData and DM: Paris 7-8 September 2017 1
2
CONTENTS 1. Batch k-means algorithm and criterion
2. Data scatter decomposition and the complementary clustering criterion (CCC) 3. 4. A review of k-means initialization Extracting anomalous clusters one-by-one 5. Experiments with ik-means clustering 6. Ward agglomeration (WA) preceded by ik-means Affinity Propagation with CCC PAC = AP-CCC+WA: Capturing right number of 7. 8. clusters 2 BigData and DM: Paris 7-8 September 2017
3
ck, Sk Centers sets (k=1,…, K) (a) Initialize (b) Assign
entities to nearest center (c) Cluster update BigData and DM: Paris 7-8 September 2017 (d) Center update 3
4
1. Cluster update: Update sets Sk (k=1,…, K)
Cluster k: center ck and set Sk (k=1,…, K) K-means: 0. Initialize: Specify K, number of clusters, and initial centers ck (k=1,…, K) 1. Cluster update: Update sets Sk (k=1,…, K) using Minimum distance rule 2. Center update: Update centers ck (k=1,…, K) as means of Sk 3. Halt condition: If new centers coincide with the previous ones, stop. Else go to 1. BigData and DM: Paris 7-8 September 2017 4
5
K-MEANS CRITERIO N � �, �� �=� �∈��� Find partition S and minimize:
𝑲 centers c to 𝑾 𝑺, � = � �, �� �=� �∈��� Criterion: Sum of distances between entities and centers of their clusters Distance X= [ Y= [ (squared Euclidean): -2] -1] -2-(-1)]=[0, 3, -1] 1, 2, -1, 2-(-1), X-Y=[1-1, d(X,Y)=<X - Y, X - Y>=02+32+(-1)2 = 10 5 BigData and DM: Paris 7-8 September 2017
6
PYTHAGOREAN DECOMPOSITION yiv yiv N k ck ,ck
K-Means criterion: DECOMPOSITION K V k 1 iSk v 1 2 W ( S ,c ) ( yiv - c ) 2 kv K V k 1 iSk v 1 c 2 ( yiv - 2y c ) iv kv kv K V k 1 v 1 K k 1 2 yiv N k ck ,ck T D( S ,c ) T = D(S,c) + W(S,c) Data_Scatter = “Explained”+”Unexplained” 6 BigData and DM: Paris 7-8 September 2017
7
�� �� � � � �, �� 𝑵� < �� , �� > where Nk is the number of
K-Means minimizes: 𝑲 Complementary criterion: 𝑾 𝑺, � = � �, �� Maximize 𝑫 𝑺, � = 𝑲 �=� 𝑵� < �� , �� > �=� �∈��� over S and c. where Nk is the number of entities in S �� �� � � Data scatter = �, 𝒗 k = W(S,c) + D(S,c) <ck, ck> - Euclidean squared distance between 0 and ck Data scatter is constant while partitioning 7 BigData and DM: Paris 7-8 September 2017
8
COMPLEMENTARY CRITERION
𝑲 �=� Maximize 𝑫 � �, � = #��� < �� , �� > 0 is grand mean Pre-center data: <ck, ck> - Euclidean squared distance 0 to Look for anomalous & populated ck clusters!!! Further away from the origin ! 8 BigData and DM: Paris 7-8 September 2017
9
DETERMINING THE NUMBER OF
CLUSTERS: TWO WAYS A) Extract clusters one-by-one: find it an anomalous cluster, then remove First Second B) Determine objects, both most distant and representative, in parallel; then k-means BigData and DM: Paris 7-8 September 2017 9
10
ANOMALOUS CLUSTER Cluster update: If d(yi,c)< d(yi,0), assign yi 1.
0. Initial center c is object, farthest away from 2. to Cluster update: If d(yi,c)< d(yi,0), assign yi S. 3. Centroid update: Within-S mean c c'. Otherwise, halt. c' if c' c. Go to 2 with 10 BigData and DM: Paris 7-8 September 2017
11
FINDING AN ANOMALOUS CLUSTER (MIRKIN 1998, CHIANG&MIRKIN 2010)
Anomalous cluster S with center c: max #S< �, � > 0 is Reference point (grand mean). Build Anomalous cluster S with center c 11 BigData and DM: Paris 7-8 September 2017
12
Anomalous Cluster is (almost) K-Means up to: (i) the number of
clusters K=2: the the “main body” of “anomalous” one entities around 0; (ii) center of forcibly always at and the “main body” cluster 0; is (iii) natural initialization: c0 is at which is the farthest away from 0. entity 12 BigData and DM: Paris 7-8 September 2017
13
IK-MEANS 1. Pre-center the data matrix to grand-mean, set
threshold t (=1 by default). Find Anomalous cluster and store its center and size. 2. 3. Remove set. Halt Initialize the Anomalous cluster from data if the set gets empty, else: go to k-means with centers of those 2. 4. anomalous clusters whose size t 13 BigData and DM: Paris 7-8 September 2017
14
CLUSTERING EXPERIMENTS (CHIANG&MIRKIN 2010)
iK-Means is superior in cluster recovery (Chiang, Mirkin, Journal of Classification, 2010) over Method Acronym Calinski and Harabasz index CH Hartigan rule HK Gap statistic GS Jump statistic JS Silhouette width SW Consensus distribution area CD Average distance between partitions DD Square error iK-Means LS Absolute error iK-Means LM 14 BigData and DM: Paris 7-8 September 2017
15
ACCELERATING WARD CLUSTERING
Agglomerative Ward clustering: 1. 2. Start with trivial partition S={{1},{2},…, {N}} Given S, form S(k,l) by merging min (k,l) = W(S(k,l),c(k,l)) Sk and Sl – W(S,c) 3. Stop when S={1,2,…, N} 15 BigData and DM: Paris 7-8 September 2017
16
min (k,l) = W(S(k,l),c(k,l)) – W(S,c)
ACCELERATING WARD CLUSTERING (AMORIM, MAKARENKOV, MIRKIN, 2016) A-Ward clustering: 1. 2. Start with S resulting from ik-means at t=1 Given S, form S(k,l) by merging Sk and Sl min (k,l) = W(S(k,l),c(k,l)) – W(S,c) Stop when S={1,2,…, N} Experimental result: 3. A-Ward always gives a larger K; is about 10-15 times faster than Ward at similar cluster recovery 16 BigData and DM: Paris 7-8 September 2017
17
PARALLEL CLUSTER CENTERS,1
Affinity Propagation (Frey&Dueck, s(i,j)=d2(i,j) r(i,i)=const a(i,i)=0 2008) 1. Similarity field “Responsibility” “Availability” 2. Exchange process: r(i,j) (r(i,j)+ bij – maxlj [a(i,l) + bil])/2 a(i,j) (a(i,j)+ + min[0, r(k,k)+ max (0, ��(��, ��)) ])/2 �� ≠�, � 3. Choose i with maximal responsibility. BigData and DM: Paris 7-8 September 2017 17
18
PARALLEL CLUSTER CENTERS,2
Affinity Propagation criterion (APC) at Complementary 1. Similarity field s(i,j)=<yi,yj> r(i,i)=<yi,yj> a(i,i)=0 “Responsibility” “Availability” 2. Exchange process: r(i,j) (r(i,j)+ bij – maxlj [a(i,l) + bil])/2 a(i,j) (a(i,j)+ + min[0, r(k,k)+ max (0, ��(��, ��)) ])/2 �� ≠�, � 3. Choose i with responsibility r(i,i)>mean BigData and DM: Paris 7-8 September 2017 18
19
PARALLEL CLUSTER CENTERS,3
Affinity Propagation criterion (APC) at Complementary 1. Similarity field s(i,j)=<yi,yj> r(i,i)=<yi,yj> “Responsibility” Why? Because: � 𝑲 �=� D(S,c) = < 𝒚 , 𝒚 > �,�∈ Sk � � #� �� 19 BigData and DM: Paris 7-8 September 2017
20
(k,l) = W(S(k,l),c(k,l)) – W(S,c)
PARALLEL ANOMALOUS CLUSTERS: (PAC) (MIRKIN, TOKMAKOV, AMORIM, MAKARENKOV, IN PROGRESS) 1. Affinity Propagation at Complementary criterion (APC) followed by K-Means 2. A-Ward till a stop-condition based (k,l) = W(S(k,l),c(k,l)) – W(S,c) K-Means reiterated on 3. 20 BigData and DM: Paris 7-8 September 2017
21
PARALLEL ANOMALOUS CLUSTERS: sf=.75 sf=.50 sf=.25 Generated Gaussian
K* clusters at some sf 0.06 0.08 0.08 0.04 0.06 0.06 0.04 0.04 0.02 0.02 0.02 u2 u2 u2 -0.02 -0.02 -0.02 -0.04 -0.04 -0.04 -0.06 -0.06 -0.06 -0.06 -0.04 -0.02 0 u1 0.02 0.04 0.06 -0.08 -0.06 -0.04 -0.02 0 u1 0.02 0.04 u1 -0.08 0.08 Experimental computation: At K* = 7, 15, 25 sf=0.75, 0.50, PAC finds K* exactly sf=0.25, PAC finds K* exactly at K*=15, 25. BigData and DM: Paris 7-8 September 2017 21
22
CONCLUSION The complementary criterion (CCC) has a natural meaning: Big anomalous clusters! Two ways to advance: Sequential: • • • ik-means with a good cluster recovery A-Ward: an effective agglomerative method • Parallel: • CCC based Affinity Propagation PAC, a k-means method definitively capturing K 22 BigData and DM: Paris 7-8 September 2017
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.