Download presentation
Presentation is loading. Please wait.
Published byCleopatra Melton Modified over 9 years ago
1
K means ++ and K means Parallel Jun Wang
2
Review of K means Simple and fast Choose k centers randomly Class points to its nearest center Update centers
3
K means ++ Actually you are using it Spend some time on choosing k centers(seeding) Save time on clustering
4
K means ++ algorithm Seeding Choose a center from X randomly For k-1 times Sample one center each time from X with probability p Update center matrix Clustering
5
d i 2 =min(euclidean distance b/t Xi to each Ci )
6
How to choose K centers
7
Choose a point from X randomly
8
Calculate all d i 2
9
Calculate Pi D=d 1 2 +d 2 2 +d 3 2 +…+d n 2 P i =d i 2 / D ∑P i =1 Points further away from red point have better chance to be chosen
10
Pick up point with probability p
11
Keep doing the following: Update center matrix Calculate d i 2 Calculate pi Until k centers are found
12
K means || algorithm Seeding Choose a small subset C from X Assign weight to points in C Cluster C and get k centers Clustering
13
Choose subset C from X Let D=Sum of square distance=d 1 2 +d 2 2 +d 3 2 +…+d n 2 Let L be f(k) like 0.2k or 1.5k for ln(D) times Pick up each point in X using Bernoulli distribution P(chosen)=L*d i 2 /D Update the C
14
How many data in C?
15
Ln(D) iterations Each iteration there suppose to be 1*P1+1*P2+…+1*Pn =L points Total Ln(D)*L points
16
Cluster the subset C Red points are in subset C
17
Cluster the sample C Calculate distances between point A to other points in C, and find the smallest distance In this case,d_c 1
18
Cluster the sample C Calculate distances between point A and all points in X, and get d_x i
19
Cluster the sample C Compare d_x i to d_c 1, and let W A =number of d_x i <d_c 1 Then we get weight matrix, W Cluster W into k clusters, get k centers
20
Difference among three methods K meansK means ++K means || seedingchoose k centers randomly Choose k centers proportionally Choose subset C and get k centers from C clustering
21
Hypothesis K meansK means ++K means || seedingchoose k centers randomly fast Choose k centers proportionally slow Choose subset C and get k centers from C slower clustering
22
Hypothesis K meansK means ++K means || seedingchoose k centers randomly fast Choose k centers proportionally slow Choose subset C and get k centers from C slower clusteringslowfastfaster
23
Test Hypothesis Toy data one – very small Cloud data – small Spam data – moderate Toy data two – very large
24
Simple data set N=100; d=2; k=2; Iteration=100
25
Executive time
26
Cloud data consists of 1024 points in 10 dimension k=6
28
Executive time (in seconds)
29
Total scatter
30
Spam base data represents features available to an e-mail spam detection system consists of 4601 points in 58 dimensions K=10
31
Executive time
32
Total scatter
33
Complicate data set N=20,000 d=40 K=40
34
Executive time
35
Clustered figure with true label
36
Clustered figure with computed label
37
summary K meansK means ++K means || Small size dataFastVery fastSlow Moderate size Large size data SlowVery slowFast
38
Select L Does not matter much when data is small Try on large data set
39
Questions
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.