Presentation is loading. Please wait.

Presentation is loading. Please wait.

The Effectiveness of Lloyd-type Methods for the k-means Problem Chaitanya Swamy University of Waterloo Joint work with Rafi Ostrovsky, Yuval Rabani, Leonard.

Similar presentations


Presentation on theme: "The Effectiveness of Lloyd-type Methods for the k-means Problem Chaitanya Swamy University of Waterloo Joint work with Rafi Ostrovsky, Yuval Rabani, Leonard."— Presentation transcript:

1 The Effectiveness of Lloyd-type Methods for the k-means Problem Chaitanya Swamy University of Waterloo Joint work with Rafi Ostrovsky, Yuval Rabani, Leonard Schulman UCLA Technion Caltech

2 The k-means Problem Given: n points in d-dimensional space partition X into k clusters X 1,…, X k assign each point in X i to a common center c i  R d Goal: Minimize ∑ i ∑ x  X i d(x,c i ) 2 d: L 2 distance X  R d : point set with |X| = n c1c1 c3c3 c2c2 X1X1 X2X2 X3X3

3 k-means (contd.) Given the c i ’s, best clustering is to assign each point to nearest center: X i = {x  X: c i is ctr. nearest to x} Given the X i ’s, best choice of centers is to set c i = center of mass of X i = ctr(X i ) = ∑ x  X i x / |X i |  Optimal solution satisfies both properties Problem is NP-hard even for k=2 (n, d not fixed)

4 Related Work k-means problem dates back to Steinhaus ( 1 956). a) Approximation algorithms  algorithms with provable guarantees PTAS’s with varying runtime dependence on n, d, k: poly/linear in n, could be exponential in d and/or k –Matousek (poly(n), exp(d,k)) –Kumar, Sabharwal & Sen (KSS04) (lin(n,d), exp(k)) O( 1 )-approximation algorithms for k-median: any point set with any metric, runtime poly(n,d,k); guarantees also translate to k-means –Charikar, Guha, Tardos & Shmoys –Arya et al. + Kanungo et al.: (9+  )-approximation

5 b) Heuristics: Lloyd’s method invented in 1 957 and remains an extremely popular heuristic even today 1 ) Start with k initial / “seed” centers c 1,…, c k. 2) Iterate the following Lloyd step a)Assign each point to nearest center c i to obtain clustering X 1,…, X k. b)Update c i  ctr(X i ) = ∑ x  X i x/|X i |.

6 1 ) Start with k initial / “seed” centers c 1,…, c k. 2) Iterate the following Lloyd step a)Assign each point to nearest center c i to obtain clustering X 1,…, X k. b)Update c i  ctr(X i ) = ∑ x  X i x/|X i |. b) Heuristics: Lloyd’s method invented in 1 957 and remains an extremely popular heuristic even today

7 1 ) Start with k initial / “seed” centers c 1,…, c k. 2) Iterate the following Lloyd step a)Assign each point to nearest center c i to obtain clustering X 1,…, X k. b)Update c i  ctr(X i ) = ∑ x  X i x/|X i |. b) Heuristics: Lloyd’s method invented in 1 957 and remains an extremely popular heuristic even today

8 Some bounds on number of iterations of Lloyd-type methods: Inaba-Katoh-Imai; Har-Peled-Sadri; Arthur-Vassilvitskii (’06) Performance very sensitive to choice of seed centers; lot of literature on finding “good” seeding methods for Lloyd But, almost no analysis that proves performance guarantees about quality of final solution for arbitrary k and dimension Lloyd’s method: What’s known? Our Goal: to analyze Lloyd and try to prove rigorous performance guarantees for Lloyd-type methods

9 Our Results Introduce a clusterability or separation condition. Give a novel, efficient sampling process for seeding Lloyd’s method with initial centers. Show that if data satisfies our clusterabililty condition: –seeding + 1 Lloyd step yields a constant-approximation in time linear in n and d, poly(k): is potentially faster than Lloyd variants which require multiple reseedings –seeding + KSS04-sampling gives a PTAS: algorithm is faster and simpler than PTAS in KSS04. Main Theorem: If data has a “meaningful k-clustering”, then there is a simple, efficient seeding method s.t. Lloyd-type methods return a near-optimal solution.

10 “Meaningful k-Clustering” Settings where one would NOT consider data to possess a meaningful k-clustering: 1 ) If near-optimum cost can be achieved by two very distinct k- partitions of data, then identity of an optimal k-partition carries little meaning – provides ambiguous classification. 2) If cost of best k-clustering ≈ cost of best (k- 1 )-clustering, then a k-clustering yields only marginal benefit over the best (k- 1 )-clustering – should use smaller value of k here. Example: k=3

11 We formalize 2). Let  k 2 (X) = cost of best k-clustering of X. X is  -seperated for k-means iff  k 2 (X) /  k- 1 2 (X) ≤  2. Simple condition. Drop in k-clustering cost is already used by practitioners to choose the right k Can show that (roughly), X is  -separated for k-means two low-cost k-clusterings disagree on only a small fraction of data 

12 Some basic facts Fact:  p  R d,∑ x  X d(x, p) 2 =  1 2 (X) + n. d(p,c) 2  ∑ {x,y}  X d(x,y) 2 = n.  1 2 (X) [Write d(x, p) 2 = (x-c + c-p) T (x-c + c-p) and expand.] Lemma: Let X = X 1  X 2 be a partition of X with c i = ctr(X i ). Then  1 2 (X) =  1 2 (X 1 ) +  1 2 (X 2 ) + (|X 1 | |X 2 | / n). d(c 1,c 2 ) 2. Proof:  1 2 (X)= ∑ x  X 1 d(x,c) 2 + ∑ x  X 2 d(x,c) 2 = (  1 2 (X 1 ) + |X 1 |. d(c 1,c) 2 ) + (  1 2 (X 2 ) + |X 2 |. d(c 2,c) 2 ) =  1 2 (X 1 ) +  1 2 (X 2 ) + (|X 1 | |X 2 | / n). d(c 1,c 2 ) 2 n = |X| c = ctr(X) X2X2 c2c2 c X1X1 c1c1 c = (|X 1 |. c 1 +|X 2 |. c 2 ) / n

13 r*2r*2 r*1r*1 The 2-means problem (k=2) X * 1, X * 2 : optimal clusters c * i = ctr(X * i ),D * = d(c * 1,c * 2 ) n i = |X * i |, (r * i ) 2 = ∑ x  X * i d(x, c * i ) 2 / n i =  1 2 (X * i ) / n i = avg. squared distance in cluster X * i Lemma: For i= 1, 2, (r * i ) 2 ≤  2 /( 1 -  2 ). D *2. Proof:  2 2 (X) /  2 ≤  1 2 (X) =  2 2 (X) + (n 1 n 2 / n). D *2. X is  -separated for 2-means. c*1c*1 c*2c*2 D*D* X*1X*1 X*2X*2

14 The 2-means algorithm Assume: X is e-separated for 2-means 1 ) Sampling-based seeding procedure: –Pick two seed centers c 1, c 2 by randomly picking the pair x, y  X with probability d(x,y) 2. 2) Lloyd step or simpler “ball k-means step”: –For each c i, let B i = {x  X: d(x,c i ) ≤ d(c 1,c 2 )/3}. –Update c i  ctr(B i ); return these as final centers. Sampling can be implemented in O(nd) time, so entire algorithm runs in O(nd) time. c*1c*1 c*2c*2 D*D* X*1X*1 X*2X*2

15 c1c1 2-means: Analysis c*1c*1 c*2c*2 D*D* X*1X*1 X*2X*2 core(X * 1 ) core(X * 2 ) c2c2 Let core(X * i ) = {x  X * i : d(x,c) 2 ≤ (r * i ) 2 /  }, where  =  (  2 ) < 1. Seeding lemma: With prob. 1 –O(  ), c 1,c 2 lie in cores of X * 1, X * 2. Proof: |core(X * i )| ≥ ( 1 -  )n i for i= 1,2. Let A= ∑ x  core(X * 1 ), y  core(X * 2 ) d(x,y) 2 ≈ ( 1 -  ) 2 n 1 n 2 D *2. B= ∑ {x,y}  X d(x,y) 2 = n.  1 2 (X) ≈ n 1 n 2 D *2. Probability = A / B ≈ ( 1 -  ) 2 = 1 – O(  ).

16 2-means analysis (contd.) c1c1 c*1c*1 c*2c*2 D*D* X*1X*1 X*2X*2 core(X * 1 ) core(X * 2 ) c2c2 Recall that B i = {x  X: d(x,c i ) ≤ d(c 1,c 2 )/3} Ball-k-means lemma: For i= 1,2, core(X * i )  B i  X * i. Therefore d(ctr(B i ), c * i ) 2 ≤  (r * i ) 2 /( 1 –  ). Intuitively, since B i  X * i and B i contains almost all of the mass of X * i, ctr(B i ) must be close to ctr(X * i ) = c * i. B1B1 B2B2

17 2-means analysis (contd.) c1c1 c*1c*1 c*2c*2 D*D* X*1X*1 X*2X*2 core(X * 1 ) core(X * 2 ) c2c2 Recall that B i = {x  X: d(x,c i ) ≤ d(c 1,c 2 )/3} Ball-k-means lemma: For i= 1,2, core(X * i )  B i  X * i. Therefore d(ctr(B i ), c * i ) 2 ≤  (r * i ) 2 /( 1 –  ). Proof:   2 (X * i ) ≥ (|B i | |X * i \ B i | / n i ). d(ctr(B i ), ctr(X * i \ B i )) 2 B1B1 B2B2

18 Some basic facts Fact:  p  R d,∑ x  X d(x, p) 2 =  1 2 (X) + n. d(p,c) 2  ∑ {x,y}  X d(x,y) 2 = n.  1 2 (X) [Write d(x, p) 2 = (x-c + c-p) T (x-c + c-p) and expand.] Lemma: Let X = X 1  X 2 be a partition of X, c i = ctr(X i ). Then  1 2 (X) =  1 2 (X 1 ) +  1 2 (X 2 ) + (|X 1 | |X 2 | / n). d(c 1,c 2 ) 2. n = |X| c = ctr(X) X2X2 c2c2 c X1X1 c1c1 c = (|X 1 |. c 1 +|X 2 |. c 2 ) / n

19 2-means analysis (contd.) c1c1 c*1c*1 c*2c*2 D*D* X*1X*1 X*2X*2 core(X * 1 ) core(X * 2 ) c2c2 Recall that B i = {x  X: d(x,c i ) ≤ d(c 1,c 2 )/3} Ball-k-means lemma: For i= 1,2, core(X * i )  B i  X * i. Therefore d(ctr(B i ), c * i ) 2 ≤  (r * i ) 2 /( 1 –  ). Proof:   2 (X * i ) ≥ (|B i | |X * i \ B i | / n i ). d(ctr(B i ), ctr(X * i \ B i )) 2 Also d(ctr(B i ), c * i ) = (|X * i \ B i | / n i ). d(ctr(B i ), ctr(X * i \ B i ))  n i (r * i ) 2 ≥ (n i |B i | / |X * i \ B i |).d(ctr(B i ), c * i ) 2 B1B1 B2B2

20 2-means analysis (contd.) Theorem: With probability 1 –O(  ), cost of final clustering is at most  2 2 (X)/( 1 –  ),  get a ( 1 /( 1 –  ))-approximation algorithm. Since  = O(  2 ), we have approximation ratio  1 as  0. probability of success  1 as  0.

21 Arbitrary k Algorithm and analysis follow the same outline as in 2-means. If X is  -separated for k-means, can again show that all clusters are well separated, that is, cluster radius << inter-cluster distance, r * i = O(  ). d(c * i, c * j )  i,j 1 ) Seeding stage: we choose k initial centers and ensure that they lie in the “cores” of the k optimal clusters. –exploits the fact that clusters are well separated –after seeding stage, each optimal center has a distinct seed center very “near” it 2) Now, can run either a Lloyd step or a ball-k-means step. Theorem: If X is  -separated for k-means, then one can obtain an  (  )-approximation algorithm where  (  )  1 as   0.

22 Schematic of entire algorithm Simple sampling: Pick k centers as follows. –first pick 2 centers c 1, c 2 as in 2-means –to pick center c i+ 1, pick x  X with probability min j ≤ i d(x,c j ) 2 Simple sampling: success probability = exp(-k) Oversampling + deletion: sample O(k) centers, then greedily delete till k remain O( 1 ) success probability, O(nkd+k 3 d) Greedy deletion: O(n 3 d) Greedy deletion: Start with n centers and keep deleting the center that causes least cost increase till k centers remain k well- placed seeds Ball k-means or Lloyd step: gives O( 1 )-approx. KSS04-sampling: gives PTAS

23 Simple sampling: analysis sketch X * 1,…, X * k : optimal clusters c * i = ctr(X * i ),n i = |X * i |,(r * i ) 2 = ∑ x  X * i d(x,c * i ) 2 / n i =  1 2 (X * i ) / n i core(X * i ) = {x  X * i : d(x,c * i ) 2 ≤ (r * i ) 2 /  }where  =  (  2/3 ) Lemma: With probability ( 1 –O(  )) k, all sampled centers lie in the cores of distinct optimal clusters. Proof: Will show inductively that if c 1,…, c i lie in distinct cores, then with probability 1 –O(  ), so does center c i+ 1. Base case: X is  -separated for k-means  X * i  X * j is  -separated for 2-means for every i≠ j (because merging two clusters causes a huge increase in cost). So by 2-means analysis, first two centers c 1,c 2 lie in distinct cores. X is  -separated for k-means.

24 Simple sampling: analysis (contd.) Inductive step: Assume c 1,…, c i lie in cores of X * 1,…, X * i Let C = {c 1,…, c i }. A = ∑ j ≥ i+ 1 ∑ x  core(X * j ) d(x,C) 2 ≈ ∑ j ≥ i+ 1 ( 1 -  )n j d(c * j,C) 2 B= ∑ j ≤ k, x  X * j d(x,C) 2 ≈ ∑ j ≤ i  1 2 (X * j ) + ∑ j ≥ i+ 1 (  1 2 (X * j ) + n j d(c * j,C) 2 ) X*2X*2 X*1X*1 core(X * 1 ) c*1c*1 c1c1 X*iX*i core(X * i ) c*ic*i cici c * i+ 1 X * i+1 c*kc*k X*kX*k

25 Simple sampling: analysis (contd.) Inductive step: Assume c 1,…, c i lie in cores of X * 1,…, X * i Let C = {c 1,…, c i }. A = ∑ j ≥ i+ 1 ∑ x  core(X * j ) d(x,C) 2 ≈ ∑ j ≥ i+ 1 ( 1 -  )n j d(c * j,C) 2 B= ∑ j ≤ k, x  X * j d(x,C) 2 ≈ ∑ j ≤ i  1 2 (X * j ) + ∑ j ≥ i+ 1 (  1 2 (X * j ) + n j d(c * j,C) 2 ) ≈ ∑ j ≥ i+ 1 n j d(c * j,C) 2 Probability = A/B = 1 –O(  ) X*2X*2 c * i+ 1 c*kc*k X*1X*1 core(X * 1 ) c*1c*1 c1c1 X*iX*i core(X * i ) c*ic*i cici X * i+1 X*kX*k

26 Open Questions Deeper analysis of Lloyd: are there weaker conditions under which one can prove performance guarantes for Lloyd-type methods? PTAS for k-means with polytime dependence on n, k and d? Is it APX hard in geometric setting? PTAS for k-means under our separation condition? Other applications of separation condition?

27 Thank You.


Download ppt "The Effectiveness of Lloyd-type Methods for the k-means Problem Chaitanya Swamy University of Waterloo Joint work with Rafi Ostrovsky, Yuval Rabani, Leonard."

Similar presentations


Ads by Google