The Effectiveness of Lloyd-type Methods for the k-means Problem Chaitanya Swamy University of Waterloo Joint work with Rafi Ostrovsky, Yuval Rabani, Leonard Schulman UCLA Technion Caltech
The k-means Problem Given: n points in d-dimensional space partition X into k clusters X 1,…, X k assign each point in X i to a common center c i R d Goal: Minimize ∑ i ∑ x X i d(x,c i ) 2 d: L 2 distance X R d : point set with |X| = n c1c1 c3c3 c2c2 X1X1 X2X2 X3X3
k-means (contd.) Given the c i ’s, best clustering is to assign each point to nearest center: X i = {x X: c i is ctr. nearest to x} Given the X i ’s, best choice of centers is to set c i = center of mass of X i = ctr(X i ) = ∑ x X i x / |X i | Optimal solution satisfies both properties Problem is NP-hard even for k=2 (n, d not fixed)
Related Work k-means problem dates back to Steinhaus ( 1 956). a) Approximation algorithms algorithms with provable guarantees PTAS’s with varying runtime dependence on n, d, k: poly/linear in n, could be exponential in d and/or k –Matousek (poly(n), exp(d,k)) –Kumar, Sabharwal & Sen (KSS04) (lin(n,d), exp(k)) O( 1 )-approximation algorithms for k-median: any point set with any metric, runtime poly(n,d,k); guarantees also translate to k-means –Charikar, Guha, Tardos & Shmoys –Arya et al. + Kanungo et al.: (9+ )-approximation
b) Heuristics: Lloyd’s method invented in and remains an extremely popular heuristic even today 1 ) Start with k initial / “seed” centers c 1,…, c k. 2) Iterate the following Lloyd step a)Assign each point to nearest center c i to obtain clustering X 1,…, X k. b)Update c i ctr(X i ) = ∑ x X i x/|X i |.
1 ) Start with k initial / “seed” centers c 1,…, c k. 2) Iterate the following Lloyd step a)Assign each point to nearest center c i to obtain clustering X 1,…, X k. b)Update c i ctr(X i ) = ∑ x X i x/|X i |. b) Heuristics: Lloyd’s method invented in and remains an extremely popular heuristic even today
1 ) Start with k initial / “seed” centers c 1,…, c k. 2) Iterate the following Lloyd step a)Assign each point to nearest center c i to obtain clustering X 1,…, X k. b)Update c i ctr(X i ) = ∑ x X i x/|X i |. b) Heuristics: Lloyd’s method invented in and remains an extremely popular heuristic even today
Some bounds on number of iterations of Lloyd-type methods: Inaba-Katoh-Imai; Har-Peled-Sadri; Arthur-Vassilvitskii (’06) Performance very sensitive to choice of seed centers; lot of literature on finding “good” seeding methods for Lloyd But, almost no analysis that proves performance guarantees about quality of final solution for arbitrary k and dimension Lloyd’s method: What’s known? Our Goal: to analyze Lloyd and try to prove rigorous performance guarantees for Lloyd-type methods
Our Results Introduce a clusterability or separation condition. Give a novel, efficient sampling process for seeding Lloyd’s method with initial centers. Show that if data satisfies our clusterabililty condition: –seeding + 1 Lloyd step yields a constant-approximation in time linear in n and d, poly(k): is potentially faster than Lloyd variants which require multiple reseedings –seeding + KSS04-sampling gives a PTAS: algorithm is faster and simpler than PTAS in KSS04. Main Theorem: If data has a “meaningful k-clustering”, then there is a simple, efficient seeding method s.t. Lloyd-type methods return a near-optimal solution.
“Meaningful k-Clustering” Settings where one would NOT consider data to possess a meaningful k-clustering: 1 ) If near-optimum cost can be achieved by two very distinct k- partitions of data, then identity of an optimal k-partition carries little meaning – provides ambiguous classification. 2) If cost of best k-clustering ≈ cost of best (k- 1 )-clustering, then a k-clustering yields only marginal benefit over the best (k- 1 )-clustering – should use smaller value of k here. Example: k=3
We formalize 2). Let k 2 (X) = cost of best k-clustering of X. X is -seperated for k-means iff k 2 (X) / k- 1 2 (X) ≤ 2. Simple condition. Drop in k-clustering cost is already used by practitioners to choose the right k Can show that (roughly), X is -separated for k-means two low-cost k-clusterings disagree on only a small fraction of data
Some basic facts Fact: p R d,∑ x X d(x, p) 2 = 1 2 (X) + n. d(p,c) 2 ∑ {x,y} X d(x,y) 2 = n. 1 2 (X) [Write d(x, p) 2 = (x-c + c-p) T (x-c + c-p) and expand.] Lemma: Let X = X 1 X 2 be a partition of X with c i = ctr(X i ). Then 1 2 (X) = 1 2 (X 1 ) + 1 2 (X 2 ) + (|X 1 | |X 2 | / n). d(c 1,c 2 ) 2. Proof: 1 2 (X)= ∑ x X 1 d(x,c) 2 + ∑ x X 2 d(x,c) 2 = ( 1 2 (X 1 ) + |X 1 |. d(c 1,c) 2 ) + ( 1 2 (X 2 ) + |X 2 |. d(c 2,c) 2 ) = 1 2 (X 1 ) + 1 2 (X 2 ) + (|X 1 | |X 2 | / n). d(c 1,c 2 ) 2 n = |X| c = ctr(X) X2X2 c2c2 c X1X1 c1c1 c = (|X 1 |. c 1 +|X 2 |. c 2 ) / n
r*2r*2 r*1r*1 The 2-means problem (k=2) X * 1, X * 2 : optimal clusters c * i = ctr(X * i ),D * = d(c * 1,c * 2 ) n i = |X * i |, (r * i ) 2 = ∑ x X * i d(x, c * i ) 2 / n i = 1 2 (X * i ) / n i = avg. squared distance in cluster X * i Lemma: For i= 1, 2, (r * i ) 2 ≤ 2 /( 1 - 2 ). D *2. Proof: 2 2 (X) / 2 ≤ 1 2 (X) = 2 2 (X) + (n 1 n 2 / n). D *2. X is -separated for 2-means. c*1c*1 c*2c*2 D*D* X*1X*1 X*2X*2
The 2-means algorithm Assume: X is e-separated for 2-means 1 ) Sampling-based seeding procedure: –Pick two seed centers c 1, c 2 by randomly picking the pair x, y X with probability d(x,y) 2. 2) Lloyd step or simpler “ball k-means step”: –For each c i, let B i = {x X: d(x,c i ) ≤ d(c 1,c 2 )/3}. –Update c i ctr(B i ); return these as final centers. Sampling can be implemented in O(nd) time, so entire algorithm runs in O(nd) time. c*1c*1 c*2c*2 D*D* X*1X*1 X*2X*2
c1c1 2-means: Analysis c*1c*1 c*2c*2 D*D* X*1X*1 X*2X*2 core(X * 1 ) core(X * 2 ) c2c2 Let core(X * i ) = {x X * i : d(x,c) 2 ≤ (r * i ) 2 / }, where = ( 2 ) < 1. Seeding lemma: With prob. 1 –O( ), c 1,c 2 lie in cores of X * 1, X * 2. Proof: |core(X * i )| ≥ ( 1 - )n i for i= 1,2. Let A= ∑ x core(X * 1 ), y core(X * 2 ) d(x,y) 2 ≈ ( 1 - ) 2 n 1 n 2 D *2. B= ∑ {x,y} X d(x,y) 2 = n. 1 2 (X) ≈ n 1 n 2 D *2. Probability = A / B ≈ ( 1 - ) 2 = 1 – O( ).
2-means analysis (contd.) c1c1 c*1c*1 c*2c*2 D*D* X*1X*1 X*2X*2 core(X * 1 ) core(X * 2 ) c2c2 Recall that B i = {x X: d(x,c i ) ≤ d(c 1,c 2 )/3} Ball-k-means lemma: For i= 1,2, core(X * i ) B i X * i. Therefore d(ctr(B i ), c * i ) 2 ≤ (r * i ) 2 /( 1 – ). Intuitively, since B i X * i and B i contains almost all of the mass of X * i, ctr(B i ) must be close to ctr(X * i ) = c * i. B1B1 B2B2
2-means analysis (contd.) c1c1 c*1c*1 c*2c*2 D*D* X*1X*1 X*2X*2 core(X * 1 ) core(X * 2 ) c2c2 Recall that B i = {x X: d(x,c i ) ≤ d(c 1,c 2 )/3} Ball-k-means lemma: For i= 1,2, core(X * i ) B i X * i. Therefore d(ctr(B i ), c * i ) 2 ≤ (r * i ) 2 /( 1 – ). Proof: 2 (X * i ) ≥ (|B i | |X * i \ B i | / n i ). d(ctr(B i ), ctr(X * i \ B i )) 2 B1B1 B2B2
Some basic facts Fact: p R d,∑ x X d(x, p) 2 = 1 2 (X) + n. d(p,c) 2 ∑ {x,y} X d(x,y) 2 = n. 1 2 (X) [Write d(x, p) 2 = (x-c + c-p) T (x-c + c-p) and expand.] Lemma: Let X = X 1 X 2 be a partition of X, c i = ctr(X i ). Then 1 2 (X) = 1 2 (X 1 ) + 1 2 (X 2 ) + (|X 1 | |X 2 | / n). d(c 1,c 2 ) 2. n = |X| c = ctr(X) X2X2 c2c2 c X1X1 c1c1 c = (|X 1 |. c 1 +|X 2 |. c 2 ) / n
2-means analysis (contd.) c1c1 c*1c*1 c*2c*2 D*D* X*1X*1 X*2X*2 core(X * 1 ) core(X * 2 ) c2c2 Recall that B i = {x X: d(x,c i ) ≤ d(c 1,c 2 )/3} Ball-k-means lemma: For i= 1,2, core(X * i ) B i X * i. Therefore d(ctr(B i ), c * i ) 2 ≤ (r * i ) 2 /( 1 – ). Proof: 2 (X * i ) ≥ (|B i | |X * i \ B i | / n i ). d(ctr(B i ), ctr(X * i \ B i )) 2 Also d(ctr(B i ), c * i ) = (|X * i \ B i | / n i ). d(ctr(B i ), ctr(X * i \ B i )) n i (r * i ) 2 ≥ (n i |B i | / |X * i \ B i |).d(ctr(B i ), c * i ) 2 B1B1 B2B2
2-means analysis (contd.) Theorem: With probability 1 –O( ), cost of final clustering is at most 2 2 (X)/( 1 – ), get a ( 1 /( 1 – ))-approximation algorithm. Since = O( 2 ), we have approximation ratio 1 as 0. probability of success 1 as 0.
Arbitrary k Algorithm and analysis follow the same outline as in 2-means. If X is -separated for k-means, can again show that all clusters are well separated, that is, cluster radius << inter-cluster distance, r * i = O( ). d(c * i, c * j ) i,j 1 ) Seeding stage: we choose k initial centers and ensure that they lie in the “cores” of the k optimal clusters. –exploits the fact that clusters are well separated –after seeding stage, each optimal center has a distinct seed center very “near” it 2) Now, can run either a Lloyd step or a ball-k-means step. Theorem: If X is -separated for k-means, then one can obtain an ( )-approximation algorithm where ( ) 1 as 0.
Schematic of entire algorithm Simple sampling: Pick k centers as follows. –first pick 2 centers c 1, c 2 as in 2-means –to pick center c i+ 1, pick x X with probability min j ≤ i d(x,c j ) 2 Simple sampling: success probability = exp(-k) Oversampling + deletion: sample O(k) centers, then greedily delete till k remain O( 1 ) success probability, O(nkd+k 3 d) Greedy deletion: O(n 3 d) Greedy deletion: Start with n centers and keep deleting the center that causes least cost increase till k centers remain k well- placed seeds Ball k-means or Lloyd step: gives O( 1 )-approx. KSS04-sampling: gives PTAS
Simple sampling: analysis sketch X * 1,…, X * k : optimal clusters c * i = ctr(X * i ),n i = |X * i |,(r * i ) 2 = ∑ x X * i d(x,c * i ) 2 / n i = 1 2 (X * i ) / n i core(X * i ) = {x X * i : d(x,c * i ) 2 ≤ (r * i ) 2 / }where = ( 2/3 ) Lemma: With probability ( 1 –O( )) k, all sampled centers lie in the cores of distinct optimal clusters. Proof: Will show inductively that if c 1,…, c i lie in distinct cores, then with probability 1 –O( ), so does center c i+ 1. Base case: X is -separated for k-means X * i X * j is -separated for 2-means for every i≠ j (because merging two clusters causes a huge increase in cost). So by 2-means analysis, first two centers c 1,c 2 lie in distinct cores. X is -separated for k-means.
Simple sampling: analysis (contd.) Inductive step: Assume c 1,…, c i lie in cores of X * 1,…, X * i Let C = {c 1,…, c i }. A = ∑ j ≥ i+ 1 ∑ x core(X * j ) d(x,C) 2 ≈ ∑ j ≥ i+ 1 ( 1 - )n j d(c * j,C) 2 B= ∑ j ≤ k, x X * j d(x,C) 2 ≈ ∑ j ≤ i 1 2 (X * j ) + ∑ j ≥ i+ 1 ( 1 2 (X * j ) + n j d(c * j,C) 2 ) X*2X*2 X*1X*1 core(X * 1 ) c*1c*1 c1c1 X*iX*i core(X * i ) c*ic*i cici c * i+ 1 X * i+1 c*kc*k X*kX*k
Simple sampling: analysis (contd.) Inductive step: Assume c 1,…, c i lie in cores of X * 1,…, X * i Let C = {c 1,…, c i }. A = ∑ j ≥ i+ 1 ∑ x core(X * j ) d(x,C) 2 ≈ ∑ j ≥ i+ 1 ( 1 - )n j d(c * j,C) 2 B= ∑ j ≤ k, x X * j d(x,C) 2 ≈ ∑ j ≤ i 1 2 (X * j ) + ∑ j ≥ i+ 1 ( 1 2 (X * j ) + n j d(c * j,C) 2 ) ≈ ∑ j ≥ i+ 1 n j d(c * j,C) 2 Probability = A/B = 1 –O( ) X*2X*2 c * i+ 1 c*kc*k X*1X*1 core(X * 1 ) c*1c*1 c1c1 X*iX*i core(X * i ) c*ic*i cici X * i+1 X*kX*k
Open Questions Deeper analysis of Lloyd: are there weaker conditions under which one can prove performance guarantes for Lloyd-type methods? PTAS for k-means with polytime dependence on n, k and d? Is it APX hard in geometric setting? PTAS for k-means under our separation condition? Other applications of separation condition?
Thank You.