The Effectiveness of Lloyd-type Methods for the k-means Problem Chaitanya Swamy University of Waterloo Joint work with Rafi Ostrovsky, Yuval Rabani, Leonard.

Slides:

Advertisements

Similar presentations

A Fast PTAS for k-Means Clustering

Advertisements

A Dependent LP-Rounding Approach for the k-Median Problem Moses Charikar 1 Shi Li 1 1 Department of Computer Science Princeton University ICALP 2012, Warwick,

Sub Exponential Randomize Algorithm for Linear Programming Paper by: Bernd Gärtner and Emo Welzl Presentation by : Oz Lavee.

Generalization and Specialization of Kernelization Daniel Lokshtanov.

Fast Algorithms For Hierarchical Range Histogram Constructions

Robust hierarchical k- center clustering Ilya Razenshteyn (MIT) Silvio Lattanzi (Google), Stefano Leonardi (Sapienza University of Rome) and Vahab Mirrokni.

1 Truthful Mechanism for Facility Allocation: A Characterization and Improvement of Approximation Ratio Pinyan Lu, MSR Asia Yajun Wang, MSR Asia Yuan Zhou,

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ What is Cluster Analysis? l Finding groups of objects such that the objects in a group will.

Discrete geometry Lecture 2 1 © Alexander & Michael Bronstein

The Rate of Convergence of AdaBoost Indraneel Mukherjee Cynthia Rudin Rob Schapire.

Prénom Nom Document Analysis: Linear Discrimination Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.

Proximity algorithms for nearly-doubling spaces Lee-Ad Gottlieb Robert Krauthgamer Weizmann Institute TexPoint fonts used in EMF. Read the TexPoint manual.

Segmentation Divide the image into segments. Each segment:

Testing of Clustering Noga Alon, Seannie Dar Michal Parnas, Dana Ron.

Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.

Finding Compact Structural Motifs Presented By: Xin Gao Authors: Jianbo Qian, Shuai Cheng Li, Dongbo Bu, Ming Li, and Jinbo Xu University of Waterloo,

Clustering Color/Intensity

Matrix sparsification and the sparse null space problem Lee-Ad GottliebWeizmann Institute Tyler NeylonBynomial Inc. TexPoint fonts used in EMF. Read the.

Truthful Mechanisms for One-parameter Agents Aaron Archer, Eva Tardos Presented by: Ittai Abraham.

Preference Analysis Joachim Giesen and Eva Schuberth May 24, 2006.

Performance guarantees for hierarchical clustering Sanjoy Dasgupta University of California, San Diego Philip Long Genomics Institute of Singapore.

Clustering with k-means: faster, smarter, cheaper Charles Elkan University of California, San Diego April 24, 2004.

CS Dept, City Univ.1 The Complexity of Connectivity in Wireless Networks Presented by LUO Hongbo.

10/31/02CSE Greedy Algorithms CSE Algorithms Greedy Algorithms.

Computational aspects of stability in weighted voting games Edith Elkind (NTU, Singapore) Based on joint work with Leslie Ann Goldberg, Paul W. Goldberg,

10/31/02CSE Greedy Algorithms CSE Algorithms Greedy Algorithms.

Sampling-based Approximation Algorithms for Multi-stage Stochastic Optimization Chaitanya Swamy Caltech and U. Waterloo Joint work with David Shmoys Cornell.

Gene expression & Clustering (Chapter 10)

LP-based Algorithms for Capacitated Facility Location Chaitanya Swamy Joint work with Retsef Levi and David Shmoys Cornell University.

Approximation Algorithms for Stochastic Combinatorial Optimization Part I: Multistage problems Anupam Gupta Carnegie Mellon University.

Approximation schemes Bin packing problem. Bin Packing problem Given n items with sizes a 1,…,a n  (0,1]. Find a packing in unit-sized bins that minimizes.

1 By: MOSES CHARIKAR, CHANDRA CHEKURI, TOMAS FEDER, AND RAJEEV MOTWANI Presented By: Sarah Hegab.

Approximation Algorithms for Stochastic Optimization Chaitanya Swamy Caltech and U. Waterloo Joint work with David Shmoys Cornell University.

CLUSTERABILITY A THEORETICAL STUDY Margareta Ackerman Joint work with Shai Ben-David.

Joint work with Chandrashekhar Nagarajan (Yahoo!)

Princeton University COS 423 Theory of Algorithms Spring 2001 Kevin Wayne Approximation Algorithms These lecture slides are adapted from CLRS.

1 Markov Decision Processes Infinite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld.

Randomized Composable Core-sets for Submodular Maximization Morteza Zadimoghaddam and Vahab Mirrokni Google Research New York.

Approximation Algorithms for Prize-Collecting Forest Problems with Submodular Penalty Functions Chaitanya Swamy University of Waterloo Joint work with.

Bahman Bahmani Stanford University

Approximation Algorithms for Graph Homomorphism Problems Chaitanya Swamy University of Waterloo Joint work with Michael Langberg and Yuval Rabani Open.

Approximate Inference: Decomposition Methods with Applications to Computer Vision Kyomin Jung ( KAIST ) Joint work with Pushmeet Kohli (Microsoft Research)

Gene expression & Clustering. Determining gene function Sequence comparison tells us if a gene is similar to another gene, e.g., in a new species –Dynamic.

Generating a d-dimensional linear subspace efficiently Raphael Yuster SODA’10.

Unique Games Approximation Amit Weinstein Complexity Seminar, Fall 2006 Based on: “Near Optimal Algorithms for Unique Games" by M. Charikar, K. Makarychev,

Stability Yields a PTAS for k-Median and k-Means Clustering

Minimizing Delay in Shared Pipelines Ori Rottenstreich (Technion, Israel) Joint work with Isaac Keslassy (Technion, Israel) Yoram Revah, Aviran Kadosh.

Graph Partitioning using Single Commodity Flows

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Cluster Analysis This lecture node is modified based on Lecture Notes for Chapter.

1 Microarray Clustering. 2 Outline Microarrays Hierarchical Clustering K-Means Clustering Corrupted Cliques Problem CAST Clustering Algorithm.

Why almost all satisfiable k - CNF formulas are easy? Danny Vilenchik Joint work with A. Coja-Oghlan and M. Krivelevich.

CZ5211 Topics in Computational Biology Lecture 4: Clustering Analysis for Microarray Data II Prof. Chen Yu Zong Tel:

Common Intersection of Half-Planes in R 2 2 PROBLEM (Common Intersection of half- planes in R 2 ) Given n half-planes H 1, H 2,..., H n in R 2 compute.

Clustering Data Streams A presentation by George Toderici.

Clustering with Spectral Norm and the k-means algorithm Ravi Kannan Microsoft Research Bangalore joint work with Amit Kumar (Indian Institute of Technology,

Approximation Algorithms based on linear programming.

Clustering – Definition and Basic Algorithms Seminar on Geometric Approximation Algorithms, spring 11/12.

Clustering (2) Center-based algorithms Fuzzy k-means Density-based algorithms ( DBSCAN as an example ) Evaluation of clustering results Figures and equations.

Clustering Data Streams

Semi-Supervised Clustering

Data Driven Resource Allocation for Distributed Learning

Core-Sets and Geometric Optimization problems.

Haim Kaplan and Uri Zwick

Distributed Submodular Maximization in Massive Datasets

k-center Clustering under Perturbation Resilience

עידן שני ביה"ס למדעי המחשב אוניברסיטת תל-אביב

3.5 Minimum Cuts in Undirected Graphs

Coverage Approximation Algorithms

Presentation transcript:

The Effectiveness of Lloyd-type Methods for the k-means Problem Chaitanya Swamy University of Waterloo Joint work with Rafi Ostrovsky, Yuval Rabani, Leonard Schulman UCLA Technion Caltech

The k-means Problem Given: n points in d-dimensional space partition X into k clusters X 1,…, X k assign each point in X i to a common center c i  R d Goal: Minimize ∑ i ∑ x  X i d(x,c i ) 2 d: L 2 distance X  R d : point set with |X| = n c1c1 c3c3 c2c2 X1X1 X2X2 X3X3

k-means (contd.) Given the c i ’s, best clustering is to assign each point to nearest center: X i = {x  X: c i is ctr. nearest to x} Given the X i ’s, best choice of centers is to set c i = center of mass of X i = ctr(X i ) = ∑ x  X i x / |X i |  Optimal solution satisfies both properties Problem is NP-hard even for k=2 (n, d not fixed)

Related Work k-means problem dates back to Steinhaus ( 1 956). a) Approximation algorithms  algorithms with provable guarantees PTAS’s with varying runtime dependence on n, d, k: poly/linear in n, could be exponential in d and/or k –Matousek (poly(n), exp(d,k)) –Kumar, Sabharwal & Sen (KSS04) (lin(n,d), exp(k)) O( 1 )-approximation algorithms for k-median: any point set with any metric, runtime poly(n,d,k); guarantees also translate to k-means –Charikar, Guha, Tardos & Shmoys –Arya et al. + Kanungo et al.: (9+  )-approximation

b) Heuristics: Lloyd’s method invented in and remains an extremely popular heuristic even today 1 ) Start with k initial / “seed” centers c 1,…, c k. 2) Iterate the following Lloyd step a)Assign each point to nearest center c i to obtain clustering X 1,…, X k. b)Update c i  ctr(X i ) = ∑ x  X i x/|X i |.

1 ) Start with k initial / “seed” centers c 1,…, c k. 2) Iterate the following Lloyd step a)Assign each point to nearest center c i to obtain clustering X 1,…, X k. b)Update c i  ctr(X i ) = ∑ x  X i x/|X i |. b) Heuristics: Lloyd’s method invented in and remains an extremely popular heuristic even today

1 ) Start with k initial / “seed” centers c 1,…, c k. 2) Iterate the following Lloyd step a)Assign each point to nearest center c i to obtain clustering X 1,…, X k. b)Update c i  ctr(X i ) = ∑ x  X i x/|X i |. b) Heuristics: Lloyd’s method invented in and remains an extremely popular heuristic even today

Some bounds on number of iterations of Lloyd-type methods: Inaba-Katoh-Imai; Har-Peled-Sadri; Arthur-Vassilvitskii (’06) Performance very sensitive to choice of seed centers; lot of literature on finding “good” seeding methods for Lloyd But, almost no analysis that proves performance guarantees about quality of final solution for arbitrary k and dimension Lloyd’s method: What’s known? Our Goal: to analyze Lloyd and try to prove rigorous performance guarantees for Lloyd-type methods

Our Results Introduce a clusterability or separation condition. Give a novel, efficient sampling process for seeding Lloyd’s method with initial centers. Show that if data satisfies our clusterabililty condition: –seeding + 1 Lloyd step yields a constant-approximation in time linear in n and d, poly(k): is potentially faster than Lloyd variants which require multiple reseedings –seeding + KSS04-sampling gives a PTAS: algorithm is faster and simpler than PTAS in KSS04. Main Theorem: If data has a “meaningful k-clustering”, then there is a simple, efficient seeding method s.t. Lloyd-type methods return a near-optimal solution.

“Meaningful k-Clustering” Settings where one would NOT consider data to possess a meaningful k-clustering: 1 ) If near-optimum cost can be achieved by two very distinct k- partitions of data, then identity of an optimal k-partition carries little meaning – provides ambiguous classification. 2) If cost of best k-clustering ≈ cost of best (k- 1 )-clustering, then a k-clustering yields only marginal benefit over the best (k- 1 )-clustering – should use smaller value of k here. Example: k=3

We formalize 2). Let  k 2 (X) = cost of best k-clustering of X. X is  -seperated for k-means iff  k 2 (X) /  k- 1 2 (X) ≤  2. Simple condition. Drop in k-clustering cost is already used by practitioners to choose the right k Can show that (roughly), X is  -separated for k-means two low-cost k-clusterings disagree on only a small fraction of data 

Some basic facts Fact:  p  R d,∑ x  X d(x, p) 2 =  1 2 (X) + n. d(p,c) 2  ∑ {x,y}  X d(x,y) 2 = n.  1 2 (X) [Write d(x, p) 2 = (x-c + c-p) T (x-c + c-p) and expand.] Lemma: Let X = X 1  X 2 be a partition of X with c i = ctr(X i ). Then  1 2 (X) =  1 2 (X 1 ) +  1 2 (X 2 ) + (|X 1 | |X 2 | / n). d(c 1,c 2 ) 2. Proof:  1 2 (X)= ∑ x  X 1 d(x,c) 2 + ∑ x  X 2 d(x,c) 2 = (  1 2 (X 1 ) + |X 1 |. d(c 1,c) 2 ) + (  1 2 (X 2 ) + |X 2 |. d(c 2,c) 2 ) =  1 2 (X 1 ) +  1 2 (X 2 ) + (|X 1 | |X 2 | / n). d(c 1,c 2 ) 2 n = |X| c = ctr(X) X2X2 c2c2 c X1X1 c1c1 c = (|X 1 |. c 1 +|X 2 |. c 2 ) / n

r*2r*2 r*1r*1 The 2-means problem (k=2) X * 1, X * 2 : optimal clusters c * i = ctr(X * i ),D * = d(c * 1,c * 2 ) n i = |X * i |, (r * i ) 2 = ∑ x  X * i d(x, c * i ) 2 / n i =  1 2 (X * i ) / n i = avg. squared distance in cluster X * i Lemma: For i= 1, 2, (r * i ) 2 ≤  2 /( 1 -  2 ). D *2. Proof:  2 2 (X) /  2 ≤  1 2 (X) =  2 2 (X) + (n 1 n 2 / n). D *2. X is  -separated for 2-means. c*1c*1 c*2c*2 D*D* X*1X*1 X*2X*2

The 2-means algorithm Assume: X is e-separated for 2-means 1 ) Sampling-based seeding procedure: –Pick two seed centers c 1, c 2 by randomly picking the pair x, y  X with probability d(x,y) 2. 2) Lloyd step or simpler “ball k-means step”: –For each c i, let B i = {x  X: d(x,c i ) ≤ d(c 1,c 2 )/3}. –Update c i  ctr(B i ); return these as final centers. Sampling can be implemented in O(nd) time, so entire algorithm runs in O(nd) time. c*1c*1 c*2c*2 D*D* X*1X*1 X*2X*2

c1c1 2-means: Analysis c*1c*1 c*2c*2 D*D* X*1X*1 X*2X*2 core(X * 1 ) core(X * 2 ) c2c2 Let core(X * i ) = {x  X * i : d(x,c) 2 ≤ (r * i ) 2 /  }, where  =  (  2 ) < 1. Seeding lemma: With prob. 1 –O(  ), c 1,c 2 lie in cores of X * 1, X * 2. Proof: |core(X * i )| ≥ ( 1 -  )n i for i= 1,2. Let A= ∑ x  core(X * 1 ), y  core(X * 2 ) d(x,y) 2 ≈ ( 1 -  ) 2 n 1 n 2 D *2. B= ∑ {x,y}  X d(x,y) 2 = n.  1 2 (X) ≈ n 1 n 2 D *2. Probability = A / B ≈ ( 1 -  ) 2 = 1 – O(  ).

2-means analysis (contd.) c1c1 c*1c*1 c*2c*2 D*D* X*1X*1 X*2X*2 core(X * 1 ) core(X * 2 ) c2c2 Recall that B i = {x  X: d(x,c i ) ≤ d(c 1,c 2 )/3} Ball-k-means lemma: For i= 1,2, core(X * i )  B i  X * i. Therefore d(ctr(B i ), c * i ) 2 ≤  (r * i ) 2 /( 1 –  ). Intuitively, since B i  X * i and B i contains almost all of the mass of X * i, ctr(B i ) must be close to ctr(X * i ) = c * i. B1B1 B2B2

2-means analysis (contd.) c1c1 c*1c*1 c*2c*2 D*D* X*1X*1 X*2X*2 core(X * 1 ) core(X * 2 ) c2c2 Recall that B i = {x  X: d(x,c i ) ≤ d(c 1,c 2 )/3} Ball-k-means lemma: For i= 1,2, core(X * i )  B i  X * i. Therefore d(ctr(B i ), c * i ) 2 ≤  (r * i ) 2 /( 1 –  ). Proof:   2 (X * i ) ≥ (|B i | |X * i \ B i | / n i ). d(ctr(B i ), ctr(X * i \ B i )) 2 B1B1 B2B2

Some basic facts Fact:  p  R d,∑ x  X d(x, p) 2 =  1 2 (X) + n. d(p,c) 2  ∑ {x,y}  X d(x,y) 2 = n.  1 2 (X) [Write d(x, p) 2 = (x-c + c-p) T (x-c + c-p) and expand.] Lemma: Let X = X 1  X 2 be a partition of X, c i = ctr(X i ). Then  1 2 (X) =  1 2 (X 1 ) +  1 2 (X 2 ) + (|X 1 | |X 2 | / n). d(c 1,c 2 ) 2. n = |X| c = ctr(X) X2X2 c2c2 c X1X1 c1c1 c = (|X 1 |. c 1 +|X 2 |. c 2 ) / n

2-means analysis (contd.) c1c1 c*1c*1 c*2c*2 D*D* X*1X*1 X*2X*2 core(X * 1 ) core(X * 2 ) c2c2 Recall that B i = {x  X: d(x,c i ) ≤ d(c 1,c 2 )/3} Ball-k-means lemma: For i= 1,2, core(X * i )  B i  X * i. Therefore d(ctr(B i ), c * i ) 2 ≤  (r * i ) 2 /( 1 –  ). Proof:   2 (X * i ) ≥ (|B i | |X * i \ B i | / n i ). d(ctr(B i ), ctr(X * i \ B i )) 2 Also d(ctr(B i ), c * i ) = (|X * i \ B i | / n i ). d(ctr(B i ), ctr(X * i \ B i ))  n i (r * i ) 2 ≥ (n i |B i | / |X * i \ B i |).d(ctr(B i ), c * i ) 2 B1B1 B2B2

2-means analysis (contd.) Theorem: With probability 1 –O(  ), cost of final clustering is at most  2 2 (X)/( 1 –  ),  get a ( 1 /( 1 –  ))-approximation algorithm. Since  = O(  2 ), we have approximation ratio  1 as  0. probability of success  1 as  0.

Arbitrary k Algorithm and analysis follow the same outline as in 2-means. If X is  -separated for k-means, can again show that all clusters are well separated, that is, cluster radius << inter-cluster distance, r * i = O(  ). d(c * i, c * j )  i,j 1 ) Seeding stage: we choose k initial centers and ensure that they lie in the “cores” of the k optimal clusters. –exploits the fact that clusters are well separated –after seeding stage, each optimal center has a distinct seed center very “near” it 2) Now, can run either a Lloyd step or a ball-k-means step. Theorem: If X is  -separated for k-means, then one can obtain an  (  )-approximation algorithm where  (  )  1 as   0.

Schematic of entire algorithm Simple sampling: Pick k centers as follows. –first pick 2 centers c 1, c 2 as in 2-means –to pick center c i+ 1, pick x  X with probability min j ≤ i d(x,c j ) 2 Simple sampling: success probability = exp(-k) Oversampling + deletion: sample O(k) centers, then greedily delete till k remain O( 1 ) success probability, O(nkd+k 3 d) Greedy deletion: O(n 3 d) Greedy deletion: Start with n centers and keep deleting the center that causes least cost increase till k centers remain k well- placed seeds Ball k-means or Lloyd step: gives O( 1 )-approx. KSS04-sampling: gives PTAS

Simple sampling: analysis sketch X * 1,…, X * k : optimal clusters c * i = ctr(X * i ),n i = |X * i |,(r * i ) 2 = ∑ x  X * i d(x,c * i ) 2 / n i =  1 2 (X * i ) / n i core(X * i ) = {x  X * i : d(x,c * i ) 2 ≤ (r * i ) 2 /  }where  =  (  2/3 ) Lemma: With probability ( 1 –O(  )) k, all sampled centers lie in the cores of distinct optimal clusters. Proof: Will show inductively that if c 1,…, c i lie in distinct cores, then with probability 1 –O(  ), so does center c i+ 1. Base case: X is  -separated for k-means  X * i  X * j is  -separated for 2-means for every i≠ j (because merging two clusters causes a huge increase in cost). So by 2-means analysis, first two centers c 1,c 2 lie in distinct cores. X is  -separated for k-means.

Simple sampling: analysis (contd.) Inductive step: Assume c 1,…, c i lie in cores of X * 1,…, X * i Let C = {c 1,…, c i }. A = ∑ j ≥ i+ 1 ∑ x  core(X * j ) d(x,C) 2 ≈ ∑ j ≥ i+ 1 ( 1 -  )n j d(c * j,C) 2 B= ∑ j ≤ k, x  X * j d(x,C) 2 ≈ ∑ j ≤ i  1 2 (X * j ) + ∑ j ≥ i+ 1 (  1 2 (X * j ) + n j d(c * j,C) 2 ) X*2X*2 X*1X*1 core(X * 1 ) c*1c*1 c1c1 X*iX*i core(X * i ) c*ic*i cici c * i+ 1 X * i+1 c*kc*k X*kX*k

Simple sampling: analysis (contd.) Inductive step: Assume c 1,…, c i lie in cores of X * 1,…, X * i Let C = {c 1,…, c i }. A = ∑ j ≥ i+ 1 ∑ x  core(X * j ) d(x,C) 2 ≈ ∑ j ≥ i+ 1 ( 1 -  )n j d(c * j,C) 2 B= ∑ j ≤ k, x  X * j d(x,C) 2 ≈ ∑ j ≤ i  1 2 (X * j ) + ∑ j ≥ i+ 1 (  1 2 (X * j ) + n j d(c * j,C) 2 ) ≈ ∑ j ≥ i+ 1 n j d(c * j,C) 2 Probability = A/B = 1 –O(  ) X*2X*2 c * i+ 1 c*kc*k X*1X*1 core(X * 1 ) c*1c*1 c1c1 X*iX*i core(X * i ) c*ic*i cici X * i+1 X*kX*k

Open Questions Deeper analysis of Lloyd: are there weaker conditions under which one can prove performance guarantes for Lloyd-type methods? PTAS for k-means with polytime dependence on n, k and d? Is it APX hard in geometric setting? PTAS for k-means under our separation condition? Other applications of separation condition?

Thank You.