Stability Yields a PTAS for k-Median and k-Means Clustering Pranjal Awasthi, Avrim Blum, Or Sheffet Carnegie Mellon University November 3rd, 2010
Stability Yields a PTAS for k-Median and k-Means Clustering Introduce k-Median / k-Means problems. Define stability Previous notion [ORSS06] Weak Deletion Stability ¯-distributed instances The algorithm for k-Median Conclusion + open problems.
Clustering In Real Life Clustering: come up with desired partition You’re comcast, looking to build an infrastructure in MV.
Clustering in a Metric Space Clustering: come up with desired partition Input n points A distance function d:n£n! R¸0 satisfying: Reflexive: 8 p, d(p,p) = 0 Symmetry: 8 p,q, d(p,q) = d(q,p) Triangle Inequality: 8 p,q,r, d(p,q) · d(p,r)+d(r,q) k-partition q p r
Clustering in a Metric Space Clustering: come up with desired partition Input: n points A distance function d:n£n! R¸0 satisfying: Reflexive: 8 p, d(p,p) = 0 Symmetry: 8 p,q, d(p,q) = d(q,p) Triangle Inequality: 8 p,q,r, d(p,q) · d(p,r)+d(r,q) k-partition k is large, e.g. k=polylog(n)
k-Median Input: 1. n points in a finite metric space 2. k Goal: Partition into k disjoint subsets: C*1, C*2 , … , C*k Choose a center per subset Cost: cost(C*i )= x d(x,c*i) Cost of partition: i cost(C*i) Given centers ) Easy to get best partition Given partition ) Easy to get best centers
k-Means Input: 1. n points in Euclidean space 2. k Goal: Partition into k disjoint subsets : C*1, C*2 , … , C*k Choose a center per subset Cost: cost(C*i )= x d2(x, c*i) Cost of partition: i cost(C*i) Given centers ) Easy to get best partition Given partition ) Easy to get best centers
Polynomial Time Approximation Scheme We Would Like To… Solve k–median/ k-means problems. NP-hard to get OPT (= cost of optimal partition) Find a c-approximation algorithm A poly-time algorithm guaranteed to output a clustering whose cost · c OPT Ideally, find a PTAS Get a c-approximation algorithm where c = (1+²), for any ²>0. Runtime can be exp(1/²) c OPT Alg 2OPT Alg 1.5 OPT Alg 1.1 OPT Alg OPT Polynomial Time Approximation Scheme
Related Work We focus on large k (e.g. k=polylog(n)) k-Median k-Means Easy (try all centers) in time nk PTAS, exponential in (k/²) [KSS04] Small k (3+²)-apx [GK98, CGTS99, AGKMMP01, JMS02, dlVKKR03] (1.367...)-apx hardness [GK98, JMS02] General k 9-apx [OR00, BHPI02, dlVKKR03, ES04, HPM04, KMNPSW02] No PTAS! Euclidean k-Median [ARR98], PTAS if dimension is small (loglog(n)c) [ORSS06] Special case We focus on large k (e.g. k=polylog(n)) Runtime goal: poly(n,k)
World All possible instances
ORSS Result (k-Means) You’re comcast, looking to build an infrastructure in MV.
ORSS Result (k-Means) Why use 5 sites? You’re comcast, looking to build an infrastructure in MV. Why use 5 sites?
ORSS Result (k-Means) You’re comcast, looking to build an infrastructure in MV.
ORSS Result (k-Means) You’re comcast, looking to build an infrastructure in MV.
ORSS Result (k-Means) You’re comcast, looking to build an infrastructure in MV.
ORSS Result (k-Means) Our Result (k-Means) Instance is stable if OPT(k-1) > (1/®)2 OPT(k) (require 1/® > 10) Give a (1+O(®))-approximation. Our Result (k-Means) Instance is stable if OPT(k-1) > (1+®) OPT(k) (require ® > 0) Give a PTAS ((1+²)-approximation). Runtime: poly(n,k) exp(1/®,1/²)
Philosophical Note Stable instances: 9®>0 s.t. OPT(k-1) > (1+®) OPT(k) Not stable instances: 8®>0 s.t. OPT(k-1) · (1+®) OPT(k) A (1+®)-approximation can return a (k-1)-clustering. Any PTAS can return a (k-1)-clustering. It is not a k-clustering problem, It is a (k-1)-clustering problem! If we believe our instance inherently has k clusters “Necessary condition“ to guarantee: PTAS will return a “meaningful” clustering. Our result: It’s a sufficient condition to get a PTAS.
Any (k-1) clustering is significantly costlier than OPT(k) World All possible instances Any (k-1) clustering is significantly costlier than OPT(k) ORSS Stable
A Weaker Guarantee Why use 5 sites? You’re comcast, looking to build an infrastructure in MV. Why use 5 sites?
A Weaker Guarantee You’re comcast, looking to build an infrastructure in MV.
A Weaker Guarantee You’re comcast, looking to build an infrastructure in MV.
(1+®)-Weak Deletion Stability Consider OPT(k). Take any cluster C*i, associate its points with c*j. This increases the cost to at least (1+®)OPT(k). An obvious relaxation of ORSS-stability. Our result: suffices to get a PTAS. ) c*j c*j c*i
Merging any two clusters in OPT(k) increases the cost significantly World All possible instances ORSS Stable Weak-Deletion Stable Merging any two clusters in OPT(k) increases the cost significantly
¯-Distributed Instances For every cluster C*i, and every p not in C*i, we have: p c*i We show that: k-median: (1+®)-weak deletion stability ) (®/2)-distributed. k-means: (1+®)-weak deletion stability ) (®/4)-distributed.
Claim: (1+®)-Weak Deletion Stability ) (®/2)-Distributed p c*i c*j ®OPT · x d(x, c*j) - x d(x, c*i) · x [d(x, c*i) + d(c*i, c*j)] - x d(x, c*i) = x d(c*i, c*j) = |C*i| d(c*i, c*j) ) ®(OPT/|C*i|) · d(c*i, c*j) · d(c*i, p) + d(p, c*j) · 2d(c*i, p)
World ORSS Stable Weak-Deletion Stable ¯-Distributed All possible instances ORSS Stable Weak-Deletion Stable ¯-Distributed In optimal solution: large distance between a center to any “outside” point
Main Result We give a PTAS for ¯-distributed k-median and k-means instances. Running time: There are NP-hard ¯-distributed instances. (Superpolynomial dependence on 1/² is unavoidable!)
Stability Yields a PTAS for k-Median and k-Means Clustering Introduce k-Median / k-Means problems. Define stability PTAS for k-Median High level description Intuition (“had only we known more…”) Description Conclusion + open problems.
k-Median Algorithm’s Overview Right definition of “core”. Get the core of each cluster. L can’t get too big. Input: Metric, k, ¯, OPT 1. Initialization stage Initialize a list, L. L = set of “suspected” clusters “cores”. 2. Population stage Populate L with subsets of points. 3. Center-Retrieving stage For each component in L: Choose center (pt that minimizes cost) For any choice of k components in L: Evaluate k-median cost with resp. k centers
k-Median Algorithm’s Overview Input: Metric, k, ¯, OPT 0. Handle “extreme” clusters (Brute-force guessing of some clusters’ centers) 1. Populate L with components 2. Pick best center in each component 3. Try all possible k-centers L := List of “suspected” clusters’ “cores”
k-Median Algorithm’s Overview Input: Metric, k, ¯, OPT 0. Handle “extreme” clusters (Brute-force guessing of some clusters’ centers) 1. Populate L with components 2. Pick best center in each component 3. Try all possible k-centers Right definition of “core”. Get the core of each cluster. L can’t get too big.
Intuition: “Mind the Gap” We know: In contrast, an “average” cluster contributes: So for an “average” point p, in an “average” cluster C*i,
Intuition: “Mind the Gap” We know: Denote the core of a cluster C*i c*i
Intuition: “Mind the Gap” We know: Denote the core of a cluster C*i Formally, call cluster C*i cheap if Assume all clusters are cheap. In general: we brute-force guess O(1/¯²) centers of expensive clusters in Stage 0.
Intuition: “Mind the Gap” We know: Denote the core of a cluster C*i Formally, call cluster C*i cheap if Markov: At most (²/4) fraction of the points of a cheap cluster, lie outside the core.
Markov Inequality Claim: At most (²/4) fraction of the points of a cheap cluster, lie outside the core. Proof by contradiction. ) Cluster isn’t cheap.
Intuition: “Mind the Gap” We know: Denote the core of a cluster C*i Formally, call cluster C*i cheap if Markov: At least half of the points of a cheap cluster lie inside the core.
Magic (r/4) Ball “Heavy”: If p belongs to the core ) B(p, r/4) contains ¸ |C*i|/2 pts. Denote r = ¯(OPT/|C*i|). “Heavy”: Mass ¸ |C*i|/2 r/4 > r · r/8 c*i p
All points in the core are merged into one set! Magic (r/4) Ball Draw a ball of radius r/4 around all points. Unite “heavy” balls whose centers overlap. Denote r = ¯(OPT/|C*i|). > r · r/8 c*i All points in the core are merged into one set!
pts from other clusters? Magic (r/4) Ball Draw a ball of radius r/4 around all points. Unite “heavy” balls whose centers overlap. Denote r = ¯(OPT/|C*i|). > r · r/8 c*i Could we merge core pts with pts from other clusters?
r/4 = r/2 - r/4 · d(x,c*i) · 3r/4 + r/4 = r Magic (r/4) Ball Draw a ball of radius r/4 around all points. Unite “heavy” balls whose centers overlap. Denote r = ¯(OPT/|C*i|). x p x > r · r/8 c*i r/2 · d(p,c*i) · 3r/4 r/4 = r/2 - r/4 · d(x,c*i) · 3r/4 + r/4 = r
x falls outside the core Magic (r/4) Ball Draw a ball of radius r/4 around all points. Unite “heavy” balls whose centers overlap. Denote r = ¯(OPT/|C*i|). x p > r · r/8 c*i r/4 · d(x,c*i) · r x falls outside the core x belongs to C*i
More than |C*i|/2 pts fall outside the core! Magic (r/4) Ball Draw a ball of radius r/4 around all points. Unite “heavy” balls whose centers overlap. Denote r = ¯(OPT/|C*i|). x p > r · r/8 c*i r/4 · d(x,c*i) · r More than |C*i|/2 pts fall outside the core! )(
Finding the Right Radius Draw a ball of radius r/4 around all points. Unite “heavy” balls whose centers overlap. Denote r = ¯(OPT/|C*i|). Problem: we don’t know |C*i| Solution: Try all sizes, in order! Set s = n, n-1, n-2, …, 1 Set rs = ¯(OPT/s) Complication: When s gets small (s=4,3,2,1) we collect many “leftovers” of one cluster. Solution: once we add a subset to L, we remove close-by points.
Population Stage Set s = n, n-1, n-2, …, 1 Set rs = ¯(OPT/s) Draw a ball of radius r/4 around each point Unite balls containing ¸ s/2 pts whose centers overlap Once a set ¸ s/2 is found Put this set in L Remove all points in a (r/2)-”buffer zone” from L.
Population Stage Set s = n, n-1, n-2, …, 1 Set rs = ¯(OPT/s) Draw a ball of radius r/4 around each point Unite balls containing ¸ s/2 pts whose centers overlap Once a set ¸ s/2 is found Put this set in L Remove all points in a (r/2)-”buffer zone” from L.
Population Stage Set s = n, n-1, n-2, …, 1 Set rs = ¯(OPT/s) Draw a ball of radius r/4 around each point Unite balls containing ¸ s/2 pts whose centers overlap Once a set ¸ s/2 is found Put this set in L Remove all points in a (r/2)-”buffer zone” from L.
Population Stage Set s = n, n-1, n-2, …, 1 Set rs = ¯(OPT/s) Draw a ball of radius r/4 around each point Unite balls containing ¸ s/2 pts whose centers overlap Once a set ¸ s/2 is found Put this set in L Remove all points in a (r/2)-”buffer zone” from L.
Population Stage Set s = n, n-1, n-2, …, 1 Set rs = ¯(OPT/s) Draw a ball of radius r/4 around each point Unite balls containing ¸ s/2 pts whose centers overlap Once a set ¸ s/2 is found Put this set in L Remove all points in a (r/2)-”buffer zone” from L.
Population Stage Set s = n, n-1, n-2, …, 1 Set rs = ¯(OPT/s) Draw a ball of radius r/4 around each point Unite balls containing ¸ s/2 pts whose centers overlap Once a set ¸ s/2 is found Put this set in L Remove all points in a (r/2)-”buffer zone” from L.
Population Stage Set s = n, n-1, n-2, …, 1 Set rs = ¯(OPT/s) Draw a ball of radius r/4 around each point Unite balls containing ¸ s/2 pts whose centers overlap Once a set ¸ s/2 is found Put this set in L Remove all points in a (r/2)-”buffer zone” from L.
Population Stage Set s = n, n-1, n-2, …, 1 Set rs = ¯(OPT/s) Draw a ball of radius r/4 around each point Unite balls containing ¸ s/2 pts whose centers overlap Once a set ¸ s/2 is found Put this set in L Remove all points in a (r/2)-”buffer zone” from L.
Population Stage Set s = n, n-1, n-2, …, 1 Set rs = ¯(OPT/s) Draw a ball of radius r/4 around each point Unite balls containing ¸ s/2 pts whose centers overlap Once a set ¸ s/2 is found Put this set in L Remove all points in a (r/2)-”buffer zone” from L.
Population Stage Set s = n, n-1, n-2, …, 1 Set rs = ¯(OPT/s) Draw a ball of radius r/4 around each point Unite balls containing ¸ s/2 pts whose centers overlap Once a set ¸ s/2 is found Put this set in L Remove all points in a (r/2)-”buffer zone” from L.
Population Stage Remainder of the proof: Set s = n, n-1, n-2, …, 1 Set rs = ¯(OPT/s) Draw a ball of radius r/4 around each point Unite balls containing ¸ s/2 pts whose centers overlap Once a set ¸ s/2 is found Put this set in L Remove all points in a (r/2)-”buffer zone” from L. Remainder of the proof: Even with “buffer zones” - still collect cores. #{Components without core pts} in L is O(1/¯) cost(k centers from cores) · (1+²)OPT
A Note About k-Means Roughly the same algorithm, consts » squared. Problem: Can’t guess centers for expensive clusters! Solution: A random sample of O(1/²) pts from each cluster approximates the center of mass. Brute force guess O(1/²) pts from O(1/¯²) expensive clusters. Better solution: Randomly sample O(1/²) pts from expensive clusters whose size ¸ poly(1/k) fraction of the instance. Slight complication: introduce intervals. Expected runtime:
Conclusion World: 8 ²>0, a (1+²)-approximation algorithm for ¯-distributed instances of k-median / k-means. Improve constants? Other clustering objectives (k-centers)? ORSS Stable Weak-Deletion Stable ¯-Distributed
Take Home Message Life ( , ) gives you a k-median instance. Stability = A belief that a PTAS is meaningful This allows us to introduce a PTAS! To what other NP-hard problems similar logic applies? - “Can you solve it?” - “NO!!!” Stability gives us an “Archimedean Point” that allows us to bypass NP-hardness. But that’s not new!
Thank you!
World BBG Stable+ ORSS Stable Weak-Deletion Stable ¯-Distributed All possible instances BBG Stable+ ORSS Stable Weak-Deletion Stable ¯-Distributed
Our (1+®)-approx algorithm outputs a meaningful k-clustering BBG Result We have target clustering. k-median is a proxy: Target is close to OPT(k). Problem: k-median is NP-hard. Solution: Use approximation alg. We would like: Our (1+®)-approx algorithm outputs a meaningful k-clustering Proxy: your goal is to retrieve the target clustering. The fact that k-Median and the target are close is something you assume.
Our (1+®)-approx algorithm outputs a meaningful k-clustering BBG Result We have target clustering. k-median is a proxy: Target is close to OPT(k). Problem: k-median is NP-hard. Solution: Use approximation alg. We would like: Our (1+®)-approx algorithm outputs a meaningful k-clustering Proxy: your goal is to retrieve the target clustering. The fact that k-Median and the target are close is something you assume.
BBG Result We have target clustering. k-median is a proxy: Target is close to OPT(k). Problem: k-median is NP-hard. Solution: Use approximation alg. Implicit assumption: Any k-clustering with cost at most (1+®)OPT is ±-close (pointwise) to target Proxy: your goal is to retrieve the target clustering. The fact that k-Median and the target are close is something you assume.
BBG Result Our result: if all clusters’ sizes are (±n/®) Instance is (BBG) stable: Any two k-partitions with cost · (1+®)OPT(k) differ over no more than (2±)-fraction of the input Give algorithm to get O(±/®)-close to the target. Additionally (k-median): if all clusters’ sizes are (±n/®) then get ±-close to the target. Our result: if BBG-stability & clusters are >2±n then PTAS for k-median (implies: get ±-close to the target). Mention that in k-means they get \delta/\alpha close, whereas we get \delta-close.
Claim: BBG-Stability & Large Clusters ) (1+®)-Weak Deletion Stability We know: Any two k-partitions with cost · (1+®)OPT(k) differ over · 2± fraction of the input All clusters contain >2±n points Take optimal k-clustering. Take C*i, move all points but c*i to C*j. New partition and OPT differ on >2±n pts ) cost(OPTi!j) ¸ cost( ) ¸ (1+®)OPT(k) * Because clusters are large this counting argument is possible. (any k-1 clustering is far from the target clustering.) * Might skip this slide. ) c*j c*j c*i c*i
World BBG Stable+ ORSS Stable Weak-Deletion Stable ¯-Distributed All possible instances BBG Stable+ ORSS Stable Weak-Deletion Stable ¯-Distributed