Stability Yields a PTAS for k-Median and k-Means Clustering

Name: Stability Yields a PTAS for k-Median and k-Means Clustering
Uploaded: 2017-11-03T09:06:16+00:00
Duration: PTM33S26
Channel: Emory Poole
Description: Stability Yields a PTAS for k-Median and k-Means Clustering

Stability Yields a PTAS for k-Median and k-Means Clustering
Pranjal Awasthi, Avrim Blum, Or Sheffet Carnegie Mellon University November 3rd, 2010

Introduce k-Median / k-Means problems. Define stability Previous notion [ORSS06] Weak Deletion Stability ¯-distributed instances The algorithm for k-Median Conclusion + open problems.

Clustering In Real Life
Clustering: come up with desired partition You’re comcast, looking to build an infrastructure in MV.

Clustering in a Metric Space
Clustering: come up with desired partition Input n points A distance function d:n£n! R¸0 satisfying: Reflexive: 8 p, d(p,p) = 0 Symmetry: 8 p,q, d(p,q) = d(q,p) Triangle Inequality: 8 p,q,r, d(p,q) · d(p,r)+d(r,q) k-partition q p r

Clustering in a Metric Space
Clustering: come up with desired partition Input: n points A distance function d:n£n! R¸0 satisfying: Reflexive: 8 p, d(p,p) = 0 Symmetry: 8 p,q, d(p,q) = d(q,p) Triangle Inequality: 8 p,q,r, d(p,q) · d(p,r)+d(r,q) k-partition k is large, e.g. k=polylog(n)

k-Median Input: 1. n points in a finite metric space 2. k Goal:
Partition into k disjoint subsets: C*1, C*2 , … , C*k Choose a center per subset Cost: cost(C*i )= x d(x,c*i) Cost of partition: i cost(C*i) Given centers ) Easy to get best partition Given partition ) Easy to get best centers

k-Means Input: 1. n points in Euclidean space 2. k Goal:
Partition into k disjoint subsets : C*1, C*2 , … , C*k Choose a center per subset Cost: cost(C*i )= x d2(x, c*i) Cost of partition: i cost(C*i) Given centers ) Easy to get best partition Given partition ) Easy to get best centers

Polynomial Time Approximation Scheme
We Would Like To… Solve k–median/ k-means problems. NP-hard to get OPT (= cost of optimal partition) Find a c-approximation algorithm A poly-time algorithm guaranteed to output a clustering whose cost · c OPT Ideally, find a PTAS Get a c-approximation algorithm where c = (1+²), for any ²>0. Runtime can be exp(1/²) c OPT Alg 2OPT Alg 1.5 OPT Alg 1.1 OPT Alg OPT Polynomial Time Approximation Scheme

Related Work We focus on large k (e.g. k=polylog(n))
k-Median k-Means Easy (try all centers) in time nk PTAS, exponential in (k/²) [KSS04] Small k (3+²)-apx [GK98, CGTS99, AGKMMP01, JMS02, dlVKKR03] ( )-apx hardness [GK98, JMS02] General k 9-apx [OR00, BHPI02, dlVKKR03, ES04, HPM04, KMNPSW02] No PTAS! Euclidean k-Median [ARR98], PTAS if dimension is small (loglog(n)c) [ORSS06] Special case We focus on large k (e.g. k=polylog(n)) Runtime goal: poly(n,k)

World All possible instances

ORSS Result (k-Means) You’re comcast, looking to build an infrastructure in MV.

ORSS Result (k-Means) Why use 5 sites?
You’re comcast, looking to build an infrastructure in MV. Why use 5 sites?

ORSS Result (k-Means) You’re comcast, looking to build an infrastructure in MV.

ORSS Result (k-Means) Our Result (k-Means) Instance is stable if
OPT(k-1) > (1/®)2 OPT(k) (require 1/® > 10) Give a (1+O(®))-approximation. Our Result (k-Means) Instance is stable if OPT(k-1) > (1+®) OPT(k) (require ® > 0) Give a PTAS ((1+²)-approximation). Runtime: poly(n,k) exp(1/®,1/²)

Philosophical Note Stable instances:
9®>0 s.t. OPT(k-1) > (1+®) OPT(k) Not stable instances: 8®>0 s.t. OPT(k-1) · (1+®) OPT(k) A (1+®)-approximation can return a (k-1)-clustering. Any PTAS can return a (k-1)-clustering. It is not a k-clustering problem, It is a (k-1)-clustering problem! If we believe our instance inherently has k clusters “Necessary condition“ to guarantee: PTAS will return a “meaningful” clustering. Our result: It’s a sufficient condition to get a PTAS.

Any (k-1) clustering is significantly costlier than OPT(k)
World All possible instances Any (k-1) clustering is significantly costlier than OPT(k) ORSS Stable

A Weaker Guarantee Why use 5 sites?
You’re comcast, looking to build an infrastructure in MV. Why use 5 sites?

A Weaker Guarantee You’re comcast, looking to build an infrastructure in MV.

(1+®)-Weak Deletion Stability
Consider OPT(k). Take any cluster C*i, associate its points with c*j. This increases the cost to at least (1+®)OPT(k). An obvious relaxation of ORSS-stability. Our result: suffices to get a PTAS. ) c*j c*j c*i

Merging any two clusters in OPT(k) increases the cost significantly
World All possible instances ORSS Stable Weak-Deletion Stable Merging any two clusters in OPT(k) increases the cost significantly

¯-Distributed Instances
For every cluster C*i, and every p not in C*i, we have: p c*i We show that: k-median: (1+®)-weak deletion stability ) (®/2)-distributed. k-means: (1+®)-weak deletion stability ) (®/4)-distributed.

Claim: (1+®)-Weak Deletion Stability ) (®/2)-Distributed
p c*i c*j ®OPT · x d(x, c*j) - x d(x, c*i) · x [d(x, c*i) + d(c*i, c*j)] - x d(x, c*i) = x d(c*i, c*j) = |C*i| d(c*i, c*j) ) ®(OPT/|C*i|) · d(c*i, c*j) · d(c*i, p) + d(p, c*j) · 2d(c*i, p)

World ORSS Stable Weak-Deletion Stable ¯-Distributed
All possible instances ORSS Stable Weak-Deletion Stable ¯-Distributed In optimal solution: large distance between a center to any “outside” point

Main Result We give a PTAS for ¯-distributed k-median and k-means instances. Running time: There are NP-hard ¯-distributed instances. (Superpolynomial dependence on 1/² is unavoidable!)

Introduce k-Median / k-Means problems. Define stability PTAS for k-Median High level description Intuition (“had only we known more…”) Description Conclusion + open problems.

k-Median Algorithm’s Overview
Right definition of “core”. Get the core of each cluster. L can’t get too big. Input: Metric, k, ¯, OPT 1. Initialization stage Initialize a list, L. L = set of “suspected” clusters “cores”. 2. Population stage Populate L with subsets of points. 3. Center-Retrieving stage For each component in L: Choose center (pt that minimizes cost) For any choice of k components in L: Evaluate k-median cost with resp. k centers

Input: Metric, k, ¯, OPT 0. Handle “extreme” clusters (Brute-force guessing of some clusters’ centers) 1. Populate L with components 2. Pick best center in each component 3. Try all possible k-centers L := List of “suspected” clusters’ “cores”

Input: Metric, k, ¯, OPT 0. Handle “extreme” clusters (Brute-force guessing of some clusters’ centers) 1. Populate L with components 2. Pick best center in each component 3. Try all possible k-centers Right definition of “core”. Get the core of each cluster. L can’t get too big.

Intuition: “Mind the Gap”
We know: In contrast, an “average” cluster contributes: So for an “average” point p, in an “average” cluster C*i,

We know: Denote the core of a cluster C*i c*i

We know: Denote the core of a cluster C*i Formally, call cluster C*i cheap if Assume all clusters are cheap. In general: we brute-force guess O(1/¯²) centers of expensive clusters in Stage 0.

We know: Denote the core of a cluster C*i Formally, call cluster C*i cheap if Markov: At most (²/4) fraction of the points of a cheap cluster, lie outside the core.

Markov Inequality Claim: At most (²/4) fraction of the points of a cheap cluster, lie outside the core. Proof by contradiction. ) Cluster isn’t cheap.

We know: Denote the core of a cluster C*i Formally, call cluster C*i cheap if Markov: At least half of the points of a cheap cluster lie inside the core.

Magic (r/4) Ball “Heavy”: If p belongs to the core )
B(p, r/4) contains ¸ |C*i|/2 pts. Denote r = ¯(OPT/|C*i|). “Heavy”: Mass ¸ |C*i|/2 r/4 > r · r/8 c*i p

All points in the core are merged into one set!
Magic (r/4) Ball Draw a ball of radius r/4 around all points. Unite “heavy” balls whose centers overlap. Denote r = ¯(OPT/|C*i|). > r · r/8 c*i All points in the core are merged into one set!

pts from other clusters?
Magic (r/4) Ball Draw a ball of radius r/4 around all points. Unite “heavy” balls whose centers overlap. Denote r = ¯(OPT/|C*i|). > r · r/8 c*i Could we merge core pts with pts from other clusters?

r/4 = r/2 - r/4 · d(x,c*i) · 3r/4 + r/4 = r
Magic (r/4) Ball Draw a ball of radius r/4 around all points. Unite “heavy” balls whose centers overlap. Denote r = ¯(OPT/|C*i|). x p x > r · r/8 c*i r/2 · d(p,c*i) · 3r/4 r/4 = r/2 - r/4 · d(x,c*i) · 3r/4 + r/4 = r

x falls outside the core
Magic (r/4) Ball Draw a ball of radius r/4 around all points. Unite “heavy” balls whose centers overlap. Denote r = ¯(OPT/|C*i|). x p > r · r/8 c*i r/4 · d(x,c*i) · r x falls outside the core x belongs to C*i

More than |C*i|/2 pts fall outside the core!
Magic (r/4) Ball Draw a ball of radius r/4 around all points. Unite “heavy” balls whose centers overlap. Denote r = ¯(OPT/|C*i|). x p > r · r/8 c*i r/4 · d(x,c*i) · r More than |C*i|/2 pts fall outside the core! )(

Finding the Right Radius
Draw a ball of radius r/4 around all points. Unite “heavy” balls whose centers overlap. Denote r = ¯(OPT/|C*i|). Problem: we don’t know |C*i| Solution: Try all sizes, in order! Set s = n, n-1, n-2, …, 1 Set rs = ¯(OPT/s) Complication: When s gets small (s=4,3,2,1) we collect many “leftovers” of one cluster. Solution: once we add a subset to L, we remove close-by points.

Population Stage Set s = n, n-1, n-2, …, 1 Set rs = ¯(OPT/s)
Draw a ball of radius r/4 around each point Unite balls containing ¸ s/2 pts whose centers overlap Once a set ¸ s/2 is found Put this set in L Remove all points in a (r/2)-”buffer zone” from L.

Population Stage Remainder of the proof:
Set s = n, n-1, n-2, …, 1 Set rs = ¯(OPT/s) Draw a ball of radius r/4 around each point Unite balls containing ¸ s/2 pts whose centers overlap Once a set ¸ s/2 is found Put this set in L Remove all points in a (r/2)-”buffer zone” from L. Remainder of the proof: Even with “buffer zones” - still collect cores. #{Components without core pts} in L is O(1/¯) cost(k centers from cores) · (1+²)OPT

A Note About k-Means Roughly the same algorithm, consts » squared.
Problem: Can’t guess centers for expensive clusters! Solution: A random sample of O(1/²) pts from each cluster approximates the center of mass. Brute force guess O(1/²) pts from O(1/¯²) expensive clusters. Better solution: Randomly sample O(1/²) pts from expensive clusters whose size ¸ poly(1/k) fraction of the instance. Slight complication: introduce intervals. Expected runtime:

Conclusion World: 8 ²>0, a (1+²)-approximation algorithm for ¯-distributed instances of k-median / k-means. Improve constants? Other clustering objectives (k-centers)? ORSS Stable Weak-Deletion Stable ¯-Distributed

Take Home Message Life ( , ) gives you a k-median instance.
Stability = A belief that a PTAS is meaningful This allows us to introduce a PTAS! To what other NP-hard problems similar logic applies? - “Can you solve it?” - “NO!!!” Stability gives us an “Archimedean Point” that allows us to bypass NP-hardness. But that’s not new!

Thank you!

World BBG Stable+ ORSS Stable Weak-Deletion Stable ¯-Distributed
All possible instances BBG Stable+ ORSS Stable Weak-Deletion Stable ¯-Distributed

Our (1+®)-approx algorithm outputs a meaningful k-clustering
BBG Result We have target clustering. k-median is a proxy: Target is close to OPT(k). Problem: k-median is NP-hard. Solution: Use approximation alg. We would like: Our (1+®)-approx algorithm outputs a meaningful k-clustering Proxy: your goal is to retrieve the target clustering. The fact that k-Median and the target are close is something you assume.

BBG Result We have target clustering. k-median is a proxy: Target is close to OPT(k). Problem: k-median is NP-hard. Solution: Use approximation alg. Implicit assumption: Any k-clustering with cost at most (1+®)OPT is ±-close (pointwise) to target Proxy: your goal is to retrieve the target clustering. The fact that k-Median and the target are close is something you assume.

BBG Result Our result: if all clusters’ sizes are (±n/®)
Instance is (BBG) stable: Any two k-partitions with cost · (1+®)OPT(k) differ over no more than (2±)-fraction of the input Give algorithm to get O(±/®)-close to the target. Additionally (k-median): if all clusters’ sizes are (±n/®) then get ±-close to the target. Our result: if BBG-stability & clusters are >2±n then PTAS for k-median (implies: get ±-close to the target). Mention that in k-means they get \delta/\alpha close, whereas we get \delta-close.

Claim: BBG-Stability & Large Clusters ) (1+®)-Weak Deletion Stability
We know: Any two k-partitions with cost · (1+®)OPT(k) differ over · 2± fraction of the input All clusters contain >2±n points Take optimal k-clustering. Take C*i, move all points but c*i to C*j. New partition and OPT differ on >2±n pts ) cost(OPTi!j) ¸ cost( ) ¸ (1+®)OPT(k) * Because clusters are large this counting argument is possible. (any k-1 clustering is far from the target clustering.) * Might skip this slide. ) c*j c*j c*i c*i

World BBG Stable+ ORSS Stable Weak-Deletion Stable ¯-Distributed
All possible instances BBG Stable+ ORSS Stable Weak-Deletion Stable ¯-Distributed

Stability Yields a PTAS for k-Median and k-Means Clustering

Similar presentations

Presentation on theme: "Stability Yields a PTAS for k-Median and k-Means Clustering"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Stability Yields a PTAS for k-Median and k-Means Clustering

Similar presentations

Presentation on theme: "Stability Yields a PTAS for k-Median and k-Means Clustering"— Presentation transcript:

Similar presentations

About project

Feedback