Improved approximation for k-median Shi Li Department of Computer Science Princeton University Princeton, NJ, /20/2013
$100 $130 maintenance cost transportation cost $10 $20 $50 $30 + minimize
BALINSKI, M. L On finding integer solutions to linear programs. In Proceedings of the IBM Scientific Computing Symposium on Combinatorial Problems. IBM, New York, pp. 225–248. KUEHN, A. A., AND HAMBURGER, M. J A heuristic program for locating warehouses. STOLLSTEIMER, J. F The effect of technical change and output expansion on the optimum number, size and location of pear marketing facilities in a California pear producing region. Ph.D. thesis, Univ. California at Berkeley, Berkeley, Calif. STOLLSTEIMER, J. F A working model for plant numbers and locations. J. Farm Econom. 45, 631– 645. Facility Location Problem
Uncapacitated Facility Location (UFL) facility cost connection cost + F : potential facility locations C : set of clients f i, i F : cost for opening i d : metric over F C find S F, minimize facilities clients $30 $100 $20 $100
Wal-mart Stores in New Jersey Question : Suppose you have budget for 50 stores, how will you select 50 locations?
k -median facilities clients + F : potential facility locations C : set of clients d : metric over F C find S F, minimize f i, i F : cost for opening i k : number of facilities to open | S |= k
k -median clustering
Known Results: UFL O(log n)-approximation [Hoc82] constant approximations 3.16 [STA98] 2.41 [GK99] 3 [JV99] [CG99] [CG99] 5+ε [Kor00] [MMSV01] [CS03] 1.61 [JMS02] [Svi02] 1.52 [MYZ02] 1.50 [Byr07] [Li11] hardness of approx. [GK98]
4Deterministic rounding of linear programs 4.5 The uncapacitated facility location problem 5Random sampling and randomized rounding of linear programs 5.8 The uncapacitated facility location problem 7The primal-dual method 7.6 The uncapacitated facility location problem 9Further uses of greedy and local search algorithms 9.1 A local search algorithm for the uncapacitated facility location problem 9.4 A greedy algorithm for the uncapacitated facility location problem 12 Further uses of random sampling and randomized rounding of linear programmings 12.1 The uncapacitated facility location problem
Know results : k -median pseudo-approximation 1-approx with O(k log n) facilities [Hoc82] 2(1+ε)-approx. with (1+1/ε)k facilities[LV92] super-constant approximation O(log n loglog n) [Bar96,Bar98] O(log k loglog k) [CCGS98]
Known Results: k -median constant approximation LP rounding Primal-Dual Local Search [CGTS99]6 [JV99] 4 [CG99]4 [JMS03]3.25 [CL12] 3+ε [AGK + 01] 1+√3+ε [LS13] (1+2/e)-hardness of approximation [JMS03]
Lloyd Algorithm[Lloyd82] k-means clustering : min total squared distances k-means vs k-median clustering: k-means is more often used Walmart example: k-median is more appropriate approximation: k-median is “easier”
Local Search Can we improve the solution by p swaps? No : stop Yes : swap and repeat Approximation : k-median : 3+2/p [AGK + 01] k-means : (3+2/p) 2 [KMN + 02]
LP for k -median y i : whether to open i x i,j : whether connect j to i open at most k facilities client j must be connected client j can only connected to an open facility integrality gap is at least 2 integrality gap is at most 3 (proof non-constructive)
(1+√3+ε)-approximation on k-median
k -median and UFL f = cost of a facility f #open facilities Given a black-box α-approximation A for UFL Naïve try : find an f such that A opens k facilities α-approxition for k-median? Proof : α ≈ for UFL, α > for k-median
k -median and UFL Naïve try : find an f such that A opens k facilities 2 issues with naïve try : 1. need LMP α-approximation for UFL α- approximation: LMP α-approximation LMP = Lagragean Multiplier Preserving
k -median and UFL S 1 : set of k 1 < k facilities S 2 : set of k 2 > k facilities bi-point solution Naïve try : find an f such that A opens k facilities 2 issues with naïve try : 1. need LMP α-approximation for UFL 2. can not find f s.t. A opens exactly k facilities
k -median and UFL 2 issues with naïve try : 1. need LMP α-approximation for UFL 2. can not find f s.t. A opens exactly k facilities LMP approx. factor bi-point integral final ratio for k-median [JV] [JMS] 3 x our result 2 do not know how to improve this factor of 2 is tight !!
bi-point solution k 1 = | S 1 | < k ≤ | S 2 | = k 2 a, b : ak 1 + bk 2 = k, a + b = 1 bi-point solution : a S 1 +b S 2 cost(a S 1 +b S 2 ) = a cost( S 1 ) + b cost( S 2 ) S1S1 S2S2
gap-2 instance 1 0 k + 1 cost of integral solution = 2 k 1 = 1, k 2 = k+1 cost ( S 1 ) = k+1, cost ( S 2 ) = 0 S1S1 S2S2
k -median and UFL Main Lemma 2 : bi-point solution of cost C solution of cost with k+O(1/ε) facilities [JV][JMS]our result LMP approx. factor 322 bi-point integral x 2 final ratio for k-median 64 this factor of 2 is tight !! bi-point pseudo-integral Main Lemma 1 : suffice to give an α-approximate solution with k+O(1) facilities
Main Lemma 1 with k+1 open facilities, cost = 0 with k open facilities, cost huge A : black-box α-approximation with k+c open facilities A ' : (α+ε)-approximation with k open facilities A ' calls A n O(c/ε) times. bad instance:
Dense Facility B i : set of clients in a small ball around i i is A-dense, if connection cost of B i in OPT is ≥ A i BiBi this instance : i is A-dense for A ≈ opt
Dense Facility BiBi Reduction component works directly if there are no opt/t-dense facilities, t = O(c/ε) can reduce to such an instance in n O(t) time i
[Awasthi-Blum-Sheffet] : ε, δ >0 constants, OPT k-1 ≥ (1+δ)OPT k can find (1+ε)-approximation Main Lemma 1 : suffice to give an α-approximate solution with k+O(1) facilities k-median clustering is easy in practice reason : there is a “meaningful” clustering Lemma 1 from [ABS]
Algorithm Apply A to (k-c, F, C, d) solution with k facilities of cost ≤ αOPT k-c Apply [ABS] to each (k-i, F, C, d) for i = 0, 1, 2, …, c-1 Output the best of the c+1 solutions Proof If OPT k-c ≤ (1+ε)OPT k, then done. otherwise, consider the smallest i s.t. OPT k-i-1 ≥ (1+ε) 1/c OPT k-i [ABS] on (k-i, F, C, d) solution of cost (1+ε)OPT k-i ≤ (1+ε) 2 OPT k [ABS] OPT k-1 ≥ (1+δ)OPT k (1+ε)-approximation A : α-approximation algorithm for k-median with k+c medians
Main Lemma 2 : bi-point solution of cost C solution of cost with k+O(1/ε) facilities [JV] bi-point solution of cost C solution of cost 2C based on improving [JV] algorithm
S1S1 S2S2 given : bi-point solution a S 1 +b S 2 select S’ 2 S 2, | S’ 2 | = | S 1 | = k 1 with prob. a, open S 1 with prob. b, open S’ 2 randomly open k-k 1 facilities in S 2 \ S’ 2 i JV algorithm τ i = nearest facility of i guarantee : either i is open, or τ i is open
Analysis of JV algorithm i1i1 i2i2 i3i3 ≤ d 1 + d 2 If i 2 is open, connect j to i 2 Otherwise, if i 1 is open, connect j to i 1 Otherwise connect j to i 3 E[cost of j] ≤ × [cost of j in a S 1 +b S 2 ] d1d1 d2d2 j i 1 S 1, i 3 S’ 2 either i 1 or i 3 is open 2
Our Algorithm on average, d 1 >> d 2 d(j, i 3 ) ≤ i1i1 i2i2 i3i3 d1d1 d2d2 ≤ d 1 + d 2 j i3i3 If i 2 is open, connect j to i 2 Otherwise, if i 1 is open, connect j to i 1 Otherwise connect j to i 3 E[cost of j] ≤ × [cost of j in a S 1 +b S 2 ] 2 d 1 +2 d 2 2d1+d22d1+d2
Our Algorithm for a star, either the center is open, or all leaves are open idea : big stars: always open the center, open each leaf with prob. ≈b group small stars of the same size, dependent rounding for each group, open 3 more facilities than expected first try open each star independently? with prob. a, open the center, with prob. b, open the leaves problem : can not bound the number of open facilities need to guarantee : either i is open, or τ i is open i τiτi
small stars small star : star of size ≤ 2/(abε ) M h : set of stars of size h, m = |M h | Roughly, for am stars, open the center for bm stars, open the leaves More accurately, permute the stars and the facilities open top centers open bottom leaves
big stars size h > 2/(abε ) always open the center randomly open leaves ≈ bh for big star
Lemma : we open at most k + 6/(abε) facilities. for a big star of size h, FRAC : a+bh ALG : for a group of m small stars of size h FRAC : m(a+bh) ALG : there are at most 2/(abε) groups
Summary Main Lemma 2 : bi-point solution of cost C solution of cost with k+O(1/ε) facilities [JV][JMS]our result LMP approx. factor 322 x 2 final ratio for k-median 64 bi-point pseudo-integral Main Lemma 1 : suffice to give an α-approximate solution with k+O(1) facilities
Open Problems gap between integral solution with k+1 open facilities and LP value(with k open facilities)? tight analysis? algorithm works for k-means?