Stream-based Geometric Algorithms Piotr Indyk MIT
Streaming Algorithms for Geometric Problems Input: a stream S=p1…pn of points in Rd Goal: compute certain geometric quantity and/or structure Variations: Dynamic case: points can be deleted Sliding window: points disappear after some time t
Minimum Spanning Tree The tree has representation size (n) We only estimate the cost of MST
Minimum Weight Matching
Minimum Weight Bichromatic Matching
Facility Location Goal: choose a set F of facilities to minimize the sum of the distances to nearest facility plus the number of facilities times f
K-median K is given Goal: choose K medians to minimize the sum of the distances to the nearest median
Known Results Computing Lp norms of a stream (Graham’s talk) Clustering of points in metric spaces Charikar et al ’97, ’03; Guha et al’00: K-center and K-median (K) space, no deletions Meyerson’02: Facility location (|F|) space, no deletions
More of Known Results Approximate diameter etc Convex hulls etc Indyk’03: high dimensions Feigenbaum et al, Hershberger et al, Cormode et al’03: low dimensions Convex hulls etc
*follows Charikar’02; also Varadarajan’02 and Indyk-Thaper’02 Our Results Problem Type Delete Space Appr. MST Cost Yes polylog(D,n) log D MWM MWBM* Fac.Loc. No log2 D K-median Full poly(K,log D,log n) *follows Charikar’02; also Varadarajan’02 and Indyk-Thaper’02
Applications MST, MWM: ? MWBM: similarity of low-dim data sets Fac. Loc. : “clusterability” of a data set K-median: allocation of servers to clients (Muthu’03) log D might be not so bad in practice (1.1 in Indyk-Thaper’03)
Approach Impose square grids G0…Gk, with side lengths 20,21, …, 2k , shifted at random. For each square cell c in Gi, let nP(c) be the number of points from P in c. The algorithms will maintain certain statistics over nP(.), which will allow it to approximately solve the problems 2 1 1 3 1 1 3
Estimators MST: ∑i 2i ∑c Gi [nP(c)>0] MWM: ∑i 2i ∑c Gi [nP(c) is odd] MWBM: ∑i 2i ∑c Gi |nG(c)-nB(c)| Fac. Loc.: ∑i 2i ∑c Gi min[nP(c), Ti] K-median: ∑c Bj nP(c) for B1…Bl sampled from Gi’s with density 1/K
Proofs View the grids as a probabilistic embedding of P into a tree (HST’s) Show how to solve the problem in HST’s Show how to express the solution using just nP(c)’s First application of this kind of embeddings to streaming
Conclusions and Open Problems Replace log D by O(1) Other apps ?