1 Streaming Algorithms for Geometric Problems Piotr Indyk MIT.

Slides:



Advertisements
Similar presentations
Estimating Distinct Elements, Optimally
Advertisements

Rectangle-Efficient Aggregation in Spatial Data Streams Srikanta Tirthapura David Woodruff Iowa State IBM Almaden.
Optimal Approximations of the Frequency Moments of Data Streams Piotr Indyk David Woodruff.
1+eps-Approximate Sparse Recovery Eric Price MIT David Woodruff IBM Almaden.
The Data Stream Space Complexity of Cascaded Norms T.S. Jayram David Woodruff IBM Almaden.
The Complexity of Linear Dependence Problems in Vector Spaces David Woodruff IBM Almaden Joint work with Arnab Bhattacharyya, Piotr Indyk, and Ning Xie.
Optimal Space Lower Bounds for All Frequency Moments David Woodruff MIT
Numerical Linear Algebra in the Streaming Model Ken Clarkson - IBM David Woodruff - IBM.
Optimal Space Lower Bounds for all Frequency Moments David Woodruff Based on SODA 04 paper.
Sublinear-time Algorithms for Machine Learning Ken Clarkson Elad Hazan David Woodruff IBM Almaden Technion IBM Almaden.
Xiaoming Sun Tsinghua University David Woodruff MIT
Tight Lower Bounds for the Distinct Elements Problem David Woodruff MIT Joint work with Piotr Indyk.
Nearest Neighbor Search in High Dimensions Seminar in Algorithms and Geometry Mica Arie-Nachimson and Daniel Glasner April 2009.
Fast Algorithms For Hierarchical Range Histogram Constructions
Maintaining Variance and k-Medians over Data Stream Windows Brian Babcock, Mayur Datar, Rajeev Motwani, Liadan O’Callaghan Stanford University.
Searching on Multi-Dimensional Data
1 CS 361 Lecture 5 Approximate Quantiles and Histograms 9 Oct 2002 Gurmeet Singh Manku
Navigating Nets: Simple algorithms for proximity search Robert Krauthgamer (IBM Almaden) Joint work with James R. Lee (UC Berkeley)
Uncertainty Principles, Extractors, and Explicit Embeddings of L 2 into L 1 Piotr Indyk MIT.
Yi Wu (CMU) Joint work with Parikshit Gopalan (MSR SVC) Ryan O’Donnell (CMU) David Zuckerman (UT Austin) Pseudorandom Generators for Halfspaces TexPoint.
Approximate Range Searching in the Absolute Error Model Guilherme D. da Fonseca CAPES BEX Advisor: David M. Mount.
Network Design Adam Meyerson Carnegie-Mellon University.
Approximate Nearest Neighbors and the Fast Johnson-Lindenstrauss Transform Nir Ailon, Bernard Chazelle (Princeton University)
Estimating Set Expression Cardinalities over Data Streams Sumit Ganguly Minos Garofalakis Rajeev Rastogi Internet Management Research Department Bell Labs,
Computing Sketches of Matrices Efficiently & (Privacy Preserving) Data Mining Petros Drineas Rensselaer Polytechnic Institute (joint.
Spatial Indexing I Point Access Methods. Spatial Indexing Point Access Methods (PAMs) vs Spatial Access Methods (SAMs) PAM: index only point data Hierarchical.
CS 591 A11 Algorithms for Data Streams Dhiman Barman CS 591 A1 Algorithms for the New Age 2 nd Dec, 2002.
Embedding and Sketching Alexandr Andoni (MSR). Definition by example  Problem: Compute the diameter of a set S, of size n, living in d-dimensional ℓ.
Data Structures for Computer Graphics Point Based Representations and Data Structures Lectured by Vlastimil Havran.
Minimal Spanning Trees What is a minimal spanning tree (MST) and how to find one.
Approximation Algorithms for Stochastic Combinatorial Optimization Part I: Multistage problems Anupam Gupta Carnegie Mellon University.
1 Introduction to Approximation Algorithms. 2 NP-completeness Do your best then.
B-trees and kd-trees Piotr Indyk (slides partially by Lars Arge from Duke U)
Streaming Algorithms Piotr Indyk MIT. Data Streams A data stream is a sequence of data that is too large to be stored in available memory Examples: –Network.
1 Streaming Algorithms for Geometric Problems Piotr Indyk MIT.
A Quantitative Analysis and Performance Study For Similar- Search Methods In High- Dimensional Space Presented By Umang Shah Koushik.
PODC Distributed Computation of the Mode Fabian Kuhn Thomas Locher ETH Zurich, Switzerland Stefan Schmid TU Munich, Germany TexPoint fonts used in.
Randomized Composable Core-sets for Submodular Maximization Morteza Zadimoghaddam and Vahab Mirrokni Google Research New York.
Open Problem: Dynamic Planar Nearest Neighbors CSCE 620 Problem 63 from the Open Problems Project
Data reduction for weighted and outlier-resistant clustering
Geometric Problems in High Dimensions: Sketching Piotr Indyk.
Data in Motion Michael Hoffman (Leicester) S Muthukrishnan (Google) Rajeev Raman (Leicester)
Facility Location in Dynamic Geometric Data Streams Christiane Lammersen Christian Sohler.
Data Stream Algorithms Lower Bounds Graham Cormode
A light metric spanner Lee-Ad Gottlieb. Graph spanners A spanner for graph G is a subgraph H ◦ H contains vertices, subset of edges of G Some qualities.
Komplexitätstheorie und effiziente Algorithmen Christian Sohler, TU Dortmund Algorithms for geometric data streams.
1 Approximations and Streaming Algorithms for Geometric Problems Piotr Indyk MIT.
The Message Passing Communication Model David Woodruff IBM Almaden.
Sketching complexity of graph cuts Alexandr Andoni joint work with: Robi Krauthgamer, David Woodruff.
Sparse RecoveryAlgorithmResults  Original signal x = x k + u, where x k has k large coefficients and u is noise.  Acquire measurements Ax = y. If |x|=n,
Clustering Data Streams
Stream-based Geometric Algorithms
New Characterizations in Turnstile Streams with Applications
Approximating the MST Weight in Sublinear Time
Estimating L2 Norm MIT Piotr Indyk.
Sublinear Algorithmic Tools 2
Lecture 10: Sketching S3: Nearest Neighbor Search
Sketching and Embedding are Equivalent for Norms
Near(est) Neighbor in High Dimensions
Lecture 16: Earth-Mover Distance
CIS 700: “algorithms for Big Data”
Y. Kotidis, S. Muthukrishnan,
Near-Optimal (Euclidean) Metric Compression
Overview Massive data sets Streaming algorithms Regression
CSCI B609: “Foundations of Data Science”
Range-Efficient Computation of F0 over Massive Data Streams
Lecture 15: Least Square Regression Metric Embeddings
Clustering.
President’s Day Lecture: Advanced Nearest Neighbor Search
Routing in Networks with Low Doubling Dimension
Presentation transcript:

1 Streaming Algorithms for Geometric Problems Piotr Indyk MIT

2 Data Streams A data stream is a (massive) sequence of data Too large to store (on disk, memory, cache, etc.) Examples: Network traffic (source/destination) Sensor networks Satellite data feed, etc. Approaches: Ignore it Develop algorithms for dealing with such data

3 Talk Overview Computational model Example problems (Short) history of streaming algorithms Streaming algorithms for geometric problems Insertions only Insertions and deletions Open problems

4 Computational Model Single pass over the data: e 1, e 2, …,e n Bounded storage Fast processing time per element

5 Related Models External Memory: Bounded Storage Data Stored on Disk Random Access to Blocks of Data Compact Representations of Data and Communication Complexity Read-Once Branching Programs Memory Disk Alice: xBob: y F(x,y)=? e 1 =1 ? YN

6 Classic Examples Compute the number of distinct elements: Exactly:  (n) bits of space (1+  ) -approximation: O(1/  2 *log n) bits [Flajolet-Martin, JCSS’85],… Compute the median Exactly:  (n) (50%   ) -approximation: O(1/  *polylog n) [Paterson-Munro, TCS’80],…

7 Brief History of Streaming Algorithms Ancient times [MP’80,FM’85,Morris,..] Middle Ages Renaissance [Alon-Matias-Szegedy, STOC’96] Theory DB (Aqua project in Bell Labs) Networking … Streaming became mainstream

8 Theoretical History Vector problems: Stream defines an array of numbers Maintain stats of the array, e.g., median Metric problems Clustering Graph problems, Text problems Geometric Problems [this talk]

9 Geometric Data Stream Algorithms as Data Structures Data structures that support: Insert(p) to P Possibly: Delete(p) from P Compute(P) Use space that is sub-linear in |P|

10 Insertions-only

11 Metric clustering problems k-center [Charikar-Chekuri-Feder-Motwani, STOC’97] k-median [Guha-Mishra-Motwani-O’Callaghan, FOCS’00, Meyerson, FOCS’01, Charikar-O’Callaghan- Panigrahy, STOC’03] Bounds: Poly(K,log n) space O(1)-approximation

12 k-median/k-center k is given Goal: choose k medians/centers to minimize: k-median: the sum of the distances k-center: the max distance

13 Geometric Problems Diameter, Minimum Enclosing Ball [Agarwal-Har-Peled, SODA’01, Feigenbaum-Kannan-Zhang’02 (Algorithmica), Hershberger-Suri, PODS’04] K-center [AHP, SODA’01] K-median [Har-Peled-Mazumdar, STOC’04] Range searching via  -approximations: [Suri-Toth-Zhou, SoCG’04] [Bagchi-Chaudhary-Eppstein-Goodrich, SoCG’04]

14 Dominant Approach: Merge and Reduce Main ideas: Design an (off-line) algorithm that computes a “sketch” of the input Small size Sufficient to solve the problem A sketch of sketches is a sketch

15 Tree Computation p1p1 p2p2 p3p3 p4p4 p5p5 p7p7 p6p6 p8p8 p9p9 p 10 p 11 p 12 p 13 p 15 p 14 p 16

16 Algorithm Space: (sketch size)*log n Time: sketch computation time Question: Where do sketches come from ?

17 Idea I: solution=sketch Consider k-median [GMMO’00] : approximate k- median of approximate weighted k-medians is an approximate k-median Result: Constant depth tree Space: kn ,  >0 O(1) -approximation Works for any metric space       k=3

18 Use the solution, ctd.  -Approximations: find a subset S  P, such that for any rectangle/halfspace/etc R, |R  S|/|S| = |R  P|/|P|  [Matousek] : approximation of a union of approximations is an approximation [BCEG’04] : convert it into streaming algorithm, applications  1/  2 space [STZ’04] : better/optimal bounds for rectangles and halfspaces

19 Idea 2: Core-Sets [AHP’01] Assume we want to minimize C P (o) S  P is an  -core-set for P, if for any o, and a set T: C P  T (o) < (1+  ) C S  T (o) Note: this must hold for all o, not just the optimal one o

20 Example: Core-set for MEB Compute extremal points: Choose “densely” spaced direction v 1 …v k I.e., for any u there is v i such that u*v i ≥ ||u|| 2 / (1+  ) For each direction maintain extremal point k=O(1/  ) (d-1)/2 suffice

21 Stream Algorithms via Core-sets Diameter/MEB/width: O(1/  ) (d-1)/2 log n space [AHP ’ 01] k-center: O(k/  d ) log n [HP ’ 01] k-median: O(k/  d ) log n [HPM ’ 04] Faster algorithms and other results: [Chan, SoCG ’ 04], [Suri-Hershberger ’ 03]

22 Limitations Small core-sets might not exist (see next slide) Do not support deletions

23 Minimum Weight Bi-chromatic Matching Estimate the cost of MWBM

24 Insertions and Deletions

25 Streaming Algorithms for Vector Problems Norm estimation: Stream elements: (i,b), i=1…m Interpretation: x i =x i +b Want to maintain ||x|| p Why ? Examples: ||x|| p p =Σ i x i p = #non-zero elements in x, as p  0 …

26 Dimensionality reduction L 2 : Johnson-Lindenstrauss Lemma: x is an m-dimensional vector A is a random m times k matrix, each entry independently drawn from e.g. Gaussian distribution, k=O(log N/  2 ) Then with probability 1-1/N ||x|| 2 ≤||Ax|| 2 ≤(1+  )||x|| 2 A can be pseudo-random [AMS’96] * *Using slightly different method for norm estimation

27 What it means To know ||x|| 2, suffices to know Ax Can maintain Ax when the coordinates are incremented: A(x+ be i )=Ax+ bA e i Ax Can maintain approximate L 2 -norm of x Similar approach works for p  (0,2] [Indyk, FOCS’00] Ax

28 Histograms View x as a function x:[1…n]  [1…M] Approximate it using piecewise constant function h, with B pieces (buckets) Problem can be formulated in 2D as well (buckets become rectangular tiles)

29 Results: 1D [Gilbert-Guha-Indyk-Kotidis-Muthukrishnan-Strauss, STOC’02] : Maintains h with B pieces such that ||x-h|| 2 ≤ (1+  )||x-h OPT || 2 Under increments/decrements of x Space: poly(B,1/ ,log n) Time: poly(B,1/ ,log n)

30 Results: 2D [Thaper-Guha-Indyk-Koudas, SIGMOD’02] : Maintains h with B log (nM) tiles such that ||x-h|| 2 ≤ (1+  )||x-h OPT || 2 Under increments/decrements of x Space/Update time: poly(B,1/ ,log n) Histogram reconstruction time: poly(B,1/ , n) [Muthukrishnan-Strauss, FSTTCS’03] : Maintains h with 4B tiles Time: poly(B,1/ , log(nM))

31 General Approach Maintain sketches Ax of x This allows us to estimate the error of any given h, via ||x-h||  ||Ax-Ah|| Construct h: Enumeration Greedy Dynamic Programming

32 Minimum Weight Matching Estimate the cost of MWM

33 Minimum Spanning Tree Estimate the cost of MST

34 Facility Location Goal: choose a set F of facilities to minimize the sum of the distances to nearest facility plus the number of facilities times f Again, report the cost

35 Approach Assume P  {1…  } 2 Reduce to vector problems Impose square grids G 0 …G k, with side lengths 2 0,2 1, …, 2 k, shifted at random. For each square cell c in G i, let n P (c) be the number of points from P in c. The algorithms will maintain certain statistics over n P (.), which will allow it to approximately solve the problems

36 Estimators MST:∑ i 2 i ∑ c  G i [n P (c)>0] MWM:∑ i 2 i ∑ c  G i [n P (c) is odd] MWBM:∑ i 2 i ∑ c  G i |n G (c)-n B (c)| Fac. Loc.:∑ i 2 i ∑ c  G i min[n P (c), T i ] K-median:∑ i 2 i ∑ c  G i - B(Q, 2^i) n P (c) (const. factor) Maintain #non-zero entries in n P [FM’85] Maintain L 1 difference [I’00]

37 Results [Indyk’04] ProblemAppr. MST log  MWM log  MWBM * log  Fac.Loc. log 2  *follows from Charikar, STOC’02; also Agarwal-Varadarajan, SoCG’04 and Indyk-Thaper’02 Space: (log  +log n) O(1)

38 Results: K-median Space: (K+log +  log n) O(1) Computation TimeApproximation  O(k) poly(log n+1/  )1+   2 poly(log n+log +k) O(1) poly(log n+log  +k)[ 1+ , log n log  ]

39 Probabilistic embeddings into HST’s Known [Bartal, FOCS’96, Charikar-Chekuri-Goel-Guha-Plotkin,STOC’98]: ||p-q|| ≤ D tree (p,q) E[ D tree (p,q) ] ≤ ||p-q|| * O(log  ) T 5

40 MST E[Cost(MST in T)] ≤ O(log  ) Cost(MST) Cost(MST in T)  Cost(T) How to compute Cost(T) ? Sum over all levels i, of the #nodes at i, times 2 i Node c exists iff n i (c)>

41 Matching Algorithm: Match what you can at the current level Odd leftovers wait for the next level Repeat Optimal on the HST Cost=∑ i 2 i ∑ c  Gi [n P (c) is odd]

42 Conclusions Algorithms for geometric data streams Insertions-only: merge and reduce Insertions and deletions: randomized linear embeddings

43 Open Problems High dimensions: Diameter: 2 1/2 -approx, O(d 2 n 1/2 ) space, follows from [Goel-Indyk-Varadarajan, SODA’01] c-approx, O( dn 1/(c 2 - 1) ) [Indyk, SODA’03] Conjecture:  2 1/2 -approx, O(d polylog n) space Min-width cylinder: 18-approx, O(d) space [Chan’04] Other problems ?

44 Open Problems Range queries: General lower bounds ? (Not just for  - approximations)  (1/  2 ) -bit bound for general queries follows from LB for dot product [Indyk-Woodruff, FOCS’03], and is tight (for randomized algorithms) What about e.g., half-space queries ? O(1/  4/3 ) is known [STZ’04] Other problems [STZ’04]

45 Open Problems Matchings, Facility Location, etc: Replace log  by O(1) or even 1+  Possible for MST [Frahling-Indyk-Sohler’??] Related to computing bi-chromatic matching [Agarwal-Varadarajan’04] Min-sum clustering ?

46 Open Problems Better core-sets k-median: 1/  d  1/  (d-1)/2 ? Possible for d=1 [Indyk] k-center: 1/  d  1/  (d-1)/2 Possible for k=1 (this is minimum enclosing ball) Insertions and deletions ? k-median: poly(log n+log  +k+1/  ) space/time, (1+  ) –approximation ?

47 The End – Thank you !