Rectangle-Efficient Aggregation in Spatial Data Streams Srikanta Tirthapura David Woodruff Iowa State IBM Almaden.

Slides:

Advertisements

Similar presentations

Estimating Distinct Elements, Optimally

Advertisements

Fast Moment Estimation in Data Streams in Optimal Space Daniel Kane, Jelani Nelson, Ely Porat, David Woodruff Harvard MIT Bar-Ilan IBM.

Optimal Approximations of the Frequency Moments of Data Streams Piotr Indyk David Woodruff.

1+eps-Approximate Sparse Recovery Eric Price MIT David Woodruff IBM Almaden.

The Data Stream Space Complexity of Cascaded Norms T.S. Jayram David Woodruff IBM Almaden.

Tight Bounds for Distributed Functional Monitoring David Woodruff IBM Almaden Qin Zhang Aarhus University MADALGO Based on a paper in STOC, 2012.

The Complexity of Linear Dependence Problems in Vector Spaces David Woodruff IBM Almaden Joint work with Arnab Bhattacharyya, Piotr Indyk, and Ning Xie.

Tight Bounds for Distributed Functional Monitoring David Woodruff IBM Almaden Qin Zhang Aarhus University MADALGO.

Optimal Space Lower Bounds for All Frequency Moments David Woodruff MIT

Equations as Relations y = 2x is an equation with 2 variables When you substitute an ordered pair into an equation with 2 variables then the ordered pair.

Numerical Linear Algebra in the Streaming Model Ken Clarkson - IBM David Woodruff - IBM.

Optimal Space Lower Bounds for all Frequency Moments David Woodruff Based on SODA 04 paper.

The Average Case Complexity of Counting Distinct Elements David Woodruff IBM Almaden.

An Optimal Algorithm for the Distinct Elements Problem

Optimal Bounds for Johnson- Lindenstrauss Transforms and Streaming Problems with Sub- Constant Error T.S. Jayram David Woodruff IBM Almaden.

Numerical Linear Algebra in the Streaming Model

Efficient Private Approximation Protocols Piotr Indyk David Woodruff Work in progress.

Sublinear-time Algorithms for Machine Learning Ken Clarkson Elad Hazan David Woodruff IBM Almaden Technion IBM Almaden.

Tight Lower Bounds for the Distinct Elements Problem David Woodruff MIT Joint work with Piotr Indyk.

Subspace Embeddings for the L1 norm with Applications Christian Sohler David Woodruff TU Dortmund IBM Almaden.

An Improved Data Stream Summary: The Count-Min Sketch and its Applications Graham Cormode, S. Muthukrishnan 2003.

Fast Algorithms For Hierarchical Range Histogram Constructions

3/13/2012Data Streams: Lecture 161 CS 410/510 Data Streams Lecture 16: Data-Stream Sampling: Basic Techniques and Results Kristin Tufte, David Maier.

Heavy Hitters Piotr Indyk MIT. Last Few Lectures Recap (last few lectures) –Update a vector x –Maintain a linear sketch –Can compute L p norm of x (in.

Sketching for M-Estimators: A Unified Approach to Robust Regression

Turnstile Streaming Algorithms Might as Well Be Linear Sketches Yi Li Huy L. Nguyen David Woodruff.

1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 13 June 25, 2006

CS591A1 Fall Sketch based Summarization of Data Streams Manish R. Sharma and Weichao Ma.

Sketching for M-Estimators: A Unified Approach to Robust Regression Kenneth Clarkson David Woodruff IBM Almaden.

8.3 Solving Systems of Linear Equations by Elimination

THE ELIMINATION METHOD Solving Systems of Three Linear Equations in Three Variables.

7.4 Function Notation and Linear Functions

How Robust are Linear Sketches to Adaptive Inputs? Moritz Hardt, David P. Woodruff IBM Research Almaden.

Fast Approximate Wavelet Tracking on Streams Graham Cormode Minos Garofalakis Dimitris Sacharidis

Tight Bounds for Graph Problems in Insertion Streams Xiaoming Sun and David P. Woodruff Chinese Academy of Sciences and IBM Research-Almaden.

Finding Frequent Items in Data Streams [Charikar-Chen-Farach-Colton] Paper report By MH, 2004/12/17.

4.4 Equations as Relations

Solving Systems of Linear Equations by Elimination Solve linear systems by elimination. Multiply when using the elimination method. Use an alternative.

Information Complexity Lower Bounds for Data Streams David Woodruff IBM Almaden.

Streaming Algorithms Piotr Indyk MIT. Data Streams A data stream is a sequence of data that is too large to be stored in available memory Examples: –Network.

1. Inverse of A 2. Inverse of a 2x2 Matrix 3. Matrix With No Inverse 4. Solving a Matrix Equation 1.

Data Stream Algorithms Ke Yi Hong Kong University of Science and Technology.

Weikang Qian. Outline Intersection Pattern and the Problem Motivation Solution 2.

Similarity Searching in High Dimensions via Hashing Paper by: Aristides Gionis, Poitr Indyk, Rajeev Motwani.

1/19 Minimizing weighted completion time with precedence constraints Nikhil Bansal (IBM) Subhash Khot (NYU)

Chapter 4 Section 3. Objectives 1 Copyright © 2012, 2008, 2004 Pearson Education, Inc. Solving Systems of Linear Equations by Elimination Solve linear.

Globally Consistent Range Scan Alignment for Environment Mapping F. LU ∗ AND E. MILIOS Department of Computer Science, York University, North York, Ontario,

Low Rank Approximation and Regression in Input Sparsity Time David Woodruff IBM Almaden Joint work with Ken Clarkson (IBM Almaden)

SOLVING SYSTEMS USING ELIMINATION 6-3. Solve the linear system using elimination. 5x – 6y = -32 3x + 6y = 48 (2, 7)

Searching a Linear Subspace Lecture VI. Deriving Subspaces There are several ways to derive the nullspace matrix (or kernel matrix). ◦ The methodology.

The Message Passing Communication Model David Woodruff IBM Almaden.

TODAY IN ALGEBRA 2.0…  Review: Solving Linear Systems by Graphing  Learning Goal 1: 3.2 Solving Linear Systems by Substitution with one equation solved.

Beating CountSketch for Heavy Hitters in Insertion Streams Vladimir Braverman (JHU) Stephen R. Chestnut (ETH) Nikita Ivkin (JHU) David P. Woodruff (IBM)

Discrete Methods in Mathematical Informatics Kunihiko Sadakane The University of Tokyo

New Algorithms for Heavy Hitters in Data Streams David Woodruff IBM Almaden Joint works with Arnab Bhattacharyya, Vladimir Braverman, Stephen R. Chestnut,

An Optimal Algorithm for Finding Heavy Hitters

Stochastic Streams: Sample Complexity vs. Space Complexity

New Characterizations in Turnstile Streams with Applications

Homomorphic Hashing for Sparse Coefficient Extraction

Streaming & sampling.

Lecture 7: Dynamic sampling Dimension Reduction

Turnstile Streaming Algorithms Might as Well Be Linear Sketches

Y. Kotidis, S. Muthukrishnan,

The Communication Complexity of Distributed Set-Joins

Graphing Linear Equations

CSCI B609: “Foundations of Data Science”

Range-Efficient Computation of F0 over Massive Data Streams

Streaming Symmetric Norms via Measure Concentration

Lecture 6: Counting triangles Dynamic graphs & sampling

Sublinear Algorihms for Big Data

Presentation transcript:

Rectangle-Efficient Aggregation in Spatial Data Streams Srikanta Tirthapura David Woodruff Iowa State IBM Almaden

The Data Stream Model Stream S of additive updates (i, Δ) to an underlying vector v: v i <- v i + Δ Dimension of v is n v initialized to 0 n Number of updates is m Δ is an integer in {-M, -M+1, …, M} Assume M, m · poly(n)

Applications Coordinates of v associated with items v i is the number of times (frequency) item i occurs in S Number of non-zero entries of v = number of distinct items in S –denoted |v| 0 |v| 1 = Σ i |v i | is sum of frequencies in S |v| 2 2 = Σ i v i 2 is self-join size |v| p = (Σ i v i p ) 1/p is p-norm of v

Lots of Known Results (ε, δ)-approximation –output estimator E to |v| p –Pr[|v| p · E · (1+ε) |v| p ] ¸ 1-δ Let O~(1) denote poly(1/ε, log n/δ) Optimal estimators for |v| p : –0 · p · 2, use O~(1) memory –p > 2 use O~(n 1-2/p ) memory –Both results have O~(1) update time

Range-Updates Sometimes more efficient to represent additive updates in the form ([i,j], Δ): 8 k 2 {i, i+1, i+2, …, j}: v k <- v k + Δ Useful for representing updates to time intervals Many reductions: Triangle counting Distinct Summation Max Dominance Norm

A Trivial Solution? Given update ([i, j], Δ), treat it as j-i+1 updates (i, Δ), (i+1, Δ), (i+2, Δ), …, (i+j, Δ) Run memory-optimal algorithm on the resulting j-i+1 updates Main problem: update time could be as large as n

Scattered Known Results Algorithm is range-efficient if update time is O~(1) and memory optimal up to O~(1) Range-efficient algorithms exist for –positive-only updates to |v| 0 –|v| 2 For |v| p, p>2, can get O~(n 1-1/p ) time and memory Many questions open (we will resolve some)

Talk Outline General Framework: Rectangle Updates Problems –Frequent points (heavy hitters) –Norm estimation Results Techniques Conclusion

From Ranges to Rectangles Vector v has coordinates indexed by pairs (i,j) 2 {1, 2, …, n} x {1, 2, …, n} Rectangular updates ([i 1, j 1 ] x [i 2, j 2 ], Δ) 8 (i,j) 2 [i 1, j 1 ] x [i 2, j 2 ]: v i,j <- v i,j + Δ

Spatial Datasets A natural extension of 1-dimensional ranges is to axis-aligned rectangles Can approximate polygons by rectangles Spatial databases such as OpenGIS No previous work on streams of rectangles What statistics do we care about? …

Klees Measure Problem What is the volume of the union of three rectangles? [0, 2] x [0,4], [1, 3] x [1,3], [2, 4] x [0,2]

Other Important Statistics max i,j v i,j is the depth can approximate the depth by approximating |v| p for large p > 0 (sometimes easier than computing depth) also of interest: find heavy hitters, namely, those (i,j) with v i,j ¸ ε |v| 1 or v i,j 2 ¸ ε |v| 2 2

Notation Algorithm is rectangle-efficient if update time is O~(1) and memory optimal up to O~(1) For (ε,δ)-approximating |v| p we want –for 0 · p · 2, O~(1) time and memory –for p > 2, O~(1) time and O~(n 2 ) 1-2/p memory

Our Results For |v| p for 0 · p · 2, we obtain the first rectangle-efficient algorithms –First solution for Klees Measure Problem on streams For finding heavy hitters, we obtain the first rectangle-efficient algorithms For p > 2 we achieve: –O~(n 2 ) 1-2/p memory and time OR –O~(n 2 ) 1-1/p memory and O~(1) time

Our Results For any number d of dimensions: For |v| p for 0 · p · 2 and heavy hitters: –O~(1) memory and O~(d) time For |v| p for p > 2: –O~(n d ) 1-2/p memory and time OR –O~(n d ) 1-1/p memory and O~(d) time Only a mild dependence on d Improves previous results even for d = 1

Our Techniques Main idea: –Leverage a technique in streaming algorithms for estimating |v| p for any p ¸ 0 –Replace random hash functions in technique with hash functions with very special properties

Indyk/W Methodology To estimate |v| p of a vector v of dimension n: Choose O(log n) random subsets of coordinates of v, denoted S 0, S 1, …, S log n –S i is of size n/2 i 8 S i : find those coordinates j 2 S i for which v j 2 ¸ γ ¢ |v S i | 2 2 –Use CountSketch: –Assign each coordinate j a random sign ¾ (j) 2 {-1,1} –Randomly hash the coordinates into 1/γ buckets, maintain Σ j s.t. h(j) = k and j 2 S i ¾ (j) ¢ v j in k-th bucket Our observation: can choose the S i and hash functions in CountSketch to be pairwise-independent Σ j s.t. h(j) = 2 and j 2 S i ¾ (j) ¢ v j

Special Hash Functions Let A be a random k x r binary matrix, and b a random binary vector of length k Let x in GF(2 r ) Ax + b is a pairwise-independent function: –For x x 2 GF(2 r ) and y y 2 GF(2 k ): Pr[Ax+b = y and Ax+b = y] = 1/2 2k

Special Hash Functions Cond Given a stream update ([i 1, j 1 ] x [i 2, j 2 ], Δ): can decompose [i 1, j 1 ] and [i 2, j 2 ] into O(log n) disjoint dyadic intervals: [i 1, j 1 ] = [a 1, b 1 ] [ … [ [a s, b s ] [i 2, j 2 ] = [c 1, d 1 ] [ … [ [c s, d s ] Dyadic inteval: [u2 q, (u+1)2 q ) for integers u,q Then [i 1, j 1 ] x [i 2, j 2 ] = [ r, r [a r, b r ] x [c r, d r ]

Special Hash Functions Cond A property of function Ax+b: can quickly compute the number of x in the interval [a r, b r ] x [c r, d r ] for which Ax+b = e By the structure of dyadic intervals, this corresponds to fixing a subset of bits of x, and letting the remaining variables be free: # of x 2 [a r,b r ] x [c r,d r ] for which Ax+b = e is # of z for which Az = e, z is unconstrained Can now use Gaussian elimination to count # of z

Techniques WrapUp Step 1: Modify Indyk/W analysis to use only pairwise- independence Step 2: Given update ([i 1, j 1 ] x [i 2, j 2 ], Δ), decompose into disjoint products of dyadic intervals [ r, r [a r, b r ] x [c r, d r ] Step 3: For each [a r, b r ] x [c r, d r ], find the number of items in each S i and each bucket of CountSketch and with each sign by solving a system of linear equations Step 4: Update all buckets

Conclusion Contributions: Initiated study of rectangle-efficiency Gave rectangle-efficient algorithms for estimating |v| p, 0 · p · 2 and heavy hitters Tradeoffs for estimating |v| p for p > 2 Improve previous work even for d = 1 Open Questions: Get O~(n d ) 1-2/p memory and O~(1) time for p > 2 Other applications of the model and technique