Fast Moment Estimation in Data Streams in Optimal Space Daniel Kane, Jelani Nelson, Ely Porat, David Woodruff Harvard MIT Bar-Ilan IBM.

Slides:



Advertisements
Similar presentations
Polylogarithmic Private Approximations and Efficient Matching
Advertisements

Estimating Distinct Elements, Optimally
Rectangle-Efficient Aggregation in Spatial Data Streams Srikanta Tirthapura David Woodruff Iowa State IBM Almaden.
Optimal Approximations of the Frequency Moments of Data Streams Piotr Indyk David Woodruff.
1+eps-Approximate Sparse Recovery Eric Price MIT David Woodruff IBM Almaden.
The Data Stream Space Complexity of Cascaded Norms T.S. Jayram David Woodruff IBM Almaden.
Tight Bounds for Distributed Functional Monitoring David Woodruff IBM Almaden Qin Zhang Aarhus University MADALGO Based on a paper in STOC, 2012.
Tight Bounds for Distributed Functional Monitoring David Woodruff IBM Almaden Qin Zhang Aarhus University MADALGO.
Optimal Space Lower Bounds for All Frequency Moments David Woodruff MIT
How to Solve Longstanding Open Problems In Quantum Computing Using Only Fourier Analysis Scott Aaronson (MIT) For those who hate quantum: The open problems.
Numerical Linear Algebra in the Streaming Model Ken Clarkson - IBM David Woodruff - IBM.
Optimal Space Lower Bounds for all Frequency Moments David Woodruff Based on SODA 04 paper.
The Average Case Complexity of Counting Distinct Elements David Woodruff IBM Almaden.
An Optimal Algorithm for the Distinct Elements Problem
Optimal Bounds for Johnson- Lindenstrauss Transforms and Streaming Problems with Sub- Constant Error T.S. Jayram David Woodruff IBM Almaden.
Numerical Linear Algebra in the Streaming Model
Efficient Private Approximation Protocols Piotr Indyk David Woodruff Work in progress.
Sublinear-time Algorithms for Machine Learning Ken Clarkson Elad Hazan David Woodruff IBM Almaden Technion IBM Almaden.
Tight Lower Bounds for the Distinct Elements Problem David Woodruff MIT Joint work with Piotr Indyk.
Subspace Embeddings for the L1 norm with Applications Christian Sohler David Woodruff TU Dortmund IBM Almaden.
Data Stream Algorithms Frequency Moments
Sublinear Algorithms … Lecture 23: April 20.
Ordinary Least-Squares
Fast Fourier Transform for speeding up the multiplication of polynomials an Algorithm Visualization Alexandru Cioaca.
An Improved Data Stream Summary: The Count-Min Sketch and its Applications Graham Cormode, S. Muthukrishnan 2003.
Finding Frequent Items in Data Streams Moses CharikarPrinceton Un., Google Inc. Kevin ChenUC Berkeley, Google Inc. Martin Franch-ColtonRutgers Un., Google.
ABSTRACT We consider the problem of computing information theoretic functions such as entropy on a data stream, using sublinear space. Our first result.
Heavy Hitters Piotr Indyk MIT. Last Few Lectures Recap (last few lectures) –Update a vector x –Maintain a linear sketch –Can compute L p norm of x (in.
Sketching for M-Estimators: A Unified Approach to Robust Regression
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 13 June 25, 2006
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 12 June 18, 2006
Machine Learning CUNY Graduate Center Lecture 3: Linear Regression.
Sketching and Streaming Entropy via Approximation Theory Nick Harvey (MSR/Waterloo) Jelani Nelson (MIT) Krzysztof Onak (MIT)
CS151 Complexity Theory Lecture 10 April 29, 2004.
Foundations of Privacy Lecture 11 Lecturer: Moni Naor.
Sketching as a Tool for Numerical Linear Algebra David Woodruff IBM Almaden.
Sketching for M-Estimators: A Unified Approach to Robust Regression Kenneth Clarkson David Woodruff IBM Almaden.
How Robust are Linear Sketches to Adaptive Inputs? Moritz Hardt, David P. Woodruff IBM Research Almaden.
Finding Frequent Items in Data Streams [Charikar-Chen-Farach-Colton] Paper report By MH, 2004/12/17.
Streaming Algorithms Piotr Indyk MIT. Data Streams A data stream is a sequence of data that is too large to be stored in available memory Examples: –Network.
1 Markov Decision Processes Infinite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld.
Sublinear Algorithms via Precision Sampling Alexandr Andoni (Microsoft Research) joint work with: Robert Krauthgamer (Weizmann Inst.) Krzysztof Onak (CMU)
1 Markov Decision Processes Infinite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld.
Embedding and Sketching Sketching for streaming Alexandr Andoni (MSR)
Low Rank Approximation and Regression in Input Sparsity Time David Woodruff IBM Almaden Joint work with Ken Clarkson (IBM Almaden)
The Message Passing Communication Model David Woodruff IBM Almaden.
Beating CountSketch for Heavy Hitters in Insertion Streams Vladimir Braverman (JHU) Stephen R. Chestnut (ETH) Nikita Ivkin (JHU) David P. Woodruff (IBM)
Fast Pseudo-Random Fingerprints Yoram Bachrach, Microsoft Research Cambridge Ely Porat – Bar Ilan-University.
A Story of Principal Component Analysis in the Distributed Model David Woodruff IBM Almaden Based on works with Christos Boutsidis, Ken Clarkson, Ravi.
New Algorithms for Heavy Hitters in Data Streams David Woodruff IBM Almaden Joint works with Arnab Bhattacharyya, Vladimir Braverman, Stephen R. Chestnut,
An Optimal Algorithm for Finding Heavy Hitters
Information Complexity Lower Bounds
Stochastic Streams: Sample Complexity vs. Space Complexity
New Characterizations in Turnstile Streams with Applications
Finding Frequent Items in Data Streams
Estimating L2 Norm MIT Piotr Indyk.
Streaming & sampling.
Approximate Matchings in Dynamic Graph Streams
Sublinear Algorithmic Tools 3
Lecture 11: Nearest Neighbor Search
Sublinear Algorithmic Tools 2
Lecture 7: Dynamic sampling Dimension Reduction
Turnstile Streaming Algorithms Might as Well Be Linear Sketches
Linear sketching with parities
Locally Decodable Codes from Lifting
Y. Kotidis, S. Muthukrishnan,
Overview Massive data sets Streaming algorithms Regression
The Communication Complexity of Distributed Set-Joins
CSCI B609: “Foundations of Data Science”
Sublinear Algorihms for Big Data
Presentation transcript:

Fast Moment Estimation in Data Streams in Optimal Space Daniel Kane, Jelani Nelson, Ely Porat, David Woodruff Harvard MIT Bar-Ilan IBM

l p -estimation: Problem Statement Model x = (x 1, x 2, …, x n ) starts off as 0 n Stream of m updates (j 1, v 1 ), …, (j m, v m ) Update (j, v) causes change x j = x j + v v 2 {-M, -M+1, …, M} Problem Output l p = j=1 n |x j | p = |x| p Want small space and fast update time For simplicity: n, m, M are polynomially related

Some Bad News Alon, Matias, and Szegedy –No sublinear space algorithms unless Approximation (allow output to be (1± ε) l p ) Randomization (allow 1% failure probability) New goal –Output (1±ε) l p with probability 99%

Some More Bad News Estimating l p for p > 2 in a stream requires n 1-2/p space [AMS, IW, SS] We focus on the feasible regime, when p 2 (0,2) p = 0 and p = 2 well-understood –p = 0 is number of distinct elements –p = 2 is Euclidean norm

Applications for p 2 [1,2) l p -norm for p 2 [1,2) less sensitive to outliers –Nearest neighbor –Regression –Subspace approximation Query point a 2 R d Database points b1b2…bnb1b2…bn Want argmin j |a-b j | p Less likely to be spoiled by noise in each coordinate Can quickly replace d- dimensional points with small sketches

Applications for p 2 (0,1) Best entropy estimation in a stream [HNO] –Empirical entropy = j q j log(1/q j ), where q j = |x j |/|x| 1 –Estimates |x| p for O(log 1/ε) different p 2 (0,1) –Interpolates a polynomial through these values to estimate entropy –Entropy used for detecting DoS attacks, etc.

Previous Work for p 2 (0,2) Lot of players –FKSV, I, KNW, GC, NW, AOK Tradeoffs possible –Can get optimal ε -2 log n bits of space, but then the update time is at least 1/ε 2 –BIG difference in practice between ε -2 update time and O(1) (e.g., AMS vs. TZ for p = 2) –No way to get close to optimal space with less than poly(1/ε) update time

Our Results For every p 2 (0,2) –estimate l p with optimal ε -2 log n bits of space –log 2 1/ε log log 1/ε update time –exponential improvement over previous update time For entropy –Exponential improvement over previous update time (polylog 1/ε versus poly 1/ε)

Our Algorithm Split coordinates into head and tail j 2 head if |x j | p ¸ ε 2 |x| p p j 2 tail if |x j | p < ε 2 |x| p p Estimate |x| p p = |x head | p p + |x tail | p p separately Two completely different procedures

Outline Estimating |x head | p p Estimating |x tail | p p Putting it all together

Simplifications We can assume we know the set of head coordinates, as well as their signs Can be found using known algorithms [CountSketch] Challenge Need j in head |x j | p

Estimating |x head | p p xjxj log 1/ε rows 1/ ε 2 columns Hash each coordinate to a unique column in each row We DO NOT - maintain sum of values in each cell We DO NOT - maintain the inner product of values in a cell with a random sign vector Key idea: for each cell c, if S is the set of items hashed to c, let V(c) j in S x j ¢ exp(2 ¼i h(j)/r ) r is a parameter, i = sqrt(-1) Key idea: for each cell c, if S is the set of items hashed to c, let V(c) j in S x j ¢ exp(2 ¼i h(j)/r ) r is a parameter, i = sqrt(-1)

Our Algorithm To estimate |x head | p p –For each j in the head, find an arbitrary cell c(j) containing j and no other head coordinates –Compute y j = sign(x j ) ¢ exp(-2 ¼i h(j)/r) ¢ V(c) Recall V(c) j in S x j ¢ exp(2 ¼i h(j)/r ) –Expected value of y j is |x j | –What can we say about y j p ? –What does it mean?

Our Algorithm Recall y j = sign(x j ) ¢ exp(-2 ¼i h(j)/r) ¢ V(c) What is y j 1/2 if y j = -4? -4 = 4 exp( ¼ i) (-4) 1/2 = 2 exp( ¼ i / 2) = 2i or 2 exp(- ¼ i / 2) = -2i By y j p we mean |y j | p exp(i p arg(z)), where arg(z) 2 (- ¼, ¼ ] is the angle of y j in the complex plane

Our Algorithm Wishful thinking Estimator = j in head y j p Intuitively, when p = 1, since E[y j ] = |y j | we have an unbiased estimator For general p, this may be complex, so how about Estimator = Re [ j in head y j p ]? Almost correct, but we want optimal space, and were ignoring most of the cells Better: y j = Mean cells c isolating j sign(x j ) ¢ exp(-2 ¼i h(j)/r) ¢ V(c)

Analysis Why did we use roots of unity? Estimator is real part of j in head y j p j in head y j p = j in head |y j | p ¢ (1+z j ) p for z j = (y j - |y j |)/|y j | Can apply Generalized Binomial theorem E[|y j | p (1+z j ) p ] = |y j | p ¢ k=0 1 {p choose k} E[z j k ] = |y j | p + small since E[z j k ] = 0 if 0 < k < r Generalized binomial coefficient {p choose k} = p ¢ (p-1) (p-k+1)/k! = O(1/k 1+p ) Generalized binomial coefficient {p choose k} = p ¢ (p-1) (p-k+1)/k! = O(1/k 1+p ) Intuitively variance is small because head coordinates dont collide

Outline Estimating |x head | p p Estimating |x tail | p p Putting it all together

Our Algorithm x(b) Estimating |x tail | p p xjxj In each bucket b maintain an unbiased estimator of the p-th power of the p-norm |x(b)| p p in the bucket [Li] If Z 1, …, Z s are p-stable, for any vector a = (a 1, …, a s ), j=1 s Z j ¢ a j » |a| p Z, for Z also p-stable Add up estimators in all buckets not containing a head coordinate (variance is small)

Outline Estimating |x head | p p Estimating |x tail | p p Putting it all together

Complexity Bag of tricks Example For optimal space, in buckets in the light estimator, we prove 1/ ε p – wise independent p-stable variables suffice –Rewrite Lis estimator so that [KNW] can be applied Need to evaluate a degree- 1/ε p polynomial per update Instead: batch 1/ε p updates together and do fast multipoint evaluation –Can be deamortized –Use that different buckets are pairwise independent

Complexity Example # 2 Finding head coordinates requires ε -2 log 2 n space Reduce the universe size to poly 1/ε by hashing Now requires ε -2 log n log 1/ε space Replace ε with ε log 1/2 1/ε Head estimator okay, but slightly adjust light estimator

Conclusion For every p 2 (0,2) –estimate l p with optimal ε -2 log n bits of space –log 2 1/ε log log 1/ε update time –exponential improvement over previous update time For entropy –Exponential improvement over previous update time (polylog 1/ε versus poly 1/ε)