Beating CountSketch for Heavy Hitters in Insertion Streams Vladimir Braverman (JHU) Stephen R. Chestnut (ETH) Nikita Ivkin (JHU) David P. Woodruff (IBM)

Slides:

Advertisements

Similar presentations

Estimating Distinct Elements, Optimally

Advertisements

Rectangle-Efficient Aggregation in Spatial Data Streams Srikanta Tirthapura David Woodruff Iowa State IBM Almaden.

Fast Moment Estimation in Data Streams in Optimal Space Daniel Kane, Jelani Nelson, Ely Porat, David Woodruff Harvard MIT Bar-Ilan IBM.

Optimal Approximations of the Frequency Moments of Data Streams Piotr Indyk David Woodruff.

1+eps-Approximate Sparse Recovery Eric Price MIT David Woodruff IBM Almaden.

The Data Stream Space Complexity of Cascaded Norms T.S. Jayram David Woodruff IBM Almaden.

Tight Bounds for Distributed Functional Monitoring David Woodruff IBM Almaden Qin Zhang Aarhus University MADALGO Based on a paper in STOC, 2012.

The Complexity of Linear Dependence Problems in Vector Spaces David Woodruff IBM Almaden Joint work with Arnab Bhattacharyya, Piotr Indyk, and Ning Xie.

Optimal Space Lower Bounds for All Frequency Moments David Woodruff MIT

Numerical Linear Algebra in the Streaming Model Ken Clarkson - IBM David Woodruff - IBM.

Optimal Space Lower Bounds for all Frequency Moments David Woodruff Based on SODA 04 paper.

The Average Case Complexity of Counting Distinct Elements David Woodruff IBM Almaden.

An Optimal Algorithm for the Distinct Elements Problem

Optimal Bounds for Johnson- Lindenstrauss Transforms and Streaming Problems with Sub- Constant Error T.S. Jayram David Woodruff IBM Almaden.

Sublinear-time Algorithms for Machine Learning Ken Clarkson Elad Hazan David Woodruff IBM Almaden Technion IBM Almaden.

Xiaoming Sun Tsinghua University David Woodruff MIT

Subspace Embeddings for the L1 norm with Applications Christian Sohler David Woodruff TU Dortmund IBM Almaden.

Fast Johnson-Lindenstrauss Transform(s) Nir Ailon Edo Liberty, Bernard Chazelle Bertinoro Workshop on Sublinear Algorithms May 2011.

An Improved Data Stream Summary: The Count-Min Sketch and its Applications Graham Cormode, S. Muthukrishnan 2003.

Finding Frequent Items in Data Streams Moses CharikarPrinceton Un., Google Inc. Kevin ChenUC Berkeley, Google Inc. Martin Franch-ColtonRutgers Un., Google.

Analysis of Algorithms

3/13/2012Data Streams: Lecture 161 CS 410/510 Data Streams Lecture 16: Data-Stream Sampling: Basic Techniques and Results Kristin Tufte, David Maier.

ABSTRACT We consider the problem of computing information theoretic functions such as entropy on a data stream, using sublinear space. Our first result.

Sketching for M-Estimators: A Unified Approach to Robust Regression

Turnstile Streaming Algorithms Might as Well Be Linear Sketches Yi Li Huy L. Nguyen David Woodruff.

Algorithms for data streams Foundations of Data Science 2014 Indian Institute of Science Navin Goyal.

Yi Wu (CMU) Joint work with Parikshit Gopalan (MSR SVC) Ryan O’Donnell (CMU) David Zuckerman (UT Austin) Pseudorandom Generators for Halfspaces TexPoint.

1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 12 June 18, 2006

Hash Tables With Finite Buckets Are Less Resistant to Deletions Yossi Kanizo (Technion, Israel) Joint work with David Hay (Columbia U. and Hebrew U.) and.

Estimating Set Expression Cardinalities over Data Streams Sumit Ganguly Minos Garofalakis Rajeev Rastogi Internet Management Research Department Bell Labs,

Data Streams and Applications in Computer Science David Woodruff IBM Almaden Presburger lecture, ICALP, 2014.

Sketching for M-Estimators: A Unified Approach to Robust Regression Kenneth Clarkson David Woodruff IBM Almaden.

How Robust are Linear Sketches to Adaptive Inputs? Moritz Hardt, David P. Woodruff IBM Research Almaden.

Cloud and Big Data Summer School, Stockholm, Aug., 2015 Jeffrey D. Ullman.

Tight Bounds for Graph Problems in Insertion Streams Xiaoming Sun and David P. Woodruff Chinese Academy of Sciences and IBM Research-Almaden.

Finding Frequent Items in Data Streams [Charikar-Chen-Farach-Colton] Paper report By MH, 2004/12/17.

Streaming Algorithms Piotr Indyk MIT. Data Streams A data stream is a sequence of data that is too large to be stored in available memory Examples: –Network.

A PTAS for Computing the Supremum of Gaussian Processes Raghu Meka (IAS/DIMACS)

PODC Distributed Computation of the Mode Fabian Kuhn Thomas Locher ETH Zurich, Switzerland Stefan Schmid TU Munich, Germany TexPoint fonts used in.

Data Stream Algorithms Ke Yi Hong Kong University of Science and Technology.

Data in Motion Michael Hoffman (Leicester) S Muthukrishnan (Google) Rajeev Raman (Leicester)

Calculating frequency moments of Data Stream

The bin packing problem. For n objects with sizes s 1, …, s n where 0 < s i ≤1, find the smallest number of bins with capacity one, such that n objects.

Low Rank Approximation and Regression in Input Sparsity Time David Woodruff IBM Almaden Joint work with Ken Clarkson (IBM Almaden)

The Message Passing Communication Model David Woodruff IBM Almaden.

Big Data Lecture 5: Estimating the second moment, dimension reduction, applications.

Discrete Methods in Mathematical Informatics Kunihiko Sadakane The University of Tokyo

A Story of Principal Component Analysis in the Distributed Model David Woodruff IBM Almaden Based on works with Christos Boutsidis, Ken Clarkson, Ravi.

Discrete Methods in Mathematical Informatics Kunihiko Sadakane The University of Tokyo

New Algorithms for Heavy Hitters in Data Streams David Woodruff IBM Almaden Joint works with Arnab Bhattacharyya, Vladimir Braverman, Stephen R. Chestnut,

An Optimal Algorithm for Finding Heavy Hitters

Algorithms for Big Data: Streaming and Sublinear Time Algorithms

Stochastic Streams: Sample Complexity vs. Space Complexity

Tell Me Who I Am: An Interactive Recommendation System

New Characterizations in Turnstile Streams with Applications

Finding Frequent Items in Data Streams

Estimating L2 Norm MIT Piotr Indyk.

Streaming & sampling.

Sublinear Algorithmic Tools 2

Lecture 10: Sketching S3: Nearest Neighbor Search

Lecture 4: CountSketch High Frequencies

Lecture 7: Dynamic sampling Dimension Reduction

Turnstile Streaming Algorithms Might as Well Be Linear Sketches

Y. Kotidis, S. Muthukrishnan,

CSCI B609: “Foundations of Data Science”

Streaming Symmetric Norms via Measure Concentration

Lecture 6: Counting triangles Dynamic graphs & sampling

Joint work with Morteza Monemizadeh

Sublinear Algorihms for Big Data

(Learned) Frequency Estimation Algorithms

Presentation transcript:

Beating CountSketch for Heavy Hitters in Insertion Streams Vladimir Braverman (JHU) Stephen R. Chestnut (ETH) Nikita Ivkin (JHU) David P. Woodruff (IBM)

Streaming Model Stream of elements a 1, …, a m in [n] = {1, …, n}. Assume m = poly(n) One pass over the data Minimize space complexity (in bits) for solving a task Let f j be the number of occurrences of item j Heavy Hitters Problem: find those j for which f j is large …

Guarantees f 1, f 2 f 3 f 4 f 5 f 6

CountSketch achieves the l 2 –guarantee [CCFC] Assign each coordinate i a random sign ¾ (i) 2 {-1,1} Randomly partition coordinates into B buckets, maintain c j = Σ i: h(i) = j ¾ (i) ¢ f i in j- th bucket.Σ i: h(i) = 2 ¾(i)¢f i.. f1f1 f2f2 f3f3 f4f4 f5f5 f6f6 f7f7 f8f8 f9f9 f 10 Estimate f i as ¾(i) ¢ c h(i)

Known Space Bounds for l 2 – heavy hitters CountSketch achieves O(log 2 n) bits of space If the stream is allowed to have deletions, this is optimal [DPIW] What about insertion-only streams? This is the model originally introduced by Alon, Matias, and Szegedy Models internet search logs, network traffic, databases, scientific data, etc. The only known lower bound is Ω(log n) bits, just to report the identity of the heavy hitter

Our Results

Simplifications

Intuition

Only gives 1 bit of information. Can’t repeat log n times in parallel, but can repeat log n times sequentially!

Repeating Sequentially

Gaussian Processes

Chaining Inequality [Fernique, Talagrand]

… atat a5a5 a4a4 a3a3 a2a2 a1a1 amam … Apply the chaining inequality!

Applying the Chaining Inequality Same behavior as for random walks!

Removing Frequency Assumptions

Amplification Create O(log log n) pairs of streams from the input stream (stream L 1, stream R 1 ), (stream L 2, stream R 2 ), …, (stream L log log n, stream R log log n ) For each j in O(log log n), choose a hash function h j :{1, …, n} -> {0,1} stream L j is the original stream restricted to items i with h j (i) = 0 stream R j is the remaining part of the input stream maintain counters c L = Σ i: h j (i) = 0 g (i) ¢ f i and c R = Σ i: h j (i) = 1 g (i) ¢ f i (Chaining Inequality + Chernoff) the larger counter is usually the substream with i* The larger counter stays larger forever if the Chaining Inequality holds Run algorithm on items with counts which are larger a 9/10 fraction of the time Expected F 2 value of items, excluding i*, is F 2 /poly(log n), so i* is heavier

Derandomization We don’t have an infinitely long random tape We need to (1)derandomize a Gaussian process (2)derandomize the hash functions used to sequentially learn bits of i* We achieve (1) by (Derandomized Johnson Lindenstrauss) defining our counters by first applying a Johnson-Lindenstrauss (JL) transform [KMN] to the frequency vector, reducing n dimensions to log n, then taking the inner product with fully independent Gaussians (Slepian’s Lemma) counters don’t change much because a Gaussian process is determined by its covariances and all covariances are roughly preserved by JL For (2), derandomize an auxiliary algorithm via a reordering argument and Nisan’s PRG [I]

Conclusions