New Algorithms for Heavy Hitters in Data Streams David Woodruff IBM Almaden Joint works with Arnab Bhattacharyya, Vladimir Braverman, Stephen R. Chestnut,

Slides:



Advertisements
Similar presentations
Estimating Distinct Elements, Optimally
Advertisements

Rectangle-Efficient Aggregation in Spatial Data Streams Srikanta Tirthapura David Woodruff Iowa State IBM Almaden.
Fast Moment Estimation in Data Streams in Optimal Space Daniel Kane, Jelani Nelson, Ely Porat, David Woodruff Harvard MIT Bar-Ilan IBM.
Optimal Approximations of the Frequency Moments of Data Streams Piotr Indyk David Woodruff.
1+eps-Approximate Sparse Recovery Eric Price MIT David Woodruff IBM Almaden.
The Data Stream Space Complexity of Cascaded Norms T.S. Jayram David Woodruff IBM Almaden.
Tight Bounds for Distributed Functional Monitoring David Woodruff IBM Almaden Qin Zhang Aarhus University MADALGO Based on a paper in STOC, 2012.
The Complexity of Linear Dependence Problems in Vector Spaces David Woodruff IBM Almaden Joint work with Arnab Bhattacharyya, Piotr Indyk, and Ning Xie.
Optimal Space Lower Bounds for All Frequency Moments David Woodruff MIT
Numerical Linear Algebra in the Streaming Model Ken Clarkson - IBM David Woodruff - IBM.
The Average Case Complexity of Counting Distinct Elements David Woodruff IBM Almaden.
An Optimal Algorithm for the Distinct Elements Problem
Optimal Bounds for Johnson- Lindenstrauss Transforms and Streaming Problems with Sub- Constant Error T.S. Jayram David Woodruff IBM Almaden.
Sublinear-time Algorithms for Machine Learning Ken Clarkson Elad Hazan David Woodruff IBM Almaden Technion IBM Almaden.
Xiaoming Sun Tsinghua University David Woodruff MIT
Subspace Embeddings for the L1 norm with Applications Christian Sohler David Woodruff TU Dortmund IBM Almaden.
Fast Johnson-Lindenstrauss Transform(s) Nir Ailon Edo Liberty, Bernard Chazelle Bertinoro Workshop on Sublinear Algorithms May 2011.
An Improved Data Stream Summary: The Count-Min Sketch and its Applications Graham Cormode, S. Muthukrishnan 2003.
Finding Frequent Items in Data Streams Moses CharikarPrinceton Un., Google Inc. Kevin ChenUC Berkeley, Google Inc. Martin Franch-ColtonRutgers Un., Google.
3/13/2012Data Streams: Lecture 161 CS 410/510 Data Streams Lecture 16: Data-Stream Sampling: Basic Techniques and Results Kristin Tufte, David Maier.
Vladimir(Vova) Braverman UCLA Joint work with Rafail Ostrovsky.
ABSTRACT We consider the problem of computing information theoretic functions such as entropy on a data stream, using sublinear space. Our first result.
Sketching for M-Estimators: A Unified Approach to Robust Regression
Turnstile Streaming Algorithms Might as Well Be Linear Sketches Yi Li Huy L. Nguyen David Woodruff.
Algorithms for data streams Foundations of Data Science 2014 Indian Institute of Science Navin Goyal.
Yi Wu (CMU) Joint work with Parikshit Gopalan (MSR SVC) Ryan O’Donnell (CMU) David Zuckerman (UT Austin) Pseudorandom Generators for Halfspaces TexPoint.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 12 June 18, 2006
Reverse Hashing for Sketch Based Change Detection in High Speed Networks Ashish Gupta Elliot Parsons with Robert Schweller, Theory Group Advisor: Yan Chen.
Estimating Set Expression Cardinalities over Data Streams Sumit Ganguly Minos Garofalakis Rajeev Rastogi Internet Management Research Department Bell Labs,
CS591A1 Fall Sketch based Summarization of Data Streams Manish R. Sharma and Weichao Ma.
Data Streams and Applications in Computer Science David Woodruff IBM Almaden Presburger lecture, ICALP, 2014.
Sketching for M-Estimators: A Unified Approach to Robust Regression Kenneth Clarkson David Woodruff IBM Almaden.
How Robust are Linear Sketches to Adaptive Inputs? Moritz Hardt, David P. Woodruff IBM Research Almaden.
Streaming Algorithms Piotr Indyk MIT. Data Streams A data stream is a sequence of data that is too large to be stored in available memory Examples: –Network.
PODC Distributed Computation of the Mode Fabian Kuhn Thomas Locher ETH Zurich, Switzerland Stefan Schmid TU Munich, Germany TexPoint fonts used in.
Data Stream Algorithms Ke Yi Hong Kong University of Science and Technology.
Data Stream Algorithms Lower Bounds Graham Cormode
Calculating frequency moments of Data Stream
Low Rank Approximation and Regression in Input Sparsity Time David Woodruff IBM Almaden Joint work with Ken Clarkson (IBM Almaden)
The Message Passing Communication Model David Woodruff IBM Almaden.
Beating CountSketch for Heavy Hitters in Insertion Streams Vladimir Braverman (JHU) Stephen R. Chestnut (ETH) Nikita Ivkin (JHU) David P. Woodruff (IBM)
Big Data Lecture 5: Estimating the second moment, dimension reduction, applications.
A Story of Principal Component Analysis in the Distributed Model David Woodruff IBM Almaden Based on works with Christos Boutsidis, Ken Clarkson, Ravi.
Discrete Methods in Mathematical Informatics Kunihiko Sadakane The University of Tokyo
An Optimal Algorithm for Finding Heavy Hitters
Chapter 11 Sorting Acknowledgement: These slides are adapted from slides provided with Data Structures and Algorithms in C++, Goodrich, Tamassia and Mount.
Upper and Lower Bounds on the cost of a Map-Reduce Computation
Stochastic Streams: Sample Complexity vs. Space Complexity
New Characterizations in Turnstile Streams with Applications
Finding Frequent Items in Data Streams
Estimating L2 Norm MIT Piotr Indyk.
Streaming & sampling.
A special case of calibration
Sublinear Algorithmic Tools 2
Lecture 10: Sketching S3: Nearest Neighbor Search
Randomized Algorithms CS648
Turnstile Streaming Algorithms Might as Well Be Linear Sketches
Hidden Markov Models Part 2: Algorithms
Qun Huang, Patrick P. C. Lee, Yungang Bao
Y. Kotidis, S. Muthukrishnan,
Overview Massive data sets Streaming algorithms Regression
The Communication Complexity of Distributed Set-Joins
Searching CLRS, Sections 9.1 – 9.3.
CSCI B609: “Foundations of Data Science”
Streaming Symmetric Norms via Measure Concentration
Dimension versus Distortion a.k.a. Euclidean Dimension Reduction
Lecture 6: Counting triangles Dynamic graphs & sampling
Joint work with Morteza Monemizadeh
Sublinear Algorihms for Big Data
(Learned) Frequency Estimation Algorithms
Presentation transcript:

New Algorithms for Heavy Hitters in Data Streams David Woodruff IBM Almaden Joint works with Arnab Bhattacharyya, Vladimir Braverman, Stephen R. Chestnut, Palash Dey, Nikita Ivkin, Jelani Nelson, and Zhengyu Wang

Streaming Model Stream of elements a 1, …, a m in [n] = {1, …, n}. Assume m = poly(n) Arbitrary order One pass over the data Minimize memory usage (space complexity) in bits for solving a task Let f j be the number of occurrences of item j Heavy Hitters Problem: find those j for which f j is large …

Guarantees f 1, f 2 f 3 f 4 f 5 f 6

Outline Optimal algorithm in all parameters φ, ε for l 1 -guarantee Optimal algorithm for l 2 -guarantee for constant φ, ε

Misra-Gries key value

Space Complexity of Misra-Gries

Our Results

A Simple Initial Improvement

Why Sampling Works?

Misra-Gries after Hashing

Maintaining Identities hashed key value \\30903\\ actual key value

Summary of Initial Improvement

An Optimal Algorithm

Dealing with Item Identifiers

Dealing with Item Counts

Conclusions on l 1 -guarantee

Outline Optimal algorithm in all parameters φ, ε for l 1 -guarantee Optimal algorithm for l 2 -guarantee for constant φ, ε

CountSketch achieves the l 2 –guarantee [CCFC] Assign each coordinate i a random sign ¾ (i) 2 {-1,1} Randomly partition coordinates into B buckets, maintain c j = Σ i: h(i) = j ¾ (i) ¢ f i in j-th bucket.Σ i: h(i) = 2 ¾ (i) ¢ f i.. f1f1 f2f2 f3f3 f4f4 f5f5 f6f6 f7f7 f8f8 f9f9 f 10 Estimate f i as ¾ (i) ¢ c h(i)

Known Space Bounds for l 2 – heavy hitters CountSketch achieves O(log 2 n) bits of space If the stream is allowed to have deletions, this is optimal [DPIW] What about insertion-only streams? This is the model originally introduced by Alon, Matias, and Szegedy Models internet search logs, network traffic, databases, scientific data, etc. The only known lower bound is Ω(log n) bits, just to report the identity of the heavy hitter

Our Results [BCIW]

Simplifications

Intuition

Only gives 1 bit of information. Can’t repeat log n times in parallel, but can repeat log n times sequentially!

Repeating Sequentially

Gaussian Processes

Chaining Inequality [Fernique, Talagrand]

… atat a5a5 a4a4 a3a3 a2a2 a1a1 amam … Apply the chaining inequality!

Applying the Chaining Inequality Same behavior as for random walks!

Removing Frequency Assumptions

Amplification Create O(log log n) pairs of streams from the input stream (stream L 1, stream R 1 ), (stream L 2, stream R 2 ), …, (stream L log log n, stream R log log n ) For each j in O(log log n), choose a hash function h j :{1, …, n} -> {0,1} stream L j is the original stream restricted to items i with h j (i) = 0 stream R j is the remaining part of the input stream maintain counters c L = Σ i: h j (i) = 0 g (i) ¢ f i and c R = Σ i: h j (i) = 1 g (i) ¢ f i (Chaining Inequality + Chernoff) the larger counter is usually the substream with i* The larger counter stays larger forever if the Chaining Inequality holds Run algorithm on items corresponding to the larger counts Expected F 2 value of items, excluding i*, is F 2 /poly(log n), so i* is heavier

Derandomization We have to account for the randomness in our algorithm We need to (1)derandomize a Gaussian process (2)derandomize the hash functions used to sequentially learn bits of i* We achieve (1) by (Derandomized Johnson Lindenstrauss) defining our counters by first applying a Johnson-Lindenstrauss (JL) transform [KMN] to the frequency vector, reducing n dimensions to log n, then taking the inner product with fully independent Gaussians (Slepian’s Lemma) counters don’t change much because a Gaussian process is determined by its covariances and all covariances are roughly preserved by JL For (2), derandomize an auxiliary algorithm via Nisan’s pseudorandom generator [I]

An Optimal Algorithm [BCINWW] Want O(log n) bits instead of O(log n log log n) bits Multiple sources where the O(log log n) factor is coming from Amplification Use a tree-based scheme and that the heavy hitter becomes heavier! Derandomization Show O(1)-wise independence suffices for derandomizing a Gaussian process!

Conclusions on l 2 -guarantee

Accelerated Counters