An Optimal Algorithm for Finding Heavy Hitters

Slides:

Advertisements

Similar presentations

Estimating Distinct Elements, Optimally

Advertisements

Rectangle-Efficient Aggregation in Spatial Data Streams Srikanta Tirthapura David Woodruff Iowa State IBM Almaden.

Fast Moment Estimation in Data Streams in Optimal Space Daniel Kane, Jelani Nelson, Ely Porat, David Woodruff Harvard MIT Bar-Ilan IBM.

Optimal Approximations of the Frequency Moments of Data Streams Piotr Indyk David Woodruff.

1+eps-Approximate Sparse Recovery Eric Price MIT David Woodruff IBM Almaden.

The Data Stream Space Complexity of Cascaded Norms T.S. Jayram David Woodruff IBM Almaden.

Tight Bounds for Distributed Functional Monitoring David Woodruff IBM Almaden Qin Zhang Aarhus University MADALGO Based on a paper in STOC, 2012.

The Complexity of Linear Dependence Problems in Vector Spaces David Woodruff IBM Almaden Joint work with Arnab Bhattacharyya, Piotr Indyk, and Ning Xie.

Optimal Space Lower Bounds for All Frequency Moments David Woodruff MIT

Numerical Linear Algebra in the Streaming Model Ken Clarkson - IBM David Woodruff - IBM.

Optimal Space Lower Bounds for all Frequency Moments David Woodruff Based on SODA 04 paper.

The Average Case Complexity of Counting Distinct Elements David Woodruff IBM Almaden.

An Optimal Algorithm for the Distinct Elements Problem

Optimal Bounds for Johnson- Lindenstrauss Transforms and Streaming Problems with Sub- Constant Error T.S. Jayram David Woodruff IBM Almaden.

Sublinear-time Algorithms for Machine Learning Ken Clarkson Elad Hazan David Woodruff IBM Almaden Technion IBM Almaden.

Subspace Embeddings for the L1 norm with Applications Christian Sohler David Woodruff TU Dortmund IBM Almaden.

An Improved Data Stream Summary: The Count-Min Sketch and its Applications Graham Cormode, S. Muthukrishnan 2003.

Finding Frequent Items in Data Streams Moses CharikarPrinceton Un., Google Inc. Kevin ChenUC Berkeley, Google Inc. Martin Franch-ColtonRutgers Un., Google.

3/13/2012Data Streams: Lecture 161 CS 410/510 Data Streams Lecture 16: Data-Stream Sampling: Basic Techniques and Results Kristin Tufte, David Maier.

1 The Monte Carlo method. 2 (0,0) (1,1) (-1,-1) (-1,1) (1,-1) 1 Z= 1 If  X 2 +Y 2  1 0 o/w (X,Y) is a point chosen uniformly at random in a 2  2 square.

Sketching for M-Estimators: A Unified Approach to Robust Regression

Turnstile Streaming Algorithms Might as Well Be Linear Sketches Yi Li Huy L. Nguyen David Woodruff.

Algorithms for data streams Foundations of Data Science 2014 Indian Institute of Science Navin Goyal.

1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 13 June 25, 2006

1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 12 June 18, 2006

This material in not in your text (except as exercises) Sequence Comparisons –Problems in molecular biology involve finding the minimum number of edit.

Estimating Set Expression Cardinalities over Data Streams Sumit Ganguly Minos Garofalakis Rajeev Rastogi Internet Management Research Department Bell Labs,

Sketching for M-Estimators: A Unified Approach to Robust Regression Kenneth Clarkson David Woodruff IBM Almaden.

Embedding and Sketching Alexandr Andoni (MSR). Definition by example  Problem: Compute the diameter of a set S, of size n, living in d-dimensional ℓ.

How Robust are Linear Sketches to Adaptive Inputs? Moritz Hardt, David P. Woodruff IBM Research Almaden.

Tight Bounds for Graph Problems in Insertion Streams Xiaoming Sun and David P. Woodruff Chinese Academy of Sciences and IBM Research-Almaden.

CHAPTER 09 Compiled by: Dr. Mohammad Omar Alhawarat Sorting & Searching.

Streaming Algorithms Piotr Indyk MIT. Data Streams A data stream is a sequence of data that is too large to be stored in available memory Examples: –Network.

The Selection Problem. 2 Median and Order Statistics In this section, we will study algorithms for finding the i th smallest element in a set of n elements.

Data Stream Algorithms Ke Yi Hong Kong University of Science and Technology.

Similarity Searching in High Dimensions via Hashing Paper by: Aristides Gionis, Poitr Indyk, Rajeev Motwani.

Data Stream Algorithms Lower Bounds Graham Cormode

Lower bounds on data stream computations Seminar in Communication Complexity By Michael Umansky Instructor: Ronitt Rubinfeld.

The Message Passing Communication Model David Woodruff IBM Almaden.

Beating CountSketch for Heavy Hitters in Insertion Streams Vladimir Braverman (JHU) Stephen R. Chestnut (ETH) Nikita Ivkin (JHU) David P. Woodruff (IBM)

Big Data Lecture 5: Estimating the second moment, dimension reduction, applications.

Discrete Methods in Mathematical Informatics Kunihiko Sadakane The University of Tokyo

A Story of Principal Component Analysis in the Distributed Model David Woodruff IBM Almaden Based on works with Christos Boutsidis, Ken Clarkson, Ravi.

Discrete Methods in Mathematical Informatics Kunihiko Sadakane The University of Tokyo

New Algorithms for Heavy Hitters in Data Streams David Woodruff IBM Almaden Joint works with Arnab Bhattacharyya, Vladimir Braverman, Stephen R. Chestnut,

Chapter 11 Sorting Acknowledgement: These slides are adapted from slides provided with Data Structures and Algorithms in C++, Goodrich, Tamassia and Mount.

Algorithms for Big Data: Streaming and Sublinear Time Algorithms

Stochastic Streams: Sample Complexity vs. Space Complexity

Tell Me Who I Am: An Interactive Recommendation System

New Characterizations in Turnstile Streams with Applications

Lecture 22: Linearity Testing Sparse Fourier Transform

Finding Frequent Items in Data Streams

Streaming & sampling.

Lecture 18: Uniformity Testing Monotonicity Testing

Sublinear Algorithmic Tools 2

Lecture 10: Sketching S3: Nearest Neighbor Search

Lecture 4: CountSketch High Frequencies

Randomized Algorithms CS648

Turnstile Streaming Algorithms Might as Well Be Linear Sketches

Linear sketching with parities

Y. Kotidis, S. Muthukrishnan,

Hash Tables – 2 Comp 122, Spring 2004.

Searching CLRS, Sections 9.1 – 9.3.

Linear sketching with parities

CSCI B609: “Foundations of Data Science”

Dimension versus Distortion a.k.a. Euclidean Dimension Reduction

Lecture 6: Counting triangles Dynamic graphs & sampling

The Selection Problem.

Sublinear Algorihms for Big Data

Hash Tables – 2 1.

Presentation transcript:

An Optimal Algorithm for Finding Heavy Hitters David Woodruff IBM Almaden Based on works with Vladimir Braverman, Stephen R. Chestnut Nikita Ivkin, Jelani Nelson, and Zhengyu Wang

Streaming Model … 4 3 7 3 1 1 2 Stream of elements a1, …, am in [n] = {1, …, n}. Assume m = poly(n) Arbitrary order One pass over the data Minimize memory usage (space complexity) in bits for solving a task Let fj be the number of occurrences of item j Heavy Hitters Problem: find those j for which fj is large

Guarantees l1 – guarantee output a set containing all items j for which fj ≥ φ m the set should not contain any j with fj ≤ (φ-ε) m l2 – guarantee F 2 = j f j 2 output a set containing all items j for which fj 2 ≥ φ F 2 the set should not contain any j with fj 2 ≤ (φ-ε) F 2 l2 – guarantee can be much stronger than the l1 – guarantee Suppose frequency vector is (√𝑛, 1, 1, 1, …, 1) Item 1 is an l2-heavy hitter for constant φ, ε, but not an l1-heavy hitter f1, f2 f3 f4 f5 f6

CountSketch achieves the l2–guarantee [CCFC] E[¾(i) ¢ ch(i)] = ¾(i) ⋅ Σi’: h(i’) = h(i) ¾(i’)¢fi’ = fi Repeat this hashing scheme O(log n) times Output median of estimates Noise in a bucket is ¾(i) ⋅ Σi’ ≠i: h(i’) = h(i) ¾(i’)¢fi’ Ensures every fj is approximated up to an additive ( F 2 /B)1/2 Gives O(log2 n) bits of space Assign each coordinate i a random sign ¾(i) 2 {-1,1} Randomly partition coordinates into B buckets, maintain cj = Σi: h(i) = j ¾(i)¢fi in j-th bucket f1 f2 f3 f4 f5 f6 f7 f8 f9 f10 . Σi: h(i) = 2 ¾(i)¢fi Estimate fi as ¾(i) ¢ ch(i)

Known Space Bounds for l2– heavy hitters CountSketch achieves O(log2 n) bits of space If the stream is allowed to have deletions, this is optimal [DPIW] What about insertion-only streams? This is the model originally introduced by Alon, Matias, and Szegedy Models internet search logs, network traffic, databases, scientific data, etc. The only known lower bound is Ω(log n) bits, just to report the identity of the heavy hitter

Our Results [BCIW] We give an algorithm using O(log n log log n) bits of space! Same techniques give a number of other results: ( F 2 at all times) Estimate F 2 at all times in a stream with O(log n log log n) bits of space Improves the union bound which would take O(log2 n) bits of space ( L ∞ -Estimation) Compute maxi fi up to additive (ε F 2 )1/2 using O(log n log log n) bits of space (Resolves IITK Open Question 3)

Simplifications Output a set containing all items i for which fi 2 ≥ φ F 2 for constant φ There are at most O(1/φ) = O(1) such items i Hash items into O(1) buckets All items i for which fi 2 ≥ φ F 2 will go to different buckets with good probability Problem reduces to having a single i* in {1, 2, …, n} with fi* ≥ ( φ F 2 )1/2

Intuition Suppose first that f i∗ ≥ n 1/2 log n and fi in {0,1} for all i in {1, 2, …, n} \ {i*} For the moment, let us also not count the space to store random hash functions Assign each coordinate i a random sign ¾(i) 2 {-1,1} Randomly partition items into 2 buckets Maintain c1 = Σi: h(i) = 1 ¾(i)¢fi and c2 = Σi: h(i) = 2 ¾(i)¢fi Suppose h(i*) = 1. What do the values c1 and c2 look like?

Eventually, fi* > 2𝐶𝑛 1/2 , 𝑡ℎ𝑒𝑛 𝑤𝑒 𝑘𝑛𝑜𝑤 𝑤ℎ𝑖𝑐ℎ 𝑏𝑢𝑐𝑘𝑒𝑡 𝑐𝑜𝑛𝑡𝑎𝑖𝑛𝑠 𝑖∗! Only gives 1 bit of information. Can’t repeat log n times in parallel, but can repeat log n times sequentially! c1 = ¾(i*)¢fi* + i≠i∗, h i =1 σ i ⋅ f i and c2 = i, h i =2 σ i ⋅ f i c1 - ¾(i*)¢fi* and c2 evolve as random walks as the stream progresses (Random Walks) There is a constant C > 0 so that with probability 9/10, at all times, |c1 - ¾(i*)¢fi*| < Cn1/2 and |c2| < Cn1/2

Repeating Sequentially Wait until either |c1| or |c2| exceeds Cn1/2 If |c1| > Cn1/2 then h(i*) = 1, otherwise h(i*) = 2 This gives 1 bit of information about i* (Repeat) initialize 2 new counters to 0 and perform the procedure again! Assuming f i∗ = Ω( n 1/2 log n), we will have at least 10 log n repetitions, and we will be correct in a 2/3 fraction of them (Chernoff) only a single value of i* whose hash values match a 2/3 fraction of repetitions

Gaussian Processes We don’t actually have f i∗ ≥ n 1/2 log n and fi in {0,1} for all i in {1, 2, …, n} \ {i*} Fix both problems using Gaussian processes (Gaussian Process) Collection {Xt}t in T of random variables, for an index set T, for which every finite linear combination of random variables is Gaussian Assume E[Xt] = 0 for all t Process entirely determined by covariances E[XsXt] Distance function d(s,t) = (E[|Xs-Xt|2])1/2 is a pseudo-metric on T (Connection to Data Streams) Suppose we replace the signs ¾(i) with normal random variables g(i), and consider a counter c at time t: c(t) = Σi g(i)¢fi(t) fi(t) is frequency of item i after processing t stream insertions c(t) is a Gaussian process!

Chaining Inequality [Fernique, Talagrand] Let {Xt}t in T be a Gaussian process and let T 0 ⊆ T 1 ⊆ T 2 ⊆⋯⊆T be such that T 0 =1 and T i ≤ 2 2 i for i≥1. Then, E sup t∈T |X t | ≤O 1 sup t∈T i≥0 2 i/2 d(t, T i ) How can we apply this to c(t) = Σi g(i)¢fi(t)? Let F 2 t be the value of F 2 after t stream insertions Let the T i be a recursive partitioning of the stream where F 2 (t) changes by a factor of 2

Apply the chaining inequality! … … a1 a2 a3 a4 a5 at am at is the first point in the stream for which F 2 m 2 ≤F 2 t T 0 = t Let T i be the set of 2 2 i times t 1 , t 2 ,…, t 2 2 i in the stream such that tj is the first point in the stream with j ⋅ F 2 m 2 2 i ≤F 2 t j Then T 0 ⊆ T 1 ⊆ T 2 ⊆⋯⊆T, and T 0 =1, and T i ≤ 2 2 i for i≥1 Apply the chaining inequality!

Applying the Chaining Inequality Let {Xt}t in T be a Gaussian process and let T 0 ⊆ T 1 ⊆ T 2 ⊆⋯⊆T be such that T 0 =1 and T i ≤ 2 2 i for i≥1. Then, E sup t∈T |X t | ≤O 1 sup t∈T i≥0 2 i/2 d(t, T i ) d(t, T i ) = min t j ∈𝑇𝑖 (E|c(t) – c(tj)|2])1/2 ≤( 𝐹 2 2 2 𝑖 )1/2 Hence, E sup t∈T |X t | ≤O 1 sup t∈T i≥0 2 i/2 ( 𝐹 2 2 2 𝑖 )1/2 = O(F21/2) Same behavior as for random walks!

Removing Frequency Assumptions We don’t actually have f i∗ ≥ n 1/2 log n and fj in {0,1} for all j in {1, 2, …, n} \ {t} Gaussian process removes the restriction that fj in {0,1} for all j in {1, 2, …, n} \ {t} The random walk bound of Cn1/2 we needed on counters holds without this restriction But we still need f i∗ ≥ F 2 log n to learn log n bits about the heavy hitter How to replace this restriction with f i∗ ≥ (φ F2) 1/2? Assume φ > log log n by hashing into log log n buckets and incurring a log log n factor in space

Amplification Create O(log log n) pairs of streams from the input stream (streamL1 , streamR1), (streamL2 , streamR2), …, (streamLlog log n , streamRlog log n) For each j in O(log log n), choose a hash function hj :{1, …, n} -> {0,1} streamLj is the original stream restricted to items i with hj(i) = 0 streamRj is the remaining part of the input stream maintain counters cL = Σi: hj(i) = 0 g(i)¢fi and cR = Σi: hj(i) = 1 g(i)¢fi (Chaining Inequality + Chernoff) the larger counter is usually the substream with i* The larger counter stays larger forever if the Chaining Inequality holds Run algorithm on items corresponding to the larger counts Expected F2 value of items, excluding i*, is F2/poly(log n), so i* is heavier

Derandomization We have to account for the randomness in our algorithm We need to derandomize a Gaussian process derandomize the hash functions used to sequentially learn bits of i* We achieve (1) by (Derandomized Johnson Lindenstrauss) defining our counters by first applying a Johnson-Lindenstrauss (JL) transform [KMN] to the frequency vector, reducing n dimensions to log n, then taking the inner product with fully independent Gaussians (Slepian’s Lemma) counters don’t change much because a Gaussian process is determined by its covariances and all covariances are roughly preserved by JL For (2), derandomize an auxiliary algorithm via Nisan’s pseudorandom generator [I]

An Optimal Algorithm [BCINWW] Want O(log n) bits instead of O(log n log log n) bits Multiple sources where the O(log log n) factor is coming from Amplification Use a tree-based scheme and that the heavy hitter becomes heavier! Derandomization Show 4-wise independence suffices for derandomizing a Gaussian process!

Conclusions Beat CountSketch for finding l 2 -heavy hitters in a data stream Achieve O(log n) bits of space instead of O(log2 n) bits New results for estimating F2 at all points and L∞ - estimation Questions: Is this a significant practical improvement over CountSketch as well? Can we use Gaussian processes for other insertion-only stream problems?