Download presentation
Presentation is loading. Please wait.
Published byCathleen Dean Modified over 8 years ago
1
New Algorithms for Heavy Hitters in Data Streams David Woodruff IBM Almaden Joint works with Arnab Bhattacharyya, Vladimir Braverman, Stephen R. Chestnut, Palash Dey, Nikita Ivkin, Jelani Nelson, and Zhengyu Wang
2
Streaming Model Stream of elements a 1, …, a m in [n] = {1, …, n}. Assume m = poly(n) Arbitrary order One pass over the data Minimize memory usage (space complexity) in bits for solving a task Let f j be the number of occurrences of item j Heavy Hitters Problem: find those j for which f j is large … 2113734
3
Guarantees f 1, f 2 f 3 f 4 f 5 f 6
4
Outline Optimal algorithm in all parameters φ, ε for l 1 -guarantee Optimal algorithm for l 2 -guarantee for constant φ, ε
5
Misra-Gries key value
6
Space Complexity of Misra-Gries
7
Our Results
8
A Simple Initial Improvement
9
Why Sampling Works? 2 1 1 3 731
10
Misra-Gries after Hashing
11
Maintaining Identities 314659 10002033500011 hashed key value 458938\\30903\\ 1002033500011 actual key value
12
Summary of Initial Improvement
13
An Optimal Algorithm
15
Dealing with Item Identifiers
16
Dealing with Item Counts
17
Conclusions on l 1 -guarantee
18
Outline Optimal algorithm in all parameters φ, ε for l 1 -guarantee Optimal algorithm for l 2 -guarantee for constant φ, ε
19
CountSketch achieves the l 2 –guarantee [CCFC] Assign each coordinate i a random sign ¾ (i) 2 {-1,1} Randomly partition coordinates into B buckets, maintain c j = Σ i: h(i) = j ¾ (i) ¢ f i in j-th bucket.Σ i: h(i) = 2 ¾ (i) ¢ f i.. f1f1 f2f2 f3f3 f4f4 f5f5 f6f6 f7f7 f8f8 f9f9 f 10 Estimate f i as ¾ (i) ¢ c h(i)
20
Known Space Bounds for l 2 – heavy hitters CountSketch achieves O(log 2 n) bits of space If the stream is allowed to have deletions, this is optimal [DPIW] What about insertion-only streams? This is the model originally introduced by Alon, Matias, and Szegedy Models internet search logs, network traffic, databases, scientific data, etc. The only known lower bound is Ω(log n) bits, just to report the identity of the heavy hitter
21
Our Results [BCIW]
22
Simplifications
23
Intuition
24
Only gives 1 bit of information. Can’t repeat log n times in parallel, but can repeat log n times sequentially!
25
Repeating Sequentially
26
Gaussian Processes
27
Chaining Inequality [Fernique, Talagrand]
28
… atat a5a5 a4a4 a3a3 a2a2 a1a1 amam … Apply the chaining inequality!
29
Applying the Chaining Inequality Same behavior as for random walks!
30
Removing Frequency Assumptions
31
Amplification Create O(log log n) pairs of streams from the input stream (stream L 1, stream R 1 ), (stream L 2, stream R 2 ), …, (stream L log log n, stream R log log n ) For each j in O(log log n), choose a hash function h j :{1, …, n} -> {0,1} stream L j is the original stream restricted to items i with h j (i) = 0 stream R j is the remaining part of the input stream maintain counters c L = Σ i: h j (i) = 0 g (i) ¢ f i and c R = Σ i: h j (i) = 1 g (i) ¢ f i (Chaining Inequality + Chernoff) the larger counter is usually the substream with i* The larger counter stays larger forever if the Chaining Inequality holds Run algorithm on items corresponding to the larger counts Expected F 2 value of items, excluding i*, is F 2 /poly(log n), so i* is heavier
32
Derandomization We have to account for the randomness in our algorithm We need to (1)derandomize a Gaussian process (2)derandomize the hash functions used to sequentially learn bits of i* We achieve (1) by (Derandomized Johnson Lindenstrauss) defining our counters by first applying a Johnson-Lindenstrauss (JL) transform [KMN] to the frequency vector, reducing n dimensions to log n, then taking the inner product with fully independent Gaussians (Slepian’s Lemma) counters don’t change much because a Gaussian process is determined by its covariances and all covariances are roughly preserved by JL For (2), derandomize an auxiliary algorithm via Nisan’s pseudorandom generator [I]
33
An Optimal Algorithm [BCINWW] Want O(log n) bits instead of O(log n log log n) bits Multiple sources where the O(log log n) factor is coming from Amplification Use a tree-based scheme and that the heavy hitter becomes heavier! Derandomization Show O(1)-wise independence suffices for derandomizing a Gaussian process!
34
Conclusions on l 2 -guarantee
35
Accelerated Counters
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.