Presentation is loading. Please wait.

Presentation is loading. Please wait.

New Algorithms for Heavy Hitters in Data Streams David Woodruff IBM Almaden Joint works with Arnab Bhattacharyya, Vladimir Braverman, Stephen R. Chestnut,

Similar presentations


Presentation on theme: "New Algorithms for Heavy Hitters in Data Streams David Woodruff IBM Almaden Joint works with Arnab Bhattacharyya, Vladimir Braverman, Stephen R. Chestnut,"— Presentation transcript:

1 New Algorithms for Heavy Hitters in Data Streams David Woodruff IBM Almaden Joint works with Arnab Bhattacharyya, Vladimir Braverman, Stephen R. Chestnut, Palash Dey, Nikita Ivkin, Jelani Nelson, and Zhengyu Wang

2 Streaming Model Stream of elements a 1, …, a m in [n] = {1, …, n}. Assume m = poly(n) Arbitrary order One pass over the data Minimize memory usage (space complexity) in bits for solving a task Let f j be the number of occurrences of item j Heavy Hitters Problem: find those j for which f j is large … 2113734

3 Guarantees f 1, f 2 f 3 f 4 f 5 f 6

4 Outline Optimal algorithm in all parameters φ, ε for l 1 -guarantee Optimal algorithm for l 2 -guarantee for constant φ, ε

5 Misra-Gries key value

6 Space Complexity of Misra-Gries

7 Our Results

8 A Simple Initial Improvement

9 Why Sampling Works? 2 1 1 3 731

10 Misra-Gries after Hashing

11 Maintaining Identities 314659 10002033500011 hashed key value 458938\\30903\\ 1002033500011 actual key value

12 Summary of Initial Improvement

13 An Optimal Algorithm

14

15 Dealing with Item Identifiers

16 Dealing with Item Counts

17 Conclusions on l 1 -guarantee

18 Outline Optimal algorithm in all parameters φ, ε for l 1 -guarantee Optimal algorithm for l 2 -guarantee for constant φ, ε

19 CountSketch achieves the l 2 –guarantee [CCFC] Assign each coordinate i a random sign ¾ (i) 2 {-1,1} Randomly partition coordinates into B buckets, maintain c j = Σ i: h(i) = j ¾ (i) ¢ f i in j-th bucket.Σ i: h(i) = 2 ¾ (i) ¢ f i.. f1f1 f2f2 f3f3 f4f4 f5f5 f6f6 f7f7 f8f8 f9f9 f 10 Estimate f i as ¾ (i) ¢ c h(i)

20 Known Space Bounds for l 2 – heavy hitters CountSketch achieves O(log 2 n) bits of space If the stream is allowed to have deletions, this is optimal [DPIW] What about insertion-only streams? This is the model originally introduced by Alon, Matias, and Szegedy Models internet search logs, network traffic, databases, scientific data, etc. The only known lower bound is Ω(log n) bits, just to report the identity of the heavy hitter

21 Our Results [BCIW]

22 Simplifications

23 Intuition

24 Only gives 1 bit of information. Can’t repeat log n times in parallel, but can repeat log n times sequentially!

25 Repeating Sequentially

26 Gaussian Processes

27 Chaining Inequality [Fernique, Talagrand]

28 … atat a5a5 a4a4 a3a3 a2a2 a1a1 amam … Apply the chaining inequality!

29 Applying the Chaining Inequality Same behavior as for random walks!

30 Removing Frequency Assumptions

31 Amplification Create O(log log n) pairs of streams from the input stream (stream L 1, stream R 1 ), (stream L 2, stream R 2 ), …, (stream L log log n, stream R log log n ) For each j in O(log log n), choose a hash function h j :{1, …, n} -> {0,1} stream L j is the original stream restricted to items i with h j (i) = 0 stream R j is the remaining part of the input stream maintain counters c L = Σ i: h j (i) = 0 g (i) ¢ f i and c R = Σ i: h j (i) = 1 g (i) ¢ f i (Chaining Inequality + Chernoff) the larger counter is usually the substream with i* The larger counter stays larger forever if the Chaining Inequality holds Run algorithm on items corresponding to the larger counts Expected F 2 value of items, excluding i*, is F 2 /poly(log n), so i* is heavier

32 Derandomization We have to account for the randomness in our algorithm We need to (1)derandomize a Gaussian process (2)derandomize the hash functions used to sequentially learn bits of i* We achieve (1) by (Derandomized Johnson Lindenstrauss) defining our counters by first applying a Johnson-Lindenstrauss (JL) transform [KMN] to the frequency vector, reducing n dimensions to log n, then taking the inner product with fully independent Gaussians (Slepian’s Lemma) counters don’t change much because a Gaussian process is determined by its covariances and all covariances are roughly preserved by JL For (2), derandomize an auxiliary algorithm via Nisan’s pseudorandom generator [I]

33 An Optimal Algorithm [BCINWW] Want O(log n) bits instead of O(log n log log n) bits Multiple sources where the O(log log n) factor is coming from Amplification Use a tree-based scheme and that the heavy hitter becomes heavier! Derandomization Show O(1)-wise independence suffices for derandomizing a Gaussian process!

34 Conclusions on l 2 -guarantee

35 Accelerated Counters

36

37


Download ppt "New Algorithms for Heavy Hitters in Data Streams David Woodruff IBM Almaden Joint works with Arnab Bhattacharyya, Vladimir Braverman, Stephen R. Chestnut,"

Similar presentations


Ads by Google