Download presentation
Presentation is loading. Please wait.
Published byCassie Haig Modified over 9 years ago
1
Finding Frequent Items in Data Streams Moses CharikarPrinceton Un., Google Inc. Kevin ChenUC Berkeley, Google Inc. Martin Franch-ColtonRutgers Un., Google Inc. Presented by Amir Rothschild
2
Presenting: 1-pass algorithm for estimating the most frequent items in a data stream using very limited storage space. The algorithm achieves especially good space bounds for Zipfian distribution 2-pass algorithm for estimating the items with the largest change in frequency between two data streams.
3
Definitions: Data stream: where Object o i appears n i times in S. Order o i so that f i = n i /n
4
The first problem: FindApproxTop(S,k,ε) Input: stream S, int k, real ε. Output: k elements from S such that: for every element O i in the output: Contains every item with: n1n2nk
5
Clarifications: This is not the problem discussed last week! Sampling algorithm does not give any bounds for this version of the problem.
6
Hash functions We say that h is a pair wise independent hash function, if h is chosen randomly from a group H, so that:
7
Let’s start with some intuition… Idea: Let s be a hash function from objects to {+1,-1}, and let c be a counter. For each q i in the stream, update c += s(q i ) C S Estimate ni=c*s(oi) (since )
8
Realization s(O 1 )s(O 2 )s(O 2 )s(O 2 )s(O 3 )s(O 2 ) s1s1 +1 s2s2 +1 s3s3 s4s4 +1 E0 0
9
Claim: For each element O j other then O i, s(O j )*s(O i )=-1 w.p.1/2 s(O j )*s(O i )=+1 w.p. 1/2. So O j adds the counter +n j w.p. 1/2 and - n j w.p. 1/2, and so has no influence on the expectation. O i on the other hand, adds +n i to the counter w.p. 1 (since s(O i )*s(O i )=+1) So the expectation (average) is +n i. Proof:
10
That’s not enough: The variance is very high. O(m) objects have estimates that are wrong by more then the variance.
11
First attempt to fix the algorithm… t independent hash functions S j t different counters C j For each element qi in the stream: For each j in {1,2,…,t} do C j += S j (q i ) Take the mean or the median of the estimates C j *S j (o i ) to estimate n i. C1C3C2C4C5C6 S1S2S3S4S5S6
12
Still not enough Collisions with high frequency elements like O 1 can spoil most estimates of lower frequency elements, as O k.
13
Ci The solution !!! Divide & Conquer: Don’t let each element update every counter. More precisely: replace each counter with a hash table of b counters and have the items one counter per hash table. Ti hi Si
14
Presenting the CountSketch algorithm… Let’s start working…
15
h1h2ht t hash tables b buckets T1 h1 S1 T2 h2 S2 Tt ht St CountSketch data structure
16
The CountSketch data structure Define CountSkatch d.s. as follows: Let t and b be parameters with values determined later. h1,…,ht – hash functions O -> {1,2,…,b}. T1,…,Tt – arrays of b counters. S1,…,St – hash functions from objects O to {+1,-1}. From now on, define : hi[oj] := Ti[hi(oj)]
17
The d.s. supports 2 operations: Add(q): Estimate(q): Why median and not mean? In order to show the median is close to reality it’s enough to show that ½ of the estimates are good. The mean on the other hand is very sensitive to outliers.
18
Finally, the algorithm: Keep a CountSketch d.s. C, and a heap of the top k elements. Given a data stream q 1,…,q n : For each j=1,…,n: C.Add(q j ); If qj is in the heap, increment it’s count. Else, If C.Estimate(q j ) > smallest estimated count in the heap, add q j to the heap. (If the heap is full evict the object with the smallest estimated count from it)
19
And now for the hard part: Algorithms analysis
20
Definitions
21
Claims & Proofs
23
The CountSketch algorithm space complexity:
24
Zipfian distribution Analysis of the CountSketch algorithm for Zipfian distribution
25
Zipfian distribution Zipfian(z): for some constant c. This distribution is very common in human languages (useful in search engines).
26
Pr q (oi=q)
27
Observations k most frequent elements can only be preceded by elements j with n j > (1-ε)n k => Choosing l instead of k so that n l+1 <(1-ε)n k will ensure that our list will include the k most frequent elements. n1n1 n2n2 nknk n l+1
28
Analysis for Zipfian distribution For this distribution the space complexity of the algorithm is where:
29
Proof of the space bounds: Part 1, l=O(k)
30
Proof of the space bounds: Part 2
31
Comparison of space requirements for random sampling vs. our algorithm
32
Yet another algorithm which uses CountSketch d.s. Finding items with largest frequency change
33
The problem Let be the number of occurrences of o in S. Given 2 streams S1,S2 find the items o such that is maximal. 2-pass algorithm.
34
The algorithm – first pass First pass – only update the counters:
35
The algorithm – second pass Pass over S1 and S2 and:
36
Explanation Though A can change, items once removed are never added back. Thus accurate exact counts can be maintained for all objects currently in A. Space bounds for this algorithm are similar to those of the former with replaced by
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.