Presentation is loading. Please wait.

Presentation is loading. Please wait.

Finding Frequent Items in Data Streams

Similar presentations


Presentation on theme: "Finding Frequent Items in Data Streams"— Presentation transcript:

1 Finding Frequent Items in Data Streams
Moses Charikar, Kevin Chen & Martin Farach-Colton

2 Problem Statement Goal:
Given a data stream, return an approximate list of the k most frequent items in one pass and sub-linear space Application: Search engine query streams

3 Problem Statement Our Solution:
A data structure called a Count Sketch that gives good estimates of the counts of the high frequency elements at every point in the stream

4 Notation ni is the frequency of the ith most frequent element in the stream n is the total number of elements in the stream m is the total number of distinct elements

5 FindCandidateTop FindCandidateTop Naive sampling gives a solution with
Input: stream S, int k, int p Output: list of p elements containing the top k Naive sampling gives a solution with p = (n log n / nk)

6 FindApproxTop FindApproxTop
Input: stream S, int k, real  Output: list of k elements, each of frequency ni > (1-) nk Naive sampling gives no solution for this problem.

7 FindMaxChange FindMaxChange Previously unconsidered in the literature
Input: data streams S1 and S2, int k, real  Output: list of k elements with ni > (1-) nk Previously unconsidered in the literature We consider only absolute change, not percentage change

8 Related Work Iceberg Queries (Fang et al.)
Concise and Counting Samples (Gibbons and Matias) Frequency Moments (Alon, Matias and Szegedy)

9 Lower Bounds Lower bound of (m) on computing n1 (Alon, Matias and Szegedy) ni ni i i

10 Summary of Results First known theoretical bounds for FindApproxTop and FindMaxChange Bounds for FindCandidateTop that beat naive sampling for certain interesting distributions (Zipfs)

11 Intuition Consider a single counter c with a single hash function s:{q}  { +1, -1} On seeing each element qi, update the counter with c += s(qi) Claim: E[c • s(qi)] = ni Proof idea: Cross-terms cancel because of pairwise independence

12 Finding the max element
Problem with the single counter scheme: variance is too high Replace with an array of t counters, using independent hash functions s1... st s1: q  {+1, -1} st: q  {+1, -1}

13 Analysis of “array of counters” data structure
Expectation still correct Claim: Variance of final estimate <  ni2 /t Variance of each estimate <  ni2 Proof idea: cross-terms cancel Set t = O(log n •  ni2 / (n1)2) to get answer with high prob. Proof idea: “median of averages”

14 Problem with “array of counters” data structure
Sum will be dominated by large elements, and important elements such as nk will be corrupted by larger elements To avoid collisions, replace each counter with a hashtable of b counters to spread out the large elements

15 Count Sketch data structure
2t pairwise independent hash functions, h1,...,ht , s1,...st. All hash functions independent of each other Data structure is an array of hashtables of counters h1 : q  {1, ..., b} s1: q  {+1, -1} ht : q  {1, ..., b} st: q  {+1, -1}

16 Algorithm On seeing each item q, update hi(q) += si(q)
Estimate(q) = mediani { hi(q) • si(q) } Maintain a heap of the k top elements seen so far

17 Note on taking the median
We have not completely eliminated the problem of colliding with high frequency elements The median is not sensitive to these poor estimates, whereas the mean is

18 Avoiding Large Items If each hashtable has > O(k) buckets, then with constant probability, a particular item does not collide with any of the top k elements The t hashtables represent independent trials and by Chernoff bounds we need log (n/) trials to get high prob. results

19 Analysis of b In addition to the requirement that b > O(k), we also need that the variance of the small elements that we collide with is small Claim: E[Var of each estimate] < ( ni2)/b Final bound on b =  ni2 / (nk)2

20 Error bound analysis ni nk i

21 Final Results FindApproxTop FindMaxChange:
log (n/) (k + ni2 / (nk)2) FindMaxChange: 2-pass algorithm in same space bounds FindCandidateTop, Zipf parameter 0.5 our bound: O(k log m log n) sampling bound: O( (km)0.5 log n)

22 Further Work Lower bounds Max percent change problem


Download ppt "Finding Frequent Items in Data Streams"

Similar presentations


Ads by Google