1 Efficient Computation of Frequent and Top-k Elements in Data Streams
2 Motivation Motivated by Internet advertising commissioners Before rendering an advertisement for user, query clicks stream for advertisements to display. If the user's profile is not a frequent “clicker”, then s/he will probably not click any displayed advertisement. –Show Pay-Per-Impression advertisements. If the user's profile is a frequent “clicker”, then s/he may click a displayed advertisement. –Show Pay-Per-Click advertisements. –Retrieve top advertisements to choose what to display.
3 Problem Definition Given alphabet A, stream S of size N, a frequent element, E, is an element whose frequency, F, exceeds a user specified support, φN Top-k elements are the k elements with highest frequency Both problems: –Very related, though, no integrated solution has been proposed –Exact solution is O(min(N,A)) space approximate variations
4 Practical Frequent Elements -Deficient Frequent Elements [Manku ‘02]: –All frequent elements output should have F > (φ - )N, where is the user-defined error. φ Nφ N ( φ - ) N
5 Practical Top-k FindApproxTop(S, k, ) [Charikar ‘02]: –Retrieve a list of k elements such that every element, E i, in the list has F i > (1 - ) F k, where E k is the k th ranked element. F4F4 (1 - ) F 4
6 The Space-Saving Algorithm Space-Saving is counter-based Monitor only m elements Only over-estimation errors Frequency estimation is more accurate for significant elements Keep track of max. possible errors
7 Space-Saving By Example Element Count error (max possible) ABBACABBDD Element ABC Count221 error (max possible) 000 Element ABC Count321 error (max possible) 000 Element BAC Count431 error (max possible) 000 Element BAD Count432 error (max possible) 001 Element BAD Count533 error (max possible) 001 E Element BEA Count543 error (max possible) 030 Space-Saving Algorithm –For every element in the stream S –If a monitored element is observed Increment its Count –If a non-monitored element is observed, Replace the element with minimum hits, min Increment the minimum Count to min + 1 maximum possible over-estimation is error Space-Saving Algorithm –For every element in the stream S –If a monitored element is observed Increment its Count –If a non-monitored element is observed, Replace the element with minimum hits, min Increment the minimum Count to min + 1 maximum possible over-estimation is error Space-Saving Algorithm –For every element in the stream S –If a monitored element is observed Increment its Count –If a non-monitored element is observed, Replace the element with minimum hits, min Increment the minimum Count to min + 1 maximum possible over-estimation is error Space-Saving Algorithm –For every element in the stream S –If a monitored element is observed Increment its Count –If a non-monitored element is observed, Replace the element with minimum hits, min Increment the minimum Count to min + 1 maximum possible over-estimation is error Space-Saving Algorithm –For every element in the stream S –If a monitored element is observed Increment its Count –If a non-monitored element is observed, Replace the element with minimum hits, min Increment the minimum Count to min + 1 maximum possible over-estimation is error C Element BEC Count544 error (max possible) 033 B
8 Space-Saving Observations Observations: –The summation of the Counts is N Element BEC Count544 error (max possible) 033 S = ABBACABBDDBECN = 13 –Minimum number of hits, min ≤ N/m –In this example, min = 4 Element BEC Count544 error (max possible) 033 –The minimum number of hits, min, is an upper bound on the error of any element Element BEC Count544 error (max possible) 033
9 Space-Saving Proved Properties 1. If Element E has frequency F > min, then E must be in Stream-Summary. F(B) = F 1 = 5, min = 4. S = ABBACABBDDBECN = 13 Element BEC Count544 error (max possible) The Count at position i in Stream-Summary is no less than F i, the frequency of the i th ranked element. F(A) = F 2 = 3, Count 2 = 4. Element BEC Count544 error (max possible) 033 S = ABBACABBDDBECN = 13
10 Space-Saving Data Structure We need a data structure that –Increments counters in constant time –Keeps elements sorted by their counters We propose the Stream-Summary structure, similar to the data structure in [Demaine ’ 02]
11 Frequent Elements Queries Traverse Stream-Summary, and report all elements that satisfy the user support Any element whose guaranteed hits = (Count – error) > φN is guaranteed to be a frequent element
12 Frequent Elements Example For N = 73, m = 8, φ = 0.15: –Frequent Elements should have support of 11 hits. –Candidate Frequent Elements are B, D, and G. Element BDGAQFCE Count error Guaranteed Hits = Count - error –Guaranteed Frequent Elements are B, and D, since their guaranteed hits > 11. Element BDGAQFCE Count error Guaranteed Hits = Count - error
13 Top-k Elements Queries Traverse the Stream-Summary, and report top-k elements. From Property 2, we assert: –Guaranteed top-k elements: Any element whose guaranteed hits = (Count – error) ≥ Count k+1, is guaranteed to be in the top-k. –Guaranteed top-k’ (where k’≈k): The top-k’ elements reported are guaranteed to be the correct top-k’ iff for every element in the top-k’, guaranteed hits = (Count – error) ≥ Count k’+1.
14 Top-k Elements Example For k = 3, m = 8: –B, D, and G are the top-3 candidates. Element BDGAQFCE Count error Guaranteed Hits = Count - error –B, and D are guaranteed to be in the top-3. Element BDGAQFCE Count error Guaranteed Hits = Count - error –B, D, G and A are guaranteed to be the top-4. Here k’ = 4. Element BDGAQFCE Count error Guaranteed Hits = Count - error –B, and D are guaranteed to be the top-2. Another k’ = 2. Element BDGAQFCE Count error Guaranteed Hits = Count - error
15 Frequent Elements Precision
16 Frequent Elements Run Time
17 Frequent Elements Space Used
Max freq. element in stream Can we promise to find it with less than m buckets? 18