Efficient Computation of Frequent and Top-k Elements in Data Streams Ahmed Metwally Divyakant Agrawal Amr El Abbadi Department of Computer Science University of California, Santa Barbara
Outline Problem Definition Space-Saving: Summarizing the Data Stream Answering Frequent Elements Queries Answering Top-k Queries Experimental Results Conclusion
Motivation Motivated by Internet advertising commissioners Before rendering an advertisement for user, query clicks stream for advertisements to display. If the user's profile is not a frequent “clicker”, then s/he will probably not click any displayed advertisement. Show Pay-Per-Impression advertisements. If the user's profile is a frequent “clicker”, then s/he may click a displayed advertisement. Show Pay-Per-Click advertisements. Retrieve top advertisements to choose what to display.
Problem Definition Given alphabet A, stream S of size N, a frequent element, E, is an element whose frequency, F, exceeds a user specified support, φN Top-k elements are the k elements with highest frequency Both problems: Very related, though, no integrated solution has been proposed Exact solution is O(min(N,A)) space approximate variations
Practical Frequent Elements -Deficient Frequent Elements [Manku ‘02]: All frequent elements output should have F > (φ - )N, where is the user-defined error. φ N (φ - ) N
Practical Top-k FindApproxTop(S, k, ) [Charikar ‘02]: Retrieve a list of k elements such that every element, Ei, in the list has Fi > (1 - ) Fk, where Ek is the kth ranked element. F4 (1 - ) F4
Related Work Algorithms Classification Counter-Based techniques Keep an individual counter for each element If the observed ID is monitored, its counter is updated If the observed ID is not monitored, algorithm dependent action Sketch-Based techniques Estimate frequency for all elements using bit-maps of counters Each element is hashed into the counters’ space using a family of hash functions. Hashed-to counters are queried for the frequencies
Recent Work (Comparison) Algorithm Nature Space Bound Handles CountSketch [Charikar ‘02] Sketch O(k/2 log N/δ), δ is the failure probability FindApproxTop(S, k, ) GroupTest [Cormode ’03] O(φ-1 log(φ-1) log(|A|)) Hot Items Frequent [Demaine ’02] Counter O(1/), proved by [Bose ‘03] FE Probabilistic-Inplace [Demaine ’02] O(m), m is the available memory FindCandidateTop(S, k, m/2) Lossy Counting [Manku ’02] (1/) log(N) -Deficient FE Sticky Sampling [Manku ’02] (2/) log(φ-1δ-1)
Outline Problem Definition Space-Saving: Summarizing the Data Stream Answering Frequent Elements Queries Answering Top-k Queries Experimental Results Conclusion
The Space-Saving Algorithm Space-Saving is counter-based Monitor only m elements Only over-estimation errors Frequency estimation is more accurate for significant elements Keep track of max. possible errors
Space-Saving By Example Element B A C Count 4 3 1 error (max possible) Element B A D Count 4 3 2 error (max possible) 1 Element B A D Count 5 3 error (max possible) 1 Element B E C Count 5 4 error (max possible) 3 Element B E A Count 5 4 3 error (max possible) Element A B C Count 2 1 error (max possible) Element A B C Count 3 2 1 error (max possible) Element Count error (max possible) A B B A C A B B D D B E C Space-Saving Algorithm For every element in the stream S If a monitored element is observed Increment its Count If a non-monitored element is observed, Replace the element with minimum hits, min Increment the minimum Count to min + 1 maximum possible over-estimation is error Space-Saving Algorithm For every element in the stream S If a monitored element is observed Increment its Count If a non-monitored element is observed, Replace the element with minimum hits, min Increment the minimum Count to min + 1 maximum possible over-estimation is error Space-Saving Algorithm For every element in the stream S If a monitored element is observed Increment its Count If a non-monitored element is observed, Replace the element with minimum hits, min Increment the minimum Count to min + 1 maximum possible over-estimation is error Space-Saving Algorithm For every element in the stream S If a monitored element is observed Increment its Count If a non-monitored element is observed, Replace the element with minimum hits, min Increment the minimum Count to min + 1 maximum possible over-estimation is error Space-Saving Algorithm For every element in the stream S If a monitored element is observed Increment its Count If a non-monitored element is observed, Replace the element with minimum hits, min Increment the minimum Count to min + 1 maximum possible over-estimation is error
Space-Saving Observations S = ABBACABBDDBEC N = 13 Observations: The summation of the Counts is N Minimum number of hits, min ≤ N/m In this example, min = 4 The minimum number of hits, min, is an upper bound on the error of any element Element B E C Count 5 4 error (max possible) 3 Element B E C Count 5 4 error (max possible) 3 Element B E C Count 5 4 error (max possible) 3
Space-Saving Proved Properties S = ABBACABBDDBEC N = 13 S = ABBACABBDDBEC N = 13 If Element E has frequency F > min, then E must be in Stream-Summary. F(B) = F1 = 5, min = 4. The Count at position i in Stream-Summary is no less than Fi, the frequency of the ith ranked element. F(A) = F2 = 3, Count2 = 4. Property 2 is important to guarantee the correctness and order of top-k. Element B E C Count 5 4 error (max possible) 3 Element B E C Count 5 4 error (max possible) 3
Space-Saving Intuition Make use of the skewed property of the data A minority of the elements, the more frequent ones, gets the majority of the hits. Frequent elements will reside in the counters of bigger values. They will not be distorted by the ineffective hits of the infrequent elements. Numerous infrequent elements reside on the smaller counters.
Space-Saving Intuition (Cont’d) If the skew remains, but the popular elements change overtime: The elements that are growing more popular will gradually be pushed to the top of the list. If one of the previously popular elements lost its popularity, its relative position will decline, as other counters get incremented.
Space-Saving Data Structure We need a data structure that Increments counters in constant time Keeps elements sorted by their counters We propose the Stream-Summary structure, similar to the data structure in [Demaine ’02]
Outline Problem Definition Space-Saving: Summarizing the Data Stream Answering Frequent Elements Queries Answering Top-k Queries Experimental Results Conclusion
Frequent Elements Queries Traverse Stream-Summary, and report all elements that satisfy the user support Any element whose guaranteed hits = (Count – error) > φN is guaranteed to be a frequent element
Frequent Elements Example B D G A Q F C E Count 20 14 12 9 7 5 3 error 1 4 2 Guaranteed Hits = Count - error 19 8 Element B D G A Q F C E Count 20 14 12 9 7 5 3 error 1 4 2 Guaranteed Hits = Count - error 19 8 For N = 73, m = 8, φ = 0.15: Frequent Elements should have support of 11 hits. Candidate Frequent Elements are B, D, and G. Guaranteed Frequent Elements are B, and D, since their guaranteed hits > 11.
Frequent Elements Space Bounds General Distribution Zipf(α) Space-Saving O(1/) (1/)(1/α) GroupTest O(φ-1 log(φ-1) log(|A|)) Frequent O(1/) proved by[Bose’03] Lossy Counting (1/) log(N) Sticky Sampling (2/) log(φ-1δ-1)
FE: Quantitative Comparison Example: N = 106, |A| = 104, φ = 10-1, = 10-2, and δ, the failure probability, = 10-1 ,and Uniform data: Space-Saving and Frequent: 100 counters Sticky Sampling: 700 counters Lossy Counting: 1000 counters GroupTest: C*930 counters, C ≥ 1 Zipfian with α = 2: Space-Saving: 10 counters
FE: Qualitative Comparison Frequent: It has a bound similar to Space-Saving in the general distribution case. It is built and queried in a way that does not allow the user to specify an error threshold. There is no feasible extension to track under-estimation errors. Every observation of a non-monitored element increases the errors for all the monitored elements, since their counters get decremented.
FE: Qualitative Comparison (Cont’d) GroupTest: It does not output frequencies at all. It reveals nothing about the relative order of the elements. It assumes that IDs are 1 … |A|. This can only be enforced by building an indexed lookup table. Thus, practically it needs O(|A|) space.
FE: Qualitative Comparison (Cont’d) Lossy Counting and Sticky Sampling: The theoretical space bound of Space-Saving is much tighter than those of Lossy Counting and Sticky Sampling.
Outline Problem Definition Space-Saving: Summarizing the Data Stream Answering Frequent Elements Queries Answering Top-k Queries Experimental Results Conclusion
Top-k Elements Queries Traverse the Stream-Summary, and report top-k elements. From Property 2, we assert: Guaranteed top-k elements: Any element whose guaranteed hits = (Count – error) ≥ Countk+1, is guaranteed to be in the top-k. Guaranteed top-k’ (where k’≈k): The top-k’ elements reported are guaranteed to be the correct top-k’ iff for every element in the top-k’, guaranteed hits = (Count – error) ≥ Countk’+1.
Top-k Elements Example B D G A Q F C E Count 20 14 12 9 7 5 3 error 1 4 2 Guaranteed Hits = Count - error 19 8 Element B D G A Q F C E Count 20 14 12 9 7 5 3 error 1 4 2 Guaranteed Hits = Count - error 19 8 Element B D G A Q F C E Count 20 14 12 9 7 5 3 error 1 4 2 Guaranteed Hits = Count - error 19 8 Element B D G A Q F C E Count 20 14 12 9 7 5 3 error 1 4 2 Guaranteed Hits = Count - error 19 8 For k = 3, m = 8: B, D, and G are the top-3 candidates. B, and D are guaranteed to be in the top-3. B , D, G and A are guaranteed to be the top-4. Here k’ = 4. B , and D are guaranteed to be the top-2. Another k’ = 2.
Top-k Elements Space Bounds General Distribution Zipf(α) Space-Saving FindApproxTop(S, k, ): O(k/ * log(N)) Exact Top-k Problem: α = 1: O(k2 log(A) ) α > 1: O((k/ α)(1/α) k ) CountSketch O(k/2 * log(N / δ)) α ≥ 1: O(k * log(N / δ))
Top-k: Quantitative Comparison For N = 106, |A| = 104, k = 100, = 10-1, and δ = 10-1, and Uniform data: Space-Saving: 1000 counters CountSketch: C*2.3*107 counters, C >> 1 If the data is Zipfian with α = 2 Space-Saving: 66 counters CountSketch: C*230 counters, C >> 1
Top-k: Qualitative Comparison CountSketch: General distribution: Space-Saving has a tighter theoretical space bound. Zipf(α) distribution: Space-Saving solves the exact problem, while CountSketch solves the approximate problem. Space-Saving has a tighter bound in cases of Skewed data Long streams It has 0-probability of failure.
Top-k: Qualitative Comparison (Cont’d) Probabilistic-Inplace: Outputs m/2 elements, which is too many. Zipf(α) distribution: Probabilistic-Inplace does not offer space analysis in case of Zipfian data.
Outline Problem Definition Space-Saving: Summarizing the Data Stream Answering Frequent Elements Queries Answering Top-k Queries Experimental Results Conclusion
Experimental Results - Setup Synthetic data: Zipf(α), α varied: 0.0, 0.5, 1.0, …, 2.5, 3.0 N = 107 hits. Real Data (ValueClick, Inc.): Similar results Precision: number of correct elements found / entire output Recall: number of correct elements found / number of actual correct Run time: Processing Stream + Query Time Space used: Including hash table
Frequent Elements Results Query: φ = 10-2, = 10-4, and δ = 10-2 We compared with GroupTest and Frequent All algorithms had a recall of 1. That is, they all output the correct elements among their output. Space-Saving was able to guarantee all its output to be correct
Frequent Elements Precision
Frequent Elements Run Time
Frequent Elements Space Used
Top-k Elements Results Query: k = 100, = 10-4, and δ = 10-2 We compared with CountSketch: CountSketch was re-run several times. The hidden constant was estimated to be 16, in order to have output of competitive quality. Probabilistic-InPlace: was allowed the same number of counters as Space-Saving Space-Saving was able to guarantee all its output to be correct
Top-k Elements Precision
Top-k Elements Recall
Top-k Elements Run Time
Top-k Elements Space Used
Outline Problem Definition Space-Saving: Summarizing the Data Stream Answering Frequent Elements Queries Answering Top-k Queries Experimental Results Conclusion
Conclusion Contributions: Future Work: An integrated approach to solve an interesting family of problems Strict error bounds using little space Guarantees on results Special attention was given to Zipfian data Experimental validation Future Work: Incremental frequent and top-k elements reporting