Download presentation
Published byJasmine Rose Modified over 10 years ago
1
Efficient Computation of Frequent and Top-k Elements in Data Streams
Ahmed Metwally Divyakant Agrawal Amr El Abbadi Department of Computer Science University of California, Santa Barbara
2
Outline Problem Definition Space-Saving: Summarizing the Data Stream
Answering Frequent Elements Queries Answering Top-k Queries Experimental Results Conclusion
3
Motivation Motivated by Internet advertising commissioners
Before rendering an advertisement for user, query clicks stream for advertisements to display. If the user's profile is not a frequent “clicker”, then s/he will probably not click any displayed advertisement. Show Pay-Per-Impression advertisements. If the user's profile is a frequent “clicker”, then s/he may click a displayed advertisement. Show Pay-Per-Click advertisements. Retrieve top advertisements to choose what to display.
4
Problem Definition Given alphabet A, stream S of size N, a frequent element, E, is an element whose frequency, F, exceeds a user specified support, φN Top-k elements are the k elements with highest frequency Both problems: Very related, though, no integrated solution has been proposed Exact solution is O(min(N,A)) space approximate variations
5
Practical Frequent Elements
-Deficient Frequent Elements [Manku ‘02]: All frequent elements output should have F > (φ - )N, where is the user-defined error. φ N (φ - ) N
6
Practical Top-k FindApproxTop(S, k, ) [Charikar ‘02]:
Retrieve a list of k elements such that every element, Ei, in the list has Fi > (1 - ) Fk, where Ek is the kth ranked element. F4 (1 - ) F4
7
Related Work Algorithms Classification Counter-Based techniques
Keep an individual counter for each element If the observed ID is monitored, its counter is updated If the observed ID is not monitored, algorithm dependent action Sketch-Based techniques Estimate frequency for all elements using bit-maps of counters Each element is hashed into the counters’ space using a family of hash functions. Hashed-to counters are queried for the frequencies
8
Recent Work (Comparison)
Algorithm Nature Space Bound Handles CountSketch [Charikar ‘02] Sketch O(k/2 log N/δ), δ is the failure probability FindApproxTop(S, k, ) GroupTest [Cormode ’03] O(φ-1 log(φ-1) log(|A|)) Hot Items Frequent [Demaine ’02] Counter O(1/), proved by [Bose ‘03] FE Probabilistic-Inplace [Demaine ’02] O(m), m is the available memory FindCandidateTop(S, k, m/2) Lossy Counting [Manku ’02] (1/) log(N) -Deficient FE Sticky Sampling [Manku ’02] (2/) log(φ-1δ-1)
9
Outline Problem Definition Space-Saving: Summarizing the Data Stream
Answering Frequent Elements Queries Answering Top-k Queries Experimental Results Conclusion
10
The Space-Saving Algorithm
Space-Saving is counter-based Monitor only m elements Only over-estimation errors Frequency estimation is more accurate for significant elements Keep track of max. possible errors
11
Space-Saving By Example
Element B A C Count 4 3 1 error (max possible) Element B A D Count 4 3 2 error (max possible) 1 Element B A D Count 5 3 error (max possible) 1 Element B E C Count 5 4 error (max possible) 3 Element B E A Count 5 4 3 error (max possible) Element A B C Count 2 1 error (max possible) Element A B C Count 3 2 1 error (max possible) Element Count error (max possible) A B B A C A B B D D B E C Space-Saving Algorithm For every element in the stream S If a monitored element is observed Increment its Count If a non-monitored element is observed, Replace the element with minimum hits, min Increment the minimum Count to min + 1 maximum possible over-estimation is error Space-Saving Algorithm For every element in the stream S If a monitored element is observed Increment its Count If a non-monitored element is observed, Replace the element with minimum hits, min Increment the minimum Count to min + 1 maximum possible over-estimation is error Space-Saving Algorithm For every element in the stream S If a monitored element is observed Increment its Count If a non-monitored element is observed, Replace the element with minimum hits, min Increment the minimum Count to min + 1 maximum possible over-estimation is error Space-Saving Algorithm For every element in the stream S If a monitored element is observed Increment its Count If a non-monitored element is observed, Replace the element with minimum hits, min Increment the minimum Count to min + 1 maximum possible over-estimation is error Space-Saving Algorithm For every element in the stream S If a monitored element is observed Increment its Count If a non-monitored element is observed, Replace the element with minimum hits, min Increment the minimum Count to min + 1 maximum possible over-estimation is error
12
Space-Saving Observations
S = ABBACABBDDBEC N = 13 Observations: The summation of the Counts is N Minimum number of hits, min ≤ N/m In this example, min = 4 The minimum number of hits, min, is an upper bound on the error of any element Element B E C Count 5 4 error (max possible) 3 Element B E C Count 5 4 error (max possible) 3 Element B E C Count 5 4 error (max possible) 3
13
Space-Saving Proved Properties
S = ABBACABBDDBEC N = 13 S = ABBACABBDDBEC N = 13 If Element E has frequency F > min, then E must be in Stream-Summary. F(B) = F1 = 5, min = 4. The Count at position i in Stream-Summary is no less than Fi, the frequency of the ith ranked element. F(A) = F2 = 3, Count2 = 4. Property 2 is important to guarantee the correctness and order of top-k. Element B E C Count 5 4 error (max possible) 3 Element B E C Count 5 4 error (max possible) 3
14
Space-Saving Intuition
Make use of the skewed property of the data A minority of the elements, the more frequent ones, gets the majority of the hits. Frequent elements will reside in the counters of bigger values. They will not be distorted by the ineffective hits of the infrequent elements. Numerous infrequent elements reside on the smaller counters.
15
Space-Saving Intuition (Cont’d)
If the skew remains, but the popular elements change overtime: The elements that are growing more popular will gradually be pushed to the top of the list. If one of the previously popular elements lost its popularity, its relative position will decline, as other counters get incremented.
16
Space-Saving Data Structure
We need a data structure that Increments counters in constant time Keeps elements sorted by their counters We propose the Stream-Summary structure, similar to the data structure in [Demaine ’02]
17
Outline Problem Definition Space-Saving: Summarizing the Data Stream
Answering Frequent Elements Queries Answering Top-k Queries Experimental Results Conclusion
18
Frequent Elements Queries
Traverse Stream-Summary, and report all elements that satisfy the user support Any element whose guaranteed hits = (Count – error) > φN is guaranteed to be a frequent element
19
Frequent Elements Example
B D G A Q F C E Count 20 14 12 9 7 5 3 error 1 4 2 Guaranteed Hits = Count - error 19 8 Element B D G A Q F C E Count 20 14 12 9 7 5 3 error 1 4 2 Guaranteed Hits = Count - error 19 8 For N = 73, m = 8, φ = 0.15: Frequent Elements should have support of 11 hits. Candidate Frequent Elements are B, D, and G. Guaranteed Frequent Elements are B, and D, since their guaranteed hits > 11.
20
Frequent Elements Space Bounds
General Distribution Zipf(α) Space-Saving O(1/) (1/)(1/α) GroupTest O(φ-1 log(φ-1) log(|A|)) Frequent O(1/) proved by[Bose’03] Lossy Counting (1/) log(N) Sticky Sampling (2/) log(φ-1δ-1)
21
FE: Quantitative Comparison
Example: N = 106, |A| = 104, φ = 10-1, = 10-2, and δ, the failure probability, = 10-1 ,and Uniform data: Space-Saving and Frequent: 100 counters Sticky Sampling: 700 counters Lossy Counting: 1000 counters GroupTest: C*930 counters, C ≥ 1 Zipfian with α = 2: Space-Saving: 10 counters
22
FE: Qualitative Comparison
Frequent: It has a bound similar to Space-Saving in the general distribution case. It is built and queried in a way that does not allow the user to specify an error threshold. There is no feasible extension to track under-estimation errors. Every observation of a non-monitored element increases the errors for all the monitored elements, since their counters get decremented.
23
FE: Qualitative Comparison (Cont’d)
GroupTest: It does not output frequencies at all. It reveals nothing about the relative order of the elements. It assumes that IDs are 1 … |A|. This can only be enforced by building an indexed lookup table. Thus, practically it needs O(|A|) space.
24
FE: Qualitative Comparison (Cont’d)
Lossy Counting and Sticky Sampling: The theoretical space bound of Space-Saving is much tighter than those of Lossy Counting and Sticky Sampling.
25
Outline Problem Definition Space-Saving: Summarizing the Data Stream
Answering Frequent Elements Queries Answering Top-k Queries Experimental Results Conclusion
26
Top-k Elements Queries
Traverse the Stream-Summary, and report top-k elements. From Property 2, we assert: Guaranteed top-k elements: Any element whose guaranteed hits = (Count – error) ≥ Countk+1, is guaranteed to be in the top-k. Guaranteed top-k’ (where k’≈k): The top-k’ elements reported are guaranteed to be the correct top-k’ iff for every element in the top-k’, guaranteed hits = (Count – error) ≥ Countk’+1.
27
Top-k Elements Example
B D G A Q F C E Count 20 14 12 9 7 5 3 error 1 4 2 Guaranteed Hits = Count - error 19 8 Element B D G A Q F C E Count 20 14 12 9 7 5 3 error 1 4 2 Guaranteed Hits = Count - error 19 8 Element B D G A Q F C E Count 20 14 12 9 7 5 3 error 1 4 2 Guaranteed Hits = Count - error 19 8 Element B D G A Q F C E Count 20 14 12 9 7 5 3 error 1 4 2 Guaranteed Hits = Count - error 19 8 For k = 3, m = 8: B, D, and G are the top-3 candidates. B, and D are guaranteed to be in the top-3. B , D, G and A are guaranteed to be the top-4. Here k’ = 4. B , and D are guaranteed to be the top-2. Another k’ = 2.
28
Top-k Elements Space Bounds
General Distribution Zipf(α) Space-Saving FindApproxTop(S, k, ): O(k/ * log(N)) Exact Top-k Problem: α = 1: O(k2 log(A) ) α > 1: O((k/ α)(1/α) k ) CountSketch O(k/2 * log(N / δ)) α ≥ 1: O(k * log(N / δ))
29
Top-k: Quantitative Comparison
For N = 106, |A| = 104, k = 100, = 10-1, and δ = 10-1, and Uniform data: Space-Saving: 1000 counters CountSketch: C*2.3*107 counters, C >> 1 If the data is Zipfian with α = 2 Space-Saving: 66 counters CountSketch: C*230 counters, C >> 1
30
Top-k: Qualitative Comparison
CountSketch: General distribution: Space-Saving has a tighter theoretical space bound. Zipf(α) distribution: Space-Saving solves the exact problem, while CountSketch solves the approximate problem. Space-Saving has a tighter bound in cases of Skewed data Long streams It has 0-probability of failure.
31
Top-k: Qualitative Comparison (Cont’d)
Probabilistic-Inplace: Outputs m/2 elements, which is too many. Zipf(α) distribution: Probabilistic-Inplace does not offer space analysis in case of Zipfian data.
32
Outline Problem Definition Space-Saving: Summarizing the Data Stream
Answering Frequent Elements Queries Answering Top-k Queries Experimental Results Conclusion
33
Experimental Results - Setup
Synthetic data: Zipf(α), α varied: 0.0, 0.5, 1.0, …, 2.5, 3.0 N = 107 hits. Real Data (ValueClick, Inc.): Similar results Precision: number of correct elements found / entire output Recall: number of correct elements found / number of actual correct Run time: Processing Stream + Query Time Space used: Including hash table
34
Frequent Elements Results
Query: φ = 10-2, = 10-4, and δ = 10-2 We compared with GroupTest and Frequent All algorithms had a recall of 1. That is, they all output the correct elements among their output. Space-Saving was able to guarantee all its output to be correct
35
Frequent Elements Precision
36
Frequent Elements Run Time
37
Frequent Elements Space Used
38
Top-k Elements Results
Query: k = 100, = 10-4, and δ = 10-2 We compared with CountSketch: CountSketch was re-run several times. The hidden constant was estimated to be 16, in order to have output of competitive quality. Probabilistic-InPlace: was allowed the same number of counters as Space-Saving Space-Saving was able to guarantee all its output to be correct
39
Top-k Elements Precision
40
Top-k Elements Recall
41
Top-k Elements Run Time
42
Top-k Elements Space Used
43
Outline Problem Definition Space-Saving: Summarizing the Data Stream
Answering Frequent Elements Queries Answering Top-k Queries Experimental Results Conclusion
44
Conclusion Contributions: Future Work:
An integrated approach to solve an interesting family of problems Strict error bounds using little space Guarantees on results Special attention was given to Zipfian data Experimental validation Future Work: Incremental frequent and top-k elements reporting
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.