Efficient Computation of Frequent and Top-k Elements in Data Streams

Slides:



Advertisements
Similar presentations
Numbers Treasure Hunt Following each question, click on the answer. If correct, the next page will load with a graphic first – these can be used to check.
Advertisements

Chapter 4 Sampling Distributions and Data Descriptions.
October 31, 2005Copyright © by Erik D. Demaine and Charles E. LeisersonL13.1 Introduction to Algorithms LECTURE 11 Amortized Analysis Dynamic tables.
Adders Used to perform addition, subtraction, multiplication, and division (sometimes) Half-adder adds rightmost (least significant) bit Full-adder.
Angstrom Care 培苗社 Quadratic Equation II
1
1 Vorlesung Informatik 2 Algorithmen und Datenstrukturen (Parallel Algorithms) Robin Pomplun.
Copyright © 2003 Pearson Education, Inc. Slide 1 Computer Systems Organization & Architecture Chapters 8-12 John D. Carpinelli.
Cognitive Radio Communications and Networks: Principles and Practice By A. M. Wyglinski, M. Nekovee, Y. T. Hou (Elsevier, December 2009) 1 Chapter 12 Cross-Layer.
Copyright © 2011, Elsevier Inc. All rights reserved. Chapter 6 Author: Julia Richards and R. Scott Hawley.
Author: Julia Richards and R. Scott Hawley
1 Copyright © 2013 Elsevier Inc. All rights reserved. Appendix 01.
STATISTICS POINT ESTIMATION Professor Ke-Sheng Cheng Department of Bioenvironmental Systems Engineering National Taiwan University.
Properties Use, share, or modify this drill on mathematic properties. There is too much material for a single class, so you’ll have to select for your.
RXQ Customer Enrollment Using a Registration Agent (RA) Process Flow Diagram (Move-In) Customer Supplier Customer authorizes Enrollment ( )
1 RA I Sub-Regional Training Seminar on CLIMAT&CLIMAT TEMP Reporting Casablanca, Morocco, 20 – 22 December 2005 Status of observing programmes in RA I.
Properties of Real Numbers CommutativeAssociativeDistributive Identity + × Inverse + ×
Custom Statutory Programs Chapter 3. Customary Statutory Programs and Titles 3-2 Objectives Add Local Statutory Programs Create Customer Application For.
Polygon Scan Conversion – 11b
Chapter 7 Sampling and Sampling Distributions
1 Outline relationship among topics secrets LP with upper bounds by Simplex method basic feasible solution (BFS) by Simplex method for bounded variables.
1 Click here to End Presentation Software: Installation and Updates Internet Download CD release NACIS Updates.
REVIEW: Arthropod ID. 1. Name the subphylum. 2. Name the subphylum. 3. Name the order.
Biostatistics Unit 5 Samples Needs to be completed. 12/24/13.
Break Time Remaining 10:00.
EE, NCKU Tien-Hao Chang (Darby Chang)
Table 12.1: Cash Flows to a Cash and Carry Trading Strategy.
PP Test Review Sections 6-1 to 6-6
Bright Futures Guidelines Priorities and Screening Tables
EIS Bridge Tool and Staging Tables September 1, 2009 Instructor: Way Poteat Slide: 1.
Chi-Square and Analysis of Variance (ANOVA)
Outline Minimum Spanning Tree Maximal Flow Algorithm LP formulation 1.
Bellwork Do the following problem on a ½ sheet of paper and turn in.
Association Rule Mining
Exarte Bezoek aan de Mediacampus Bachelor in de grafische en digitale media April 2014.
Sample Service Screenshots Enterprise Cloud Service 11.3.
Copyright © 2012, Elsevier Inc. All rights Reserved. 1 Chapter 7 Modeling Structure with Blocks.
1 RA III - Regional Training Seminar on CLIMAT&CLIMAT TEMP Reporting Buenos Aires, Argentina, 25 – 27 October 2006 Status of observing programmes in RA.
Basel-ICU-Journal Challenge18/20/ Basel-ICU-Journal Challenge8/20/2014.
1..
CONTROL VISION Set-up. Step 1 Step 2 Step 3 Step 5 Step 4.
Adding Up In Chunks.
1 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt Synthetic.
Artificial Intelligence
Note to the teacher: Was 28. A. to B. you C. said D. on Note to the teacher: Make this slide correct answer be C and sound to be “said”. to said you on.
1 hi at no doifpi me be go we of at be do go hi if me no of pi we Inorder Traversal Inorder traversal. n Visit the left subtree. n Visit the node. n Visit.
Analyzing Genes and Genomes
©Brooks/Cole, 2001 Chapter 12 Derived Types-- Enumerated, Structure and Union.
Essential Cell Biology
Numerical Analysis 1 EE, NCKU Tien-Hao Chang (Darby Chang)
Chapter 8 Estimation Understandable Statistics Ninth Edition
Exponents and Radicals
Clock will move after 1 minute
Intracellular Compartments and Transport
PSSA Preparation.
Essential Cell Biology
Immunobiology: The Immune System in Health & Disease Sixth Edition
Physics for Scientists & Engineers, 3rd Edition
Energy Generation in Mitochondria and Chlorplasts
Select a time to count down from the clock above
Murach’s OS/390 and z/OS JCLChapter 16, Slide 1 © 2002, Mike Murach & Associates, Inc.
9. Two Functions of Two Random Variables
4/4/2015Slide 1 SOLVING THE PROBLEM A one-sample t-test of a population mean requires that the variable be quantitative. A one-sample test of a population.
Amit Goyal Laks V. S. Lakshmanan RecMax: Exploiting Recommender Systems for Fun and Profit University of British Columbia
User Defined Functions Lesson 1 CS1313 Fall User Defined Functions 1 Outline 1.User Defined Functions 1 Outline 2.Standard Library Not Enough #1.
1 Decidability continued…. 2 Theorem: For a recursively enumerable language it is undecidable to determine whether is finite Proof: We will reduce the.
Math Review with Matlab:
1 Using Association Rules for Fraud Detection in Web Advertising Networks Ahmed Metwally Divyakant Agrawal Amr El Abbadi Department of Computer Science.
1 Efficient Computation of Frequent and Top-k Elements in Data Streams.
Presentation transcript:

Efficient Computation of Frequent and Top-k Elements in Data Streams Ahmed Metwally Divyakant Agrawal Amr El Abbadi Department of Computer Science University of California, Santa Barbara

Outline Problem Definition Space-Saving: Summarizing the Data Stream Answering Frequent Elements Queries Answering Top-k Queries Experimental Results Conclusion

Motivation Motivated by Internet advertising commissioners Before rendering an advertisement for user, query clicks stream for advertisements to display. If the user's profile is not a frequent “clicker”, then s/he will probably not click any displayed advertisement. Show Pay-Per-Impression advertisements. If the user's profile is a frequent “clicker”, then s/he may click a displayed advertisement. Show Pay-Per-Click advertisements. Retrieve top advertisements to choose what to display.

Problem Definition Given alphabet A, stream S of size N, a frequent element, E, is an element whose frequency, F, exceeds a user specified support, φN Top-k elements are the k elements with highest frequency Both problems: Very related, though, no integrated solution has been proposed Exact solution is O(min(N,A)) space  approximate variations

Practical Frequent Elements -Deficient Frequent Elements [Manku ‘02]: All frequent elements output should have F > (φ - )N, where  is the user-defined error. φ N (φ - ) N

Practical Top-k FindApproxTop(S, k, ) [Charikar ‘02]: Retrieve a list of k elements such that every element, Ei, in the list has Fi > (1 - ) Fk, where Ek is the kth ranked element. F4 (1 - ) F4

Related Work Algorithms Classification Counter-Based techniques Keep an individual counter for each element If the observed ID is monitored, its counter is updated If the observed ID is not monitored, algorithm dependent action Sketch-Based techniques Estimate frequency for all elements using bit-maps of counters Each element is hashed into the counters’ space using a family of hash functions. Hashed-to counters are queried for the frequencies

Recent Work (Comparison) Algorithm Nature Space Bound Handles CountSketch [Charikar ‘02] Sketch O(k/2 log N/δ), δ is the failure probability FindApproxTop(S, k, ) GroupTest [Cormode ’03] O(φ-1 log(φ-1) log(|A|)) Hot Items Frequent [Demaine ’02] Counter O(1/), proved by [Bose ‘03] FE Probabilistic-Inplace [Demaine ’02] O(m), m is the available memory FindCandidateTop(S, k, m/2) Lossy Counting [Manku ’02] (1/) log(N) -Deficient FE Sticky Sampling [Manku ’02] (2/) log(φ-1δ-1)

Outline Problem Definition Space-Saving: Summarizing the Data Stream Answering Frequent Elements Queries Answering Top-k Queries Experimental Results Conclusion

The Space-Saving Algorithm Space-Saving is counter-based Monitor only m elements Only over-estimation errors Frequency estimation is more accurate for significant elements Keep track of max. possible errors

Space-Saving By Example Element B A C Count 4 3 1 error (max possible) Element B A D Count 4 3 2 error (max possible) 1 Element B A D Count 5 3 error (max possible) 1 Element B E C Count 5 4 error (max possible) 3 Element B E A Count 5 4 3 error (max possible) Element A B C Count 2 1 error (max possible) Element A B C Count 3 2 1 error (max possible) Element Count error (max possible) A B B A C A B B D D B E C Space-Saving Algorithm For every element in the stream S If a monitored element is observed Increment its Count If a non-monitored element is observed, Replace the element with minimum hits, min Increment the minimum Count to min + 1 maximum possible over-estimation is error Space-Saving Algorithm For every element in the stream S If a monitored element is observed Increment its Count If a non-monitored element is observed, Replace the element with minimum hits, min Increment the minimum Count to min + 1 maximum possible over-estimation is error Space-Saving Algorithm For every element in the stream S If a monitored element is observed Increment its Count If a non-monitored element is observed, Replace the element with minimum hits, min Increment the minimum Count to min + 1 maximum possible over-estimation is error Space-Saving Algorithm For every element in the stream S If a monitored element is observed Increment its Count If a non-monitored element is observed, Replace the element with minimum hits, min Increment the minimum Count to min + 1 maximum possible over-estimation is error Space-Saving Algorithm For every element in the stream S If a monitored element is observed Increment its Count If a non-monitored element is observed, Replace the element with minimum hits, min Increment the minimum Count to min + 1 maximum possible over-estimation is error

Space-Saving Observations S = ABBACABBDDBEC N = 13 Observations: The summation of the Counts is N Minimum number of hits, min ≤ N/m In this example, min = 4 The minimum number of hits, min, is an upper bound on the error of any element Element B E C Count 5 4 error (max possible) 3 Element B E C Count 5 4 error (max possible) 3 Element B E C Count 5 4 error (max possible) 3

Space-Saving Proved Properties S = ABBACABBDDBEC N = 13 S = ABBACABBDDBEC N = 13 If Element E has frequency F > min, then E must be in Stream-Summary. F(B) = F1 = 5, min = 4. The Count at position i in Stream-Summary is no less than Fi, the frequency of the ith ranked element. F(A) = F2 = 3, Count2 = 4. Property 2 is important to guarantee the correctness and order of top-k. Element B E C Count 5 4 error (max possible) 3 Element B E C Count 5 4 error (max possible) 3

Space-Saving Intuition Make use of the skewed property of the data A minority of the elements, the more frequent ones, gets the majority of the hits. Frequent elements will reside in the counters of bigger values. They will not be distorted by the ineffective hits of the infrequent elements. Numerous infrequent elements reside on the smaller counters.

Space-Saving Intuition (Cont’d) If the skew remains, but the popular elements change overtime: The elements that are growing more popular will gradually be pushed to the top of the list. If one of the previously popular elements lost its popularity, its relative position will decline, as other counters get incremented.

Space-Saving Data Structure We need a data structure that Increments counters in constant time Keeps elements sorted by their counters We propose the Stream-Summary structure, similar to the data structure in [Demaine ’02]

Outline Problem Definition Space-Saving: Summarizing the Data Stream Answering Frequent Elements Queries Answering Top-k Queries Experimental Results Conclusion

Frequent Elements Queries Traverse Stream-Summary, and report all elements that satisfy the user support Any element whose guaranteed hits = (Count – error) > φN is guaranteed to be a frequent element

Frequent Elements Example B D G A Q F C E Count 20 14 12 9 7 5 3 error 1 4 2 Guaranteed Hits = Count - error 19 8 Element B D G A Q F C E Count 20 14 12 9 7 5 3 error 1 4 2 Guaranteed Hits = Count - error 19 8 For N = 73, m = 8, φ = 0.15: Frequent Elements should have support of 11 hits. Candidate Frequent Elements are B, D, and G. Guaranteed Frequent Elements are B, and D, since their guaranteed hits > 11.

Frequent Elements Space Bounds General Distribution Zipf(α) Space-Saving O(1/) (1/)(1/α) GroupTest O(φ-1 log(φ-1) log(|A|)) Frequent O(1/) proved by[Bose’03] Lossy Counting (1/) log(N) Sticky Sampling (2/) log(φ-1δ-1)

FE: Quantitative Comparison Example: N = 106, |A| = 104, φ = 10-1,  = 10-2, and δ, the failure probability, = 10-1 ,and Uniform data: Space-Saving and Frequent: 100 counters Sticky Sampling: 700 counters Lossy Counting: 1000 counters GroupTest: C*930 counters, C ≥ 1 Zipfian with α = 2: Space-Saving: 10 counters

FE: Qualitative Comparison Frequent: It has a bound similar to Space-Saving in the general distribution case. It is built and queried in a way that does not allow the user to specify an error threshold. There is no feasible extension to track under-estimation errors. Every observation of a non-monitored element increases the errors for all the monitored elements, since their counters get decremented.

FE: Qualitative Comparison (Cont’d) GroupTest: It does not output frequencies at all. It reveals nothing about the relative order of the elements. It assumes that IDs are 1 … |A|. This can only be enforced by building an indexed lookup table. Thus, practically it needs O(|A|) space.

FE: Qualitative Comparison (Cont’d) Lossy Counting and Sticky Sampling: The theoretical space bound of Space-Saving is much tighter than those of Lossy Counting and Sticky Sampling.

Outline Problem Definition Space-Saving: Summarizing the Data Stream Answering Frequent Elements Queries Answering Top-k Queries Experimental Results Conclusion

Top-k Elements Queries Traverse the Stream-Summary, and report top-k elements. From Property 2, we assert: Guaranteed top-k elements: Any element whose guaranteed hits = (Count – error) ≥ Countk+1, is guaranteed to be in the top-k. Guaranteed top-k’ (where k’≈k): The top-k’ elements reported are guaranteed to be the correct top-k’ iff for every element in the top-k’, guaranteed hits = (Count – error) ≥ Countk’+1.

Top-k Elements Example B D G A Q F C E Count 20 14 12 9 7 5 3 error 1 4 2 Guaranteed Hits = Count - error 19 8 Element B D G A Q F C E Count 20 14 12 9 7 5 3 error 1 4 2 Guaranteed Hits = Count - error 19 8 Element B D G A Q F C E Count 20 14 12 9 7 5 3 error 1 4 2 Guaranteed Hits = Count - error 19 8 Element B D G A Q F C E Count 20 14 12 9 7 5 3 error 1 4 2 Guaranteed Hits = Count - error 19 8 For k = 3, m = 8: B, D, and G are the top-3 candidates. B, and D are guaranteed to be in the top-3. B , D, G and A are guaranteed to be the top-4. Here k’ = 4. B , and D are guaranteed to be the top-2. Another k’ = 2.

Top-k Elements Space Bounds General Distribution Zipf(α) Space-Saving FindApproxTop(S, k, ): O(k/ * log(N)) Exact Top-k Problem: α = 1: O(k2 log(A) ) α > 1: O((k/ α)(1/α) k ) CountSketch O(k/2 * log(N / δ)) α ≥ 1: O(k * log(N / δ))

Top-k: Quantitative Comparison For N = 106, |A| = 104, k = 100,  = 10-1, and δ = 10-1, and Uniform data: Space-Saving: 1000 counters CountSketch: C*2.3*107 counters, C >> 1 If the data is Zipfian with α = 2 Space-Saving: 66 counters CountSketch: C*230 counters, C >> 1

Top-k: Qualitative Comparison CountSketch: General distribution: Space-Saving has a tighter theoretical space bound. Zipf(α) distribution: Space-Saving solves the exact problem, while CountSketch solves the approximate problem. Space-Saving has a tighter bound in cases of Skewed data Long streams It has 0-probability of failure.

Top-k: Qualitative Comparison (Cont’d) Probabilistic-Inplace: Outputs m/2 elements, which is too many. Zipf(α) distribution: Probabilistic-Inplace does not offer space analysis in case of Zipfian data.

Outline Problem Definition Space-Saving: Summarizing the Data Stream Answering Frequent Elements Queries Answering Top-k Queries Experimental Results Conclusion

Experimental Results - Setup Synthetic data: Zipf(α), α varied: 0.0, 0.5, 1.0, …, 2.5, 3.0 N = 107 hits. Real Data (ValueClick, Inc.): Similar results Precision: number of correct elements found / entire output Recall: number of correct elements found / number of actual correct Run time: Processing Stream + Query Time Space used: Including hash table

Frequent Elements Results Query: φ = 10-2,  = 10-4, and δ = 10-2 We compared with GroupTest and Frequent All algorithms had a recall of 1. That is, they all output the correct elements among their output. Space-Saving was able to guarantee all its output to be correct

Frequent Elements Precision

Frequent Elements Run Time

Frequent Elements Space Used

Top-k Elements Results Query: k = 100,  = 10-4, and δ = 10-2 We compared with CountSketch: CountSketch was re-run several times. The hidden constant was estimated to be 16, in order to have output of competitive quality. Probabilistic-InPlace: was allowed the same number of counters as Space-Saving Space-Saving was able to guarantee all its output to be correct

Top-k Elements Precision

Top-k Elements Recall

Top-k Elements Run Time

Top-k Elements Space Used

Outline Problem Definition Space-Saving: Summarizing the Data Stream Answering Frequent Elements Queries Answering Top-k Queries Experimental Results Conclusion

Conclusion Contributions: Future Work: An integrated approach to solve an interesting family of problems Strict error bounds using little space Guarantees on results Special attention was given to Zipfian data Experimental validation Future Work: Incremental frequent and top-k elements reporting