1 Efficient Computation of Frequent and Top-k Elements in Data Streams.

Slides:



Advertisements
Similar presentations
Efficient Computation of Frequent and Top-k Elements in Data Streams
Advertisements

Noise, Information Theory, and Entropy (cont.) CS414 – Spring 2007 By Karrie Karahalios, Roger Cheng, Brian Bailey.
CS4432: Database Systems II
Probabilistic Skyline Operator over Sliding Windows Wenjie Zhang University of New South Wales & NICTA, Australia Joint work: Xuemin Lin, Ying Zhang, Wei.
Finding Frequent Items in Data Streams Moses CharikarPrinceton Un., Google Inc. Kevin ChenUC Berkeley, Google Inc. Martin Franch-ColtonRutgers Un., Google.
Analysis of Algorithms
Order Statistics Sorted
Fast Algorithms For Hierarchical Range Histogram Constructions
3/13/2012Data Streams: Lecture 161 CS 410/510 Data Streams Lecture 16: Data-Stream Sampling: Basic Techniques and Results Kristin Tufte, David Maier.
Yasuhiro Fujiwara (NTT Cyber Space Labs)
School of Computer Science and Engineering Finding Top k Most Influential Spatial Facilities over Uncertain Objects Liming Zhan Ying Zhang Wenjie Zhang.
1 A Deterministic Algorithm for Summarizing Asynchronous Streams over a Sliding Window Costas Busch Rensselaer Polytechnic Institute Srikanta Tirthapura.
Adaptive Frequency Counting over Bursty Data Streams Bill Lin, Wai-Shing Ho, Ben Kao and Chun-Kit Chui Form CIDM07.
Bar Ilan University And Georgia Tech Artistic Consultant: Aviya Amir.
Learning Objectives Explain similarities and differences among algorithms, programs, and heuristic solutions List the five essential properties of an algorithm.
1 CS 361 Lecture 5 Approximate Quantiles and Histograms 9 Oct 2002 Gurmeet Singh Manku
COMP53311 Data Stream Prepared by Raymond Wong Presented by Raymond Wong
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 13 June 25, 2006
Heavy hitter computation over data stream
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 12 June 18, 2006
CSE 830: Design and Theory of Algorithms
1 Ranked Queries over sources with Boolean Query Interfaces without Ranking Support Vagelis Hristidis, Florida International University Yuheng Hu, Arizona.
What ’ s Hot and What ’ s Not: Tracking Most Frequent Items Dynamically G. Cormode and S. Muthukrishman Rutgers University ACM Principles of Database Systems.
Continuous Data Stream Processing MAKE Lab Date: 2006/03/07 Post-Excellence Project Subproject 6.
Data Flow Analysis Compiler Design Nov. 8, 2005.
Top- K Query Evaluation with Probabilistic Guarantees Martin Theobald, Gerhard Weikum, Ralf Schenkel Presenter: Avinandan Sengupta.
MAT 1235 Calculus II Section 7.7 Approximate (Numerical) Integration
NUITS: A Novel User Interface for Efficient Keyword Search over Databases The integration of DB and IR provides users with a wide range of high quality.
1 Using Association Rules for Fraud Detection in Web Advertising Networks Ahmed Metwally Divyakant Agrawal Amr El Abbadi Department of Computer Science.
CSV: Visualizing and Mining Cohesive Subgraphs Nan Wang Srinivasan Parthasarathy Kian-Lee Tan Anthony K. H. Tung School of Computing National University.
Computer Science and Engineering Loyalty-based Selection: Retrieving Objects That Persistently Satisfy Criteria Presented By: Zhitao Shen Joint work with.
Analysis of Algorithms
Approximate Frequency Counts over Data Streams Loo Kin Kong 4 th Oct., 2002.
By Graham Cormode and Marios Hadjieleftheriou Presented by Ankur Agrawal ( )
Finding Frequent Items in Data Streams [Charikar-Chen-Farach-Colton] Paper report By MH, 2004/12/17.
Approximate Frequency Counts over Data Streams Gurmeet Singh Manku, Rajeev Motwani Standford University VLDB2002.
False Positive or False Negative: Mining Frequent Itemsets from High Speed Transactional Data Streams Jeffrey Xu Yu, Zhihong Chong, Hongjun Lu, Aoying.
1 Approximating Quantiles over Sliding Windows Srimathi Harinarayanan CMPS 565.
1 Markov Decision Processes Infinite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld.
Data Stream Algorithms Ke Yi Hong Kong University of Science and Technology.
Spatio-temporal Pattern Queries M. Hadjieleftheriou G. Kollios P. Bakalov V. J. Tsotras.
1 Online Computation and Continuous Maintaining of Quantile Summaries Tian Xia Database CCIS Northeastern University April 16, 2004.
Space-Efficient Online Computation of Quantile Summaries SIGMOD 01 Michael Greenwald & Sanjeev Khanna Presented by ellery.
The Misra Gries Algorithm. Motivation Espionage The rest we monitor.
Chapter 10 Algorithmic Thinking. Learning Objectives Explain similarities and differences among algorithms, programs, and heuristic solutions List the.
By: Gang Zhou Computer Science Department University of Virginia 1 Medians and Beyond: New Aggregation Techniques for Sensor Networks CS851 Seminar Presentation.
Ranking of Database Query Results Nitesh Maan, Arujn Saraswat, Nishant Kapoor.
@ Carnegie Mellon Databases 1 Finding Frequent Items in Distributed Data Streams Amit Manjhi V. Shkapenyuk, K. Dhamdhere, C. Olston Carnegie Mellon University.
Lower bounds on data stream computations Seminar in Communication Complexity By Michael Umansky Instructor: Ronitt Rubinfeld.
1 VLDB, Background What is important for the user.
Discrete Methods in Mathematical Informatics Kunihiko Sadakane The University of Tokyo
Discrete Methods in Mathematical Informatics Kunihiko Sadakane The University of Tokyo
SketchVisor: Robust Network Measurement for Software Packet Processing
Algorithms for Big Data: Streaming and Sublinear Time Algorithms
Frequency Counts over Data Streams
5.5 Trapezoidal Rule.
A paper on Join Synopses for Approximate Query Answering
Finding Frequent Items in Data Streams
Spatio-temporal Pattern Queries
False Positive or False Negative: Mining Frequent Itemsets from High Speed Transactional Data Streams Jeffrey Xu Yu , Zhihong Chong(崇志宏) , Hongjun Lu.
COMS E F15 Lecture 2: Median trick + Chernoff, Distinct Count, Impossibility Results Left to the title, a presenter can insert his/her own image.
Approximate Frequency Counts over Data Streams
CSCI B609: “Foundations of Data Science”
By: Ran Ben Basat, Technion, Israel
Heavy Hitters in Streams and Sliding Windows
By: Ran Ben Basat, Technion, Israel
Minwise Hashing and Efficient Search
Lu Tang , Qun Huang, Patrick P. C. Lee
Dynamically Maintaining Frequent Items Over A Data Stream
Presentation transcript:

1 Efficient Computation of Frequent and Top-k Elements in Data Streams

2 Motivation Motivated by Internet advertising commissioners Before rendering an advertisement for user, query clicks stream for advertisements to display. If the user's profile is not a frequent “clicker”, then s/he will probably not click any displayed advertisement. –Show Pay-Per-Impression advertisements. If the user's profile is a frequent “clicker”, then s/he may click a displayed advertisement. –Show Pay-Per-Click advertisements. –Retrieve top advertisements to choose what to display.

3 Problem Definition Given alphabet A, stream S of size N, a frequent element, E, is an element whose frequency, F, exceeds a user specified support, φN Top-k elements are the k elements with highest frequency Both problems: –Very related, though, no integrated solution has been proposed –Exact solution is O(min(N,A)) space  approximate variations

4 Practical Frequent Elements  -Deficient Frequent Elements [Manku ‘02]: –All frequent elements output should have F > (φ -  )N, where  is the user-defined error. φ Nφ N ( φ -  ) N

5 Practical Top-k FindApproxTop(S, k,  ) [Charikar ‘02]: –Retrieve a list of k elements such that every element, E i, in the list has F i > (1 -  ) F k, where E k is the k th ranked element. F4F4 (1 -  ) F 4

6 The Space-Saving Algorithm Space-Saving is counter-based Monitor only m elements Only over-estimation errors Frequency estimation is more accurate for significant elements Keep track of max. possible errors

7 Space-Saving By Example Element Count error (max possible) ABBACABBDD Element ABC Count221 error (max possible) 000 Element ABC Count321 error (max possible) 000 Element BAC Count431 error (max possible) 000 Element BAD Count432 error (max possible) 001 Element BAD Count533 error (max possible) 001 E Element BEA Count543 error (max possible) 030 Space-Saving Algorithm –For every element in the stream S –If a monitored element is observed Increment its Count –If a non-monitored element is observed, Replace the element with minimum hits, min Increment the minimum Count to min + 1 maximum possible over-estimation is error Space-Saving Algorithm –For every element in the stream S –If a monitored element is observed Increment its Count –If a non-monitored element is observed, Replace the element with minimum hits, min Increment the minimum Count to min + 1 maximum possible over-estimation is error Space-Saving Algorithm –For every element in the stream S –If a monitored element is observed Increment its Count –If a non-monitored element is observed, Replace the element with minimum hits, min Increment the minimum Count to min + 1 maximum possible over-estimation is error Space-Saving Algorithm –For every element in the stream S –If a monitored element is observed Increment its Count –If a non-monitored element is observed, Replace the element with minimum hits, min Increment the minimum Count to min + 1 maximum possible over-estimation is error Space-Saving Algorithm –For every element in the stream S –If a monitored element is observed Increment its Count –If a non-monitored element is observed, Replace the element with minimum hits, min Increment the minimum Count to min + 1 maximum possible over-estimation is error C Element BEC Count544 error (max possible) 033 B

8 Space-Saving Observations Observations: –The summation of the Counts is N Element BEC Count544 error (max possible) 033 S = ABBACABBDDBECN = 13 –Minimum number of hits, min ≤ N/m –In this example, min = 4 Element BEC Count544 error (max possible) 033 –The minimum number of hits, min, is an upper bound on the error of any element Element BEC Count544 error (max possible) 033

9 Space-Saving Proved Properties 1. If Element E has frequency F > min, then E must be in Stream-Summary. F(B) = F 1 = 5, min = 4. S = ABBACABBDDBECN = 13 Element BEC Count544 error (max possible) The Count at position i in Stream-Summary is no less than F i, the frequency of the i th ranked element. F(A) = F 2 = 3, Count 2 = 4. Element BEC Count544 error (max possible) 033 S = ABBACABBDDBECN = 13

10 Space-Saving Data Structure We need a data structure that –Increments counters in constant time –Keeps elements sorted by their counters We propose the Stream-Summary structure, similar to the data structure in [Demaine ’ 02]

11 Frequent Elements Queries Traverse Stream-Summary, and report all elements that satisfy the user support Any element whose guaranteed hits = (Count – error) > φN is guaranteed to be a frequent element

12 Frequent Elements Example For N = 73, m = 8, φ = 0.15: –Frequent Elements should have support of 11 hits. –Candidate Frequent Elements are B, D, and G. Element BDGAQFCE Count error Guaranteed Hits = Count - error –Guaranteed Frequent Elements are B, and D, since their guaranteed hits > 11. Element BDGAQFCE Count error Guaranteed Hits = Count - error

13 Top-k Elements Queries Traverse the Stream-Summary, and report top-k elements. From Property 2, we assert: –Guaranteed top-k elements: Any element whose guaranteed hits = (Count – error) ≥ Count k+1, is guaranteed to be in the top-k. –Guaranteed top-k’ (where k’≈k): The top-k’ elements reported are guaranteed to be the correct top-k’ iff for every element in the top-k’, guaranteed hits = (Count – error) ≥ Count k’+1.

14 Top-k Elements Example For k = 3, m = 8: –B, D, and G are the top-3 candidates. Element BDGAQFCE Count error Guaranteed Hits = Count - error –B, and D are guaranteed to be in the top-3. Element BDGAQFCE Count error Guaranteed Hits = Count - error –B, D, G and A are guaranteed to be the top-4. Here k’ = 4. Element BDGAQFCE Count error Guaranteed Hits = Count - error –B, and D are guaranteed to be the top-2. Another k’ = 2. Element BDGAQFCE Count error Guaranteed Hits = Count - error

15 Frequent Elements Precision

16 Frequent Elements Run Time

17 Frequent Elements Space Used

Max freq. element in stream Can we promise to find it with less than m buckets? 18