Heavy hitter computation over data stream

Slides:



Advertisements
Similar presentations
Tight Bounds for Distributed Functional Monitoring David Woodruff IBM Almaden Qin Zhang Aarhus University MADALGO.
Advertisements

An Improved Data Stream Summary: The Count-Min Sketch and its Applications Graham Cormode, S. Muthukrishnan 2003.
3/13/2012Data Streams: Lecture 161 CS 410/510 Data Streams Lecture 16: Data-Stream Sampling: Basic Techniques and Results Kristin Tufte, David Maier.
Lecture 11 oct 6 Goals: hashing hash functions chaining closed hashing application of hashing.
Adaptive Frequency Counting over Bursty Data Streams Bill Lin, Wai-Shing Ho, Ben Kao and Chun-Kit Chui Form CIDM07.
Computing Diameter in the Streaming and Sliding-Window Models J. Feigenbaum, S. Kannan, J. Zhang.
Bowen Yu Programming Practice Midterm, 7/30/2013.
© 2004 Goodrich, Tamassia Hash Tables1  
COMP53311 Data Stream Prepared by Raymond Wong Presented by Raymond Wong
Introduction to Analysis of Algorithms
Computing Diameter in the Streaming and Sliding-Window Models J. Feigenbaum, S. Kannan, J. Zhang.
Approximate Frequency Counts over Data Streams Gurmeet Singh Manku (Standford) Rajeev Motwani (Standford) Presented by Michal Spivak November, 2003.
Point and Confidence Interval Estimation of a Population Proportion, p
Introduction to Computers and Programming Lecture 8: More Loops New York University.
What ’ s Hot and What ’ s Not: Tracking Most Frequent Items Dynamically G. Cormode and S. Muthukrishman Rutgers University ACM Principles of Database Systems.
Introduction to Computers and Programming More Loops  2000 Prentice Hall, Inc. All rights reserved. Modified for use with this course.
This material in not in your text (except as exercises) Sequence Comparisons –Problems in molecular biology involve finding the minimum number of edit.
May 21, 2003CS Forum Annual Meeting1 Randomization for Massive and Streaming Data Sets Rajeev Motwani.
Continuous Data Stream Processing MAKE Lab Date: 2006/03/07 Post-Excellence Project Subproject 6.
Estimating Set Expression Cardinalities over Data Streams Sumit Ganguly Minos Garofalakis Rajeev Rastogi Internet Management Research Department Bell Labs,
Lecture 11 oct 7 Goals: hashing hash functions chaining closed hashing application of hashing.
Student Seminar – Fall 2012 A Simple Algorithm for Finding Frequent Elements in Streams and Bags RICHARD M. KARP, SCOTT SHENKER and CHRISTOS H. PAPADIMITRIOU.
CS 473Lecture 121 CS473-Algorithms I Lecture 12 Amortized Analysis.
06/10/2015Applied Algorithmics - week81 Combinatorial Group Testing  Much of the current effort of the Human Genome Project involves the screening of.
Approximate Frequency Counts over Data Streams Loo Kin Kong 4 th Oct., 2002.
By Graham Cormode and Marios Hadjieleftheriou Presented by Ankur Agrawal ( )
Approximate Frequency Counts over Data Streams Gurmeet Singh Manku, Rajeev Motwani Standford University VLDB2002.
1 Efficient Computation of Frequent and Top-k Elements in Data Streams.
1 Verifying and Mining Frequent Patterns from Large Windows ICDE2008 Barzan Mozafari, Hetal Thakkar, Carlo Zaniolo Date: 2008/9/25 Speaker: Li, HueiJyun.
Amortized Analysis Typically, most data structures provide absolute guarantees on the worst case time for performing a single operation. We will study.
TinyLFU: A Highly Efficient Cache Admission Policy
3. Counting Permutations Combinations Pigeonhole principle Elements of Probability Recurrence Relations.
Sorting Fun1 Chapter 4: Sorting     29  9.
An Optimal Cache-Oblivious Priority Queue and Its Applications in Graph Algorithms By Arge, Bender, Demaine, Holland-Minkley, Munro Presented by Adam Sheffer.
False Positive or False Negative: Mining Frequent Itemsets from High Speed Transactional Data Streams Jeffrey Xu Yu, Zhihong Chong, Hongjun Lu, Aoying.
1 Approximating Quantiles over Sliding Windows Srimathi Harinarayanan CMPS 565.
Data Stream Algorithms Ke Yi Hong Kong University of Science and Technology.
August 21, 2002VLDB Gurmeet Singh Manku Frequency Counts over Data Streams Frequency Counts over Data Streams Stanford University, USA.
CS 361 – Chapters 8-9 Sorting algorithms –Selection, insertion, bubble, “swap” –Merge, quick, stooge –Counting, bucket, radix How to select the n-th largest/smallest.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
Programming Logic and Design Fourth Edition, Comprehensive Chapter 8 Arrays.
Space-Efficient Online Computation of Quantile Summaries SIGMOD 01 Michael Greenwald & Sanjeev Khanna Presented by ellery.
The Misra Gries Algorithm. Motivation Espionage The rest we monitor.
Data Structures and Algorithms Searching Algorithms M. B. Fayek CUFE 2006.
1 CHAPTER (7) Attributes Control Charts. 2 Introduction Data that can be classified into one of several categories or classifications is known as attribute.
Lower bounds on data stream computations Seminar in Communication Complexity By Michael Umansky Instructor: Ronitt Rubinfeld.
Why Repetition? Read 8 real numbers and compute their average REAL X1, X2, X3, X4, X5, X6, X7, X8 REAL SUM, AVG READ *, X1, X2, X3, X4, X5, X6, X7, X8.
COSC 1P02 Introduction to Computer Science 5.1 Cosc 1P02 Week 5 Lecture slides Psychiatrist to patient "You have nothing to worry about - anyone who can.
Discrete Methods in Mathematical Informatics Kunihiko Sadakane The University of Tokyo
Discrete Methods in Mathematical Informatics Kunihiko Sadakane The University of Tokyo
Mining Data Streams (Part 1)
Frequency Counts over Data Streams
Data Link Layer.
New Characterizations in Turnstile Streams with Applications
Hash Tables 6/13/2018 Presentation for use with the textbook Algorithm Design and Applications, by M. T. Goodrich and R. Tamassia, Wiley, 2015 Cuckoo Hashing.
Dictionaries Dictionaries 07/27/16 16:46 07/27/16 16:46 Hash Tables 
False Positive or False Negative: Mining Frequent Itemsets from High Speed Transactional Data Streams Jeffrey Xu Yu , Zhihong Chong(崇志宏) , Hongjun Lu.
COMS E F15 Lecture 2: Median trick + Chernoff, Distinct Count, Impossibility Results Left to the title, a presenter can insert his/her own image.
Algorithm design and Analysis
Lecture 7: Dynamic sampling Dimension Reduction
Data Structures ADT List
Approximate Frequency Counts over Data Streams
CSCI B609: “Foundations of Data Science”
Hashing Sections 10.2 – 10.3 Lecture 26 CS302 Data Structures
Lecture 6: Counting triangles Dynamic graphs & sampling
Heavy Hitters in Streams and Sliding Windows
Approximation and Load Shedding Sampling Methods
Dynamically Maintaining Frequent Items Over A Data Stream
REPETITION Why Repetition?
Data Link Layer. Position of the data-link layer.
Presentation transcript:

Heavy hitter computation over data stream Slides modified from Rajeev Motwani(Stanford University) and Subhash Suri (University of California )

Frequency Related Problems Top-k most frequent elements 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Find all elements with frequency > 0.1% What is the total frequency of elements between 8 and 14? What is the frequency of element 3? How many elements have non-zero frequency?

An Old Chestnut: Majority A sequence of N items. You have constant memory. In one pass, decide if some item is in majority (occurs > N/2 times)? 9 3 4 6 7 2 N = 12; item 9 is majority

Misra-Gries Algorithm (‘82) A counter and an ID. If new item is same as stored ID, increment counter. Otherwise, decrement the counter. If counter 0, store new item with count = 1. If counter > 0, then its item is the only candidate for majority. 2 9 7 6 4 3 ID count 1

A generalization: Frequent Items (Karp 03) Find k items, each occurring at least N/(k+1) times. Algorithm: Maintain k items, and their counters. If next item x is one of the k, increment its counter. Else if a zero counter, put x there with count = 1 Else (all counters non-zero) decrement all k counters . count IDk ID2 ID1 ID

Frequent Elements: Analysis A frequent item’s count is decremented if all counters are full: it erases k+1 items. If x occurs > N/(k+1) times, then it cannot be completely erased. Similarly, x must get inserted at some point, because there are not enough items to keep it away.

Problem of False Positives False positives in Misra-Gries(MG) algorithm It identifies all true heavy hitters, but not all reported items are necessarily heavy hitters. How can we tell if the non-zero counters correspond to true heavy hitters or not? A second pass is needed to verify. False positives are problematic if heavy hitters are used for billing or punishment. What guarantees can we achieve in one pass?

Approximation Guarantees Find heavy hitters with a guaranteed approximation error [MM02] Manku-Motwani (Lossy Counting) Suppose you want -heavy hitters --- items with freq > N An approximation parameter , where  << . (E.g.,  = .01 and  = .0001;  = 1% and  = .01% ) Identify all items with frequency >  N No reported item has frequency < ( - )N The algorithm uses O(1/ log (N)) memory

MM02 Algorithm 1: Lossy Counting Step 1: Divide the stream into ‘windows’ Window 1 Window 2 Window 3 Window-size W is function of support s – specify later…

Lossy Counting in Action ... Frequency Counts + First Window Empty At window boundary, decrement all counters by 1

Lossy Counting continued ... Frequency Counts Next Window + At window boundary, decrement all counters by 1

Error Analysis frequency error  How much do we undercount? If current size of stream = N and window-size W = 1/ε then # windows = εN frequency error  Rule of thumb: Set ε = 10% of support s Example: Given support frequency s = 1%, set error frequency ε = 0.1%

Putting it all together… Output: Elements with counter values exceeding (s-ε)N Approximation guarantees Frequencies underestimated by at most εN No false negatives False positives have true frequency at least (s–ε)N How many counters do we need? Worst case bound: 1/ε log εN counters

Misra-Gries revisited Running MG algorithm with k = 1/ counters also achieves the -approximation. Undercounts any item by at most N. In fact, MG uses only O(1/) memory. Lossy Counting slightly better in per-item processing cost MG requires extra data structure for decrementing all counters Lossy Counting is O(1) amortized per item.

Algorithm 2: Sticky Sampling 34 15 30 28 31 41 23 35 19 Stream  Create counters by sampling  Maintain exact counts thereafter What is sampling rate?

Sticky Sampling contd... For finite stream of length N Sampling rate = (2/εN)(log1/s)  = probability of failure Output: Elements with counter values exceeding (s-ε)N Same error guarantees as Lossy Counting but probabilistic Approximation guarantees (probabilistic) Frequencies underestimated by at most εN No false negatives False positives have true frequency at least (s-ε)N Same Rule of thumb: Set ε = 10% of support s Example: Given support threshold s = 1%, set error threshold ε = 0.1% set failure probability  = 0.01%

Number of counters? Finite stream of length N Sampling rate: (2/εN)log(1/s) Infinite stream with unknown N Gradually adjust sampling rate, Remove the element with certain probability In either case, Expected number of counters = 2/ log 1/s Independent of N