By Graham Cormode and Marios Hadjieleftheriou Presented by Ankur Agrawal (09305002)

Slides:



Advertisements
Similar presentations
Optimal Approximations of the Frequency Moments of Data Streams Piotr Indyk David Woodruff.
Advertisements

The Average Case Complexity of Counting Distinct Elements David Woodruff IBM Almaden.
Size-estimation framework with applications to transitive closure and reachability Presented by Maxim Kalaev Edith Cohen AT&T Bell Labs 1996.
Counting Distinct Objects over Sliding Windows Presented by: Muhammad Aamir Cheema Joint work with Wenjie Zhang, Ying Zhang and Xuemin Lin University of.
An Improved Data Stream Summary: The Count-Min Sketch and its Applications Graham Cormode, S. Muthukrishnan 2003.
Analysis of Algorithms CS Data Structures Section 2.6.
Finding Frequent Items in Data Streams Moses CharikarPrinceton Un., Google Inc. Kevin ChenUC Berkeley, Google Inc. Martin Franch-ColtonRutgers Un., Google.
Fast Algorithms For Hierarchical Range Histogram Constructions
3/13/2012Data Streams: Lecture 161 CS 410/510 Data Streams Lecture 16: Data-Stream Sampling: Basic Techniques and Results Kristin Tufte, David Maier.
ABSTRACT We consider the problem of computing information theoretic functions such as entropy on a data stream, using sublinear space. Our first result.
© 2004 Goodrich, Tamassia Hash Tables1  
B+-Trees (PART 1) What is a B+ tree? Why B+ trees? Searching a B+ tree
Quick Sort, Shell Sort, Counting Sort, Radix Sort AND Bucket Sort
New Sampling-Based Summary Statistics for Improving Approximate Query Answers P. B. Gibbons and Y. Matias (ACM SIGMOD 1998) Rongfang Li Feb 2007.
Tracking most frequent items dynamically. Article by G.Cormode and S.Muthukrishnan. Presented by Simon Kamenkovich.
Heavy hitter computation over data stream
Streaming Algorithms for Robust, Real- Time Detection of DDoS Attacks S. Ganguly, M. Garofalakis, R. Rastogi, K. Sabnani Krishan Sabnani Bell Labs Research.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 12 June 18, 2006
Ph.D. DefenceUniversity of Alberta1 Approximation Algorithms for Frequency Related Query Processing on Streaming Data Presented by Fan Deng Supervisor:
What ’ s Hot and What ’ s Not: Tracking Most Frequent Items Dynamically G. Cormode and S. Muthukrishman Rutgers University ACM Principles of Database Systems.
CS591A1 Fall Sketch based Summarization of Data Streams Manish R. Sharma and Weichao Ma.
CS 580S Sensor Networks and Systems Professor Kyoung Don Kang Lecture 7 February 13, 2006.
Cmpt-225 Simulation. Application: Simulation Simulation  A technique for modeling the behavior of both natural and human-made systems  Goal Generate.
Ch. 8 & 9 – Linear Sorting and Order Statistics What do you trade for speed?
Fast Approximate Wavelet Tracking on Streams Graham Cormode Minos Garofalakis Dimitris Sacharidis
Chapter 4 Statistics. 4.1 – What is Statistics? Definition Data are observed values of random variables. The field of statistics is a collection.
06/10/2015Applied Algorithmics - week81 Combinatorial Group Testing  Much of the current effort of the Human Genome Project involves the screening of.
Approximate Frequency Counts over Data Streams Loo Kin Kong 4 th Oct., 2002.
Finding Frequent Items in Data Streams [Charikar-Chen-Farach-Colton] Paper report By MH, 2004/12/17.
Approximate Frequency Counts over Data Streams Gurmeet Singh Manku, Rajeev Motwani Standford University VLDB2002.
1 Efficient Computation of Frequent and Top-k Elements in Data Streams.
Streaming Algorithm Presented by: Group 7 Advanced Algorithm National University of Singapore Min Chen Zheng Leong Chua Anurag Anshu Samir Kumar Nguyen.
Compact Data Structures and Applications Gil Einziger and Roy Friedman Technion, Haifa.
Analysis of Algorithms
Complexity of algorithms Algorithms can be classified by the amount of time they need to complete compared to their input size. There is a wide variety:
Streaming Algorithms Piotr Indyk MIT. Data Streams A data stream is a sequence of data that is too large to be stored in available memory Examples: –Network.
The Selection Problem. 2 Median and Order Statistics In this section, we will study algorithms for finding the i th smallest element in a set of n elements.
1 Approximating Quantiles over Sliding Windows Srimathi Harinarayanan CMPS 565.
A Formal Analysis of Conservative Update Based Approximate Counting Gil Einziger and Roy Freidman Technion, Haifa.
CSC 211 Data Structures Lecture 13
PODC Distributed Computation of the Mode Fabian Kuhn Thomas Locher ETH Zurich, Switzerland Stefan Schmid TU Munich, Germany TexPoint fonts used in.
1 LD-Sketch: A Distributed Sketching Design for Accurate and Scalable Anomaly Detection in Network Data Streams Qun Huang and Patrick P. C. Lee The Chinese.
Data Stream Algorithms Ke Yi Hong Kong University of Science and Technology.
New Sampling-Based Summary Statistics for Improving Approximate Query Answers Yinghui Wang
Chapter 12 Hash Table. ● So far, the best worst-case time for searching is O(log n). ● Hash tables  average search time of O(1).  worst case search.
Analysis of Algorithms CS 477/677 Instructor: Monica Nicolescu Lecture 7.
1 Online Computation and Continuous Maintaining of Quantile Summaries Tian Xia Database CCIS Northeastern University April 16, 2004.
Space-Efficient Online Computation of Quantile Summaries SIGMOD 01 Michael Greenwald & Sanjeev Khanna Presented by ellery.
By: Gang Zhou Computer Science Department University of Virginia 1 Medians and Beyond: New Aggregation Techniques for Sensor Networks CS851 Seminar Presentation.
Lecture 9COMPSCI.220.FS.T Lower Bound for Sorting Complexity Each algorithm that sorts by comparing only pairs of elements must use at least 
@ Carnegie Mellon Databases 1 Finding Frequent Items in Distributed Data Streams Amit Manjhi V. Shkapenyuk, K. Dhamdhere, C. Olston Carnegie Mellon University.
1. Searching The basic characteristics of any searching algorithm is that searching should be efficient, it should have less number of computations involved.
3/7/20161 Now it’s time to look at… Discrete Probability.
SCREAM: Sketch Resource Allocation for Software-defined Measurement Masoud Moshref, Minlan Yu, Ramesh Govindan, Amin Vahdat (CoNEXT’15)
REU 2009-Traffic Analysis of IP Networks Daniel S. Allen, Mentor: Dr. Rahul Tripathi Department of Computer Science & Engineering Data Streams Data streams.
Discrete Methods in Mathematical Informatics Kunihiko Sadakane The University of Tokyo
Discrete Methods in Mathematical Informatics Kunihiko Sadakane The University of Tokyo
Mining Data Streams (Part 1)
Algorithms for Big Data: Streaming and Sublinear Time Algorithms
Frequency Counts over Data Streams
A paper on Join Synopses for Approximate Query Answering
Streaming & sampling.
Optimal Elephant Flow Detection Presented by: Gil Einziger,
Randomized Algorithms CS648
Approximate Frequency Counts over Data Streams
Range-Efficient Computation of F0 over Massive Data Streams
Heavy Hitters in Streams and Sliding Windows
By: Ran Ben Basat, Technion, Israel
Lu Tang , Qun Huang, Patrick P. C. Lee
(Learned) Frequency Estimation Algorithms
Presentation transcript:

By Graham Cormode and Marios Hadjieleftheriou Presented by Ankur Agrawal ( )

 Many data generation processes produce huge numbers of pieces of data, each of which is simple in isolation, but which taken together lead to a complex whole.  Examples ◦ Sequence of queries posed to an Internet search engine. ◦ Collection of transactions across all branches of a supermarket chain.  Data can arrive at enormous rates - hundreds of gigabytes per day or higher. Finding the Frequent Items in Streams of Data2

 Storing and indexing such large amount of data is costly.  Important to process the data as it happens.  Provide up to the minute analysis and statistics.  Algorithms take only a single pass over their input.  Compute various functions using resources that are sublinear in the size of input. Finding the Frequent Items in Streams of Data3

Problems range from simple to very complex.  Given a stream of transactions, finding the mean and standard deviation of the bill totals. ◦ Requires only few sufficient statistics to be stored.  Determining whether a search query has already appeared in the stream of queries. ◦ Requires a large amount of information to be stored.  Algorithms must ◦ Respond quickly to new information. ◦ Use very less amount of resources in comparison to total quantity of data. Finding the Frequent Items in Streams of Data4

 Given a stream of items, the problem is simply to find those items which occur most frequently.  Formalized as finding all items whose frequency exceeds a specified fraction of the total number of items.  Variations arise when the items are given weights, and further when these weights can also be negative. Finding the Frequent Items in Streams of Data5

 The problem is important both in itself and as a subroutine in more advanced computations.  For example, ◦ It can help in routing decisions, for in-network caching etc (if items represent packets on the Internet). ◦ Can help in finding popular terms if items represent queries made to an Internet search engine. ◦ Mining frequent itemsets inherently builds on this problem as a basic building block.  Algorithms for the problem have been applied by large corporations: AT&T and Google. Finding the Frequent Items in Streams of Data6

 Given a stream S of n items t 1 …t n, the exact ɸ-frequent items comprise the set {i | f i >ɸn}, where f i is the frequency of item i.  Solving the exact frequent items problem requires Ω(n) space.  Approximate version is defined based on tolerance for error parameterized by ɛ. Finding the Frequent Items in Streams of Data7

 Given a stream S of n items, the ɛ- approximate frequent items problem is to return a set of items F so that for all items i ε F, f i >(ɸ-ɛ)n, and there is no i∉F such that f i >ɸn. Finding the Frequent Items in Streams of Data8

 Given a stream S of n items, the frequency estimation problem is to process a stream so that, given any i, an f i * is returned satisfying f i ≤ f i * ≤ f i +ɛn. Finding the Frequent Items in Streams of Data9

 Two main classes of algorithms : ◦ Counter-based Algorithms ◦ Sketch Algorithms  Other Solutions : ◦ Quantiles : based on various notions of randomly sampling items from the input, and of summarizing the distribution of items. ◦ Less effective and have attracted less interest. Finding the Frequent Items in Streams of Data10

 Track a subset of items from the input and monitor their counts.  Decide for each new arrival whether to store or not.  Decide what count to associate with it.  Cannot handle negative weights. Finding the Frequent Items in Streams of Data11

 Problem posed by J. S. Moore in Journal of Algorithms, in Finding the Frequent Items in Streams of Data12

 Start with a counter set to zero. For each item: ◦ If counter is zero, store the item, set counter to 1. ◦ Else, if item is same as item stored, increment counter. ◦ Else, decrement counter.  If there is a majority item, it is the item stored.  Proof outline: ◦ Each decrement pairs up two different items and cancels them out. ◦ Since majority occurs > N/2 times, not all of its occurrences can be canceled out. Finding the Frequent Items in Streams of Data13

 First proposed by Misra and Gries in  Finds all items in a sequence whose frequency exceeds a 1/k fraction of the total count.  Stores k-1 (item, counter) pairs.  A generalization of the Majority algorithm. Finding the Frequent Items in Streams of Data14

 For each new item: ◦ Increment the counter if the item is already stored. ◦ If <k items stored, then store new item with counter set to 1. ◦ Otherwise decrement all the counters. ◦ If any counter equals 0, then delete the corresponding item. Finding the Frequent Items in Streams of Data15

Finding the Frequent Items in Streams of Data16

Finding the Frequent Items in Streams of Data

Finding the Frequent Items in Streams of Data18

 Time cost involves: ◦ O(1) dictionary operations per update. ◦ Cost of decrementing counts.  Can be performed in O(1) time.  Also solves the frequency estimation problem if executed with k=1/ɛ. Finding the Frequent Items in Streams of Data19

 Proposed by Manku and Motwani in  Stores tuples comprising ◦ An item ◦ A lower bound on its count ◦ A delta (△) value which records the difference between the upper bound and the lower bound. Finding the Frequent Items in Streams of Data20

 For processing i th item: ◦ Increment the lower bound if it is already stored. ◦ Else create a new tuple for the item with lower bound set to 1 and △ set to ⌊i/k⌋.  Periodically delete tuples with upper bound less than ⌊i/k⌋.  Uses O( k log(n/k) ) space in worst case.  Also solves frequency estimation problem with k=1/ɛ. Finding the Frequent Items in Streams of Data21

Finding the Frequent Items in Streams of Data22 * The CACM 2009 version of the paper has erroneously shown if block (line 8) outside for loop.

 Introduced by Metwally et al. in  Store k (item, count) pairs.  Initialize by first k distinct items and their exact counts.  If new item is not already stored, replace the item with least count and set the counter to 1 more than the least count.  Items which are stored by the algorithm early in the stream and are not removed have very accurate estimated counts. Finding the Frequent Items in Streams of Data23

Finding the Frequent Items in Streams of Data24

 The space required is O(k).  The time cost involves cost of: ◦ the dictionary operation of finding if an item is stored. ◦ finding and maintaining the item with minimum count.  Simple heap implementations can track the smallest count item in O(log k) time per update.  As in other counter based algorithms, it also solves frequency estimation problem with k=1/ɛ. Finding the Frequent Items in Streams of Data25

 The term “sketch” denotes ◦ a data structure. ◦ a linear projection of the input frequency vector.  Sketch is the product of frequency vector with a matrix.  Updates with negative values can easily be accommodated. ◦ Allows us to model the removal of items.  Primarily solve the frequency estimation problem.  Algorithms are randomized. ◦ Also take a failure probability δ so that the probability of failure is at most δ. Finding the Frequent Items in Streams of Data26

 Introduced by Charikar et al. in  Each update only affects a small subset of the sketch.  The sketch consists of a two-dimensional array C with d rows of w counters Each.  Uses two hash functions for each row: ◦ h j which maps input items onto [w]. ◦ g j which maps input items onto {−1, +1}.  Each input item i ◦ Add g j (i) to entry C[ j, h j (i)] in row j, for 1 ≤ j ≤ d. Finding the Frequent Items in Streams of Data27

Finding the Frequent Items in Streams of Data28

Finding the Frequent Items in Streams of Data29 For any row j, the value g j (i)*C[j,h j (i)] is an unbiased estimator for f i. The estimate f i * is the median of these estimates over the d rows.

 Setting d=log(4/δ) and w=O(1/ɛ 2 ) ensures that f i has error at most ɛn with probability of at least 1−δ. ◦ Requires the hash functions to be chosen randomly from a family of “fourwise independent” hash functions.  The total space used is.  The time per update is in worst case. Finding the Frequent Items in Streams of Data30

 Introduced by Cormode and Muthukrishnan in  Similarly uses a sketch consisting of 2-D array of d rows of w counters each.  Uses d hash functions h j, one for each row.  Each update is mapped onto d entries in the array, each of which is incremented.  Now frequency estimate for any item is f i * = min 1≤j≤d C[ j, h j (i)] Finding the Frequent Items in Streams of Data31

Finding the Frequent Items in Streams of Data32

 The estimate for an item in each row overestimates by less than n/w. ◦ Can be shown using Markov inequality.  Setting d=log(1/δ) and w=O(1/ɛ) ensures that f i has error at most ɛn with probability of at least 1−δ.  Similar to CountSketch with g j (i) always equal to 1.  The total space used is.  The time per update is in worst case. Finding the Frequent Items in Streams of Data33

 Sketches estimate the frequency of a single item. ◦ How to find frequent items without explicitly storing sketch for all items?  Divide-and-conquer approach limits search cost. ◦ Impose a binary tree over the domain ◦ Keep a sketch of each level of the tree ◦ Descend if a node is heavy, else stop  Correctness: all ancestors of a frequent item are also frequent.  Assumes that negative updates are possible but no negative frequencies are allowed.  Updates take \ hashing operations, and O(1) counter updates for each hash. Finding the Frequent Items in Streams of Data34

 Randomly divides the input into buckets.  Expect at most one frequent item in each bucket.  Within each bucket, the items are divided into subgroups so that the “weight” of each group indicates the identity of the frequent item.  Repeating this for all bit positions reveals the full identity of the item. Finding the Frequent Items in Streams of Data35

 Experiments performed using 10 million packets of HTTP traffic, representing 24 hours of traffic from a backbone router in a major network.  The frequency threshold ɸ varied from to 0.01 and the error guarantee ɛ set to ɸ.  Compare the efficiency of the algorithms with respect to: ◦ Update throughput, measured in number of updates per millisecond. ◦ Space consumed, measured in bytes. ◦ Recall, measured in the total number of true heavy hitters reported over the number of true heavy hitters given by an exact algorithm. ◦ Precision, measured in total number of true heavy hitters reported over the total number of answers reported. Finding the Frequent Items in Streams of Data36

Finding the Frequent Items in Streams of Data37

Finding the Frequent Items in Streams of Data38

 Overall, the SpaceSaving algorithm appears conclusively better than other counter-based algorithms.  SSH yields very good estimates, typically achieving 100% recall and precision, consumes very small space, and is fairly fast to update.  SSL is the fastest algorithm with all the good characteristics of SSH but consumes twice as much space on average. Finding the Frequent Items in Streams of Data39

 No clear winner among the sketch algorithms.  CMH has small size and high update throughput, but is only accurate for highly skewed distributions.  CGT consumes a lot of space but it is the fastest sketch and is very accurate in all cases, with high precision and good frequency estimation accuracy.  CS has low space consumption and is very accurate in most cases, but has slow update rate and exhibits some random behavior. Finding the Frequent Items in Streams of Data40

Finding the Frequent Items in Streams of Data41