What ’ s Hot and What ’ s Not: Tracking Most Frequent Items Dynamically G. Cormode and S. Muthukrishman Rutgers University ACM Principles of Database Systems.

Slides:

Advertisements

Similar presentations

A Privacy Preserving Index for Range Queries

Advertisements

Counting Distinct Objects over Sliding Windows Presented by: Muhammad Aamir Cheema Joint work with Wenjie Zhang, Ying Zhang and Xuemin Lin University of.

Probabilistic Skyline Operator over Sliding Windows Wenjie Zhang University of New South Wales & NICTA, Australia Joint work: Xuemin Lin, Ying Zhang, Wei.

An Improved Data Stream Summary: The Count-Min Sketch and its Applications Graham Cormode, S. Muthukrishnan 2003.

Finding Frequent Items in Data Streams Moses CharikarPrinceton Un., Google Inc. Kevin ChenUC Berkeley, Google Inc. Martin Franch-ColtonRutgers Un., Google.

3/13/2012Data Streams: Lecture 161 CS 410/510 Data Streams Lecture 16: Data-Stream Sampling: Basic Techniques and Results Kristin Tufte, David Maier.

Towards Estimating the Number of Distinct Value Combinations for a Set of Attributes Xiaohui Yu 1, Calisto Zuzarte 2, Ken Sevcik 1 1 University of Toronto.

Segmented Hash: An Efficient Hash Table Implementation for High Performance Networking Subsystems Sailesh Kumar Patrick Crowley.

Maintaining Variance over Data Stream Windows Brian Babcock, Mayur Datar, Rajeev Motwani, Liadan O ’ Callaghan, Stanford University ACM Symp. on Principles.

Mining Data Streams.

Resource-oriented Approximation for Frequent Itemset Mining from Bursty Data Streams SIGMOD’14 Toshitaka Yamamoto, Koji Iwanuma, Shoshi Fukuda.

Algorithms for data streams Foundations of Data Science 2014 Indian Institute of Science Navin Goyal.

1 Maintaining Bernoulli Samples Over Evolving Multisets Rainer Gemulla Wolfgang Lehner Technische Universität Dresden Peter J. Haas IBM Almaden Research.

New Sampling-Based Summary Statistics for Improving Approximate Query Answers P. B. Gibbons and Y. Matias (ACM SIGMOD 1998) Rongfang Li Feb 2007.

COMP53311 Data Stream Prepared by Raymond Wong Presented by Raymond Wong

Tracking most frequent items dynamically. Article by G.Cormode and S.Muthukrishnan. Presented by Simon Kamenkovich.

Heavy hitter computation over data stream

Streaming Algorithms for Robust, Real- Time Detection of DDoS Attacks S. Ganguly, M. Garofalakis, R. Rastogi, K. Sabnani Krishan Sabnani Bell Labs Research.

1 Distributed Streams Algorithms for Sliding Windows Phillip B. Gibbons, Srikanta Tirthapura.

A survey on stream data mining

Continuous Data Stream Processing MAKE Lab Date: 2006/03/07 Post-Excellence Project Subproject 6.

Estimating Set Expression Cardinalities over Data Streams Sumit Ganguly Minos Garofalakis Rajeev Rastogi Internet Management Research Department Bell Labs,

How to find frequent items continuously in data streams Speaker: 陳弘軒 Adviser: 王家祥.

What’s Hot and What’s Not: Tracking Most Frequent Items Dynamically By Graham Cormode & S. Muthukrishnan Rutgers University, Piscataway NY Presented by.

1Bloom Filters Lookup questions: Does item “ x ” exist in a set or multiset? Data set may be very big or expensive to access. Filter lookup questions with.

Student Seminar – Fall 2012 A Simple Algorithm for Finding Frequent Elements in Streams and Bags RICHARD M. KARP, SCOTT SHENKER and CHRISTOS H. PAPADIMITRIOU.

Sketching, Sampling and other Sublinear Algorithms: Streaming Alex Andoni (MSR SVC)

A Dip in the Reservoir: Maintaining Sample Synopses of Evolving Datasets Rainer Gemulla (University of Technology Dresden) Wolfgang Lehner (University.

Approximate Frequency Counts over Data Streams Loo Kin Kong 4 th Oct., 2002.

By Graham Cormode and Marios Hadjieleftheriou Presented by Ankur Agrawal ( )

Finding Frequent Items in Data Streams [Charikar-Chen-Farach-Colton] Paper report By MH, 2004/12/17.

Approximate Frequency Counts over Data Streams Gurmeet Singh Manku, Rajeev Motwani Standford University VLDB2002.

New Streaming Algorithms for Fast Detection of Superspreaders Shobha Venkataraman* Joint work with: Dawn Song*, Phillip Gibbons ¶,

False Positive or False Negative: Mining Frequent Itemsets from High Speed Transactional Data Streams Jeffrey Xu Yu, Zhihong Chong, Hongjun Lu, Aoying.

Computer Science and Engineering Efficiently Monitoring Top-k Pairs over Sliding Windows Presented By: Zhitao Shen 1 Joint work with Muhammad Aamir Cheema.

Data Stream Algorithms Ke Yi Hong Kong University of Science and Technology.

August 21, 2002VLDB Gurmeet Singh Manku Frequency Counts over Data Streams Frequency Counts over Data Streams Stanford University, USA.

The Bloom Paradox Ori Rottenstreich Joint work with Yossi Kanizo and Isaac Keslassy Technion, Israel.

New Sampling-Based Summary Statistics for Improving Approximate Query Answers Yinghui Wang

Facility Location in Dynamic Geometric Data Streams Christiane Lammersen Christian Sohler.

The Bloom Paradox Ori Rottenstreich Joint work with Isaac Keslassy Technion, Israel.

Hashtables. An Abstract data type that supports the following operations: –Insert –Find –Remove Search trees can be used for the same operations but require.

Streaming Algorithms for Robust, Real-Time Detection of DDoS Attacks S. Ganguly M. Garofalakis R. Rastogi K.Sabnani Indian Inst. Of Tech. India Yahoo!

D-skyline and T-skyline Methods for Similarity Search Query in Streaming Environment Ling Wang 1, Tie Hua Zhou 1, Kyung Ah Kim 2, Eun Jong Cha 2, and Keun.

Calculating frequency moments of Data Stream

Rainer Gemulla, Wolfgang Lehner and Peter J. Haas VLDB 2006 A Dip in the Reservoir: Maintaining Sample Synopses of Evolving Datasets 2008/8/27 1.

Duplicate Detection in Click Streams(2005) SubtitleAhmed Metwally Divyakant Agrawal Amr El Abbadi Tian Wang.

REU 2009-Traffic Analysis of IP Networks Daniel S. Allen, Mentor: Dr. Rahul Tripathi Department of Computer Science & Engineering Data Streams Data streams.

Discrete Methods in Mathematical Informatics Kunihiko Sadakane The University of Tokyo

Mining Data Streams (Part 1)

CFI-Stream: Mining Closed Frequent Itemsets in Data Streams

Frequency Counts over Data Streams

Updating SF-Tree Speaker: Ho Wai Shing.

The Stream Model Sliding Windows Counting 1’s

The Variable-Increment Counting Bloom Filter

Finding Frequent Items in Data Streams

Augmented Sketch: Faster and More Accurate Stream Processing

False Positive or False Negative: Mining Frequent Itemsets from High Speed Transactional Data Streams Jeffrey Xu Yu , Zhihong Chong（崇志宏） , Hongjun Lu.

Randomized Algorithms CS648

CH 9.2 : Hash Tables Acknowledgement: These slides are adapted from slides provided with Data Structures and Algorithms in C++, Goodrich, Tamassia and.

Approximate Frequency Counts over Data Streams

CH 9.2 : Hash Tables Acknowledgement: These slides are adapted from slides provided with Data Structures and Algorithms in C++, Goodrich, Tamassia and.

Range-Efficient Computation of F0 over Massive Data Streams

CPSC-608 Database Systems

By: Ran Ben Basat, Technion, Israel

Heavy Hitters in Streams and Sliding Windows

By: Ran Ben Basat, Technion, Israel

Lu Tang , Qun Huang, Patrick P. C. Lee

Dynamically Maintaining Frequent Items Over A Data Stream

Maintaining Stream Statistics over Sliding Windows

Presentation transcript:

What ’ s Hot and What ’ s Not: Tracking Most Frequent Items Dynamically G. Cormode and S. Muthukrishman Rutgers University ACM Principles of Database Systems 2003 ACM Transactions on Database Systems 2005

Introduction  Find “ hot ” items, but the set of hot items will change over time  Applications: caching, load balancing, sensor networks, data mining, etc.  Usually focus on “ insert ” only, this paper also take “ delete ” into account

Prior works  Stream with sliding window (*)  Flajolet-Martin approach (*) Estimate number of distinct elements  Majority voting algorithm Use only one counter to identify the majority item  Lossy counting Elements Arrival time *

Contribution of the paper  Dynamically maintain the hot items Both insert and delete transactions are supported  Randomized algorithm Use hash table Use “ random ” to confuse omniscient adversary  Small space required  Short processing time

 Keep log 2 m+1 counters C 0 : keep how many items are “ live ” C j (j!=0): increase or decrease if bit(x,j)=1 Search: if there is a majority, it is given by No false negative, but false positive is possible Finding the majority item

Algorithms to find the majority element in a sequence of updates

Example Space of 8 items Counter 0 Counter 1 (2 0 ) Counter 2 (2 1 ) Counter 3 (2 2 ) Find majority: x=0 #>(counter 0)/2 ? =2 2 False positive is possible!

Finding hot items  Sequence with length n  Item identifiers: 1..m  n x (t): # of inserts - # of deletes before time t  f x (t): n x (t)/sigma(n y (t), y=1..m)  Hot item: given k, f x (t) > 1/(k+1)

Process Item (insert or delete)  Classify sets by universal hash function  Initialize c[0..2Tk][0..logm]=0, c=0 T: # of groups k: frequency threshold (f x (t)>1/(k+1))  for all (i, transType) do if (transType == insert)  c=c+1 else  c=c-1 for x=1 to T do index = hash(x) // uniformly distributed UpdateCounters(i,transType,c[index])

Find hot sets  for i=1 to T do//for each group if c[i][0] ≧ n/(k+1) position=0; t=1; for j=1 to logm do if (c[i][j] ≧ n/(k+1)) position = position + t t = t*2 output(position) Similar to the algorithm to find the majority

Error probability  Choosing |h| ≧ 2k, T=log 2 (k/δ), the algorithm ensures that the probability of all hot items being output is at least 1-δ Details of the proof (*,**) * Universal classes of hash functions, J. Comput. Syst ** the two papers currently presented

Experiments  Synthetic data: Uniformly insert Zip-f insert Uniformly delete 1,000,000 items k=50 (hot items: f>1/(k+1))  Real data: Telephone connections (from AT&T) 3.5 million transactions Every 100,000 transactions, query (src, dest) pairs with frequency greater than 1%

Results of synthetic data  Recall: proportion of the hot items that are found by the method  Precision: proportion of items identified by the algorithm are hot items

Results of real data

Conclusion  Propose a new method for identifying hot items  Cope with dynamic datasets

Majority voting algorithm  Initialize the counter to zero  For each element in the stream: If the counter is zero, define the current element to be the monitored element of the counter If the current element is the monitored element, increment the counter. Otherwise, decrement the counter  Ex: ψψ Counter: 11ψψ elementcount

Lossy counting Bucket 1Bucket 2Bucket 3 Divide Stream into ‘Buckets’

First bucket of stream Empty (summary) + At bucket boundary, decrease all counters by 1

next bucket of stream + At bucket boundary, decrease all counters by 1