What ’ s Hot and What ’ s Not: Tracking Most Frequent Items Dynamically G. Cormode and S. Muthukrishman Rutgers University ACM Principles of Database Systems.

What ’ s Hot and What ’ s Not: Tracking Most Frequent Items Dynamically G. Cormode and S. Muthukrishman Rutgers University ACM Principles of Database Systems 2003 ACM Transactions on Database Systems 2005

Introduction  Find “ hot ” items, but the set of hot items will change over time  Applications: caching, load balancing, sensor networks, data mining, etc.  Usually focus on “ insert ” only, this paper also take “ delete ” into account

Prior works  Stream with sliding window (*)  Flajolet-Martin approach (*) Estimate number of distinct elements  Majority voting algorithm Use only one counter to identify the majority item  Lossy counting 1 0 1 0 1 1 0 0 0 1 0 1 0 1 0 1 1 0 1 0 1 1 1 1 1 0 1 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 Elements Arrival time * http://vc.cs.nthu.edu.tw/ezLMS/show.php?id=385

Contribution of the paper  Dynamically maintain the hot items Both insert and delete transactions are supported  Randomized algorithm Use hash table Use “ random ” to confuse omniscient adversary  Small space required  Short processing time

 Keep log 2 m+1 counters C 0 : keep how many items are “ live ” C j (j!=0): increase or decrease if bit(x,j)=1 Search: if there is a majority, it is given by No false negative, but false positive is possible Finding the majority item

Algorithms to find the majority element in a sequence of updates

Example Space of 8 items Counter 0 Counter 1 (2 0 ) Counter 2 (2 1 ) Counter 3 (2 2 ) 12227246 Find majority: x=0 #>(counter 0)/2 ? +2 1 +0=2 2 False positive is possible!

Finding hot items  Sequence with length n  Item identifiers: 1..m  n x (t): # of inserts - # of deletes before time t  f x (t): n x (t)/sigma(n y (t), y=1..m)  Hot item: given k, f x (t) > 1/(k+1)

Process Item (insert or delete)  Classify sets by universal hash function  Initialize c[0..2Tk][0..logm]=0, c=0 T: # of groups k: frequency threshold (f x (t)>1/(k+1))  for all (i, transType) do if (transType == insert)  c=c+1 else  c=c-1 for x=1 to T do index = hash(x) // uniformly distributed UpdateCounters(i,transType,c[index])

Find hot sets  for i=1 to T do//for each group if c[i][0] ≧ n/(k+1) position=0; t=1; for j=1 to logm do if (c[i][j] ≧ n/(k+1)) position = position + t t = t*2 output(position) Similar to the algorithm to find the majority

Error probability  Choosing |h| ≧ 2k, T=log 2 (k/δ), the algorithm ensures that the probability of all hot items being output is at least 1-δ Details of the proof (*,**) * Universal classes of hash functions, J. Comput. Syst. 1979 ** the two papers currently presented

Experiments  Synthetic data: Uniformly insert Zip-f insert Uniformly delete 1,000,000 items k=50 (hot items: f>1/(k+1))  Real data: Telephone connections (from AT&T) 3.5 million transactions Every 100,000 transactions, query (src, dest) pairs with frequency greater than 1%

Results of synthetic data  Recall: proportion of the hot items that are found by the method  Precision: proportion of items identified by the algorithm are hot items

Results of real data

Conclusion  Propose a new method for identifying hot items  Cope with dynamic datasets

Majority voting algorithm  Initialize the counter to zero  For each element in the stream: If the counter is zero, define the current element to be the monitored element of the counter If the current element is the monitored element, increment the counter. Otherwise, decrement the counter  Ex: 1222321 ψψ Counter: 11ψψ elementcount 212121

Lossy counting Bucket 1Bucket 2Bucket 3 Divide Stream into ‘Buckets’

First bucket of stream Empty (summary) + At bucket boundary, decrease all counters by 1

next bucket of stream + At bucket boundary, decrease all counters by 1

What ’ s Hot and What ’ s Not: Tracking Most Frequent Items Dynamically G. Cormode and S. Muthukrishman Rutgers University ACM Principles of Database Systems.

Similar presentations

Presentation on theme: "What ’ s Hot and What ’ s Not: Tracking Most Frequent Items Dynamically G. Cormode and S. Muthukrishman Rutgers University ACM Principles of Database Systems."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

What ’ s Hot and What ’ s Not: Tracking Most Frequent Items Dynamically G. Cormode and S. Muthukrishman Rutgers University ACM Principles of Database Systems.

Similar presentations

Presentation on theme: "What ’ s Hot and What ’ s Not: Tracking Most Frequent Items Dynamically G. Cormode and S. Muthukrishman Rutgers University ACM Principles of Database Systems."— Presentation transcript:

Similar presentations

About project

Feedback