Approximate Frequency Counts over Data Streams Loo Kin Kong 4 th Oct., 2002.

Approximate Frequency Counts over Data Streams Loo Kin Kong 4 th Oct., 2002

Plan Motivation Paper review: Approximate Frequency Counts over Data Streams Finding frequent items Finding frequent itemsets Performance Conclusion

Motivation In some new applications, data come as a continuous “ stream ” The sheer volume of a stream over its lifetime is huge Queries require timely answer Examples: Stock ticks Network traffic measurements

Frequent itemset mining on offline databases vs data streams Often, level-wise algorithms are used to mine offline databases E.g., the Apriori algorithm and its variants At least 2 database scans are needed Level-wise algorithms cannot be applied to mine data streams Cannot go through the data stream multiple times

Paper review: Approximate Frequency Counts over Data Streams By G. S. Manku and R. Motwani Published in VLDB 02 Main contributions of the paper: Proposed 2 algorithms to find frequent items appear in a data stream of items Extended the algorithms to find frequent itemset

Notations Some notations: Let N denote the current length of the stream Let s  (0,1) denote the support threshold Let   (0,1) denote the error tolerance

Goals of the paper The algorithm ensures that All itemsets whose true frequency exceeds sN are reported No itemset whose true frequency is less than ( s-  ) N is output Estimated frequencies are less than the true frequencies by at most  N

The simple case: finding frequent items Each transaction in the stream contains only 1 item 2 algorithms were proposed, namely: Sticky Sampling Algorithm Lossy Counting Algorithm Features of the algorithms: Sampling techniques are used Frequency counts found are approximate but error is guaranteed not to exceed a user-specified tolerance level For Lossy Counting, all frequent items are reported

Sticky Sampling Algorithm User input includes 3 values, namely: Support threshold s Error tolerance  Probability of failure  Counts are kept in a data structure S Each entry in S is in the form ( e, f ), where: e is the item f is the frequency of e in the stream since the entry is inserted in S When queried about the frequent items, all entries ( e, f ) such that f  (s -  )N

Sticky Sampling Algorithm (cont ’ d) 1. S   ; N  0; t  1/  log (1/s  ); r  1 2. e  next transaction; N  N + 1 3. if (e,f) exists in S do 4. increment the count f 5. else if random(0,1) > 1/r do 6. insert (e,1) to S 7. endif 8. if N = 2t  2 n do 9. r  2r 10. halfSampRate(S); 11. endif 12. Goto 2; S: The set of all counts e: Transaction (item) N: Curr. len. of stream r: Sampling rate t: 1/  log (1/s  )

Sticky Sampling Algorithm: halfSampRate() 1. function halfSampRate(S) 2. for every entry (e,f) in S do 3. while random(0,1) 0 do 4. f  f – 1 5. if f = 0 do 6. remove the entry from S 7. endif

Lossy Counting Algorithm Incoming data stream is conceptually divided into buckets of  1/  transactions Counts are kept in a data structure D Each entry in D is in the form ( e, f,  ), where: e is the item f is the frequency of e in the stream since the entry is inserted in D  is the maximum count of e in the stream before e is added to D

Lossy Counting Algorithm (cont ’ d) 1. D   ; N  0 2. w   1/  ; b  1 3. e  next transaction; N  N + 1 4. if (e,f,  ) exists in D do 5. f  f + 1 6. else do 7. insert (e,1,b-1) to D 8. endif 9. if N mod w = 0 do 10. prune(D, b); b  b + 1 11. endif 12. Goto 3; D: The set of all counts N: Curr. len. of stream e: Transaction (itemset) w: Bucket width b: Current bucket id

Lossy Counting Algorithm – prune() 1. function prune(D, b) 2. for each entry (e,f,  ) in D do 3. if f +   b do 4. remove the entry from D 5. endif

Lossy Counting Lossy Counting guarantees that: When deletion occurs, b   N If an entry ( e, f,  ) is deleted, f e  b where f e is the actual frequency count of e Hence, if an entry ( e, f,  ) is deleted, f e   N Finally, f  f e  f +  N

Sticky Sampling vs Lossy Counting Sticky Sampling is non-deterministic, while Lossy Counting is deterministic Experimental result shows that Lossy Counting requires fewer entries than Sticky Sampling

The more complex case: finding frequent itemsets The Lossy Counting algorithm is extended to find frequent itemsets Transactions in the data stream contains any number of items

Overview of the algorithm Incoming data stream is conceptually divided into buckets of  1/  transactions Counts are kept in a data structure D Multiple buckets (  of them say) are processed in a batch Each entry in D is in the form ( set, f,  ), where: set is the itemset f is the frequency of set in the stream since the entry is inserted in D  is the maximum count of set in the stream before set is added to D

Overview of the algorithm (cont ’ d) D is updated by the operations UPDATE_SET and NEW_SET UPDATE_SET updates and deletes entries in D For each entry ( set, f,  ), count occurrence of set in the batch and update the entry If an updated entry satisfies f +   b current, the entry is removed from D NEW_SET inserts new entries into D If a set set has frequency f   in the batch and set does not occur in D, create a new entry ( set, f, b current -  )

Implementation Challenges: Not to enumerate all subsets of a transaction Data structure must be compact for better space efficiency 3 major modules: Buffer Trie SetGen

Implementation (cont ’ d) Buffer: repeatedly reads in a batch of buckets of transactions, where each transaction is a set of item- id ’ s, into available main memory Trie: maintains the data structure D SetGen: generates subsets of item-id ’ s along with their frequency counts in the current batch Not all possible subsets need to be generated If a subset S is not inserted into D after application of both UPDATE_SET and NEW_SET, then no supersets of S should be considered

Performance IBM dataset (T10 I4 D1000K / 10K items)

Performance (cont ’ d) Compared with Apriori IBM dataset (T10 I4 D1000K / 10K items)

Conclusion Sticky Sampling and Lossy Counting are 2 approximate algorithms that can find frequent items Both algorithms produces frequency counts within a user-specified error tolerance level, though Sticky Sampling is non-deterministic Lossy Counting can be extended to find frequent itemsets

Reference G. S. Manku and R. Motwani. Approximate Frequency Counts over Data Streams. In VLDB 02, Hong Kong, 2002

Approximate Frequency Counts over Data Streams Loo Kin Kong 4 th Oct., 2002.

Similar presentations

Presentation on theme: "Approximate Frequency Counts over Data Streams Loo Kin Kong 4 th Oct., 2002."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Approximate Frequency Counts over Data Streams Loo Kin Kong 4 th Oct., 2002.

Similar presentations

Presentation on theme: "Approximate Frequency Counts over Data Streams Loo Kin Kong 4 th Oct., 2002."— Presentation transcript:

Similar presentations

About project

Feedback