Approximate Frequency Counts over Data Streams Gurmeet Singh Manku, Rajeev Motwani Standford University VLDB2002.

Slides:

Advertisements

Similar presentations

Online Mining of Frequent Query Trees over XML Data Streams Hua-Fu Li*, Man-Kwan Shan and Suh-Yin Lee Department of Computer Science.

Advertisements

Recap: Mining association rules from large datasets

Indexing DNA Sequences Using q-Grams

Association Analysis (2). Example TIDList of item ID’s T1I1, I2, I5 T2I2, I4 T3I2, I3 T4I1, I2, I4 T5I1, I3 T6I2, I3 T7I1, I3 T8I1, I2, I3, I5 T9I1, I2,

Frequent Itemset Mining Methods. The Apriori algorithm Finding frequent itemsets using candidate generation Seminal algorithm proposed by R. Agrawal and.

Association rules and frequent itemsets mining

Data Mining (Apriori Algorithm)DCS 802, Spring DCS 802 Data Mining Apriori Algorithm Spring of 2002 Prof. Sung-Hyuk Cha School of Computer Science.

Efficient Keyword Search for Smallest LCAs in XML Database Yu Xu Department of Computer Science & Engineering University of California, San Diego Yannis.

Fast Algorithms For Hierarchical Range Histogram Constructions

Adaptive Frequency Counting over Bursty Data Streams Bill Lin, Wai-Shing Ho, Ben Kao and Chun-Kit Chui Form CIDM07.

Resource-oriented Approximation for Frequent Itemset Mining from Bursty Data Streams SIGMOD’14 Toshitaka Yamamoto, Koji Iwanuma, Shoshi Fukuda.

Mining Frequent Patterns in Data Streams at Multiple Time Granularities CS525 Paper Presentation Presented by: Pei Zhang, Jiahua Liu, Pengfei Geng and.

FP-Growth algorithm Vasiljevic Vladica,

Sampling Large Databases for Association Rules ( Toivenon’s Approach, 1996) Farzaneh Mirzazadeh Fall 2007.

1 of 25 1 of 45 Association Rule Mining CIT366: Data Mining & Data Warehousing Instructor: Bajuna Salehe The Institute of Finance Management: Computing.

Data Mining Association Analysis: Basic Concepts and Algorithms

New Sampling-Based Summary Statistics for Improving Approximate Query Answers P. B. Gibbons and Y. Matias (ACM SIGMOD 1998) Rongfang Li Feb 2007.

COMP53311 Data Stream Prepared by Raymond Wong Presented by Raymond Wong

Association Analysis. Association Rule Mining: Definition Given a set of records each of which contain some number of items from a given collection; –Produce.

Mining Frequent Itemsets from Uncertain Data Presented by Chun-Kit Chui, Ben Kao, Edward Hung Department of Computer Science, The University of Hong Kong.

Data Mining Association Analysis: Basic Concepts and Algorithms Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction.

Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,

Association Analysis (2). Example TIDList of item ID’s T1I1, I2, I5 T2I2, I4 T3I2, I3 T4I1, I2, I4 T5I1, I3 T6I2, I3 T7I1, I3 T8I1, I2, I3, I5 T9I1, I2,

Heavy hitter computation over data stream

Data Mining Association Analysis: Basic Concepts and Algorithms

Approximate Frequency Counts over Data Streams Gurmeet Singh Manku (Standford) Rajeev Motwani (Standford) Presented by Michal Spivak November, 2003.

1 Mining Frequent Patterns Without Candidate Generation Apriori-like algorithm suffers from long patterns or quite low minimum support thresholds. Two.

Association Analysis: Basic Concepts and Algorithms.

Data Mining Association Analysis: Basic Concepts and Algorithms

What ’ s Hot and What ’ s Not: Tracking Most Frequent Items Dynamically G. Cormode and S. Muthukrishman Rutgers University ACM Principles of Database Systems.

Continuous Data Stream Processing MAKE Lab Date: 2006/03/07 Post-Excellence Project Subproject 6.

Estimating Set Expression Cardinalities over Data Streams Sumit Ganguly Minos Garofalakis Rajeev Rastogi Internet Management Research Department Bell Labs,

Fast Algorithms for Association Rule Mining

1 Fast Algorithms for Mining Association Rules Rakesh Agrawal Ramakrishnan Srikant Slides from Ofer Pasternak.

NGDM’02 1 Efficient Data-Reduction Methods for On-line Association Rule Mining H. Bronnimann B. ChenM. Dash, Y. Qiao, P. ScheuermannP. Haas Polytechnic.

Mining frequency counts from sensor set data Loo Kin Kong 25 th June 2003.

Approximate Frequency Counts over Data Streams Loo Kin Kong 4 th Oct., 2002.

By Graham Cormode and Marios Hadjieleftheriou Presented by Ankur Agrawal ( )

Finding Frequent Items in Data Streams [Charikar-Chen-Farach-Colton] Paper report By MH, 2004/12/17.

1 Verifying and Mining Frequent Patterns from Large Windows ICDE2008 Barzan Mozafari, Hetal Thakkar, Carlo Zaniolo Date: 2008/9/25 Speaker: Li, HueiJyun.

Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining By Tan, Steinbach, Kumar Lecture.

Modul 7: Association Analysis. 2 Association Rule Mining  Given a set of transactions, find rules that will predict the occurrence of an item based on.

Efficient Data Mining for Calling Path Patterns in GSM Networks Information Systems, accepted 5 December 2002 SPEAKER: YAO-TE WANG ( 王耀德 )

MINING FREQUENT ITEMSETS IN A STREAM TOON CALDERS, NELE DEXTERS, BART GOETHALS ICDM2007 Date: 5 June 2008 Speaker: Li, Huei-Jyun Advisor: Dr. Koh, Jia-Ling.

False Positive or False Negative: Mining Frequent Itemsets from High Speed Transactional Data Streams Jeffrey Xu Yu, Zhihong Chong, Hongjun Lu, Aoying.

Data Stream Algorithms Ke Yi Hong Kong University of Science and Technology.

August 21, 2002VLDB Gurmeet Singh Manku Frequency Counts over Data Streams Frequency Counts over Data Streams Stanford University, USA.

CS 8751 ML & KDDSupport Vector Machines1 Mining Association Rules KDD from a DBMS point of view –The importance of efficiency Market basket analysis Association.

Outline Introduction – Frequent patterns and the Rare Item Problem – Multiple Minimum Support Framework – Issues with Multiple Minimum Support Framework.

Research issues on association rule mining Loo Kin Kong 26 th February, 2003.

Mining Frequent Itemsets from Uncertain Data Presenter : Chun-Kit Chui Chun-Kit Chui [1], Ben Kao [1] and Edward Hung [2] [1] Department of Computer Science.

New Sampling-Based Summary Statistics for Improving Approximate Query Answers Yinghui Wang

1 Inverted Matrix: Efficient Discovery of Frequent Items in Large Datasets in the Context of Interactive Mining -SIGKDD’03 Mohammad El-Hajj, Osmar R. Zaïane.

Association Rule Mining

Space-Efficient Online Computation of Quantile Summaries SIGMOD 01 Michael Greenwald & Sanjeev Khanna Presented by ellery.

Association Analysis (3)

Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,

1 Efficient Data Reduction Methods for Online Association Rule Discovery -NGDM’02 Herve Bronnimann, Bin Chen, Manoranjan Dash, Peter Haas, Yi Qiao, Peter.

Discrete Methods in Mathematical Informatics Kunihiko Sadakane The University of Tokyo

Discrete Methods in Mathematical Informatics Kunihiko Sadakane The University of Tokyo

CFI-Stream: Mining Closed Frequent Itemsets in Data Streams

Frequency Counts over Data Streams

Online Frequent Episode Mining

Frequent Pattern Mining

Market Basket Analysis and Association Rules

False Positive or False Negative: Mining Frequent Itemsets from High Speed Transactional Data Streams Jeffrey Xu Yu , Zhihong Chong（崇志宏） , Hongjun Lu.

Association Rule Mining

COMP5331 FP-Tree Prepared by Raymond Wong Presented by Raymond Wong

Approximate Frequency Counts over Data Streams

Dynamically Maintaining Frequent Items Over A Data Stream

Presentation transcript:

Approximate Frequency Counts over Data Streams Gurmeet Singh Manku, Rajeev Motwani Standford University VLDB2002

Introduction  Data come as a continuous “ stream ”  Differs from traditional stored DB The sheer volume of a stream over its lifetime is huge Queries require timely answer

Frequent itemset mining on offline databases vs data streams  Often, level-wise algorithms are used to mine offline databases At least 2 database scans are needed  Ex: Apriori algorithm  Level-wise algorithms cannot be applied to mine data streams Cannot go through the data stream multiple times

Challenges of streaming  Single pass  Limited Memory  Enumeration of itemsets

Purpose  Present algorithms computing frequency exceeding threshold Simple Low memory footprint Output approximate, guaranteed not exceed a user specified error parameter. Deployed for singleton items, handle variable sized sets of items.  Main contributions of the paper: Proposed 2 algorithms to find frequent items appear in a data stream of items Extended the algorithms to find frequent itemset

Notations  Some notations: Let N denote the current length of the stream Let s (0,1) denote the support threshold Let  (0,1) denote the error tolerance   << s

Approximation guarantees  All itemsets whose true frequency exceeds sN are reported  No itemset whose true frequency is less than ( s- ) N is output  Estimated frequencies are less than the true frequencies by at most  N

Example  s = 0.1%  ε should be one-tenth or one-twentieth of s. ε = 0.01%  Property 1, elements frequency exceeding 0.1% output.  Property 2, NO element frequency below 0.09% output  Elements between 0.09% ~ 0.1% may or may not be output.  Property 3, frequencies are less than their true frequencies at most 0.01%

Problem definition  An algorithm maintains an ε- deficient synopsis if its output satisifies the aforementioned properties  Devise algorithms support ε- deficient synopsis using little main memory as possible

The Algorithms for frequent Items  Each transaction contains only 1 item  Two algorithms proposed: Sticky Sampling Algorithm Lossy Counting Algorithm  Features : Sampling used Frequency found approximate, error guaranteed not exceed user-specified tolerance level For Lossy Counting, all frequent items are reported

Sticky Sampling Algorithm  Create counters by sampling Stream

Sticky Sampling Algorithm  User input : Support threshold s Error tolerance  Probability of failure   Counts kept in data structure S  Each entry in S is in the form ( e, f ), where: e : item f : frequency of e since the entry inserted in S  Output entries in S where f  (s -  )N

Sticky Sampling Algorithm  r : sampling rate  Sampling an element with rate = r means select the element with probablity = 1/r

Sticky Sampling Algorithm  Initially – S is empty, r = 1.  For each incoming element e if (e exists in S) increment corresponding f else { sample element with rate r if (sampled) add entry (e,1) to S else ignore }

Sampling rate  Let t = 1/ ε log(s -1  -1 ) ( = probability of failure)  First 2t elements sampled at rate=1  The next 2t at rate=2  The next 4t at rate=4 and so on …

Sticky Sampling Algorithm Whenever the sampling rate r changes: for each entry (e,f) in S repeat { toss an unbiased coin if (toss is not successful) diminsh f by one if (f == 0) { delete entry from S break } } until toss is successful

Lossy Counting  Data stream conceptually divided into buckets = 1/ transactions  Buckets labeled with bucket ids, starting from 1  Current bucket id is b current,value is  N/  f e :true frequency of an element e in stream seen so far  Each entry in data structure D is form ( e, f, ) e : item f : frequency of e  : the maximum possible error in f

Lossy Counting   is the maximum # of times e occurred in the first b current – 1 buckets ( this value is exactly b current – 1)  Once a value is inserted into D its value  is unchanged

Lossy Counting  Initially D is empty  Receive element e if (e exists in D) increment its frequency (f) by 1 else create a new entry (e, 1, b current – 1)  If bucket boundary prune D by the following the rule: (e,f,) is deleted if f +  ≤ b current  When the user requests a list of items with threshold s, output those entries in D where f ≥ (s – ε)N

Lossy Counting 1. function prune(D, b) 2. for each entry (e,f,) in D do 3. if f +   b do 4. remove the entry from D 5. endif

Lossy Counting At window boundary, remove entries that for them f+∆ ≤ b current D is Empty

Lossy Counting At window boundary, remove entries that for them f+∆≤ b current Next Window +

Lossy Counting  Lossy Counting guarantees that: When deletion occurs, b current   N Entry ( e, f, ) is deleted, If f e b current  f e : actual frequency count of e Hence, if entry ( e, f, ) is deleted, f e   N Finally, f  f e  f +  N

Sticky Sampling vs Lossy Counting  Sticky Sampling is non- deterministic, while Lossy Counting is deterministic  Experimental result shows that Lossy Counting requires fewer entries than Sticky Sampling

Sticky Sampling vs Lossy Counting  Lossy counting is superior by a large factor  Sticky sampling performs worse because of its tendency to remember every unique element that gets sampled  Lossy counting is good at pruning low frequency elements quickly

The more complex case: finding frequent itemsets  The Lossy Counting algorithm is extended to find frequent itemsets  Transactions in the data stream contains a set of items

Finding frequent itemsets Stream

Finding frequent itemsets  Input: stream of transactions, each transaction is a set of items from I  N: length of the stream  User specifies two parameters: support s, error   Challenge: - handling variable sized transactions - avoiding explicit enumeration of all subsets of any transaction

Finding frequent itemsets  Data structure D – set of entries of the form (set, f, ) set : subset of items  Transactions are divided into buckets  = 1/ transactions : # of transactions in each bucket  b current : current bucket id

Finding frequent itemsets  Transactions not processed one by one. Main memory filled as many transactions as possible. Processing is done on a batch of transactions.  β : # of buckets in main memory in the current batch being processed.

Finding frequent itemsets  D ’ s operations : UPDATE_SET updates and deletes in D  Entry (set, f, ) count occurrence of set in the batch and update the entry  If updated entry satisfies f +   bcurrent, removed it from D NEW_SET inserts new entries into D  If set set has frequency f   in batch and set doesn ’ t occur in D, create a new entry (set, f, bcurrent-)

Finding frequent itemsets  If f set ≥ N it has an entry in D  If (set,f,) E D then the true frequency of f set satisfies the inequality f≤ f set ≤ f+  When user requests list of items with threshold s, output in D where f ≥ (s-)N  β needs to be a large number. Any subset of I that occurs β +1 times or more contributes to D.

 Buffer: repeatedly reads in a batch of buckets of transactions into available main memory  Trie: maintains the data structure D  SetGen: generates subsets of item-id ’ s along with their frequency counts in the current batch Not all possible subsets need to be generated If a subset S is not inserted into D after application of both UPDATE_SET and NEW_SET, then no supersets of S should be considered

Three modules BUFFER TRIE SUBSET-GEN maintains the data structure D operates on the current batch of transactions repeatedly reads in a batch of transactions into available main memory implement UPDATE_SET, NEW_SET

Module 1 - Buffer  Read a batch of transactions  Transactions are laid out one after the other in a big array  A bitmap is used to remember transaction boundaries  After reading in a batch, BUFFER sorts each transaction by its item-id ’ s Window 1 Window 2 Window 3 Window 4 Window 5 Window 6 In Main Memory

Module 2 - TRIE Sets with frequency counts

Module 2 – TRIE cont…  Nodes are labeled {item-id, f, , level}  Children of any node are ordered by their item- id ’ s  Root nodes are also ordered by their item-id ’ s  A node represents an itemset consisting of item- id ’ s in that node and all its ancestors  TRIE is maintained as an array of entries of the form {item-id, f, , level} (pre-order of the trees). Equivalent to a lexicographic ordering of subsets it encodes.  No pointers, level ’ s compactly encode the underlying tree structure.

Module 3 - SetGen BUFFER Frequency counts of subsets in lexicographic order SetGen uses the following pruning rule: if a subset S does not make its way into TRIE after application of both UPDATE_SET and NEW_SET, then no supersets of S should be considered

Overall Algorithm BUFFER SUBSET-GEN TRIEnew TRIE

Conclusion  Sticky Sampling and Lossy Counting are 2 approximate algorithms that can find frequent items  Both algorithms produces frequency counts within a user-specified error tolerance level, though Sticky Sampling is non-deterministic  Lossy Counting can be extended to find frequent itemsets

Thank you very much…