Approximate Frequency Counts over Data Streams

Slides:

Advertisements

Similar presentations

Online Mining of Frequent Query Trees over XML Data Streams Hua-Fu Li*, Man-Kwan Shan and Suh-Yin Lee Department of Computer Science.

Advertisements

Association Rule Mining

Association Analysis (2). Example TIDList of item ID’s T1I1, I2, I5 T2I2, I4 T3I2, I3 T4I1, I2, I4 T5I1, I3 T6I2, I3 T7I1, I3 T8I1, I2, I3, I5 T9I1, I2,

Data Mining Techniques Association Rule

Probabilistic Skyline Operator over Sliding Windows Wenjie Zhang University of New South Wales & NICTA, Australia Joint work: Xuemin Lin, Ying Zhang, Wei.

Data Mining of Very Large Data

Adaptive Frequency Counting over Bursty Data Streams Bill Lin, Wai-Shing Ho, Ben Kao and Chun-Kit Chui Form CIDM07.

Mining Frequent Patterns in Data Streams at Multiple Time Granularities CS525 Paper Presentation Presented by: Pei Zhang, Jiahua Liu, Pengfei Geng and.

Association rules The goal of mining association rules is to generate all possible rules that exceed some minimum user-specified support and confidence.

1 of 25 1 of 45 Association Rule Mining CIT366: Data Mining & Data Warehousing Instructor: Bajuna Salehe The Institute of Finance Management: Computing.

New Sampling-Based Summary Statistics for Improving Approximate Query Answers P. B. Gibbons and Y. Matias (ACM SIGMOD 1998) Rongfang Li Feb 2007.

COMP53311 Data Stream Prepared by Raymond Wong Presented by Raymond Wong

Chapter 5: Mining Frequent Patterns, Association and Correlations

Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,

Tracking most frequent items dynamically. Article by G.Cormode and S.Muthukrishnan. Presented by Simon Kamenkovich.

Heavy hitter computation over data stream

Streaming Algorithms for Robust, Real- Time Detection of DDoS Attacks S. Ganguly, M. Garofalakis, R. Rastogi, K. Sabnani Krishan Sabnani Bell Labs Research.

Approximate Frequency Counts over Data Streams Gurmeet Singh Manku (Standford) Rajeev Motwani (Standford) Presented by Michal Spivak November, 2003.

Association Analysis: Basic Concepts and Algorithms.

Is Sampling Useful in Data Mining? A Case in the Maintenance of Discovered Association Rules S.D. Lee, D. W. Cheung, B. Kao Department of Computer Science.

What ’ s Hot and What ’ s Not: Tracking Most Frequent Items Dynamically G. Cormode and S. Muthukrishman Rutgers University ACM Principles of Database Systems.

Continuous Data Stream Processing MAKE Lab Date: 2006/03/07 Post-Excellence Project Subproject 6.

Is Sampling Useful in Data Mining? A Case in the Maintenance of Discovered Association Rules S.D. Lee David W. Cheung Ben Kao The University of Hong Kong.

Mining Association Rules

Performance and Scalability: Apriori Implementation.

Mining Association Rules in Large Databases. What Is Association Rule Mining?  Association rule mining: Finding frequent patterns, associations, correlations,

Mining Frequent Itemsets with Constraints Takeaki Uno Takeaki Uno National Institute of Informatics, JAPAN Nov/2005 FJWCP.

NGDM’02 1 Efficient Data-Reduction Methods for On-line Association Rule Mining H. Bronnimann B. ChenM. Dash, Y. Qiao, P. ScheuermannP. Haas Polytechnic.

Mining frequency counts from sensor set data Loo Kin Kong 25 th June 2003.

Approximate Frequency Counts over Data Streams Loo Kin Kong 4 th Oct., 2002.

By Graham Cormode and Marios Hadjieleftheriou Presented by Ankur Agrawal ( )

Approximate Frequency Counts over Data Streams Gurmeet Singh Manku, Rajeev Motwani Standford University VLDB2002.

Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining By Tan, Steinbach, Kumar Lecture.

Modul 7: Association Analysis. 2 Association Rule Mining  Given a set of transactions, find rules that will predict the occurrence of an item based on.

Frequent Item Mining. What is data mining? =Pattern Mining? What patterns? Why are they useful?

August 21, 2002VLDB Gurmeet Singh Manku Frequency Counts over Data Streams Frequency Counts over Data Streams Stanford University, USA.

Research issues on association rule mining Loo Kin Kong 26 th February, 2003.

1 Efficient Algorithms for Incremental Update of Frequent Sequences Minghua ZHANG Dec. 7, 2001.

New Sampling-Based Summary Statistics for Improving Approximate Query Answers Yinghui Wang

Association Rule Mining

Security in Outsourced Association Rule Mining. Agenda  Introduction  Approximate randomized technique  Encryption  Summary and future work.

Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,

1 Online Mining (Recently) Maximal Frequent Itemsets over Data Streams Hua-Fu Li, Suh-Yin Lee, Man Kwan Shan RIDE-SDMA ’ 05 speaker ：董原賓 Advisor ：柯佳伶.

1 Efficient Data Reduction Methods for Online Association Rule Discovery -NGDM’02 Herve Bronnimann, Bin Chen, Manoranjan Dash, Peter Haas, Yi Qiao, Peter.

Discrete Methods in Mathematical Informatics Kunihiko Sadakane The University of Tokyo

1 Parallel Mining of Closed Sequential Patterns Shengnan Cong, Jiawei Han, David Padua Proceeding of the 11th ACM SIGKDD international conference on Knowledge.

Mining Data Streams (Part 1)

Frequency Counts over Data Streams

Reducing Number of Candidates

Data Mining Association Analysis: Basic Concepts and Algorithms

Association Rules Repoussis Panagiotis.

Streaming & sampling.

Augmented Sketch: Faster and More Accurate Stream Processing

Frequent Pattern Mining

Frequent Itemsets Association Rules

Query-Friendly Compression of Graph Streams

Market Basket Many-to-many relationship between different objects

False Positive or False Negative: Mining Frequent Itemsets from High Speed Transactional Data Streams Jeffrey Xu Yu , Zhihong Chong（崇志宏） , Hongjun Lu.

Dynamic Itemset Counting

Data Mining Association Analysis: Basic Concepts and Algorithms

Association Rule Mining

Data Mining Association Analysis: Basic Concepts and Algorithms

Market Baskets Frequent Itemsets A-Priori Algorithm

Association Rule Mining

Data Mining Association Analysis: Basic Concepts and Algorithms

Market Basket Analysis and Association Rules

Mining Association Rules in Large Databases

Heavy Hitters in Streams and Sliding Windows

Association Analysis: Basic Concepts

Dynamically Maintaining Frequent Items Over A Data Stream

Presentation transcript:

Approximate Frequency Counts over Data Streams Gurmeet Singh Manku, Bajeev Motwani Proceeding of the 28th VLDB Conference , 2002 2019/1/1 報告人:吳建良

Motivation In some new applications, data come as a continuous “stream” The sheer volume of a stream over its lifetime is huge Response times of queries should be small Examples: Network traffic measurements Market data

Network Traffic Management ALERT: RED flow exceeds 1% of all traffic through me, check it!!! Frequent Items: Frequent Flow identification at IP router short-term monitoring long-term management

Mining Market Data … Frequent Itemsets at Supermarket store layout Among 100 million records: (1) at least 1% customers buy both beer and diaper at same time (2) 51% customers who buy beer also buy diaper! … Frequent Itemsets at Supermarket store layout catalog design …

Challenges Single pass Limited Memory (network management) Enumeration of itemsets (mining market Data)

General Solution (Approximate) Answer Stream Processing Engine Summary in Memory Data Streams

Approximate Algorithm Propose two algorithms for frequent item Sticky Sampling Lossy Counting Propose one algorithm for frequent itemset Extended Lossy Counting for frequent itemsets

Property of proposed algorithm All item(set)s whose true frequency exceeds sN are output No item(set) whose true frequency is less than is output Estimated frequencies are less than the true frequencies by at most

Sticky Sampling Algorithm User input includes three parameters, namely: Support threshold s Error parameter  Probability of failure  Counts are kept in a data structure S Each entry in S is in the form (e,f), where: e is the item f is the estimated frequency of e in the stream When queried about the frequent items, all entries (e,f) such that f  (s - )N N denote the current length of the stream

Sticky Sampling Algorithm (cont’d) Example Empty S Stream

Sticky Sampling Algorithm (cont’d) S  ; N  0; t  1/ log (1/s); r  1 e  next item; N  N + 1 if (e,f) exists in S do increment the count f else if random(0,1) > 1/r do insert (e,1) to S endif if N = 2t  2n do r  2r Prune(S); Goto 2; S: The set of all counts e: item N: Curr. len. of stream r: Sampling rate t: 1/ log (1/s) Prune S的時機: at sampling rate change

Sticky Sampling Algorithm: Prune S function Prune(S) for every entry (e,f) in S do while random(0,1) < 0.5 and f > 0 do f  f – 1 if f = 0 do remove the entry from S endif

Lossy Counting Algorithm Incoming data stream is conceptually divided into buckets of w=1/ transactions Current bucket id denote as bcurrent = N/w fe: the true frequency of e in the stream Counts are kept in a data structure D Each entry in D is in the form (e, f, ), where: e is the item f is the estimated frequency of e in the stream  is the maximum possible error in f

Lossy Counting Algorithm (cont’d) Example: =0.2, w=5, N=17, bcurrent=4 Bucket 1 Bucket 2 Bucket 3 bcurrent= 4 A B C A B E A C C D D A B E D F C D D D D (A,2,0) (B,2,0) (C,1,0) (A,3,0) (B,2,0) (C,2,1) (E,1,1) (D,1,1) (A,4,0) (B,1,2) (C,2,1) (D,2,2) (E,1,2) (A,4,0) (C,1,3) (D,2,2) (F,1,3) Prune D Prune D Prune D D D D (A,2,0) (B,2,0) (A,3,0) (C,2,1) (A,4,0) (D,2,2)

Lossy Counting Algorithm (cont’d) D  ; N  0 w  1/; bcurrent  1 e  next item; N  N + 1 if (e,f,) exists in D do f  f + 1 else do insert (e,1, bcurrent-1) to D endif if N mod w = 0 do prune(D, bcurrent); bcurrent  bcurrent + 1 Goto 3; D: The set of all counts N: Curr. len. of stream e: item w: Bucket width bcurrent: Current bucket id Prune D的時機: at bucket boundary

Lossy Counting Algorithm: prune D function prune(D, bcurrent ) for each entry (e,f,) in D do if f +   bcurrent do remove the entry from D endif

Lossy Counting Algorithm (cont’d) Four Lemmas Lemma1: Whenever deletions occur, bcurrent  N Lemma2: Whenever an entry (e,f,) gets deleted, fe  bcurrent Lemma3: If e does not appear in D, then fe  N Lemma4: If (e,f,) D, then f  fe  f+N

Extended Lossy Counting for Frequent Itemsets Incoming data stream is conceptually divided into buckets of w= 1/ transactions Counts are kept in a data structure D Multiple buckets ( of them say) are processed in a batch Each entry in D is in the form (set, f, ), where: set is the itemset f is the approximate frequency of set in the stream  is the maximum possible error in f

Extended Lossy Counting for Frequent Itemsets (cont’d) Bucket 1 Bucket 2 Bucket 3 Put 3 buckets of data into main memory one time

Overview of the algorithm D is updated by the operations UPDATE_SET and NEW_SET UPDATE_SET updates and deletes entries in D For each entry (set, f, ), count occurrence of set in the batch and update the entry If an updated entry satisfies f +   bcurrent, the entry is removed from D NEW_SET inserts new entries into D If a set set has frequency f   in the batch and set does not occur in D, create a new entry (set, f, bcurrent-)

Implementation Challenges: 3 major modules: Not to enumerate all subsets of a transaction Data structure must be compact for better space efficiency 3 major modules: Buffer Trie SetGen

Implementation (cont’d) Buffer: repeatedly reads in a batch of buckets of transactions, into available main memory Trie: maintains the data structure D SetGen: generates subsets of item-id’s along with their frequency counts in the current batch Not all possible subsets need to be generated If a subset S is not inserted into D after application of both UPDATE_SET and NEW_SET, then no supersets of S should be considered

Example ACE BCD AB ABC AD BCE ACE: AC, A, C BCD: BC, B, C AB: AB, A, B Main Memory bucket3 bucket4 ACE BCD AB ABC AD BCE UPDATE_SET SetGen ACE: AC, A, C BCD: BC, B, C AB: AB, A, B ABC: AB, AC, BC, A, B, C AD: A BCE: BC, B, C D D (A,5,0) (B,3,0) (C,3,0) (D,2,0) (AB,2,0) (AC,3,0) (AD,2,0) (BC,2,0) (A,9,0) (B,7,0) (C,7,0) (AC,5,0) (BC,5,0) NEW_SET Add (AB,2,2) into D

Experiments IBM synthetic dataset T10.I4.1000K N = 1Million Avg Tran Size = 10 Input Size = 49MB IBM synthetic dataset T15.I6.1000K N = 1Million Avg Tran Size = 15 Input Size = 69MB Frequent word pairs in 100K web documents N = 100K Avg Tran Size = 134 Input Size = 54MB Frequent word pairs in 806K Reuters newsreports N = 806K Avg Tran Size = 61 Input Size = 210MB

Varying support s and BUFFER B Time in seconds Time in seconds S = 0.004 S = 0.008 S = 0.001 S = 0.012 S = 0.002 S = 0.016 S = 0.004 S = 0.020 S = 0.008 BUFFER size in MB BUFFER size in MB IBM 1M transactions Reuters 806K docs Fixed: Stream length N Varying: BUFFER size B Support threshold s

Varying length N and support s Time in seconds S = 0.004 Time in seconds S = 0.002 S = 0.004 Length of stream in Thousands Length of stream in Thousands IBM 1M transactions Reuters 806K docs Fixed: BUFFER size B Varying: Stream length N Support threshold s

Varying BUFFER B and support s Time in seconds Time in seconds B = 4 MB B = 4 MB B = 16 MB B = 16 MB B = 28 MB B = 28 MB B = 40 MB B = 40 MB Support threshold s Support threshold s IBM 1M transactions Reuters 806K docs Fixed: Stream length N Varying: BUFFER size B Support threshold s

Comparison with fast A-priori Our Algorithm with 4MB Buffer Our Algorithm with 44MB Buffer Support Time Memory 0.001 99 s 82 MB 111 s 12 MB 27 s 45 MB 0.002 25 s 53 MB 94 s 10 MB 15 s 0.004 14 s 48 MB 65 s 7MB 8 s 0.006 13 s 46 s 6 MB 6 s 0.008 34 s 5 MB 4 s 0.010 26 s Dataset: IBM T10.I4.1000K with 1M transactions, average size 10.

Sticky Sampling Expected: 2/ log 1/s Lossy Counting Worst Case: 1/ log N No of counters Support s = 1% Error ε = 0.1% N (stream length) No of counters Log10 of N (stream length)