False Positive or False Negative: Mining Frequent Itemsets from High Speed Transactional Data Streams Jeffrey Xu Yu, Zhihong Chong, Hongjun Lu, Aoying.

Slides:



Advertisements
Similar presentations
Association Rule Mining
Advertisements

Recap: Mining association rules from large datasets
Huffman Codes and Asssociation Rules (II) Prof. Sin-Min Lee Department of Computer Science.
Association Analysis (2). Example TIDList of item ID’s T1I1, I2, I5 T2I2, I4 T3I2, I3 T4I1, I2, I4 T5I1, I3 T6I2, I3 T7I1, I3 T8I1, I2, I3, I5 T9I1, I2,
Frequent Itemset Mining Methods. The Apriori algorithm Finding frequent itemsets using candidate generation Seminal algorithm proposed by R. Agrawal and.
Data Mining Techniques Association Rule
Association Rule Mining. 2 The Task Two ways of defining the task General –Input: A collection of instances –Output: rules to predict the values of any.
Adaptive Frequency Counting over Bursty Data Streams Bill Lin, Wai-Shing Ho, Ben Kao and Chun-Kit Chui Form CIDM07.
Resource-oriented Approximation for Frequent Itemset Mining from Bursty Data Streams SIGMOD’14 Toshitaka Yamamoto, Koji Iwanuma, Shoshi Fukuda.
Mining Frequent Patterns in Data Streams at Multiple Time Granularities CS525 Paper Presentation Presented by: Pei Zhang, Jiahua Liu, Pengfei Geng and.
Sampling Large Databases for Association Rules ( Toivenon’s Approach, 1996) Farzaneh Mirzazadeh Fall 2007.
Data Mining Association Analysis: Basic Concepts and Algorithms
Adaptive Load Shedding for Mining Frequent Patterns from Data Streams Xuan Hong Dang, Wee-Keong Ng, and Kok-Leong Ong (DaWaK 2006) 2008/3/191Yi-Chun Chen.
Rakesh Agrawal Ramakrishnan Srikant
COMP53311 Data Stream Prepared by Raymond Wong Presented by Raymond Wong
Association Analysis. Association Rule Mining: Definition Given a set of records each of which contain some number of items from a given collection; –Produce.
Mining Frequent Itemsets from Uncertain Data Presented by Chun-Kit Chui, Ben Kao, Edward Hung Department of Computer Science, The University of Hong Kong.
Data Mining Techniques So Far: Cluster analysis K-means Classification Decision Trees J48 (C4.5) Rule-based classification JRIP (RIPPER) Logistic Regression.
Data Mining Association Analysis: Basic Concepts and Algorithms Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Heavy hitter computation over data stream
Evaluation.
Data Mining Association Analysis: Basic Concepts and Algorithms
Data Mining Association Analysis: Basic Concepts and Algorithms
Maintenance of Discovered Association Rules S.D.LeeDavid W.Cheung Presentation : Pablo Gazmuri.
What ’ s Hot and What ’ s Not: Tracking Most Frequent Items Dynamically G. Cormode and S. Muthukrishman Rutgers University ACM Principles of Database Systems.
© Vipin Kumar CSci 8980 Fall CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance Computing Research Center Department of Computer.
Performance and Scalability: Apriori Implementation.
Association Discovery from Databases Association rules are a simple formalism for expressing positive connections between columns in a 0/1 matrix. A classical.
VLDB 2012 Mining Frequent Itemsets over Uncertain Databases Yongxin Tong 1, Lei Chen 1, Yurong Cheng 2, Philip S. Yu 3 1 The Hong Kong University of Science.
Approximate Frequency Counts over Data Streams Loo Kin Kong 4 th Oct., 2002.
Approximate Frequency Counts over Data Streams Gurmeet Singh Manku, Rajeev Motwani Standford University VLDB2002.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining By Tan, Steinbach, Kumar Lecture.
Modul 7: Association Analysis. 2 Association Rule Mining  Given a set of transactions, find rules that will predict the occurrence of an item based on.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
1 Approximating Quantiles over Sliding Windows Srimathi Harinarayanan CMPS 565.
August 21, 2002VLDB Gurmeet Singh Manku Frequency Counts over Data Streams Frequency Counts over Data Streams Stanford University, USA.
CSE4334/5334 DATA MINING CSE4334/5334 Data Mining, Fall 2014 Department of Computer Science and Engineering, University of Texas at Arlington Chengkai.
Privacy-preserving rule mining. Outline  A brief introduction to association rule mining  Privacy preserving rule mining Single party  Perturbation.
Association Rule Mining Data Mining and Knowledge Discovery Prof. Carolina Ruiz and Weiyang Lin Department of Computer Science Worcester Polytechnic Institute.
Mining Frequent Itemsets from Uncertain Data Presenter : Chun-Kit Chui Chun-Kit Chui [1], Ben Kao [1] and Edward Hung [2] [1] Department of Computer Science.
Association Rule Mining
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Association Analysis This lecture node is modified based on Lecture Notes for.
Space-Efficient Online Computation of Quantile Summaries SIGMOD 01 Michael Greenwald & Sanjeev Khanna Presented by ellery.
Mining Frequent Patterns, Associations, and Correlations Compiled By: Umair Yaqub Lecturer Govt. Murray College Sialkot.
Security in Outsourced Association Rule Mining. Agenda  Introduction  Approximate randomized technique  Encryption  Summary and future work.
M. Sulaiman Khan Dept. of Computer Science University of Liverpool 2009 COMP527: Data Mining ARM: Improvements March 10, 2009 Slide.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Reducing Number of Candidates Apriori principle: – If an itemset is frequent, then all of its subsets must also be frequent Apriori principle holds due.
An Efficient Rigorous Approach for Identifying Statistically Significant Frequent Itemsets Adam Kirsch, Michael Mitzenmacher, Havard University Andrea.
1 Data Mining Lecture 6: Association Analysis. 2 Association Rule Mining l Given a set of transactions, find rules that will predict the occurrence of.
Improvement of Apriori Algorithm in Log mining Junghee Jaeho Information and Communications University,
Discrete Methods in Mathematical Informatics Kunihiko Sadakane The University of Tokyo
Discrete Methods in Mathematical Informatics Kunihiko Sadakane The University of Tokyo
Frequency Counts over Data Streams
Reducing Number of Candidates
Frequent Pattern Mining
False Positive or False Negative: Mining Frequent Itemsets from High Speed Transactional Data Streams Jeffrey Xu Yu , Zhihong Chong(崇志宏) , Hongjun Lu.
Data Mining Association Analysis: Basic Concepts and Algorithms
Data Mining Association Analysis: Basic Concepts and Algorithms
Mining Frequent Itemsets over Uncertain Databases
A Parameterised Algorithm for Mining Association Rules
Data Mining Association Analysis: Basic Concepts and Algorithms
Farzaneh Mirzazadeh Fall 2007
COMP5331 FP-Tree Prepared by Raymond Wong Presented by Raymond Wong
Maintaining Frequent Itemsets over High-Speed Data Streams
Association Analysis: Basic Concepts
Dynamically Maintaining Frequent Items Over A Data Stream
Presentation transcript:

False Positive or False Negative: Mining Frequent Itemsets from High Speed Transactional Data Streams Jeffrey Xu Yu, Zhihong Chong, Hongjun Lu, Aoying Zhou VLDB 2004

Introduction Mining data stream: Mining data stream: –Data items arrive continuously –One scan of data –Limited memory –Bounded error

Introduction In this paper, develop algorithm of effectively mining frequent itemset with bound of memory consumption In this paper, develop algorithm of effectively mining frequent itemset with bound of memory consumption Use false-negative Use false-negative

False Positive Most existing algorithm of mining frequent itemset are false-positive oriented Most existing algorithm of mining frequent itemset are false-positive oriented –Control memory consumption by error parameter ε –Allow item’s support below min support s but above s –ε as frequent Approximate frequency counts over data streams (VLDB 02) Approximate frequency counts over data streams (VLDB 02)

False Positive Memory bound : O ( . log (εN)) Memory bound : O ( . log (εN)) Dilemma of false-positive approach Dilemma of false-positive approach –ε smaller, less # of false-positive item included –Memory consumption increase reciprocally in terms of ε –In Apriori, k-th frequent itemset generate (k+1)-th candidate itemset 1l ε

False Positive & False Negative s S + ε S - ε False PositiveFalse Negative All itemsets will output Some will output All itemsets will output Some will output

False Negative Error control and pruning Error control and pruning –ε : prune data, control error bound, changeable changeable ε decrease and approach to zero when # of observation increase ε decrease and approach to zero when # of observation increase ε reciprocal of n ε reciprocal of n – s : minimum support – n : # of observation

False Negative Memory control Memory control –δ : reliability, instead ε control memory consumption –Memory consumption related to ln(1/ δ) In this approach not allow 1-itemsets with support below s as frequent In this approach not allow 1-itemsets with support below s as frequent

Comparison: False Positive & False Negative Recall and Precision Recall and Precision A : true frequent itemsets B : obtained frequent itemsets –Recall = –Precision = |A ∩ B| |A| |A ∩ B| |B|

Comparison: False Positive ε= S/10 ε= S/10 δ=0.1 δ=0.1 S(%) True Size Mined Size Recall Precisio n ,361126, ,25268, ,35923,

Comparison: False Negative s+ε: minimum support S (%) True Size Mined Size Recall Precisio n ,36118, ,25210, ,3591,

Chernoff Bound Chernoff Bound give certain probabilistic guarantee on estimation of statistics about underlying data Chernoff Bound give certain probabilistic guarantee on estimation of statistics about underlying data Pr{ T ≥ е E[T]} ≤ е - E[T] Pr{ T ≥ е E[T]} ≤ е - E[T] For example : Pick a lottery number For example : Pick a lottery number 0000,0001, …, ,0001, …, ,000,000 people buy $1 ticket 1,000,000 people buy $1 ticket E[#winners] = 100 E[#winners] = 100 Pr{T ≧ 273} ≦ e -100 Pr{T ≧ 273} ≦ e -100

Chernoff Bound Bernolli trails (coin flips): Bernolli trails (coin flips): –Pr[o i =1]=p, Pr[o i =0]=1-p –r : # of heads in n coin flips –np: expectation of r for any γ> 0 for any γ> 0

Chernoff Bound Let r as r/n, min support s as p Let r as r/n, min support s as p Replace s γ with ε Replace s γ with ε Right of equation be δ Right of equation be δ –Pr{| RunningSupport – TrueSupport |≥ε n } ≤δ

Frequent or Infrequent A pattern X is potential infrequent if A pattern X is potential infrequent if count(X) / n < s – ε n in terms of n count(X) / n < s – ε n in terms of n A pattern X is potential frequent if it A pattern X is potential frequent if it is not potential infrequent in terms of n is not potential infrequent in terms of n

FDPM-1(s, δ)

ACD Item A Count Source B 11 2 C 1 Memory is full Compute new ε n Delete infrequent items D

FDPM-1(s, δ) Algorithm ensure : Algorithm ensure : –item whose true frequency exceeds sN are output with probability of at least 1-δ –No item whose true frequency is less than sN are output –Probability of the estimated support that equal true support no less than 1-δ

Memory Bound Sup(X) ≥ ( s – ε n ) n Sup(X) ≥ ( s – ε n ) n |P| ≤ 1/( s – ε n ), when s – ε n >0 |P| ≤ 1/( s – ε n ), when s – ε n >0 |P| = n = = |P| = n = = n = n = 1 S – ε n

FDPM-2(s, δ)

Mining Frequent Itemsets Mining Frequent Itemsets from a Data Stream Itemset{A}{B}{AB}… Count… Source …… n1n1 {A,B} …….. {E,F,G} ItemSet{A}{B}{AB}{E}{F}{EF}Count P F {C}{D} {E} 5 Memory is full, compute new ε n Delete infrequent itemsets {F}{EF} 64

Conclusion False negative False negative Limited memory Limited memory Error bound with some probability Error bound with some probability