Is Sampling Useful in Data Mining? A Case in the Maintenance of Discovered Association Rules S.D. Lee, David W. Cheung, Ben Kao The University of Hong.

Slides:

Advertisements

Similar presentations

Brian Chase.  Retailers now have massive databases full of transactional history ◦ Simply transaction date and list of items  Is it possible to gain.

Advertisements

Estimation of Means and Proportions

Frequent Itemset Mining Methods. The Apriori algorithm Finding frequent itemsets using candidate generation Seminal algorithm proposed by R. Agrawal and.

A distributed method for mining association rules

Data Mining (Apriori Algorithm)DCS 802, Spring DCS 802 Data Mining Apriori Algorithm Spring of 2002 Prof. Sung-Hyuk Cha School of Computer Science.

IT 433 Data Warehousing and Data Mining Association Rules Assist.Prof.Songül Albayrak Yıldız Technical University Computer Engineering Department

Adaptive Frequency Counting over Bursty Data Streams Bill Lin, Wai-Shing Ho, Ben Kao and Chun-Kit Chui Form CIDM07.

Association rules The goal of mining association rules is to generate all possible rules that exceed some minimum user-specified support and confidence.

Sampling Large Databases for Association Rules ( Toivenon’s Approach, 1996) Farzaneh Mirzazadeh Fall 2007.

Data Mining Association Analysis: Basic Concepts and Algorithms

Lecture 14 Dustin Lueker.  This interval will contain μ with a 100(1-α)% confidence ◦ If we are estimating µ, then why it is unreasonable for us to know.

Association Analysis. Association Rule Mining: Definition Given a set of records each of which contain some number of items from a given collection; –Produce.

Mining Frequent Itemsets from Uncertain Data Presented by Chun-Kit Chui, Ben Kao, Edward Hung Department of Computer Science, The University of Hong Kong.

Data Mining Techniques So Far: Cluster analysis K-means Classification Decision Trees J48 (C4.5) Rule-based classification JRIP (RIPPER) Logistic Regression.

1 Fast Algorithms for Mining Association Rules Rakesh Agrawal Ramakrishnan Srikant.

Association Analysis (2). Example TIDList of item ID’s T1I1, I2, I5 T2I2, I4 T3I2, I3 T4I1, I2, I4 T5I1, I3 T6I2, I3 T7I1, I3 T8I1, I2, I3, I5 T9I1, I2,

Is Sampling Useful in Data Mining? A Case in the Maintenance of Discovered Association Rules S.D. Lee, David W. Cheung, Ben Kao The University of Hong.

4/3/01CS632 - Data Mining1 Data Mining Presented By: Kevin Seng.

1 Mining Quantitative Association Rules in Large Relational Database Presented by Jin Jin April 1, 2004.

Maintenance of Discovered Association Rules S.D.LeeDavid W.Cheung Presentation : Pablo Gazmuri.

Is Sampling Useful in Data Mining? A Case in the Maintenance of Discovered Association Rules S.D. Lee, D. W. Cheung, B. Kao Department of Computer Science.

Evaluating Hypotheses

Fast Algorithms for Association Rule Mining

Is Sampling Useful in Data Mining? A Case in the Maintenance of Discovered Association Rules S.D. Lee David W. Cheung Ben Kao The University of Hong Kong.

Lecture14: Association Rules

Mining Association Rules

1 Fast Algorithms for Mining Association Rules Rakesh Agrawal Ramakrishnan Srikant Slides from Ofer Pasternak.

Experimental Evaluation

Nonlinear Stochastic Programming by the Monte-Carlo method Lecture 4 Leonidas Sakalauskas Institute of Mathematics and Informatics Vilnius, Lithuania EURO.

Statistics for Managers Using Microsoft® Excel 7th Edition

1 Terminating Statistical Analysis By Dr. Jason Merrick.

Chapter 13: Inference in Regression

CA200 Quantitative Analysis for Business Decisions.

+ DO NOW What conditions do you need to check before constructing a confidence interval for the population proportion? (hint: there are three)

NGDM’02 1 Efficient Data-Reduction Methods for On-line Association Rule Mining H. Bronnimann B. ChenM. Dash, Y. Qiao, P. ScheuermannP. Haas Polytechnic.

Virtual COMSATS Inferential Statistics Lecture-6

Binomial Distributions Calculating the Probability of Success.

1 1 Slide © 2008 Thomson South-Western. All Rights Reserved Slides by JOHN LOUCKS St. Edward’s University.

1 Apriori Algorithm Review for Finals. SE 157B, Spring Semester 2007 Professor Lee By Gaurang Negandhi.

Section Copyright © 2014, 2012, 2010 Pearson Education, Inc. Lecture Slides Elementary Statistics Twelfth Edition and the Triola Statistics Series.

Secure Incremental Maintenance of Distributed Association Rules.

Approximate Frequency Counts over Data Streams Loo Kin Kong 4 th Oct., 2002.

Modul 7: Association Analysis. 2 Association Rule Mining  Given a set of transactions, find rules that will predict the occurrence of an item based on.

Section 8.1 Estimating  When  is Known In this section, we develop techniques for estimating the population mean μ using sample data. We assume that.

1 Mining Association Rules Mohamed G. Elfeky. 2 Introduction Data mining is the discovery of knowledge and useful information from the large amounts of.

Mining Sequential Patterns Rakesh Agrawal Ramakrishnan Srikant Proc. of the Int ’ l Conference on Data Engineering (ICDE) March 1995 Presenter: Sam Brown.

+ The Practice of Statistics, 4 th edition – For AP* STARNES, YATES, MOORE Chapter 8: Estimating with Confidence Section 8.3 Estimating a Population Mean.

1 Discovering Robust Knowledge from Databases that Change Chun-Nan HsuCraig A. Knoblock Arizona State UniversityUniversity of Southern California Journal.

Expert Systems with Applications 34 (2008) 459–468 Multi-level fuzzy mining with multiple minimum supports Yeong-Chyi Lee, Tzung-Pei Hong, Tien-Chin Wang.

Section 10.1 Confidence Intervals

CS 8751 ML & KDDSupport Vector Machines1 Mining Association Rules KDD from a DBMS point of view –The importance of efficiency Market basket analysis Association.

Outline Knowledge discovery in databases. Data warehousing. Data mining. Different types of data mining. The Apriori algorithm for generating association.

Part II - Association Rules © Prentice Hall1 DATA MINING Introductory and Advanced Topics Part II – Association Rules Margaret H. Dunham Department of.

Lecture 4: Statistics Review II Date: 9/5/02  Hypothesis tests: power  Estimation: likelihood, moment estimation, least square  Statistical properties.

Mining Frequent Itemsets from Uncertain Data Presenter : Chun-Kit Chui Chun-Kit Chui [1], Ben Kao [1] and Edward Hung [2] [1] Department of Computer Science.

1 Inverted Matrix: Efficient Discovery of Frequent Items in Large Datasets in the Context of Interactive Mining -SIGKDD’03 Mohammad El-Hajj, Osmar R. Zaïane.

Mining Quantitative Association Rules in Large Relational Tables ACM SIGMOD Conference 1996 Authors: R. Srikant, and R. Agrawal Presented by: Sasi Sekhar.

Association Rule Mining

Confidence Intervals (Dr. Monticino). Assignment Sheet  Read Chapter 21  Assignment # 14 (Due Monday May 2 nd )  Chapter 21 Exercise Set A: 1,2,3,7.

1 1 Slide © 2007 Thomson South-Western. All Rights Reserved Chapter 8 Interval Estimation Population Mean:  Known Population Mean:  Known Population.

 A Characteristic is a measurable description of an individual such as height, weight or a count meeting a certain requirement.  A Parameter is a numerical.

1 Mean Analysis. 2 Introduction l If we use sample mean (the mean of the sample) to approximate the population mean (the mean of the population), errors.

Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,

A Scalable Association Rules Mining Algorithm Based on Sorting, Indexing and Trimming Chuang-Kai Chiou, Judy C. R Tseng Proceedings of the Sixth International.

Chapter Eleven Sample Size Determination Chapter Eleven.

+ Chapter 8 Estimating with Confidence 8.1Confidence Intervals: The Basics 8.2Estimating a Population Proportion 8.3Estimating a Population Mean.

Sampling and Sampling Distribution

Spatial Online Sampling and Aggregation

Farzaneh Mirzazadeh Fall 2007

Presentation transcript:

Is Sampling Useful in Data Mining? A Case in the Maintenance of Discovered Association Rules S.D. Lee, David W. Cheung, Ben Kao The University of Hong Kong Data Mining and Knowledge Discovery, 1998 Revised and Presented by Matthew Starbuck : April 2,

 Definitions  Old Algorithms Apriori FUP 2  New Algorithm: DELI Design Pseudo code Sampling Techniques  Experiments(Comparisons) showing DELI is better  Consecutive runs/Conclusions/Exam Questions Outline 2

Definitions(1) ‏ D = Transaction Set I = Full Item Set T1T1 T2T2 T3T3 3

Definitions(2) ‏ For X= σ X = Support count = 4 Support = 4/5 = 80% Support threshold: s% 4

Definitions(3)  K-itemset: itemset containing k items.  Large itemset (L k ): itemset with support larger than support threshold. 5

Old Algorithm (1) ‏ Apriori 6

Pseudo code of Apriori get C 1 ; k = 1; until (C k is empty || L k is empty)‏ do { Get L k from C k using minimum support count; Use apriori_gen() to generate C k+1 from L k ; k++; } ; return union(all L k ); 7

C k = Candidate Set L k = Large Set apriori_gen() ‏ s% = 40% 8

Use Apriori in Maintenance  Simply apply the algorithm to the updated database again; Not efficient; Fails to reuse the results of previous mining; Very cost-expensive. 9

 FUP 2 works similarly to Apriori by generating large itemsets iteratively;  It scans only the updated part of the database for old large itemsets;  For the rest, it scans the whole database. Old Algorithm 2: FUP 2 10

Δ - : set of deleted transactions Δ +: set of added transactions D: old database D': updated database D*: set of unchanged transactions σ X : support count of itemset X σ ’ X : new support count of itemset X δ X - : support count of itemset X in Δ - δ X + : support count of itemset X in Δ + 11

Pseudo code of FUP 2 get C 1 ; k = 1; until (C k is empty || L k ’ is empty)‏ do { divide C k into two partitions: P k = C k ۸ L k and Q k = C k – P k ; For X in P k, calculate σ’ X = σ X - δ X - + δ X + and get part 1 of L k ’ ; For X in Q k, eliminate candidates with δ X + - δ X - < (Δ + -Δ - )s% ; For the remaining candidates X in Q k, scan D* to get part 2 of L k ’ ; Use apriori_gen() to generate C k+1 from L k ’; k++; }; return union(all L k ’); CkCk LkLk PkPk QkQk Δ - (δ - X )‏ Δ + (δ + X )‏ D* D(σX)‏D(σX)‏ D’(σ’X)‏D’(σ’X)‏ 12

An Example on FUP 2 13

DELI Algorithm  Difference Estimation for Large Itemsets  Key idea: It examines samples of the database when the update is not too much; 14

Basic pseudo code of DELI get C 1 ; k = 1; until (C k is empty || L k ’ is empty)‏ do { divide C k into two partitions: P k = C k ۸ L k and Q k = C k – P k For X in P k, calculate σ’ X = σ X - δ X - + δ X + and get part 1 of L k ’ For X in Q k, eliminate candidates with δ X + - δ X - < (Δ + -Δ - )s%, For the remaining candidates X in Q k, scan D* to get part 2 of L k ’ Use apriori_gen() to generate C k+1 from L k ’; k++; }; return union(all L k ’); A sample subset of D* 15

Binomial Distribution  Assume 5% of the population is green-eyed.  You pick 500 people randomly with replacement.  The total number of green-eyed people you pick is a random variable X which follows a binomial distribution with n = 500 and p =

Binomial Distribution 17

Sampling Techniques (1) ‏  Consider an arbitrary itemset X;  Randomly select m transactions from D with replacement;  T X = the total number of X out of m;  T X is binomially distributed with p = σ X / |D| n = m Mean = np = (m / |D|) σ X Variance = np(1-p) 18

Sampling Techniques (2) ‏  T X approximates normally distributed with Mean = (m / |D|) σ X Variance = mp(1 - p) ‏  Define: σ X ^ = |D| / m * T X  σ X ^ is normally distributed with Mean = σ X Variance = σ X (|D| - σ X )/m 19

Confidence Interval axax bxbx Mean = σ X α/2 20

Sampling Techniques (3) ‏  We can obtain a 100(1-α)% confidence interval [a x, b x ] for σ X where  Typical Values: For α= 0.1, z α/2 =1.645 For α= 0.05,z α/2 =1.960 For α= 0.01,z α/2 =

Sampling Techniques (4) ‏  The width of this interval is  The widths of all confidence intervals are no more than  Suppose we want the widths not to exceed 22

Sampling Techniques (5) ‏  If s = 2 and α= 0.05, then z α/2 =1.96  Solving the above inequality gives m ≥  This value is independent of the size of the database D! Note*: D may contain billions of transactions. A sample of around 19 thousand is large enough for the desired accuracy in this example 23

 L k » : large in D and D’ ;  L k > : not large in D, large in D’ with a certain confidence;  L k ≈ : not large in D, maybe large in D’ ;  L k ’ : approximation of new L k. L k ’ =L k »  L k >  L k ≈ LkLk Lk’Lk’ CkCk QkQk PkPk Lk»Lk» Lk>Lk> Lk≈Lk≈ Obtain the estimated set of L k 24

 Degree of uncertainty u k = L k ≈ /L k ’, uncertainty factor u k - is a user-specified threshold If u k ≥ u k -, then DELI halts and FUP 2 is needed  Amount of changes (symmetric difference) ‏ η k = |L k – L k ’ | ξ k = |L k (>) | + |L k (≈) | d k - is a user-specified threshold If d k ≥ d k -, then DELI halts and FUP 2 is needed Criteria met to perform a full update 25

Pseudo code of DELI get C 1 ; k = 1; until (C k is empty || L k is empty)‏ do { divide C k into two partitions: P k = C k ۸ L k and Q k = C k – P k For X in P k, calculate σ’ X = σ X - δ X - + δ X + and get part 1 of L k ’ For X in Q k, eliminate candidates with δ X + - δ X - < (Δ + -Δ - )s%, For the remaining candidates X in Q k, scan a sample subset of D* to get part 2 of L k ’ Use apriori_gen() to generate C k+1 from L k ’; If any criteria is met, then terminate and go to FUP 2 ; k++; }; return union(all L k ’); 26

An Improvement  Store the support counts of all 1- itemsets  Extra storage: O(|I|) ‏ 27

Experiment Preparation  Synthetic databases – generate D, Δ +, Δ -  1%-18% of the large itemsets are changed by the updates.  u k - = ∞  d k - = ∞ 28

Experimental Results (1) ‏ α= 0.05 z α/2 =1.960 |Δ + |=|Δ - | = 5000 |D| = s% = 2% 29

Experimental Results (2) ‏ α= 0.05 z α/2 =1.960 |Δ + |=|Δ - | = 5000 |D| = s% = 2% 30

Experimental Results (3) ‏ m=20000 |Δ + |=|Δ - | = 5000 |D| = s% = 2% 31

Experimental Results (4) ‏ m=20000 |Δ + |=|Δ - | = 5000 |D| = s% = 2% 32

Experimental Results (5) ‏ α= 0.05 z α/2 =1.960 m=20000 |D| = s% = 2% 33

Experimental Results (6) ‏ α= 0.05 z α/2 =1.960 m=20000 |D| = s% = 2% 34

Experimental Results (7) ‏ α= 0.05 z α/2 =1.960 |Δ - |= 5000 m = |D| = s% = 2% 35

Experimental Results (8) ‏ α= 0.05 z α/2 =1.960 |Δ - |= 5000 m = |D| = s% = 2% 36

Experimental Results (9) ‏ α= 0.05 z α/2 =1.960 |Δ + |= 5000 m = |D| = s% = 2% 37

Experimental Results (10) ‏ α= 0.05 z α/2 =1.960 |Δ + |= 5000 m = |D| = s% = 2% 38

Experimental Results (11) ‏ α= 0.05 z α/2 =1.960 |Δ + |= |Δ-| = 5000 m = |D| =

Experimental Results (12) ‏ α= 0.05 z α/2 =1.960 |Δ + |= |Δ-| = 5000 m = |D| =

Experimental Results (13) ‏ α= 0.05 z α/2 =1.960 |Δ + |= |Δ-| = 5000 m = s% = 2% 41

Experimental Summary  u c - < 0.036, very low;  when | Δ - | < 10000, d c - < 0.1;  when | Δ - | = 20000, d c - < 0.21;  (Suggested) u - = 0.05, d - =

Consecutive Runs:  Say we use Apriori to find association rules in a database  Later, 1 st batch of updates arrives, use DELI to make rules (r) if necessary  If r = F then use old association rules  When 2 nd batch comes, check both batches for significant changes  Sense the 2 nd batch is repeating work from 1 st batch we must try to afford some storage space  To get storage space we must keep every δ X + and δ X -.  Repeat for each updated batch, so that every update has resources stored from the previous batch 43

Conclusions  Real-world databases get updated constantly, therefore the knowledge extracted from them changes too.  The authors proposed DELI algorithm to determine if the change is significant so that it knows when to update the extracted association rules.  The algorithm applies sampling techniques and statistic methods to efficiently estimate an approximate large itemsets. 44

Final Exam Questions  Q1: Compare and contrast FUP 2 and DELI Both algorithms are used in Association Analysis; Goal: DELI decides when to update the association rules while FUP 2 provides an efficient way of updating them; Technique: DELI scans a small portion of the database (sample) and approximates the large itemsets whereas FUP 2 scans the whole database and returns the large itemsets exactly; DELI saves machine resources and time. 45

Final Exam Questions  Q2: Difference Estimation for Large Itemsets  Q3 Difference between Apriori and FUP 2 : Apriori scans the whole database to find association rules, and does not use old data mining results; For most itemsets, FUP 2 scans only the updated part of the database and takes advantage of the old association analysis results. 46

Thank you! Now it is discussion time! 47