Is Sampling Useful in Data Mining? A Case in the Maintenance of Discovered Association Rules S.D. Lee, D. W. Cheung, B. Kao Department of Computer Science.

Slides:



Advertisements
Similar presentations
Association Rules Evgueni Smirnov.
Advertisements

Association Rule Mining
Recap: Mining association rules from large datasets
Association Analysis (2). Example TIDList of item ID’s T1I1, I2, I5 T2I2, I4 T3I2, I3 T4I1, I2, I4 T5I1, I3 T6I2, I3 T7I1, I3 T8I1, I2, I3, I5 T9I1, I2,
Frequent Itemset Mining Methods. The Apriori algorithm Finding frequent itemsets using candidate generation Seminal algorithm proposed by R. Agrawal and.
A distributed method for mining association rules
Frequent Closed Pattern Search By Row and Feature Enumeration
LOGO Association Rule Lecturer: Dr. Bo Yuan
1 Department of Information & Computer Education, NTNU SmartMiner: A Depth First Algorithm Guided by Tail Information for Mining Maximal Frequent Itemsets.
Association rules The goal of mining association rules is to generate all possible rules that exceed some minimum user-specified support and confidence.
FP (FREQUENT PATTERN)-GROWTH ALGORITHM ERTAN LJAJIĆ, 3392/2013 Elektrotehnički fakultet Univerziteta u Beogradu.
Sampling Large Databases for Association Rules ( Toivenon’s Approach, 1996) Farzaneh Mirzazadeh Fall 2007.
Mining Frequent Itemsets from Uncertain Data Presented by Chun-Kit Chui, Ben Kao, Edward Hung Department of Computer Science, The University of Hong Kong.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
1 Fast Algorithms for Mining Association Rules Rakesh Agrawal Ramakrishnan Srikant.
Association Rule Mining Part 2 (under construction!) Introduction to Data Mining with Case Studies Author: G. K. Gupta Prentice Hall India, 2006.
Is Sampling Useful in Data Mining? A Case in the Maintenance of Discovered Association Rules S.D. Lee, David W. Cheung, Ben Kao The University of Hong.
Data Mining Association Analysis: Basic Concepts and Algorithms
Maintenance of Discovered Association Rules S.D.LeeDavid W.Cheung Presentation : Pablo Gazmuri.
2/8/00CSE 711 data mining: Apriori Algorithm by S. Cha 1 CSE 711 Seminar on Data Mining: Apriori Algorithm By Sung-Hyuk Cha.
Is Sampling Useful in Data Mining? A Case in the Maintenance of Discovered Association Rules S.D. Lee David W. Cheung Ben Kao The University of Hong Kong.
Lecture14: Association Rules
Mining Association Rules
1 Fast Algorithms for Mining Association Rules Rakesh Agrawal Ramakrishnan Srikant Slides from Ofer Pasternak.
Mining Association Rules
© Vipin Kumar CSci 8980 Fall CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance Computing Research Center Department of Computer.
Performance and Scalability: Apriori Implementation.
Mining Association Rules in Large Databases. What Is Association Rule Mining?  Association rule mining: Finding frequent patterns, associations, correlations,
CS 349: Market Basket Data Mining All about beer and diapers.
Abrar Fawaz AlAbed-AlHaq Kent State University October 28, 2011
Approximate Frequency Counts over Data Streams Loo Kin Kong 4 th Oct., 2002.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining By Tan, Steinbach, Kumar Lecture.
Modul 7: Association Analysis. 2 Association Rule Mining  Given a set of transactions, find rules that will predict the occurrence of an item based on.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Association Rules. CS583, Bing Liu, UIC 2 Association rule mining Proposed by Agrawal et al in Initially used for Market Basket Analysis to find.
Mining Sequential Patterns Rakesh Agrawal Ramakrishnan Srikant Proc. of the Int ’ l Conference on Data Engineering (ICDE) March 1995 Presenter: Sam Brown.
Apriori Algorithms Feapres Project. Outline 1.Association Rules Overview 2.Apriori Overview – Apriori Advantage and Disadvantage 3.Apriori Algorithms.
Implementation of “A New Two-Phase Sampling Based Algorithm for Discovering Association Rules” Tokunbo Makanju Adan Cosgaya Faculty of Computer Science.
CS 8751 ML & KDDSupport Vector Machines1 Mining Association Rules KDD from a DBMS point of view –The importance of efficiency Market basket analysis Association.
CSE4334/5334 DATA MINING CSE4334/5334 Data Mining, Fall 2014 Department of Computer Science and Engineering, University of Texas at Arlington Chengkai.
Outline Introduction – Frequent patterns and the Rare Item Problem – Multiple Minimum Support Framework – Issues with Multiple Minimum Support Framework.
Privacy-preserving rule mining. Outline  A brief introduction to association rule mining  Privacy preserving rule mining Single party  Perturbation.
Part II - Association Rules © Prentice Hall1 DATA MINING Introductory and Advanced Topics Part II – Association Rules Margaret H. Dunham Department of.
Is Sampling Useful in Data Mining? A Case in the Maintenance of Discovered Association Rules S.D. Lee, David W. Cheung, Ben Kao The University of Hong.
1 Efficient Algorithms for Incremental Update of Frequent Sequences Minghua ZHANG Dec. 7, 2001.
Mining Frequent Itemsets from Uncertain Data Presenter : Chun-Kit Chui Chun-Kit Chui [1], Ben Kao [1] and Edward Hung [2] [1] Department of Computer Science.
1 On Mining General Temporal Association Rules in a Publication Database Chang-Hung Lee, Cheng-Ru Lin and Ming-Syan Chen, Proceedings of the 2001 IEEE.
Data Mining Find information from data data ? information.
1 AC-Close: Efficiently Mining Approximate Closed Itemsets by Core Pattern Recovery Advisor : Dr. Koh Jia-Ling Speaker : Tu Yi-Lang Date : Hong.
CS Data Mining1 Data Mining The Extraction of useful information from data The automated extraction of hidden predictive information from (large)
M. Sulaiman Khan Dept. of Computer Science University of Liverpool 2009 COMP527: Data Mining ARM: Improvements March 10, 2009 Slide.
A Scalable Association Rules Mining Algorithm Based on Sorting, Indexing and Trimming Chuang-Kai Chiou, Judy C. R Tseng Proceedings of the Sixth International.
1 The Strategies for Mining Fault-Tolerant Patterns Jia-Ling Koh Department of Information and Computer Education National Taiwan Normal University.
Data Mining Association Rules Mining Frequent Itemset Mining Support and Confidence Apriori Approach.
1 Data Mining Lecture 6: Association Analysis. 2 Association Rule Mining l Given a set of transactions, find rules that will predict the occurrence of.
1 Mining the Smallest Association Rule Set for Predictions Jiuyong Li, Hong Shen, and Rodney Topor Proceedings of the 2001 IEEE International Conference.
Association Rules Repoussis Panagiotis.
Data Mining and Its Applications to Image Processing
Frequent Pattern Mining
Byung Joon Park, Sung Hee Kim
Dynamic Itemset Counting
Data Mining Association Analysis: Basic Concepts and Algorithms
Data Mining Association Analysis: Basic Concepts and Algorithms
Association Rule Mining
An Efficient Algorithm for Incremental Mining of Association Rules
Data Mining Association Analysis: Basic Concepts and Algorithms
Farzaneh Mirzazadeh Fall 2007
Unit 3 MINING FREQUENT PATTERNS ASSOCIATION AND CORRELATIONS
Association Analysis: Basic Concepts
Presentation transcript:

Is Sampling Useful in Data Mining? A Case in the Maintenance of Discovered Association Rules S.D. Lee, D. W. Cheung, B. Kao Department of Computer Science The University of Hong Kong Presenter: Elena Zheleva, April 8, 2004 Data Mining and Knowledge Discovery 1998

Introduction Data mining enables us to find out useful information from huge databases It enables marketers to develop and implement customized marketing programs and strategies Databases are not static, so maintenance of discovered association rules is an important problem – Example: Inventory Database

Introduction To update the association rules, multiple scans of the database will be necessary Authors propose a method to determine when to update the association rules by scanning a sample from the database and its changes Lecture Week 3: Association Analysis Related to first step in Association Analysis: constructing large itemsets

Outline Problem Descriptions and Solutions – Mining of Association Rules – Update of Association Rules – Scheduling Update of Association Rules DELI Algorithm Example Experimental Results Conclusion

Problem Descriptions and Solutions

Problem 1: Mining of Large Itemsets Given a database D of transactions and a set of possible items, find the large itemsets Large Itemsets: itemsets which have a transaction support above a pre-specified support% Transaction: a non-empty set of items Association Rule: X => Y, X and Y are itemsets Find association rules by examining large itemsets

Solution: Apriori Algorithm Finds out the large itemsets iteratively At iteration k: – Use large (k-1)-itemsets, find candidate itemsets of size k – Check which ones have a support above pre-specified and add them to large k-itemsets At every iteration, it scans the database to count the transactions which contain each candidate itemset A large amount of time is spent in scanning the whole database

Problem 2: Update of Association Rules After some updates have been applied to a database, find the new large itemsets and their support counts in an efficient manner Efficient: by using the old database mining results All database updates are either insertions or deletions Association Rule Maintenance Problem

Update of Association Rules D- :set of deleted T D+ :set of added T D :old database D' :update database D*:set of unchanged transactions D D- D* D+ D' D'

FUP2 Algorithm Addresses maintenance problem Apriori fails to use old data mining result FUP2 reduces the amount of work that needs to be done FUP2 works similarly to Apriori but it scans only the updated part of the database for old large itemsets For the rest, it scans the whole database

Problem 3: When to Update Association Rules First idea: after n transactions have been updated BAD! Symmetric difference: measure how many large itemsets have been added and deleted after the database update If too many => time to update association rules If too few => old association rules are a good approximation of the updated database

DELI Algorithm

Q2: Difference Estimation for Large Itemsets Purpose: to estimate the difference between the association rules in a database before and after it is updated Decides whether to update the association rules Overview: it estimates the size of the association rule change by examining a sample Advantage: DELI saves machine resources and time

DELI Algorithm Basic Notation: Input: old support counts, D, D+ and D- Output: a Boolean value indicating whether a rule-update is needed Iterative algorithm - start with k = 1 Each iteration: 13 steps, reduced to 5 logical steps D- D* D+ D' D' D

DELI Algorithm – Step 1 Generate large I (all 1-itemsets), k=1 itemset candidates apriori_gen(~L ), k>1 and partition the set into P and Q k { k-1 kk C =

DELI Algorithm – Step 2 P - the itemsets of size k that were large in old database and potentially large in the new one For each itemset X  P : - SupCount(D’) = SupCount(D) + SupCount(D+) - SupCount(D-) (scan only D+ and D-) - If (SupCount(D’) >= |D’| * support%), then add X to L (L - itemsets, large both in old and new db’s) k >> k

DELI Algorithm – Step 3 Q - the itemsets of size k that were not large in old database and potentially large in the new one For each itemset X  Q : - If (SupCount(D+) - SupCount(D-)) <= (|D+| - |D-|)*support%, then delete X from Q Take a random sample S from old database of size m For each itemset X  Q : - Find SupCount(S) and obtain an interval [a, b] for SupCount(D) with a 100(1-  )% confidence - SupCount(D’)  [a + , b +  ], where  = SupCount(D+) - SupCount(D-) Reason: SupCount(D’) = SupCount(D) + SupCount(D+) - SupCount(D-) k k k

DELI Algorithm – Step 3 For each itemset X  Q : - Compare estimated SupCount(D’) interval with |D’|*support% - L - itemsets that were not large in D but are large in D’ with a certain confidence - L - itemsets that were not large in D, maybe large in D’ a +  b +  L > L  >  k

DELI Algorithm – Step 4 Obtain the estimated set of large itemsets of size k ~L = L  L  L Itemsets: L - large in D, large in D’ (Step 2) L - not large in D, large in D’ with a certain confidence (Step 3) L - not large in D, maybe large in D’ (Step 3) ~L k is an approximation of new L k. However, misses are rare and also the false hits are very rare.  << < k < 

DELI Algorithm – Step 5 Decide whether an association rule update is needed – IF uncertainty ( L / ~L ) is too large => DELI halts, update is needed – IF symmetric difference of large itemsets is too large => DELI halts, update is needed – IF ~L is empty => DELI halts, no update is necessary – IF ~L is non-empty => k = k + 1, go to Step 1  k k k

Example

|D|=10 6 |  - |=9000 |  + |=10000S%=2%

DELI Algorithm – Example k=1: 1) C = {A, B, C, D, E, F}, P = {A, B, C, D, E}, Q = {F} 2) P: |D’|*support% = => L = {A, B, C, D, E} ItemsetSupCount(D’) A24818 B31438 C24410 D27880 E ) Q: 17 drop F 4) ~L = {A, B, C, D, E} 5) Update? No. k = 2, proceed to Step 1 >> 1

DELI Algorithm – Example k=2: 1) P={AB, AC, AD, AE, BC, BD, CD}, Q={BE, CE, DE} 2) P: |D’|*support% = => L = {AB, AC, AD, BC, BD, CD} 3) Q: drop CE, DE; For BE: SupCount(Sample)=202 => SupCount(D)= % confidence interval [ , ] For SupCount(D’), confidence interval: [17677, 23191] L ={BE} >> 

DELI Algorithm – Example k=2: 4) ~L = {AB, AC, AD, BC, BD, CD, BE} 5) Update? No. (uncertainty=1/7 and difference=2/15). k = 3, proceed to Step 1. k=3: … 4) ~L = {ABC, ACD, BCD} 5) Update? No. (uncertainty=0 and difference=2/15) Returns: False (no update of association rules is needed). 2 3

Experimental Results

Synthetic databases – generate D, D+, D- Use Apriori to find large itemsets FUP2 is invoked to find large itemsets in the updated database – record time Run DELi – record time |D| = , |D+|=|D-|= 5000 confidence = 95%, support% = 2% Sample size = 20000

Experimental Results Figure 3

Experimental Results 90% level of confidence %

Conclusion

Conclusion Real-world databases get updated constantly, therefore the knowledge extracted from them changes too We have to know when the change is significant Applying sampling techniques and statistic methods, we can efficiently determine when to update the extracted association rules Sampling is really useful in data mining.

Final Exam Questions Q1: Compare and contrast FUP2 and DELI – Both algorithms are used in Association Analysis – Goal: DELI decides when to update the association rules while FUP2 provides an efficient way of updating them – Technique: DELI scans a small portion of the database whereas FUP2 scans the whole database – DELI saves machine resources and time

Final Exam Questions Q2: Difference Estimation for Large Itemsets Q3 Difference between Apriori and FUP2: – Apriori scans the whole database to find association rules, and does not use old data mining results – For most itemsets, FUP2 scans only the updated part of the database and takes advantage of the old association analysis results