Fast Algorithms for Association Rule Mining

Slides:



Advertisements
Similar presentations
Association Rule Mining
Advertisements

Recap: Mining association rules from large datasets
Brian Chase.  Retailers now have massive databases full of transactional history ◦ Simply transaction date and list of items  Is it possible to gain.
Huffman Codes and Asssociation Rules (II) Prof. Sin-Min Lee Department of Computer Science.
Association Analysis (2). Example TIDList of item ID’s T1I1, I2, I5 T2I2, I4 T3I2, I3 T4I1, I2, I4 T5I1, I3 T6I2, I3 T7I1, I3 T8I1, I2, I3, I5 T9I1, I2,
Frequent Itemset Mining Methods. The Apriori algorithm Finding frequent itemsets using candidate generation Seminal algorithm proposed by R. Agrawal and.
Data Mining Techniques Association Rule
Association Rule Mining. 2 The Task Two ways of defining the task General –Input: A collection of instances –Output: rules to predict the values of any.
10 -1 Lecture 10 Association Rules Mining Topics –Basics –Mining Frequent Patterns –Mining Frequent Sequential Patterns –Applications.
Data Mining Association Analysis: Basic Concepts and Algorithms
Association Rules l Mining Association Rules between Sets of Items in Large Databases (R. Agrawal, T. Imielinski & A. Swami) l Fast Algorithms for.
Rakesh Agrawal Ramakrishnan Srikant
Association Analysis. Association Rule Mining: Definition Given a set of records each of which contain some number of items from a given collection; –Produce.
Data Mining Techniques So Far: Cluster analysis K-means Classification Decision Trees J48 (C4.5) Rule-based classification JRIP (RIPPER) Logistic Regression.
Data Mining Association Analysis: Basic Concepts and Algorithms Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Data Mining Association Analysis: Basic Concepts and Algorithms
1 Fast Algorithms for Mining Association Rules Rakesh Agrawal Ramakrishnan Srikant.
Data Mining Association Analysis: Basic Concepts and Algorithms
4/3/01CS632 - Data Mining1 Data Mining Presented By: Kevin Seng.
Association Analysis: Basic Concepts and Algorithms.
Data Mining Association Analysis: Basic Concepts and Algorithms
Fast Algorithms for Mining Association Rules * CS401 Final Presentation Presented by Lin Yang University of Missouri-Rolla * Rakesh Agrawal, Ramakrishnam.
6/23/2015CSE591: Data Mining by H. Liu1 Association Rules Transactional data Algorithm Applications.
© Vipin Kumar CSci 8980 Fall CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance Computing Research Center Department of Computer.
Lecture14: Association Rules
Mining Association Rules
1 Fast Algorithms for Mining Association Rules Rakesh Agrawal Ramakrishnan Srikant Slides from Ofer Pasternak.
SEG Tutorial 2 – Frequent Pattern Mining.
Mining Association Rules in Large Databases. What Is Association Rule Mining?  Association rule mining: Finding frequent patterns, associations, correlations,
Association Discovery from Databases Association rules are a simple formalism for expressing positive connections between columns in a 0/1 matrix. A classical.
Apriori algorithm Seminar of Popular Algorithms in Data Mining and Machine Learning, TKK Presentation Lauri Lahti.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining By Tan, Steinbach, Kumar Lecture.
Modul 7: Association Analysis. 2 Association Rule Mining  Given a set of transactions, find rules that will predict the occurrence of an item based on.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
1 Mining Association Rules Mohamed G. Elfeky. 2 Introduction Data mining is the discovery of knowledge and useful information from the large amounts of.
Efficient Data Mining for Calling Path Patterns in GSM Networks Information Systems, accepted 5 December 2002 SPEAKER: YAO-TE WANG ( 王耀德 )
Frequent Item Mining. What is data mining? =Pattern Mining? What patterns? Why are they useful?
Fast Algorithms For Mining Association Rules By Rakesh Agrawal and R. Srikant Presented By: Chirayu Modi.
CS 8751 ML & KDDSupport Vector Machines1 Mining Association Rules KDD from a DBMS point of view –The importance of efficiency Market basket analysis Association.
CSE4334/5334 DATA MINING CSE4334/5334 Data Mining, Fall 2014 Department of Computer Science and Engineering, University of Texas at Arlington Chengkai.
Association Rule Mining Data Mining and Knowledge Discovery Prof. Carolina Ruiz and Weiyang Lin Department of Computer Science Worcester Polytechnic Institute.
Part II - Association Rules © Prentice Hall1 DATA MINING Introductory and Advanced Topics Part II – Association Rules Margaret H. Dunham Department of.
1 What is Association Analysis: l Association analysis uses a set of transactions to discover rules that indicate the likely occurrence of an item based.
Association Rule Mining
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Association Analysis This lecture node is modified based on Lecture Notes for.
CMU SCS : Multimedia Databases and Data Mining Lecture #30: Data Mining - assoc. rules C. Faloutsos.
Data Mining  Association Rule  Classification  Clustering.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Chapter 8 Association Rules. Data Warehouse and Data Mining Chapter 10 2 Content Association rule mining Mining single-dimensional Boolean association.
Chap 6: Association Rules. Rule Rules!  Motivation ~ recent progress in data mining + warehousing have made it possible to collect HUGE amount of data.
COMP53311 Association Rule Mining Prepared by Raymond Wong Presented by Raymond Wong
1 Data Mining Lecture 6: Association Analysis. 2 Association Rule Mining l Given a set of transactions, find rules that will predict the occurrence of.
Data Mining – Association Rules
Data Mining Find information from data data ? information.
Data Mining Association Analysis: Basic Concepts and Algorithms
Data Mining: Concepts and Techniques
Association rule mining
Frequent Pattern Mining
Data Mining Association Analysis: Basic Concepts and Algorithms
Association Rule Mining
Data Mining Association Analysis: Basic Concepts and Algorithms
Data Mining Association Analysis: Basic Concepts and Algorithms
Association Rule Mining
Association Analysis: Basic Concepts and Algorithms
Fast Algorithms for Mining Association Rules
Association Analysis: Basic Concepts
Presentation transcript:

Fast Algorithms for Association Rule Mining Presented by Muhammad Aurangzeb Ahmad Nupur Bhatnagar R. Agrawal and R. Srikant

Outline Background and Motivation Problem Definition Major Contribution Key Concepts Validation Assumptions Future Revision

Background & Motivation Basket Data: Collection of records consisting of transaction identifier and the items bought in a transaction. Mining for associations among items in a large database of sales transaction to predict the occurrence of an item based on the occurrences of other items in the transaction. For Example:

Terms and Notations Items : I = {i1,i2,…,im} Transaction – set of items such as Items are sorted lexicographically TID – unique identifier for each transaction Association Rule : X->Y where

Terms and Notations Confidence : A rule X->Y holds in the transaction set D with confidence c if c% of transactions in D that contain X also contain Y. Support: A rule X->Y has support s if s% of transactions in D contain X and Y. Large Itemset Itemsets having support greater than minimum support and minimum confidence are called large itemsets other they are called small itemsets. Candidate Itemsets A set of itemsets which are generated from a seed of itemsets which were found to be large in the previous pass having support ≥ minsup threshold confidence ≥ minconf threshold

Problem Definition Objective: INPUT A set of transactions Objective: Given a set of transactions D, generate all association rules that have support and confidence greater than the user-specified minimum support and minimum confidence. Minimize computation time by pruning. Constraints: Items should be in lexicographical order Association Rules {Diaper}  {Beer}, {Milk, Bread}  {Eggs, Coke}, {Beer, Bread}  {Milk}, Real World Applications NCR (Terradata) does ARM for more than 20 large retail organizations including Walmart. Used for pattern discovery in biological DBs.

Major Contribution Proposed two new algorithms for fast association rule mining: Apriori and AprioriTID, along with a hybrid of the two algorithms . Empirical evaluations of the performance of the proposed algorithms as compared with the contemporary algorithms. Completeness: Find all rules.

Related Work -SETM and AIS Major difference in Candidate Itemset generation In pass k, read a database transaction t Determine which of the large itemsets in Lk-1 are present in t. Each of these large itemsets l is then extended with all those large items that are present in t and occur later in the lexicographic ordering than any of the items in l. Results: A lot of Candidate Itemsets are generated which are later discarded.

Key Concepts: Support and Confidence Why do we need Support and Confidence? Given a rule : X->Y Support determines how often a rule is applicable to a given data set. Confidence determines how frequently items in Y appear in transactions that contains X. A rule having low support may occur by chance!! A low support rule tends to be uninteresting from a business perspective. Confidence measures the reliability of the inference made by a rule.

Key Concepts –Association Rule Mining Problem Given a set of transactions T, find all rules having support >= minsupport and confidence>=minconfidence. Decomposition of Problem: Frequent Itemset Generation : Find all itemsets having transaction support above minimum support. These itemsets are called frequent itemsets. 2. Rule Generation: Use the large itemsets to generate rules. These rules are high- confidence rules extracted from the frequent itemsets found in the previous step.

Frequent Itemset Generation: Apriori Apriori Principle: Given an itemeset I={a,b,c,d,e}. If an item set is frequent, then all of its subsets must also be frequent and vice-versa.

Frequent Itemset Generation: Apriori Apriori Principle: if {c,d,e} is frequent then all its subsets must also be frequent

Frequent Itemset Generation: Apriori Apriori Principle: Candidate Pruning If {a,b} is infrequent, then all it supersets are infrequent

Key Concepts –Frequent Itemset Generation : Apriori Algorithm Input The market base transaction dataset. Process Determine large 1-itemsets. Repeat until no new large 1-itemsets are identified. Generate (k+1) length candidate itemsets from length k large itemsets. Prune candidate itemsets that are not large. Count the support of each candidate itemset. Eliminate candidate itemsets that are small. Output Itemsets that are “large” and qualify the min support and min confidence thresholds.

Apriori Example: Minimum support two transaction 1-itemset Pruning 2-itemset Pruning 3-itemset

Apriori Candidate Generation Given an k-itemset, generate k+1 itemset in two steps: C(4)={{135},{235}} C(4) = {{235}} JOIN STEP Delete all candidates having non-frequent subset PRUNE Join k- itemset with k-itemset, with the join condition that the first k-1 items should be the same.

AprioriTID AprioriTID Same candidate generation function as Apriori. Does not use database for counting support after the first pass. Encoding of the candidate itemsets used in the previous pass. Saves reading effort.

Apriori Tid Example: Support Count:2 Database Items TID 1 3 4 100 2 3 5 200 1 2 3 5 300 2 5 400 Set-of-itemsets TID { {1},{3},{4} } 100 { {2},{3},{5} } 200 { {1},{2},{3},{5} } 300 { {2},{5} } 400 Support Itemset 2 {1} 3 {2} {3} {5} L1 C^1 Item Support {1 2} 1 {1 3} 2 {1 5} 1 {2 3} 2 {2 5} 3 {3 5} 2 Set-of-itemsets TID { {1 3} } 100 { {2 3},{2 5} {3 5} } 200 { {1 2},{1 3},{1 5}, {2 3}, {2 5}, {3 5} } 300 { {2 5} } 400 Support Itemset 2 {1 3} 3 {2 3} {2 5} {3 5} C2 C^2 L2 Set-of-itemsets TID { {2 3 5} } 200 300 itemset {2 3 5} C^3 C3 Support Itemset 2 {2 3 5} L3

Apriori Tid : Analysis Advantages : If a transaction does not contain k-itemset candidates, then Ck will not have an entry for this transaction. For large k, each entry may be smaller than the transaction because very few candidates may be present in the transaction. Disadvantages: For small k, each entry may be larger than the corresponding transaction. An entry includes all k-itemsets contained in the transaction.

Apriori Hybrid Apriori Hybrid : It uses Apriori in the initial passes and switches to AprioriTid when it expects that the candidate itemsets at the end of the pass will be in memory.

Validation : Computer Experiments Parameters for data generation D – Number of transactions T – Average size of the transaction I – Average size of the maximal potentially large itemsets L – Number of maximal potentially large itemsets N – Number of Items. Parameter Settings : 6 synthetic data sets

Results : Execution Time Apriori is always better than AIS and SETM. SETM values were too big. Apriori is better than Apriori TID in large transactions.

Results : Analysis AprioriTid uses C^k instead of the database. If C^k fits in memory AprioriTid is faster than Apriori. When C^k is too big to fit in memory, the computation time is much longer. Thus Apriori is faster than AprioriTid.

Results: Execution time Apriori Hybrid Graphs: Apriori Hybrid performs better than Apriori in almost all cases.

Scale Up - Experiments Apriori Hybrid scales up as the number of transactions is increased from 100,000 to 10 million. Minimum support .75% Apriori Hybrid scales up when average transaction size was increased. Done to see the affect on data structures independent of physical db size and number of large item sets.

Results: The Apriori algorithms are better than the SETM and AIS. The algorithms performs there best when combined. The algorithm shows good results in scale-up experiments.

Validation Methodology-Weakness and Strength Author use a substantial basket data for guiding the process of designing fast algorithms for association rule mining. Weakness: Synthetic data set is used for validation. The data might be too synthetic as to not give any valuable information about real world datasets.

Assumptions Synthetic dataset is used. It is assumed that performance of the algorithm in the synthetic dataset is indicative of its performance on a real world dataset. All the items in the data are in a lexicographical order. Assume that all data is categorical. It is assumed that all the data is present in the same site or table and there are no cases which there would be a requirement to make joins.

Possible Revision Some real world datasets should be used to perform the experiments . The number of large itemsets could exponentially increase with large databases. Modification in the representation structure is required that captures just a subset of the candidate large itemsets. Limitations of Support and Confidence Framework Support : Potentially interesting patterns involving low support items might be eliminated. Confidence: Confidence ignores the support of the itemset in the rule consequent. Improvement : Interestingness measure : Computes the ratio between the rule’s confidence and the support of the itemset in the rule consequent. = S(a,b)/s(a) * s(b) Effect of Skewed support Distribution

Questions?