M. Sulaiman Khan Dept. of Computer Science University of Liverpool 2009 COMP527: Data Mining Association Rule Mining March 5, 2009.

Slides:



Advertisements
Similar presentations
Association Rule Mining
Advertisements

Recap: Mining association rules from large datasets
Huffman Codes and Asssociation Rules (II) Prof. Sin-Min Lee Department of Computer Science.
Association Analysis (2). Example TIDList of item ID’s T1I1, I2, I5 T2I2, I4 T3I2, I3 T4I1, I2, I4 T5I1, I3 T6I2, I3 T7I1, I3 T8I1, I2, I3, I5 T9I1, I2,
Data Mining Techniques Association Rule
Association Rules Spring Data Mining: What is it?  Two definitions:  The first one, classic and well-known, says that data mining is the nontrivial.
LOGO Association Rule Lecturer: Dr. Bo Yuan
Association Rule Mining. 2 The Task Two ways of defining the task General –Input: A collection of instances –Output: rules to predict the values of any.
MIS2502: Data Analytics Association Rule Mining. Uses What products are bought together? Amazon’s recommendation engine Telephone calling patterns Association.
Association rules The goal of mining association rules is to generate all possible rules that exceed some minimum user-specified support and confidence.
1 of 25 1 of 45 Association Rule Mining CIT366: Data Mining & Data Warehousing Instructor: Bajuna Salehe The Institute of Finance Management: Computing.
Chapter 5: Mining Frequent Patterns, Association and Correlations
Data Mining Techniques So Far: Cluster analysis K-means Classification Decision Trees J48 (C4.5) Rule-based classification JRIP (RIPPER) Logistic Regression.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Association Rule Mining Part 2 (under construction!) Introduction to Data Mining with Case Studies Author: G. K. Gupta Prentice Hall India, 2006.
Data Mining Association Analysis: Basic Concepts and Algorithms
Data Mining Association Analysis: Basic Concepts and Algorithms
Association Analysis: Basic Concepts and Algorithms.
Data Mining Association Analysis: Basic Concepts and Algorithms
Association Rules Presented by: Anilkumar Panicker Presented by: Anilkumar Panicker.
Association Rule Mining Part 1 Introduction to Data Mining with Case Studies Author: G. K. Gupta Prentice Hall India, 2006.
Lecture14: Association Rules
Mining Association Rules
Mining Association Rules in Large Databases. What Is Association Rule Mining?  Association rule mining: Finding frequent patterns, associations, correlations,
M. Sulaiman Khan Dept. of Computer Science University of Liverpool 2009 COMP527: Data Mining Classification: Evaluation February 23,
Association Rules. 2 Customer buying habits by finding associations and correlations between the different items that customers place in their “shopping.
M. Sulaiman Khan Dept. of Computer Science University of Liverpool 2009 COMP527: Data Mining ARM: Advanced Techniques March 11, 2009.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining By Tan, Steinbach, Kumar Lecture.
Modul 7: Association Analysis. 2 Association Rule Mining  Given a set of transactions, find rules that will predict the occurrence of an item based on.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Association Rules. CS583, Bing Liu, UIC 2 Association rule mining Proposed by Agrawal et al in Initially used for Market Basket Analysis to find.
ASSOCIATION RULE DISCOVERY (MARKET BASKET-ANALYSIS) MIS2502 Data Analytics Adapted from Tan, Steinbach, and Kumar (2004). Introduction to Data Mining.
Supermarket shelf management – Market-basket model:  Goal: Identify items that are bought together by sufficiently many customers  Approach: Process.
Data & Text Mining1 Introduction to Association Analysis Zhangxi Lin ISQS 3358 Texas Tech University.
M. Sulaiman Khan Dept. of Computer Science University of Liverpool 2009 COMP527: Data Mining Classification: Bayes February 17, 2009.
CSE4334/5334 DATA MINING CSE4334/5334 Data Mining, Fall 2014 Department of Computer Science and Engineering, University of Texas at Arlington Chengkai.
Instructor : Prof. Marina Gavrilova. Goal Goal of this presentation is to discuss in detail how data mining methods are used in market analysis.
Part II - Association Rules © Prentice Hall1 DATA MINING Introductory and Advanced Topics Part II – Association Rules Margaret H. Dunham Department of.
Association Rule Mining
ASSOCIATION RULES (MARKET BASKET-ANALYSIS) MIS2502 Data Analytics Adapted from Tan, Steinbach, and Kumar (2004). Introduction to Data Mining.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Association Analysis This lecture node is modified based on Lecture Notes for.
M. Sulaiman Khan Dept. of Computer Science University of Liverpool 2009 COMP527: Data Mining ARM: Improvements March 10, 2009 Slide.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Chapter 8 Association Rules. Data Warehouse and Data Mining Chapter 10 2 Content Association rule mining Mining single-dimensional Boolean association.
Chap 6: Association Rules. Rule Rules!  Motivation ~ recent progress in data mining + warehousing have made it possible to collect HUGE amount of data.
Data Mining Association Rules Mining Frequent Itemset Mining Support and Confidence Apriori Approach.
Introduction to Machine Learning Lecture 13 Introduction to Association Rules Albert Orriols i Puig Artificial.
1 Data Mining Lecture 6: Association Analysis. 2 Association Rule Mining l Given a set of transactions, find rules that will predict the occurrence of.
MIS2502: Data Analytics Association Rule Mining David Schuff
MIS2502: Data Analytics Association Rule Mining Jeremy Shafer
Data Science Algorithms: The Basic Methods
Data Mining Association Analysis: Basic Concepts and Algorithms
Frequent Pattern Mining
Association Rules.
Waikato Environment for Knowledge Analysis
Data Mining Association Analysis: Basic Concepts and Algorithms
Association Rule Mining
Data Mining Association Analysis: Basic Concepts and Algorithms
Association Rule Mining
Data Mining Association Analysis: Basic Concepts and Algorithms
Association Analysis: Basic Concepts and Algorithms
MIS2502: Data Analytics Association Rule Mining
Market Basket Analysis and Association Rules
MIS2502: Data Analytics Association Rule Mining
Dept. of Computer Science University of Liverpool
Lecture 11 (Market Basket Analysis)
Dept. of Computer Science University of Liverpool
Association Analysis: Basic Concepts
Presentation transcript:

M. Sulaiman Khan Dept. of Computer Science University of Liverpool 2009 COMP527: Data Mining Association Rule Mining March 5, 2009 Slide 1 COMP527: Data Mining

Introduction to the Course Introduction to Data Mining Introduction to Text Mining General Data Mining Issues Data Warehousing Classification: Challenges, Basics Classification: Rules Classification: Trees Classification: Trees 2 Classification: Bayes Classification: Neural Networks Classification: SVM Classification: Evaluation Classification: Evaluation 2 Regression, Prediction COMP527: Data Mining Association Rule Mining March 5, 2009 Slide 2 COMP527: Data Mining Input Preprocessing Attribute Selection Association Rule Mining ARM: A Priori and Data Structures ARM: Improvements ARM: Advanced Techniques Clustering: Challenges, Basics Clustering: Improvements Clustering: Advanced Algorithms Hybrid Approaches Graph Mining, Web Mining Text Mining: Challenges, Basics Text Mining: Text-as-Data Text Mining: Text-as-Language Revision for Exam

Introduction to Association Rule Mining (ARM)‏ General Issues Support Confidence Lift Conviction Complexity! Frequent Itemsets Today's Topics Association Rule Mining March 5, 2009 Slide 3 COMP527: Data Mining

We've spent a long time looking at various classification methods, but there's more to data mining than classification. Given a data set with no classes, just attributes, what might we want to do with it? Association Rule Mining: Find patterns in the attribute values between instances. Instead of predicting an unknown value, we want to find interesting facts about the relationships between the known values. Introduction Association Rule Mining March 5, 2009 Slide 4 COMP527: Data Mining

In ARM, these patterns take the form of rules about the co- occurrence of attributes. The easiest example to use is market basket analysis -- finding patterns of things that are bought together in a supermarket. Shopping at a supermarket, you typically buy many things together (as opposed to shopping for a television, say). Perhaps 30 different items. Under 10 items is pretty rare. By comparing your shopping habits over time, the supermarket can learn about you and how best to make you spend more money, increasing their profits. They can also compare all shoppers' habits to find general rules, hopefully for how to increase profits. Introduction Association Rule Mining March 5, 2009 Slide 5 COMP527: Data Mining

Basket1: bread, butter, jam Basket2: bread, butter Basket3: bread, butter, milk Basket4: beer, bread Basket5: beer, milk What can we find from this? Some simple statistics: bread occurs 80% of the time. butter appears 60% of the time. Less simple: 100% of baskets containing butter also contain bread. 100% of baskets containing butter and jam also contain bread. Introduction Association Rule Mining March 5, 2009 Slide 6 COMP527: Data Mining

Basket1: bread, butter, jam Basket2: bread, butter Basket3: bread, butter, milk Basket4: beer, bread Basket5: beer, milk if (butter jam) then bread if butter then bread if bread then butter To find rules we find sets of items which occur together. The more frequently they occur, the better our rule is. There are some particular factors involved in determining the 'goodness' of a rule... Finding Rules Association Rule Mining March 5, 2009 Slide 7 COMP527: Data Mining

Basket1: bread, butter, jam Basket2: bread, butter Basket3: bread, butter, milk Basket4: beer, bread Basket5: beer, milk Support: Percentage of baskets in which the item(s) occur. bread: 80%, butter 60%, (bread butter) 60%... So the support for a rule X => Y, is the percentage of instances which contain both X and Y. Support Association Rule Mining March 5, 2009 Slide 8 COMP527: Data Mining

Basket1: bread, butter, jam Basket2: bread, butter Basket3: bread, butter, milk Basket4: beer, bread Basket5: beer, milk We also need a confidence for each rule -- how strongly we believe that rule to be true. Here, butter => bread is true 100% of the time, but bread => butter is only true for 3/4 baskets that contain bread so true 75% of the time. Confidence for X => Y is number of instances that contain X and Y divided by the number of instances that contain X. Confidence Association Rule Mining March 5, 2009 Slide 9 COMP527: Data Mining

Basket1: bread, butter, jam Basket2: bread, butter Basket3: bread, butter, milk Basket4: beer, bread Basket5: beer, milk ARM algorithms have a minimum threshold for both support and confidence and discard any rules below those thresholds. For example jam => (butter bread) has 100% confidence, but only 20% support, because jam butter and bread only occur once. On the other hand butter => bread has 60% support and 100% confidence, a much more interesting rule to us. Rule Mining Association Rule Mining March 5, 2009 Slide 10 COMP527: Data Mining

Confidence and Support are necessary but not sufficient to find interesting rules. Suppose that X => Y has a confidence of 60%. (X+Y)/X = 0.6 Sure, that looks interesting... there's a correlation between buying X and buying Y. But what if the probability of Y was 70% overall? Then if you buy X, you're less likely than normal to buy Y... certainly not what the rule is implying! Lift Association Rule Mining March 5, 2009 Slide 11 COMP527: Data Mining

Lift is measured in terms of support: s(X+Y) / s(X) * s(Y)‏ This would then take into account the likelihood of Y. This penalises 'obvious' rules where both X and Y are common. For example bread => milk... if 90% of baskets contain bread and 85% of baskets contain milk, then the worst that bread=>milk could be is 75%. (10% of baskets don't contain bread but do contain milk, 15% don't contain milk but do contain bread, therefore at least 75% must contain both. The maximum is 85%, where all baskets with milk have bread, 5% have just bread and 10% have neither)‏ Lift Association Rule Mining March 5, 2009 Slide 12 COMP527: Data Mining

Lift: s(X+Y) / s(X) * s(Y)‏ if the support for X is 0.25, Y is 0.7, and X+Y is 0.15 then we have: 0.15 / (0.25 * 0.7) = Because this is less than 1, there is a negative correlation / (0.85 * 0.90) = > Negative lift 0.85 / (0.85 * 0.90) = > Positive lift Break even point is Lift Association Rule Mining March 5, 2009 Slide 13 COMP527: Data Mining

We can express this in just terms of baskets that contain A but not B. “if A then B” implies “not (A and not B)” So the formula for conviction is: s(A) s(not B) / s(A and not B)‏ If A and B always co-occur, the denominator will be 0. Splat. (treat as infinite)‏ Conviction Association Rule Mining March 5, 2009 Slide 14 COMP527: Data Mining

Other Evaluation Metrics Association Rule Mining March 5, 2009 Slide 15 COMP527: Data Mining

The most common approach to finding rules is: 1. Find sets of 2 or more attributes that occur together in more instances than a minimum support threshold. 2. Generate rules from those sets. The most important thing to note is that any subset of a frequent item set is also frequent. If (bread, milk, butter, beer) is frequent, then (bread, butter, beer) is also frequent because it must occur as least as often as the full set. Back to Rule Mining Association Rule Mining March 5, 2009 Slide 16 COMP527: Data Mining

No problem. Algorithm is obvious: Count all possible itemsets that appear in all transactions. If our transactions are: BC, BD, AC, BCD, ABD, ABCD We count: AB, AC, AD, BC, BD, CD, ABC, ABD, ACD, BCD, ABCD Uhh... And when you have the number of different items as a supermarket?? Say 100,000 different products? Ignoring empty set and the single item sets, that's You want to know how many that is? Naïve Approach Association Rule Mining March 5, 2009 Slide 17 COMP527: Data Mining

Naïve Approach: BAD!!! Association Rule Mining March 5, 2009 Slide 18 COMP527: Data Mining

Let's not try to work out the support for all possible combinations. Subsets of frequent itemsets are frequent. All subsets of a set that meets the minimum support will also necessarily meet the minimum support. So if we know a subset is small, any superset must also be small. So, instead of trying all combinations, we'll generate itemsets for a particular size and scan the database to see if any of them meet the support threshold. We know that any subsets of frequent sets are also frequent and supersets of infrequent are also infrequent, so don't need to check them. Frequent Itemsets Association Rule Mining March 5, 2009 Slide 19 COMP527: Data Mining

Itemset Lattice Association Rule Mining March 5, 2009 Slide 20 COMP527: Data Mining Pruned supersets Infrequent (Lattice borrowed from MSU)‏

The algorithm that does this is called A Priori and most other ARM techniques are based on it. Will look at it in more detail next week. A Priori Association Rule Mining March 5, 2009 Slide 21 COMP527: Data Mining

ARFF is a horrible horrible format for ARM. Most datasets are very sparse with the attributes being present or not present. Bread 0/1, Milk 0/1, etc. We want to record this as {bread, milk,cheese} not a huge table of 1s and 0s Weka doesn't include many ARM algorithms... In fact it has three, thankfully one is A Priori. The book doesn't include much information, but Dunham has good coverage. We'll also look at some other ARM applications built by Frans Coenen and Paul Leng here at Liverpool. Issues with WEKA and ARM Association Rule Mining March 5, 2009 Slide 22 COMP527: Data Mining

Witten 4.5 Dunham 6.1, 6.2 Han 5.1 Berry and Browne Berry and Linoff Chapter 9 Zhang, Association Rule Mining, Chapter 1, 2.1, 2.2 Pal and Mitra, 8.3 Further Reading Association Rule Mining March 5, 2009 Slide 23 COMP527: Data Mining