M. Sulaiman Khan Dept. of Computer Science University of Liverpool 2009 COMP527: Data Mining Association Rule Mining March 5, 2009 Slide 1 COMP527: Data Mining
Introduction to the Course Introduction to Data Mining Introduction to Text Mining General Data Mining Issues Data Warehousing Classification: Challenges, Basics Classification: Rules Classification: Trees Classification: Trees 2 Classification: Bayes Classification: Neural Networks Classification: SVM Classification: Evaluation Classification: Evaluation 2 Regression, Prediction COMP527: Data Mining Association Rule Mining March 5, 2009 Slide 2 COMP527: Data Mining Input Preprocessing Attribute Selection Association Rule Mining ARM: A Priori and Data Structures ARM: Improvements ARM: Advanced Techniques Clustering: Challenges, Basics Clustering: Improvements Clustering: Advanced Algorithms Hybrid Approaches Graph Mining, Web Mining Text Mining: Challenges, Basics Text Mining: Text-as-Data Text Mining: Text-as-Language Revision for Exam
Introduction to Association Rule Mining (ARM) General Issues Support Confidence Lift Conviction Complexity! Frequent Itemsets Today's Topics Association Rule Mining March 5, 2009 Slide 3 COMP527: Data Mining
We've spent a long time looking at various classification methods, but there's more to data mining than classification. Given a data set with no classes, just attributes, what might we want to do with it? Association Rule Mining: Find patterns in the attribute values between instances. Instead of predicting an unknown value, we want to find interesting facts about the relationships between the known values. Introduction Association Rule Mining March 5, 2009 Slide 4 COMP527: Data Mining
In ARM, these patterns take the form of rules about the co- occurrence of attributes. The easiest example to use is market basket analysis -- finding patterns of things that are bought together in a supermarket. Shopping at a supermarket, you typically buy many things together (as opposed to shopping for a television, say). Perhaps 30 different items. Under 10 items is pretty rare. By comparing your shopping habits over time, the supermarket can learn about you and how best to make you spend more money, increasing their profits. They can also compare all shoppers' habits to find general rules, hopefully for how to increase profits. Introduction Association Rule Mining March 5, 2009 Slide 5 COMP527: Data Mining
Basket1: bread, butter, jam Basket2: bread, butter Basket3: bread, butter, milk Basket4: beer, bread Basket5: beer, milk What can we find from this? Some simple statistics: bread occurs 80% of the time. butter appears 60% of the time. Less simple: 100% of baskets containing butter also contain bread. 100% of baskets containing butter and jam also contain bread. Introduction Association Rule Mining March 5, 2009 Slide 6 COMP527: Data Mining
Basket1: bread, butter, jam Basket2: bread, butter Basket3: bread, butter, milk Basket4: beer, bread Basket5: beer, milk if (butter jam) then bread if butter then bread if bread then butter To find rules we find sets of items which occur together. The more frequently they occur, the better our rule is. There are some particular factors involved in determining the 'goodness' of a rule... Finding Rules Association Rule Mining March 5, 2009 Slide 7 COMP527: Data Mining
Basket1: bread, butter, jam Basket2: bread, butter Basket3: bread, butter, milk Basket4: beer, bread Basket5: beer, milk Support: Percentage of baskets in which the item(s) occur. bread: 80%, butter 60%, (bread butter) 60%... So the support for a rule X => Y, is the percentage of instances which contain both X and Y. Support Association Rule Mining March 5, 2009 Slide 8 COMP527: Data Mining
Basket1: bread, butter, jam Basket2: bread, butter Basket3: bread, butter, milk Basket4: beer, bread Basket5: beer, milk We also need a confidence for each rule -- how strongly we believe that rule to be true. Here, butter => bread is true 100% of the time, but bread => butter is only true for 3/4 baskets that contain bread so true 75% of the time. Confidence for X => Y is number of instances that contain X and Y divided by the number of instances that contain X. Confidence Association Rule Mining March 5, 2009 Slide 9 COMP527: Data Mining
Basket1: bread, butter, jam Basket2: bread, butter Basket3: bread, butter, milk Basket4: beer, bread Basket5: beer, milk ARM algorithms have a minimum threshold for both support and confidence and discard any rules below those thresholds. For example jam => (butter bread) has 100% confidence, but only 20% support, because jam butter and bread only occur once. On the other hand butter => bread has 60% support and 100% confidence, a much more interesting rule to us. Rule Mining Association Rule Mining March 5, 2009 Slide 10 COMP527: Data Mining
Confidence and Support are necessary but not sufficient to find interesting rules. Suppose that X => Y has a confidence of 60%. (X+Y)/X = 0.6 Sure, that looks interesting... there's a correlation between buying X and buying Y. But what if the probability of Y was 70% overall? Then if you buy X, you're less likely than normal to buy Y... certainly not what the rule is implying! Lift Association Rule Mining March 5, 2009 Slide 11 COMP527: Data Mining
Lift is measured in terms of support: s(X+Y) / s(X) * s(Y) This would then take into account the likelihood of Y. This penalises 'obvious' rules where both X and Y are common. For example bread => milk... if 90% of baskets contain bread and 85% of baskets contain milk, then the worst that bread=>milk could be is 75%. (10% of baskets don't contain bread but do contain milk, 15% don't contain milk but do contain bread, therefore at least 75% must contain both. The maximum is 85%, where all baskets with milk have bread, 5% have just bread and 10% have neither) Lift Association Rule Mining March 5, 2009 Slide 12 COMP527: Data Mining
Lift: s(X+Y) / s(X) * s(Y) if the support for X is 0.25, Y is 0.7, and X+Y is 0.15 then we have: 0.15 / (0.25 * 0.7) = Because this is less than 1, there is a negative correlation / (0.85 * 0.90) = > Negative lift 0.85 / (0.85 * 0.90) = > Positive lift Break even point is Lift Association Rule Mining March 5, 2009 Slide 13 COMP527: Data Mining
We can express this in just terms of baskets that contain A but not B. “if A then B” implies “not (A and not B)” So the formula for conviction is: s(A) s(not B) / s(A and not B) If A and B always co-occur, the denominator will be 0. Splat. (treat as infinite) Conviction Association Rule Mining March 5, 2009 Slide 14 COMP527: Data Mining
Other Evaluation Metrics Association Rule Mining March 5, 2009 Slide 15 COMP527: Data Mining
The most common approach to finding rules is: 1. Find sets of 2 or more attributes that occur together in more instances than a minimum support threshold. 2. Generate rules from those sets. The most important thing to note is that any subset of a frequent item set is also frequent. If (bread, milk, butter, beer) is frequent, then (bread, butter, beer) is also frequent because it must occur as least as often as the full set. Back to Rule Mining Association Rule Mining March 5, 2009 Slide 16 COMP527: Data Mining
No problem. Algorithm is obvious: Count all possible itemsets that appear in all transactions. If our transactions are: BC, BD, AC, BCD, ABD, ABCD We count: AB, AC, AD, BC, BD, CD, ABC, ABD, ACD, BCD, ABCD Uhh... And when you have the number of different items as a supermarket?? Say 100,000 different products? Ignoring empty set and the single item sets, that's You want to know how many that is? Naïve Approach Association Rule Mining March 5, 2009 Slide 17 COMP527: Data Mining
Naïve Approach: BAD!!! Association Rule Mining March 5, 2009 Slide 18 COMP527: Data Mining
Let's not try to work out the support for all possible combinations. Subsets of frequent itemsets are frequent. All subsets of a set that meets the minimum support will also necessarily meet the minimum support. So if we know a subset is small, any superset must also be small. So, instead of trying all combinations, we'll generate itemsets for a particular size and scan the database to see if any of them meet the support threshold. We know that any subsets of frequent sets are also frequent and supersets of infrequent are also infrequent, so don't need to check them. Frequent Itemsets Association Rule Mining March 5, 2009 Slide 19 COMP527: Data Mining
Itemset Lattice Association Rule Mining March 5, 2009 Slide 20 COMP527: Data Mining Pruned supersets Infrequent (Lattice borrowed from MSU)
The algorithm that does this is called A Priori and most other ARM techniques are based on it. Will look at it in more detail next week. A Priori Association Rule Mining March 5, 2009 Slide 21 COMP527: Data Mining
ARFF is a horrible horrible format for ARM. Most datasets are very sparse with the attributes being present or not present. Bread 0/1, Milk 0/1, etc. We want to record this as {bread, milk,cheese} not a huge table of 1s and 0s Weka doesn't include many ARM algorithms... In fact it has three, thankfully one is A Priori. The book doesn't include much information, but Dunham has good coverage. We'll also look at some other ARM applications built by Frans Coenen and Paul Leng here at Liverpool. Issues with WEKA and ARM Association Rule Mining March 5, 2009 Slide 22 COMP527: Data Mining
Witten 4.5 Dunham 6.1, 6.2 Han 5.1 Berry and Browne Berry and Linoff Chapter 9 Zhang, Association Rule Mining, Chapter 1, 2.1, 2.2 Pal and Mitra, 8.3 Further Reading Association Rule Mining March 5, 2009 Slide 23 COMP527: Data Mining