Association Rule Mining Debapriyo Majumdar Data Mining – Fall 2014 Indian Statistical Institute Kolkata August 4 and 7, 2014.

Slides:



Advertisements
Similar presentations
Association Rule Mining
Advertisements

Recap: Mining association rules from large datasets
Association Analysis (2). Example TIDList of item ID’s T1I1, I2, I5 T2I2, I4 T3I2, I3 T4I1, I2, I4 T5I1, I3 T6I2, I3 T7I1, I3 T8I1, I2, I3, I5 T9I1, I2,
Association Rules Spring Data Mining: What is it?  Two definitions:  The first one, classic and well-known, says that data mining is the nontrivial.
Association Rule Mining. 2 The Task Two ways of defining the task General –Input: A collection of instances –Output: rules to predict the values of any.
10 -1 Lecture 10 Association Rules Mining Topics –Basics –Mining Frequent Patterns –Mining Frequent Sequential Patterns –Applications.
Association rules The goal of mining association rules is to generate all possible rules that exceed some minimum user-specified support and confidence.
Sampling Large Databases for Association Rules ( Toivenon’s Approach, 1996) Farzaneh Mirzazadeh Fall 2007.
Data Mining Association Analysis: Basic Concepts and Algorithms
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Association Mining Data Mining Spring Transactional Database Transaction – A row in the database i.e.: {Eggs, Cheese, Milk} Transactional Database.
Association Rules l Mining Association Rules between Sets of Items in Large Databases (R. Agrawal, T. Imielinski & A. Swami) l Fast Algorithms for.
Rakesh Agrawal Ramakrishnan Srikant
Association Analysis. Association Rule Mining: Definition Given a set of records each of which contain some number of items from a given collection; –Produce.
Chapter 5: Mining Frequent Patterns, Association and Correlations
Data Mining Techniques So Far: Cluster analysis K-means Classification Decision Trees J48 (C4.5) Rule-based classification JRIP (RIPPER) Logistic Regression.
Data Mining Association Analysis: Basic Concepts and Algorithms Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Organization “Association Analysis”
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Data Mining Association Analysis: Basic Concepts and Algorithms
Association Analysis (2). Example TIDList of item ID’s T1I1, I2, I5 T2I2, I4 T3I2, I3 T4I1, I2, I4 T5I1, I3 T6I2, I3 T7I1, I3 T8I1, I2, I3, I5 T9I1, I2,
Data Mining Association Analysis: Basic Concepts and Algorithms
4/3/01CS632 - Data Mining1 Data Mining Presented By: Kevin Seng.
Association Analysis: Basic Concepts and Algorithms.
Data Mining Association Analysis: Basic Concepts and Algorithms
Fast Algorithms for Mining Association Rules * CS401 Final Presentation Presented by Lin Yang University of Missouri-Rolla * Rakesh Agrawal, Ramakrishnam.
6/23/2015CSE591: Data Mining by H. Liu1 Association Rules Transactional data Algorithm Applications.
© Vipin Kumar CSci 8980 Fall CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance Computing Research Center Department of Computer.
Association Rule Mining Part 1 Introduction to Data Mining with Case Studies Author: G. K. Gupta Prentice Hall India, 2006.
Association Rule Mining (Some material adapted from: Mining Sequential Patterns by Karuna Pande Joshi)‏
Fast Algorithms for Association Rule Mining
1 Fast Algorithms for Mining Association Rules Rakesh Agrawal Ramakrishnan Srikant Slides from Ofer Pasternak.
Mining Association Rules in Large Databases. What Is Association Rule Mining?  Association rule mining: Finding frequent patterns, associations, correlations,
Association Discovery from Databases Association rules are a simple formalism for expressing positive connections between columns in a 0/1 matrix. A classical.
Eick, Tan, Steinbach, Kumar: Association Analysis Part1 Organization “Association Analysis” 1. What is Association Analysis? 2. Association Rules 3. The.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining By Tan, Steinbach, Kumar Lecture.
Modul 7: Association Analysis. 2 Association Rule Mining  Given a set of transactions, find rules that will predict the occurrence of an item based on.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
ASSOCIATION RULE DISCOVERY (MARKET BASKET-ANALYSIS) MIS2502 Data Analytics Adapted from Tan, Steinbach, and Kumar (2004). Introduction to Data Mining.
DATA MINING LECTURE 3 Frequent Itemsets Association Rules.
Eick, Tan, Steinbach, Kumar: Association Analysis Part1 Organization “Association Analysis” 1. What is Association Analysis? 2. Association Rules 3. The.
Supermarket shelf management – Market-basket model:  Goal: Identify items that are bought together by sufficiently many customers  Approach: Process.
Data & Text Mining1 Introduction to Association Analysis Zhangxi Lin ISQS 3358 Texas Tech University.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Frequent Item Mining. What is data mining? =Pattern Mining? What patterns? Why are they useful?
CSE4334/5334 DATA MINING CSE4334/5334 Data Mining, Fall 2014 Department of Computer Science and Engineering, University of Texas at Arlington Chengkai.
Data Mining Find information from data data ? information.
Association Rule Mining
ASSOCIATION RULES (MARKET BASKET-ANALYSIS) MIS2502 Data Analytics Adapted from Tan, Steinbach, and Kumar (2004). Introduction to Data Mining.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Association Analysis This lecture node is modified based on Lecture Notes for.
Associations and Frequent Item Analysis. 2 Outline  Transactions  Frequent itemsets  Subset Property  Association rules  Applications.
Elsayed Hemayed Data Mining Course
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
COMP53311 Association Rule Mining Prepared by Raymond Wong Presented by Raymond Wong
1 Data Mining Lecture 6: Association Analysis. 2 Association Rule Mining l Given a set of transactions, find rules that will predict the occurrence of.
DATA MINING: ASSOCIATION ANALYSIS (2) Instructor: Dr. Chun Yu School of Statistics Jiangxi University of Finance and Economics Fall 2015.
Data Mining Association Analysis: Basic Concepts and Algorithms
Data Mining Association Analysis: Basic Concepts and Algorithms
Frequent Pattern Mining
COMP 5331: Knowledge Discovery and Data Mining
Data Mining Association Analysis: Basic Concepts and Algorithms
Association Rule Mining
Data Mining Association Analysis: Basic Concepts and Algorithms
Data Mining Association Analysis: Basic Concepts and Algorithms
Association Analysis: Basic Concepts and Algorithms
Association Analysis: Basic Concepts
Presentation transcript:

Association Rule Mining Debapriyo Majumdar Data Mining – Fall 2014 Indian Statistical Institute Kolkata August 4 and 7, 2014

Transaction idItems 1Bread, Ham, Juice, Cheese, Salami, Lettuce 2Rice, Dal, Coconut, Curry leaves, Coffee, Milk, Pickle 3Milk, Biscuit, Bread, Salami, Fruit jam, Egg 4Tea, Bread, Salami, Bacon, Ham, Sausage, Tomato 5Rice, Egg, Pickle, Curry leaves, Coconut, Red chilly Market Basket Analysis Scenario: customers shopping at a supermarket 1  What can we infer from the above data?  An association rule: {Bread, Salami}  {Ham}, with confidence ~ 2/3

Applications  Information driven marketing  Catalog design  Store layout  Customer segmentation based on buying patterns  Several papers by Rakesh Agrawal and others in the 1990s  Rakesh Agrawal and Ramakrishnan Srikant Fast Algorithms for Mining Association Rules The VLDB

The Market-Basket Model  A (large) set of binary attributes, called items I = {i 1, …, i n } e.g. milk, bread, the items sold at the market  A transaction T consists of a (small) subset of I e.g. the list of items (bill) bought by one customer at once  The database D is a (large) set of transactions D = {T 1, …, T N } 3

The Market-Basket Model  Goal: mining associations between the items – The transactions or customers also may have associations, but here we are interested in such relations  Approach: finding subset of items that are present together in transactions frequently  An itemset: any subset X of I 4

Support of an Itemset  Let X be an itemset  Support count σ(X) = # of transactions containing all items of X  support(X) = fraction of transactions containing all items of X 5  Makes sense (statistically significant) only when – support count is at least a few hundreds – in a database of several thousand transactions support({Bread, Salami}) support({Rice, Pickle, Coconut}) = 0.6 = 0.4 T-IDItems 1Bread, Ham, Juice, Cheese, Salami, Lettuce 2Rice, Dal, Coconut, Curry leaves, Coffee, Milk, Pickle 3Milk, Biscuit, Bread, Salami, Fruit jam, Egg 4Tea, Bread, Salami, Bacon, Ham, Sausage, Tomato 5Rice, Egg, Pickle, Curry leaves, Coconut, Red chilly

Association Rule  Association rule: an implication of the form X  Y where X, Y I, and X Y = ϕ.  support(X  Y) = – Transactions containing all items of both X and Y  confidence(X  Y) = 6 U UI T-IDItems 1Bread, Ham, Juice, Cheese, Salami, Lettuce 2Rice, Dal, Coconut, Curry leaves, Coffee, Milk, Pickle 3Milk, Biscuit, Bread, Salami, Fruit jam, Egg 4Tea, Bread, Salami, Bacon, Ham, Sausage, Tomato 5Rice, Egg, Pickle, Curry leaves, Coconut, Red chilly σ(X U Y) | D | σ(X U Y) σ(X)σ(X) R : {Bread, Salami}  {Ham} support(R) = confidence(R) =

Association Rule Mining Task  Given a set of items I, a set of transactions D, a minimum support thresholds minsup and a minimum confidence threshold minconf  Find all rules R such that support(R) ≥ minsup confidence(R) ≥ minconf 7

One Approach  Observe: support(X  Y) = == support(Z) where Z = X U Y  If Z = W U V, support(X  Y) = support(W  V) – Each binary partition of Z represents an association rule – With same support – However, the confidences may be different  Approach: frequent itemset generation 1.Find all itemsets Z with support(Z) ≥ minsup. Call such itemsets frequent itemsets. 2.From each Z, generate rules with confidence(Z) ≥ minconf 8 σ(X U Y) | D | σ(Z)σ(Z)

Finding Frequent Itemsets  If | I | = n, then number of possible itemsets = 2 n  For each itemset, compute the support by scanning the lists of items of each transaction – O(N × w), where w is the average length of transactions  Overall complexity: O(2 n × N × w)  Computationally very expensive!! 9

Anti-monotone Property of Support  If an itemset is frequent, all its subsets are also frequent – Because if X ⊆ Y, then support(X) ≥ support(Y) – For all transactions T such that Y ⊆ T, we have X ⊆ T 10 T-IDItems 1Bread, Ham, Juice, Cheese, Salami, Lettuce 2Rice, Dal, Coconut, Curry leaves, Coffee, Milk, Pickle 3Milk, Biscuit, Bread, Salami, Fruit jam, Egg 4Tea, Bread, Salami, Bacon, Ham, Sausage, Tomato 5Rice, Egg, Pickle, Curry leaves, Coconut, Red chilly Support({Bread, Salami}) ≥ Support({Bread, Ham, Salami})

The A-Priori Algorithm Notation: L k = The set of frequent (large) itemsets of size k C k = The candidate set of frequent (large) itemsets of size. Algorithm: L 1 = {Frequent 1-itemsets}; for ( k = 2; L k – 1 ≠ 0; k++ ) do begin C k = apriori_gen(L k-1 ); /* Generating new candidates */ for all transactions T in D do begin C T = subset(C k,T) /* Keeping only the valid candidates */ for all candidates c in C T do c.count++; end L k = {c in C k | c.count ≥ minsup} end Output = Union of all L k for k = 1, 2, …, n 11

Generating candidate itemsets L k  A join of L k-1 with itself insert into C k select p.item 1, p.item 2, …, p.item k-1, q.item k-1 from L k-1 p, L k-1 q where p.item 1 = q.item 1, …, p.item k-2 = q.item k-2, p.item k-1 < q.item k-1  What does it do? 12 L3L3 L3L3 {1, 2, 3} {1, 2, 4} {1, 3, 4} {1, 3, 5} {2, 3, 4} C 4 = { {1, 2, 3, 4}, {1, 3, 4, 5} } A prune step: {1, 3, 4, 5} will be pruned because {1, 4, 5} ∉ L 3

Checking Support for candidates  One approach: for each candidate itemset c ∈ C k for each transactions T ∈ D do begin check if c ⊆ T end  Complexity? 13

Using a Hash Tree Let us have 12 candidate itemsets of size 3 {1 2 5}, {1 2 7}, {1 3 9}, {2 4 5}, {2 8 9}, {3 5 7}, {4 5 9}, {4 7 8}, {5 6 7}, {5 7 9}, {6 7 8}, {6 7 9} 14 Hash function 1, 4, 7 2, 5, 8 3, 6, 9

The Hash Tree {1 2 5}, {1 2 7}, {1 3 9}, {2 4 5}, {2 8 9}, {3 5 7}, {4 5 9}, {4 7 8}, {5 6 7}, {5 7 9}, {6 7 8}, {6 7 9} 15 Hash Function 1, 4, 7 2, 5, 8 3, 6, 9 Root 1,4,7+ 2,5,8+ 3,6,9+ {1 2 5} {1 2 7} {1 3 9}{2 4 5} {2 8 9} {3 5 7} {4 5 9} {4 7 8} {5 6 7} {5 7 9} {6 7 8}{6 7 9}

Subsets of the transaction 16 {1 6 7} {1 6 8} {1 2 6} {1 2 7} {1 2 8} { } {6 7 8}{ }{ } { } { }{1 7 8} { } {2 7 8} {2 6 7} {2 6 8} All subsets of size 3 for a transaction{ }, ordered by the item id Subsets starting with 1 Subsets starting with 12 Hashing in the same style

The Subset Operation using Hash Tree Transaction: { }, ordered by item id 17 Hash Function 1, 4, 7 2, 5, 8 3, 6, 9 Root 1,4,7+ 2,5,8+ 3,6,9+ {1 2 5} {1 2 7} {1 3 9}{2 4 5} {2 8 9} {3 5 7} {4 5 9} {4 7 8} {5 6 7} {5 7 9} {6 7 8}{6 7 9} { } { } {5 6 8} { } {1 2 5}

Where are we now?  Computed frequent itemsets, i.e. the itemsets with required support minsup  Each frequent k-itemset X gives rise to several association rules  Ignoring X  ϕ and ϕ  X, 2 k – 2 rules  Rules generated from different itemsets are also different  The rules need to be checked for minimum confidence  All these rules already satisfy the support condition 18 How many?

Rules Generated from the Same Itemset  Let X ⊂ Y, for non empty itemsets X, and Y  Then X  Y - X is an association rule  Theorem: If X ’ ⊂ X, then c(X  Y – X) ≥ c(X ’  Y – X ’ ) – Example: c({1 2 3}  {4 5}) ≥ c({1 2}  {3 4 5})  Proof. Observe: c(X  Y – X) = σ(Y)/σ(X) c(X ’  Y – X ’ ) = σ(Y)/σ(X ’ ) since X’ ⊂ X, σ(X ’ ) ≥ σ(X) so c(X  Y – X) ≥ c(X ’  Y – X ’)  Corollary: If X  Y – X is not a high-confidence association rule, then X’  Y – X’ is also not a high confidence rule. 19

Level-wise Approach for Rule Generation Frequent itemset: { } 20 {1 3 4}  {2} {2 3 4}  {1} {1 2 4}  {3} {1}  {2 3 4} {1 2}  {3 4} {1 2 3}  {4} {1 3}  {2 4} {1 4}  {2 3} {2 3}  {1 4} {2 4}  {1 3} {3 4}  {1 2} {2}  {1 3 4} {3}  {1 2 4} {4}  {1 2 3} { }  {}  Suppose {1 2 4}  {3} fails the confidence bar  The whole tree under {1 2 4}  {3} can be discarded       

Maximal Frequent itemsets Maximal frequent itemset: an itemset, for which none of its immediate supersets are frequent 21 {3} {4} {2} {1 2 3} {1 2} {1} {1 3} {1 4} {2 3} {2 4} {3 4} {1 2 4} {1 3 4} {2 3 4} {} { }

Maximal Frequent itemsets Maximal frequent itemset: an itemset, for which none of its immediate supersets are frequent 22 {3} {4} {2} {1 2 3} {1 2} {1} {1 3} {1 4} {2 3} {2 4} {3 4} {1 2 4} {1 3 4} {2 3 4} {} { } Not frequent

Maximal Frequent itemsets Maximal frequent itemset: an itemset, for which none of its immediate supersets are frequent 23 {3} {4} {2} {1 2 3} {1 2} {1} {1 3} {1 4} {2 3} {2 4} {3 4} {1 2 4} {1 3 4} {2 3 4} {} { } Not frequent Maximal frequent

Maximal Frequent itemsets All frequent itemsets are subsets of one of the maximal frequent itemsets. 24 {3} {4} {2} {1 2 3} {1 2} {1} {1 3} {1 4} {2 3} {2 4} {3 4} {1 2 4} {1 3 4} {2 3 4} {} { } Not frequent Maximal frequent

Maximal Frequent Itemsets  Valuable compact representation of the frequent itemsets But  Do not contain the support information of the subsets – Says all supersets have lesser support, but does not say if any subset also has the same support 25

Closed Frequent Itemsets  Closed itemset: an itemset X for which none of its immediate supersets has exactly the same support count as X – If X is not closed, at least one of its immediate supersets have the same support as the support of X  Closed frequent itemset: an itemset which is closed and frequent (support ≥ minsup)  Support for non-closed frequent itemsets can be determined from the support information of the closed frequent itemsets 26 Frequent itemsets Closed frequent itemsets Maximal frequent itemsets

Evaluation of Association Rules  Even from a small dataset a very large number of rules can be generated – For example, as support and confidence conditions are relaxed, number of rules explode  Interestingness measure for patterns / rules is required  Objective interestingness measure: a measure that uses statistics derived from the data – Support, confidence, correlation, … – Domain independent – Requires minimal human involvement 27

Subjective Measure of Interestingness  The rule {Salami}  {Bread} is not so interesting because it is obvious!  Rules such as{Salami}  {Dish washer detergent}, {Salami}  {Diper}, etc are less obvious  Subjectively more interesting for marketing experts – Non-trivial cross sell  Methods for subjective measurement – Visualization aided: human in the loop – Template-based: constrains are provided for rules – Filter obvious and non-actionable rules 28 ? ?

Contingency Table Coffee Tea Tea BB’ Af 11 f 10 f 1+ A’f 01 f 00 f 0+ f +1 f +0  Frequency tabulated for a pair of binary variables  Used as a useful evaluation and illustration tool  Generally: A’ (or B’) denotes the transactions in which A (or B) is absent f 1+ = support count of A f +1 = support count of B

Limitations of Support & Confidence  Tuning the support threshold is tricky  Low threshold – Too many rules generated!  High threshold – Potentially interesting patterns may fall below the support threshold 30

Limitation of Confidence  But: Overall 80% people have coffee – i.e., the rule{}  {Coffee} has confidence 80%. – Among tea takers, the percentage actually drops to 75%!!  Where does it go wrong?  Confidence measure ignores the support of Y for a rule X  Y 31 Coffee Tea Tea Consider the rule: {Tea}  {Coffee} Support = 15% Confidence = 75%

Interest factor  Lift: Lift(X  Y) =  For binary variables, lift is equivalent to interest factor  Interest factor: I(X,Y) = =  Similar to baseline frequency comparison under statistical independence assumption – If X and Y are statistically independent, their baseline frequency (expected frequency of X and Y both occurring) is f 11 = 32 c(X  Y) σ(Y)σ(Y) s(X U Y) s(X) s(Y) N f 11 f 1+. f +1 N

Interest factor  Intuitively I(X,Y) = 1, if X and Y are independent > 1, if X and Y have a positive correlation < 1, if X and Y have a negative correlation  Verify for the tea – coffee example I(Tea, Coffee) = 0.15 / (0.2 × 0.8) = Coffee Tea Tea I = N f 11 f 1+. f +1

Limitation of Interest Factor  Observe: I(Text, Analysis) = 1.02, I(Graph, Mining) = 4.08  Text and Analysis are more related than Graph and Mining  Confidence measure: c(Text  Analysis) = 94.6% c(Graph  Mining) = 28.6%  What goes wrong here? 34 Text Analysis Analysis Mining Graph Graph

More Measures  Correlation coefficient for binary variables:  IS Measure: I and S measures combined  Mathematically equivalent to cosine measure of binary variables 35

Properties of Objective Measures 36 BB’ Af 11 f 10 f 1+ A’f 01 f 00 f 0+ f +1 f +0  Inversion property: Invariant under inversion operation – Exchange f 11 with f 00 and f 01 with f 10 – The value of the measure remains the same  Null addition property: Invariant under addition of counts for other variables, i.e. the value of the measure remains the same if f 00 is increased  Which measures have which properties?

References  Rakesh Agrawal and Ramakrishnan Srikant Fast Algorithms for Mining Association Rules VLDB 1994  Introduction to Data Mining, by Tan, Steinbach, Kumar – The webpage: users.cs.umn.edu/~kumar/dmbook/index.phphttp://www- users.cs.umn.edu/~kumar/dmbook/index.php – Chapter 6 is available online: users.cs.umn.edu/~kumar/dmbook/ch6.pdfhttp://www- users.cs.umn.edu/~kumar/dmbook/ch6.pdf 37