Association Rule Mining (Some material adapted from: Mining Sequential Patterns by Karuna Pande Joshi)
2
3 An Example
4 Terminology Item Itemset Transaction
5 Association Rules Let U be a set of items and let X, Y U, with X Y = Let U be a set of items and let X, Y U, with X Y = An association rule is an expression of the form X Y, whose meaning is: An association rule is an expression of the form X Y, whose meaning is: If the elements of X occur in some context, then so do the elements of Y If the elements of X occur in some context, then so do the elements of Y
6 Quality Measures Let T be set of all transactions. The following statistical quantities are relevant to association rule mining: Let T be set of all transactions. The following statistical quantities are relevant to association rule mining: support(X) support(X) |{t T: X t}| / |T| |{t T: X t}| / |T| support(X Y) support(X Y) |{t T: X Y t}| / |T| |{t T: X Y t}| / |T| confidence(X Y) confidence(X Y) |{t T: X Y t}| / |{t T: X t}| |{t T: X Y t}| / |{t T: X t}| The percentage of all transactions, containing item set x The percentage of all transactions, containing both item sets x and y The percentage of transactions containing item set x, that also contain item set y. How good is item set x at predicting item set y.
7 Learning Associations user-defined The purpose of association rule learning is to find “interesting” rules, i.e., rules that meet the following two user-defined conditions: support(X Y) MinSupport support(X Y) MinSupport confidence(X Y) MinConfidence confidence(X Y) MinConfidence
8 Itemsets Frequent itemset Frequent itemset An itemset whose support is greater than MinSupport (denoted L k where k is the size of the itemset) An itemset whose support is greater than MinSupport (denoted L k where k is the size of the itemset) Candidate itemset Candidate itemset A potentially frequent itemset (denoted C k where k is the size of the itemset) A potentially frequent itemset (denoted C k where k is the size of the itemset) High percentage of transactions contain the full item set.
9 Basic Idea Generate all frequent itemsets satisfying the condition on minimum support Generate all frequent itemsets satisfying the condition on minimum support Build all possible rules from these itemsets and check them against the condition on minimum confidence Build all possible rules from these itemsets and check them against the condition on minimum confidence All the rules above the minimum confidence threshold are returned for further evaluation All the rules above the minimum confidence threshold are returned for further evaluation
10
11
12
14
15
16
17
18
19
20
21
22 AprioriAll (I) L 1 L 1 For each item I j I For each item I j I count({I j }) = | {T i : I j T i } | count({I j }) = | {T i : I j T i } | If count({I j }) MinSupport x m If count({I j }) MinSupport x m L 1 L 1 {({I j }, count({I j })} L 1 L 1 {({I j }, count({I j })} k 2 k 2 While L k-1 While L k-1 L k L k For each (l 1, count(l 1 )) L k-1 For each (l 1, count(l 1 )) L k-1 For each (l 2, count(l 2 )) L k-1 For each (l 2, count(l 2 )) L k-1 If (l 1 = {j 1, …, j k-2, x} l 2 = {j 1, …, j k-2, y} x y) If (l 1 = {j 1, …, j k-2, x} l 2 = {j 1, …, j k-2, y} x y) l {j 1, …, j k-2, x, y} l {j 1, …, j k-2, x, y} count(l) | {T i : l T i } | count(l) | {T i : l T i } | If count(l) MinSupport x m If count(l) MinSupport x m L k L k {(l, count(l))} L k L k {(l, count(l))} k k + 1 k k + 1 Return L 1 L 2 … L k-1 Return L 1 L 2 … L k-1 The number of all transactions, containing item I_j If this count is big enough, we add the item and count to a stack, L_1
Rule Generation Look at set {a,d,e} Look at set {a,d,e} Has six candidate association rules: Has six candidate association rules: {a} {d,e} confidence: {a,d,e} / {a} = {a} {d,e} confidence: {a,d,e} / {a} = {d,e} {a} confidence: {a,d,e} / {d,e} = {d,e} {a} confidence: {a,d,e} / {d,e} = {d} {a,e} confidence: {a,d,e} / {d} = {d} {a,e} confidence: {a,d,e} / {d} = {a,e} {d} confidence: {a,d,e} / {a,e} = {a,e} {d} confidence: {a,d,e} / {a,e} = {e} {a,d} confidence: {a,d,e} / {e} = {e} {a,d} confidence: {a,d,e} / {e} = {a,d} {e} confidence: {a,d,e} / {a,d} = {a,d} {e} confidence: {a,d,e} / {a,d} = 0.800
Confidence-Based Pruning
Rule Generation Look at set {a,d,e}. Let MinConfidence == Look at set {a,d,e}. Let MinConfidence == Has six candidate association rules: Has six candidate association rules: {d,e} {a} confidence: {a,d,e} / {d,e} = {d,e} {a} confidence: {a,d,e} / {d,e} = {a,e} {d} confidence: {a,d,e} / {a,e} = {a,e} {d} confidence: {a,d,e} / {a,e} = {a,d} {e} confidence: {a,d,e} / {a,d} = {a,d} {e} confidence: {a,d,e} / {a,d} = {d} {a,e} confidence: {a,d,e} / {d} = {d} {a,e} confidence: {a,d,e} / {d} = Selected Rules: Selected Rules: {d,e} a and {a,d} e {d,e} a and {a,d} e
26 Summary Apriori is a rather simple algorithm that discovers useful and interesting patterns Apriori is a rather simple algorithm that discovers useful and interesting patterns It is widely used It is widely used It has been extended to create collaborative filtering algorithms to provide recommendations It has been extended to create collaborative filtering algorithms to provide recommendations
27 References Fast Algorithms for Mining Association Rules (1994) Fast Algorithms for Mining Association Rules (1994) Rakesh Agrawal, Ramakrishnan Srikant. Proc. 20th Int. Conf. Very Large Data Bases, VLDB (PDF) Rakesh Agrawal, Ramakrishnan Srikant. Proc. 20th Int. Conf. Very Large Data Bases, VLDB (PDF)PDF Mining Association Rules between Sets of Items in Large Databases (1993) Mining Association Rules between Sets of Items in Large Databases (1993) Rakesh Agrawal, Tomasz Imielinski, Arun Swami. Proceedings of the 1993 ACM SIGMOD International Conference on Management of Data Rakesh Agrawal, Tomasz Imielinski, Arun Swami. Proceedings of the 1993 ACM SIGMOD International Conference on Management of Data Introduction to Data Mining Introduction to Data Mining P-N. Tan, M. Steinbach and V. Kumar, Introduction to Data Mining, Pearson Education Inc., 2006, Chapter 6 P-N. Tan, M. Steinbach and V. Kumar, Introduction to Data Mining, Pearson Education Inc., 2006, Chapter 6