ASSOCIATION RULES (MARKET BASKET-ANALYSIS) MIS2502 Data Analytics Adapted from Tan, Steinbach, and Kumar (2004). Introduction to Data Mining.
What is Association Mining? Discovering interesting relationships between variables in large databases ( Find out which items predict the occurrence of other items Also known as “affinity analysis” or “market basket” analysis
Examples of Association Mining Market basket analysis/affinity analysis What products are bought together? Where to place items on grocery store shelves? Amazon’s recommendation engine “People who bought this product also bought…” Telephone calling patterns Who do a set of people tend to call most often? Social network analysis Determine who you “may know”
Market-Basket Analysis Large set of items e.g., things sold in a supermarket Large set of baskets e.g., things one customer buys in one visit Supermarket chains keep terabytes of this data Informs store layout Suggests “tie-ins” Place spaghetti sauce in the pasta aisle Put diapers on sale and raise the price of beer
Market-Basket Transactions BasketItems 1 Bread, Milk 2 Bread, Diapers, Beer, Eggs 3 Milk, Diapers, Beer, Coke 4 Bread, Milk, Diapers, Beer 5 Bread, Milk, Diapers, Coke Association Rules from these transactions X Y (antecedent consequent) {Diapers} {Beer}, {Milk, Bread} {Diapers} {Beer, Bread} {Milk}, {Bread} {Milk, Diapers}
Core idea: The itemset Itemset: A group of items of interest {Milk, Coke, Diapers} This itemset is a “3 itemset” because it contains…3 items! An association rule expresses related itemsets X Y, where X and Y are two itemsets {Milk, Diapers} {Coke} means “when you have milk and diapers, you also have Coke) BasketItems 1 Bread, Milk 2 Bread, Diapers, Beer, Eggs 3 Milk, Diapers, Beer, Coke 4 Bread, Milk, Diapers, Beer 5 Bread, Milk, Diapers, Coke
Support Support count ( ) Frequency of occurrence of an itemset {Milk, Diapers, Coke} = 2 (i.e., it’s in baskets 3 and 5) Support (s) Fraction of transactions that contain all itemsets in the relationship X Y s({Milk, Diapers, Coke}) = 2/5 = 0.4 You can calculate support for both X and Y separately Support for X = 3/5 = 0.6; Support for Y = 2/5 = 0.4 BasketItems 1 Bread, Milk 2 Bread, Diapers, Beer, Eggs 3 Milk, Diapers, Beer, Coke 4 Bread, Milk, Diapers, Beer 5 Bread, Milk, Diapers, Coke X Y 2 baskets have milk, Coke, and diapers 5 baskets total
Confidence Confidence is the strength of the association Measures how often items in Y appear in transactions that contain X Given we have one thing, what is our confidence that we will see another thing? BasketItems 1 Bread, Milk 2 Bread, Diapers, Beer, Eggs 3 Milk, Diapers, Beer, Coke 4 Bread, Milk, Diapers, Beer 5 Bread, Milk, Diapers, Coke The first says: when you have milk and diapers in the itemset, 67% of the time you also have Coke! c must be between 0 and 1 1 is a complete association 0 is no association What does the second say?
{Milk, Diapers} N = 3 {Coke} N = 2 67% of Total Venn Diagram Representation
{Milk, Diapers} N = 3 {Coke} N = 2 100% of Total Venn Diagram Representation
Some sample rules Association RuleSupport (s)Confidence (c) {Milk, Diapers} {Beer} 2/5 = 0.42/3 = 0.67 {Milk,Beer} {Diapers} 2/5 = 0.42/2 = 1.0 {Diapers,Beer} {Milk} 2/5 = 0.42/3 = 0.67 {Beer} {Milk,Diapers} 2/5 = 0.42/3 = 0.67 {Diapers} {Milk,Beer} 2/5 = 0.42/4 = 0.5 {Milk} {Diapers,Beer} 2/5 = 0.42/4 = 0.5 BasketItems 1 Bread, Milk 2 Bread, Diapers, Beer, Eggs 3 Milk, Diapers, Beer, Coke 4 Bread, Milk, Diapers, Beer 5 Bread, Milk, Diapers, Coke All the above rules are binary partitions of the same itemset: {Milk, Diapers, Beer}
Amazon Recommendation Agent People who bought this, also bought this May be bidirectional association, or it may be unidirectional Why? What does this mean?
But don’t blindly follow the numbers Rules originating from the same itemset (X Y) will have identical support Since the total elements (X Y) are always the same But they can have different confidence Depending on what is contained in X and Y High confidence suggests a strong association But this can be deceptive Consider {Bread} {Diapers} Support for the total itemset is 0.6 (3/5) And confidence is 0.75 (3/4) – pretty high But is this just because both are frequently occurring items (s=0.8)? BasketItems 1 Bread, Milk 2 Bread, Diapers, Beer, Eggs 3 Milk, Diapers, Beer, Coke 4 Bread, Milk, Diapers, Beer 5 Bread, Milk, Diapers, Coke
Real Association or Coincidence? The Question… Is the association real (e.g., causal) or is it due to chance? Confidence can help you get at direction of association… c = {HouseFire Firemen} = 1.00 c = {Fireman HouseFire} = 0.10 But, How Can We Tell Co-occurrence from Real Association? E.g., Complementary Goods or Similar Goods? Bread & Butter: one increases the probability of the other Bread & Milk: both are just high probability items
Lift Takes into account how co-occurrence differs from what is expected by chance i.e., if items were selected independently from one another Independent events (no interdependence between A, B)… Based on the support metric Support for total itemset X and Y Support for X times support for Y
Lift Example What’s the lift of the association rule {Milk, Diapers} {Beer} So X = {Milk, Diapers} and Y = {Beer} s({Milk, Diapers, Beer}) = 2/5 = 0.4 s({Milk, Diapers}) = 3/5 = 0.6 s({Beer}) = 3/5 = 0.6 So BasketItems 1 Bread, Milk 2 Bread, Diapers, Beer, Eggs 3 Milk, Diapers, Beer, Coke 4 Bread, Milk, Diapers, Beer 5 Bread, Milk, Diapers, Coke When Lift > 1, the occurrence of X Y together is more likely than what you would expect by chance
Another example Checking Account Savings Account NoYes No Yes Are people more likely to have a checking account if they have a savings account? Support ({Savings} {Checking}) = 5000/10000 = 0.5 Support ({Savings}) = 6000/10000 = 0.6 Support ({Checking}) = 8500/10000 = 0.85 Confidence ({Savings} {Checking}) = 5000/6000 = 0.83 Answer: No In fact, it’s slightly less than what you’d expect by chance!
Selecting the rules We know how to calculate the measures for each rule Support Confidence Lift Then we set up thresholds for the minimum rule strength we want to accept For support – called minsup For confidence – called minconf The steps List all possible association rules Compute the support and confidence for each rule Drop rules that don’t make the minsup and minconf thresholds Use lift to double- check the association
But this can be overwhelming Imagine all the combinations possible in your local grocery store Every product matched with every combination of other products Tens of thousands of possible rule combinations So where do you start?
Once you are confident in a rule, take action {Milk, Diapers} {Beer} Possible Marketing Actions Lower the price of milk and diapers, raise it on beer Put beer next to one of those products in the store Put beer in a special section of the store away from milk and diapers Create “New Parent Coping Kits” of beer, milk, and diapers Target new beers to people who buy milk and diapers