Frequent patterns and Association Rules CSCI N317 Computation for Scientific Applications Unit 3 - 3 Weka Frequent patterns and Association Rules
Frequent Pattern Analysis Frequent pattern: a pattern (a set of items, subsequences, etc.) that occurs frequently in a data set Motivation: Finding inherent regularities in data What products were often purchased together?— Beer and diapers?! What are the subsequent purchases after buying a PC? What kinds of DNA are sensitive to this new drug? Market Basket Analysis - Analyze customers buying habits by finding associations between different items that are placed in their “shopping baskets” Help design store layouts Items frequently purchased together can be placed together to further encourage the sale of both items Items are placed at opposite ends of the store may entice customers to pick up other items along the way Help retailers plan which items to put on sale at a reduced price
Association Rules Patterns can be represented in the form of association rules E.g. Computer =>antivirus_software [support = 2%, confidence = 60%] - 2% of all the transactions under analysis show that computer and antivirus software are purchased together. - 60% of customers who purchased a computer also bought the software
Basic Concepts: Frequent Patterns and Association Rules TID Items bought 10 I1, I2, I4, 20 I1, I3, I4, 30 I1, I4, I5, 40 I2, I5, I6, 50 I2, I3, I4, I5, I6, Itemset I = {I1, …, Im} Find all the rules A B with minimum support and confidence support, s, probability that a transaction contains A B confidence, c, conditional probability that a transaction having A also contains B Customer buys diaper buys both buys beer Let supmin = 50%, confmin = 50% Association rules: I1 I4 (60%, 100%) I4 I1 (60%, 75%)
Basic Concepts: Frequent Patterns and Association Rules Itemset I = {I1, …, Im} Let A and B be a set of items An association rule is an implication of the form A=>B, where and Support(A=>B) = Confidence(A=>B) = Rules that satisfy both a minimum support threshold and a minimum confidence threshold are called strong The occurrence frequency of an itemset is the number of transactions that contain the itemset, also called frequency, support count or count If the support of an itemset satisfies the minimum support threshold, the itemset is a frequent itemset Looking for rules with high support
Interestingness Measure: Correlations (Lift) Strong rules are not necessarily interesting: E.g. 10,000 transactions, 6,000 include computer games, 7,500 include videos, and 4,000 include both computer games and videos. Association rule generated: buys(X, “computer games”) => buys(X, “videos”) [support = 40%, confidence = 66%] The rule is misleading as the probability of purchasing videos is 75%, which is even larger than 66%. In fact, computer games and videos are negatively associated because the purchase of one decreases the likelihood of purchasing the other. Correlation measure can be used to augment the support-confidence framework A => B [support, confidence, correlation]
Interestingness Measure: Correlations (Lift) Measure of dependent/correlated events: lift P(AU B): the likelihood of purchasing both P(A)P(B):the likelihood if the two purchases are independent lift >1: A and B are positively correlated lift<1: A and B are negatively correlated lift = 1: A and B are independent Looking for a rule with lift > 1
Interestingness Measure: Correlations (Lift) Contingency table: lift is smaller than 1. Two times are negatively correlated, thus the rule is not interesting. Computer Game No Computer Game Total Video Game 4,000 3,500 7,500 No Video Game 2,000 500 2,500 6,000 10,000
Association Rule Mining in R and Weka https://www.youtube.com/watch?v=Z4VZsF96QfU https://www.youtube.com/watch?v=4J3gX4ySw1s R https://www.youtube.com/watch?v=b5hgDPa7a2k https://www.youtube.com/watch?v=Gy_nqzJMNrI Need to install “arules” package when following examples on the video. May need to install package from local zip file. Sample code: dataMiningExample.R