Download presentation
Published byWilfred Hubbard Modified over 9 years ago
1
Charles Tappert Seidenberg School of CSIS, Pace University
Data Science and Big Data Analytics Chap 5: Adv Analytical Theory and Methods: Association Rules Charles Tappert Seidenberg School of CSIS, Pace University
2
Chapter Sections 5.1 Overview 5.2 Apriori Algorithm
5.3 Evaluation of Candidate Rules 5.4 Example: Transactions in a Grocery Store 5.5 Validation and Testing 5.6 Diagnostics
3
5.1 Overview Association rules method
Unsupervised learning method Descriptive (not predictive) method Used to find hidden relationships in data The relationships are represented as rules Questions association rules might answer Which products tend to be purchased together What products do similar customers tend to buy
4
5.1 Overview Example – general logic of association rules
5
5.1 Overview Rules have the form X -> Y Itemset
When X is observed, Y is also observed Itemset Collection of items or entities k-itemset = {item 1, item 2,…,item k} Examples Items purchased in one transaction Set of hyperlinks clicked by a user in one session
6
5.1 Overview – Apriori Algorithm
Apriori is the most fundamental algorithm Given itemset L, support of L is the percent of transactions that contain L Frequent itemset – items appear together “often enough” Minimum support defines “often enough” (% transactions) If an itemset is frequent, then any subset is frequent
7
5.1 Overview – Apriori Algorithm
If {B,C,D} frequent, then all subsets frequent
8
5.2 Apriori Algorithm Frequent = minimum support
Bottom-up iterative algorithm Identify the frequent (min support) 1-itemsets Frequent 1-itemsets are paired into 2-itemsets, and the frequent 2-itemsets are identified, etc. Definitions for next slide D = transaction database d = minimum support threshold N = maximum length of itemset (optional parameter) Ck = set of candidate k-itemsets Lk = set of k-itemsets with minimum support
9
5.2 Apriori Algorithm
10
5.3 Evaluation of Candidate Rules Confidence
Frequent itemsets can form candidate rules Confidence measures the certainty of a rule Minimum confidence – predefined threshold Problem with confidence Given a rule X->Y, confidence considers only the antecedent (X) and the co-occurrence of X and Y Cannot tell if a rule contains true implication
11
5.3 Evaluation of Candidate Rules Lift
Lift measures how much more often X and Y occur together than expected if statistically independent Lift = 1 if X and Y are statistically independent Lift > 1 indicates the degree of usefulness of the rule Example – in 1000 transactions, If {milk, eggs} appears in 300, {milk} in 500, and {eggs} in 400, then Lift(milk->eggs) = 0.3/(0.5*0.4) = 1.5 If {milk, bread} appears in 400, {milk} in 500, and {bread} in 400, then Lift(milk->bread) = 0.4/(0.5*0.4) = 2.0
12
5.3 Evaluation of Candidate Rules Leverage
Leverage measures the difference in the probability of X and Y appearing together compared to statistical independence Leverage = 0 if X and Y are statistically independent Leverage > 0 indicates degree of usefulness of rule Example – in 1000 transactions, If {milk, eggs} appears in 300, {milk} in 500, and {eggs} in 400, then Leverage(milk->eggs) = *0.4 = 0.1 If {milk, bread} appears in 400, {milk} in 500, and {bread} in 400, then Leverage (milk->bread) = *0.4 = 0.2
13
5.4 Applications of Association Rules
The term market basket analysis refers to a specific implementation of association rules For better merchandising – products to include/exclude from inventory each month Placement of products within related products Association rules also used for Recommender systems – Amazon, Netflix Clickstream analysis from web usage log files Website visitors to page X click on links A,B,C more than on links D,E,F
14
5.5 Example: Grocery Store Transactions 5.5.1 The Groceries Dataset
> 5.5 Example: Grocery Store Transactions The Groceries Dataset Packages -> Install -> arules, arulesViz # don’t enter next line > install.packages(c("arules", "arulesViz")) # appears on console > library('arules') > library('arulesViz') > data(Groceries) > summary(Groceries) # indicates 9835 rows Class of dataset Groceries is transactions, containing 3 slots transactionInfo # data frame with vectors having length of transactions itemInfo # data frame storing item labels data # binary evidence matrix of labels in transactions > > "))
15
> summary(itemsets) # found 59 itemsets> inspect(head(sort(itemsets,by="support"),10)) # lists top 10 supported items 5.5 Example: Grocery Store Transactions Frequent Itemset Generation To illustrate the Apriori algorithm, the code below does each iteration separately. Assume minimum support threshold = 0.02 (0.02 * 9853 = 198 items), get 122 itemsets total First, get itemsets of length 1 > itemsets<-apriori(Groceries,parameter=list(minlen=1,maxlen=1,support=0.02,target="frequent itemsets")) > summary(itemsets) # found 59 itemsets > inspect(head(sort(itemsets,by="support"),10)) # lists top 10 Second, get itemsets of length 2 > itemsets<-apriori(Groceries,parameter=list(minlen=2,maxlen=2,support=0.02,target="frequent itemsets")) > summary(itemsets) # found 61 itemsets Third, get itemsets of length 3 > itemsets<-apriori(Groceries,parameter=list(minlen=3,maxlen=3,support=0.02,target="frequent itemsets")) > summary(itemsets) # found 2 itemsets
16
5. 5 Example: Grocery Store Transactions 5. 5
5.5 Example: Grocery Store Transactions Rule Generation and Visualization The Apriori algorithm will now generate rules. Set minimum support threshold to (allows more rules, presumably for the scatterplot) and minimum confidence threshold to 0.6 to generate 2,918 rules. > rules <- apriori(Groceries,parameter=list(support=0.001,confidence=0.6,target="rules")) > summary(rules) # finds 2918 rules > plot(rules) # displays scatterplot The scatterplot shows that the highest lift occurs at a low support and a low confidence.
17
> plot(rules) > plot(rules) 5.5 Example: Grocery Store Transactions Rule Generation and Visualization
18
5. 5 Example: Grocery Store Transactions 5. 5
5.5 Example: Grocery Store Transactions Rule Generation and Visualization Get scatterplot matrix to compare the support, confidence, and lift of the 2918 rules > # displays scatterplot matrix Lift is proportional to confidence with several linear groupings. Note that Lift = Confidence/Support(Y), so when support of Y remains the same, lift is proportional to confidence and the slope of the linear trend is the reciprocal of Support(Y).
19
> plot(rules) > plot(rules) 5.5 Example: Grocery Store Transactions Rule Generation and Visualization
20
5. 5 Example: Grocery Store Transactions 5. 5
5.5 Example: Grocery Store Transactions Rule Generation and Visualization Compute the 1/Support(Y) which is the slope > slope<- Display the number of times each slope appears in dataset > unlist(lapply(split(slope,f=slope),length)) Display the top 10 rules sorted by lift > inspect(head(sort(rules,by="lift"),10)) Rule {Instant food products, soda} -> {hamburger meat} has the highest lift of 19 (page 154)
21
5. 5 Example: Grocery Store Transactions 5. 5
5.5 Example: Grocery Store Transactions Rule Generation and Visualization Find the rules with confidence above 0.9 > confidentRules<-rules[quality(rules)$confidence>0.9] > confidentRules # set of 127 rules Plot a matrix-based visualization of the LHS v RHS of rules > plot(confidentRules,method="matrix",measure=c("lift","confid ence"),control=list(reorder=TRUE)) The legend on the right is a color matrix indicating the lift and the confidence to which each square in the main matrix corresponds
22
> plot(rules) > plot(rules) 5.5 Example: Grocery Store Transactions Rule Generation and Visualization
23
5. 5 Example: Grocery Store Transactions 5. 5
5.5 Example: Grocery Store Transactions Rule Generation and Visualization Visualize the top 5 rules with the highest lift. > highLiftRules<-head(sort(rules,by="lift"),5) > plot(highLiftRules,method="graph",control=list(type="items")) In the graph, the arrow always points from an item on the LHS to an item on the RHS. For example, the arrows that connects ham, processed cheese, and white bread suggest the rule {ham, processed cheese} -> {white bread} Size of circle indicates support and shade represents lift
24
5. 5 Example: Grocery Store Transactions 5. 5
5.5 Example: Grocery Store Transactions Rule Generation and Visualization
25
5.6 Validation and Testing
The frequent and high confidence itemsets are found by pre- specified minimum support and minimum confidence levels Measures like lift and/or leverage then ensure that interesting rules are identified rather than coincidental ones However, some of the remaining rules may be considered subjectively uninteresting because they don’t yield unexpected profitable actions E.g., rules like {paper} -> {pencil} are not interesting/meaningful Incorporating subjective knowledge requires domain experts Good rules provide valuable insights for institutions to improve their business operations
26
5.7 Diagnostics Although minimum support is pre-specified in phases 3&4, this level can be adjusted to target the range of the number of rules – variants/improvements of Apriori are available For large datasets the Apriori algorithm can be computationally expensive – efficiency improvements Partitioning Sampling Transaction reduction Hash-based itemset counting Dynamic itemset counting
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.