Market Basket Analysis and Association Rules Lantz Ch 8

Slides:



Advertisements
Similar presentations
Mining Association Rules from Microarray Gene Expression Data.
Advertisements

A distributed method for mining association rules
Data Mining Techniques Association Rule
DATA MINING Association Rule Discovery. AR Definition aka Affinity Grouping Common example: Discovery of which items are frequently sold together at a.
Mining Multiple-level Association Rules in Large Databases
Association Rules Spring Data Mining: What is it?  Two definitions:  The first one, classic and well-known, says that data mining is the nontrivial.
LOGO Association Rule Lecturer: Dr. Bo Yuan
IT 433 Data Warehousing and Data Mining Association Rules Assist.Prof.Songül Albayrak Yıldız Technical University Computer Engineering Department
Association Rule Mining. 2 The Task Two ways of defining the task General –Input: A collection of instances –Output: rules to predict the values of any.
10 -1 Lecture 10 Association Rules Mining Topics –Basics –Mining Frequent Patterns –Mining Frequent Sequential Patterns –Applications.
MIS2502: Data Analytics Association Rule Mining. Uses What products are bought together? Amazon’s recommendation engine Telephone calling patterns Association.
1 of 25 1 of 45 Association Rule Mining CIT366: Data Mining & Data Warehousing Instructor: Bajuna Salehe The Institute of Finance Management: Computing.
Data Mining Techniques Cluster Analysis Induction Neural Networks OLAP Data Visualization.
Association Analysis. Association Rule Mining: Definition Given a set of records each of which contain some number of items from a given collection; –Produce.
Data Mining Techniques So Far: Cluster analysis K-means Classification Decision Trees J48 (C4.5) Rule-based classification JRIP (RIPPER) Logistic Regression.
1 Data Warehousing. 2 Data Warehouse A data warehouse is a huge database that stores historical data Example: Store information about all sales of products.
Association Rules Presented by: Anilkumar Panicker Presented by: Anilkumar Panicker.
6/23/2015CSE591: Data Mining by H. Liu1 Association Rules Transactional data Algorithm Applications.
Asssociation Rules Prof. Sin-Min Lee Department of Computer Science.
Fast Algorithms for Association Rule Mining
Knowledge Discovery & Data Mining process of extracting previously unknown, valid, and actionable (understandable) information from large databases Data.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining By Tan, Steinbach, Kumar Lecture.
Modul 7: Association Analysis. 2 Association Rule Mining  Given a set of transactions, find rules that will predict the occurrence of an item based on.
ASSOCIATION RULE DISCOVERY (MARKET BASKET-ANALYSIS) MIS2502 Data Analytics Adapted from Tan, Steinbach, and Kumar (2004). Introduction to Data Mining.
M. Sulaiman Khan Dept. of Computer Science University of Liverpool 2009 COMP527: Data Mining Association Rule Mining March 5, 2009.
EXAM REVIEW MIS2502 Data Analytics. Exam What Tool to Use? Evaluating Decision Trees Association Rules Clustering.
CS 8751 ML & KDDSupport Vector Machines1 Mining Association Rules KDD from a DBMS point of view –The importance of efficiency Market basket analysis Association.
Outline Knowledge discovery in databases. Data warehousing. Data mining. Different types of data mining. The Apriori algorithm for generating association.
Is Sampling Useful in Data Mining? A Case in the Maintenance of Discovered Association Rules S.D. Lee, David W. Cheung, Ben Kao The University of Hong.
Association Rule.. Association rule mining  It is an important data mining model studied extensively by the database and data mining community.  Assume.
Association rule mining Goal: Find all rules that satisfy the user-specified minimum support (minsup) and minimum confidence (minconf). Assume all data.
Frequent-Itemset Mining. Market-Basket Model A large set of items, e.g., things sold in a supermarket. A large set of baskets, each of which is a small.
Association Rule Mining
ASSOCIATION RULES (MARKET BASKET-ANALYSIS) MIS2502 Data Analytics Adapted from Tan, Steinbach, and Kumar (2004). Introduction to Data Mining.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Association Analysis This lecture node is modified based on Lecture Notes for.
Associations and Frequent Item Analysis. 2 Outline  Transactions  Frequent itemsets  Subset Property  Association rules  Applications.
Charles Tappert Seidenberg School of CSIS, Pace University
CURE Clustering Using Representatives Handles outliers well. Hierarchical, partition First a constant number of points c, are chosen from each cluster.
 Frequent Word Combinations Mining and Indexing on HBase Hemanth Gokavarapu Santhosh Kumar Saminathan.
Elsayed Hemayed Data Mining Course
HEMANTH GOKAVARAPU SANTHOSH KUMAR SAMINATHAN Frequent Word Combinations Mining and Indexing on HBase.
Overview Definition of Apriori Algorithm
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Association Rules Carissa Wang February 23, 2010.
Chap 6: Association Rules. Rule Rules!  Motivation ~ recent progress in data mining + warehousing have made it possible to collect HUGE amount of data.
Copyright © 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 28 Data Mining Concepts.
MIS2502: Data Analytics Association Rule Mining David Schuff
Data Mining Association Analysis: Basic Concepts and Algorithms
A Research Oriented Study Report By :- Akash Saxena
Association Rules Repoussis Panagiotis.
Knowledge discovery & data mining Association rules and market basket analysis--introduction UCLA CS240A Course Notes*
Frequent Pattern Mining
Text Classification CS5604 Information Retrieval and Storage – Spring 2016 Virginia Polytechnic Institute and State University Blacksburg, VA Professor:
CPS216: Advanced Database Systems Data Mining
Market Basket Analysis and Association Rules
Market Basket Many-to-many relationship between different objects
Exam #3 Review Zuyin (Alvin) Zheng.
Data Mining Association Analysis: Basic Concepts and Algorithms
Association Rule Mining
Data Mining Association Analysis: Basic Concepts and Algorithms
Data Mining Association Rules Assoc.Prof.Songül Varlı Albayrak
Transactional data Algorithm Applications
Data Mining Association Analysis: Basic Concepts and Algorithms
MIS2502: Data Analytics Association Rule Mining
Market Basket Analysis and Association Rules
MIS2502: Data Analytics Association Rule Mining
Finding Patterns – Market Basket Analysis Using Association Rules
Hansheng Lei Univ. of Texas Rio Grande Valley
Charles Tappert Seidenberg School of CSIS, Pace University
Presentation transcript:

Market Basket Analysis and Association Rules Lantz Ch 8 Left – The goal of computerized market basket analysis is to discover non-intuitive associations, the knowledge of which can improve profits. Image from http://www.allanalytics.com/author.asp?section_id=2037 Wk 4, Part 2 Market Basket Analysis and Association Rules Lantz Ch 8

How does it work? Usually, you need to have some way of identifying the groups you might want to separate? Like the ID’s of the different products in a market basket analysis. If we don’t have the concepts to use, the separation problem is much trickier. Image from http://en.wikipedia.org/wiki/Cluster_analysis What if you don’t have names for these 3 different groups of data?

The grocery store is a good example Practical to know more about buying patterns. Easy to picture underlying rationale: {peanut butter, jelly}  {bread} Most often, searching for associations that will be used later for some purpose. A fishing expedition. Unsupervised learning. (No dependent variable, per se.) Humans eyeball the results for usefulness.

Typical example Fraud detection Where are unusual relationships repeated? E.g., Store associate X has a lot more cash “voids” than average, and Supervisor Y approves a high percentage of these. Voided transactions are a known source of “opportunity crimes” by retail clerks. Because liquid assets are handed out.

Another example “Intrusion detection” systems try to help the humans, who are monitoring a system, notice anomalies in the patterns: E.g., Pay-per-view movie rentals. Network security Ship movements Image and examples from http://www.ercim.eu/publication/Ercim_News/enw56/holst.html. Analyst’s dilemma – Is this “normal”?

Basic principle – ignore the unimportant In analyzing network performance, at AT&T, we mostly had to rule-out most of the data that devices might report. Like, how many errors does a network device have to throw off, in a certain period of time, before we care? Probably a lot!

The Apriori algorithm It uses prior beliefs about the properties of common itemsets. Like the analyst supplies logic and subject-matter experience to what they see. Subsets (e.g., associations) of data only deserve to be searched if the larger dataset is significant. [A, B] is only interesting to look at if [A] and [B] are both frequently occurring.

How to use Apriori Strengths Weaknesses Is ideally suited for working with very large amounts of transactional data. Not very helpful for small datasets. Results in rules that are easy to understand. Takes effort to separate the insight from common sense. Useful for “data mining” and discovering unexpected knowledge in databases. Easy to draw spurious conclusions from random patterns. P 246

Measuring rule interest Apriori could come up with a huge number of proposed rules. Whether or not an itemset or rule is “interesting” is determined by: Support – How frequently it occurs. Confidence – Its predictive power or accuracy, like the Confidence that X leads to Y is the Support for X and Y occurring together, vs Support for just X.

What Apriori does Identify all the itemsets that meet a minimum Support threshold. In succeeding iterations, it combines itemsets that follow rules (like occurring together) with increasing levels of frequency. Create rules from these itemsets that meet a minimum Confidence threshold. At some point, it begins considering candidate rules like [A]  [B], evaluated vs the threshold.

Lantz’s example – Frequently purchased groceries The lists in grocery.csv are varying length grocery transactions, listing items bought. Need a sparse matrix to put these items into! A column in each transaction for every item that might possibly appear. Lantz’s data has 169 different items. A large Walmart typically has 1,000,000 items (though not that many different types of items). A sparse matrix is more memory efficient.

Preparing the data > library(arules) … > groceries <- read.transactions("/Users/chenowet/Documents/Rstuff/groceries.csv", sep = ",") > summary(groceries) transactions as itemMatrix in sparse format with 9835 rows (elements/itemsets/transactions) and 169 columns (items) and a density of 0.02609146 most frequent items: whole milk other vegetables rolls/buns soda yogurt 2513 1903 1809 1715 1372 (Other) 34055 element (itemset/transaction) length distribution: sizes 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 2159 1643 1299 1005 855 645 545 438 350 246 182 117 78 77 55 46 29 18 19 20 21 22 23 24 26 27 28 29 32 14 14 9 11 4 6 1 1 1 1 3 1 Min. 1st Qu. Median Mean 3rd Qu. Max. 1.000 2.000 3.000 4.409 6.000 32.000 includes extended item information - examples: labels 1 abrasive cleaner 2 artif. sweetener 3 baby cosmetics

Can inspect support for each item > itemFrequency(groceries[, 1:3]) abrasive cleaner artif. sweetener baby cosmetics 0.0035587189 0.0032536858 0.0006100661 > itemFrequencyPlot(groceries, support = 0.1) > itemFrequencyPlot(groceries, topN = 20) > image(groceries[1:5]) > image(sample(groceries,100))  See next 3 slides

Training a model on the data > apriori(groceries)   Parameter specification: confidence minval smax arem aval originalSupport support minlen maxlen 0.8 0.1 1 none FALSE TRUE 0.1 1 10 target ext rules FALSE … set of 0 rules Using default confidence and support

What’s a reasonable support level? Lantz argues that you can reason about this. E.g., if a pattern occurs in a grocery transaction twice a day, it may be interesting to us. The dataset represents 30 days of data. So, that would be 60 occurrences, out of the 9835 transactions represented. Or, 0.006 as a trial at the level of support needed. P 259

What’s a reasonable confidence level? Lantz discusses batteries as an example. Set this too high, you’ll only get the association with smoke detectors. Set this too low, you’ll get every chance thing someone buys commonly, like celery. Goal probably is to know “What to place next to the batteries.” He guesses, start with 0.25 here, and then lower it if you are getting only obvious results.

Revised training model > groceryrules <- apriori(groceries, parameter = list(support = 0.006, confidence = 0.25, minlen = 2))   Parameter specification: confidence minval smax arem aval originalSupport support minlen maxlen 0.25 0.1 1 none FALSE TRUE 0.006 2 10 target ext rules FALSE  … set of 463 rules We got something this time!

More from the apriori call… rule length distribution (lhs + rhs):sizes 2 3 4 150 297 16 Min. 1st Qu. Median Mean 3rd Qu. Max. 2.000 2.000 3.000 2.711 3.000 4.000 summary of quality measures: support confidence lift Min. :0.006101 Min. :0.2500 Min. :0.9932 1st Qu.:0.007117 1st Qu.:0.2971 1st Qu.:1.6229 Median :0.008744 Median :0.3554 Median :1.9332 Mean :0.011539 Mean :0.3786 Mean :2.0351 3rd Qu.:0.012303 3rd Qu.:0.4495 3rd Qu.:2.3565 Max. :0.074835 Max. :0.6600 Max. :3.9565 Like, {peanut butter, jelly}  {bread} is length 3. Lift(X  Y) = confidence(X  Y) / support(Y) And, Lift(X  Y) = Lift(Y  X)

And what rules did we get? Actionable? Trivial? Inexplicable? > inspect(groceryrules[1:3]) lhs rhs support confidence lift 1 {potted plants} => {whole milk} 0.006914082 0.4000000 1.565460 2 {pasta} => {whole milk} 0.006100661 0.4054054 1.586614 3 {herbs} => {root vegetables} 0.007015760 0.4312500 3.956477

Improving model performance How about sorting these rules, to make them easier to analyze? > inspect(sort(groceryrules, by = "lift") [1:5]) lhs rhs support confidence lift 1 {herbs} => {root vegetables} 0.007015760 0.4312500 3.956477 2 {berries} => {whipped/sour cream} 0.009049314 0.2721713 3.796886 3 {other vegetables, tropical fruit, whole milk} => {root vegetables} 0.007015760 0.4107143 3.768074 More interesting?

Or, How about taking subsets of the rules? > berryrules <- subset(groceryrules, items %in% "berries") > inspect(berryrules) lhs rhs support confidence lift 1 {berries} => {whipped/sour cream} 0.009049314 0.2721713 3.796886 2 {berries} => {yogurt} 0.010574479 0.3180428 2.279848 3 {berries} => {other vegetables} 0.010269446 0.3088685 1.596280 4 {berries} => {whole milk} 0.011794611 0.3547401 1.388328

Can save rules for future analysis, etc. > write(groceryrules, file = "/Users/chenowet/Documents/Rstuff/groceryrules.csv", sep = ",", quote = TRUE, row.names = FALSE) > groceryrules_df <- as(groceryrules, "data.frame") > str(groceryrules_df) 'data.frame': 463 obs. of 4 variables: $ rules : Factor w/ 463 levels "{baking powder} => {other vegetables}",..: 340 302 207 206 208 341 402 21 139 140 ... $ support : num 0.00691 0.0061 0.00702 0.00773 0.00773 ... $ confidence: num 0.4 0.405 0.431 0.475 0.475 ... $ lift : num 1.57 1.59 3.96 2.45 1.86 ...