Market Basket Analysis and Association Rules

Slides:

Advertisements

Similar presentations

Association Rules Evgueni Smirnov.

Advertisements

Mining Association Rules

Association Analysis (2). Example TIDList of item ID’s T1I1, I2, I5 T2I2, I4 T3I2, I3 T4I1, I2, I4 T5I1, I3 T6I2, I3 T7I1, I3 T8I1, I2, I3, I5 T9I1, I2,

Frequent Itemset Mining Methods. The Apriori algorithm Finding frequent itemsets using candidate generation Seminal algorithm proposed by R. Agrawal and.

Data Mining Techniques Association Rule

Association rules and frequent itemsets mining

DATA MINING Association Rule Discovery. AR Definition aka Affinity Grouping Common example: Discovery of which items are frequently sold together at a.

COMP5318 Knowledge Discovery and Data Mining

Association Rules Spring Data Mining: What is it?  Two definitions:  The first one, classic and well-known, says that data mining is the nontrivial.

Association Rule Mining. 2 The Task Two ways of defining the task General –Input: A collection of instances –Output: rules to predict the values of any.

FP-Growth algorithm Vasiljevic Vladica,

FP (FREQUENT PATTERN)-GROWTH ALGORITHM ERTAN LJAJIĆ, 3392/2013 Elektrotehnički fakultet Univerziteta u Beogradu.

1 of 25 1 of 45 Association Rule Mining CIT366: Data Mining & Data Warehousing Instructor: Bajuna Salehe The Institute of Finance Management: Computing.

Data Mining Association Analysis: Basic Concepts and Algorithms

Association Rules l Mining Association Rules between Sets of Items in Large Databases (R. Agrawal, T. Imielinski & A. Swami) l Fast Algorithms for.

Association Analysis. Association Rule Mining: Definition Given a set of records each of which contain some number of items from a given collection; –Produce.

Data Mining Techniques So Far: Cluster analysis K-means Classification Decision Trees J48 (C4.5) Rule-based classification JRIP (RIPPER) Logistic Regression.

Data Mining Association Analysis: Basic Concepts and Algorithms Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction.

Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,

Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,

732A02 Data Mining - Clustering and Association Analysis ………………… Jose M. Peña Association rules Apriori algorithm FP grow algorithm.

Data Mining Association Analysis: Basic Concepts and Algorithms

1 Mining Frequent Patterns Without Candidate Generation Apriori-like algorithm suffers from long patterns or quite low minimum support thresholds. Two.

Association Analysis: Basic Concepts and Algorithms.

Association Rule Mining. Generating assoc. rules from frequent itemsets  Assume that we have discovered the frequent itemsets and their support  How.

Data Mining Association Analysis: Basic Concepts and Algorithms

Fast Algorithms for Association Rule Mining

Lecture14: Association Rules

SEG Tutorial 2 – Frequent Pattern Mining.

Chapter 5 Mining Association Rules with FP Tree Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2010.

Basic Data Mining Techniques

Ch5 Mining Frequent Patterns, Associations, and Correlations

AR mining Implementation and comparison of three AR mining algorithms Xuehai Wang, Xiaobo Chen, Shen chen CSCI6405 class project.

Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,

Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining By Tan, Steinbach, Kumar Lecture.

Modul 7: Association Analysis. 2 Association Rule Mining  Given a set of transactions, find rules that will predict the occurrence of an item based on.

Efficient Data Mining for Calling Path Patterns in GSM Networks Information Systems, accepted 5 December 2002 SPEAKER: YAO-TE WANG ( 王耀德 )

Mining Frequent Patterns without Candidate Generation.

Frequent Item Mining. What is data mining? =Pattern Mining? What patterns? Why are they useful?

CS 8751 ML & KDDSupport Vector Machines1 Mining Association Rules KDD from a DBMS point of view –The importance of efficiency Market basket analysis Association.

Part II - Association Rules © Prentice Hall1 DATA MINING Introductory and Advanced Topics Part II – Association Rules Margaret H. Dunham Department of.

Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,

Chap 6: Association Rules. Rule Rules!  Motivation ~ recent progress in data mining + warehousing have made it possible to collect HUGE amount of data.

1 Data Mining Lecture 6: Association Analysis. 2 Association Rule Mining l Given a set of transactions, find rules that will predict the occurrence of.

Chapter 3 Data Mining: Classification & Association Chapter 4 in the text box Section: 4.3 (4.3.1),

Reducing Number of Candidates

Data Mining Association Analysis: Basic Concepts and Algorithms

Data Mining: Concepts and Techniques

A Research Oriented Study Report By :- Akash Saxena

Data Mining: Concepts and Techniques

Association Rules Repoussis Panagiotis.

Frequent Pattern Mining

Association Rules.

Market Basket Many-to-many relationship between different objects

Data Mining Association Analysis: Basic Concepts and Algorithms

Association Rule Mining

Data Mining Association Analysis: Basic Concepts and Algorithms

Vasiljevic Vladica, FP-Growth algorithm Vasiljevic Vladica,

DIRECT HASHING AND PRUNING (DHP) ALGORITHM

Association Rule Mining

Data Mining Association Analysis: Basic Concepts and Algorithms

COMP5331 FP-Tree Prepared by Raymond Wong Presented by Raymond Wong

Association Analysis: Basic Concepts and Algorithms

732A02 Data Mining - Clustering and Association Analysis

Frequent-Pattern Tree

Market Basket Analysis and Association Rules

FP-Growth Wenlong Zhang.

Mining Association Rules in Large Databases

15-826: Multimedia Databases and Data Mining

Association Analysis: Basic Concepts

Presentation transcript:

MARKET BASKET ANALYSIS, FREQUENT ITEMSETS, ASSOCIATION RULES, APRIORI ALGORITHM, OTHER ALGORITHMS

Market Basket Analysis and Association Rules Market Basket Analysis studies characteristics or attributes that “go together”. Seeks to uncover associations between 2 or more attributes. Association Rules have form: IF antecedent THEN consequent For example, of 1,000 customers shopping, 200 bought Milk. In addition, of 200 buying milk, 50 bought bread. Thus, the rule “If buy milk, then buy bread” has support = 50/1,000 = 5% and confidence = 50/200 = 25% ◦ Support is records with consequent over total ◦ Confidence is records with consequent over records with both

Market Basket Analysis(cont’d) Applications: Investigating proportion of subscribers to cell phone plan that respond to offer for service upgrade. Examining the proportion of children whose parents read to them who are themselves good readers. Finding out which items are purchased together in super market. Challenges: Curse of dimensionality : Number of rules grows exponentially in number of attributes. With k binary attributes, and only positive cases considered, there are k * 2k – 1 possible association rules.

Market Basket Analysis(cont’d) A Priori algorithm reduces search problem to manageable size. It leverages rule structure to its advantage Example : Consider farmer selling crops at roadside stand. Seven items are available for purchase in set I = {asparagus, beans, broccoli, corn, green peppers, squash, tomatoes}. Customers purchase different subsets of I. Transaction Items Purchased 1 Broccoli, green peppers, corn 2 Asparagus, squash, corn 3 Corn, tomatoes, beans, squash 4 Green peppers, corn, tomatoes, beans 5 Beans, asparagus, broccoli 6 Squash, asparagus, beans, tomatoes

IF {beans, squash} THEN {asparagus} Support, Confidence, Frequent Itemsets, and the A Priori Property (cont’d) Let D = set of transactions {T1, T2, ..., T14} in previous Table Each T represents set of items contained in I Suppose set of items A = {beans, squash} and B = {asparagus} Association Rule has form: IF A THEN B A -> B IF {beans, squash} THEN {asparagus} A and B proper subsets of I A and B are mutually exclusive Therefore, by definition, rules such as IF {beans, squash} THEN {beans} excluded

Support, Confidence, Frequent Itemsets, and the A Priori Property ◦Support for association rule A -> B is proportion of transactions in D containing both A and B Support = p(A∩B) = number of transactions containing both A and B total number of transactions Confidence for association rule A à B measures rule accuracy. Determined by percentage of transactions in D containing A, also containing B Confidence = p(B|A) =p(A∩B) = number of transactions containing both A and B P(A) total number of transactions containing A

Support, Confidence, Frequent Itemsets, and the A Priori Property Rules often preferred having high support, high confidence, or both Strong Rules meet specified support and/or confidence threshold For example, an analyst may determine supermarket items purchased together with minimum support = 20% and confidence = 70% However, fraud detection analysts may set minimum support much lower, equal to 1% or less In this case, very few transactions are fraudulent-related

Support, Confidence, Frequent Itemsets, and the A Priori Property Itemset is set of items contained in I k-itemset contains k items For example, {beans, squash} = 2-itemset, from roadside stand set I Itemset Frequency is number of transactions containing specific itemset Frequent Itemset occurrence greater than or equal to minimum threshold Frequent Itemset has itemset frequency ≥ ϕ (where ϕ= Minimum Threshold) We denote the set of frequent k-itemsets as Fk

Support, Confidence, Frequent Itemsets, and the A Priori Property Mining Association Rules Two-step process (1) Find all frequent itemsets, where itemset frequency ≥ ϕ (2) From list of frequent itemsets, generate association rules satisfying minimum support and confidence criteria A Priori Property If itemset Z not frequent, then for any item A, Z U A not frequent In other words, no superset of Z (itemset containing Z) will be frequent A Priori algorithm uses this property to significantly reduce the search space

APRIORI ALGORITHM Apriori is a classical algorithm in data mining. It is used for mining frequent itemsets and relevant association rules. Principle of Apriori : If an itemset is frequent, then all of its non empty subsets must also be frequent. It is devised to operate on a database containing many transactions.

ALGORITHM

APPLICATIONS Apriori algorithm is used in examining drug-drug interactions and in finding out Adverse Drug Reactions(ADR). It is used in finding associations between diabetic conditions of people. Mobile e commerce sites can make use of it to improve their product recommendations.

Pros and Cons Pros Apriori is an easy-to-implement and easy-to-understand algorithm. It can be used on large itemsets. Cons Finding a large number of candidate rules can be computationally expensive. Calculating support is also expensive because it has to go through the entire database.

Process of Rule Selection Generate all rules that meet specified support & confidence Find frequent item sets (those with sufficient support) Support → The number of times an item appears in a dataset From these item sets, generate rules with sufficient confidence Confidence → Indicates the number of times the if/then statements have been found to be true

if/then…. So if/then can be associated with two main components of association rules: . Antecedent → Item found in the dataset and can be viewed as the “if” Consequent → Item found in combination with the Antecedent and can be viewed as the “ then” e.g. If a customer buys a bread, he/she is 80% likely to buy a butter as well.. If a customer buys a mouse, he/she is 95% likely to buy a keyboard ….

Generating frequent itemsets: The Apriori Algorithm For k products…. Set minimum support criterion Generate list of one-item sets that meet the support criterion Use list of one-item sets to generate list of two-item sets that meet support criterion Use list of two-item sets to generate list of three-item sets that meet support criterion Continue up through k-item sets

The Apriori Algorithm → Example

Support and Confidence → Fraction of transactions that contain both X and Y Confidence → Measure how often items in Y appears in transactions that contain X 1/5 1/3

OTHER ALGORITHMS : FREQUENT PATTERN GROWTH ALGORITHM Two step approach: Step I: Construct a compact data structure called FP Tree. Constructed using two pass over the data set. Step II: Extract frequent items from directly from the FP Tree. Traverse the tree to extract frequent item sets

FP TREE CONSTRUCTION FP-Tree is constructed using 2 passes over the data-set: Pass I: From a set of given transactions, find support for each item. Sort the items in decreasing order of their support. For in our example: a, b, c, d, e Use this order when building the FP-Tree, so common prefixes can be shared.

EXAMPLE TRANSACTIONS AND ITEM SUPPORT TID Items Bought 1 {a, b, d, e} 2 {b, c, d} 3 4 {a, c, d, e} 5 {b, c, d, e} 6 {b, d, e} 7 {c, d} 8 {a, b, c} 9 {a, d, e} 10 {b, d} Item Support d 9 b 7 e 6 a 5 c Support for each transaction

RE-ORDERING TRANSACTIONS BASED ON SUPPORT VALUE TID Items Bought Reordered set 1 {a, b, d, e} {d, b, e, a} 2 {b, c, d} {d, b, c} 3 4 {a, c, d, e} {d, e, a, c} 5 {b, c, d, e} {d, b, e, c} 6 {b, d, e} {d, b, e} 7 {c, d} {d, c} 8 {a, b, c} {b, a, c} 9 {a, d, e} {d, e, a} 10 {b, d} {d, b}

FP TREE CONSTRUCTION insert_tree([p|P], T) { if (T has a child n, where n.item = p increment) n.count = n.count + 1 else { create new node N n.count = 1 Link it up from the root node (null) }

FP GROWTH TREE CONSTRUCTION AFTER REORDERING TRANSACTIONS Each paths represent transactions Nodes have counts to track original frequency null 1 b 9 d c 1 a 1 6 b e 2 1 c 4 2 e c a 2 1 1 c a c 1

CONCEPT OF CONDITIONAL PATTERN BASE Once the FP-tree is constructed, the next step is to traverse the FP Tree to find all frequent itemsets for each item. For this we need to find the conditional pattern base for each pattern starting right from the 1-frequent pattern. Conditional pattern base is defined as the prefix-paths in the FP-tree which consist of the suffix pattern. From the conditional pattern base a conditional pattern tree is generated which is recursively mined in the algorithm.

FREQUENT ITEMSETS GENERATION BY MINING THE TREE Suffix Pattern : a (d, b, e, a, 2) (d, e, a, 2) (b, a, 1) null Item Support d 4 e b 3 4 1 d b 4 e 2 b Frequent Item sets for a: (Considering the minimum threshold to be 3) {d, a, 4} {d, e, a, 4} {b, a, 3} Conditional FP Tree for a

ADVANTAGES & DISADVANTAGES OF FP TREE GROWTH ALGORITHM Advantages of FP-Growth Only 2 passes over data-set than repeated database scan in Apriori Avoids candidate set explosion by building compact tree data structure Much faster than Apriori Algorithm Discovering pattern of length 100 requires at least 2^100 candidates (no of subsets) Disadvantages of FP-Growth FP-Tree may not fit in memory FP-Tree is expensive to build Trade-off: takes time to build, but once it is built, frequent itemsets can be generated easily.