Knowledge discovery & data mining Association rules and market basket analysis--introduction UCLA CS240A Course Notes*

Knowledge discovery & data mining Association rules and market basket analysis--introduction
UCLA CS240A Course Notes*

Market Basket Analysis: the context
Customer buying habits by finding associations and correlations between the different items that customers place in their “shopping basket” Milk, eggs, sugar, bread Milk, eggs, cereal, bread Eggs, sugar Customer1 Customer2 Customer3

Market Basket Analysis: the context
Given: a database of customer transactions, where each transaction is a set of items Find groups of items which are frequently purchased together

Goal of MBA Extract information on purchasing behavior
Actionable information: can suggest new store layouts new product assortments which products to put on promotion MBA applicable whenever a customer purchases multiple things in proximity credit cards services of telecommunication companies banking services medical treatments

MBA: applicable to many other contexts
Telecommunication: Each customer is a transaction containing the set of customer’s phone calls Atmospheric phenomena: Each time interval (e.g. a day) is a transaction containing the set of observed event (rains, wind, etc.) Etc.

Association Rules Express how product/services relate to each other, and tend to group together “if a customer purchases three-way calling, then will also purchase call-waiting” simple to understand actionable information: bundle three-way calling and call-waiting in a single package

Useful, trivial, unexplicable
Useful: “On Thursdays, grocery store consumers often purchase diapers and beer together”. Trivial: “Customers who purchase maintenance agreements are very likely to purchase large appliances”. Unexplicable: “When a new hardaware store opens, one of the most sold items is toilet rings.”

Basic Concepts Transaction: Relational format Compact format
<Tid,item> <Tid,itemset> <1, item1> <1, {item1,item2}> <1, item2> <2, {item3}> <2, item3> Item: single element, Itemset: set of items Support of an itemset I: # of transaction containing I Minimum Support  : threshold for support Frequent Itemset : with support  . Frequent Itemsets represents set of items which are positively correlated

Frequent Itemsets Support({dairy}) = 3 (75%)
Support({fruit}) = 3 (75%) Support({dairy, fruit}) = 2 (50%) If  = 60%, then {dairy} and {fruit} are frequent while {dairy, fruit} is not.

Association Rules: Measures
Let A and B be a partition of I : A  B [s, c] A and B are itemsets s = support of A  B = support(AB) c = confidence of A  B = support(AB)/support(A) Measure for rules: minimum support  minimum confidence  The rules holds if : s   and c  

Association Rules: Meaning
A  B [ s, c ] Support: denotes the frequency of the rule within transactions. A high value means that the rule involve a great part of database. support(A  B [ s, c ]) = p(A  B) Confidence: denotes the percentage of transactions containing A which contain also B. It is an estimation of conditioned probability . confidence(A  B [ s, c ]) = p(B|A) = p(A & B)/p(A).

Association Rules - Example
Min. support 50% Min. confidence 50% For rule A  C: support = support({A, C}) = 50% confidence = support({A, C})/support({A}) = 66.6% The Apriori principle: Any subset of a frequent itemset must be frequent

Problem Statement The database consists of a set of transactions.
Each transaction consists of a transaction ID and a set of items bought in that transaction (as in a market basket). An association rule is an implication of the form X  Y , which says that customers who buy item X are also likely to buy item Y . In practice we are only interested in relationships between highvolume items (aka frequent items) Confidence: X  Y holds with confidence C% if C% of transactions that contain X also contain Y . Support: X  Y has support S% if S% of transactions contain X  Y . Observe that the support level for X is  to that for X  Y and that their inverse ratio is the confidence of X Y: confidence(X  Y) = support(XY)/support(X)

Algorithms: Apriori A level-wise, candidate-generation-and-test approach (Agrawal & Srikant 1994) Data base D Freq 1-itemsets 2-candidates 1-candidates TID Items 10 a, c, d 20 b, c, e 30 a, b, c, e 40 b, e Itemset Sup a 2 b 3 c d 1 e Itemset Sup a 2 b 3 c e Itemset ab ac ae bc be ce Scan D Min_sup=2 Freq 2-itemsets Counting 3-candidates Scan D Itemset bce Itemset Sup ac 2 bc be 3 ce Itemset Sup ab 1 ac 2 ae bc be 3 ce Scan D Freq 3-itemsets Itemset Sup bce 2

Performance Challenges of Frequent Sets (aka Frequent Pattern) Mining
Data structures: Hash tables and Prefix trees Multiple scans of transaction database Huge number of candidates Tedious workload of support counting for candidates Improving Apriori: Many algorithms proposed. General ideas Reduce number of transaction database scans Shrink number of candidates Facilitate support counting of candidates FP without candidate generation [Han, Pei, Yin 2000].

Apriori Summary Scanning the database and counting occurrences
Pruning the itemsets below the minimum support level: [Particularly after the first step, we might want to prune the database D as well] Combining frequent sets of size n into candidate larger sets of size n + 1 [or even larger]. Monotonicity Condition: The support level of a set is always smaller than that of every subset

Extracting the Rules For rule A  C:
Support for rule: support for set of items = support({A, C})=50% Confidence:support for the rule over support for its left side= support({A, C})/support({A})=66.6%

Rule Implications Lemma: If X  Y Z, then XY  Z, and XZ Y
This properties can be used to limit the number of rules tested. Example: For frequent itemset ABCDE If ACDE  B and ABCE  D, then ACE  BD is the only a rule we should test.

Apriori in DB2 S. Sarawagi, S. Thomas, R. Agrawal: "Integrating Association Rule Mining with Databases: Alternatives and Implications", Data Mining and Knowledge Discovery Journal, 4(2/3), July 2000. Very difficult to integrate Apriori into DBMS: the only approach that works is cache-mining. And similar conclusions apply to other KDD algorithms Lackluster commercial success of DBMS vendors withpredictive analytics– in stark contrast with their success in descriptive analytics, i.e. rollups and data cubes.

Apriori in Datalog

Knowledge discovery & data mining Association rules and market basket analysis--introduction UCLA CS240A Course Notes*

Similar presentations

Presentation on theme: "Knowledge discovery & data mining Association rules and market basket analysis--introduction UCLA CS240A Course Notes*"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Knowledge discovery & data mining Association rules and market basket analysis--introduction UCLA CS240A Course Notes*

Similar presentations

Presentation on theme: "Knowledge discovery & data mining Association rules and market basket analysis--introduction UCLA CS240A Course Notes*"— Presentation transcript:

Similar presentations

About project

Feedback