Download presentation
Presentation is loading. Please wait.
Published byLawrence Walsh Modified over 9 years ago
1
1 Knowledge discovery & data mining Association rules and market basket analysis --introduction UCLA CS240A Course Notes* __________________________ * from a EDBT2000 tutorial by Fosca Giannotti & Dino Pedreschi Pisa KDD Lab
2
2 Market Basket Analysis: the context Customer buying habits by finding associations and correlations between the different items that customers place in their “shopping basket” Customer1 Customer2Customer3 Milk, eggs, sugar, bread Milk, eggs, cereal, breadEggs, sugar
3
3 Market Basket Analysis: the context Given: a database of customer transactions, where each transaction is a set of items y Find groups of items which are frequently purchased together
4
4 Goal of MBA zExtract information on purchasing behavior zActionable information: can suggest ynew store layouts ynew product assortments ywhich products to put on promotion zMBA applicable whenever a customer purchases multiple things in proximity ycredit cards yservices of telecommunication companies ybanking services ymedical treatments
5
5 MBA: applicable to many other contexts Telecommunication: Each customer is a transaction containing the set of customer’s phone calls Atmospheric phenomena: Each time interval (e.g. a day) is a transaction containing the set of observed event (rains, wind, etc.) Etc.
6
6 Association Rules zExpress how product/services relate to each other, and tend to group together “ if a customer purchases three-way calling, then will also purchase call-waiting ” zsimple to understand zactionable information: bundle three-way calling and call-waiting in a single package
7
7 Useful, trivial, unexplicable Useful: “ On Thursdays, grocery store consumers often purchase diapers and beer together ”. Trivial: “ Customers who purchase maintenance agreements are very likely to purchase large appliances ”. Unexplicable: “ When a new hardaware store opens, one of the most sold items is toilet rings. ”
8
8 Basic Concepts Transaction : Relational formatCompact format Item: single element, Itemset: set of items Support of an itemset I: # of transaction containing I Minimum Support : threshold for support Frequent Itemset : with support . Frequent Itemsets represents set of items which are positively correlated
9
9 Frequent Itemsets Support({dairy}) = 3 (75%) Support({fruit}) = 3 (75%) Support({dairy, fruit}) = 2 (50%) If = 60%, then {dairy} and {fruit} are frequent while {dairy, fruit} is not.
10
10 Association Rules: Measures +Let A and B be a partition of I : A B [s, c] A and B are itemsets s = support of A B = support(A B) c = confidence of A B = support(A B)/support(A) + Measure for rules: + minimum support + minimum confidence +The rules holds if : s and c
11
11 Association Rules: Meaning A B [ s, c ] Support: denotes the frequency of the rule within transactions. A high value means that the rule involve a great part of database. support(A B [ s, c ]) = p(A B) Confidence: denotes the percentage of transactions containing A which contain also B. It is an estimation of conditioned probability. confidence(A B [ s, c ]) = p(B|A) = p(A & B)/p(A).
12
12 Association Rules - Example For rule A C: support = support({A, C}) = 50% confidence = support({A, C})/support({A}) = 66.6% The Apriori principle: Any subset of a frequent itemset must be frequent Min. support 50% Min. confidence 50%
13
13 Problem Statement zThe database consists of a set of transactions. zEach transaction consists of a transaction ID and a set of items bought in that transaction (as in a market basket). zAn association rule is an implication of the form X Y, which says that customers who buy item X are also likely to buy item Y. In practice we are only interested in relationships between high volume items (aka frequent items) z Confidence: X Y holds with confidence C% if C% of transactions that contain X also contain Y. zSupport: X Y has support S% if S% of transactions contain X Y. Observe that the support level for X is to that for X Y and that their inverse ratio is the confidence of X Y: confidence(X Y) = support(X Y)/support(X)
14
14 Algorithms: Apriori zA level-wise, candidate-generation-and-test approach (Agrawal & Srikant 1994 ) TIDItems 10a, c, d 20b, c, e 30 a, b, c, e 40b, e Min_sup=2 ItemsetSup a2 b3 c3 d1 e3 Data base D 1-candidates Scan D ItemsetSup a2 b3 c3 e3 Freq 1-itemsets Itemset ab ac ae bc be ce 2-candidates ItemsetSup ab1 ac2 ae1 bc2 be3 ce2 Counting Scan D ItemsetSup ac2 bc2 be3 ce2 Freq 2-itemsets Itemset bce 3-candidates ItemsetSup bce2 Freq 3-itemsets Scan D
15
15 Performance Challenges of Frequent Sets (aka Frequent Pattern) Mining zChallenges y Data structures: Hash tables and Prefix trees yMultiple scans of transaction database yHuge number of candidates yTedious workload of support counting for candidates zImproving Apriori: Many algorithms proposed. General ideas yReduce number of transaction database scans yShrink number of candidates yFacilitate support counting of candidates y FP without candidate generation [Han, Pei, Yin 2000].
16
16 Apriori Summary zScanning the database and counting occurrences zPruning the itemsets below the minimum support level: [Particularly after the first step, we might want to prune the database D as well] zCombining frequent sets of size n into candidate larger sets of size n + 1 [or even larger]. Monotonicity Condition: The support level of a set is always smaller than that of every subset
17
Apriori in DB2- S. Sarawagi, S. Thomas, R. Agrawal: "Integrating Association Rule Mining with Databases: Alternatives and Implications", Data Mining and Knowledge Discovery Journal, 4(2/3), July 2000. 17
18
Apriori in Datalog 18
19
19 Extracting the Rules zFor rule A C: zSupport for rule: support for set of items = support({A, C})=50% zConfidence:support for the rule over support for its left side= support({A, C})/support({A})=66.6%
20
20 Rule Implications zLemma: If X Y Z, then XY Z, and XZ Y zThis properties can be used to limit the number of rules tested. zExample: For frequent itemset ABCDE zIf ACDE B and ABCE D, then ACE BD is the only a rule we should test.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.