Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Frequent Itemsets Association rules and market basket analysis CS240B--UCLA Notes by Carlo Zaniolo Most slides borrowed from Jiawei Han,UIUC May 2007.

Similar presentations


Presentation on theme: "1 Frequent Itemsets Association rules and market basket analysis CS240B--UCLA Notes by Carlo Zaniolo Most slides borrowed from Jiawei Han,UIUC May 2007."— Presentation transcript:

1 1 Frequent Itemsets Association rules and market basket analysis CS240B--UCLA Notes by Carlo Zaniolo Most slides borrowed from Jiawei Han,UIUC May 2007

2 2 Association Rules & Correlations zBasic concepts zEfficient and scalable frequent itemset mining methods: yApriori, and improvements yFP-growth zRule derivation, visualization and validation zMulti-level Associations zTemporal associations and frequent sequences zOther association mining methods zSummary

3 3 Market Basket Analysis: the context Customer buying habits by finding associations and correlations between the different items that customers place in their “shopping basket” Customer1 Customer2Customer3 Milk, eggs, sugar, bread Milk, eggs, cereal, breadEggs, sugar

4 4 Market Basket Analysis: the context Given: a database of customer transactions, where each transaction is a set of items y Find groups of items which are frequently purchased together

5 5 Goal of MBA zExtract information on purchasing behavior zActionable information: can suggest ynew store layouts ynew product assortments ywhich products to put on promotion zMBA applicable whenever a customer purchases multiple things in proximity ycredit cards yservices of telecommunication companies ybanking services ymedical treatments

6 6 MBA: applicable to many other contexts Telecommunication: Each customer is a transaction containing the set of customer’s phone calls Atmospheric phenomena: Each time interval (e.g. a day) is a transaction containing the set of observed event (rains, wind, etc.) Etc.

7 7 Association Rules zExpress how product/services relate to each other, and tend to group together z“if a customer purchases three-way calling, then will also purchase call-waiting” zsimple to understand zactionable information: bundle three-way calling and call-waiting in a single package

8 8 Frequent Itemsets Transaction : Relational formatCompact format Item: single element, Itemset: set of items Support of an itemset I: # of transaction containing I Minimum Support  : threshold for support Frequent Itemset : with support  . Frequent Itemsets represents set of items which are positively correlated

9 9 Frequent Itemsets Example Support({dairy}) = 3 (75%) Support({fruit}) = 3 (75%) Support({dairy, fruit}) = 2 (50%) If  = 60%, then {dairy} and {fruit} are frequent while {dairy, fruit} is not.

10 10 Itemset support & Rules confidence + Let A and B be disjoint itemsets and let: s = support(A  B) and c= support(A  B)/support(A) Then the rule A  B holds with support s and confidence c: write A  B [s, c] Objective of the mining task. Find all rules with + minimum support  + minimum confidence  + Thus A  B [s, c] holds if : s   and c  

11 11 Association Rules: Meaning A  B [ s, c ] Support: denotes the frequency of the rule within transactions. A high value means that the rule involve a great part of database. support(A  B [ s, c ]) = p(A  B) Confidence: denotes the percentage of transactions containing A which contain also B. It is an estimation of conditioned probability. confidence(A  B [ s, c ]) = p(B|A) = p(A & B)/p(A).

12 12 Association Rules - Example For rule A  C: support = support({A, C}) = 50% confidence = support({A, C})/support({A}) = 66.6% The Apriori principle: Any subset of a frequent itemset must be frequent Min. support 50% Min. confidence 50%

13 13 Closed Patterns and Max-Patterns zA long pattern contains very many subpatterns---combinatorial explosion yClosed patterns and max-patterns zAn itemset is closed if none of its supersets has the same support yClosed pattern is a lossless compression of freq. patterns--Reducing the # of patterns and rules zAn itemset is maximal frequent if none of its supersets is frequent y But support of their subsets is not known – additional DB scans are needed

14 14 Frequent Itemsets Minimum support = 2 # Frequent = 13 null ABACADAEBCBDBECDCEDE ABCDE ABC ABDABEACDACEADEBCDBCEBDECDE ABCDABCEABDEACDEBCDE ABCDE 124123 1234 245345 1212424 4 123 2 3 24 34 45 12 2 24 44 2 34 2 4

15 15 Maximal Frequent Itemset: if none of its supersets is frequent Minimum support = 2 # Frequent = 13 # Maximal = 4 null ABACADAEBCBDBECDCEDE ABCDE ABC ABDABEACDACEADEBCDBCEBDECDE ABCDABCEABDEACDEBCDE ABCDE 124123 1234 245345 1212424 4 123 2 3 24 34 45 12 2 24 44 2 34 2 4

16 16 Closed Frequent Itemset: None of its superset has the same support Minimum support = 2 # Frequent = 13 # Closed = 9 # Maximal = 4 Closed and maximal Closed but not maximal Closed and maximal

17 17 Maximal vs Closed Itemsets As we move from an itemset A to its superset support can: 1.Remain the same, 2.Drop but still remain above treshold, A is closed but not maximal 3.Drop below the threshold: A is maximal (and closed) 1 2 3

18 18 Scalable Methods for Mining Frequent Patterns zThe downward closure property of frequent patterns yEvery subset of a frequent itemset must be frequent [ antimonotonic property ] yIf {beer, diaper, nuts} is frequent, so is {beer, diaper} yi.e., every transaction having {beer, diaper, nuts} also contains {beer, diaper} zScalable mining methods: Three major approaches yApriori (Agrawal & Srikant@VLDB’94)  Freq. pattern growth (FPgrowth — Han, Pei & Yin @SIGMOD’00)  Vertical data format approach (Charm — Zaki & Hsiao @SDM ’ 02)

19 19 Apriori: A Candidate Generation-and-Test Approach zApriori pruning principle: If there is any itemset which is infrequent, its superset should not be generated/tested! (Agrawal & Srikant @VLDB’94, Mannila, et al. @ KDD’ 94) zMethod: yInitially, scan DB once to get frequent 1-itemset yGenerate length (k+1) candidate itemsets from length k frequent itemsets yTest the candidates against DB yTerminate when no frequent or candidate set can be generated

20 20 Association Rules & Correlations zBasic concepts zEfficient and scalable frequent itemset mining methods: yApriori, and improvements

21 21 The Apriori Algorithm—An Example Database TDB 1 st scan C1C1 L1L1 L2L2 C2C2 C2C2 2 nd scan C3C3 L3L3 3 rd scan TidItems 10A, C, D 20B, C, E 30A, B, C, E 40B, E Itemsetsup {A}2 {B}3 {C}3 {D}1 {E}3 Itemsetsup {A}2 {B}3 {C}3 {E}3 Itemset {A, B} {A, C} {A, E} {B, C} {B, E} {C, E} Itemsetsup {A, B}1 {A, C}2 {A, E}1 {B, C}2 {B, E}3 {C, E}2 Itemsetsup {A, C}2 {B, C}2 {B, E}3 {C, E}2 Itemset {B, C, E} Itemsetsup {B, C, E}2 Sup min = 2

22 22 Important Details of Apriori zHow to generate candidates? yStep 1: self-joining L k yStep 2: pruning zHow to count supports of candidates? zExample of Candidate-generation yL 3 ={abc, abd, acd, ace, bcd} ySelf-joining: L 3 *L 3 xabcd from abc and abd xacde from acd and ace yPruning: xacde is removed because ade is not in L 3 yC 4 ={abcd}

23 23 How to Generate Candidates? zSuppose the items in L k-1 are listed in an order zStep 1: self-joining L k-1 insert into C k select p.item 1, p.item 2, …, p.item k-1, q.item k-1 from L k-1 p, L k-1 q where p.item 1 =q.item 1, …, p.item k-2 =q.item k-2, p.item k-1 < q.item k-1 zStep 2: pruning forall itemsets c in C k do forall (k-1)-subsets s of c do if (s is not in L k-1 ) then delete c from C k

24 24 How to Count Supports of Candidates? zWhy counting supports of candidates a problem? yThe total number of candidates can be very huge y One transaction may contain many candidates zData Structures used: y Candidate itemsets can be stored in a hash-tree y or in a prefix-tree (trie)--example

25 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 25 Effect of Support Distribution l Many real data sets have skewed support distribution Support distribution of a retail data set

26 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 26 Effect of Support Distribution l How to set the appropriate minsup threshold? –If minsup is set too high, we could miss itemsets involving interesting rare items (e.g., expensive products) –If minsup is set too low, it is computationally expensive and the number of itemsets is very large l Using a single minimum support threshold may not be effective

27 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 27 Rule Generation l How to efficiently generate rules from frequent itemsets? –In general, confidence does not have an anti- monotone property c(ABC  D) can be larger or smaller than c(AB  D) –But confidence of rules generated from the same itemset has an anti-monotone property –e.g., L = {A,B,C,D}: c(ABC  D)  c(AB  CD)  c(A  BCD)  Confidence is anti-monotone w.r.t. number of items on the RHS of the rule

28 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 28 Rule Generation l Given a frequent itemset L, find all non-empty subsets f  L such that f  L–f satisfies the minimum confidence requirement l If |L| = k, then there are 2 k candidate association rules (including L   and   L) –Example: L= {A,B,C,D} is the frequent itemset, then –The candidate rules are: ABC  D, ABD  C, ACD  B, BCD  A, A  BCD,B  ACD,C  ABD, D  ABC AB  CD,AC  BD, AD  BC, BC  AD, BD  AC, CD  AB, But antimonotonicity will make things converge fast.

29 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 29 Lattice of rules: confidence(f  L– f)=support(L)/support(f) Pruned Rules Low Confidence Rule L={A,B,C,D} L= f

30 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 30 Rule Generation for Apriori Algorithm 1. Candidate rule is generated by merging two rules that share the same prefix in the rule consequent 2. join(CD=>AB,BD=>AC) would produce the candidate rule D => ABC 3. Prune rule D=>ABC if its subset AD=>BC does not have high confidence. 4. Finally check the validity of rule D=>ABC (This is not an expensive operation so we might skip 3)

31 31 Rules: some useful, some trivial, others unexplicable zUseful: “On Thursdays, grocery store consumers often purchase diapers and beer together”. zTrivial: “Customers who purchase maintenance agreements are very likely to purchase large appliances”. zUnexplicable: “When a new hardaware store opens, one of the most sold items is toilet rings.” Conclusion: Inferred rules must be validate by domain expert, before they can be used in the marketplace: Post Mining of association rules.

32 32 Mining for Association Rules The main steps in the process 1.Select a minimum support/confidence level 2.Find the frequent itemsets 3.Find the association rules 4.Validate (postmine) the rules so found.

33 33 Mining for Association Rules: Checkpoint zApriori opened up a big commercial market for DM yassociation rules came from the db fields, classifier from AI, clustering precedes both … and DM zMany open problem areas, including 1.Performance: Faster Algorithms needed for frequent itemsets 2.Improving statistical/semantic significance of rules 3.Data Stream Mining for association rules. Even Faster algorithms needed, incremental computation, adaptability, etc. Also the post-mining process becomes more challenging.

34 34 Performance: Efficient Implementation Apriori in SQL zHard to get good performance out of pure SQL (SQL-92) based approaches alone zMake use of object-relational extensions like UDFs, BLOBs, Table functions etc. yS. Sarawagi, S. Thomas, and R. Agrawal. Integrating association rule mining with relational database systems: Alternatives and implications. In SIGMOD’98 yA much better solution: use UDAs—native or imported. Haixun Wang and Carlo Zaniolo: ATLaS: A Native Extension of SQL for Data Mining. SIAM International Conference on Data Mining 2003, San Francisco, CA, May 1-3, 2003 ATLaS: A Native Extension of SQL for Data Mining.

35 35 Performance for Apriori zChallenges yMultiple scans of transaction database [not for data streams] yHuge number of candidates yTedious workload of support counting for candidates zMany Improvements suggested: general ideas yReduce passes of transaction database scans yShrink number of candidates yFacilitate counting of candidates

36 36 Partition: Scan Database Only Twice zAny itemset that is potentially frequent in DB must be frequent in at least one of the partitions of DB yScan 1: partition database and find local frequent patterns yScan 2: consolidate global frequent patterns zA. Savasere, E. Omiecinski, and S. Navathe. An efficient algorithm for mining association in large databases. In VLDB’95 zDoes this scaleup to larger partitions?

37 37 Sampling for Frequent Patterns zSelect a sample S of original database, mine frequent patterns within sample using Apriori zTo avoid losses mine for a support less than that required zScan rest of database to find exact counts. zH. Toivonen. Sampling large databases for association rules. In VLDB’96

38 38 DIC: Reduce Number of Scans ABCD ABC ABDACD BCD ABACBC AD BDCD A BCD {} Itemset lattice zOnce both A and D are determined frequent, the counting of AD begins zOnce all length-2 subsets of BCD are determined frequent, the counting of BCD begins Transactions 1-itemsets 2-itemsets … Apriori 1-itemsets 2-items 3-itemsDIC S. Brin R. Motwani, J. Ullman, and S. Tsur. Dynamic itemset counting and implication rules for market basket data. In SIGMOD’97

39 39 Improving Performance (cont.) zAPriori Multiple database scans are costly zMining long patterns needs many passes of scanning and generates lots of candidates yTo find frequent itemset i 1 i 2 …i 100 x# of scans: 100 x# of Candidates: ( 100 1 ) + ( 100 2 ) + … + ( 1 1 0 0 0 0 ) = 2 100 -1 = 1.27*10 30 ! zBottleneck: candidate-generation-and-test zCan we avoid candidate generation ?

40 40 Mining Frequent Patterns Without Candidate Generation zFP-Growth Algorithm 1.Build FP-tree: items are listed by decreasing frequency 2.For each suffix (recursively) xBuild its conditionalized subtree x and compute its frequent items zAn order of magnitude faster than Apriori

41 41 Frequent Patterns (FP) Algorithm _________________________________________ These slides are based on those by: Yousry Taha,Taghrid Al-Shallali, Ghada AL Modaifer,Nesreen AL Boiez The algorithm consists of two steps: Step 1: builds the FP-Tree (Frequent Patterns Tree). Step 2: use FP_Growth Algorithm for finding frequent itemsets from the FP- Tree.

42 42 Frequent Pattern Tree Algorithm: Example The first scan of database is same as Apriori, which derives the set of 1-itemsets & their support counts. The set of frequent items is sorted in the order of descending support count. An Fp-tree is constructed The Fp-tree is conditionalized and mined for frequent itemsets T-IDList of Items 101Milk, bread, cookies, juice 792Milk, juice 1130Milk, eggs 1735Bread, cookies, coffee

43 43 NULL Milk:1Milk:2Milk:3 Bread:1 Cookies:1 Bread:1 Juice:1 Item Id SupportNode-link milk 3 bread 2 cookies 2 juice 2 Table: Item header table FP-tree FP-Tree for T-IDList of Items 101Milk, bread, cookies, juice 792Milk, juice 1130Milk, eggs 1735Bread, cookies, coffee

44 44 FP-Growth Algorithm For Finding Frequent Itemsets Steps: 1.Start from each frequent length-1 pattern (as an initial suffix pattern). 2.Construct its conditional pattern base which consists of the set of prefix paths in the FP-Tree co-occurring with suffix pattern. 3.Then, Construct its conditional FP-Tree & perform mining on such a tree. 4.The pattern growth is achieved by concatenation of the suffix pattern with the frequent patterns generated from a conditional FP-Tree. 5.The union of all frequent patterns (generated by step 4) gives the required frequent itemset.

45 45 FP-Growth: for each suffix find (1) its supporting paths, (2) its conditional FP-tree, and (3) the frequent patterns with such an ending (suffix) Suffix Tree paths supporting suffix (conditional pattern base) Conditional FP-Tree Frequent pattern generated juice {(milk, bread,cookies:1), (milk: 1)} {milk:2}{juice, milk:2} cookies{(milk, bread:1),(bread: 1)}{bread: 2}{cookies, bread:2} bread{(milk: 1)}-- milk--- … then expand the suffix and repeat these operations

46 46 NULL Milk:1Milk:2Milk:3 Bread:1 Cookies:1 Bread:1 Juice:1 Starting from least frequent suffix: Juice NULL Milk:1Milk:2 Milk:3 Bread:1 Cookies:1 Juice:1 2

47 47 Conditionalized tree for Suffix “Juice” NULL Milk:2 Thus: (Juice, Milk:2) is a frequent pattern

48 48 NULL Milk:1Milk:2Milk:3 Bread:1 Cookies:1 Bread:1 Now Patterns with Suffix “Cookies” Item Id Sup Count Node-link milk3.. bread2 Next cookies 2 NOW juiceDone NULL Bread:2 NULL Milk:1Milk:2Milk:1 Bread:1 Thus: (Cookies, Bread:2) is frequent

49 49 Why Frequent Pattern Growth Fast ? Performance study shows FP-growth is an order of magnitude faster than Apriori Reasoning − No candidate generation, no candidate test − Use compact data structure − Eliminate repeated database scan − Basic operation is counting and FP-tree building

50 50 Other types of Association RULES Association Rules among Hierarchies. Multidimensional Association Negative Association

51 51 6/22/2000 51 FP-growth vs. Apriori: Scalability With the Support Threshold Data set T25I20D10K

52 52 6/22/2000 52 FP-growth vs. Apriori: Scalability With Number of Transactions Data set T25I20D100K (1.5%)

53 53 FP-Growth: pros and cons zFP- tree is Complete yPreserve complete information for frequent pattern mining yNever break a long pattern of any transaction zFP- tree Compact yReduce irrelevant info—infrequent items are gone yItems in frequency descending order: the more frequently occurring, the more likely to be shared yNever be larger than the original database (not count node-links and the count field) zFP-tree is generate in one scan of database (data streams mining?) yHowever, deriving the frequent patterns from the FP-tree is still computationally expensive—improved algorithms needed for data streams.


Download ppt "1 Frequent Itemsets Association rules and market basket analysis CS240B--UCLA Notes by Carlo Zaniolo Most slides borrowed from Jiawei Han,UIUC May 2007."

Similar presentations


Ads by Google