Download presentation
Presentation is loading. Please wait.
Published byAnnabella Cole Modified over 9 years ago
1
Mining Generalized Association Rules Ramkrishnan Strikant Rakesh Agrawal Data Mining Seminar, spring semester, 2003 Prof. Amos Fiat Student: Idit Haran
2
Idit Haran, Data Mining Seminar, 20032 Outline Motivation Terms & Definitions Interest Measure Algorithms for mining generalized association rules Comparison Conclusions
3
Idit Haran, Data Mining Seminar, 20033 Motivation Find Association Rules of the form: Diapers Beer Different kinds of diapers: Huggies/Pampers, S/M/L, etc. Different kinds of beers: Heineken/Maccabi, in a bottle/in a can, etc. The information on the bar-code is of type: Huggies Diapers, M Heineken Beer in bottle The preliminary rule is not interesting, and probably will not have minimum support.
4
Idit Haran, Data Mining Seminar, 20034 Taxonomy is-a hierarchies Clothes OutwearShirts JacketsSki Pants Footwear Shoes Hiking Boots
5
Idit Haran, Data Mining Seminar, 20035 Taxonomy - Example Let say we found the rule: Outwear Hiking Boots with minimum support and confidence. The rule Jackets Hiking Boots may not have minimum support The rule Clothes Hiking Boots may not have minimum confidence.
6
Idit Haran, Data Mining Seminar, 20036 Taxonomy Users are interested in generating rules that span different levels of the taxonomy. Rules of lower levels may not have minimum support Taxonomy can be used to prune uninteresting or redundant rules Multiple taxonomies may be present. for example: category, price(cheap, expensive), “items-on-sale”. etc. Multiple taxonomies may be modeled as a forest, or a DAG.
7
Idit Haran, Data Mining Seminar, 20037 Notations c1 p c2 z ancestors (marked with ^) descendants child parent edge: is_a relationship
8
Idit Haran, Data Mining Seminar, 20038 Notations I = {i 1, i 2, …, i m }- items. T- transaction, set of items T I (we expect the items in T to be leaves in T. ) D – set of transactions T supports item x, if x is in T or x is an ancestor of some item in T. T supports X I if it supports every item in X.
9
Idit Haran, Data Mining Seminar, 20039 Notations A generalized association rule: X Y if X I, Y I, X Y = , and no item in Y is an ancestor of any item in X. The rule X Y has confidence c in D if c% of transactions in D that support X also support Y. The rule X Y has support s in D if s% of transactions in D supports X Y.
10
Idit Haran, Data Mining Seminar, 200310 Problem Statement To find all generalized association rules that have support and confidence greater than the user- specified minimum support (called minsup) and minimum confidence (called minconf) respectively.
11
Idit Haran, Data Mining Seminar, 200311 Example Recall the taxonomy: Clothes OutwearShirts JacketsSki Pants Footwear Shoes Hiking Boots
12
Idit Haran, Data Mining Seminar, 200312 Database D TransactionItems Bought 100Shirt 200Jacket, Hiking Boots 300Ski Pants, Hiking Boots 400Shoes 500Shoes 600Jacket Frequent Itemsets ItemsetSupport {Jacket}2 {Outwear}3 {Clothes}4 {Shoes}2 {Hiking Boots}2 {Footwear}4 {Outwear, Hiking Boots}2 {Clothes,Hiking Boots}2 {Outwear, Footwear}2 {Clothes, Footwear}2 Rules RuleSupportConfidence Outwear Hiking Boots 33%66.6% Outwear Footwear 33%66.6% Hiking Boots Outwear 33%100% Hiking Boots Clothes 33%100% Example minsup = 30% minconf = 60%
13
Idit Haran, Data Mining Seminar, 200313 Observation 1 If the set{x,y} has minimum support, so do {x^,y^} {x^,y} and {x^,y^} For example: if {Jacket, Shoes} has minsup, so will {Outwear, Shoes}, {Jacket,Footwear}, and {Outwear,Footwear} Clothes OutwearShirts JacketsSki Pants Footwear Shoes Hiking Boots
14
Idit Haran, Data Mining Seminar, 200314 Observation 2 If the rule x y has minimum support and confidence, only x y^ is guaranteed to have both minsup and minconf. The rule Outwear Hiking Boots has minsup and minconf. The rule Outwear Footwear has both minsup and minconf. Clothes OutwearShirts JacketsSki Pants Footwear Shoes Hiking Boots
15
Idit Haran, Data Mining Seminar, 200315 Observation 2 – cont. However, the rules x^ y and x^ y^ will have minsup, they may not have minconf. For example: The rules Clothes Hiking Boots and Clothes Footwear have minsup, but not minconf. Clothes OutwearShirts JacketsSki Pants Footwear Shoes Hiking Boots
16
Idit Haran, Data Mining Seminar, 200316 Interesting Rules – Previous Work a rule X Y is not interesting if: support(X Y) support(X)support(Y) Previous work does not consider taxonomy. The previous interest measure pruned less than 1% of the rules on a real database.
17
Idit Haran, Data Mining Seminar, 200317 Interesting Rules – Using the Taxonomy Milk Cereal (8% support, 70% conf) Milk is parent of Skim Milk, and 25% of sales of Milk are Skim Milk We expect: Skim Milk Cereal to have 2% support and 70% confidence
18
Idit Haran, Data Mining Seminar, 200318 R-Interesting Rules A rule is X Y is R-interesting w.r.t an ancestor X^ Y^ if: or, With R = 1.1 about 40-55% of the rules were prunes. real support (X Y) > expected support (X Y) based on (X^ Y^) R real confidence (X Y) > expected confidence (X Y) based on (X^ Y^) R
19
Idit Haran, Data Mining Seminar, 200319 Problem Statement (new) To find all generalized R-interesting association rules (R is a user- specified minimum interest called min-interest) that have support and confidence greater than minsup and minconf respectively.
20
Idit Haran, Data Mining Seminar, 200320 Algorithms – 3 steps 1. Find all itemsets whose support is greater than minsup. These itemsets are called frequent itemsets. 2. Use the frequent itemsets to generate the desired rules: if ABCD and AB are frequent then conf(AB CD) = support(ABCD)/support(AB) 3. Prune all uninteresting rules from this set. *All presented algorithms will only implement step 1.
21
Idit Haran, Data Mining Seminar, 200321 Algorithms – 3 steps 1. Find all itemsets whose support is greater than minsup. These itemsets are called frequent itemsets. 2. Use the frequent itemsets to generate the desired rules: if ABCD and AB are frequent then conf(AB CD) = support(ABCD)/support(AB) 3. Prune all uninteresting rules from this set. *All presented algorithms will only implement step 1.
22
Idit Haran, Data Mining Seminar, 200322 Algorithms (step 1) Input: Database, Taxonomy Output: All frequent itemsets 3 algorithms (same output, different run-time): Basic, Cumulate, EstMerge
23
Idit Haran, Data Mining Seminar, 200323 Algorithm Basic – Main Idea Is itemset X is frequent? Does transaction T supports X? (X contains items from different levels of taxonomy, T contains only leaves) T’ = T + ancestors(T); Answer: T supports X X T’
24
Idit Haran, Data Mining Seminar, 200324 Algorithm Basic Count item occurrences Generate new k-itemsets candidates Add all ancestors of each item in t to t, removing any duplication Find the support of all the candidates Take only those with support over minsup
25
Idit Haran, Data Mining Seminar, 200325 Candidate generation Join step Prune step P and q are 2 k-1 frequent itemsets identical in all k-2 first items. Join by adding the last item of q to p Check all the subsets, remove a candidate with “ small ” subset
26
Idit Haran, Data Mining Seminar, 200326 Optimization 1 Filtering the ancestors added to transactions We only need to add to transaction t the ancestors that are in one of the candidates. If the original item is not in any itemsets, it can be dropped from the transaction. Example: candidates: {clothes,shoes}. Transaction t: {Jacket, …} can be replaced with {clothes, …} Clothes OutwearShirts Jackets Ski Pants
27
Idit Haran, Data Mining Seminar, 200327 Optimization 2 Pre-computing ancestors Rather than finding ancestors for each item by traversing the taxonomy graph, we can pre- compute the ancestors for each item. We can drop ancestors that are not contained in any of the candidates in the same time.
28
Idit Haran, Data Mining Seminar, 200328 Optimization 3 Pruning itemsets containing an item and its ancestor If we have {Jacket} and {Outwear}, we will have candidate {Jacket, Outwear} which is not interesting. support({Jacket} ) = support({Jacket, Outwear}) Delete ({Jacket, Outwear}) in k=2 will ensure it will not erase in k>2. (because of the prune step of candidate generation method) Therefore, we can prune the rules containing an item an its ancestor only for k=2, and in the next steps all candidates will not include item + ancestor.
29
Idit Haran, Data Mining Seminar, 200329 Algorithm Cumulate Optimization 2: compute the set of all ancestors T* from T Optimization 3: Delete any candidate in C 2 that consists of an item and its ancestor Optimization 1: Delete any ancestors in T* that are not present in any of the candidates in C k Optimzation2: foreach item x t add all ancestor of x in T* to t. Then, remove any duplicates in t.
30
Idit Haran, Data Mining Seminar, 200330 Stratification Candidates: {Clothes, Shoes}, {Outwear,Shoes}, {Jacket,Shoes} If {Clothes, Shoes} does not have minimum support, we don’t need to count either {Outwear,Shoes} or {Jacket,Shoes} We will count in steps: step 1: count {Clothes, Shoes}, and if it has minsup - step 2: count {Outwear,Shoes}, if has minsup – step 3: count {Jacket,Shoes} Clothes OutwearShirts JacketsSki Pants Footwear Shoes Hiking Boots
31
Idit Haran, Data Mining Seminar, 200331 Version 1: Stratify Depth of an itemset: itemsets with no parents are of depth 0. others: depth(X) = max({depth(X^) |X^ is a parent of X}) + 1 The algorithm: Count all itemsets C 0 of depth 0. Delete candidates that are descendants to the itemsets in C 0 that didn’t have minsup. Count remaining itemsets at depth 1 (C 1 ) Delete candidates that are descendants to the itemsets in C 1 that didn’t have minsup. Count remaining itemsets at depth 2 (C 2 ), etc…
32
Idit Haran, Data Mining Seminar, 200332 Tradeoff & Optimizations #candidates counted#passes over DB CumulateCount each depth on different pass Optimiztion 1: Count together multiple depths from certain level Optimiztion 2: Count more than 20% of candidates per pass
33
Idit Haran, Data Mining Seminar, 200333 Version 2: Estimate Estimating candidates support using sample 1 st pass: (C’ k ) count candidates that are expected to have minsup (we count these candidates as candidates that has 0.9*minsup in the sample) count candidates whose parents expect to have minsup. 2 nd pass: (C” k ) count children of candidates in C’ k that were not expected to have minsup.
34
Idit Haran, Data Mining Seminar, 200334 Example for Estimate Candidates Itemsets Support in Sample Support in Database Scenario AScenario B {Clothes, Shoes}8%7%9% {Outwear, Shoes}4% 6% {Jacket, Shoes}2% minsup = 5%
35
Idit Haran, Data Mining Seminar, 200335 Version 3: EstMerge Motivation: eliminate 2 nd pass of algorithm Estimate Implementation: count these candidates of C” k with the candidates in C’ k+1. Restriction: to create C’ k+1 we assume that all candidates in C” k has minsup. The tradeoff: extra candidates counted by EstMerge v.s. extra pass made by Estimate.
36
Idit Haran, Data Mining Seminar, 200336 Algorithm EstMerge Count item occurrencesGenerate a sample over the Database, in the first pass Find the support of C ’ k C ” k-1 by making a pass over D Generate new k-itemsets candidates from L k-1 C ” k-1 Delete candidates in C k whose ancestors in C ’ k don ’ t have minsup Estimate C k candidate ’ s support by making a pass over D s. C ’ k = candidates that are expected to have minsup + candidates whose parents are expected to have minsup Remaining candidates in C k that are not in C ’ k Add all candidate in C ” k with minsup All candidate in C ’ k with minsup
37
Idit Haran, Data Mining Seminar, 200337 Stratify - Variants
38
Idit Haran, Data Mining Seminar, 200338 Size of Sample Pr[support in sample < a] P=5%P=1%P=0.5%P=0.1% a=.8pa=.9pa=.8pa=.9pa=.8pa=.9pa=.8pa=.9p n=10000.320.760.800.950.890.970.980.99 n=10,0000.000.070.110.590.340.770.800.95 n=100,0000.00 0.010.000.070.120.60 n=1,000,0000.00 0.01
39
Idit Haran, Data Mining Seminar, 200339 Size of Sample
40
Idit Haran, Data Mining Seminar, 200340 Performance Evaluation Compare running time of 3 algorithms: Basic, Cumulate and EstMerge On synthetic data: effect of each parameter on performance On real data: Supermarket Data Department Store Data
41
Idit Haran, Data Mining Seminar, 200341 Synthetic Data Generation Parameter Default Value |D|Number of transactions1,000,000 |T|Average size of the Transactions10 |I|Average size of the maximal potentially frequent itemsets4 | I |Number of maximal potentially frequent itemsets10,000 NNumber of items100,000 RNumber of Roots250 LNumber of Levels4-5 FFanout5 D Depth-ration ( probability that item in a rule comes from level i / probability that item comes from level i+1) 1
42
Idit Haran, Data Mining Seminar, 200342 Minimum Support
43
Idit Haran, Data Mining Seminar, 200343 Number of Transactions
44
Idit Haran, Data Mining Seminar, 200344 Fanout
45
Idit Haran, Data Mining Seminar, 200345 Number of Items
46
Idit Haran, Data Mining Seminar, 200346 Reality Check Supermarket Data 548,000 items Taxonomy: 4 levels, 118 roots ~1.5 million transactions Average of 9.6 items per transaction Department Store Data 228,000 items Taxonomy: 7 levels, 89 roots 570,000 transactions Average of 4.4 items per transaction
47
Idit Haran, Data Mining Seminar, 200347 Results
48
Idit Haran, Data Mining Seminar, 200348 Conclusions Cumulate and EstMerge were 2 to 5 times faster than Basic on all synthetic datasets. On the supermarket database they were 100 times faster ! EstMerge was ~25-30% faster than Cumulate. Both EstMerge and Cumulate exhibits linear scale-up with the number of transactions.
49
Idit Haran, Data Mining Seminar, 200349 Summary The use of taxonomy is necessary for finding association rules between items at any level of hierarchy. The obvious solution (algorithm Basic) is not very fast. New algorithms that use the taxonomy benefits are much faster We can use the taxonomy to prune uninteresting rules.
50
Idit Haran, Data Mining Seminar, 200350
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.