Download presentation
Presentation is loading. Please wait.
Published byMelvyn Montgomery Modified over 9 years ago
1
Association Rule Mining
2
Mining Association Rules in Large Databases Association rule mining Algorithms Apriori and FP-Growth Max and closed patterns Mining various kinds of association/correlation rules
3
Max-patterns & Close-patterns If there are frequent patterns with many items, enumerating all of them is costly. We may be interested in finding the ‘ boundary ’ frequent patterns. Two types …
4
Max-patterns Frequent pattern {a 1, …, a 100 } ( 100 1 ) + ( 100 2 ) + … + ( 1 1 0 0 0 0 ) = 2 100 -1 = 1.27*10 30 frequent sub-patterns! Max-pattern: frequent patterns without proper frequent super pattern BCDE, ACD are max-patterns BCD is not a max-pattern TidItems 10A,B,C,D,E 20B,C,D,E, 30A,C,D,F Min_sup=2
5
Maximal Frequent Itemset Border Infrequent Itemsets Maximal Itemsets An itemset is maximal frequent if none of its immediate supersets is frequent
6
Closed Itemset An itemset is closed if none of its immediate supersets has the same support as the itemset
7
Maximal vs Closed Itemsets Transaction Ids Not supported by any transactions
8
Maximal vs Closed Frequent Itemsets Minimum support = 2 # Closed = 9 # Maximal = 4 Closed and maximal Closed but not maximal
9
Maximal vs Closed Itemsets
10
MaxMiner: Mining Max-patterns Idea: generate the complete set- enumeration tree one level at a time, while prune if applicable. (ABCD) A (BCD) B (CD) C (D)D () AB (CD)AC (D)AD () BC (D)BD () CD ()ABC (C) ABCD () ABD ()ACD ()BCD ()
11
Local Pruning Techniques (e.g. at node A) Check the frequency of ABCD and AB, AC, AD. If ABCD is frequent, prune the whole sub-tree. If AC is NOT frequent, remove C from the parenthesis before expanding. (ABCD) A (BCD) B (CD) C (D)D () AB (CD)AC (D)AD () BC (D)BD () CD ()ABC (C) ABCD () ABD ()ACD ()BCD ()
12
Algorithm MaxMiner Initially, generate one node N=, where h(N)= and t(N)={A,B,C,D}. Consider expanding N, If h(N)t(N) is frequent, do not expand N. If for some it(N), h(N){i} is NOT frequent, remove i from t(N) before expanding N. Apply global pruning techniques … (ABCD)
13
Global Pruning Technique (across sub-trees) When a max pattern is identified (e.g. ABCD), prune all nodes (e.g. B, C and D) where h(N)t(N) is a sub-set of it (e.g. ABCD). (ABCD) A (BCD) B (CD) C (D)D () AB (CD)AC (D)AD () BC (D)BD () CD ()ABC (C) ABCD () ABD ()ACD ()BCD ()
14
Example TidItems 10A,B,C,D,E 20B,C,D,E, 30A,C,D,F (ABCDEF) ItemsFrequency ABCDEF0 A2 B2 C3 D3 E2 F1 Min_sup=2 Max patterns: A (BCDE) B (CDE)C (DE)E ()D (E)
15
Example TidItems 10A,B,C,D,E 20B,C,D,E, 30A,C,D,F (ABCDEF) ItemsFrequency ABCDE1 AB1 AC2 AD2 AE1 Min_sup=2 A (BCDE) B (CDE)C (DE)E ()D (E) AC (D)AD () Max patterns: Node A
16
Example TidItems 10A,B,C,D,E 20B,C,D,E, 30A,C,D,F (ABCDEF) ItemsFrequency BCDE2 BC BD BE Min_sup=2 A (BCDE) B (CDE)C (DE)E ()D (E) AC (D)AD () Max patterns: BCDE Node B
17
Example TidItems 10A,B,C,D,E 20B,C,D,E, 30A,C,D,F (ABCDEF) ItemsFrequency ACD2 Min_sup=2 A (BCDE) B (CDE)C (DE)E ()D (E) AC (D)AD () Max patterns: BCDE ACD Node AC
18
Frequent Closed Patterns For frequent itemset X, if there exists no item y s.t. every transaction containing X also contains y, then X is a frequent closed pattern “ ab ” is a frequent closed pattern Concise rep. of freq pats Reduce # of patterns and rules N. Pasquier et al. In ICDT ’ 99 TIDItems 10a, b, c 20a, b, c 30a, b, d 40a, b, d 50e, f Min_sup=2
19
Max Pattern vs. Frequent Closed Pattern max pattern closed pattern if itemset X is a max pattern, adding any item to it would not be a frequent pattern; thus there exists no item y s.t. every transaction containing X also contains y. closed pattern max pattern “ ab ” is a closed pattern, but not max TIDItems 10a, b, c 20a, b, c 30a, b, d 40a, b, d 50e, f Min_sup=2
20
Mining Frequent Closed Patterns: CLOSET Flist: list of all frequent items in support ascending order Flist: d-a-f-e-c Divide search space Patterns having d Patterns having a but not d, etc. Find frequent closed pattern recursively Among the transactions having d, cfa is frequent closed cfad is a frequent closed pattern J. Pei, J. Han & R. Mao. CLOSET: An Efficient Algorithm for Mining Frequent Closed Itemsets", DMKD'00. TIDItems 10a, c, d, e, f 20a, b, e 30c, e, f 40a, c, d, f 50c, e, f Min_sup=2
21
Multiple-Level Association Rules Items often form hierarchy. Items at the lower level are expected to have lower support. Rules regarding itemsets at appropriate levels could be quite useful. A transactional database can be encoded based on dimensions and levels We can explore shared multi- level mining Food bread milk skim Garelick 2% fatwhite wheat Wonder....
22
Mining Multi-Level Associations A top_down, progressive deepening approach: First find high-level strong rules: milk bread [20%, 60%]. Then find their lower-level “weaker” rules: 2% fat milk wheat bread [6%, 50%]. Variations at mining multiple-level association rules. Level-crossed association rules: skim milk Wonder wheat bread Association rules with multiple, alternative hierarchies: full fat milk Wonder bread
23
Multi-level Association: Uniform Support vs. Reduced Support Uniform Support: the same minimum support for all levels + One minimum support threshold. No need to examine itemsets containing any item whose ancestors do not have minimum support. – Lower level items do not occur as frequently. If support threshold too high miss low level associations too low generate too many high level associations
24
Multi-level Association: Uniform Support vs. Reduced Support Reduced Support: reduced minimum support at lower levels There are 4 search strategies: Level-by-level independent Independent search at all levels (no misses) Level-cross filtering by k-itemset Prune a k-pattern if the corresponding k-pattern at the upper level is infrequent Level-cross filtering by single item Prune an item if its parent node is infrequent Controlled level-cross filtering by single item Consider ‘subfrequent’ items that pass a passage threshold
25
Uniform Support Multi-level mining with uniform support Milk [support = 10%] full fat Milk [support = 6%] Skim Milk [support = 4%] Level 1 min_sup = 5% Level 2 min_sup = 5% X
26
Reduced Support Multi-level mining with reduced support full fat Milk [support = 6%] Skim Milk [support = 4%] Level 1 min_sup = 5% Level 2 min_sup = 3% Milk [support = 10%]
27
Pattern Evaluation Association rule algorithms tend to produce too many rules many of them are uninteresting or redundant Redundant if {A,B,C} {D} and {A,B} {D} have same support & confidence Interestingness measures can be used to prune/rank the derived patterns In the original formulation of association rules, support & confidence are the only measures used
28
Computing Interestingness Measure Given a rule X Y, information needed to compute rule interestingness can be obtained from a contingency table YY Xf 11 f 10 f 1+ Xf 01 f 00 f o+ f +1 f +0 |T| Contingency table for X Y f 11 : support of X and Y f 10 : support of X and Y f 01 : support of X and Y f 00 : support of X and Y Used to define various measures u support, confidence, lift, Gini, J-measure, etc.
29
Drawback of Confidence Coffee Tea15520 Tea75580 9010100 Association Rule: Tea Coffee Confidence= P(Coffee|Tea) = 0.75 but P(Coffee) = 0.9 Although confidence is high, rule is misleading P(Coffee|Tea) = 0.9375
30
Statistical Independence Population of 1000 students 600 students know how to swim (S) 700 students know how to bike (B) 420 students know how to swim and bike (S,B) P(SB) = 420/1000 = 0.42 P(S) P(B) = 0.6 0.7 = 0.42 P(SB) = P(S) P(B) => Statistical independence P(SB) > P(S) P(B) => Positively correlated P(SB) Negatively correlated
31
Statistical-based Measures Measures that take into account statistical dependence
32
Example: Lift/Interest Coffee Tea15520 Tea75580 9010100 Association Rule: Tea Coffee Confidence= P(Coffee|Tea) = 0.75 but P(Coffee) = 0.9 Lift = 0.75/0.9= 0.8333 (< 1, therefore is negatively associated)
33
Drawback of Lift & Interest YY X100 X090 1090100 YY X900 X010 9010100 Statistical independence: If P(X,Y)=P(X)P(Y) => Lift = 1
34
There are lots of measures proposed in the literature Some measures are good for certain applications, but not for others What criteria should we use to determine whether a measure is good or bad? What about Apriori- style support based pruning? How does it affect these measures?
35
Properties of A Good Measure Piatetsky-Shapiro: 3 properties a good measure M must satisfy: M(A,B) = 0 if A and B are statistically independent M(A,B) increase monotonically with P(A,B) when P(A) and P(B) remain unchanged M(A,B) decreases monotonically with P(A) [or P(B)] when P(A,B) and P(B) [or P(A)] remain unchanged
36
Comparing Different Measures 10 examples of contingency tables: Rankings of contingency tables using various measures:
37
Property under Variable Permutation Does M(A,B) = M(B,A)? Symmetric measures: u support, lift, collective strength, cosine, Jaccard, etc Asymmetric measures: u confidence, conviction, Laplace, J-measure, etc
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.