Association Rule Mining. Mining Association Rules in Large Databases  Association rule mining  Algorithms Apriori and FP-Growth  Max and closed patterns.

1 Association Rule Mining

2 Mining Association Rules in Large Databases  Association rule mining  Algorithms Apriori and FP-Growth  Max and closed patterns  Mining various kinds of association/correlation rules

3 Max-patterns & Close-patterns  If there are frequent patterns with many items, enumerating all of them is costly.  We may be interested in finding the ‘ boundary ’ frequent patterns.  Two types …

4 Max-patterns  Frequent pattern {a 1, …, a 100 }  ( 100 1 ) + ( 100 2 ) + … + ( 1 1 0 0 0 0 ) = 2 100 -1 = 1.27*10 30 frequent sub-patterns!  Max-pattern: frequent patterns without proper frequent super pattern BCDE, ACD are max-patterns BCD is not a max-pattern TidItems 10A,B,C,D,E 20B,C,D,E, 30A,C,D,F Min_sup=2

5 Maximal Frequent Itemset Border Infrequent Itemsets Maximal Itemsets An itemset is maximal frequent if none of its immediate supersets is frequent

6 Closed Itemset  An itemset is closed if none of its immediate supersets has the same support as the itemset

7 Maximal vs Closed Itemsets Transaction Ids Not supported by any transactions

8 Maximal vs Closed Frequent Itemsets Minimum support = 2 # Closed = 9 # Maximal = 4 Closed and maximal Closed but not maximal

9 Maximal vs Closed Itemsets

10 MaxMiner: Mining Max-patterns  Idea: generate the complete set- enumeration tree one level at a time, while prune if applicable.  (ABCD) A (BCD) B (CD) C (D)D () AB (CD)AC (D)AD () BC (D)BD () CD ()ABC (C) ABCD () ABD ()ACD ()BCD ()

11 Local Pruning Techniques (e.g. at node A) Check the frequency of ABCD and AB, AC, AD.  If ABCD is frequent, prune the whole sub-tree.  If AC is NOT frequent, remove C from the parenthesis before expanding.  (ABCD) A (BCD) B (CD) C (D)D () AB (CD)AC (D)AD () BC (D)BD () CD ()ABC (C) ABCD () ABD ()ACD ()BCD ()

12 Algorithm MaxMiner  Initially, generate one node N=, where h(N)= and t(N)={A,B,C,D}.  Consider expanding N, If h(N)t(N) is frequent, do not expand N. If for some it(N), h(N){i} is NOT frequent, remove i from t(N) before expanding N.  Apply global pruning techniques …  (ABCD)

13 Global Pruning Technique (across sub-trees)  When a max pattern is identified (e.g. ABCD), prune all nodes (e.g. B, C and D) where h(N)t(N) is a sub-set of it (e.g. ABCD).  (ABCD) A (BCD) B (CD) C (D)D () AB (CD)AC (D)AD () BC (D)BD () CD ()ABC (C) ABCD () ABD ()ACD ()BCD ()

14 Example TidItems 10A,B,C,D,E 20B,C,D,E, 30A,C,D,F  (ABCDEF) ItemsFrequency ABCDEF0 A2 B2 C3 D3 E2 F1 Min_sup=2 Max patterns: A (BCDE) B (CDE)C (DE)E ()D (E)

15 Example TidItems 10A,B,C,D,E 20B,C,D,E, 30A,C,D,F  (ABCDEF) ItemsFrequency ABCDE1 AB1 AC2 AD2 AE1 Min_sup=2 A (BCDE) B (CDE)C (DE)E ()D (E) AC (D)AD () Max patterns: Node A

16 Example TidItems 10A,B,C,D,E 20B,C,D,E, 30A,C,D,F  (ABCDEF) ItemsFrequency BCDE2 BC BD BE Min_sup=2 A (BCDE) B (CDE)C (DE)E ()D (E) AC (D)AD () Max patterns: BCDE Node B

17 Example TidItems 10A,B,C,D,E 20B,C,D,E, 30A,C,D,F  (ABCDEF) ItemsFrequency ACD2 Min_sup=2 A (BCDE) B (CDE)C (DE)E ()D (E) AC (D)AD () Max patterns: BCDE ACD Node AC

18 Frequent Closed Patterns  For frequent itemset X, if there exists no item y s.t. every transaction containing X also contains y, then X is a frequent closed pattern “ ab ” is a frequent closed pattern  Concise rep. of freq pats  Reduce # of patterns and rules  N. Pasquier et al. In ICDT ’ 99 TIDItems 10a, b, c 20a, b, c 30a, b, d 40a, b, d 50e, f Min_sup=2

19 Max Pattern vs. Frequent Closed Pattern  max pattern  closed pattern if itemset X is a max pattern, adding any item to it would not be a frequent pattern; thus there exists no item y s.t. every transaction containing X also contains y.  closed pattern  max pattern “ ab ” is a closed pattern, but not max TIDItems 10a, b, c 20a, b, c 30a, b, d 40a, b, d 50e, f Min_sup=2

20 Mining Frequent Closed Patterns: CLOSET  Flist: list of all frequent items in support ascending order Flist: d-a-f-e-c  Divide search space Patterns having d Patterns having a but not d, etc.  Find frequent closed pattern recursively Among the transactions having d, cfa is frequent closed  cfad is a frequent closed pattern  J. Pei, J. Han & R. Mao. CLOSET: An Efficient Algorithm for Mining Frequent Closed Itemsets", DMKD'00. TIDItems 10a, c, d, e, f 20a, b, e 30c, e, f 40a, c, d, f 50c, e, f Min_sup=2

21 Multiple-Level Association Rules  Items often form hierarchy.  Items at the lower level are expected to have lower support.  Rules regarding itemsets at appropriate levels could be quite useful.  A transactional database can be encoded based on dimensions and levels  We can explore shared multi- level mining Food bread milk skim Garelick 2% fatwhite wheat Wonder....

22 Mining Multi-Level Associations  A top_down, progressive deepening approach: First find high-level strong rules: milk  bread [20%, 60%]. Then find their lower-level “weaker” rules: 2% fat milk  wheat bread [6%, 50%].  Variations at mining multiple-level association rules. Level-crossed association rules: skim milk  Wonder wheat bread Association rules with multiple, alternative hierarchies: full fat milk  Wonder bread

23 Multi-level Association: Uniform Support vs. Reduced Support  Uniform Support: the same minimum support for all levels + One minimum support threshold. No need to examine itemsets containing any item whose ancestors do not have minimum support. – Lower level items do not occur as frequently. If support threshold  too high  miss low level associations  too low  generate too many high level associations

24 Multi-level Association: Uniform Support vs. Reduced Support  Reduced Support: reduced minimum support at lower levels There are 4 search strategies:  Level-by-level independent  Independent search at all levels (no misses)  Level-cross filtering by k-itemset  Prune a k-pattern if the corresponding k-pattern at the upper level is infrequent  Level-cross filtering by single item  Prune an item if its parent node is infrequent  Controlled level-cross filtering by single item  Consider ‘subfrequent’ items that pass a passage threshold

25 Uniform Support Multi-level mining with uniform support Milk [support = 10%] full fat Milk [support = 6%] Skim Milk [support = 4%] Level 1 min_sup = 5% Level 2 min_sup = 5% X

26 Reduced Support Multi-level mining with reduced support full fat Milk [support = 6%] Skim Milk [support = 4%] Level 1 min_sup = 5% Level 2 min_sup = 3% Milk [support = 10%]

27 Pattern Evaluation  Association rule algorithms tend to produce too many rules many of them are uninteresting or redundant Redundant if {A,B,C}  {D} and {A,B}  {D} have same support & confidence  Interestingness measures can be used to prune/rank the derived patterns  In the original formulation of association rules, support & confidence are the only measures used

28 Computing Interestingness Measure  Given a rule X  Y, information needed to compute rule interestingness can be obtained from a contingency table YY Xf 11 f 10 f 1+ Xf 01 f 00 f o+ f +1 f +0 |T| Contingency table for X  Y f 11 : support of X and Y f 10 : support of X and Y f 01 : support of X and Y f 00 : support of X and Y Used to define various measures u support, confidence, lift, Gini, J-measure, etc.

29 Drawback of Confidence Coffee Tea15520 Tea75580 9010100 Association Rule: Tea  Coffee Confidence= P(Coffee|Tea) = 0.75 but P(Coffee) = 0.9  Although confidence is high, rule is misleading  P(Coffee|Tea) = 0.9375

30 Statistical Independence  Population of 1000 students 600 students know how to swim (S) 700 students know how to bike (B) 420 students know how to swim and bike (S,B) P(SB) = 420/1000 = 0.42 P(S)  P(B) = 0.6  0.7 = 0.42 P(SB) = P(S)  P(B) => Statistical independence P(SB) > P(S)  P(B) => Positively correlated P(SB) Negatively correlated

31 Statistical-based Measures  Measures that take into account statistical dependence

32 Example: Lift/Interest Coffee Tea15520 Tea75580 9010100 Association Rule: Tea  Coffee Confidence= P(Coffee|Tea) = 0.75 but P(Coffee) = 0.9  Lift = 0.75/0.9= 0.8333 (< 1, therefore is negatively associated)

33 Drawback of Lift & Interest YY X100 X090 1090100 YY X900 X010 9010100 Statistical independence: If P(X,Y)=P(X)P(Y) => Lift = 1

34 There are lots of measures proposed in the literature Some measures are good for certain applications, but not for others What criteria should we use to determine whether a measure is good or bad? What about Apriori- style support based pruning? How does it affect these measures?

35 Properties of A Good Measure  Piatetsky-Shapiro: 3 properties a good measure M must satisfy: M(A,B) = 0 if A and B are statistically independent M(A,B) increase monotonically with P(A,B) when P(A) and P(B) remain unchanged M(A,B) decreases monotonically with P(A) [or P(B)] when P(A,B) and P(B) [or P(A)] remain unchanged

36 Comparing Different Measures 10 examples of contingency tables: Rankings of contingency tables using various measures:

37 Property under Variable Permutation Does M(A,B) = M(B,A)? Symmetric measures: u support, lift, collective strength, cosine, Jaccard, etc Asymmetric measures: u confidence, conviction, Laplace, J-measure, etc

