Download presentation
Presentation is loading. Please wait.
1
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1
2
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 2 Mining Association Rules l Two-step approach: 1.Frequent Itemset Generation – Generate all itemsets whose support minsup 2.Rule Generation – Generate high confidence rules from each frequent itemset, where each rule is a binary partitioning of a frequent itemset
3
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 3 Apriori Algorithm l Method: –Let k=1 –Generate frequent itemsets of length 1 –Repeat until no new frequent itemsets are identified Generate length (k+1) candidate itemsets from length k frequent itemsets Prune candidate itemsets containing subsets of length k that are infrequent Count the support of each candidate by scanning the DB Eliminate candidates that are infrequent, leaving only those that are frequent
4
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 4 Rule Generation l Given a frequent itemset L, find all non-empty subsets f L such that f L – f satisfies the minimum confidence requirement –If {A,B,C,D} is a frequent itemset, candidate rules: ABC D, ABD C, ACD B, BCD A, A BCD,B ACD,C ABD, D ABC AB CD,AC BD, AD BC, BC AD, BD AC, CD AB, l If |L| = k, then there are 2 k – 2 candidate association rules (ignoring L and L) l X Y –X: antecedent, Y: consequent
5
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 5 Rule Generation l If {A,B,C,D} is a frequent itemset, candidate rules: ABC D, ABD C, ACD B, BCD A, A BCD,B ACD,C ABD, D ABC AB CD,AC BD, AD BC, BC AD, BD AC, CD AB, l Confidence of ABC D: –count(ABCD)/count(ABC) –Do we need to scan the whole DB? l Still computationally inefficient to consider all candidate rules. –Any pruning strategy
6
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 6 Rule Generation l Consider the following rule: –X U a Y Confidence = count(X a Y)/count(X a) l Assume the confidence of X U a Y is lower than the minimum threshold l How about the following rule: X Y U a l count(X a Y)/count(X) ≤ count(X a Y)/count(X a) –Since count(X a) ≤ count(X)
7
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 7 Rule Generation l How to efficiently generate rules from frequent itemsets? –In general, confidence does not have an anti- monotone property c(ABC D) can be larger or smaller than c(AB D) –But confidence of rules generated from the same itemset has an anti-monotone property –e.g., L = {A,B,C,D}: c(ABC D) c(AB CD) c(A BCD) Confidence is anti-monotone w.r.t. number of items on the RHS of the rule
8
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 8 Rule Generation for Apriori Algorithm Lattice of rules Pruned Rules Low Confidence Rule
9
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 9 Rule Generation for Apriori Algorithm l Candidate rule is generated by merging two rules that share the same prefix in the rule consequent l join(CD=>AB,BD=>AC) would produce the candidate rule D => ABC l Prune rule D=>ABC if its subset AD=>BC does not have high confidence
10
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 10 Confidence-Based Pruning l Initially, all the highconfidence rules that have only one item in the rule consequent are extracted. l These rules are then used to generate new candidate rules. l For example, if –{acd} {b} and {abd} {c} are highconfidence rules, then the candidate rule {ad} {bc} is generated by merging the consequents of both rules.
11
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 11 Confidence-Based Pruning {Bread,Milk} {Diaper} (confidence = 3/3) threshold=50% {Bread,Diaper} {Milk} (confidence = 3/3) {Diaper,Milk} {Bread} (confidence = 3/3) Items (1-itemsets) Pairs (2-itemsets) Triplets (3-itemsets)
12
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 12 Confidence-Based Prunning V Merge: {Bread,Milk} {Diaper} {Bread,Diaper} {Milk} {Bread} {Diaper,Milk} (confidence = 3/4) …
13
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 13 Case Study: Congressional Voting Records l List of binary attributes from 1984 US Congressional Voting Records (data source: UCI, 435 transactions with 34 items) –1. democrat, republican –2. handicapped-infants: 2 (y,n) –3. water-project-cost-sharing: 2 (y,n) –4. adoption-of-the-budget-resolution: 2 (y,n) –5. physician-fee-freeze: 2 (y,n) –6. el-salvador-aid: 2 (y,n) –7. religious-groups-in-schools: 2 (y,n) –8. anti-satellite-test-ban: 2 (y,n) –9. aid-to-nicaraguan-contras: 2 (y,n) –10. mx-missile: 2 (y,n) –11. immigration: 2 (y,n) –12. synfuels-corporation-cutback: 2 (y,n) –13. education-spending: 2 (y,n) –14. superfund-right-to-sue: 2 (y,n) –15. crime: 2 (y,n) –16. duty-free-exports: 2 (y,n) –17. export-administration-act-south-africa: 2 (y,n)
14
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 14 Case Study: Congressional Voting Records l Sample transactions –republican,n, y, n,y,y,y,n,n,n,y,?,y,y,y,n,y –democrat, ?, y, y,?,y,y,n,n,n,n,y,n,y,y,n,n –democrat, y, y, y,n,?,y,n,n,n,n,y,n,y,n,n,y l Data pre-processing –Get 34 features (skip ?) republican democrat handcapped-infants=yes handcapped-infants=no … Transaction 1 1 4 … Transaction 2 2 … Transaction 3 2 3 …
15
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 15 Case Study: Congressional Voting Records l Association rules extracted using minsup = 30% and minconf = 90% –{budget resolution = no, MX-missile = no, aid t o El Salvador = yes} {Republican} Confidence = 91% –{budget resolution = yes, MX-missile = yes, aid t o El Salvador = no} {democrat} Confidence = 97.5% –{crime = yes, right-to-use = yes, physican fee freeze = yes} {republican} Confidence = 93.5% –{crime = no, right-to-use = no, physican fee freeze = no} {democrat} Confidence = 100%
16
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 16 Pattern Evaluation l Association rule algorithms tend to produce too many rules –many of them are uninteresting or redundant –Redundant if {A,B,C} {D} and {A,B} {D} have same support & confidence l Interestingness measures can be used to prune/rank the derived patterns l In the original formulation of association rules, support & confidence are the only measures used
17
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 17 Application of Interestingness Measure Interestingness Measures
18
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 18 Computing Interestingness Measure l Given a rule X Y, information needed to compute rule interestingness can be obtained from a contingency table YY Xf 11 f 10 f 1+ Xf 01 f 00 f o+ f +1 f +0 |T| Contingency table for X Y f 11 : support of X and Y f 10 : support of X and Y f 01 : support of X and Y f 00 : support of X and Y Used to define various measures u support, confidence, lift, Gini, J-measure, etc.
19
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 19 Drawback of Confidence Coffee Tea15520 Tea75580 9010100 Association Rule: Tea Coffee Confidence= P(Coffee|Tea) = 0.75 but P(Coffee) = 0.9 Although confidence is high, rule is misleading P(Coffee|Tea) = 0.9375
20
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 20 Statistical Independence l Population of 1000 students –600 students know how to swim (S) –700 students know how to bike (B) –420 students know how to swim and bike (S,B) –P(S B) = 420/1000 = 0.42 –P(S) P(B) = 0.6 0.7 = 0.42 –P(S B) = P(S) P(B) => Statistical independence –P(S B) > P(S) P(B) => Positively correlated –P(S B) Negatively correlated
21
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 21 Statistical-based Measures l Measures that take into account statistical dependence
22
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 22 Example: Lift/Interest Coffee Tea15520 Tea75580 9010100 Association Rule: Tea Coffee Confidence= P(Coffee|Tea) = 0.75 but P(Coffee) = 0.9 Lift = 0.75/0.9= 0.8333 (< 1, therefore is negatively associated)
23
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 23 Drawback of Lift & Interest YY X100 X090 1090100 YY X900 X010 9010100 Statistical independence: If P(X,Y)=P(X)P(Y) => Lift = 1
24
There are lots of measures proposed in the literature Some measures are good for certain applications, but not for others What criteria should we use to determine whether a measure is good or bad? What about Apriori- style support based pruning? How does it affect these measures?
25
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 25 Properties of A Good Measure l Piatetsky-Shapiro: 3 properties a good measure M must satisfy: –M(A,B) = 0 if A and B are statistically independent –M(A,B) increase monotonically with P(A,B) when P(A) and P(B) remain unchanged –M(A,B) decreases monotonically with P(A) [or P(B)] when P(A,B) and P(B) [or P(A)] remain unchanged
26
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 26 Comparing Different Measures 10 examples of contingency tables: Rankings of contingency tables using various measures:
27
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 27 Property under Variable Permutation Does M(A,B) = M(B,A)? Symmetric measures: u support, lift, collective strength, cosine, Jaccard, etc Asymmetric measures: u confidence, conviction, Laplace, J-measure, etc
28
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 28 Property under Row/Column Scaling MaleFemale High235 Low145 3710 MaleFemale High43034 Low24042 67076 Grade-Gender Example (Mosteller, 1968): Mosteller: Underlying association should be independent of the relative number of male and female students in the samples 2x10x
29
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 29 Property under Inversion Operation Transaction 1 Transaction N..........
30
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 30 Example: -Coefficient l -coefficient is analogous to correlation coefficient for continuous variables YY X601070 X102030 7030100 YY X201030 X106070 3070100 Coefficient is the same for both tables
31
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 31 Property under Null Addition Invariant measures: u support, cosine, Jaccard, etc Non-invariant measures: u correlation, Gini, mutual information, odds ratio, etc
32
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 32 Different Measures have Different Properties
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.