Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Association Rules & Correlations zBasic concepts zEfficient and scalable frequent itemset mining methods: yApriori, and improvements yFP-growth zRule.

Similar presentations


Presentation on theme: "1 Association Rules & Correlations zBasic concepts zEfficient and scalable frequent itemset mining methods: yApriori, and improvements yFP-growth zRule."— Presentation transcript:

1 1 Association Rules & Correlations zBasic concepts zEfficient and scalable frequent itemset mining methods: yApriori, and improvements yFP-growth zRule postmining: visualization and validation zInteresting association rules.

2 2 Rule Validations zOnly a small subset of derived rules might be meaningful/useful yDomain expert must validate the rules zUseful tools: yVisualization yCorrelation analysis

3 3 Visualization of Association Rules: Plane Graph

4 4 Visualization of Association Rules (SGI/MineSet 3.0)

5 5 Pattern Evaluation zAssociation rule algorithms tend to produce too many rules ymany of them are uninteresting or redundant yconfidence(A  B) = p(B|A) = p(A & B)/p(A) yConfidence is not discriminative enough criterion y Beyond original support & confidence yInterestingness measures can be used to prune/rank the derived patterns

6 6 Application of Interestingness Measure Interestingness Measures

7 7 Computing Interestingness Measure zGiven a rule X  Y, information needed to compute rule interestingness can be obtained from a contingency table YY Xf 11 f 10 f 1+ Xf 01 f 00 f o+ f +1 f +0 |T| Contingency table for X  Y f 11 : support of X and Y f 10 : support of X and Y f 01 : support of X and Y f 00 : support of X and Y Used to define various measures u support, confidence, lift, Gini, J-measure, etc.

8 8 Drawback of Confidence Coffee Tea15520 Tea75580 9010100 Association Rule: Tea  Coffee Confidence= P(Coffee|Tea) = 0.75 but P(Coffee) = 0.9 … >0.75  Although confidence is high, rule is misleading  P(Coffee|Tea) = 0.9375 …>>0.75

9 9 Statistical-Based Measures zMeasures that take into account statistical dependence )()(),( )()( ),( )( )|( YPXPYXPPS YPXP YXP Interest YP XYP Lift    Does X lift the probability of Y? i.e. probability of Y given X over probability of Y. This is the same as interest factor I =1 independence, I> 1 positive association (<1 negative) Many other measures PS: Piatesky-Shapiro

10 10 Example: Lift/Interest Coffee Tea15520 Tea75580 9010100 Association Rule: Tea  Coffee Confidence= P(Coffee|Tea) = 0.75 but P(Coffee) = 0.9  Lift = 0.75/0.9= 0.8333 (< 1, therefore is negatively associated)

11 11 Drawback of Lift & Interest YY X100 X090 1090100 YY X900 X010 9010100 Statistical independence: If P(X,Y)=P(X)P(Y) => Lift = 1 u Lift favors infrequent items u Other criteria proposed Gini, J-measure, etc.

12 12 There are lots of measures proposed in the literature Some measures are good for certain applications, but not for others What criteria should we use to determine whether a measure is good or bad? What about Apriori- style support based pruning? How does it affect these measures?

13 13 Association Rules & Correlations zBasic concepts zEfficient and scalable frequent itemset mining methods: yApriori, and improvements yFP-growth zRule derivation, visualization and validation zMulti-level Associations zSummary

14 14 Multiple-Level Association Rules zItems often form hierarchy. zItems at the lower level are expected to have lower support. zRules regarding itemsets at appropriate levels could be quite useful. zTransaction database can be encoded based on dimensions and levels zWe can explore shared multi- level mining Food bread milk skim SunsetFraser 2%white wheat

15 15 Mining Multi-Level Associations zA top_down, progressive deepening approach: y First find high-level strong rules: milk  bread [20%, 60%]. y Then find their lower-level “weaker” rules: 2% milk  wheat bread [6%, 50%]. zVariations at mining multiple-level association rules. yLevel-crossed association rules: 2% milk  Wonder wheat bread yAssociation rules with multiple, alternative hierarchies: 2% milk  Wonder bread

16 16 Multi-level Association: Uniform Support vs. Reduced Support zUniform Support: the same minimum support for all levels y+ One minimum support threshold. No need to examine itemsets containing any item whose ancestors do not have minimum support. y– Lower level items do not occur as frequently. If support threshold xtoo high  miss low level associations xtoo low  generate too many high level associations zReduced Support: reduced minimum support at lower levels yThere are 4 search strategies: xLevel-by-level independent xLevel-cross filtering by k-itemset xLevel-cross filtering by single item xControlled level-cross filtering by single item

17 17 Uniform Support Multi-level mining with uniform support Milk [support = 10%] 2% Milk [support = 6%] Skim Milk [support = 4%] Level 1 min_sup = 5% Level 2 min_sup = 5% Back

18 18 Reduced Support Multi-level mining with reduced support 2% Milk [support = 6%] Skim Milk [support = 4%] Level 1 min_sup = 5% Level 2 min_sup = 3% Back Milk [support = 10%]

19 19 Multi-level Association: Redundancy Filtering zSome rules may be redundant due to “ancestor” relationships between Example ymilk  wheat bread [support = 8%, confidence = 70%] y Say that 2%Milk is 25% of milk sales, then: y2% milk  wheat bread [support = 2%, confidence = 72%] zWe say the first rule is an ancestor of the second rule. zA rule is redundant if its support is close to the “expected” value, based on the rule’s ancestor.

20 20 Multi-Level Mining: Progressive Deepening zA top-down, progressive deepening approach: y First mine high-level frequent items: milk (15%), bread (10%) y Then mine their lower-level “weaker” frequent itemsets: 2% milk (5%), wheat bread (4%) zDifferent min_support threshold across multi-levels lead to different algorithms: yIf adopting the same min_support across multi-levels then toss t if any of t’s ancestors is infrequent. yIf adopting reduced min_support at lower levels then examine only those descendents whose ancestor’s support is frequent/non-negligible.

21 21 Association Rules & Correlations zBasic concepts zEfficient and scalable frequent itemset mining methods: yApriori, and improvements yFP-growth zRule derivation, visualization and validation zMulti-level Associations zTemporal associations and frequent sequences zOther association mining methods zSummary zTemporal associations and frequent sequences [later]

22 22 Other Association Mining Methods zCHARM: Mining frequent itemsets by a Vertical Data Format zMining Frequent Closed Patterns zMining Max-patterns zMining Quantitative Associations [e.g., what is the implication between age and income?] zConstraint-base association mining z Frequent Patterns in Data Streams: very difficult problem. Performance is a real issue zConstraint-based (Query-Directed) Mining zMining sequential and structured patterns

23 23 Summary zAssociation rule mining yprobably the most significant contribution from the database community in KDD zNew interesting research directions yAssociation analysis in other types of data: spatial data, multimedia data, time series data, zAssociation Rule Mining for Data Streams: a very difficult challenge.

24 24 Statistical Independence zPopulation of 1000 students y600 students know how to swim (S) y700 students know how to bike (B) y420 students know how to swim and bike (S,B) yP(S  B) = 420/1000 = 0.42 yP(S)  P(B) = 0.6  0.7 = 0.42 yP(S  B) = P(S)  P(B) => Statistical independence yP(S  B) > P(S)  P(B) => Positively correlated yP(S  B) Negatively correlated


Download ppt "1 Association Rules & Correlations zBasic concepts zEfficient and scalable frequent itemset mining methods: yApriori, and improvements yFP-growth zRule."

Similar presentations


Ads by Google