Presentation is loading. Please wait.

Presentation is loading. Please wait.

Association Rule Mining

Similar presentations


Presentation on theme: "Association Rule Mining"— Presentation transcript:

1 Association Rule Mining
Instructor Qiang Yang Slides from Jiawei Han and Jian Pei And from Introduction to Data Mining By Tan, Steinbach, Kumar

2 What Is Frequent Pattern Mining?
Frequent patterns: pattern (set of items, sequence, etc.) that occurs frequently in a database [AIS93] Frequent pattern mining: finding regularities in data What products were often purchased together? What are the subsequent purchases after buying a PC? Frequent-pattern mining methods

3 Why Is Frequent Pattern Mining an Essential Task in Data Mining?
Foundation for many essential data mining tasks Association, correlation, causality Sequential patterns, temporal or cyclic association, partial periodicity, spatial and multimedia association Associative classification, cluster analysis, iceberg cube, fascicles (semantic data compression) Broad applications Basket data analysis, cross-marketing, catalog design, sale campaign analysis Web log (click stream) analysis, DNA sequence analysis, etc. Frequent-pattern mining methods

4 Basic Concepts: Frequent Patterns and Association Rules
Itemset X={x1, …, xk} Find all the rules XY with min confidence and support support, s, probability that a transaction contains XY confidence, c, conditional probability that a transaction having X also contains Y. Transaction-id Items bought 10 A, B, C 20 A, C 30 A, D 40 B, E, F Customer buys diaper buys both buys beer Let min_support = 50%, min_conf = 50%: A  C (50%, 66.7%) C  A (50%, 100%) Frequent-pattern mining methods

5 Concept: Frequent Itemsets
Outlook Temperature Humidity Play sunny hot high no overcast yes rainy mild cool normal Minimum support=2 {sunny, hot, no} {sunny, hot, high, no} {rainy, normal} Min Support =3 ? How strong is {sunny, no}? Count = Percentage = Frequent-pattern mining methods

6 Concept: Itemset  Rules
{sunny, hot, no} = {Outlook=Sunny, Temp=hot, Play=no} Generate a rule: Outlook=sunny and Temp=hot  Play=no How strong is this rule? Support of the rule = support of the itemset {sunny, hot, no} = 2 = Pr({sunny, hot, no}) Either expressed in count form or percentage form Confidence = Pr(Play=no | {Outlook=sunny, Temp=hot}) In general LHS RHS, Confidence = Pr(RHS|LHS) Confidence =Pr(RHS|LHS) =count(LHS and RHS) / count(LHS) What is the confidence of Outlook=sunnyPlay=no? Frequent-pattern mining methods

7 Frequent-pattern mining methods
Frequent Patterns Patterns = Item Sets {i1, i2, … in}, where each item is a pair: (Attribute=value) Frequent Patterns Itemsets whose support >= minimum support Support count(itemset)/count(database) Frequent-pattern mining methods

8 Frequent Itemset Generation
Given d items, there are 2d possible candidate itemsets Frequent-pattern mining methods

9 Frequent-pattern mining methods
Max-patterns Max-pattern: frequent patterns without proper frequent super pattern BCDE, ACD are max-patterns BCD is not a max-pattern Tid Items 10 A,B,C,D,E 20 B,C,D,E, 30 A,C,D,F Min_sup=2 Frequent-pattern mining methods

10 Maximal Frequent Itemset
An itemset is maximal frequent if none of its immediate supersets is frequent Maximal Itemsets Infrequent Itemsets Border Frequent-pattern mining methods

11 Frequent-pattern mining methods
Frequent Max Patterns Succinct Expression of frequent patterns Let {a, b, c} be frequent Then, {a, b}, {b, c}, {a, c} must also be frequent Then {a}, {b}, {c}, must also be frequent By writing down {a, b, c} once, we save lots of computation Max Pattern If {a, b, c} is a frequent max pattern, then {a, b, c, x} is NOT a frequent pattern, for any other item x. Frequent-pattern mining methods

12 Find Frequent Max Patterns
Outlook Temperature Humidity Play sunny hot high no overcast yes rainy mild cool normal Minimum support=2 {sunny, hot, no} ?? Frequent-pattern mining methods

13 Frequent-pattern mining methods
Closed Patterns An itemset is closed if none of its immediate supersets has the same support as the itemset {a, b}, {a, b, d}, {a, b, c} are closed patterns But, {a, b} is not a max pattern See where changes happen Reduce # of patterns and rules N. Pasquier et al. In ICDT’99 TID Items 10 a, b, c 20 30 a, b, d 40 a, b, d, 50 c, e, f Frequent-pattern mining methods

14 Maximal vs Closed Itemsets
Transaction Ids indexes beside an item set is the transaction #s. Not supported by any transactions Frequent-pattern mining methods

15 Maximal vs Closed Frequent Itemsets
Closed but not maximal Minimum support = 2 Closed and maximal # Closed = 9 # Maximal = 4 Frequent-pattern mining methods

16 Note on Closed Patterns
Closed patterns have no need to specify the minimum support Given dataset, we can find a set of closed patterns from it, so that for any minimum support values, we can immediately find the set of patterns (a subset of the closed patterns). Closed frequent patterns Both closed and above the min support Frequent-pattern mining methods

17 Maximal vs Closed Itemsets
Frequent-pattern mining methods

18 Mining Association Rules—an Example
Min. support 50% Min. confidence 50% Transaction-id Items bought 10 A, B, C 20 A, C 30 A, D 40 B, E, F Frequent pattern Support {A} 75% {B} 50% {C} {A, C} For rule A  C: support = support({A}{C}) = 50% confidence = support({A}{C})/support({A}) = 66.6% Frequent-pattern mining methods

19 Method 1: Apriori: A Candidate Generation-and-test Approach
Any subset of a frequent itemset must be frequent if {beer, diaper, nuts} is frequent, so is {beer, diaper} Every transaction having {beer, diaper, nuts} also contains {beer, diaper} Apriori pruning principle: If there is any itemset which is infrequent, its superset should not be generated/tested! Method: generate length (k+1) candidate itemsets from length k frequent itemsets, and test the candidates against DB The performance studies show its efficiency and scalability Agrawal & Srikant 1994, Mannila, et al. 1994 Frequent-pattern mining methods

20 The Apriori Algorithm — An Example
Itemset sup {A} 2 {B} 3 {C} {D} 1 {E} Database TDB Itemset sup {A} 2 {B} 3 {C} {E} L1 C1 Tid Items 10 A, C, D 20 B, C, E 30 A, B, C, E 40 B, E 1st scan C2 Itemset sup {A, B} 1 {A, C} 2 {A, E} {B, C} {B, E} 3 {C, E} C2 Itemset {A, B} {A, C} {A, E} {B, C} {B, E} {C, E} L2 2nd scan Itemset sup {A, C} 2 {B, C} {B, E} 3 {C, E} C3 L3 Itemset {B, C, E} 3rd scan Itemset sup {B, C, E} 2 Frequent-pattern mining methods

21 Speeding up Association rules
Dynamic Hashing and Pruning technique Thanks to Cheng Hong & Hu Haibo

22 DHP: Reduce the Number of Candidates
A k-itemset whose corresponding hashing bucket count is below the threshold cannot be frequent Candidates: a, b, c, d, e Hash entries: {ab, ad, ae} {bd, be, de} … Frequent 1-itemset: a, b, d, e ab is not a candidate 2-itemset if the sum of count of {ab, ad, ae} is below support threshold J. Park, M. Chen, and P. Yu. An effective hash-based algorithm for mining association rules. In SIGMOD’95 Frequent-pattern mining methods

23 Still challenging, the niche for DHP
DHP ( Park ’95 ): Dynamic Hashing and Pruning Candidate large 2-itemsets are huge. DHP: trim them using hashing Transaction database is huge that one scan per iteration is costly DHP: prune both number of transactions and number of items in each transaction after each iteration Frequent-pattern mining methods

24 Hash Table Construction
Consider two items sets, all itesms are numbered as i1, i2, …in. For any any pair (x, y), has according to Hash function bucket #= h({x y}) = ((order of x)*10+(order of y)) % 7 Example: Items = A, B, C, D, E, Order = , 2, 3 4, 5, H({C, E})= (3*10 + 5)% 7 = 0 Thus, {C, E} belong to bucket 0. Frequent-pattern mining methods

25 How to trim candidate itemsets
In k-iteration, hash all candidate k+1 itemsets in a hash table, and count all the itemsets in each bucket. In k+1 iteration, examine each of the candidate itemset to see if its correspondent bucket value is above the support ( necessary condition ) Frequent-pattern mining methods

26 Example TID Items 100 A C D 200 B C E 300 A B C E 400 B E
Figure1. An example transaction database Frequent-pattern mining methods

27 Generation of C1 & L1(1st iteration)
Itemset Sup {A} 2 {B} 3 {C} {D} 1 {E} C L1 Itemset Sup {A} 2 {B} 3 {C} {E} Frequent-pattern mining methods

28 Hash Table Construction
Find all 2-itemset of each transaction TID 2-itemset 100 {A C} {A D} {C D} 200 {B C} {B E} {C E} 300 {A B} {A C} {A E} {B C} {B E} {C E} 400 {B E} Frequent-pattern mining methods

29 Hash Table Construction (2)
Hash function h({x y}) = ((order of x)*10+(order of y)) % 7 Hash table {C E} {A E} {B C} {B E} {A B} {A C} {C E} {B C} {B E} {C D} {A D} {B E} {A C} bucket 3 1 2 Frequent-pattern mining methods

30 C2 Generation (2nd iteration)
C2 of Apriori {A B} {A C} {A E} {B C} {B E} {C E} L1*L1 # in the bucket {A B} 1 {A C} 3 {A E} {B C} 2 {B E} {C E} Resulted C2 {A C} {B C} {B E} {C E} Frequent-pattern mining methods

31 Effective Database Pruning
Apriori Don’t prune database. Prune Ck by support counting on the original database. DHP More efficient support counting can be achieved on pruned database. Frequent-pattern mining methods

32 Performance Comparison
Frequent-pattern mining methods

33 Performance Comparison (2)
Frequent-pattern mining methods

34 Frequent-pattern mining methods
FP-growth Algorithm Use a compressed representation of the database using an FP-tree Once an FP-tree has been constructed, it uses a recursive divide-and-conquer approach to mine the frequent itemsets Frequent-pattern mining methods

35 Frequent-pattern mining methods
FP-tree construction null After reading TID=1: A:1 B:1 After reading TID=2: null B:1 A:1 B:1 C:1 D:1 Frequent-pattern mining methods

36 Frequent-pattern mining methods
FP-Tree Construction Transaction Database null B:3 A:7 B:5 C:3 C:1 D:1 Header table D:1 C:3 E:1 D:1 E:1 D:1 E:1 D:1 Pointers are used to assist frequent itemset generation Frequent-pattern mining methods

37 Frequent-pattern mining methods
FP-growth null Conditional Pattern base for D: (PB | D) = {(A:1,B:1,C:1), (A:1,B:1), (A:1,C:1), (A:1), (B:1,C:1)} Recursively apply FP-growth on PB, and then append to D Thus, frequent Itemsets found from PB|D (with min support = 2): AD, BD, CD, ABD, ACD, BCD A:7 B:1 B:5 C:1 C:1 D:1 D:1 C:3 D:1 D:1 D:1 Frequent-pattern mining methods


Download ppt "Association Rule Mining"

Similar presentations


Ads by Google