1 Association Rule Mining Instructor Qiang Yang Thanks: Jiawei Han and Jian Pei.

1 Association Rule Mining Instructor Qiang Yang Thanks: Jiawei Han and Jian Pei

Frequent-pattern mining methods2 What Is Frequent Pattern Mining? Frequent pattern: pattern (set of items, sequence, etc.) that occurs frequently in a database [AIS93] Frequent pattern mining: finding regularities in data What products were often purchased together? — Beer and diapers?! What are the subsequent purchases after buying a PC? What kinds of DNA are sensitive to this new drug? Can we automatically classify web documents?

Frequent-pattern mining methods3 Why Is Frequent Pattern Mining an Essential Task in Data Mining? Foundation for many essential data mining tasks Association, correlation, causality Sequential patterns, temporal or cyclic association, partial periodicity, spatial and multimedia association Associative classification, cluster analysis, iceberg cube, fascicles (semantic data compression) Broad applications Basket data analysis, cross-marketing, catalog design, sale campaign analysis Web log (click stream) analysis, DNA sequence analysis, etc.

Frequent-pattern mining methods4 Basic Concepts: Frequent Patterns and Association Rules Itemset X={x 1, …, x k } Find all the rules X  Y with min confidence and support support, s, probability that a transaction contains X  Y confidence, c, conditional probability that a transaction having X also contains Y. Let min_support = 50%, min_conf = 50%: A  C (50%, 66.7%) C  A (50%, 100%) Customer buys diaper Customer buys both Customer buys beer Transaction-idItems bought 10A, B, C 20A, C 30A, D 40B, E, F

Frequent-pattern mining methods5 Concept: Frequent Itemsets OutlookTemperatureHumidityPlay sunnyhothighno sunnyhothighno overcasthothighyes rainymildhighyes rainycoolnormalyes rainycoolnormalno overcastcoolnormalyes sunnymildhighno sunnycoolnormalyes rainymildnormalyes sunnymildnormalyes overcastmildhighyes overcasthotnormalyes rainymildhighno Minimum support=2 {sunny, hot, no} {sunny, hot, high, no} {rainy, normal} Min Support =3 ? How strong is {sunny, no}? Count = Percentage =

Frequent-pattern mining methods6 Concept: Itemset  Rules {sunny, hot, no} = {Outlook=Sunny, Temp=hot, Play=no} Generate a rule: Outlook=sunny and Temp=hot  Play=no How strong is this rule? Support of the rule = support of the itemset {sunny, hot, no} = 2 = Pr({sunny, hot, no}) Either expressed in count form or percentage form Confidence = Pr(Play=no | {Outlook=sunny, Temp=hot}) In general LHS  RHS, Confidence = Pr(RHS|LHS) Confidence =Pr(RHS|LHS) =count(LHS and RHS) / count(LHS) What is the confidence of Outlook=sunny  Play=no?

Frequent-pattern mining methods7 6.1.3 Types of Association Rules Quantitative Age(X, “30…39”) and income(X, “42K…48K”)  buys(X, TV) Single vs. Multi dimensions: Buys(X, computer)  buys(X, “financial soft”); Multi: above example Levels of abstraction Age(X,..)  buys(X, “laptop computer”) Age(X,..)  buys(X, “computer); Extensions Max Pattern Closed Itemset

Frequent-pattern mining methods8 Frequent Patterns Patterns = Item Sets {i1, i2, … in}, where each item is a pair: (Attribute=value) Frequent Patterns Itemsets whose support >= minimum support Support count(itemset)/count(database)

Frequent-pattern mining methods9 Max-patterns Max-pattern: frequent patterns without proper frequent super pattern BCDE, ACD are max-patterns BCD is not a max-pattern TidItems 10A,B,C,D,E 20B,C,D,E, 30A,C,D,F Min_sup=2

Frequent-pattern mining methods10 Frequent Max Patterns Succinct Expression of frequent patterns Let {a, b, c} be frequent Then, {a, b}, {b, c}, {a, c} must also be frequent Then {a}, {b}, {c}, must also be frequent By writing down {a, b, c} once, we save lots of computation Max Pattern If {a, b, c} is a frequent max pattern, then {a, b, c, x} is NOT a frequent pattern, for any other item x.

Frequent-pattern mining methods11 Find frequent Max Patterns OutlookTemperatureHumidityPlay sunnyhothighno sunnyhothighno overcasthothighyes rainymildhighyes rainycoolnormalyes rainycoolnormalno overcastcoolnormalyes sunnymildhighno sunnycoolnormalyes rainymildnormalyes sunnymildnormalyes overcastmildhighyes overcasthotnormalyes rainymildhighno Minimum support=2 {sunny, hot, no} ??

Frequent-pattern mining methods12 Closed Patterns A closed itemset X has no superset X’ such that every transaction containing X also contains X’ {a, b}, {a, b, d}, {a, b, c} are frequent closed patterns But, {a, b} is not a max pattern Concise rep. of freq pats Reduce # of patterns and rules N. Pasquier et al. In ICDT’99 TIDItems 10a, b, c 20a, b, c 30a, b, d 40a, b, d, 50c, e, f Min_sup=2

Frequent-pattern mining methods13 Mining Association Rules—an Example For rule A  C: support = support({A}  {C}) = 50% confidence = support({A}  {C})/support({A}) = 66.6% Min. support 50% Min. confidence 50% Transaction-idItems bought 10A, B, C 20A, C 30A, D 40B, E, F Frequent patternSupport {A}75% {B}50% {C}50% {A, C}50%

Frequent-pattern mining methods14 Method 1: Apriori: A Candidate Generation-and-test Approach Any subset of a frequent itemset must be frequent if {beer, diaper, nuts} is frequent, so is {beer, diaper} Every transaction having {beer, diaper, nuts} also contains {beer, diaper} Apriori pruning principle: If there is any itemset which is infrequent, its superset should not be generated/tested! Method: generate length (k+1) candidate itemsets from length k frequent itemsets, and test the candidates against DB The performance studies show its efficiency and scalability Agrawal & Srikant 1994, Mannila, et al. 1994

Frequent-pattern mining methods15 The Apriori Algorithm — An Example Database TDB 1 st scan C1C1 L1L1 L2L2 C2C2 C2C2 2 nd scan C3C3 L3L3 3 rd scan TidItems 10A, C, D 20B, C, E 30A, B, C, E 40B, E Itemsetsup {A}2 {B}3 {C}3 {D}1 {E}3 Itemsetsup {A}2 {B}3 {C}3 {E}3 Itemset {A, B} {A, C} {A, E} {B, C} {B, E} {C, E} Itemsetsup {A, B}1 {A, C}2 {A, E}1 {B, C}2 {B, E}3 {C, E}2 Itemsetsup {A, C}2 {B, C}2 {B, E}3 {C, E}2 Itemset {B, C, E} Itemsetsup {B, C, E}2

Frequent-pattern mining methods16 The Apriori Algorithm Pseudo-code: C k : Candidate itemset of size k L k : frequent itemset of size k L 1 = {frequent items}; for (k = 1; L k !=  ; k++) do begin C k+1 = candidates generated from L k ; for each transaction t in database do increment the count of all candidates in C k+1 that are contained in t L k+1 = candidates in C k+1 with min_support end return  k L k ;

Frequent-pattern mining methods17 Important Details of Apriori How to generate candidates? Step 1: self-joining L k Step 2: pruning How to count supports of candidates?

Frequent-pattern mining methods18 Example of Candidate-generation L 3 ={abc, abd, acd, ace, bcd} Self-joining: L 3 *L 3 abcd from abc and abd acde from acd and ace Pruning: acde is removed because ade is not in L 3 C 4 ={abcd}

Frequent-pattern mining methods19 How to Generate Candidates? Suppose the items in L k-1 are listed in an order Step 1: self-joining L k-1 insert into C k select p.item 1, p.item 2, …, p.item k-1, q.item k-1 from L k-1 p, L k-1 q where p.item 1 =q.item 1, …, p.item k-2 =q.item k-2, p.item k-1 < q.item k-1 Step 2: pruning forall itemsets c in C k do forall (k-1)-subsets s of c do if (s is not in L k-1 ) then delete c from C k

Frequent-pattern mining methods20 How to Count Supports of Candidates? Why counting supports of candidates a problem? The total number of candidates can be very huge One transaction may contain many candidates Method: Candidate itemsets are stored in a hash-tree Leaf node of hash-tree contains a list of itemsets and counts Interior node contains a hash table Subset function: finds all the candidates contained in a transaction

21 Speeding up Association rules Dynamic Hashing and Pruning technique Thanks to Cheng Hong & Hu Haibo

Frequent-pattern mining methods22 DHP: Reduce the Number of Candidates A k-itemset whose corresponding hashing bucket count is below the threshold cannot be frequent Candidates: a, b, c, d, e Hash entries: {ab, ad, ae} {bd, be, de} … Frequent 1-itemset: a, b, d, e ab is not a candidate 2-itemset if the sum of count of {ab, ad, ae} is below support threshold J. Park, M. Chen, and P. Yu. An effective hash-based algorithm for mining association rules. In SIGMOD’95

Frequent-pattern mining methods23 Still challenging, the niche for DHP DHP ( Park ’95 ): Dynamic Hashing and Pruning Candidate large 2-itemsets are huge. DHP: trim them using hashing Transaction database is huge that one scan per iteration is costly DHP: prune both number of transactions and number of items in each transaction after each iteration

Frequent-pattern mining methods24 How does it look like? Generate candidate set Count support Make new hash table Generate candidate set Count support Apriori DHP

Frequent-pattern mining methods25 Hash Table Construction Consider two items sets, all itesms are numbered as i1, i2, …in. For any any pair (x, y), has according to Hash function bucket #= h({x y}) = ((order of x)*10+(order of y)) % 7 Example: Items = A, B, C, D, E, Order = 1, 2, 3 4, 5, H({C, E})= (3*10 + 5)% 7 = 0 Thus, {C, E} belong to bucket 0.

Frequent-pattern mining methods26 How to trim candidate itemsets In k-iteration, hash all “appearing” k+1 itemsets in a hashtable, count all the occurrences of an itemset in the correspondent bucket. In k+1 iteration, examine each of the candidate itemset to see if its correspondent bucket value is above the support ( necessary condition )

Frequent-pattern mining methods27 Example TIDItems 100A C D 200B C E 300A B C E 400B E Figure1. An example transaction database

Frequent-pattern mining methods28 Generation of C1 & L1(1st iteration) C1 L1 ItemsetSup {A}2 {B}3 {C}3 {D}1 {E}3 ItemsetSup {A}2 {B}3 {C}3 {E}3

Frequent-pattern mining methods29 Hash Table Construction Find all 2-itemset of each transaction TID2-itemset 100{A C} {A D} {C D} 200{B C} {B E} {C E} 300{A B} {A C} {A E} {B C} {B E} {C E} 400{B E}

Frequent-pattern mining methods30 Hash Table Construction (2) Hash function h({x y}) = ((order of x)*10+(order of y)) % 7 Hash table {C E} {A E} {B C} {B E} {A B} {A C} {C E} {B C} {B E} {C D} {A D} {B E} {A C} bucket 0 1 2 3 4 5 6 3120313

Frequent-pattern mining methods31 C2 Generation (2nd iteration) L1*L1 # in the bucket {A B}1 {A C}3 {A E}1 {B C}2 {B E}3 {C E}3 Resulted C2 {A C} {B C} {B E} {C E} C2 of Apriori {A B} {A C} {A E} {B C} {B E} {C E}

Frequent-pattern mining methods32 Effective Database Pruning Apriori Don’t prune database. Prune C k by support counting on the original database. DHP More efficient support counting can be achieved on pruned database.

Frequent-pattern mining methods33 Performance Comparison

Frequent-pattern mining methods34 Performance Comparison (2)

Frequent-pattern mining methods35 Conclusion Effective hash-based algorithm for the candidate itemset generation Two phase transaction database pruning Much more efficient ( time & space ) than Apriori algorithm

1 Association Rule Mining Instructor Qiang Yang Thanks: Jiawei Han and Jian Pei.

Similar presentations

Presentation on theme: "1 Association Rule Mining Instructor Qiang Yang Thanks: Jiawei Han and Jian Pei."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1 Association Rule Mining Instructor Qiang Yang Thanks: Jiawei Han and Jian Pei.

Similar presentations

Presentation on theme: "1 Association Rule Mining Instructor Qiang Yang Thanks: Jiawei Han and Jian Pei."— Presentation transcript:

Similar presentations

About project

Feedback