The UNIVERSITY of KENTUCKY Association Rule Mining CS 685: Special Topics in Data Mining Spring 2009
CS685: Special Topics in Data Mining 2 Frequent Pattern Analysis Finding inherent regularities in data What products were often purchased together?— Beer and diapers?! What are the subsequent purchases after buying a PC? What are the commonly occurring subsequences in a group of genes? What are the shared substructures in a group of effective drugs?
CS685: Special Topics in Data Mining 3 What Is Frequent Pattern Analysis? Frequent pattern: a pattern (a set of items, subsequences, substructures, etc.) that occurs frequently in a data set Applications Identify motifs in bio-molecules DNA sequence analysis, protein structure analysis Identify patterns in micro-arrays Business applications: Market basket analysis, cross-marketing, catalog design, sale campaign analysis, etc.
CS685: Special Topics in Data Mining 4 Data An item is an element (a literal, a variable, a symbol, a descriptor, an attribute, a measurement, etc) A transaction is a set of items A data set is a set of transactions A database is a data set Transaction-idItems bought 100f, a, c, d, g, I, m, p 200a, b, c, f, l,m, o 300b, f, h, j, o 400b, c, k, s, p 500a, f, c, e, l, p, m, n
CS685: Special Topics in Data Mining 5 Association Rules Itemset X = {x 1, …, x k } Find all the rules X Y with minimum support and confidence support, s, is the probability that a transaction contains X Y confidence, c, is the conditional probability that a transaction having X also contains Y Let sup min = 50%, conf min = 50% Association rules: A C (60%, 100%) C A (60%, 75%) Customer buys diaper Customer buys both Customer buys beer Transaction- id Items bought 100f, a, c, d, g, I, m, p 200a, b, c, f, l,m, o 300b, f, h, j, o 400b, c, k, s, p 500a, f, c, e, l, p, m, n
CS685: Special Topics in Data Mining 6 Apriori-based Mining Generate length (k+1) candidate itemsets from length k frequent itemsets, and Test the candidates against DB
CS685: Special Topics in Data Mining 7 Apriori Algorithm A level-wise, candidate-generation-and-test approach (Agrawal & Srikant 1994) TIDItems 10a, c, d 20b, c, e 30a, b, c, e 40b, e Min_sup=2 ItemsetSup a2 b3 c3 d1 e3 Data base D 1-candidates Scan D ItemsetSup a2 b3 c3 e3 Freq 1-itemsets Itemset ab ac ae bc be ce 2-candidates ItemsetSup ab1 ac2 ae1 bc2 be3 ce2 Counting Scan D ItemsetSup ac2 bc2 be3 ce2 Freq 2-itemsets Itemset bce 3-candidates ItemsetSup bce2 Freq 3-itemsets Scan D
CS685: Special Topics in Data Mining 8 Important Details of Apriori How to generate candidates? Step 1: self-joining L k Step 2: pruning How to count supports of candidates?
CS685: Special Topics in Data Mining 9 How to Generate Candidates? Suppose the items in L k-1 are listed in an order Step 1: self-join L k-1 INSERT INTO C k SELECT p.item 1, p.item 2, …, p.item k-1, q.item k-1 FROM L k-1 p, L k-1 q WHERE p.item 1 =q.item 1, …, p.item k-2 =q.item k-2, p.item k-1 < q.item k-1 Step 2: pruning For each itemset c in C k do For each (k-1)-subsets s of c do if (s is not in L k-1 ) then delete c from C k
CS685: Special Topics in Data Mining 10 Example of Candidate- generation L 3 ={abc, abd, acd, ace, bcd} Self-joining: L 3 *L 3 abcd from abc and abd acde from acd and ace Pruning: acde is removed because ade is not in L 3 C 4 ={abcd}
CS685: Special Topics in Data Mining 11 How to Count Supports of Candidates? Why counting supports of candidates a problem? The total number of candidates can be very huge One transaction may contain many candidates Method: Candidate itemsets are stored in a hash-tree Leaf node of hash-tree contains a list of itemsets and counts Interior node contains a hash table Subset function: finds all the candidates contained in a transaction
CS685: Special Topics in Data Mining 12 Apriori: Candidate Generation- and-test Any subset of a frequent itemset must be also frequent — an anti-monotone property A transaction containing {beer, diaper, nuts} also contains {beer, diaper} {beer, diaper, nuts} is frequent {beer, diaper} must also be frequent No superset of any infrequent itemset should be generated or tested Many item combinations can be pruned
CS685: Special Topics in Data Mining 13 The Apriori Algorithm C k : Candidate itemset of size k L k : frequent itemset of size k L 1 = {frequent items}; for (k = 1; L k != ; k++) do C k+1 = candidates generated from L k ; for each transaction t in database do increment the count of all candidates in C k+1 that are contained in t L k+1 = candidates in C k+1 with min_support return k L k ;
CS685: Special Topics in Data Mining 14 Challenges of Frequent Pattern Mining Challenges Multiple scans of transaction database Huge number of candidates Tedious workload of support counting for candidates Improving Apriori: general ideas Reduce number of transaction database scans Shrink number of candidates Facilitate support counting of candidates
CS685: Special Topics in Data Mining 15 DIC: Reduce Number of Scans ABCD ABC ABDACD BCD ABACBC AD BDCD A BCD {} Itemset lattice Once both A and D are determined frequent, the counting of AD can begin Once all length-2 subsets of BCD are determined frequent, the counting of BCD can begin Transactions 1-itemsets 2-itemsets … Apriori 1-itemsets 2-items 3-itemsDIC S. Brin R. Motwani, J. Ullman, and S. Tsur, 1997.
CS685: Special Topics in Data Mining 16 DHP: Reduce the Number of Candidates A hashing bucket count <min_sup every candidate in the buck is infrequent Candidates: a, b, c, d, e Hash entries: {ab, ad, ae} {bd, be, de} … Large 1-itemset: a, b, d, e The sum of counts of {ab, ad, ae} < min_sup ab should not be a candidate 2-itemset J. Park, M. Chen, and P. Yu, 1995
CS685: Special Topics in Data Mining 17 Partition: Scan Database Only Twice Partition the database into n partitions Itemset X is frequent X is frequent in at least one partition Scan 1: partition database and find local frequent patterns Scan 2: consolidate global frequent patterns A. Savasere, E. Omiecinski, and S. Navathe, 1995
CS685: Special Topics in Data Mining 18 Sampling for Frequent Patterns Select a sample of original database, mine frequent patterns within sample using Apriori Scan database once to verify frequent itemsets found in sample, only borders of closure of frequent patterns are checked Example: check abcd instead of ab, ac, …, etc. Scan database again to find missed frequent patterns H. Toivonen, 1996
CS685: Special Topics in Data Mining 19 Bottleneck of Frequent- pattern Mining Multiple database scans are costly Mining long patterns needs many passes of scanning and generates lots of candidates To find frequent itemset i 1 i 2 …i 100 # of scans: 100 # of Candidates: Bottleneck: candidate-generation-and-test Can we avoid candidate generation?
CS685: Special Topics in Data Mining 20 Set Enumeration Tree Subsets of I can be enumerated systematically I={a, b, c, d} abcd abacadbcbdcd abcabdacdbcd abcd
CS685: Special Topics in Data Mining 21 Borders of Frequent Itemsets Connected X and Y are frequent and X is an ancestor of Y all patterns between X and Y are frequent abcd abacadbcbdcd abcabdacdbcd abcd
CS685: Special Topics in Data Mining 22 Projected Databases To find a child Xy of X, only X-projected database is needed The sub-database of transactions containing X Item y is frequent in X-projected database abcd abacadbcbdcd abcabdacdbcd abcd
CS685: Special Topics in Data Mining 23 Tree-Projection Method Find frequent 2-itemsets For each frequent 2-itemset xy, form a projected database The sub-database containing xy Recursive mining If x’y’ is frequent in xy-proj db, then xyx’y’ is a frequent pattern
CS685: Special Topics in Data Mining 24 Borders and Max-patterns Max-patterns: borders of frequent patterns A subset of max-pattern is frequent A superset of max-pattern is infrequent abcd abacadbcbdcd abcabdacdbcd abcd
CS685: Special Topics in Data Mining 25 MaxMiner: Mining Max- patterns 1st scan: find frequent items A, B, C, D, E 2nd scan: find support for AB, AC, AD, AE, ABCDE BC, BD, BE, BCDE CD, CE, CDE, DE, Since BCDE is a max-pattern, no need to check BCD, BDE, CDE in later scan Baya’98 TidItems 10A,B,C,D,E 20B,C,D,E, 30A,C,D,F Potential max-patterns Min_sup=2
CS685: Special Topics in Data Mining 26 Frequent Closed Patterns For frequent itemset X, if there exists no item y s.t. every transaction containing X also contains y, then X is a frequent closed pattern “acdf” is a frequent closed pattern Concise rep. of freq pats Reduce # of patterns and rules N. Pasquier et al. In ICDT’99 TIDItems 10a, c, d, e, f 20a, b, e 30c, e, f 40a, c, d, f 50c, e, f Min_sup=2
CS685: Special Topics in Data Mining 27 CLOSET: Mining Frequent Closed Patterns Flist: list of all freq items in support asc. order Flist: d-a-f-e-c Divide search space Patterns having d Patterns having d but no a, etc. Find frequent closed pattern recursively Every transaction having d also has cfa cfad is a frequent closed pattern PHM’00 TIDItems 10a, c, d, e, f 20a, b, e 30c, e, f 40a, c, d, f 50c, e, f Min_sup=2
CS685: Special Topics in Data Mining 28 Closed and Max-patterns Closed pattern mining algorithms can be adapted to mine max-patterns A max-pattern must be closed Depth-first search methods have advantages over breadth-first search ones
CS685: Special Topics in Data Mining 29 Multiple-level Association Rules Items often form hierarchy Flexible support settings: Items at the lower level are expected to have lower support. Transaction database can be encoded based on dimensions and levels explore shared multi-level mining uniform support Milk [support = 10%] 2% Milk [support = 6%] Skim Milk [support = 4%] Level 1 min_sup = 5% Level 2 min_sup = 5% Level 1 min_sup = 5% Level 2 min_sup = 3% reduced support
CS685: Special Topics in Data Mining 30 Multi-dimensional Association Rules Single-dimensional rules: buys(X, “milk”) buys(X, “bread”) MD rules: 2 dimensions or predicates Inter-dimension assoc. rules (no repeated predicates) age(X,”19-25”) occupation(X,“student”) buys(X,“coke”) hybrid-dimension assoc. rules (repeated predicates) age(X,”19-25”) buys(X, “popcorn”) buys(X, “coke”) Categorical Attributes: finite number of possible values, no order among values Quantitative Attributes: numeric, implicit order
CS685: Special Topics in Data Mining 31 Quantitative/Weighted Association Rules age(X,”33-34”) income(X,”30K - 50K”) buys(X,”high resolution TV”) Numeric attributes are dynamically discretized maximize the confidence or compactness of the rules 2-D quantitative association rules: A quan1 A quan2 A cat Cluster “adjacent” association rules to form general rules using a 2-D grid k 60-70k 50-60k 40-50k 30-40k 20-30k <20k Income Age
CS685: Special Topics in Data Mining 32 Constraint-based Data Mining Find all the patterns in a database autonomously? The patterns could be too many but not focused! Data mining should be interactive User directs what to be mined Constraint-based mining User flexibility: provides constraints on what to be mined System optimization: push constraints for efficient mining
CS685: Special Topics in Data Mining 33 Constraints in Data Mining Knowledge type constraint classification, association, etc. Data constraint — using SQL-like queries find product pairs sold together in stores in New York Dimension/level constraint in relevance to region, price, brand, customer category Rule (or pattern) constraint small sales (price $200) Interestingness constraint strong rules: support and confidence
CS685: Special Topics in Data Mining 34 Potential Project Topics Novel data mining methods Approximate frequent pattern mining Mining from unstructured data, such as text and images. Semi-supervised learning.
CS685: Special Topics in Data Mining 35 Potential Project Topics Bioinformatics applications Mining functionally related genes across multiple gene expression datasets Genome-wide association study Co-evolution patterns among gene trees.
CS685: Special Topics in Data Mining 36 Potential Project Topics Netflix Prize Informational retrieval Medical informational retrieval