Association Rule Mining Zhenjiang Lin Group Presentation April 10, 2007
2 Overview Associations Market Basket Analysis Basic Concepts Frequent Itemsets Generating Frequent Itemsets Apriori FP-Growth Applications
3 Association Rule Learners to discover elements that co-occur frequently within a data set consisting of multiple independent selections of elements (such as purchasing transactions), and to discover rules, such as implication or correlation, which relate co-occurring elements. to answer questions such as "if a customer purchases product A, how likely is he to purchase product B?" and "What products will a customer buy if he buys products C and D?" are answered by association-finding algorithms. to reduce a potentially huge amount of information to a small, understandable set of statistically supported statements. also known as “market basket analysis”.
4 Associations Rules expressing relationships between items Example cereal, milk fruit “People who bought cereal and milk also bought fruit.” Stores might want to offer specials on milk and cereal to get people to buy more fruit.
5 Market Basket Analysis Analyze tables of transactions Can we hypothesize? Chips => Salsa Lettuce => Spinach PersonBasket AChips, Salsa, Cookies, Crackers, Coke, Beer BLettuce, Spinach, Oranges, Celery, Apples, Grapes CChips, Salsa, Frozen Pizza, Frozen Cake DLettuce, Spinach, Milk, Butter
6 Market Baskets In general, data consists of TIDBasket Transaction ID Subset of items
7 Basic Concepts Set of items Transaction Association Rule - set of transactions (i.e., our data)
8 Measuring Interesting Rules Support Ratio of # of transactions containing A and B to the total # of transactions Confidence Ratio of # of transactions containing A and B to #of transactions containing A
9 Measuring Interesting Rules Rules are included/excluded based on two metrics minimum support level - how frequently all of the items in a rule appear in transactions minimum confidence level - how frequently the left hand side of a rule implies the right hand side
10 Market Basket Analysis What is I? What is T for person B? What is s(Chips=>Salsa)? What is c(Chips=>Salsa)? PersonBasket AChips, Salsa, Cookies, Crackers, Coke, Beer BLettuce, Spinach, Oranges, Celery, Apples, Grapes CChips, Salsa, Frozen Pizza, Frozen Cake DLettuce, Spinach, Milk, Butter, Chips
11 Frequent Itemsets itemset – any set of items k-itemset – an itemset containing k items frequent itemset – an itemset that satisfies a minimum support level If I contains m items, how many itemsets are there?
12 Strong Association Rules Given an itemset, it’s easy to generate association rules Given itemset, {Chips, Salsa} => Chips, Salsa Chips => Salsa Salsa => Chips Chips, Salsa => Strong rules are interesting Generally defined as those rules satisfying minimum support and minimum confidence
13 Association Rule Mining Two basic steps Find all frequent itemsets Satisfying minimum support Find all strong association rules Generate association rules from frequent itemsets Keep rules satisfying minimum confidence
14 Generating Frequent Itemsets Naïve algorithm n <- |D| for each subset s of I do l <- 0 for each transaction T in D do if s is a subset of T then l <- l + 1 if minimum support <= l/n then add s to frequent subsets
15 Generating Frequent Itemsets Analysis of naïve algorithm 2 m subsets of I Scan n transactions for each subset O(2 m n) tests of s being subset of T Growth is exponential in the number of items! Can we do better?
16 Generating Frequent Itemsets Frequent itemsets support the apriori property If A is not a frequent itemset, then any superset of A is not a frequent itemset. Proof: Let n be the number of transactions. Suppose A is a subset of l transactions. If A’ A, then A’ is a subset of l’ l transactions. Thus, if l/n < minimum support, so is l’/n.
17 Generating Frequent Itemsets Central idea: Build candidate k-itemsets from frequent (k-1)-itemsets Approach Find all frequent 1-itemsets Extend (k-1)-itemsets to candidate k- itemsets Prune candidate itemsets that do not meet the minimum support.
18 Generating Frequent Itemsets (Basic Apriori) L 1 = {frequent 1-itemsets} for (k=2; L (k-1) is not empty; k++) { C k = generate k-itemset candidates from L (k-1) for each transaction t in D { // The candidates that are subsets of t C t =subset(C k,t) for each candidate c in C t { c.count++; } L k = {c in C k | c.count >= min_sup} } The frequent itemsets are the union of the L k
19 FP Growth (Han, Pei, Yin 2000) One problematic aspect of the Apriori is the candidate generation Source of exponential growth Another approach is to use a divide and conquer strategy Idea: Compress the database into a frequent pattern tree representing frequent items
20 FP Growth (Tree construction) Initially, scan database for frequent 1- itemsets Place resulting set in a list L in descending order by frequency (support) Construct an FP-tree Create a root node labeled null Scan database Process the items in each transaction in L order From the root, add nodes in the order in which items appear in the transactions Link nodes representing items along different branches
21 Frequent 1-itemsets Minimum support of 20% (frequency of 2) Frequent 1-itemsets I1,I2,I3,I4,I5 Construct list L = {(I2,7),(I1,6),(I3,6),(I4,2),(I5,2)} TIDItems 1I1,I2,I5 2I2,I4 3I2,I3,I6 4I1,I2,I4 5I1,I3 6I2,I3 7I1,I3 8I1,I2,I3,I5 9I1,I2,I3
22 Build FP-Tree Create root node null Scan database Transaction1: I1, I2, I5 Order: I2, I1, I5 Process transaction Add nodes in item order Label with items, count (I2,1) (I1,1) (I5,1) 1I5 0I4 0I3 1I1 1I2 Maintain header table
23 Build FP-Tree null (I2,2) (I1,1) (I5,1) 1I5 1I4 0I3 1I1 2I2 (I4,1) TIDItems 1I1,I2,I5 2I2,I4 3I2,I3,I6 4I1,I2,I4 5I1,I3 6I2,I3 7I1,I3 8I1,I2,I3,I5 9I1,I2,I3
24 Minining the FP-tree Start at the last item in the table Find all paths containing item Follow the node-links Identify conditional patterns Patterns in paths with required frequency Build conditional FP-tree C Append item to all paths in C, generating frequent patterns Mine C recursively (appending item) Remove item from table and tree
25 Mining the FP-Tree null (I2,7) (I1,4) (I5,1) 2I5 2I4 6I3 6I1 7I2 (I4,1) (I3,2) (I4,1) (I5,1) (I1,2) (I3,2) Prefix Paths (I2 I1,1) (I2 I1 I3, 1) Conditional Path (I2 I1, 2) Conditional FP-tree (I2 I1 I5, 2) null (I2,2) (I1,2)
26 Applications Web Personalization Genomic Data
27 Web Personalization “Effective Personalization Based on Association Rule Discovery from Web Usage Data,” Mobasher, et al., ACM Workshop on Web Information and Data Management, Personalization and recommendation systems e.g. Amazon.com’s recommended books
28 Data Preprocessing Identify set of pageviews P Which files result in a single browser display (complicated by frames, images, etc.) P = {p 1, …, p n } Transactions T From session IDs or cookies T = {t 1, …, t m }
29 Data Preprocessing A transaction t consists of t = {(p 1 t, w(p 1 t )), …, (p l t,w(p l t ))} The w is a weight associated with the pageview Could be binary (purchase or non-purchase) Could be related to amount of time spent on the page
30 Data Preprocessing In the paper, only considered pageviews in a transaction with w(p) = 1 Ordering of pageviews didn’t matter
31 Recommendation Engine Has to run online i.e. must be fast generate frequent itemsets first and store in a graph data structure for efficient searching Maintains a history of the user’s current session Sets a window size w (e.g. 3) Consider pageviews A, B, C {A,B,C} If user then visits D {B,C,D}
32 Genomic Data “Finding Association Rules on Heterogenous Genome Data,” Satou et al. Combined data from PDB, SWISS-PROT, and PROSITE Protein Name sequence feature1 sequence feature2 structure feature1 function1function2 name name200110
33 Genomic Data After mining, association rules were generated (minimum support = 5, minimum confidence = 65%) Post process results with max support of 30 Itemsets appearing too frequently aren’t interesting Reduced to 381 rules
34 Genomic Data Rules generated were corroborated by biological background data Found common substructures in serine endopeptidases Rules were not distributed well over protein families Still some work to be done on the data preprocessing stage
35 Association Rule Summary Association rule mining is a fundamental tool in data mining Several algorithms Apriori: Use a provable mathematical property to improve performance FP-Growth: Stop candidate generation, use effective data structure Correlation Rules: Evaluate interestingness based on statistics Query Flocks: Generalize approach with the purpose of query optimization (incorporation into database systems)
36 Association Rule Summary There exist several extensions Hierarchical attributes (e.g. year->month- >week->day or computer->luggable- >handheld->palm) Multilevel/multidimensional Numerical attributes Constraint based