COMP5318 Knowledge Discovery and Data Mining

Slides:



Advertisements
Similar presentations
Association Rule Mining
Advertisements

Recap: Mining association rules from large datasets
Huffman Codes and Asssociation Rules (II) Prof. Sin-Min Lee Department of Computer Science.
Association Analysis (2). Example TIDList of item ID’s T1I1, I2, I5 T2I2, I4 T3I2, I3 T4I1, I2, I4 T5I1, I3 T6I2, I3 T7I1, I3 T8I1, I2, I3, I5 T9I1, I2,
Frequent Itemset Mining Methods. The Apriori algorithm Finding frequent itemsets using candidate generation Seminal algorithm proposed by R. Agrawal and.
Association rules and frequent itemsets mining
LOGO Association Rule Lecturer: Dr. Bo Yuan
Association Rule Mining. 2 The Task Two ways of defining the task General –Input: A collection of instances –Output: rules to predict the values of any.
FP-Growth algorithm Vasiljevic Vladica,
FP (FREQUENT PATTERN)-GROWTH ALGORITHM ERTAN LJAJIĆ, 3392/2013 Elektrotehnički fakultet Univerziteta u Beogradu.
Data Mining Association Analysis: Basic Concepts and Algorithms
Association Mining Data Mining Spring Transactional Database Transaction – A row in the database i.e.: {Eggs, Cheese, Milk} Transactional Database.
Association Analysis. Association Rule Mining: Definition Given a set of records each of which contain some number of items from a given collection; –Produce.
Data Mining Techniques So Far: Cluster analysis K-means Classification Decision Trees J48 (C4.5) Rule-based classification JRIP (RIPPER) Logistic Regression.
Data Mining Association Analysis: Basic Concepts and Algorithms Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Data Mining Association Analysis: Basic Concepts and Algorithms
1 Fast Algorithms for Mining Association Rules Rakesh Agrawal Ramakrishnan Srikant.
Association Analysis (2). Example TIDList of item ID’s T1I1, I2, I5 T2I2, I4 T3I2, I3 T4I1, I2, I4 T5I1, I3 T6I2, I3 T7I1, I3 T8I1, I2, I3, I5 T9I1, I2,
Data Mining Association Analysis: Basic Concepts and Algorithms
1 Mining Frequent Patterns Without Candidate Generation Apriori-like algorithm suffers from long patterns or quite low minimum support thresholds. Two.
Association Analysis: Basic Concepts and Algorithms.
Association Rule Mining. Generating assoc. rules from frequent itemsets  Assume that we have discovered the frequent itemsets and their support  How.
Data Mining Association Analysis: Basic Concepts and Algorithms
FPtree/FPGrowth. FP-Tree/FP-Growth Algorithm Use a compressed representation of the database using an FP-tree Then use a recursive divide-and-conquer.
Association Rules Presented by: Anilkumar Panicker Presented by: Anilkumar Panicker.
© Vipin Kumar CSci 8980 Fall CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance Computing Research Center Department of Computer.
Fast Algorithms for Association Rule Mining
Association Analysis (3). FP-Tree/FP-Growth Algorithm Use a compressed representation of the database using an FP-tree Once an FP-tree has been constructed,
Lecture14: Association Rules
SEG Tutorial 2 – Frequent Pattern Mining.
Approximate Frequency Counts over Data Streams Gurmeet Singh Manku, Rajeev Motwani Standford University VLDB2002.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining By Tan, Steinbach, Kumar Lecture.
Modul 7: Association Analysis. 2 Association Rule Mining  Given a set of transactions, find rules that will predict the occurrence of an item based on.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Association Rules. CS583, Bing Liu, UIC 2 Association rule mining Proposed by Agrawal et al in Initially used for Market Basket Analysis to find.
Efficient Data Mining for Calling Path Patterns in GSM Networks Information Systems, accepted 5 December 2002 SPEAKER: YAO-TE WANG ( 王耀德 )
CS 8751 ML & KDDSupport Vector Machines1 Mining Association Rules KDD from a DBMS point of view –The importance of efficiency Market basket analysis Association.
CSE4334/5334 DATA MINING CSE4334/5334 Data Mining, Fall 2014 Department of Computer Science and Engineering, University of Texas at Arlington Chengkai.
Association Rule Mining
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Association Analysis This lecture node is modified based on Lecture Notes for.
CURE Clustering Using Representatives Handles outliers well. Hierarchical, partition First a constant number of points c, are chosen from each cluster.
CMU SCS : Multimedia Databases and Data Mining Lecture #30: Data Mining - assoc. rules C. Faloutsos.
Association Analysis (3)
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Data Mining Association Rules Mining Frequent Itemset Mining Support and Confidence Apriori Approach.
1 Data Mining Lecture 6: Association Analysis. 2 Association Rule Mining l Given a set of transactions, find rules that will predict the occurrence of.
DATA MINING: ASSOCIATION ANALYSIS (2) Instructor: Dr. Chun Yu School of Statistics Jiangxi University of Finance and Economics Fall 2015.
Reducing Number of Candidates
Data Mining Association Analysis: Basic Concepts and Algorithms
Data Mining: Concepts and Techniques
Data Mining: Concepts and Techniques
Association Rules Repoussis Panagiotis.
Frequent Pattern Mining
Market Basket Analysis and Association Rules
Dynamic Itemset Counting
Data Mining Association Analysis: Basic Concepts and Algorithms
Association Rule Mining
Data Mining Association Analysis: Basic Concepts and Algorithms
Vasiljevic Vladica, FP-Growth algorithm Vasiljevic Vladica,
Association Rule Mining
Data Mining Association Analysis: Basic Concepts and Algorithms
Association Analysis: Basic Concepts and Algorithms
Frequent-Pattern Tree
Market Basket Analysis and Association Rules
15-826: Multimedia Databases and Data Mining
Association Analysis: Basic Concepts
Presentation transcript:

COMP5318 Knowledge Discovery and Data Mining Week 8: Mining Association Rules Reference: TSK pp.328-353, 363-370 Dunham 125-142 COMP5318 w9, s2 2006

Outline What is Association Rule Mining? Basic concepts Item, Itemset, Transaction, Support, Confidence, … Association rule problem definition Apriori principle What, why, how Apriori algorithm FP-growth algorithm Discussion COMP5318 w9, s2 2006

What is Association Rule Mining? Association rule mining finds combinations of items that typically occur together in a database (market-basket analysis) Sequences of items that occur frequently (sequential analysis) in a database Originally introduced for Market-basket analysis -- useful for analysing purchasing behaviour of customers. COMP5318 w9, s2 2006

Market-Basket Analysis – Examples Where should strawberries be placed to maximize their sale? Services purchased together by telecommunication customers (e.g. broad band Internet, call forwarding, etc.) help determine how to bundle these services together to maximize revenue Unusual combinations of insurance claims can be a sign of a fraud Medical histories can give indications of complications based on combinations of treatments Sport: analyzing game statistics (shots blocked, assists, and fouls) to gain competitive advantage “When player X is on the floor, player Y’s shot accuracy decreases from 75% to 30%” Bhandari et.al. (1997). Advanced Scout: data mining and knowledge discovery in NBA data, Data Mining and Knowledge Discovery, 1(1), pp.121-125 COMP5318 w9, s2 2006

Basic Concepts Set of items: I={i1, i2,…,im}; Set of transactions: T={t1, t2, …,tn}; Each transaction tn is a subset of I Example: 5 transactions: T={t1, t2, …,t5}; 5 items: I={Bread, Jelly, PeanutButter, Milk, Beer} Itemset – a collection (set) of 1 or more items If an itemset contains k items, it is called k-itemset e.g. {Jelly, Milk, Bread} is an example of 3-itemset Example Dataset COMP5318 w9, s2 2006

These measure how “interesting” the rule is. Basic Concepts Searching for rules of the form XY, where X and Y are itemsets, e.g. {Bread}  {Jelly} {Bread, Jelly}  {PeanutButter} Formally: Given I={i1, i2,…,im} and T={t1, t2, …,tn}, an association rule is an implication of the form XY, where X,Y  I and XY= (i.e. X and Y are disjoint itemsets) Association rules have 2 important ‘properties’ Support Confidence These measure how “interesting” the rule is. COMP5318 w9, s2 2006

Support of an Itemset The support of an itemset X is the number (or percentage) of transactions containing that itemset. Example: Question: What is support({Bread, PeanutButter})? Answer: 3 (or 3/5 = 60%) COMP5318 w9, s2 2006

Support and Confidence of an Association Rule Support of an association rule XY is the number (or percentage) of transactions that contain X Y support(XY)=support(X Y) Measures how often the rule occurs in the dataset Low support: “uninteresting rule; occurs by chance” Confidence of an association rule XY is the number of transactions that contain X Y divided by the number of transactions that contain X confidence(XY)=support(XY)/support(X) Measures the reliability (strength) of the rule confidence can be seen as approximating (estimating) P(Y|X) Intuitive: Question: given X, what is the most likely Y? Answer: The Y so that P(Y|X) is the highest. COMP5318 w9, s2 2006

Support and Confidence - Example What is the support and confidence of the following rules? {Beer}{Bread} {Bread, PeanutButter}{Jelly} ? Support(XY)=support(X Y) confidence(XY)=support(XY)/support(X) COMP5318 w9, s2 2006

Association Rule Mining Problem Definition Given a set of transactions T={t1, t2, …,tn} and 2 thresholds; minsup and minconf, Find all association rules XY with support  minsup and confidence  minconf I.E: we want rules with high confidence and support We call these rules interesting We would like to Design an efficient algorithm for mining association rules in large data sets Develop an effective approach for distinguishing interesting rules from spurious ones COMP5318 w9, s2 2006

Generating Association Rules – Approach 1 (Naïve) Enumerate all possible rules and select those of them that satisfy the minimum support and confidence thresholds Not practical for large databases For a given dataset with m items, the total number of possible rules is 3m-2m+1+1 (Why?*) And most of these will be discarded! We need a strategy for rule generation -- generate only the promising rules rules that are likely to be interesting, or, more accurately, don’t generate rules that can’t be interesting. *hint: use inclusion-exclusion principle COMP5318 w9, s2 2006

Generating Association Rules – Approach 2 What do these rules have in common? A,BC A,CB B,CA The support of a rule XY depends only on the support of its itemset X Y Answer: they have the same support: support({A,B,C}) Hence, a better approach: find Frequent itemsets first, then generate the rules Frequent itemset is an itemset that occurs more than minsup times If an itemset is infrequent, all the rules that contain it will have support<minsup and there is no need to generate them COMP5318 w9, s2 2006

Generating Association Rules – Approach 2 2 step-approach: Step 1: Generate frequent itemsets -- Frequent Itemset Mining (i.e. support  minsup) e.g. {A,B,C} is frequent (so A,BC, A,CB and B,CA satisfy the minSup threshold). Step 2: From them, extract rules that satisfy the confidence threshold (i.e. confidence  minconf) e.g. maybe only A,B C and C,BA are confident Step 1 is the computationally difficult part (the next slides explain why, and a way to reduce the complexity….)

Frequent Itemset Generation (Step 1) – Brute-Force Approach Enumerate all possible itemsets and scan the dataset to calculate the support for each of them Example: I={a,b,c,d,e} Search space showing superset / subset relationships Given d items, there are 2d-1 possible (non-empty) candidate itemsets => not practical for large d COMP5318 w9, s2 2006

Frequent Itemset Generation (Step 1) -- Apriori Principle (1) A subset of any frequent itemset is also frequent Example: If {c,d,e} is frequent then {c,d}, {c,e}, {d,e}, {c}, {d} are also frequent COMP5318 w9, s2 2006

Frequent Itemset Generation (Step 1) -- Apriori Principle (2) If an itemset is not frequent, a superset of it is also not frequent Example: If we know that {a,b} is infrequent, the entire sub-graph can be pruned. Ie: {a,b,c}, {a,b,d}, {a,b,e}, {a,b,c,d}, {a,b,c,e}, {a,b,d,e} and {a,b,c,d} are infrequent COMP5318 w9, s2 2006

Frequent Itemset Generation (Step 1) -- Apriori Principle (3) That is: If an itemset is frequent then all its subsets are frequent Equivalently (more useful): If an itemset is not frequent, a superset of it is also not frequent “Support is anti-monotonic” – support monotonically decreases as we add items to the itemset. Use this to prune the search space – “support-based pruning”.

Recall the 2 Step process for Association Rule Mining Step 1: Find all frequent Itemsets So far: main ideas and concepts (Apriori principle). Later: algorithms Step 2: Generate the association rules from the frequent itemsets.

ARGen Algorithm (Step 2) Generates interesting rules from the frequent itemsets Already know the rules are frequent (Why?), just need to check confidence. ARGen algorithm for each frequent itemset F generate all non-empty subsets S. for each s in S do if confidence(s  F-s) ≥ minConf then output rule s  F-s end Example: F={a,b,c} S={{a,b}, {a,c}, {b,c}, {a}, {b}, {c}} rules output: {a,b} {c}, etc. COMP5318 w9, s2 2006

ARGen - Example minsup=30%, minconf=50% The set of frequent itemsets L={{Beer},{Bread}, {Milk}, {PeanutButter}, {Bread, PeanutButter}} Only the last itemset from L consists of 2 nonempty subsets of frequent itemsets – Bread and PeanutButter. => 2 rules will be generated COMP5318 w9, s2 2006

Summary so far Concepts (item, itemset, transaction, support, confidence, Association Rules) 2 step process for Association rule Mining Step 1: Frequent Itemset Mining The most computationally difficult step in Association Rule Mining. Apriori Principle – support is anti-monotonic. Step 2: Extract rules from frequent itemsets (ARGen).

What’s next? Algorithms for finding frequent itemsets (ie: Step 1) Apriori Algorithm FP-Growth Algorithm

Apriori Algorithm

Frequent Itemset Generation – Apriori Algorithm Generate candidate itemsets with size = 1 (all items) Scan the database to see which of them are frequent (database scan step) Use only the frequent itemsets to generate the set of candidates with size = size + 1 (candidate generation step -- AprioriGen) If candidates were generated, goto 2. Stop. All frequent itemsets found. Recall that we would then generate the association rules from these… ARGen COMP5318 w9, s2 2006

Apriori Algorithm * AprioriGen – later… Let minSup=30% I={Beer, Bread, Jelly, PeanutButter, Milk} T={t1, t2, …,t5} Let minSup=30% Level Candidate itemsets Frequent itemsets 1. {Beer} (40%), {Bread} (60%), {Beer}, {Bread}, {Milk}, {Jelly} (20%), {Milk} (40%), {PeanutButter} {PeanutButter} (60%) 2. {Beer, Bread}(20%), {Beer, Milk}(20%), {Bread, PeanutButter} {Beer, PeanutButter}(0%), {Bread, Milk} (20%), {Bread, PeanutButter} (40%), {Milk, PeanutButter} (20%) 3.  * itemset sizes * Why not {Jelly,…}? * AprioriGen – later… COMP5318 w9, s2 2006

Note the benefits of using Apriori Principle for Candidate Generation For our simple example (5 items) Brute-force approach – generate all candidate itemsets of a given size Apriori – generate candidate itemsets from frequent itemsets => Apriori is much more efficient Lets look into this in more detail… 4 = 5+6 = 11 COMP5318 w9, s2 2006

Apriori-Gen (Candidate Generation) Apriori-Gen = The algorithm for generating candidate itemsets with size k from frequent itemsets with size k-1 Initially (k = 1), all itemsets of size 1 are considered as candidate itemsets From k = 2 onwards – different strategies Brute force (for comparison only – not part of Apriori) Fk-1 x F1 Fk-1 x Fk-1 COMP5318 w9, s2 2006

Brute-force Approach for Candidate Generation Not part of Apriori Generate all possible combinations of size k (from the original items) and then prune the infrequent => will generate candidate itemsets at level k, d is total number of items Pruning of so many items is expensive Total cost of generation and pruning: O(d*2d-1) plus we would already know that most of these are not frequent COMP5318 w9, s2 2006

Apriori-Gen – Fk-1 x F1 Extend frequent (k-1)-itemset with frequent 1-itemset Will generate all frequent itemsets of size k as each frequent k-itemset consists of a frequent (k-1)-itemset and a frequent 1-itemset => the procedure is complete Less computationally expensive However, may generate the same candidate itemsets more than once But we already know some of these are not frequent (Apriori Principle) Solution: lexicographic order of the frequent itemsets and extension of (k-1)-itemset allowed only with lexicographically larger items, e.g. {Bread, Diapers} can be extended with {Milk} but {Diapers, Milk} cannot be extended with {Bread} COMP5318 w9, s2 2006

Apriori-Gen – Fk-1 x F1 (cont.) Although an improvement, it may still produce unnecessary candidate itemsets E.g. merging {Beer, Diapers} with {Milk} is not necessary as one of the subsets ( {Beer, Milk}) is infrequent can be checked at the candidate generation time and the itemset discarded or the itemset can be generated and then pruned COMP5318 w9, s2 2006

Apriori-Gen – Fk-1 x Fk-1 Assumes lexicographic ordering Merges a pair of frequent (k-1)-itemsets only if their first k-2 items are identical E.g. k=3, merging itemsets of size 2 {Bread, Diapers} and {Bread, Milk} will be merged: {Bread, Diapers, Milk} {Beer, Diapers} and {Diapers, Milk} will not be merged Complete procedure Will not generate duplicates Does not guarantee that all generated candidate itemsets as frequent => pruning is needed COMP5318 w9, s2 2006

Frequent Itemset Generation in Apriori – Clothing Example Given: 20 clothing transactions; minSup=20%, minConf=50% Generate frequent itemset using the Apriori algorithm and the Fk-1 x Fk-1 strategy for candidate itemset generation 1. Level 1 – generate all 1-itemsets and find the frequent ones Level Candidate Frequent 1 {Blouse}(3), {Jeans}(14), {Shoes}(10) {Jeans}(14), {Shoes}(10) {Shorts}(5), {Skirt}(6), {TShirt}(13) {Shorts}(5), {Skirt}(6), {TShirt}(13) COMP5318 w9, s2 2006

Frequent Itemset Generation in Apriori – Clothing Example (cont.) Level Candidate Frequent 1 {Blouse}(3), {Jeans}(14), {Shoes}(10) {Jeans}(14), {Shoes}(10) {Shorts}(5), {Skirt}(6), {TShirt}(13) {Shorts}(5), {Skirt}(6), {TShirt}(13) 2. Use AR-Gen to generate candidate 2-itemsets from frequent 1-itemsets and F1xF1 {Jeans, Shoes} (7), {Jeans, Shorts} (5), {Jeans, Shoes} (7), {Jeans, Shorts} (5) (Jeans, Skirt} (2), {Jeans, TShirt} (8), {Jeans, TShirt} (8), {Shoes, Shorts} (4), {Shoes, Skirt} (3), {Shoes, Shorts} (4), {Shoes, TShirt} (9), {Shorts, Skirt} (0), {Shoes, TShirt} (9), {Shorts, TShirt} (4), {Skirt, TShirt} (3) {Shorts, TShirt} (4) 3. Use AR-Gen to generate candidate 3-itemsets from frequent 2-itemsets and F2xF2 (1st item should be identical) 3 {Jeans, Shoes, Shorts} (4), {Jeans, Shoes, Shorts} (4), {Jeans, Shoes, TShirt} (7), {Jeans, Shoes, TShirt} (7), {Jeans, Shorts, TShirt} (4) {Jeans, Shorts, TShirt} (4) {Shoes, Shorts, TShirt} (4) {Shoes, Shorts, TShirt} (4) COMP5318 w9, s2 2006

Frequent Itemset Generation in Apriori – Clothing Example (cont.) . . . 3 {Jeans, Shoes, Shorts} (4), {Jeans, Shoes, Shorts} (4), {Jeans, Shoes, TShirt} (7), {Jeans, Shoes, TShirt} (7), {Jeans, Shorts, TShirt} (4) {Jeans, Shorts, TShirt} (4) {Shoes, Shorts, TShirt} (4) {Shoes, Shorts, TShirt} (4) 3. Use AR-Gen to generate candidate 4-itemsets from frequent 3-itemsets and F2xF2 (1st and 2d items should be identical) {Jeans, Shoes, Shorts, TShirt}(4) {Jeans, Shoes, Shorts, TShirt}(4) 4. Use AR-Gen to generate candidate 5-itemsets from frequent 4-itemsets and F2xF2 (1st ,2d and 3d items should be identical)  stop (there are no 4-itemset candidates that can be generated) COMP5318 w9, s2 2006

Clothing Example – Generation of AR Rules The next step is to use the frequent itemsets and generate association rules using the ARGen algorithm (slide 14) =50% The set of frequent itemsets is L={{Jeans},{Shoes}, {Shorts}, {Skirt}, {TShirt}, {Jeans, Shoes}, {Jeans, Shorts}, {Jeans, TShirt}, {Shoes, Shorts}, {Shoes, TShirt}, {Shorts, TShirt}, {Skirt, TShirt}, {Jeans, Shoes, Shorts}, {Jeans, Shoes, TShirt}, {Jeans, Shorts, TShirt},{Shoes, Shorts, TShirt}, {Jeans, Shoes, Shorts,TShirt} } We ignore the first 5 as they do not consists of 2 nonempty subsets of frequent itemsets. We test all the others, e.g.: etc. COMP5318 w9, s2 2006

Frequent Itemset Generation in Apriori Pseudo Code e.g. Fk-1 x F1 or Fk-1 x Fk-1, etc. COMP5318 w9, s2 2006

FP-Growth Algorithm

Frequent Pattern Growth (FP-Growth) Algorithm Apriori: generate-and-test approach – generates candidate itemsets and tests if they are frequent Problem: the generation of candidate itemsets is expensive FP-growth – the first algorithm that allows frequent itemset discovery without candidate itemsets generation Uses a compact data structure called FP-tree and extracts frequent itemsets directly from the FP-tree COMP5318 w9, s2 2006

FP-Tree Nodes correspond to items + have a counter 1 path = 1 transaction Reads 1 transaction at a time and maps it to a path Pointers between nodes containing same item Paths may overlap as transactions share items => increment the counter and add pointers More paths overlap -> higher compression => FP-tree may fit in the memory => direct extraction of frequent itemsets from FP-tree instead of many passes over data stored on disk COMP5318 w9, s2 2006

FP-Tree Construction Pass 1: Scan data and find support for each item. Discard infrequent items. Sort frequent items in decreasing order based on their support. For our example: a, b, c, d, e Pass 2: construct the FP-Tree Read trans. 1 {a, b}. Create 2 nodes a and b and the path null->a->b. Set counts of a and b to 1. Read trans. 2 {b, c, d}. Create 3 nodes for b, c and d and the path null->b->c->d. Set counts to 1. Note that although trans. 1 and 2 share b, the paths are disjoint as they don’t share a common prefix. Read trans. 3 {a, c, d, e}. It shares common prefix item {a} with trans. 1 => the path for trans. 1 and 3 will overlap and the frequency count for node a will be incremented with 1. Continue until all transactions are mapped to a path in the FP-tree. COMP5318 w9, s2 2006

FP-Tree Size FP-tree has a smaller size than the uncompressed data as many transactions share items Best case scenario – all transactions contain the same set of items. 1 path in the FP-Tree Worst case scenario – every transaction has a unique set of items (no items in common) the size of the FP-tree = size of the original data However, the storage requirements for the FP-tree are higher – need to store the pointers between the nodes and the counters The size of the FP-tree depends on how the items are ordered Ordering by decreasing support is typically used but it does not always lead to the smallest tree COMP5318 w9, s2 2006

Prefix paths ending in e, d, c, b and a FP-Growth Algorithm Extracts frequent itemsets from FP-tree Bottom-up algorithm – from the leaves to the root For our example – first look for frequent itemsets ending in e, than in d, c, b and a (Note: reverse lexiographic order!) Extract the paths ending in e, d, c, b and a (called also prefix paths) Complete FP-tree Prefix paths ending in e, d, c, b and a COMP5318 w9, s2 2006

FP-Growth Algorithm (cont. 1) Each prefix path sub-tree is processed recursively to extract the frequent itemsets and the solutions are then merged e.g. the prefix path sub-tree for e will be used to extract frequent itemsets ending in e, than in de, ce, be and ae. Each of them can be decomposed into problems, e.g. de into cde, bde, cde, etc. e -> de -> cde bde ade ce -> bce ace be-> abc ae d . . . c . . . b . . . a . . . End of recursion: no more frequent itemsets can be extracted, i.e. empty tree or tree with 1 item where tree=prefix path sub-tree or conditional FP-tree COMP5318 w9, s2 2006

FP-Growth Algorithm - Example Extract frequent itemsets for e. Let minsup = 2. 1) Obtain the prefix path sub-tree for e 2) Check if {e} is a frequent item by adding the counts. If so, extract it. Yes, count =3 => {e} is extracted as a frequent itemset. 3) As {e} is frequent, find frequent itemsets ending in de, ce, be and ae. To do this, we need first to obtain the conditional FP-tree for e. Prefix paths ending in e, d, c, b and a COMP5318 w9, s2 2006

FP-Growth Example (cont. 2) Final Conditional FP-tree for e To obtain the conditional FP-tree for e: Update the support counts along the prefix paths to reflect the number of transactions containing e => b and c should be set to 1, a to 2 Remove the nodes containing e – information about node e is no longer needed because of the previous step Remove infrequent items (nodes) from the prefix paths, e.g. b has a support of 1 and appears once => there is only 1 trans. containing b and e => be is infrequent => remove b. 2 1 2 1 cut 2 1 cut Final Conditional FP-tree for e COMP5318 w9, s2 2006

FP-Growth Example (cont. 3) 4) Use the the conditional FP-tree for e to find frequent itemsets ending in de, ce and ae Note that be is not considered as b is not in the conditional FP-tree For each of them (e.g. de) find the prefix paths from the conditional tree for e and extract frequent itemsets; generate conditional FP-tree etc. Extract {e} Frequent itemsets ending with de? Contains 1 item; no need to generate prefix paths ending in ade (will be the same as the cond. FP tree for de); extract frequent itemsets (if any) and stop this branch of the recursion. Continue with itemsets ending with ce. Extract {d,e} Frequent itemsets ending with ade? Extract {a,d,e} COMP5318 w9, s2 2006

FP-Growth Example - Solution FP-Growth algorithm will find the following frequent itemsets: COMP5318 w9, s2 2006

Discussion Association rules are typically sought for very large databases => efficient algorithms are needed The Apriori algorithm makes 1 pass through the dataset for each different itemset size The maximum number of database scans is k+1, where k is the cardinality of the largest frequent itemset (4 in the clothing ex.) potentially large number of scans – weakness of Apriori Sometimes the database is too big to be kept in memory and must be kept on disk The amount of computation also depends on the minimum support; the confidence has less impact as it does not affect the number of passes Variations Using sampling of the database Using partitioning of the database Generation of incremental rules COMP5318 w9, s2 2006

Discussion (2) FP-growth is typically an order of magnitude faster than Apriori No candidate generation Uses compact data structure Only 2 scans of the database: 1 to count the support of each item and 2 to build the FP-tree Basic operation is FP-tree building and counting COMP5318 w9, s2 2006