Chapter 5 Mining Association Rules with FP Tree Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2010
Mining Frequent Itemsets without Candidate Generation In many cases, the Apriori candidate generate-and-test method significantly reduces the size of candidate sets, leading to good performance gain. However, it suffer from two nontrivial costs: It may generate a huge number of candidates (for example, if we have 10^4 1-itemset, it may generate more than 10^7 candidata 2-itemset) It may need to scan database many times
Association Rules with Apriori Minimum support=2/9 Minimum confidence=70%
Bottleneck of Frequent-pattern Mining Multiple database scans are costly Mining long patterns needs many passes of scanning and generates lots of candidates To find frequent itemset i 1 i 2 …i 100 # of scans: 100 # of Candidates: ( ) + ( ) + … + ( ) = = 1.27*10 30 ! Bottleneck: candidate-generation-and-test Can we avoid candidate generation?
Mining Frequent Patterns Without Candidate Generation Grow long patterns from short ones using local frequent items “abc” is a frequent pattern Get all transactions having “abc”: DB|abc “d” is a local frequent item in DB|abc abcd is a frequent pattern
Process of FP growth Scan DB once, find frequent 1-itemset (single item pattern) Sort frequent items in frequency descending order Scan DB again, construct FP-tree
Association Rules Let’s have an example T1001,2,5 T2002,4 T3002,3 T4001,2,4 T5001,3 T6002,3 T7001,3 T8001,2,3,5 T9001,2,3
FP Tree
Mining the FP tree
Benefits of the FP-tree Structure Completeness Preserve complete information for frequent pattern mining Never break a long pattern of any transaction Compactness Reduce irrelevant info—infrequent items are gone Items in frequency descending order: the more frequently occurring, the more likely to be shared Never be larger than the original database (not count node-links and the count field) For Connect-4 DB, compression ratio could be over 100
Exercise A dataset has five transactions, let min- support=60% and min_confidence=80% Find all frequent itemsets using FP Tree TIDItems_bought T1 T2 T3 T4 T5 M, O, N, K, E, Y D, O, N, K, E, Y M, A, K, E M, U, C, K,Y C, O, O, K, I,E
Association Rules with Apriori K:5KE:4KE E:4KM:3KM M:3KO:3KO O:3=>KY:3=>KY=>KEO Y:3EM:2EO EO:3 EY:2 MO:1 MY:2 OY:2
Association Rules with FP Tree K:5 E:4 M:3 O:3 Y:3
Association Rules with FP Tree Y: KEMO:1 KEO:1 KY:1 K:3KY O: KEM:1 KE:2 KE:3KO EO KEO M: KE:2 K:1 K:3KM E: K:4KE
FP-Growth vs. Apriori: Scalability With the Support Threshold Data set T25I20D10K
Why Is FP-Growth the Winner? Divide-and-conquer: decompose both the mining task and DB according to the frequent patterns obtained so far leads to focused search of smaller databases Other factors no candidate generation, no candidate test compressed database: FP-tree structure no repeated scan of entire database basic ops—counting local freq items and building sub FP-tree, no pattern search and matching
Strong Association Rules are not necessary interesting Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2010
Example 5.8 Misleading “Strong” Association Rule Of the 10,000 transactions analyzed, the data show that 6,000 of the customer included computer games, while 7,500 include videos, And 4,000 included both computer games and videos
Misleading “Strong” Association Rule For this example: Support (Game & Video) = 4,000 / 10,000 =40% Confidence (Game => Video) = 4,000 / 6,000 = 66% Suppose it pass our minimum support and confidence (30%, 60%, respectively)
Misleading “Strong” Association Rule However, the truth is : “computer games and videos are negatively associated” Which means the purchase of one of these items actually decreases the likelihood of purchasing the other. (How to get this conclusion??)
Misleading “Strong” Association Rule Under the normal situation, 60% of customers buy the game 75% of customers buy the video Therefore, it should have 60% * 75% = 45% of people buy both That equals to 4,500 which is more than 4,000 (the actual value)
From Association Analysis to Correlation Analysis Lift is a simple correlation measure that is given as follows The occurrence of itemset A is independent of the occurrence of itemset B if P(A U B) = P(A)P(B) Otherwise, itemset A and B are dependent and correlated as events Lift(A,B) = P(A U B) / P(A)P(B) If the value is less than 1, the occurrence of A is negatively correlated with the occurrence of B If the value is greater than 1, then A and B are positively correlated
Mining Multiple-Level Association Rules Items often form hierarchies
Mining Multiple-Level Association Rules Items often form hierarchies
Mining Multiple-Level Association Rules Flexible support settings Items at the lower level are expected to have lower support uniform support Milk [support = 10%] 2% Milk [support = 6%] Skim Milk [support = 4%] Level 1 min_sup = 5% Level 2 min_sup = 5% Level 1 min_sup = 5% Level 2 min_sup = 3% reduced support
Multi-level Association: Redundancy Filtering Some rules may be redundant due to “ancestor” relationships between items. Example milk wheat bread [support = 8%, confidence = 70%] 2% milk wheat bread [support = 2%, confidence = 72%] We say the first rule is an ancestor of the second rule.