Data Mining: Concepts and Techniques (3rd ed.) — Chapter 6 — Jiawei Han, Micheline Kamber, and Jian Pei University of Illinois at Urbana-Champaign & Simon Fraser University ©2013 Han, Kamber & Pei. All rights reserved. 1 1
Chapter 6: Mining Frequent Patterns, Association and Correlations: Basic Concepts and Methods Frequent Itemset Mining Methods Which Patterns Are Interesting?—Pattern Evaluation Methods Summary
What Is Frequent Pattern Analysis? Frequent pattern: a pattern (a set of items, subsequences, substructures, etc.) that occurs frequently in a data set First proposed by Agrawal, Imielinski, and Swami [AIS93] in the context of frequent itemsets and association rule mining Motivation: Finding inherent regularities in data What products were often purchased together?— Beer and diapers?! What are the subsequent purchases after buying a PC? Can we automatically classify web documents? Applications Basket data analysis, cross-marketing, catalog design, sale campaign analysis, Web log (click stream) analysis, and DNA sequence analysis.
Market Basket Analysis
Why Is Freq. Pattern Mining Important? Freq. pattern: An intrinsic and important property of datasets Foundation for many essential data mining tasks Association, correlation, and causality analysis Sequential, structural (e.g., sub-graph) patterns Pattern analysis in spatiotemporal, multimedia, time-series, and stream data Classification: frequent pattern analysis Cluster analysis: frequent pattern-based clustering Data warehousing: iceberg cube and cube-gradient Semantic data compression: fascicles Broad applications
Basic Concepts: Frequent Patterns Tid Items bought 10 Beer, Nuts, Diaper 20 Beer, Coffee, Diaper 30 Beer, Diaper, Eggs 40 Nuts, Eggs, Milk 50 Nuts, Coffee, Diaper, Eggs, Milk itemset: A set of one or more items k-itemset X = {x1, …, xk} (absolute) support, or, support count of X: Frequency or occurrence of an itemset X (relative) support, s, is the fraction of transactions that contains X (i.e., the probability that a transaction contains X) An itemset X is frequent if X’s support is no less than a minsup threshold Customer buys diaper buys both buys beer
Basic Concepts: Association Rules Tid Items bought Find all the rules X Y with minimum support and confidence support, s, probability that a transaction contains X Y confidence, c, conditional probability that a transaction having X also contains Y Let minsup = 50%, minconf = 50% Freq. Pat.: Beer:3, Nuts:3, Diaper:4, Eggs:3, {Beer, Diaper}:3 10 Beer, Nuts, Diaper 20 Beer, Coffee, Diaper 30 Beer, Diaper, Eggs 40 Nuts, Eggs, Milk 50 Nuts, Coffee, Diaper, Eggs, Milk Customer buys both Customer buys diaper Customer buys beer Association rules: (many more!) Beer Diaper (60%, 100%) Diaper Beer (60%, 75%)
Closed Patterns and Max-Patterns A long pattern contains a combinatorial number of sub-patterns, e.g., {a1, …, a100} contains (1001) + (1002) + … + (110000) = 2100 – 1 = 1.27*1030 sub-patterns! Solution: Mine closed frequent itemsets and maximal frequent itemsets instead An itemset X is closed if X is frequent and there exists no super-itemset Y כ X, with the same support as X (proposed by Pasquier, et al. @ ICDT’99) An itemset X is a maximal if X is frequent and there exists no super-itemset Y כ X and Y is frequent (proposed by Bayardo @ SIGMOD’98) Closed pattern is a lossless compression of freq. patterns
Closed Patterns and Max-Patterns Exercise: Suppose a DB contains only two transactions <a1, …, a100>, <a1, …, a50> Let min_sup = 1 What is the set of closed frequent itemsets? {a1, …, a100}: 1 {a1, …, a50}: 2 What is the set of maximal frequent itemsets? What is the set of all itemsets? {a1}: 2, …, {a1, a2}: 2, …, {a1, a51}: 1, …, {a1, a2, …, a100}: 1 A big number: 2100 - 1
Chapter 5: Mining Frequent Patterns, Association and Correlations: Basic Concepts and Methods Frequent Itemset Mining Methods Which Patterns Are Interesting?—Pattern Evaluation Methods Summary
Scalable Frequent Itemset Mining Methods Apriori: A Candidate Generation-and-Test Approach Improving the Efficiency of Apriori FPGrowth: A Frequent Pattern-Growth Approach Frequent Pattern Mining with Vertical Data Format
Apriori Algorithm Proposed by R. Agrawal and R. Srikanth in 1994 Name of the algorithm is based on the fact that the algorithm uses prior knowledge of the frequent itemset properties Apriori employs an iterative approach known as a level-wise search where k-itemsets are used to explore (k+1)-itemsets
Apriori Method Method: Initially, scan DB once to get frequent 1-itemset Generate length (k+1) candidate itemsets from length k frequent itemsets Terminate when no frequent or candidate set can be generated To improve the efficiency of the level-wise generation of frequent itemsets, the apriori property is used to reduce the search space All non-empty subsets of a frequent itemset must also be frequent If {beer, diaper, nuts} is frequent, so is {beer, diaper}
The Apriori Algorithm—An Example Supmin = 2 Itemset sup {A} 2 {B} 3 {C} {D} 1 {E} Database TDB Itemset sup {A} 2 {B} 3 {C} {E} L1 C1 Tid Items 10 A, C, D 20 B, C, E 30 A, B, C, E 40 B, E 1st scan C2 Itemset sup {A, B} 1 {A, C} 2 {A, E} {B, C} {B, E} 3 {C, E} C2 Itemset {A, B} {A, C} {A, E} {B, C} {B, E} {C, E} L2 2nd scan Itemset sup {A, C} 2 {B, C} {B, E} 3 {C, E} C3 L3 Itemset {B, C, E} 3rd scan Itemset sup {B, C, E} 2
The Apriori Algorithm (Pseudo-Code) Ck: Candidate itemset of size k Lk : frequent itemset of size k L1 = {frequent items}; for (k = 1; Lk !=; k++) do begin Ck+1 = candidates generated from Lk; for each transaction t in database do increment the count of all candidates in Ck+1 that are contained in t Lk+1 = candidates in Ck+1 with min_support end return k Lk;
Implementation of Apriori How to generate candidates? Step 1: self-joining Lk Step 2: pruning Example of Candidate-generation L3={abc, abd, acd, ace, bcd} Self-joining: L3*L3 abcd from abc and abd acde from acd and ace Pruning: acde is removed because ade is not in L3 C4 = {abcd}
Generating Association Rules from Frequent Itemsets Method For each frequent itemset l, generate all nonempty subsets of l For every nonempty subset of l, output the rule s -> (l - s) if (support_coutn(l) / support_count(s)) >=min_conf Because the rules are generated from frequent itemsets, each one automatically satisfies the minimum support
Generating Association Rules Example Example: Given the following table and the frequent itemset X = {I1, I2, I5}, generate the association rules. The nonempty subsets of X are: {I1, I2}, {I1, I5}, {I2, I5}, {I1}, {I2}, {I5} The resulting association rules are: If the minimum confidence threshold is, say, 70%, then only the second, third, and last rules are output, because these are the only ones generated that are strong
Scalable Frequent Itemset Mining Methods Apriori: A Candidate Generation-and-Test Approach Improving the Efficiency of Apriori FPGrowth: A Frequent Pattern-Growth Approach Frequent Pattern Mining with Vertical Data Format 19 19
Further Improvement of the Apriori Method Major computational challenges Multiple scans of transaction database Huge number of candidates Tedious workload of support counting for candidates Improving Apriori: general ideas Reduce passes of transaction database scans Shrink number of candidates Facilitate support counting of candidates
Partition: Scan Database Only Twice Scan 1: partition database and find local frequent patterns Scan 2: consolidate global frequent patterns A. Savasere, E. Omiecinski and S. Navathe, VLDB’95
Hash-based Technique: Reduce the Number of Candidates A k-itemset whose corresponding hashing bucket count is below the threshold cannot be frequent J. Park, M. Chen, and P. Yu. An effective hash-based algorithm for mining association rules. SIGMOD’95
Sampling for Frequent Patterns Select a sample of original database, mine frequent patterns within sample using Apriori H. Toivonen. Sampling large databases for association rules. In VLDB’96
Dynamic Itemset Counting: Reduce Number of Scans Database is partitioned into blocks marked by start point New candidate itemsets can be added at any start point, unlike in Apriori, which determines new candidate itemsets only before each complete database scan The technique uses count-so-far as the lower bound of the actual count If the count-so-far passes the minimum support, the itemset is added into the frequent itemset S. Brin R. Motwani, J. Ullman, and S. Tsur. Dynamic itemset counting and implication rules for market basket data. In SIGMOD’97
Apriori Example Consider the following database with five transactions. Let min_sup = 60% and min_conf = 80%. Find all the frequent itemsets using Apriori method List all the strong association rules matching the following metarule
Scalable Frequent Itemset Mining Methods Apriori: A Candidate Generation-and-Test Approach Improving the Efficiency of Apriori FPGrowth: A Frequent Pattern-Growth Approach Frequent Pattern Mining with Vertical Data Format 26 26
Pattern-Growth Approach: Mining Frequent Patterns Without Candidate Generation Apriori: uses a generate-and-test approach – generates candidate itemsets and tests if they are frequent Generation of candidate itemsets is expensive(in both space and time) Support counting is expensive Subset checking (computationally expensive) Multiple Database scans (I/O) FP-Growth: allows frequent itemset discovery without candidate itemset generation. Two step approach: Step 1: Build a compact data structure called the FP-tree Built using 2 passes over the data-set. Step 2: Extracts frequent itemsets directly from the FP-tree
Step 1: FP-Tree Construction FP-Tree is constructed using 2 passes over the data-set: Pass 1: Scan data and find support for each item. Discard infrequent items. Sort frequent items in decreasing order based on their support. Use this order when building the FP-Tree, so common prefixes can be shared.
Step 1: FP-Tree Construction Pass 2: Nodes correspond to items and have a counter FP-Growth reads 1 transaction at a time and maps it to a path Fixed order is used, so paths can overlap when transactions share items (when they have the same prefix ). In this case, counters are incremented Pointers are maintained between nodes containing the same item, creating singly linked lists (dotted lines) The more paths that overlap, the higher the compression. FP-tree may fit in memory. Frequent itemsets extracted from the FP-Tree.
Step 2: The Frequent Pattern Growth Mining Method For each frequent item, construct its conditional pattern-base, and then its conditional FP-tree Repeat the process on each newly created conditional FP-tree Until the resulting FP-tree is empty, or it contains only one path—single path will generate all the combinations of its sub-paths, each of which is a frequent pattern
Scalable Frequent Itemset Mining Methods Apriori: A Candidate Generation-and-Test Approach Improving the Efficiency of Apriori FPGrowth: A Frequent Pattern-Growth Approach Frequent Pattern Mining with Vertical Data Format
Mining by Exploring Vertical Data Format Minimum support = 2
Chapter 5: Mining Frequent Patterns, Association and Correlations: Basic Concepts and Methods Frequent Itemset Mining Methods Which Patterns Are Interesting?—Pattern Evaluation Methods Summary
Interestingness Measure: Correlations (Lift) play basketball eat cereal [40%, 66.7%] play basketball not eat cereal [20%, 33.3%] Measure of dependent/correlated events: lift If lift =1, occurrence of itemset A is independent of occurrence of itemset B, otherwise, they are dependent and correlated Basketball Not basketball Sum (row) Cereal 2000 1750 3750 Not cereal 1000 250 1250 Sum(col.) 3000 5000
Chapter 5: Mining Frequent Patterns, Association and Correlations: Basic Concepts and Methods Frequent Itemset Mining Methods Which Patterns Are Interesting?—Pattern Evaluation Methods Summary
Summary Basic concepts: association rules, support-confident framework, closed and max-patterns Scalable frequent pattern mining methods Apriori (Candidate generation & test) Fpgrowth Vertical format approach (ECLAT, CHARM, ...) Which patterns are interesting? lift