Data Mining Association Analysis: Basic Concepts and Algorithms Based on Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction.

Slides:



Advertisements
Similar presentations
Association Rule Mining
Advertisements

Association Analysis (2). Example TIDList of item ID’s T1I1, I2, I5 T2I2, I4 T3I2, I3 T4I1, I2, I4 T5I1, I3 T6I2, I3 T7I1, I3 T8I1, I2, I3, I5 T9I1, I2,
Association Rule Mining. 2 The Task Two ways of defining the task General –Input: A collection of instances –Output: rules to predict the values of any.
Data Mining Association Analysis: Basic Concepts and Algorithms
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
CPS : Information Management and Mining
Association Analysis. Association Rule Mining: Definition Given a set of records each of which contain some number of items from a given collection; –Produce.
Data Mining Techniques So Far: Cluster analysis K-means Classification Decision Trees J48 (C4.5) Rule-based classification JRIP (RIPPER) Logistic Regression.
Data Mining Association Analysis: Basic Concepts and Algorithms Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Organization “Association Analysis”
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Minqi Zhou Minqi Zhou Introduction.
Data Mining Association Analysis: Basic Concepts and Algorithms
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Data Mining Association Analysis: Basic Concepts and Algorithms
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Data Mining Association Analysis: Basic Concepts and Algorithms
Data Mining Association Analysis: Basic Concepts and Algorithms
Association Analysis: Basic Concepts and Algorithms.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Data Mining Association Analysis: Basic Concepts and Algorithms
© Vipin Kumar CSci 8980 Fall CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance Computing Research Center Department of Computer.
© Vipin Kumar CSci 8980 Fall CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance Computing Research Center Department of Computer.
Data Mining Association Analysis: Basic Concepts and Algorithms
Eick, Tan, Steinbach, Kumar: Association Analysis Part1 Organization “Association Analysis” 1. What is Association Analysis? 2. Association Rules 3. The.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Minqi Zhou © Tan,Steinbach,
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining By Tan, Steinbach, Kumar Lecture.
Modul 7: Association Analysis. 2 Association Rule Mining  Given a set of transactions, find rules that will predict the occurrence of an item based on.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Eick, Tan, Steinbach, Kumar: Association Analysis Part1 Organization “Association Analysis” 1. What is Association Analysis? 2. Association Rules 3. The.
Supermarket shelf management – Market-basket model:  Goal: Identify items that are bought together by sufficiently many customers  Approach: Process.
Data Mining Association Analysis Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/
Data & Text Mining1 Introduction to Association Analysis Zhangxi Lin ISQS 3358 Texas Tech University.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Frequent Item Mining. What is data mining? =Pattern Mining? What patterns? Why are they useful?
CSE4334/5334 DATA MINING CSE4334/5334 Data Mining, Fall 2014 Department of Computer Science and Engineering, University of Texas at Arlington Chengkai.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Association Analysis This lecture node is modified based on Lecture Notes for.
1. UTS 2. Basic Association Analysis (IDM ch. 6) 3. Practical: 1. Project Proposal 2. Association Rules Mining (DMBAR ch. 16) 1. online radio 2. predicting.
Association Analysis (3)
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 (4) Introduction to Data Mining by Tan, Steinbach, Kumar ©
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
1 Data Mining Lecture 6: Association Analysis. 2 Association Rule Mining l Given a set of transactions, find rules that will predict the occurrence of.
DATA MINING: ASSOCIATION ANALYSIS (2) Instructor: Dr. Chun Yu School of Statistics Jiangxi University of Finance and Economics Fall 2015.
Data Mining Association Analysis: Basic Concepts and Algorithms
Data Mining Association Analysis: Basic Concepts and Algorithms
Data Mining Association Analysis: Basic Concepts and Algorithms
Data Mining Association Analysis: Basic Concepts and Algorithms
Association Analysis: Basic Concepts and Algorithms
Data Mining Association Analysis: Basic Concepts and Algorithms
Frequent Pattern Mining
COMP 5331: Knowledge Discovery and Data Mining
Association Analysis: Basic Concepts
Data Mining Association Analysis: Basic Concepts and Algorithms
Association Rule Mining
Data Mining Association Analysis: Basic Concepts and Algorithms
Data Mining Association Analysis: Basic Concepts and Algorithms
Association Analysis: Basic Concepts and Algorithms
Data Mining Association Analysis: Basic Concepts and Algorithms
Data Mining Association Analysis: Basic Concepts and Algorithms
Data Mining Association Analysis: Basic Concepts and Algorithms
Data Mining Association Analysis: Basic Concepts and Algorithms
Data Mining Association Analysis: Basic Concepts and Algorithms
Data Mining Association Analysis: Basic Concepts and Algorithms
Association Analysis: Basic Concepts
Presentation transcript:

Data Mining Association Analysis: Basic Concepts and Algorithms Based on Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Outline l Association rules & mining l Finding frequent itemsets –Apriori algorithm –compact representations of frequent itemsets –alternative discovery methods –FP-growth algorithm l Generating association rules l Evaluating discovered rules

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Association Rule Mining l Given a set of transactions, find rules that will predict the occurrence of an item based on the occurrences of other items in the transaction Market-Basket transactions Example of Association Rules {Diaper}  {Beer}, {Milk, Bread}  {Eggs,Coke}, {Beer, Bread}  {Milk}, Implication means co-occurrence, not causality!

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Definition: Frequent Itemset l Itemset –A collection of one or more items  Example: {Milk, Bread, Diaper} –k-itemset  An itemset that contains k items l Support count (  ) –Frequency of occurrence of an itemset –E.g.  ({Milk, Bread, Diaper}) = 2 l Support –Fraction of transactions that contain an itemset –E.g. s({Milk, Bread, Diaper}) = 2/5 l Frequent Itemset –An itemset whose support is greater than or equal to a minsup threshold

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Definition: Association Rule Example: l Association Rule –An implication expression of the form X  Y, where X and Y are itemsets –Example: {Milk, Diaper}  {Beer} l Rule Evaluation Metrics –Support (s)  Fraction of transactions that contain both X and Y –Confidence (c)  Measures how often items in Y appear in transactions that contain X

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Association Rule Mining Task l Given a set of transactions T, the goal of association rule mining is to find all rules having –support ≥ minsup threshold –confidence ≥ minconf threshold l Brute-force approach: –List all possible association rules –Compute the support and confidence for each rule –Prune rules that fail the minsup and minconf thresholds  Computationally prohibitive!

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Computational Complexity l Given d unique items: –Total number of itemsets = 2 d –Total number of possible association rules: If d=6, R = 602 rules

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Mining Association Rules Example of Rules: {Milk,Diaper}  {Beer} (s=0.4, c=0.67) {Milk,Beer}  {Diaper} (s=0.4, c=1.0) {Diaper,Beer}  {Milk} (s=0.4, c=0.67) {Beer}  {Milk,Diaper} (s=0.4, c=0.67) {Diaper}  {Milk,Beer} (s=0.4, c=0.5) {Milk}  {Diaper,Beer} (s=0.4, c=0.5) Observations: All the above rules are binary partitions of the same itemset: {Milk, Diaper, Beer} Rules originating from the same itemset have identical support but can have different confidence Thus, we may decouple the support and confidence requirements

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Mining Association Rules l Two-step approach: 1.Frequent Itemset Generation – Generate all itemsets whose support  minsup 2.Rule Generation – Generate high confidence rules from each frequent itemset, where each rule is a binary partitioning of a frequent itemset l Frequent itemset generation is still computationally expensive

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Frequent Itemset Generation Given d items, there are 2 d possible candidate itemsets

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Frequent Itemset Generation l Brute-force approach: –Each itemset in the lattice is a candidate frequent itemset –Count the support of each candidate by scanning the database –Match each transaction against every candidate –Complexity ~ O(MNw) => Expensive since M = 2 d !!!

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Outline l Association rules & mining l Finding frequent itemsets –Apriori algorithm –compact representations of frequent itemsets –alternative discovery methods –FP-growth algorithm l Generating association rules l Evaluating discovered rules

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Frequent Itemset Generation Strategies l Reduce the number of candidates (M) –Complete search: M=2 d –Use pruning techniques to reduce M l Reduce the number of comparisons (MN) –Use efficient data structures to store the candidates or transactions –No need to match every candidate against every transaction l Reduce the number of transactions (N) –Transactions which does not contain any frequent k-itemsets can not contain any frequent k+1-itemsets

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Reducing Number of Candidates l Apriori principle: –If an itemset is frequent, then all of its subsets must also be frequent –Or equivalently, if an itemset is NOT frequent, then none of its supersets can be frequent l Apriori principle holds due to the following property of support of an itemset: –Support of an itemset never exceeds the support of its subsets –This is known as the anti-monotone property of support: support decreases when more items added to the set

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Found to be Infrequent Illustrating Apriori Principle Pruned supersets

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Apriori Algorithm l Method: 1.Let k=1 2.Generate frequent itemsets of length 1 3.Repeat until no new frequent itemsets are identified a) Generate a (k+1)-candidate itemsets from every two frequent k-itemsets with only one different item b) Prune candidate itemsets containing subsets of length k that are infrequent c) Count the support of each candidate by scanning the DB d) Prune candidates that are infrequent, leaving only those that are frequent

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Generating Frequent 1-Itemsets l Consider a set of transactions T with 6 unique items –suppose minimum support count = 3 l Apriori makes first pass through T to obtain support counts for the candidate 1-itemsets Candidates for frequent 1-itemsets Obtain support count Coke & Eggs are infrequent Prune infrequent itemsets Frequent 1-itemsets

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Generating Frequent 2-Itemsets Frequent 1-itemsets Step 3.a: Generate candidates Candidates for frequent 2-itemsets Step 3.b: No pruning due to infrequent subsets {Bread,Beer} & {Milk,Beer} are infrequent Step 3.c: Obtain support counts Step 3.d: Prune infrequent itemsets Frequent 2-itemsets

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Generating Frequent 3-Itemsets Frequent 2-itemsets Candidates for frequent 3-itemsets Step 3.a: Generate candidates Step 3.b: Prune due infrequent subsets E.g., {Beer,Bread,Diaper} pruned since {Beer,Diaper} is infrequent Frequent 3-itemsets Step 3.c: Obtain support counts Step 3.d: no pruning due to insufficient support # of candidates considered: without pruning: 6 C C C 3 = 41 (there are 6 C 1 candidates for frequent 1-itemsets) with pruning: 6 C C = = 15

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Reducing Number of Comparisons l Candidate counting: –Scan the database of transactions to determine the support of each candidate itemset –To reduce the number of comparisons, store the candidates in a hash structure  Instead of matching each transaction against every candidate, match it against candidates contained in the hashed buckets

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Generate Hash Tree ,4,7 2,5,8 3,6,9 Hash function h(x) = x mod 3 Suppose you have 15 candidate itemsets of length 3: {1 4 5}, {1 2 4}, {4 5 7}, {1 2 5}, {4 5 8}, {1 5 9}, {1 3 6}, {2 3 4}, {5 6 7}, {3 4 5}, {3 5 6}, {3 5 7}, {6 8 9}, {3 6 7}, {3 6 8} Items in itemsets & transactions are ordered (in the same way) Hash function Max leaf size: max number of itemsets stored in a leaf node (if number of candidate itemsets exceeds max leaf size, split the node) hash on first item on second item on third max leaf size = 3

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Association Rule Discovery: Hash tree 1,4,7 2,5,8 3,6,9 Hash Function First item is: 1, 4, or Candidate Hash Tree

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Association Rule Discovery: Hash tree 1,4,7 2,5,8 3,6,9 Hash Function Second item is: 2, 5, or Candidate Hash Tree Second item is: 1, 4, or 7 Second item is: 3, 6, or 9

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Association Rule Discovery: Hash tree ,4,7 2,5,8 3,6,9 Hash Function Candidate Hash Tree First item is 2, 5, or 8

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Association Rule Discovery: Hash tree ,4,7 2,5,8 3,6,9 Hash Function Candidate Hash Tree First item is 3, 6, or 9

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Association Rule Discovery: Hash tree ,4,7 2,5,8 3,6,9 Hash Function Candidate Hash Tree Split by second item

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Association Rule Discovery: Hash tree Candidate Hash Tree not further split since <= max size, if split (by third item), we would have: ,4,7 2,5,8 3,6,9 Hash Function

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Subset Operation Given a transaction t, what are the possible subsets of size 3?

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Subset Operation Using Hash Tree ,4,7 2,5,8 3,6,9 Hash Function transaction

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Subset Operation Using Hash Tree ,4,7 2,5,8 3,6,9 Hash Function transaction

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Subset Operation Using Hash Tree ,4,7 2,5,8 3,6,9 Hash Function transaction 5 leaves visited, 9 candidates compared itemset from transaction

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Factors Affecting Complexity of Apriori l Choice of minimum support threshold –lowering support threshold results in more frequent itemsets –this may increase (1) number of candidates and (2) max length of frequent itemsets l Dimensionality (number of unique items) of the data set –more candidates & space needed to store their support counts –if number of frequent items also increases, both computation and I/O costs may also increase l Size of database –since Apriori makes multiple passes, run time of algorithm may increase with number of transactions l Average transaction width –transaction width increases with denser data sets –this may increase max length of frequent itemsets and traversals of hash tree (number of subsets in a transaction increases with its width)

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Outline l Association rules & mining l Finding frequent itemsets –Apriori algorithm –compact representations of frequent itemsets –alternative discovery methods –FP-growth algorithm l Generating association rules l Evaluating discovered rules

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Compact Representation of Frequent Itemsets l Itemsets with positive support counts –3 groups: {A1, …, A10}, {B1, …, B10}, {C1, …, C10} –no transactions containing items from different groups l All subsets of items within the same group have the same support count –e.g., support counts of {A1}, {A1, A2}, {A1, A2, A3} all = 5

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Compact Representation of Frequent Itemsets l Suppose minimum support count is 3 l Number of frequent itemsets:  need a compact representation

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Maximal Frequent Itemset Border Infrequent Itemsets Maximal Itemsets A itemset is maximally frequent if it is frequent but none of its immediate supersets is frequent

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Maximal Frequent Itemset Border Infrequent Itemsets Maximal Itemsets Every frequent itemset is either maximal or a subset of some maximal frequent itemset

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Properties of Maximal Frequent Itemsets l If level k is the highest level with maximal frequent itemsets, then level k only contains maximal frequent itemsets or infrequent itemsets Border Infrequent Itemsets Maximal Itemsets level 0 level 1 level 2 level 3 level 4 level 5

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Properties of Maximal Frequent Itemsets l Furthermore, level k + 1 (if exists) only contains infrequent itemsets Border Infrequent Itemsets Maximal Itemsets level 0 level 1 level 2 level 3 level 4 level 5

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Maximal Frequent Itemset Border Infrequent Itemsets Maximal Itemsets Know maximal frequent itemsets  know all frequent itemsets, but not their supports

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Closed Itemset l An itemset is closed if none of its immediate supersets has the same support as the itemset support count Not supported by any transactions null ABACADAEBCBDBECDCEDE ABCDE ABCABDABEACDACEADEBCDBDECDE ABCDABCEABDEACDEBCDE ABCDE Closed BCE

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Properties of Closed Itemsets l Every itemset is either a closed itemset or has the same support as one of its immediate supersets

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Properties of Closed Itemsets l If X is not closed, X has the same support as its immediate superset with the largest support l Suppose Y 1, …, Y k are X’s immediate supersets & Y i has the largest support l  (X) ≥  (Y 1 ),  (Y 2 ), …,  (Y k ) ≥ max{  (Y 1 ), …,  (Y k )} =  (Y i ) I.e.,  (X) ≥  (Y i ) l since X is not closed,  (X) =  (Y i )

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Properties of Closed Itemsets l Suppose the highest level of lattice is level k –the itemset at level k (only one k-itemset) must be closed level 0 level 1 level 2 level 3 level 4 level 5

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Properties of Closed Itemsets l Can compute support counts of all itemsets from support counts of closed itemsets level 0 level 1 level 2 level 3 level 4 level 5

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Properties of Closed Itemsets l Compute support of itemsets on level k from those in level k + 1: if an itemset on level k is not closed, we can compute its support from those at level k + 1 level 0 level 1 level 2 level 3 level 4 level 5

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Closed Frequent Itemsets l An itemset is a closed frequent itemset (shown in shaded ovals) if it is both frequent & closed Minimum support count = 2 Closed frequent itemsets # frequent = 14 # closed frequent = 9 null ABACADAEBCBDBECDCEDE ABCDE ABCABDABEACDACEADEBCDBDECDE ABCDABCEABDEACDEBCDE ABCDE BCE

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Frequent Itemsets: Maximal vs Closed l Every maximal frequent itemset is a closed frequent itemset null ABACADAEBCBDBECDCEDE ABCDE ABCABDABEACDACEADEBCDBCEBDECDE ABCDABCEABDEACDEBCDE ABCDE Closed and maximal Closed but not maximal # closed frequent = 9 # maximal frequent = 4 0 Closed and maximal Minimum support count = 2

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Frequent Itemsets: Maximal vs Closed

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Outline l Association rules & mining l Finding frequent itemsets –Apriori algorithm –compact representations of frequent itemsets –alternative discovery methods –FP-growth algorithm l Generating association rules l Evaluating discovered rules

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Alternative Methods for Frequent Itemset Generation l Traversal of itemset lattice –Breadth-first: e.g., Apriori algorithm –Depth-first: often used to search maximal frequent itemsets a ab abc a bcde ab

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Alternative Methods for Frequent Itemset Generation l Depth-first finds maximal frequent itemsets more quickly l It also enables substantial pruning –if abcd is not maximal, all its subsets (e.g., bc) are not; but supersets of bc that are not subsets of abcd (e.g., bce) may still be maximal a ab abc a bcde ab

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Alternative Methods for Frequent Itemset Generation l Traversal of itemset lattice –General-to-specific, specific-to-general, & bidirectional

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Alternative Methods for Frequent Itemset Generation l General-to-specific: going from k to k+1-itemsets –Adopted by Apriori, effective when border (of frequent and infrequent itemsets) is near the top of lattice

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Alternative Methods for Frequent Itemset Generation l Specific-to-general: going from k+1 to k-itemsets –often used to find maximal frequent itemsets –effective when border is near the bottom of lattice

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Alternative Methods for Frequent Itemset Generation l Bidirectional: going in two directions at same time –need more space for storing candidates –may quickly discover border in situation shown in (c)

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Alternative Methods for Frequent Itemset Generation l Traverse lattice by equivalent classes, formed by –having the same # of items in the itemset, e.g., Apriori –or having the same prefix (e.g., A, B, C, D) or suffix same prefix A same suffix D

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Storage Formats for Transactions l Horizontal: store a set/list of items associated with each transaction (e.g., Apriori) l Vertical: store a set/list of transactions associated with each item TID-list

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Vertical Data Layout l Searching support = intersecting TID-lists –length of TID-list shrinks as size of itemsets grows –problem: initial lists may be too large to fit in memory l Need more compact representation  FP-growth TID-list

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Outline l Association rules & mining l Finding frequent itemsets –Apriori algorithm –compact representations of frequent itemsets –alternative discovery methods –FP-growth algorithm l Generating association rules l Evaluating discovered rules

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ FP-growth Algorithm: Key Ideas l Limitations of Apriori –need to generate a large number of candidates, e.g., 1k frequent 1-itemsets  1M candidates of 2-itemset –need to repeatedly scan the database l FP-growth: frequent pattern growth –discover frequent itemsets w/o generating candidates –often just need to scan databases twice

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ FP-growth Algorithm l Step 1: –scan database to discover frequent 1-itemsets –scan database the second time to build a compact representation of transactions in form of FP-tree l Step 2: –use the constructed FP-tree to recursively find frequent itemsets via divide-and-conquer: turn problem of finding k-itemsets into a set of subproblems, each finding k-itemsets ending in a different suffix

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ FP-tree & Construction l Transactions represented by paths from root –items in transactions are ordered –node = item:# of transactions having path to item null A:1 B:1 After reading TID=1:

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ FP-tree & Construction l Transactions represented by paths from root –transactions with common prefixes (e.g., T1 and T3) share paths for their prefixes null A:1 B:1 After reading TID=1:

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ FP-tree & Construction l Items often ordered in decreasing support counts –s(A) = 8, s(B) = 7, s(C) = 5, s(D) = 5, s(E) = 3  the order is: A, B, C, D, E null A:1 B:1 After reading TID=1:

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ FP-tree & Construction l Items often ordered in decreasing support counts –s(A) = 8, s(B) = 7, s(C) = 5, s(D) = 5, s(E) = 3 –a pass over database needed for getting these counts null A:1 B:1 After reading TID=1:

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ FP-tree & Construction l Items often ordered in decreasing support counts –with this order, transactions tend to share paths  smaller tree: low branching factor, less bushy null A:1 B:1 After reading TID=1:

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ FP-tree Construction l Nodes for the same item on different paths are connected via node links null A:1 B:1 C:1 D:1 After reading TID=2: node link

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ FP-tree Construction l Shared paths between TID=1 and TID=3 After reading TID=3: null A:2 B:1 C:1 D:1 C:1 D:1 E:1

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ FP-tree Construction: Completed Tree null A:8 B:5 B:2 C:2 D:1 C:1 D:1 C:3 D:1 E:1 Pointers are used to assist frequent itemset generation D:1 E:1 Transaction Database Header table

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ FP-growth: Finding Frequent Itemsets l Take FP-tree generated in step 1 as input l Find frequent itemsets with common suffixes –e.g., suppose the order of items is: A, B, C, D, E –first find itemsets in reverse order: ending with E, then, D, C, … –start with items w/ fewer supports, so pruning is more likely l For each particular suffix, find frequent itemsets via divide-and-conquer, e.g., consider itemsets ending w/ E 1.obtain sub-tree of FP-tree where all paths end with E 2.compute support counts of E; if E is frequent, 3.check & prepare to solve subproblems: DE, CE, BE, AE 4.recursively compute each subproblem in a similar fashion

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ FP-growth: Finding Frequent Itemsets l For each particular suffix, find frequent itemsets via divide-and-conquer, e.g., consider itemsets ending w/ E 1.obtain sub-tree of FP-tree where all paths end with E 2.compute support counts of E; if E is frequent, 3.transform subtree for suffix E into conditional FP-tree for suffix E – update count – remove leaves (E’s) – prune infrequent itemsets 4.Use conditional FP-tree for E to recursively solve: DE, CE, BE, AE in a similar fashion (as using FP-tree to solve E, D, C, B, A)

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Traversal of Lattice by FP-growth null AB C D E DEDECECEBEBEAEAE CDEBDEADE BCDE ABCDE BCEACE ACDE CDBDADBCACAB ABEBCDACDABDABC ABDEABCEABCD common suffix E

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Discover Frequent Itemsets via FP-tree null AB C D E DEDECECEBEBEAEAE CDEBDEADE BCDE ABCDE BCEACE ACDE CDBDADBCACAB ABEBCDACDABDABC ABDEABCEABCD common suffix E FP-tree Sub-tree for suffix E Sub-tree for suffix DE

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Obtaining Sub-Tree for Suffix E l Find first path containing E (by looking up in the header table), find rest by following node links null A:8 B:5 B:2 C:2 D:1 C:1 D:1 C:3 D:1 E:1 D:1 E:1 A:8 B:5 C:3 D:1 Header table

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Sub-Tree for Suffix E l Remove all E’s descendants & nodes that are not E’s ancestors E:1 null A:8 B:2 C:2 C:1 D:1 E:1 A:8 null A:8 B:5 B:2 C:2 D:1 C:1 D:1 C:3 D:1 E:1 D:1 E:1 A:8 B:5 C:3 D:1 Paths in the sub-tree are the prefix paths ending in E

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Current Position of Processing 1 2 common suffix E FP-tree Sub-tree for suffix E just obtained

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Determine if E is Frequent l Support count of E = sum of counts of all E nodes –suppose minimum support count is 2 –support count of E = 3, so E is frequent null A:8 B:2 C:2 C:1 D:1 E:1 A:8 E:1

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Prepare for Solving DE, CE, BE, and AE l Turn original subtree into conditional FP-tree by 1.update counts on prefix paths 2.remove E nodes 3.prune infrequent itemsets null A:8 B:2 C:2 C:1 D:1 E:1 A:8 E:1 original sub-tree

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Update Counts on Prefix Paths l Counts for E (leaf) nodes are correct l Counts for internal nodes may not be correct –due to removal of paths which do not have E’s null A:8 B:2 C:2 C:1 D:1 E:1 A:8 E:1 Path BCD was removed, so 2 is not correct anymore

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Update Counts on Prefix Paths l Start from leaves, going upward –if node X has only child Y, count of X = count of Y –otherwise, count of X = sum of counts of X’s children null A:2 B:1 C:1 D:1 E:1 null A:8 B:2 C:2 C:1 D:1 E:1 A:8 E:1

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Remove E Nodes l E nodes can be removed –counts of internal nodes have been updated –contain no more information for solving: DE, CE, … null A:2 B:1 C:1 D:1 null A:2 B:1 C:1 D:1 E:1

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Prune Infrequent Itemsets l If sum of counts of all X nodes < minimum support count, remove X –since XE can not be frequent null A:2 B:1 C:1 D:1 remove B null A:2 C:1 D:1 Conditional FP-tree for E

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Current Position of Processing 1 2 common suffix E FP-tree Sub-tree for suffix E turned into conditional FP-tree for suffix E counts updated leaves (E’s) removed infrequent itemsets pruned

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Current Position of Processing 1 2 common suffix E FP-tree Conditional FP-tree for suffix E then used to solve DE, CE, BE, AE similarly as we use FP-tree to solve E, D, C, B, A

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Solving for DE l From conditional FP-tree for E, obtain sub-tree for suffix DE null A:2 C:1 D:1 Conditional FP-tree for E null A:2 C:1 D:1 Sub-tree for suffix DE

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Current Position of Processing 1 2 common suffix E FP-tree Sub-tree for suffix DE just obtained 3

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Solving for DE l Support count of DE = 2 (sum of counts of all D’s) –DE is frequent, need to solve: CDE, BDE, ADE Sub-tree for suffix DE null A:2 C:1 D:1

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Preparing for Solving CDE, BDE, ADE null A:2 C:1 D:1 null A:2 C:1 Conditional FP-tree for suffix DE Sub-tree for suffix DE update count remove leaves null A:2 prune infrequent itemsets

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Current Position of Processing 1 2 common suffix E FP-tree Sub-tree for suffix DE turned into conditional FP-tree for suffix DE, ready to solve CDE, BDE ADE 3

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Solving CDE, BDE, ADE l Sub-trees for both CDE and BDE are empty –no paths containing item C or B l Work on ADE l ADE (support count = 2) is frequent –but no more subproblem for ADE, backtrack –& no more subproblem for DE, backtrack Conditional FP-tree for suffix DE null A:2 null A:2 Sub-tree for suffix for ADE  solving next subproblem CE

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Current Position of Processing 1 2 common suffix E FP-tree About to solve suffix CE

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Solving for Suffix CE l CE is frequent (support count = 2) l No more subproblems for CE (empty conditional FP-tree, why?), so done with CE l Checking next subproblem: BE null A:2 C:1 D:1 Conditional FP-tree for E null A:2 C:1 Sub-tree for suffix CE Conditional FP-tree for suffix CE Empty tree

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Solving for Suffix CE l Suftree for BE is empty (no path in conditional FP-tree for E contains B) l Checking AE: there are paths containing A null A:2 C:1 D:1 Conditional FP-tree for E null A:2 C:1 Sub-tree for suffix CE Conditional FP-tree for suffix CE Empty tree

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Current Position of Processing 1 2 common suffix E FP-tree Preparing to solve suffix AE 89 10

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Solving for Suffix AE l AE is frequent (support count = 2) l Done with AE, backtrack; done with E, backtrack null A:2 C:1 D:1 Conditional FP-tree for E null A:2 Sub-tree for suffix AE  Solving next subproblem: suffix D

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Current Position of Processing 1 2 common suffix E FP-tree Ready to solve suffix D 89 10

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Found Frequent Itemsets with Suffix E l E, DE, ADE, CE, AE discovered in this order common suffix E Ready to solve suffix D

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Outline l Association rules & mining l Finding frequent itemsets –Apriori algorithm –compact representations of frequent itemsets –alternative discovery methods –FP-growth algorithm l Generating association rules l Evaluating discovered rules

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Rule Generation l Given a frequent itemset L, find all non-empty subsets f  L such that f  L – f satisfies the minimum confidence requirement –If {A,B,C,D} is a frequent itemset, candidate rules: ABC  D, ABD  C, ACD  B, BCD  A, A  BCD,B  ACD,C  ABD, D  ABC AB  CD,AC  BD, AD  BC, BC  AD, BD  AC, CD  AB, l If |L| = k, then there are 2 k – 2 candidate association rules (ignoring L   and   L)

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Rule Generation l How to efficiently generate rules from frequent itemsets? –In general, confidence does not have an anti- monotone property –e.g., c(ABC  D) can be larger or smaller than c(AB  D)

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Rule Generation l But confidence of rules generated from the same itemset has an anti-monotone property –e.g., L = {A,B,C,D}: c(BCD  A)  c(CD  AB)  c(D  ABC) l Confidence is anti-monotone w.r.t. number of items on the RHS of the rule –more items on the right  lower/equal confidence

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Rule Generation for Apriori Algorithm Lattice of rules Pruned Rules Low Confidence Rule

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Rule Generation for Apriori Algorithm l Candidate rule is generated by merging two rules that share the same prefix in the rule consequent l join(CD=>AB,BD=>AC) would produce the candidate rule D => ABC l Prune rule D=>ABC if its subset AD=>BC does not have high confidence

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Outline l Association rules & mining l Finding frequent itemsets –Apriori algorithm –compact representations of frequent itemsets –alternative discovery methods –FP-growth algorithm l Generating association rules l Evaluating discovered rules

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Pattern Evaluation l Association rule algorithms tend to produce too many rules –many of them are uninteresting l In the original formulation of association rules, support & confidence are used l Additional interestingness measures can be used to prune/rank the derived patterns

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Computing Interestingness Measure l Given a rule X  Y, information needed to compute rule interestingness can be obtained from a contingency table YY Xf 11 f 10 f 1+ Xf 01 f 00 f o+ f +1 f +0 |T| Contingency table for X  Y f 11 : support count of X and Y f 10 : support count of X and Y f 01 : support count of X and Y f 00 : support count of X and Y Used to define various measures u support, confidence, lift, cosine, Jaccard coefficient, etc

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Computing Interestingness Measure l Given a rule X  Y, information needed to compute rule interestingness can be obtained from a contingency table YY Xf 11 f 10 f 1+ Xf 01 f 00 f o+ f +1 f +0 |T| Contingency table for X  Y confidence (X  Y) = = f 11 / (f 11 + f 10 ) = f 11 / f 1+

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Drawback of Confidence Coffee Tea15520 Tea Association Rule: Tea  Coffee Confidence= P(Coffee|Tea) = 0.75 but P(Coffee) = 0.9 & P(Coffee|Tea) = >.75  Although confidence is high, rule is misleading

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Computing Interestingness Measure l Given a rule X  Y, information needed to compute rule interestingness can be obtained from a contingency table YY Xf 11 f 10 f 1+ Xf 01 f 00 f o+ f +1 f +0 |T| Contingency table for X  Y confidence (X  Y) = based on P(X,Y) & P(X), but also need to consider P(Y)  need to measure correlation of X & Y

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Measuring Correlation l Population of 1000 students –600 students know how to swim (S) –700 students know how to bike (B) –420 students know how to swim and bike (S,B) –P(S  B) = 420/1000 = 0.42 –P(S)  P(B) = 0.6  0.7 = 0.42 –P(S  B) = P(S)  P(B) => Independent –P(S  B) > P(S)  P(B) => Positively correlated –P(S  B) Negatively correlated

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Measuring Correlation l Population of 1000 students –600 students know how to swim (S) –700 students know how to bike (B) –420 students know how to swim and bike (S,B) –P(S  B) = P(S)  P(B)  lift = 1  S & B independent –P(S  B) > P(S)  P(B)  lift > 1  positive-correlated –P(S  B) < P(S)  P(B)  lift < 1  negative-correlated

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Measures on Correlation l The larger the value, usually the more likely the two variables/events are correlated XY X,Y

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Computing Lift from Contingency Table YY Xf 11 f 10 f 1+ Xf 01 f 00 f o+ f +1 f +0 N

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Computing Lift from Contingency Table Coffee Tea15520 Tea Association Rule: Tea  Coffee Confidence= P(Coffee|Tea) = 0.75 but P(Coffee) = 0.9  Lift = 0.75/0.9= (< 1, therefore is negatively associated)

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Drawback of Lift YY X100 X YY X900 X Statistical independence: If P(X,Y)=P(X)P(Y) => Lift = 1 10 (X & Y seldom occur together) >> 1.11 (X & Y frequently occur together)

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Compute Cosine from Contingency Table YY Xf 11 f 10 f 1+ Xf 01 f 00 f o+ f +1 f +0 N

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Cosine vs Lift YY Xf 11 f 10 f 1+ Xf 01 f 00 f o+ f +1 f +0 N If X and Y are independent, lift(X,Y) = 1 but Cosine(X,Y) = P(X)P(Y) Cosine does not depend on N (& f 00 ), unlike Lift

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Compute Jaccard from Contingency Table YY Xf 11 f 10 f 1+ Xf 01 f 00 f o+ f +1 f +0 N

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Property under Null Addition Invariant measures: u confidence, Cosine, Jaccard, etc Non-invariant measures: u support (f 11 /N vs f 11 /(N+s)), lift, etc Nf +0 f +1 f o+ f 00 f 01 X f 1+ f 10 f 11 X YY N+sf +0 + sf +1 f o+ + sf 00 + sf 01 X f 1+ f 10 f 11 X YY

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Property under Variable Permutation Does M(A,B) = M(B,A)? If yes, M is symmetric; otherwise asymmetric e.g., c(A  B) = c(B  A)?

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Property under Variable Permutation Does M(A,B) = M(B,A)? If yes, M is symmetric; otherwise asymmetric lift(A  B) = lift(B  A)?

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Property under Variable Permutation Does M(A,B) = M(B,A)? If yes, M is symmetric; otherwise asymmetric Symmetric measures: u support, lift, Cosine, Jaccard, etc Asymmetric measures: u confidence, etc

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Applying Interestingness Measures Association rule (A  B) directional u confidence (asymmetric) is intuitive u but confidence does not capture correlation  use additional measures such as lift to rank discovered rules & further prune those with low ranks