Presentation is loading. Please wait.

Presentation is loading. Please wait.

Data Mining Association Analysis: Basic Concepts and Algorithms Based on Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction.

Similar presentations


Presentation on theme: "Data Mining Association Analysis: Basic Concepts and Algorithms Based on Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction."— Presentation transcript:

1 Data Mining Association Analysis: Basic Concepts and Algorithms Based on Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1

2 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 2 Outline l Association rules & mining l Finding frequent itemsets –Apriori algorithm –compact representations of frequent itemsets –alternative discovery methods –FP-growth algorithm l Generating association rules l Evaluating discovered rules

3 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 3 Association Rule Mining l Given a set of transactions, find rules that will predict the occurrence of an item based on the occurrences of other items in the transaction Market-Basket transactions Example of Association Rules {Diaper}  {Beer}, {Milk, Bread}  {Eggs,Coke}, {Beer, Bread}  {Milk}, Implication means co-occurrence, not causality!

4 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 4 Definition: Frequent Itemset l Itemset –A collection of one or more items  Example: {Milk, Bread, Diaper} –k-itemset  An itemset that contains k items l Support count (  ) –Frequency of occurrence of an itemset –E.g.  ({Milk, Bread, Diaper}) = 2 l Support –Fraction of transactions that contain an itemset –E.g. s({Milk, Bread, Diaper}) = 2/5 l Frequent Itemset –An itemset whose support is greater than or equal to a minsup threshold

5 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 5 Definition: Association Rule Example: l Association Rule –An implication expression of the form X  Y, where X and Y are itemsets –Example: {Milk, Diaper}  {Beer} l Rule Evaluation Metrics –Support (s)  Fraction of transactions that contain both X and Y –Confidence (c)  Measures how often items in Y appear in transactions that contain X

6 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 6 Association Rule Mining Task l Given a set of transactions T, the goal of association rule mining is to find all rules having –support ≥ minsup threshold –confidence ≥ minconf threshold l Brute-force approach: –List all possible association rules –Compute the support and confidence for each rule –Prune rules that fail the minsup and minconf thresholds  Computationally prohibitive!

7 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 7 Computational Complexity l Given d unique items: –Total number of itemsets = 2 d –Total number of possible association rules: If d=6, R = 602 rules

8 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 8 Mining Association Rules Example of Rules: {Milk,Diaper}  {Beer} (s=0.4, c=0.67) {Milk,Beer}  {Diaper} (s=0.4, c=1.0) {Diaper,Beer}  {Milk} (s=0.4, c=0.67) {Beer}  {Milk,Diaper} (s=0.4, c=0.67) {Diaper}  {Milk,Beer} (s=0.4, c=0.5) {Milk}  {Diaper,Beer} (s=0.4, c=0.5) Observations: All the above rules are binary partitions of the same itemset: {Milk, Diaper, Beer} Rules originating from the same itemset have identical support but can have different confidence Thus, we may decouple the support and confidence requirements

9 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 9 Mining Association Rules l Two-step approach: 1.Frequent Itemset Generation – Generate all itemsets whose support  minsup 2.Rule Generation – Generate high confidence rules from each frequent itemset, where each rule is a binary partitioning of a frequent itemset l Frequent itemset generation is still computationally expensive

10 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 10 Frequent Itemset Generation Given d items, there are 2 d possible candidate itemsets

11 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 11 Frequent Itemset Generation l Brute-force approach: –Each itemset in the lattice is a candidate frequent itemset –Count the support of each candidate by scanning the database –Match each transaction against every candidate –Complexity ~ O(MNw) => Expensive since M = 2 d !!!

12 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 12 Outline l Association rules & mining l Finding frequent itemsets –Apriori algorithm –compact representations of frequent itemsets –alternative discovery methods –FP-growth algorithm l Generating association rules l Evaluating discovered rules

13 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 13 Frequent Itemset Generation Strategies l Reduce the number of candidates (M) –Complete search: M=2 d –Use pruning techniques to reduce M l Reduce the number of comparisons (MN) –Use efficient data structures to store the candidates or transactions –No need to match every candidate against every transaction l Reduce the number of transactions (N) –Transactions which does not contain any frequent k-itemsets can not contain any frequent k+1-itemsets

14 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 14 Reducing Number of Candidates l Apriori principle: –If an itemset is frequent, then all of its subsets must also be frequent –Or equivalently, if an itemset is NOT frequent, then none of its supersets can be frequent l Apriori principle holds due to the following property of support of an itemset: –Support of an itemset never exceeds the support of its subsets –This is known as the anti-monotone property of support: support decreases when more items added to the set

15 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 15 Found to be Infrequent Illustrating Apriori Principle Pruned supersets

16 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 16 Apriori Algorithm l Method: 1.Let k=1 2.Generate frequent itemsets of length 1 3.Repeat until no new frequent itemsets are identified a) Generate a (k+1)-candidate itemsets from every two frequent k-itemsets with only one different item b) Prune candidate itemsets containing subsets of length k that are infrequent c) Count the support of each candidate by scanning the DB d) Prune candidates that are infrequent, leaving only those that are frequent

17 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 17 Generating Frequent 1-Itemsets l Consider a set of transactions T with 6 unique items –suppose minimum support count = 3 l Apriori makes first pass through T to obtain support counts for the candidate 1-itemsets Candidates for frequent 1-itemsets Obtain support count Coke & Eggs are infrequent Prune infrequent itemsets Frequent 1-itemsets

18 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 18 Generating Frequent 2-Itemsets Frequent 1-itemsets Step 3.a: Generate candidates Candidates for frequent 2-itemsets Step 3.b: No pruning due to infrequent subsets {Bread,Beer} & {Milk,Beer} are infrequent Step 3.c: Obtain support counts Step 3.d: Prune infrequent itemsets Frequent 2-itemsets

19 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 19 Generating Frequent 3-Itemsets Frequent 2-itemsets Candidates for frequent 3-itemsets Step 3.a: Generate candidates Step 3.b: Prune due infrequent subsets E.g., {Beer,Bread,Diaper} pruned since {Beer,Diaper} is infrequent Frequent 3-itemsets Step 3.c: Obtain support counts Step 3.d: no pruning due to insufficient support # of candidates considered: without pruning: 6 C 1 + 6 C 2 + 6 C 3 = 41 (there are 6 C 1 candidates for frequent 1-itemsets) with pruning: 6 C 1 + 4 C 2 + 3 = 6 + 6 + 3 = 15

20 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 20 Reducing Number of Comparisons l Candidate counting: –Scan the database of transactions to determine the support of each candidate itemset –To reduce the number of comparisons, store the candidates in a hash structure  Instead of matching each transaction against every candidate, match it against candidates contained in the hashed buckets

21 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 21 Generate Hash Tree 2 3 4 5 6 7 1 4 5 1 3 6 1 2 4 4 5 7 1 2 5 4 5 8 1 5 9 3 4 5 3 5 6 3 5 7 6 8 9 3 6 7 3 6 8 1,4,7 2,5,8 3,6,9 Hash function h(x) = x mod 3 Suppose you have 15 candidate itemsets of length 3: {1 4 5}, {1 2 4}, {4 5 7}, {1 2 5}, {4 5 8}, {1 5 9}, {1 3 6}, {2 3 4}, {5 6 7}, {3 4 5}, {3 5 6}, {3 5 7}, {6 8 9}, {3 6 7}, {3 6 8} Items in itemsets & transactions are ordered (in the same way) Hash function Max leaf size: max number of itemsets stored in a leaf node (if number of candidate itemsets exceeds max leaf size, split the node) hash on first item on second item on third max leaf size = 3

22 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 22 Association Rule Discovery: Hash tree 1,4,7 2,5,8 3,6,9 Hash Function First item is: 1, 4, or 7 1 5 9 1 4 51 3 6 3 4 53 6 7 3 6 8 3 5 6 3 5 7 6 8 9 2 3 4 5 6 7 1 2 4 4 5 7 1 2 5 4 5 8 Candidate Hash Tree

23 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 23 Association Rule Discovery: Hash tree 1,4,7 2,5,8 3,6,9 Hash Function Second item is: 2, 5, or 8 1 5 9 1 4 51 3 6 3 4 53 6 7 3 6 8 3 5 6 3 5 7 6 8 9 2 3 4 5 6 7 1 2 4 4 5 7 1 2 5 4 5 8 Candidate Hash Tree Second item is: 1, 4, or 7 Second item is: 3, 6, or 9

24 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 24 Association Rule Discovery: Hash tree 1 5 9 1 4 51 3 6 3 4 53 6 7 3 6 8 3 5 6 3 5 7 6 8 9 2 3 4 5 6 7 1 2 4 4 5 7 1 2 5 4 5 8 1,4,7 2,5,8 3,6,9 Hash Function Candidate Hash Tree First item is 2, 5, or 8

25 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 25 Association Rule Discovery: Hash tree 1 5 9 1 4 51 3 6 3 4 5 3 6 7 3 6 8 3 5 6 3 5 7 6 8 9 2 3 4 5 6 7 1 2 4 4 5 7 1 2 5 4 5 8 1,4,7 2,5,8 3,6,9 Hash Function Candidate Hash Tree First item is 3, 6, or 9

26 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 26 Association Rule Discovery: Hash tree 1 5 9 1 4 51 3 6 3 4 5 3 6 7 3 6 8 3 5 6 3 5 7 6 8 9 2 3 4 5 6 7 1 2 4 4 5 7 1 2 5 4 5 8 1,4,7 2,5,8 3,6,9 Hash Function Candidate Hash Tree Split by second item

27 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 27 Association Rule Discovery: Hash tree 1 5 9 1 4 51 3 6 3 4 5 3 6 7 3 6 8 3 5 6 3 5 7 6 8 9 2 3 4 5 6 7 1 2 4 4 5 7 1 2 5 4 5 8 Candidate Hash Tree not further split since <= max size, if split (by third item), we would have: 3 5 6 6 8 9 3 5 7 1,4,7 2,5,8 3,6,9 Hash Function

28 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 28 Subset Operation Given a transaction t, what are the possible subsets of size 3?

29 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 29 Subset Operation Using Hash Tree 1 5 9 1 4 51 3 6 3 4 53 6 7 3 6 8 3 5 6 3 5 7 6 8 9 2 3 4 5 6 7 1 2 4 4 5 7 1 2 5 4 5 8 1 2 3 5 6 1 +2 3 5 6 3 5 62 + 5 63 + 1,4,7 2,5,8 3,6,9 Hash Function transaction

30 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 30 Subset Operation Using Hash Tree 1 5 9 1 4 5 1 3 6 3 4 5 3 6 7 3 6 8 3 5 6 3 5 7 6 8 92 3 4 5 6 7 1 2 4 4 5 7 1 2 5 4 5 8 1,4,7 2,5,8 3,6,9 Hash Function 1 2 3 5 6 3 5 61 2 + 5 61 3 + 61 5 + 3 5 62 + 5 63 + 1 +2 3 5 6 transaction

31 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 31 Subset Operation Using Hash Tree 1 5 9 1 4 5 1 3 6 3 4 5 3 6 7 3 6 8 3 5 6 3 5 7 6 8 92 3 4 5 6 7 1 2 4 4 5 7 1 2 5 4 5 8 1,4,7 2,5,8 3,6,9 Hash Function 1 2 3 5 6 3 5 61 2 + 5 61 3 + 61 5 + 3 5 62 + 5 63 + 1 +2 3 5 6 transaction 5 leaves visited, 9 candidates compared 125 123 126 156 135 136 235 236 256 3563-itemset from transaction

32 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 32 Factors Affecting Complexity of Apriori l Choice of minimum support threshold –lowering support threshold results in more frequent itemsets –this may increase (1) number of candidates and (2) max length of frequent itemsets l Dimensionality (number of unique items) of the data set –more candidates & space needed to store their support counts –if number of frequent items also increases, both computation and I/O costs may also increase l Size of database –since Apriori makes multiple passes, run time of algorithm may increase with number of transactions l Average transaction width –transaction width increases with denser data sets –this may increase max length of frequent itemsets and traversals of hash tree (number of subsets in a transaction increases with its width)

33 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 33 Outline l Association rules & mining l Finding frequent itemsets –Apriori algorithm –compact representations of frequent itemsets –alternative discovery methods –FP-growth algorithm l Generating association rules l Evaluating discovered rules

34 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 34 Compact Representation of Frequent Itemsets l Itemsets with positive support counts –3 groups: {A1, …, A10}, {B1, …, B10}, {C1, …, C10} –no transactions containing items from different groups l All subsets of items within the same group have the same support count –e.g., support counts of {A1}, {A1, A2}, {A1, A2, A3} all = 5

35 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 35 Compact Representation of Frequent Itemsets l Suppose minimum support count is 3 l Number of frequent itemsets:  need a compact representation

36 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 36 Maximal Frequent Itemset Border Infrequent Itemsets Maximal Itemsets A itemset is maximally frequent if it is frequent but none of its immediate supersets is frequent

37 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 37 Maximal Frequent Itemset Border Infrequent Itemsets Maximal Itemsets Every frequent itemset is either maximal or a subset of some maximal frequent itemset

38 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 38 Properties of Maximal Frequent Itemsets l If level k is the highest level with maximal frequent itemsets, then level k only contains maximal frequent itemsets or infrequent itemsets Border Infrequent Itemsets Maximal Itemsets level 0 level 1 level 2 level 3 level 4 level 5

39 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 39 Properties of Maximal Frequent Itemsets l Furthermore, level k + 1 (if exists) only contains infrequent itemsets Border Infrequent Itemsets Maximal Itemsets level 0 level 1 level 2 level 3 level 4 level 5

40 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 40 Maximal Frequent Itemset Border Infrequent Itemsets Maximal Itemsets Know maximal frequent itemsets  know all frequent itemsets, but not their supports

41 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 41 Closed Itemset l An itemset is closed if none of its immediate supersets has the same support as the itemset support count Not supported by any transactions null ABACADAEBCBDBECDCEDE ABCDE ABCABDABEACDACEADEBCDBDECDE ABCDABCEABDEACDEBCDE ABCDE 33 4 3 3 232 1 3 1 1 2 2 2 2 1 2 11 1 11 1 1 Closed 0 000 0 0 BCE

42 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 42 Properties of Closed Itemsets l Every itemset is either a closed itemset or has the same support as one of its immediate supersets

43 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 43 Properties of Closed Itemsets l If X is not closed, X has the same support as its immediate superset with the largest support l Suppose Y 1, …, Y k are X’s immediate supersets & Y i has the largest support l  (X) ≥  (Y 1 ),  (Y 2 ), …,  (Y k ) ≥ max{  (Y 1 ), …,  (Y k )} =  (Y i ) I.e.,  (X) ≥  (Y i ) l since X is not closed,  (X) =  (Y i )

44 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 44 Properties of Closed Itemsets l Suppose the highest level of lattice is level k –the itemset at level k (only one k-itemset) must be closed level 0 level 1 level 2 level 3 level 4 level 5

45 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 45 Properties of Closed Itemsets l Can compute support counts of all itemsets from support counts of closed itemsets level 0 level 1 level 2 level 3 level 4 level 5

46 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 46 Properties of Closed Itemsets l Compute support of itemsets on level k from those in level k + 1: if an itemset on level k is not closed, we can compute its support from those at level k + 1 level 0 level 1 level 2 level 3 level 4 level 5

47 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 47 Closed Frequent Itemsets l An itemset is a closed frequent itemset (shown in shaded ovals) if it is both frequent & closed Minimum support count = 2 Closed frequent itemsets # frequent = 14 # closed frequent = 9 null ABACADAEBCBDBECDCEDE ABCDE ABCABDABEACDACEADEBCDBDECDE ABCDABCEABDEACDEBCDE ABCDE 33 4 3 3 232 1 3 1 1 2 2 2 2 1 2 11 1 11 1 1 0 000 0 0 BCE

48 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 48 Frequent Itemsets: Maximal vs Closed l Every maximal frequent itemset is a closed frequent itemset null ABACADAEBCBDBECDCEDE ABCDE ABCABDABEACDACEADEBCDBCEBDECDE ABCDABCEABDEACDEBCDE ABCDE 33 4 3 3 232 1 3 1 1 2 2 2 2 1 2 11 1 11 1 1 0 000 0 Closed and maximal Closed but not maximal # closed frequent = 9 # maximal frequent = 4 0 Closed and maximal Minimum support count = 2

49 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 49 Frequent Itemsets: Maximal vs Closed

50 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 50 Outline l Association rules & mining l Finding frequent itemsets –Apriori algorithm –compact representations of frequent itemsets –alternative discovery methods –FP-growth algorithm l Generating association rules l Evaluating discovered rules

51 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 51 Alternative Methods for Frequent Itemset Generation l Traversal of itemset lattice –Breadth-first: e.g., Apriori algorithm –Depth-first: often used to search maximal frequent itemsets a ab abc a bcde ab

52 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 52 Alternative Methods for Frequent Itemset Generation l Depth-first finds maximal frequent itemsets more quickly l It also enables substantial pruning –if abcd is not maximal, all its subsets (e.g., bc) are not; but supersets of bc that are not subsets of abcd (e.g., bce) may still be maximal a ab abc a bcde ab

53 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 53 Alternative Methods for Frequent Itemset Generation l Traversal of itemset lattice –General-to-specific, specific-to-general, & bidirectional

54 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 54 Alternative Methods for Frequent Itemset Generation l General-to-specific: going from k to k+1-itemsets –Adopted by Apriori, effective when border (of frequent and infrequent itemsets) is near the top of lattice

55 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 55 Alternative Methods for Frequent Itemset Generation l Specific-to-general: going from k+1 to k-itemsets –often used to find maximal frequent itemsets –effective when border is near the bottom of lattice

56 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 56 Alternative Methods for Frequent Itemset Generation l Bidirectional: going in two directions at same time –need more space for storing candidates –may quickly discover border in situation shown in (c)

57 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 57 Alternative Methods for Frequent Itemset Generation l Traverse lattice by equivalent classes, formed by –having the same # of items in the itemset, e.g., Apriori –or having the same prefix (e.g., A, B, C, D) or suffix same prefix A same suffix D

58 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 58 Storage Formats for Transactions l Horizontal: store a set/list of items associated with each transaction (e.g., Apriori) l Vertical: store a set/list of transactions associated with each item TID-list

59 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 59 Vertical Data Layout l Searching support = intersecting TID-lists –length of TID-list shrinks as size of itemsets grows –problem: initial lists may be too large to fit in memory l Need more compact representation  FP-growth TID-list

60 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 60 Outline l Association rules & mining l Finding frequent itemsets –Apriori algorithm –compact representations of frequent itemsets –alternative discovery methods –FP-growth algorithm l Generating association rules l Evaluating discovered rules

61 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 61 FP-growth Algorithm: Key Ideas l Limitations of Apriori –need to generate a large number of candidates, e.g., 1k frequent 1-itemsets  1M candidates of 2-itemset –need to repeatedly scan the database l FP-growth: frequent pattern growth –discover frequent itemsets w/o generating candidates –often just need to scan databases twice

62 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 62 FP-growth Algorithm l Step 1: –scan database to discover frequent 1-itemsets –scan database the second time to build a compact representation of transactions in form of FP-tree l Step 2: –use the constructed FP-tree to recursively find frequent itemsets via divide-and-conquer: turn problem of finding k-itemsets into a set of subproblems, each finding k-itemsets ending in a different suffix

63 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 63 FP-tree & Construction l Transactions represented by paths from root –items in transactions are ordered –node = item:# of transactions having path to item null A:1 B:1 After reading TID=1:

64 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 64 FP-tree & Construction l Transactions represented by paths from root –transactions with common prefixes (e.g., T1 and T3) share paths for their prefixes null A:1 B:1 After reading TID=1:

65 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 65 FP-tree & Construction l Items often ordered in decreasing support counts –s(A) = 8, s(B) = 7, s(C) = 5, s(D) = 5, s(E) = 3  the order is: A, B, C, D, E null A:1 B:1 After reading TID=1:

66 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 66 FP-tree & Construction l Items often ordered in decreasing support counts –s(A) = 8, s(B) = 7, s(C) = 5, s(D) = 5, s(E) = 3 –a pass over database needed for getting these counts null A:1 B:1 After reading TID=1:

67 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 67 FP-tree & Construction l Items often ordered in decreasing support counts –with this order, transactions tend to share paths  smaller tree: low branching factor, less bushy null A:1 B:1 After reading TID=1:

68 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 68 FP-tree Construction l Nodes for the same item on different paths are connected via node links null A:1 B:1 C:1 D:1 After reading TID=2: node link

69 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 69 FP-tree Construction l Shared paths between TID=1 and TID=3 After reading TID=3: null A:2 B:1 C:1 D:1 C:1 D:1 E:1

70 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 70 FP-tree Construction: Completed Tree null A:8 B:5 B:2 C:2 D:1 C:1 D:1 C:3 D:1 E:1 Pointers are used to assist frequent itemset generation D:1 E:1 Transaction Database Header table

71 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 71 FP-growth: Finding Frequent Itemsets l Take FP-tree generated in step 1 as input l Find frequent itemsets with common suffixes –e.g., suppose the order of items is: A, B, C, D, E –first find itemsets in reverse order: ending with E, then, D, C, … –start with items w/ fewer supports, so pruning is more likely l For each particular suffix, find frequent itemsets via divide-and-conquer, e.g., consider itemsets ending w/ E 1.obtain sub-tree of FP-tree where all paths end with E 2.compute support counts of E; if E is frequent, 3.check & prepare to solve subproblems: DE, CE, BE, AE 4.recursively compute each subproblem in a similar fashion

72 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 72 FP-growth: Finding Frequent Itemsets l For each particular suffix, find frequent itemsets via divide-and-conquer, e.g., consider itemsets ending w/ E 1.obtain sub-tree of FP-tree where all paths end with E 2.compute support counts of E; if E is frequent, 3.transform subtree for suffix E into conditional FP-tree for suffix E – update count – remove leaves (E’s) – prune infrequent itemsets 4.Use conditional FP-tree for E to recursively solve: DE, CE, BE, AE in a similar fashion (as using FP-tree to solve E, D, C, B, A)

73 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 73 Traversal of Lattice by FP-growth null AB C D E DEDECECEBEBEAEAE CDEBDEADE BCDE ABCDE BCEACE ACDE CDBDADBCACAB ABEBCDACDABDABC ABDEABCEABCD 1 2 3 4 common suffix E

74 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 74 Discover Frequent Itemsets via FP-tree null AB C D E DEDECECEBEBEAEAE CDEBDEADE BCDE ABCDE BCEACE ACDE CDBDADBCACAB ABEBCDACDABDABC ABDEABCEABCD 1 2 3 4 common suffix E FP-tree Sub-tree for suffix E Sub-tree for suffix DE

75 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 75 Obtaining Sub-Tree for Suffix E l Find first path containing E (by looking up in the header table), find rest by following node links null A:8 B:5 B:2 C:2 D:1 C:1 D:1 C:3 D:1 E:1 D:1 E:1 A:8 B:5 C:3 D:1 Header table

76 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 76 Sub-Tree for Suffix E l Remove all E’s descendants & nodes that are not E’s ancestors E:1 null A:8 B:2 C:2 C:1 D:1 E:1 A:8 null A:8 B:5 B:2 C:2 D:1 C:1 D:1 C:3 D:1 E:1 D:1 E:1 A:8 B:5 C:3 D:1 Paths in the sub-tree are the prefix paths ending in E

77 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 77 Current Position of Processing 1 2 common suffix E FP-tree Sub-tree for suffix E just obtained

78 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 78 Determine if E is Frequent l Support count of E = sum of counts of all E nodes –suppose minimum support count is 2 –support count of E = 3, so E is frequent null A:8 B:2 C:2 C:1 D:1 E:1 A:8 E:1

79 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 79 Prepare for Solving DE, CE, BE, and AE l Turn original subtree into conditional FP-tree by 1.update counts on prefix paths 2.remove E nodes 3.prune infrequent itemsets null A:8 B:2 C:2 C:1 D:1 E:1 A:8 E:1 original sub-tree

80 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 80 Update Counts on Prefix Paths l Counts for E (leaf) nodes are correct l Counts for internal nodes may not be correct –due to removal of paths which do not have E’s null A:8 B:2 C:2 C:1 D:1 E:1 A:8 E:1 Path BCD was removed, so 2 is not correct anymore

81 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 81 Update Counts on Prefix Paths l Start from leaves, going upward –if node X has only child Y, count of X = count of Y –otherwise, count of X = sum of counts of X’s children null A:2 B:1 C:1 D:1 E:1 null A:8 B:2 C:2 C:1 D:1 E:1 A:8 E:1

82 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 82 Remove E Nodes l E nodes can be removed –counts of internal nodes have been updated –contain no more information for solving: DE, CE, … null A:2 B:1 C:1 D:1 null A:2 B:1 C:1 D:1 E:1

83 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 83 Prune Infrequent Itemsets l If sum of counts of all X nodes < minimum support count, remove X –since XE can not be frequent null A:2 B:1 C:1 D:1 remove B null A:2 C:1 D:1 Conditional FP-tree for E

84 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 84 Current Position of Processing 1 2 common suffix E FP-tree Sub-tree for suffix E turned into conditional FP-tree for suffix E counts updated leaves (E’s) removed infrequent itemsets pruned

85 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 85 Current Position of Processing 1 2 common suffix E FP-tree Conditional FP-tree for suffix E then used to solve DE, CE, BE, AE similarly as we use FP-tree to solve E, D, C, B, A

86 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 86 Solving for DE l From conditional FP-tree for E, obtain sub-tree for suffix DE null A:2 C:1 D:1 Conditional FP-tree for E null A:2 C:1 D:1 Sub-tree for suffix DE

87 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 87 Current Position of Processing 1 2 common suffix E FP-tree Sub-tree for suffix DE just obtained 3

88 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 88 Solving for DE l Support count of DE = 2 (sum of counts of all D’s) –DE is frequent, need to solve: CDE, BDE, ADE Sub-tree for suffix DE null A:2 C:1 D:1

89 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 89 Preparing for Solving CDE, BDE, ADE null A:2 C:1 D:1 null A:2 C:1 Conditional FP-tree for suffix DE Sub-tree for suffix DE update count remove leaves null A:2 prune infrequent itemsets

90 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 90 Current Position of Processing 1 2 common suffix E FP-tree Sub-tree for suffix DE turned into conditional FP-tree for suffix DE, ready to solve CDE, BDE ADE 3

91 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 91 Solving CDE, BDE, ADE l Sub-trees for both CDE and BDE are empty –no paths containing item C or B l Work on ADE l ADE (support count = 2) is frequent –but no more subproblem for ADE, backtrack –& no more subproblem for DE, backtrack Conditional FP-tree for suffix DE null A:2 null A:2 Sub-tree for suffix for ADE  solving next subproblem CE

92 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 92 Current Position of Processing 1 2 common suffix E FP-tree 3 4 56 7 About to solve suffix CE

93 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 93 Solving for Suffix CE l CE is frequent (support count = 2) l No more subproblems for CE (empty conditional FP-tree, why?), so done with CE l Checking next subproblem: BE null A:2 C:1 D:1 Conditional FP-tree for E null A:2 C:1 Sub-tree for suffix CE Conditional FP-tree for suffix CE Empty tree

94 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 94 Solving for Suffix CE l Suftree for BE is empty (no path in conditional FP-tree for E contains B) l Checking AE: there are paths containing A null A:2 C:1 D:1 Conditional FP-tree for E null A:2 C:1 Sub-tree for suffix CE Conditional FP-tree for suffix CE Empty tree

95 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 95 Current Position of Processing 1 2 common suffix E FP-tree 3 4 56 7 Preparing to solve suffix AE 89 10

96 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 96 Solving for Suffix AE l AE is frequent (support count = 2) l Done with AE, backtrack; done with E, backtrack null A:2 C:1 D:1 Conditional FP-tree for E null A:2 Sub-tree for suffix AE  Solving next subproblem: suffix D

97 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 97 Current Position of Processing 1 2 common suffix E FP-tree 3 4 56 7 Ready to solve suffix D 89 10

98 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 98 Found Frequent Itemsets with Suffix E l E, DE, ADE, CE, AE discovered in this order common suffix E Ready to solve suffix D

99 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 99 Outline l Association rules & mining l Finding frequent itemsets –Apriori algorithm –compact representations of frequent itemsets –alternative discovery methods –FP-growth algorithm l Generating association rules l Evaluating discovered rules

100 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 100 Rule Generation l Given a frequent itemset L, find all non-empty subsets f  L such that f  L – f satisfies the minimum confidence requirement –If {A,B,C,D} is a frequent itemset, candidate rules: ABC  D, ABD  C, ACD  B, BCD  A, A  BCD,B  ACD,C  ABD, D  ABC AB  CD,AC  BD, AD  BC, BC  AD, BD  AC, CD  AB, l If |L| = k, then there are 2 k – 2 candidate association rules (ignoring L   and   L)

101 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 101 Rule Generation l How to efficiently generate rules from frequent itemsets? –In general, confidence does not have an anti- monotone property –e.g., c(ABC  D) can be larger or smaller than c(AB  D)

102 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 102 Rule Generation l But confidence of rules generated from the same itemset has an anti-monotone property –e.g., L = {A,B,C,D}: c(BCD  A)  c(CD  AB)  c(D  ABC) l Confidence is anti-monotone w.r.t. number of items on the RHS of the rule –more items on the right  lower/equal confidence

103 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 103 Rule Generation for Apriori Algorithm Lattice of rules Pruned Rules Low Confidence Rule

104 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 104 Rule Generation for Apriori Algorithm l Candidate rule is generated by merging two rules that share the same prefix in the rule consequent l join(CD=>AB,BD=>AC) would produce the candidate rule D => ABC l Prune rule D=>ABC if its subset AD=>BC does not have high confidence

105 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 105 Outline l Association rules & mining l Finding frequent itemsets –Apriori algorithm –compact representations of frequent itemsets –alternative discovery methods –FP-growth algorithm l Generating association rules l Evaluating discovered rules

106 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 106 Pattern Evaluation l Association rule algorithms tend to produce too many rules –many of them are uninteresting l In the original formulation of association rules, support & confidence are used l Additional interestingness measures can be used to prune/rank the derived patterns

107 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 107 Computing Interestingness Measure l Given a rule X  Y, information needed to compute rule interestingness can be obtained from a contingency table YY Xf 11 f 10 f 1+ Xf 01 f 00 f o+ f +1 f +0 |T| Contingency table for X  Y f 11 : support count of X and Y f 10 : support count of X and Y f 01 : support count of X and Y f 00 : support count of X and Y Used to define various measures u support, confidence, lift, cosine, Jaccard coefficient, etc

108 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 108 Computing Interestingness Measure l Given a rule X  Y, information needed to compute rule interestingness can be obtained from a contingency table YY Xf 11 f 10 f 1+ Xf 01 f 00 f o+ f +1 f +0 |T| Contingency table for X  Y confidence (X  Y) = = f 11 / (f 11 + f 10 ) = f 11 / f 1+

109 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 109 Drawback of Confidence Coffee Tea15520 Tea75580 9010100 Association Rule: Tea  Coffee Confidence= P(Coffee|Tea) = 0.75 but P(Coffee) = 0.9 & P(Coffee|Tea) = 0.9375 >.75  Although confidence is high, rule is misleading

110 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 110 Computing Interestingness Measure l Given a rule X  Y, information needed to compute rule interestingness can be obtained from a contingency table YY Xf 11 f 10 f 1+ Xf 01 f 00 f o+ f +1 f +0 |T| Contingency table for X  Y confidence (X  Y) = based on P(X,Y) & P(X), but also need to consider P(Y)  need to measure correlation of X & Y

111 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 111 Measuring Correlation l Population of 1000 students –600 students know how to swim (S) –700 students know how to bike (B) –420 students know how to swim and bike (S,B) –P(S  B) = 420/1000 = 0.42 –P(S)  P(B) = 0.6  0.7 = 0.42 –P(S  B) = P(S)  P(B) => Independent –P(S  B) > P(S)  P(B) => Positively correlated –P(S  B) Negatively correlated

112 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 112 Measuring Correlation l Population of 1000 students –600 students know how to swim (S) –700 students know how to bike (B) –420 students know how to swim and bike (S,B) –P(S  B) = P(S)  P(B)  lift = 1  S & B independent –P(S  B) > P(S)  P(B)  lift > 1  positive-correlated –P(S  B) < P(S)  P(B)  lift < 1  negative-correlated

113 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 113 Measures on Correlation l The larger the value, usually the more likely the two variables/events are correlated XY X,Y

114 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 114 Computing Lift from Contingency Table YY Xf 11 f 10 f 1+ Xf 01 f 00 f o+ f +1 f +0 N

115 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 115 Computing Lift from Contingency Table Coffee Tea15520 Tea75580 9010100 Association Rule: Tea  Coffee Confidence= P(Coffee|Tea) = 0.75 but P(Coffee) = 0.9  Lift = 0.75/0.9= 0.8333 (< 1, therefore is negatively associated)

116 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 116 Drawback of Lift YY X100 X090 1090100 YY X900 X010 9010100 Statistical independence: If P(X,Y)=P(X)P(Y) => Lift = 1 10 (X & Y seldom occur together) >> 1.11 (X & Y frequently occur together)

117 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 117 Compute Cosine from Contingency Table YY Xf 11 f 10 f 1+ Xf 01 f 00 f o+ f +1 f +0 N

118 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 118 Cosine vs Lift YY Xf 11 f 10 f 1+ Xf 01 f 00 f o+ f +1 f +0 N If X and Y are independent, lift(X,Y) = 1 but Cosine(X,Y) = P(X)P(Y) Cosine does not depend on N (& f 00 ), unlike Lift

119 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 119 Compute Jaccard from Contingency Table YY Xf 11 f 10 f 1+ Xf 01 f 00 f o+ f +1 f +0 N

120 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 120 Property under Null Addition Invariant measures: u confidence, Cosine, Jaccard, etc Non-invariant measures: u support (f 11 /N vs f 11 /(N+s)), lift, etc Nf +0 f +1 f o+ f 00 f 01 X f 1+ f 10 f 11 X YY N+sf +0 + sf +1 f o+ + sf 00 + sf 01 X f 1+ f 10 f 11 X YY

121 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 121 Property under Variable Permutation Does M(A,B) = M(B,A)? If yes, M is symmetric; otherwise asymmetric e.g., c(A  B) = c(B  A)?

122 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 122 Property under Variable Permutation Does M(A,B) = M(B,A)? If yes, M is symmetric; otherwise asymmetric lift(A  B) = lift(B  A)?

123 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 123 Property under Variable Permutation Does M(A,B) = M(B,A)? If yes, M is symmetric; otherwise asymmetric Symmetric measures: u support, lift, Cosine, Jaccard, etc Asymmetric measures: u confidence, etc

124 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 124 Applying Interestingness Measures Association rule (A  B) directional u confidence (asymmetric) is intuitive u but confidence does not capture correlation  use additional measures such as lift to rank discovered rules & further prune those with low ranks


Download ppt "Data Mining Association Analysis: Basic Concepts and Algorithms Based on Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction."

Similar presentations


Ads by Google