Download presentation
Presentation is loading. Please wait.
Published byBrianne Dalton Modified over 9 years ago
1
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1
2
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 2 Association Rule Mining l Given a set of transactions, find rules that will predict the occurrence of an item based on the occurrences of other items in the transaction Market-Basket transactions Example of Association Rules {Diaper} {Beer}, {Milk, Bread} {Eggs,Coke}, {Beer, Bread} {Milk}, Implication means co-occurrence, not causality!
3
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 3 Association Rule: Basic Concepts l Given: (1) database of transactions, (2) each transaction is a list of items (purchased by a customer in a visit) l Find: all rules that correlate the presence of one set of items with that of another set of items –E.g., 98% of people who purchase tires and auto accessories also get automotive services done l Applications –* Maintenance Agreement (What the store should do to boost Maintenance Agreement sales) –Home Electronics * (What other products should the store stocks up?) –Attached mailing in direct marketing –Detecting “ping-pong”ing of patients, faulty “collisions”
4
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 4 Definition: Frequent Itemset l Itemset –A collection of one or more items Example: {Milk, Bread, Diaper} –k-itemset An itemset that contains k items l Support count ( ) –Frequency of occurrence of an itemset –E.g. ({Milk, Bread,Diaper}) = 2 l Support –Fraction of transactions that contain an itemset –E.g. s({Milk, Bread, Diaper}) = 2/5 l Frequent Itemset –An itemset whose support is greater than or equal to a minsup threshold
5
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 5 Rule Measures: Support and Confidence l Find all the rules X & Y Z with minimum confidence and support –support, s, probability that a transaction contains {X U Y U Z} –confidence, c, conditional probability that a transaction having {X U Y} also contains Z Let minimum support 50%, and minimum confidence 50%, we have –A C (50%, 66.6%) –C A (50%, 100%) Customer buys diaper Customer buys both Customer buys beer
6
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 6 Definition: Association Rule Example: l Association Rule –An implication expression of the form X Y, where X and Y are itemsets –Example: {Milk, Diaper} {Beer} l Rule Evaluation Metrics –Support (s) Fraction of transactions that contain both X and Y –Confidence (c) Measures how often items in Y appear in transactions that contain X
7
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 7 Mining Association Rules—An Example For rule A C: support = P (A U C) = 50% confidence = P (C|A) = 66.6% Min. support 50% Min. confidence 50%
8
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 8 Association Rule Mining Task l Given a set of transactions T, the goal of association rule mining is to find all rules having –support ≥ minsup threshold –confidence ≥ minconf threshold l Brute-force approach: –List all possible association rules –Compute the support and confidence for each rule –Prune rules that fail the minsup and minconf thresholds Computationally prohibitive!
9
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 9 Mining Association Rules Example of Rules: {Milk,Diaper} {Beer} (s=0.4, c=0.67) {Milk,Beer} {Diaper} (s=0.4, c=1.0) {Diaper,Beer} {Milk} (s=0.4, c=0.67) {Beer} {Milk,Diaper} (s=0.4, c=0.67) {Diaper} {Milk,Beer} (s=0.4, c=0.5) {Milk} {Diaper,Beer} (s=0.4, c=0.5) Observations: All the above rules are binary partitions of the same itemset: {Milk, Diaper, Beer} Rules originating from the same itemset have identical support but can have different confidence Thus, we may decouple the support and confidence requirements
10
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 10 Mining Association Rules l Two-step approach: 1.Frequent Itemset Generation – Generate all itemsets whose support minsup 2.Rule Generation – Generate high confidence rules from each frequent itemset, where each rule is a binary partitioning of a frequent itemset l Frequent itemset generation is still computationally expensive
11
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 11 Frequent Itemset Generation Given d items, there are 2 d possible candidate itemsets
12
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 12 Frequent Itemset Generation l Brute-force approach: –Each itemset in the lattice is a candidate frequent itemset –Count the support of each candidate by scanning the database –Match each transaction against every candidate –Complexity ~ O(NMw) => Expensive since M = 2 d !!!
13
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 13 Computational Complexity l Given d unique items: –Total number of itemsets = 2 d –Total number of possible association rules: If d=6, R = 602 rules
14
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 14 Frequent Itemset Generation Strategies l Reduce the number of candidates (M) –Complete search: M=2 d –Use pruning techniques to reduce M l Reduce the number of transactions (N) –Reduce size of N as the size of itemset increases –Used by DHP and vertical-based mining algorithms l Reduce the number of comparisons (NM) –Use efficient data structures to store the candidates or transactions –No need to match every candidate against every transaction
15
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 15 Reducing Number of Candidates l Apriori principle: –If an itemset is frequent, then all of its subsets must also be frequent l Apriori principle holds due to the following property of the support measure: –Support of an itemset never exceeds the support of its subsets –This is known as the anti-monotone property of support
16
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 16 Found to be Infrequent Illustrating Apriori Principle Pruned supersets
17
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 17 Illustrating Apriori Principle Items (1-itemsets) Pairs (2-itemsets) (No need to generate candidates involving Coke or Eggs) Triplets (3-itemsets) Minimum Support = 3 If every subset is considered, 6 C 1 + 6 C 2 + 6 C 3 = 41 With support-based pruning, 6 + 6 + 1 = 13
18
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 18 Apriori Algorithm l Method: –Let k=1 –Generate frequent itemsets of length 1 –Repeat until no new frequent itemsets are identified Generate length (k+1) candidate itemsets from length k frequent itemsets Prune candidate itemsets containing subsets of length k that are infrequent Count the support of each candidate by scanning the DB Eliminate candidates that are infrequent, leaving only those that are frequent
19
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 19 Mining Frequent Itemsets: the Key Step l Find the frequent itemsets: the sets of items that have minimum support –A subset of a frequent itemset must also be a frequent itemset i.e., if {AB} is a frequent itemset, both {A} and {B} should be a frequent itemset –Iteratively find frequent itemsets with cardinality from 1 to k (k-itemset) l Use the frequent itemsets to generate association rules.
20
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 20 The Apriori Algorithm l Join Step: C k is generated by joining L k-1 with itself l Prune Step: Any (k-1)-itemset that is not frequent cannot be a subset of a frequent k-itemset l Pseudo-code: C k : Candidate itemset of size k L k : frequent itemset of size k L 1 = {frequent items}; for (k = 1; L k != ; k++) do begin C k+1 = candidates generated from L k ; for each transaction t in database do increment the count of all candidates in C k+1 that are contained in t L k+1 = candidates in C k+1 with min_support end return k L k ;
21
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 21 The Apriori Algorithm — Example Database D Scan D C1C1 L1L1 L2L2 C2C2 C2C2 C3C3 L3L3
22
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 22 How to Generate Candidates? l Suppose the items in L k-1 are listed in an order l Step 1: self-joining L k-1 insert into C k select p.item 1, p.item 2, …, p.item k-1, q.item k-1 from L k-1 p, L k-1 q where p.item 1 =q.item 1, …, p.item k-2 =q.item k-2, p.item k-1 < q.item k-1 l Step 2: pruning forall itemsets c in C k do forall (k-1)-subsets s of c do if (s is not in L k-1 ) then delete c from C k
23
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 23 Example of Generating Candidates l L 3 ={abc, abd, acd, ace, bcd} l Self-joining: L 3 *L 3 –abcd from abc and abd –acde from acd and ace l Pruning: –acde is removed because ade is not in L 3 l C 4 ={abcd}
24
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 24 Reducing Number of Comparisons l Candidate counting: –Scan the database of transactions to determine the support of each candidate itemset –To reduce the number of comparisons, store the candidates in a hash structure Instead of matching each transaction against every candidate, match it against candidates contained in the hashed buckets
25
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 25 How to Count Supports of Candidates? l Why counting supports of candidates a problem? –The total number of candidates can be very huge – One transaction may contain many candidates l Method: –Candidate itemsets are stored in a hash-tree –Leaf node of hash-tree contains a list of itemsets and counts –Interior node contains a hash table –Subset function: finds all the candidates contained in a transaction
26
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 26 Generate Hash Tree 2 3 4 5 6 7 1 4 5 1 3 6 1 2 4 4 5 7 1 2 5 4 5 8 1 5 9 3 4 5 3 5 6 3 5 7 6 8 9 3 6 7 3 6 8 1,4,7 2,5,8 3,6,9 Hash function Suppose you have 15 candidate itemsets of length 3: {1 4 5}, {1 2 4}, {4 5 7}, {1 2 5}, {4 5 8}, {1 5 9}, {1 3 6}, {2 3 4}, {5 6 7}, {3 4 5}, {3 5 6}, {3 5 7}, {6 8 9}, {3 6 7}, {3 6 8} You need: Hash function Max leaf size: max number of itemsets stored in a leaf node (if number of candidate itemsets exceeds max leaf size, split the node)
27
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 27 Association Rule Discovery: Hash tree 1 5 9 1 4 51 3 6 3 4 53 6 7 3 6 8 3 5 6 3 5 7 6 8 9 2 3 4 5 6 7 1 2 4 4 5 7 1 2 5 4 5 8 1,4,7 2,5,8 3,6,9 Hash Function Candidate Hash Tree Hash on 1, 4 or 7
28
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 28 Association Rule Discovery: Hash tree 1 5 9 1 4 51 3 6 3 4 53 6 7 3 6 8 3 5 6 3 5 7 6 8 9 2 3 4 5 6 7 1 2 4 4 5 7 1 2 5 4 5 8 1,4,7 2,5,8 3,6,9 Hash Function Candidate Hash Tree Hash on 2, 5 or 8
29
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 29 Association Rule Discovery: Hash tree 1 5 9 1 4 51 3 6 3 4 53 6 7 3 6 8 3 5 6 3 5 7 6 8 9 2 3 4 5 6 7 1 2 4 4 5 7 1 2 5 4 5 8 1,4,7 2,5,8 3,6,9 Hash Function Candidate Hash Tree Hash on 3, 6 or 9
30
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 30 Subset Operation Given a transaction t, what are the possible subsets of size 3?
31
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 31 Subset Operation Using Hash Tree 1 5 9 1 4 51 3 6 3 4 53 6 7 3 6 8 3 5 6 3 5 7 6 8 9 2 3 4 5 6 7 1 2 4 4 5 7 1 2 5 4 5 8 1 2 3 5 6 1 +2 3 5 6 3 5 62 + 5 63 + 1,4,7 2,5,8 3,6,9 Hash Function transaction
32
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 32 Subset Operation Using Hash Tree 1 5 9 1 4 5 1 3 6 3 4 5 3 6 7 3 6 8 3 5 6 3 5 7 6 8 92 3 4 5 6 7 1 2 4 4 5 7 1 2 5 4 5 8 1,4,7 2,5,8 3,6,9 Hash Function 1 2 3 5 6 3 5 61 2 + 5 61 3 + 61 5 + 3 5 62 + 5 63 + 1 +2 3 5 6 transaction
33
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 33 Subset Operation Using Hash Tree 1 5 9 1 4 5 1 3 6 3 4 5 3 6 7 3 6 8 3 5 6 3 5 7 6 8 92 3 4 5 6 7 1 2 4 4 5 7 1 2 5 4 5 8 1,4,7 2,5,8 3,6,9 Hash Function 1 2 3 5 6 3 5 61 2 + 5 61 3 + 61 5 + 3 5 62 + 5 63 + 1 +2 3 5 6 transaction Match transaction against 11 out of 15 candidates
34
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 34 Factors Affecting Complexity l Choice of minimum support threshold – lowering support threshold results in more frequent itemsets – this may increase number of candidates and max length of frequent itemsets l Dimensionality (number of items) of the data set – more space is needed to store support count of each item – if number of frequent items also increases, both computation and I/O costs may also increase l Size of database – since Apriori makes multiple passes, run time of algorithm may increase with number of transactions l Average transaction width – transaction width increases with denser data sets –This may increase max length of frequent itemsets and traversals of hash tree (number of subsets in a transaction increases with its width)
35
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 35 Rule Generation l Given a frequent itemset L, find all non-empty subsets f L such that f L – f satisfies the minimum confidence requirement –If {A,B,C,D} is a frequent itemset, candidate rules: ABC D, ABD C, ACD B, BCD A, A BCD,B ACD,C ABD, D ABC AB CD,AC BD, AD BC, BC AD, BD AC, CD AB, l If |L| = k, then there are 2 k – 2 candidate association rules (ignoring L and L)
36
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 36 Rule Generation l How to efficiently generate rules from frequent itemsets? –In general, confidence does not have an anti- monotone property c(ABC D) can be larger or smaller than c(AB D) –But confidence of rules generated from the same itemset has an anti-monotone property –e.g., L = {A,B,C,D}: c(ABC D) c(AB CD) c(A BCD) Confidence is anti-monotone w.r.t. number of items on the RHS of the rule
37
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 37 Rule Generation for Apriori Algorithm Lattice of rules Pruned Rules Low Confidence Rule
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.