Data Mining (Apriori Algorithm)DCS 802, Spring DCS 802 Data Mining Apriori Algorithm Spring of 2002 Prof. Sung-Hyuk Cha School of Computer Science & Information Systems
Data Mining (Apriori Algorithm)DCS 802, Spring Association Rules Definition: Rules that state a statistical correlation between the occurrence of certain attributes in a database table. Given a set of transactions, where each transaction is a set of items, X 1,..., X n and Y, an association rule is an expression X 1,..., X n Y. This means that the attributes X 1,..., X n predict Y Intuitive meaning of such a rule: transactions in the database which contain the items in X tend also to contain the items in Y.
Data Mining (Apriori Algorithm)DCS 802, Spring Measures for an Association Rule Support : Given the association rule X 1,..., X n Y, the support is the percentage of records for which X 1,..., X n and Y both hold. The statistical significance of the association rule. Confidence : Given the association rule X 1,..., X n Y, the confidence is the percentage of records for which Y holds, within the group of records for which X 1,..., X n hold. The degree of correlation in the dataset between X and Y. A measure of the rule’s strength.
Data Mining (Apriori Algorithm)DCS 802, Spring Quiz # 2 Problem: Given a transaction table D, find the support and confidence for an association rule B,D E. Database D TIDItems A B E A C D E B C D E A B D E B D E A B C 07A B D Answer: support = 3/7, confidence = 3/4
Data Mining (Apriori Algorithm)DCS 802, Spring Apriori Algorithm An efficient algorithm to find association rules. Procedure Procedure Find all the frequent itemsets : Use the frequent itemsets to generate the association rules A frequent itemset is a set of items that have support greater than a user defined minimum.
Data Mining (Apriori Algorithm)DCS 802, Spring Notation An itemset having k items. k-itemset LkLk Set of candidate k-itemsets (those with minimum support). Each member of this set has two fields: i) itemset and ii) support count. CkCk Set of candidate k-itemsets (potentially frequent itemsets). Each member of this set has two fields: i) itemset and ii) support count. The sample transaction database D The set of all frequent items.F
Data Mining (Apriori Algorithm)DCS 802, Spring Example TIDItems A C D B C E A B C E B E Database D C1C1C1C1 Support {A}.50 {B} {C} {D} {E} L1L1L1L Y Y Y N Y (k = 1) itemset C2C2C2C2 Support {A,B}.25 {A,C} {A,E} {B,C} {B,E} L2L2L2L2 N (k = 2) itemset C3C3C3C3 Support {B,C,E} L3L3L3L3.50Y (k = 3) itemset {C,E} Y N Y Y.50Y C4C4C4C4 Support {A,B,C,E} L4L4L4L4.25N (k = 4) itemset * Suppose a user defined minimum =.49 * n items implies O(n - 2) computational complexity? 2
Data Mining (Apriori Algorithm)DCS 802, Spring Procedure Apriorialgo() { F = ; L k = {frequent 1-itemsets}; k = 2; /* k represents the pass number. */ while (L k-1 != ) { F = F U L k ; C k = New candidates of size k generated from L k-1 ; for all transactions t D increment the count of all candidates in C k that are contained in t ; L k = All candidates in C k with minimum support ; k++ ; } return ( F ) ; }
Data Mining (Apriori Algorithm)DCS 802, Spring Candidate Generation Given L k-1, the set of all frequent (k-1)-itemsets, generate a superset of the set of all frequent k-itemsets. Idea : if an itemset X has minimum support, so do all subsets of X. 1. Join L k-1 with L k-1 2. Prune: delete all itemsets c C k such that some (k-1)-subset of c is not in L k-1. ex) L 2 = { {A,C}, {B,C}, {B,E}, {C,E} } 1. Join : { {A,B,C}, {A,C,E}, {B,C,E} } 2. Prune : { {A,B,C}, {A,C,E}, {B,C,E} } {A, E} L 2 Instead of 5 C 3 = 10, we have only 1 candidate. {A, B} L 2
Data Mining (Apriori Algorithm)DCS 802, Spring Thoughts Association rules are always defined on binary attributes. Need to flatten the tables. ex) CIDGenderEthnicityCall MFWBHADICID Phone Company DB. - Support for Asian ethnicity will never exceed.5. - No need to consider itemsets {M,F}, {W,B} nor {D,I}. - M F or D I are not of interest at all. * Considering the original schema before flattening may be a good idea.
Data Mining (Apriori Algorithm)DCS 802, Spring Finding association rules with item constraints When item constraints are considered, the Apriori candidate generation procedure does not generate all the potential frequent itemsets as candidates. Procedure Procedure 1. Find all the frequent itemsets that satisfy the boolean expression B. 2. Find the support of all subsets of frequent itemsets that do not satisfy B. 3. Generate the association rules from the frequent itemsets found in Step 1. by computing confidences from the frequent itemsets found in Steps 1 & 2.
Data Mining (Apriori Algorithm)DCS 802, Spring L s(k) Set of frequent k-itemsets that contain an item in S. Additional Notation BBoolean expression with m disjuncts: B = D 1 D 2 ... D m DiDi N conjuncts in D i, D i = a i,1 a i,2 ... a i,n SSet of items such that any itemset that satisfies B contains an item from S. L b(k) Set of frequent k-itemsets that satisfy B. C s(k) Set of candidate k-itemsets that contain an item in S. C b(k) Set of candidate k-itemsets that satisfy B.
Data Mining (Apriori Algorithm)DCS 802, Spring Procedure 1. Scan the data and determine L 1 and F. 2. Find L b(1) 3. Generate C b(k+1) from L b(k) 3-1. C k+1 = L k x F 3-2. Delete all candidates in C k+1 that do not satisfy B Delete all candidates in C k+1 below the minimum support for each D i with exactly k + 1 non-negated elements, add the itemset to C k+1 if all the items are frequent. Direct Algorithm
Data Mining (Apriori Algorithm)DCS 802, Spring TIDItems A C D B C E A B C E B E Database D Example Given B = (A B) (C E) step 1 & 2 L b(1) = { C } C 1 = { {A}, {B}, {C}, {D}, {E} } L 1 = { {A}, {B}, {C}, {E} } C 2 = L b(1) x F = { {A,C}, {B,C}, {C,E} } step 3-2step 3-1 C b(2) = { {A,C}, {B,C} } step 3-3 L 2 = { {A,C}, {B,C} } step 3-4 L b(2) = { {A,B}, {A,C}, {B,C} } C 3 = L b(2) x F = { {A,B,C}, {A,B,E}, {A,C,E}, {B,C,E} } step 3-2step 3-1 C b(3) = { {A,B,C}, {A,B,E} } step 3-3 L 3 = step 3-4 L b(3) =
Data Mining (Apriori Algorithm)DCS 802, Spring MultipleJoins and Reorder algorithms to find association rules with item constraints will be added.
Data Mining (Apriori Algorithm)DCS 802, Spring Mining Sequential Patterns Given a database D of customer transactions, the problem of mining sequential patterns is to find the maximal sequences among all sequences that have certain user-specified minimum support. - Transaction-time field is added. - Itemset in a sequence is denoted as
Data Mining (Apriori Algorithm)DCS 802, Spring Sequence Version of DB Conversion CustomerID Jun Jun Jun Jun Jun Jun Jun July Jun Transaction Time , ,60,70 30,50, ,70 90 Items D CustomerID Customer Sequence Sequential version of D’ Answer set with support >.25 = {, } * Customer sequence : all the transactions of a customer is a sequence ordered by increasing transaction time.
Data Mining (Apriori Algorithm)DCS 802, Spring Definitions Def 1. A sequence is contained in another sequence if there exists integers i 1 < i 2 < … < i n such that a 1 b i1, a 2 b i2, …, a n b in ex) is contained in. is contained in. Def 2. A sequence s is maximal if s is not contained in any other sequence. - T i is transaction time. - itemset(T i ) is transaction the set of items in T i. - litemset : an item set with minimum support. Yes No
Data Mining (Apriori Algorithm)DCS 802, Spring Procedure 1. Convert D into a D’ of customer sequences. 2. Litemset mapping 3. Transform each customer sequence into a litemset representation. 4. Find the desired sequences using the set of litemsets AprioriAll 4-2. AprioriSome 4-3. DynamicSome 5. Find the maximal sequences among the set of large sequences. for(k = n; k > 1; k--) foreach k-sequence s k delete from S all subsequences of s k. Procedure
Data Mining (Apriori Algorithm)DCS 802, Spring Mapped to (30) (40) (70) (40 70) (90) Large Itemsets Example step CID Customer Sequence step 3 Transformed Sequence Mapping
Data Mining (Apriori Algorithm)DCS 802, Spring AprioriAll Aprioriall() { L k = {frequent 1-itemsets}; k = 2; /* k represents the pass number. */ while (L k-1 != ) { F = F U L k ; C k = New candidates of size k generated from L k-1 ; for each customer-sequence c D increment the count of all candidates in C k that are contained in c ; L k = All candidates in C k with minimum support ; k++ ; } return ( F ) ; }
Data Mining (Apriori Algorithm)DCS 802, Spring L 3 Supp Example C4C4 2 L 4 Supp. Customer Seq’s. Minimum support = L 1 Supp L 2 Supp The maximal large sequences are {,, }.
Data Mining (Apriori Algorithm)DCS 802, Spring AprioriSome and DynamicSome algorithms to find association rules with sequential patterns will be added.