Graduate Course DataMining Jun-Ki Min
DataMinig Knowledge discovery in databases Association Rule AB Transactions containing A tend to also contain the items Confidence The percentage of transactions containing B among the transaction containing A Support The percentage of transactions that contain both A and B
Fast Algorithms for Mining Association Rules
Problem Statement I = { i1,i2, …, im} //set of items general association rule XY, where X I, and Y I, X Y = confidence c if c% of transactions in D that contain X also contain Y support s if s% of transactions in D contain XY Given a set of transaction D, the problem of mining association rules is to generate all association rules that have support and confidence greater than minsup and minconf, respectively
Problem Decomposition Find all sets of items (large itemset) that have transaction support above minsup Use large itemsets to generate the desired rules. For each large itemset l, find all non-empty subsets of l. For every such subset a, output a rule of the form a(l-a) if the ratio of support(l) to support(a) is at least minconf.
Discovering Large Itemsets Require multiple pass 1st pass, find all large itemsets whose size is one. In each subsequence pass, we start with a seed set of itemsets (candidate set) found to be large in the previous pass. Then compute support. Anti-Monotonic if sup(A) > minSup, sup(A’) > minSup where A’ A
Aprior Algorithm L1 = {large 1-items} for( k = 2; Lk-1 !=0; k++) do Ck = apriori-gen(Lk-1) forall transactions t D do Ct = subset(Ck,t) //cadidates contained in t for all candidates c Ct do c.count++; end Lk = { c Ck|c.count >= minsup} Answer = Lk
AprioriGen Using Lk-1, generate super sets of k-item insert into Ck select p.item1, p.item2, …, p.itermk-1,q.itemk-1 from Lk-1 p, Lk-1 q where p.iterm1 = q.iterm1,…,p.itermk-2 = q.itermk-2,p.itemk-1 < q.itermk-1; forall itemsets c ∈ Ck do forall (k-1)-subsets s of c do if(not(s ∈Lk-1 )) then delete c from Ck ; Using Lk-1, generate super sets of k-item c ∈Ck인 c중에서 k-1개의 원소를 가지는 부분 집합들 중에서 하나라도 Lk-1에 포함되어 있지 않는 c는 Ck에서 제거한다
Example Item set I = {A, B, C, D, E} min_sup = 0.4(i.e., >=2 transactions) D = TID 사건항목 100 A,C,D 200 B,C,E 300 A,B,C,E 400 B,E
Pass1 C1 L1 itemset support {A} 2/4 {B} 3/4 {C} {D} 1/4 {E}
Pass2 C2 C2 L2 itemset support {A,B} 1/4 {A,C} 2/4 {B,C} 3/4 {A.E} {B,E} {C,E}
sup({B,C,E} )= 2 and sup({B,C}) =2 Pass3 sup({B,C,E} )= 2 and sup({B,C}) =2 Thus, rule {B,C}=>{E} with confidence 100% itemset support {B,C,E} 2/4
AprioriTid Principle of Apriori is simple As increase the length of itemset by 1, whole DB should be retrieved. AprioriTid – Index를 활용 As Pass gone, the size of Index Ck is reduced.
AprioriTid Algorithm L1 = {large 1-itermsets}; C1 = database D; for (k = 2; Lk-1 ≠0; k++) do begin Ck = apriori-gen(Lk-1); //new candidate Ck = 0; forall entries t ∈ Ck-1 do begin (1) //determine candidate itemsets in Ck contained //in the transaction with identifier t.TID Ct = {c ∈ Ck | (c – c[k]) ∈ t.set-of-itemsets ∧ (c – c[k-1]) ∈ t.set-of-itemsets}; (2) forall candidates c ∈ Ct do c. count++; if (Ct ≠ 0) then Ck += <t.TID, Ct>; end Lk = {c ∈Ck | c.count ≥ min_sup} Answer = ∪k Lk c[k] denotes k’th item ex) if c = {B,C,D} , then c[3] = {D}, c[2] = {C}
Example C1 L1 C2 TID Set-of-ItemSet itestset support itemset 100 {{A},{C},{D}} {A} 2/4 {A,B} 1/4 200 {{B},{C},{E}} {B} 3/4 {A,C} 300 {{A},{B},{C},{E}} {C} {A,E} 400 {{B},{E}} {E} {B,C} {B.E} {C,E}
{{A B},{A C},{A E},{B C},{B E},{C E}} C2 L2 C3 TID Set-of-ItermSet 사건항목 지지도 100 {{A C}} {A C} 2/4 {B C E} 200 {{B C},{B E}, {C E}} {B C} 300 {{A B},{A C},{A E},{B C},{B E},{C E}} {B E} 3/4 400 {{B E}} {C E}
Example C3 L3 TID Set-of-ItermSets itemset support 200 {{B C E}} 2/4 300
Apriori HyBrid Apriori and AprioriTid use the same candidate generation procedure and therefore count the same itemsets. In the later passes, the number of candidate itemsets reduces. However, Apriori still examines every transaction in DB. In other hand, AprioriTid use Index. Thus, AprioruHybrid perform Apriori in initial passes, then, if the size of Ck is enough small to fix memory, AprioriTid is performed in order to reduce DISK I/O.[5]