Fast Algorithms for Mining Frequent Itemsets 指導教授 : 張真誠 教授 研究生 : 李育強 Dept. of Computer Science and Information Engineering, National Chung Cheng University Date: January 20, 2005 博士論文計畫書 挖掘頻繁項目集合之快速演算法研究
2 Outline Introduction Background and Related Work A New FP-Tree for Mining Frequent Itemsets Efficient Algorithms for Mining Share- Frequent itemsets Conclusions
3 Introduction Data mining techniques have been developed to fine a small set of precious nugget from reams of data Mining association rules constitutes one of the most important data mining problem Two sub-problem Identifying all frequent itemsets Using these frequent itemsets to generate association rules The first sub-problem plays an essential role in mining association rules Mining frequent itemsets & mining share-frequent itemsets
4 Background and Related Work Support-Confidence Framework Each item is a binary variable denoting whether an item was purchased Apriori (Agrawal & Swami, 1994) & Apriori-like algorithms Pattern-growth algorithms (Han et al, 2000; Han et al, 2004) Share-Confidence Framework (Carter et al., 1997 ) Support-confidence framework does not analyze the exact number of products purchased. The support count method does not measure the profit or cost of an itemset Exhaustive search algorithm Fast algorithms (but with errors)
5 Support-Confidence Framework (1/3) Apriori algorithm (Agrawal and Srikant, 1994): minSup = 40%
6 Support-Confidence Framework (2/3) FP-growth algorithm (Han et al. 2000; Han et al., 2004) TID Frequent 1-itemsets (sorted) C A B D C A C B D A B D C B D
7 Support-Confidence Framework (3/3) Conditional FP-tree of “ D ” Conditional FP-tree of “ BD ”
8 Measure value: mv(i p, T q ) mv({D}, T01) = 1 mv({C}, T03) = 3 Transaction measure value: tmv(T q ) = tmv(T02) = 9 Total measure value: Tmv(DB)= Tmv(DB)=44 Itemset measure value: imv(X, T q )= imv({A, E}, T02)=4 Local measure value: lmv(X)= lmv({BC})=2+4+5=11 Share-Confidence Framework (1/6)
9 Share-Confidence Framework (2/6) minShare=30% Itemset share: SH(X)= SH({BC})=11/44=25% SH-frequent: if SH(X) >= minShare, X is a share-frequent (SH-frequent) itemset
10 Share-Confidence Framework (3/6) ZP(Zero Pruning) 、 ZSP(Zero Subset Pruning) variants of exhaustive search prune the candidate itemsets whose local measure values are exactly zero SIP(Share Infrequent Pruning) like Apriori with errors CAC(Combine All Counted) 、 PCAC(Parametric CAC) From ZSP, using a predict function with errors IAB(Item Add-Back) 、 PIAB(Parametric IAB) join each share frequent itemset with each 1-itemset with errors Existing algorithms are either inefficient or do not discover complete share-frequent (SH-frequent) itemsets
11 Share-Confidence Framework (4/6) ZP Algorithm SIP & IAB Algorithms
12 Share-Confidence Framework (5/6) ZSP Algorithm
13 Share-Confidence Framework (6/6) CAC Algorithm PSH(XY)=SH(X)+(SH(Y) × |db x |/|DB|), |db x |<|db Y | … (1) PSH(XY)=SH(Y)+(SH(X) × |db Y |/|DB|), |db Y |<|db X | … (2) PSH(XY)=((1)+(2))/2, |db Y |=|db X | PSH(AB)=(22.7%+18.2% × 4/6+18.2% % × 4/6)/2=34.1% PSH(AE)=9.1%+22.7% × (2/6)=16.7% < 30%
14 A New FP-Tree for Mining Frequent Itemsets (1/3) NFP-growth Algorithm NFP-tree construction
15 A New FP-Tree for Mining Frequent Itemsets (2/3) TID Frequent 1-itemsets (sorted) C A B D C A C B D A B D C B D
16 A New FP-Tree for Mining Frequent Itemsets (3/3) Conditional NFP-tree of “ D(3,4) ”
17 Experimental Results (1/4) PC: Pentium IV 1.5 GHZ, 1GB SDRAM, running windows 2000 professional All algorithms were coded in VC Datasets: Real: BMS-Web View-1, BMS-Web View-2, Connect 4 Artificial: generated by IBM synthetic data generator |D||D| Number of transactions in DB |T||T| Mean size of the transactions |I||I| Mean size of the maximal potentially frequent itemsets |L||L| Number of maximal potentially frequent itemsets N Number of items
18 Experimental Results (2/4)
19 Experimental Results (3/4)
20 Experimental Results (4/4)
21 A Fast Algorithm for Mining Share-Frequent Itemsets FSM: Fast Share Measure algorithm ML: Maximum transaction length in DB MV: Maximum measure value in DB min_lmv=minShare×Tmv Level Closure Property: Given a minShare and a k- itemset X Theorem 1. If lmv(X)+(lmv(X)/k)×MV < min_lmv, all supersets of X with length k + 1 are infrequent Theorem 2. If lmv(X)+(lmv(X)/k)×MV ×k ’ < min_lmv, all supersets of X with length k+k ’ are infrequent Corollary 1. If lmv(X)+(lmv(X)/k)×MV ×(ML-k)< min_lmv, all supersets of X are infrequent
22 FSM: Fast Share Measure algorithm minShare=30% Let CF(X)=lmv(X)+(lmv(X)/k)×MV ×(ML-k) Prune X if CF(X)<min_lmv CF({ABC})=3+(3/3)×3×(6-3)=12<14=min_lmv
23 Experimental Results (1/2) T4.I2.D100k.N50.S10 minShare = 0.8% ML=14 Method Pass (k) ZSPFSM(1)FSM(2)FSM(3)FSM(ML-k) k=1 CkCk 50 RC k FkFk 32 k=2 CkCk RC k FkFk 119 k=3 CkCk RC k FkFk 65 k=4 CkCk RC k FkFk k=5 CkCk RC k FkFk k=6 CkCk RC k FkFk k>=7 CkCk RC k FkFk Time(sec)
24 Experimental Results (2/2)
25 Efficient Algorithms for Mining Share-Frequent itemsets EFSM (Enhanced FSM): instead of joining arbitrary two itemsets in RC k-1, EFSM joins arbitrary itemset of RC k-1 with a single item in RC 1 to generate C k efficiently Reduce time complexity from O(n 2k-2 ) to O(n k )
26 Efficient Algorithms for Mining Share-Frequent itemsets X k+1 : arbitrary superset of X with length k+1 in DB S(X k+1 ): the set which contains all X k+1 in DB db S(X k+1 ) : the set of transactions of which each transaction contains at least one X k+1 SuFSM and ShFSM from EFSM which prune the candidates more efficiently than FSM SuFSM (Support-counted FSM): Theorem 3. If lmv(X)+Sup(S(X k+1 ))×MV×(ML – k)< min_lmv, all supersets of X are infrequent
27 SuFSM (Support-counted FSM) lmv(X)/k Sup(X) Sup(S(X k+1 )) maxSup(X k+1 ) EX. lmv({BCD})/k=15/3=5, Sup({BCD})=3, Sup(S({BCD} k+1 ))=2, maxSup(X k+1 )=1 If there is no superset of X is an SH-frequent itemset, then the following four equations hold lmv(X)+(lmv(X)/k)×MV× (ML - k) < min_lmv lmv(X)+Sup(X) ×MV× (ML - k) < min_lmv lmv(X)+Sup(S(X k+1 )) ×MV× (ML - k) < min_lmv lmv(X)+maxSup(X k+1 ) ×MV× (ML - k) < min_lmv
28 ShFSM (Share-counted FSM) ShFSM (Share-counted FSM): Theorem 4. If Tmv(db S(X k+1 ) ) < min_lmv, all supersets of X are infrequent FSM:lmv(X)+(lmv(X)/k)×MV× (ML - k) < min_lmv SuFSM: lmv(X)+Sup(S(X k+1 )) ×MV× (ML - k) < min_lmv ShFSM: Tmv(db S(X k+1 ) ) < min_lmv
29 ShFSM (Share-counted FSM) Ex. X={AB} Tmv(db S(X k+1 ) ) = tmv(T01)+tmv(T05) =6+6=12 <14 = min_lmv
30 Experimental Results (1/4)
31 Experimental Results (2/4) minShare=0.3%
32 Experimental Results (3/4) minShare=0.3%
33 Experimental Results (4/4) Method Pass (k) FSMEFSMSuFSMShFSMFkFk k=1 CkCk RC k k=2 CkCk RC k k=3 CkCk RC k k=4 CkCk RC k k=5 CkCk RC k k=6 CkCk RC k k=7 CkCk RC k k=8 CkCk RC k k>=9 CkCk RC k Time(sec) T6.I4.D100k.N200.S10 minShare = 0.1% ML=20
34 Conclusions Support measure Uses two counters per tree node to reduce the number of the tree nodes. Applies a smaller tree and header table to discover frequent itemsets efficiently Consider the development of superior data structures and extend the pattern-growth approach
35 Share measure Proposed algorithms efficiently decrease the candidate number to be counted The performance of ShFSM is the best Consider the development of superior algorithms to accelerate the process of identifying all SH-frequent itemsets
36 ShFSM: Tmv(db S(X k+1 ) ) < min_lmv
Thank You!