Download presentation
Presentation is loading. Please wait.
Published byQuentin Henry Modified over 9 years ago
1
Fast Algorithms for Mining Frequent Itemsets 指導教授 : 張真誠 教授 研究生 : 李育強 Dept. of Computer Science and Information Engineering, National Chung Cheng University Date: January 20, 2005 博士論文計畫書 挖掘頻繁項目集合之快速演算法研究
2
2 Outline Introduction Background and Related Work A New FP-Tree for Mining Frequent Itemsets Efficient Algorithms for Mining Share- Frequent itemsets Conclusions
3
3 Introduction Data mining techniques have been developed to fine a small set of precious nugget from reams of data Mining association rules constitutes one of the most important data mining problem Two sub-problem Identifying all frequent itemsets Using these frequent itemsets to generate association rules The first sub-problem plays an essential role in mining association rules Mining frequent itemsets & mining share-frequent itemsets
4
4 Background and Related Work Support-Confidence Framework Each item is a binary variable denoting whether an item was purchased Apriori (Agrawal & Swami, 1994) & Apriori-like algorithms Pattern-growth algorithms (Han et al, 2000; Han et al, 2004) Share-Confidence Framework (Carter et al., 1997 ) Support-confidence framework does not analyze the exact number of products purchased. The support count method does not measure the profit or cost of an itemset Exhaustive search algorithm Fast algorithms (but with errors)
5
5 Support-Confidence Framework (1/3) Apriori algorithm (Agrawal and Srikant, 1994): minSup = 40%
6
6 Support-Confidence Framework (2/3) FP-growth algorithm (Han et al. 2000; Han et al., 2004) TID Frequent 1-itemsets (sorted) 001 002 003 004 005 006 C A B D C A C B D A B D C B D
7
7 Support-Confidence Framework (3/3) Conditional FP-tree of “ D ” Conditional FP-tree of “ BD ”
8
8 Measure value: mv(i p, T q ) mv({D}, T01) = 1 mv({C}, T03) = 3 Transaction measure value: tmv(T q ) = tmv(T02) = 9 Total measure value: Tmv(DB)= Tmv(DB)=44 Itemset measure value: imv(X, T q )= imv({A, E}, T02)=4 Local measure value: lmv(X)= lmv({BC})=2+4+5=11 Share-Confidence Framework (1/6)
9
9 Share-Confidence Framework (2/6) minShare=30% Itemset share: SH(X)= SH({BC})=11/44=25% SH-frequent: if SH(X) >= minShare, X is a share-frequent (SH-frequent) itemset
10
10 Share-Confidence Framework (3/6) ZP(Zero Pruning) 、 ZSP(Zero Subset Pruning) variants of exhaustive search prune the candidate itemsets whose local measure values are exactly zero SIP(Share Infrequent Pruning) like Apriori with errors CAC(Combine All Counted) 、 PCAC(Parametric CAC) From ZSP, using a predict function with errors IAB(Item Add-Back) 、 PIAB(Parametric IAB) join each share frequent itemset with each 1-itemset with errors Existing algorithms are either inefficient or do not discover complete share-frequent (SH-frequent) itemsets
11
11 Share-Confidence Framework (4/6) ZP Algorithm SIP & IAB Algorithms
12
12 Share-Confidence Framework (5/6) ZSP Algorithm
13
13 Share-Confidence Framework (6/6) CAC Algorithm PSH(XY)=SH(X)+(SH(Y) × |db x |/|DB|), |db x |<|db Y | … (1) PSH(XY)=SH(Y)+(SH(X) × |db Y |/|DB|), |db Y |<|db X | … (2) PSH(XY)=((1)+(2))/2, |db Y |=|db X | PSH(AB)=(22.7%+18.2% × 4/6+18.2% + 22.7% × 4/6)/2=34.1% PSH(AE)=9.1%+22.7% × (2/6)=16.7% < 30%
14
14 A New FP-Tree for Mining Frequent Itemsets (1/3) NFP-growth Algorithm NFP-tree construction
15
15 A New FP-Tree for Mining Frequent Itemsets (2/3) TID Frequent 1-itemsets (sorted) 001 002 003 004 005 006 C A B D C A C B D A B D C B D
16
16 A New FP-Tree for Mining Frequent Itemsets (3/3) Conditional NFP-tree of “ D(3,4) ”
17
17 Experimental Results (1/4) PC: Pentium IV 1.5 GHZ, 1GB SDRAM, running windows 2000 professional All algorithms were coded in VC++ 6.0 Datasets: Real: BMS-Web View-1, BMS-Web View-2, Connect 4 Artificial: generated by IBM synthetic data generator |D||D| Number of transactions in DB |T||T| Mean size of the transactions |I||I| Mean size of the maximal potentially frequent itemsets |L||L| Number of maximal potentially frequent itemsets N Number of items
18
18 Experimental Results (2/4)
19
19 Experimental Results (3/4)
20
20 Experimental Results (4/4)
21
21 A Fast Algorithm for Mining Share-Frequent Itemsets FSM: Fast Share Measure algorithm ML: Maximum transaction length in DB MV: Maximum measure value in DB min_lmv=minShare×Tmv Level Closure Property: Given a minShare and a k- itemset X Theorem 1. If lmv(X)+(lmv(X)/k)×MV < min_lmv, all supersets of X with length k + 1 are infrequent Theorem 2. If lmv(X)+(lmv(X)/k)×MV ×k ’ < min_lmv, all supersets of X with length k+k ’ are infrequent Corollary 1. If lmv(X)+(lmv(X)/k)×MV ×(ML-k)< min_lmv, all supersets of X are infrequent
22
22 FSM: Fast Share Measure algorithm minShare=30% Let CF(X)=lmv(X)+(lmv(X)/k)×MV ×(ML-k) Prune X if CF(X)<min_lmv CF({ABC})=3+(3/3)×3×(6-3)=12<14=min_lmv
23
23 Experimental Results (1/2) T4.I2.D100k.N50.S10 minShare = 0.8% ML=14 Method Pass (k) ZSPFSM(1)FSM(2)FSM(3)FSM(ML-k) k=1 CkCk 50 RC k 5049 50 FkFk 32 k=2 CkCk 12251176 1225 RC k 12195707548451085 FkFk 119 k=3 CkCk 1932742567062886514886 RC k 17217868168524105951 FkFk 65 k=4 CkCk 16507717253233556824243 RC k 10739723264412366117 FkFk 99999 k=5 CkCk 406374812587176309 RC k 2667765401091199 FkFk 00000 k=6 CkCk 369341014287 RC k 31009600037 FkFk 00000 k>=7 CkCk 3659750000 RC k 3594710000 FkFk 00000 Time(sec)10349.92.302.983.3111.24
24
24 Experimental Results (2/2)
25
25 Efficient Algorithms for Mining Share-Frequent itemsets EFSM (Enhanced FSM): instead of joining arbitrary two itemsets in RC k-1, EFSM joins arbitrary itemset of RC k-1 with a single item in RC 1 to generate C k efficiently Reduce time complexity from O(n 2k-2 ) to O(n k )
26
26 Efficient Algorithms for Mining Share-Frequent itemsets X k+1 : arbitrary superset of X with length k+1 in DB S(X k+1 ): the set which contains all X k+1 in DB db S(X k+1 ) : the set of transactions of which each transaction contains at least one X k+1 SuFSM and ShFSM from EFSM which prune the candidates more efficiently than FSM SuFSM (Support-counted FSM): Theorem 3. If lmv(X)+Sup(S(X k+1 ))×MV×(ML – k)< min_lmv, all supersets of X are infrequent
27
27 SuFSM (Support-counted FSM) lmv(X)/k Sup(X) Sup(S(X k+1 )) maxSup(X k+1 ) EX. lmv({BCD})/k=15/3=5, Sup({BCD})=3, Sup(S({BCD} k+1 ))=2, maxSup(X k+1 )=1 If there is no superset of X is an SH-frequent itemset, then the following four equations hold lmv(X)+(lmv(X)/k)×MV× (ML - k) < min_lmv lmv(X)+Sup(X) ×MV× (ML - k) < min_lmv lmv(X)+Sup(S(X k+1 )) ×MV× (ML - k) < min_lmv lmv(X)+maxSup(X k+1 ) ×MV× (ML - k) < min_lmv
28
28 ShFSM (Share-counted FSM) ShFSM (Share-counted FSM): Theorem 4. If Tmv(db S(X k+1 ) ) < min_lmv, all supersets of X are infrequent FSM:lmv(X)+(lmv(X)/k)×MV× (ML - k) < min_lmv SuFSM: lmv(X)+Sup(S(X k+1 )) ×MV× (ML - k) < min_lmv ShFSM: Tmv(db S(X k+1 ) ) < min_lmv
29
29 ShFSM (Share-counted FSM) Ex. X={AB} Tmv(db S(X k+1 ) ) = tmv(T01)+tmv(T05) =6+6=12 <14 = min_lmv
30
30 Experimental Results (1/4)
31
31 Experimental Results (2/4) minShare=0.3%
32
32 Experimental Results (3/4) minShare=0.3%
33
33 Experimental Results (4/4) Method Pass (k) FSMEFSMSuFSMShFSMFkFk k=1 CkCk 200 159 RC k 200 199197 k=2 CkCk 19900 1970119306 1844 RC k 16214 133127199 k=3 CkCk 829547 564324190607 101 RC k 251877 997659792 k=4 CkCk 3290296 79304220913 0 RC k 332877 410571420 k=5 CkCk 393833 250031050 5 RC k 71420 19720959 k=6 CkCk 26137 11582518 8 RC k 25562 11045506 k=7 CkCk 11141 5940204 7 RC k 11099 5827196 k=8 CkCk 4426 279758 1 RC k 4423 275054 k>=9 CkCk 2036 156712 0 RC k 2030 151310 Time(sec)13610.471.5529.6710.95 T6.I4.D100k.N200.S10 minShare = 0.1% ML=20
34
34 Conclusions Support measure Uses two counters per tree node to reduce the number of the tree nodes. Applies a smaller tree and header table to discover frequent itemsets efficiently Consider the development of superior data structures and extend the pattern-growth approach
35
35 Share measure Proposed algorithms efficiently decrease the candidate number to be counted The performance of ShFSM is the best Consider the development of superior algorithms to accelerate the process of identifying all SH-frequent itemsets
36
36 ShFSM: Tmv(db S(X k+1 ) ) < min_lmv
37
Thank You!
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.