Presentation is loading. Please wait.

Presentation is loading. Please wait.

Fast Algorithms for Mining Frequent Itemsets 指導教授 : 張真誠 教授 研究生 : 李育強 Dept. of Computer Science and Information Engineering, National Chung Cheng University.

Similar presentations


Presentation on theme: "Fast Algorithms for Mining Frequent Itemsets 指導教授 : 張真誠 教授 研究生 : 李育強 Dept. of Computer Science and Information Engineering, National Chung Cheng University."— Presentation transcript:

1 Fast Algorithms for Mining Frequent Itemsets 指導教授 : 張真誠 教授 研究生 : 李育強 Dept. of Computer Science and Information Engineering, National Chung Cheng University Date: January 20, 2005 博士論文計畫書 挖掘頻繁項目集合之快速演算法研究

2 2 Outline Introduction Background and Related Work A New FP-Tree for Mining Frequent Itemsets Efficient Algorithms for Mining Share- Frequent itemsets Conclusions

3 3 Introduction Data mining techniques have been developed to fine a small set of precious nugget from reams of data Mining association rules constitutes one of the most important data mining problem Two sub-problem Identifying all frequent itemsets Using these frequent itemsets to generate association rules The first sub-problem plays an essential role in mining association rules Mining frequent itemsets & mining share-frequent itemsets

4 4 Background and Related Work Support-Confidence Framework Each item is a binary variable denoting whether an item was purchased Apriori (Agrawal & Swami, 1994) & Apriori-like algorithms Pattern-growth algorithms (Han et al, 2000; Han et al, 2004) Share-Confidence Framework (Carter et al., 1997 ) Support-confidence framework does not analyze the exact number of products purchased. The support count method does not measure the profit or cost of an itemset Exhaustive search algorithm Fast algorithms (but with errors)

5 5 Support-Confidence Framework (1/3) Apriori algorithm (Agrawal and Srikant, 1994): minSup = 40%

6 6 Support-Confidence Framework (2/3) FP-growth algorithm (Han et al. 2000; Han et al., 2004) TID Frequent 1-itemsets (sorted) 001 002 003 004 005 006 C A B D C A C B D A B D C B D

7 7 Support-Confidence Framework (3/3) Conditional FP-tree of “ D ” Conditional FP-tree of “ BD ”

8 8 Measure value: mv(i p, T q ) mv({D}, T01) = 1 mv({C}, T03) = 3 Transaction measure value: tmv(T q ) = tmv(T02) = 9 Total measure value: Tmv(DB)= Tmv(DB)=44 Itemset measure value: imv(X, T q )= imv({A, E}, T02)=4 Local measure value: lmv(X)= lmv({BC})=2+4+5=11 Share-Confidence Framework (1/6)

9 9 Share-Confidence Framework (2/6) minShare=30% Itemset share: SH(X)= SH({BC})=11/44=25% SH-frequent: if SH(X) >= minShare, X is a share-frequent (SH-frequent) itemset

10 10 Share-Confidence Framework (3/6) ZP(Zero Pruning) 、 ZSP(Zero Subset Pruning) variants of exhaustive search prune the candidate itemsets whose local measure values are exactly zero SIP(Share Infrequent Pruning) like Apriori with errors CAC(Combine All Counted) 、 PCAC(Parametric CAC) From ZSP, using a predict function with errors IAB(Item Add-Back) 、 PIAB(Parametric IAB) join each share frequent itemset with each 1-itemset with errors Existing algorithms are either inefficient or do not discover complete share-frequent (SH-frequent) itemsets

11 11 Share-Confidence Framework (4/6) ZP Algorithm SIP & IAB Algorithms

12 12 Share-Confidence Framework (5/6) ZSP Algorithm

13 13 Share-Confidence Framework (6/6) CAC Algorithm PSH(XY)=SH(X)+(SH(Y) × |db x |/|DB|), |db x |<|db Y | … (1) PSH(XY)=SH(Y)+(SH(X) × |db Y |/|DB|), |db Y |<|db X | … (2) PSH(XY)=((1)+(2))/2, |db Y |=|db X | PSH(AB)=(22.7%+18.2% × 4/6+18.2% + 22.7% × 4/6)/2=34.1% PSH(AE)=9.1%+22.7% × (2/6)=16.7% < 30%

14 14 A New FP-Tree for Mining Frequent Itemsets (1/3) NFP-growth Algorithm NFP-tree construction

15 15 A New FP-Tree for Mining Frequent Itemsets (2/3) TID Frequent 1-itemsets (sorted) 001 002 003 004 005 006 C A B D C A C B D A B D C B D

16 16 A New FP-Tree for Mining Frequent Itemsets (3/3) Conditional NFP-tree of “ D(3,4) ”

17 17 Experimental Results (1/4) PC: Pentium IV 1.5 GHZ, 1GB SDRAM, running windows 2000 professional All algorithms were coded in VC++ 6.0 Datasets: Real: BMS-Web View-1, BMS-Web View-2, Connect 4 Artificial: generated by IBM synthetic data generator |D||D| Number of transactions in DB |T||T| Mean size of the transactions |I||I| Mean size of the maximal potentially frequent itemsets |L||L| Number of maximal potentially frequent itemsets N Number of items

18 18 Experimental Results (2/4)

19 19 Experimental Results (3/4)

20 20 Experimental Results (4/4)

21 21 A Fast Algorithm for Mining Share-Frequent Itemsets FSM: Fast Share Measure algorithm ML: Maximum transaction length in DB MV: Maximum measure value in DB min_lmv=minShare×Tmv Level Closure Property: Given a minShare and a k- itemset X Theorem 1. If lmv(X)+(lmv(X)/k)×MV < min_lmv, all supersets of X with length k + 1 are infrequent Theorem 2. If lmv(X)+(lmv(X)/k)×MV ×k ’ < min_lmv, all supersets of X with length k+k ’ are infrequent Corollary 1. If lmv(X)+(lmv(X)/k)×MV ×(ML-k)< min_lmv, all supersets of X are infrequent

22 22 FSM: Fast Share Measure algorithm minShare=30% Let CF(X)=lmv(X)+(lmv(X)/k)×MV ×(ML-k) Prune X if CF(X)<min_lmv CF({ABC})=3+(3/3)×3×(6-3)=12<14=min_lmv

23 23 Experimental Results (1/2) T4.I2.D100k.N50.S10 minShare = 0.8% ML=14 Method Pass (k) ZSPFSM(1)FSM(2)FSM(3)FSM(ML-k) k=1 CkCk 50 RC k 5049 50 FkFk 32 k=2 CkCk 12251176 1225 RC k 12195707548451085 FkFk 119 k=3 CkCk 1932742567062886514886 RC k 17217868168524105951 FkFk 65 k=4 CkCk 16507717253233556824243 RC k 10739723264412366117 FkFk 99999 k=5 CkCk 406374812587176309 RC k 2667765401091199 FkFk 00000 k=6 CkCk 369341014287 RC k 31009600037 FkFk 00000 k>=7 CkCk 3659750000 RC k 3594710000 FkFk 00000 Time(sec)10349.92.302.983.3111.24

24 24 Experimental Results (2/2)

25 25 Efficient Algorithms for Mining Share-Frequent itemsets EFSM (Enhanced FSM): instead of joining arbitrary two itemsets in RC k-1, EFSM joins arbitrary itemset of RC k-1 with a single item in RC 1 to generate C k efficiently Reduce time complexity from O(n 2k-2 ) to O(n k )

26 26 Efficient Algorithms for Mining Share-Frequent itemsets X k+1 : arbitrary superset of X with length k+1 in DB S(X k+1 ): the set which contains all X k+1 in DB db S(X k+1 ) : the set of transactions of which each transaction contains at least one X k+1 SuFSM and ShFSM from EFSM which prune the candidates more efficiently than FSM SuFSM (Support-counted FSM): Theorem 3. If lmv(X)+Sup(S(X k+1 ))×MV×(ML – k)< min_lmv, all supersets of X are infrequent

27 27 SuFSM (Support-counted FSM) lmv(X)/k Sup(X) Sup(S(X k+1 )) maxSup(X k+1 ) EX. lmv({BCD})/k=15/3=5, Sup({BCD})=3, Sup(S({BCD} k+1 ))=2, maxSup(X k+1 )=1 If there is no superset of X is an SH-frequent itemset, then the following four equations hold lmv(X)+(lmv(X)/k)×MV× (ML - k) < min_lmv lmv(X)+Sup(X) ×MV× (ML - k) < min_lmv lmv(X)+Sup(S(X k+1 )) ×MV× (ML - k) < min_lmv lmv(X)+maxSup(X k+1 ) ×MV× (ML - k) < min_lmv

28 28 ShFSM (Share-counted FSM) ShFSM (Share-counted FSM): Theorem 4. If Tmv(db S(X k+1 ) ) < min_lmv, all supersets of X are infrequent FSM:lmv(X)+(lmv(X)/k)×MV× (ML - k) < min_lmv SuFSM: lmv(X)+Sup(S(X k+1 )) ×MV× (ML - k) < min_lmv ShFSM: Tmv(db S(X k+1 ) ) < min_lmv

29 29 ShFSM (Share-counted FSM) Ex. X={AB} Tmv(db S(X k+1 ) ) = tmv(T01)+tmv(T05) =6+6=12 <14 = min_lmv

30 30 Experimental Results (1/4)

31 31 Experimental Results (2/4) minShare=0.3%

32 32 Experimental Results (3/4) minShare=0.3%

33 33 Experimental Results (4/4) Method Pass (k) FSMEFSMSuFSMShFSMFkFk k=1 CkCk 200 159 RC k 200 199197 k=2 CkCk 19900 1970119306 1844 RC k 16214 133127199 k=3 CkCk 829547 564324190607 101 RC k 251877 997659792 k=4 CkCk 3290296 79304220913 0 RC k 332877 410571420 k=5 CkCk 393833 250031050 5 RC k 71420 19720959 k=6 CkCk 26137 11582518 8 RC k 25562 11045506 k=7 CkCk 11141 5940204 7 RC k 11099 5827196 k=8 CkCk 4426 279758 1 RC k 4423 275054 k>=9 CkCk 2036 156712 0 RC k 2030 151310 Time(sec)13610.471.5529.6710.95 T6.I4.D100k.N200.S10 minShare = 0.1% ML=20

34 34 Conclusions Support measure Uses two counters per tree node to reduce the number of the tree nodes. Applies a smaller tree and header table to discover frequent itemsets efficiently Consider the development of superior data structures and extend the pattern-growth approach

35 35 Share measure Proposed algorithms efficiently decrease the candidate number to be counted The performance of ShFSM is the best Consider the development of superior algorithms to accelerate the process of identifying all SH-frequent itemsets

36 36 ShFSM: Tmv(db S(X k+1 ) ) < min_lmv

37 Thank You!


Download ppt "Fast Algorithms for Mining Frequent Itemsets 指導教授 : 張真誠 教授 研究生 : 李育強 Dept. of Computer Science and Information Engineering, National Chung Cheng University."

Similar presentations


Ads by Google