Fast Algorithms for Mining Frequent Itemsets 指導教授 : 張真誠 教授 研究生 : 李育強 Dept. of Computer Science and Information Engineering, National Chung Cheng University.

Slides:



Advertisements
Similar presentations
Association Analysis (2). Example TIDList of item ID’s T1I1, I2, I5 T2I2, I4 T3I2, I3 T4I1, I2, I4 T5I1, I3 T6I2, I3 T7I1, I3 T8I1, I2, I3, I5 T9I1, I2,
Advertisements

Frequent Itemset Mining Methods. The Apriori algorithm Finding frequent itemsets using candidate generation Seminal algorithm proposed by R. Agrawal and.
A distributed method for mining association rules
Data Mining (Apriori Algorithm)DCS 802, Spring DCS 802 Data Mining Apriori Algorithm Spring of 2002 Prof. Sung-Hyuk Cha School of Computer Science.
Mining Frequent Patterns in Data Streams at Multiple Time Granularities CS525 Paper Presentation Presented by: Pei Zhang, Jiahua Liu, Pengfei Geng and.
FP-Growth algorithm Vasiljevic Vladica,
FP (FREQUENT PATTERN)-GROWTH ALGORITHM ERTAN LJAJIĆ, 3392/2013 Elektrotehnički fakultet Univerziteta u Beogradu.
Data Mining Association Analysis: Basic Concepts and Algorithms
Association Rules l Mining Association Rules between Sets of Items in Large Databases (R. Agrawal, T. Imielinski & A. Swami) l Fast Algorithms for.
Rakesh Agrawal Ramakrishnan Srikant
Efficiently Mining Long Patterns from Databases Roberto J. Bayardo Jr. IBM Almaden Research Center.
Data Mining Techniques So Far: Cluster analysis K-means Classification Decision Trees J48 (C4.5) Rule-based classification JRIP (RIPPER) Logistic Regression.
Data Mining Association Analysis: Basic Concepts and Algorithms Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
1 Efficient Algorithms for Mining Share-Frequent Itemsets Authors: Y. C. Li, J. S. Yeh and C. C. Chang Speaker: Yu-Chiang Li Date :July 28, 2005.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Association Analysis (2). Example TIDList of item ID’s T1I1, I2, I5 T2I2, I4 T3I2, I3 T4I1, I2, I4 T5I1, I3 T6I2, I3 T7I1, I3 T8I1, I2, I3, I5 T9I1, I2,
Data Mining Association Analysis: Basic Concepts and Algorithms
Data Mining Association Analysis: Basic Concepts and Algorithms
Association Analysis: Basic Concepts and Algorithms.
Data Mining Association Analysis: Basic Concepts and Algorithms
Pattern Lattice Traversal by Selective Jumps Osmar R. Zaïane and Mohammad El-Hajj Department of Computing Science, University of Alberta Edmonton, AB,
Fast Algorithms for Mining Association Rules * CS401 Final Presentation Presented by Lin Yang University of Missouri-Rolla * Rakesh Agrawal, Ramakrishnam.
© Vipin Kumar CSci 8980 Fall CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance Computing Research Center Department of Computer.
2/8/00CSE 711 data mining: Apriori Algorithm by S. Cha 1 CSE 711 Seminar on Data Mining: Apriori Algorithm By Sung-Hyuk Cha.
Fast Algorithms for Association Rule Mining
1 Fast Algorithms for Mining Association Rules Rakesh Agrawal Ramakrishnan Srikant Slides from Ofer Pasternak.
Performance and Scalability: Apriori Implementation.
SEG Tutorial 2 – Frequent Pattern Mining.
Mining Association Rules between Sets of Items in Large Databases presented by Zhuang Wang.
Apriori algorithm Seminar of Popular Algorithms in Data Mining and Machine Learning, TKK Presentation Lauri Lahti.
USpan: An Efficient Algorithm for Mining High Utility Sequential Patterns Authors: Junfu Yin, Zhigang Zheng, Longbing Cao In: Proceedings of the 18th ACM.
Fast Algorithms for Mining Frequent Itemsets
林俊宏 Parallel Association Rule Mining based on FI-Growth Algorithm Bundit Manaskasemsak, Nunnapus Benjamas, Arnon Rungsawang.
Mining High Utility Itemsets without Candidate Generation Date: 2013/05/13 Author: Mengchi Liu, Junfeng Qu Source: CIKM "12 Advisor: Jia-ling Koh Speaker:
VLDB 2012 Mining Frequent Itemsets over Uncertain Databases Yongxin Tong 1, Lei Chen 1, Yurong Cheng 2, Philip S. Yu 3 1 The Hong Kong University of Science.
Sequential PAttern Mining using A Bitmap Representation
Approximate Frequency Counts over Data Streams Gurmeet Singh Manku, Rajeev Motwani Standford University VLDB2002.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining By Tan, Steinbach, Kumar Lecture.
Modul 7: Association Analysis. 2 Association Rule Mining  Given a set of transactions, find rules that will predict the occurrence of an item based on.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Efficient Data Mining for Calling Path Patterns in GSM Networks Information Systems, accepted 5 December 2002 SPEAKER: YAO-TE WANG ( 王耀德 )
Mining High Utility Itemset in Big Data
Pattern-Growth Methods for Sequential Pattern Mining Iris Zhang
Frequent Item Mining. What is data mining? =Pattern Mining? What patterns? Why are they useful?
CSE4334/5334 DATA MINING CSE4334/5334 Data Mining, Fall 2014 Department of Computer Science and Engineering, University of Texas at Arlington Chengkai.
Outline Introduction – Frequent patterns and the Rare Item Problem – Multiple Minimum Support Framework – Issues with Multiple Minimum Support Framework.
1 FINDING FUZZY SETS FOR QUANTITATIVE ATTRIBUTES FOR MINING OF FUZZY ASSOCIATE RULES By H.N.A. Pham, T.W. Liao, and E. Triantaphyllou Department of Industrial.
Mining Frequent Itemsets from Uncertain Data Presenter : Chun-Kit Chui Chun-Kit Chui [1], Ben Kao [1] and Edward Hung [2] [1] Department of Computer Science.
CloSpan: Mining Closed Sequential Patterns in Large Datasets Xifeng Yan, Jiawei Han and Ramin Afshar Proceedings of 2003 SIAM International Conference.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Association Analysis This lecture node is modified based on Lecture Notes for.
1 AC-Close: Efficiently Mining Approximate Closed Itemsets by Core Pattern Recovery Advisor : Dr. Koh Jia-Ling Speaker : Tu Yi-Lang Date : Hong.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Reducing Number of Candidates Apriori principle: – If an itemset is frequent, then all of its subsets must also be frequent Apriori principle holds due.
Graph Indexing: A Frequent Structure-­based Approach 指導老師:曾新穆 教授 組員:李彥寬、洪世敏、丁鏘巽、 黃冠霖、詹博丞 日期: 2013/11/ /11/141.
1 Data Mining Lecture 6: Association Analysis. 2 Association Rule Mining l Given a set of transactions, find rules that will predict the occurrence of.
Chapter 3 Data Mining: Classification & Association Chapter 4 in the text box Section: 4.3 (4.3.1),
Fast Mining Frequent Patterns with Secondary Memory Kawuu W. Lin, Sheng-Hao Chung, Sheng-Shiung Huang and Chun-Cheng Lin Department of Computer Science.
Reducing Number of Candidates
Data Mining Association Analysis: Basic Concepts and Algorithms
Association rule mining
Frequent Pattern Mining
William Norris Professor and Head, Department of Computer Science
Data Mining Association Analysis: Basic Concepts and Algorithms
Data Mining Association Analysis: Basic Concepts and Algorithms
Mining Frequent Itemsets over Uncertain Databases
Data Mining Association Analysis: Basic Concepts and Algorithms
COMP5331 FP-Tree Prepared by Raymond Wong Presented by Raymond Wong
Presentation transcript:

Fast Algorithms for Mining Frequent Itemsets 指導教授 : 張真誠 教授 研究生 : 李育強 Dept. of Computer Science and Information Engineering, National Chung Cheng University Date: January 20, 2005 博士論文計畫書 挖掘頻繁項目集合之快速演算法研究

2 Outline Introduction Background and Related Work A New FP-Tree for Mining Frequent Itemsets Efficient Algorithms for Mining Share- Frequent itemsets Conclusions

3 Introduction Data mining techniques have been developed to fine a small set of precious nugget from reams of data Mining association rules constitutes one of the most important data mining problem Two sub-problem Identifying all frequent itemsets Using these frequent itemsets to generate association rules The first sub-problem plays an essential role in mining association rules Mining frequent itemsets & mining share-frequent itemsets

4 Background and Related Work Support-Confidence Framework Each item is a binary variable denoting whether an item was purchased Apriori (Agrawal & Swami, 1994) & Apriori-like algorithms Pattern-growth algorithms (Han et al, 2000; Han et al, 2004) Share-Confidence Framework (Carter et al., 1997 ) Support-confidence framework does not analyze the exact number of products purchased. The support count method does not measure the profit or cost of an itemset Exhaustive search algorithm Fast algorithms (but with errors)

5 Support-Confidence Framework (1/3) Apriori algorithm (Agrawal and Srikant, 1994): minSup = 40%

6 Support-Confidence Framework (2/3) FP-growth algorithm (Han et al. 2000; Han et al., 2004) TID Frequent 1-itemsets (sorted) C A B D C A C B D A B D C B D

7 Support-Confidence Framework (3/3) Conditional FP-tree of “ D ” Conditional FP-tree of “ BD ”

8 Measure value: mv(i p, T q ) mv({D}, T01) = 1 mv({C}, T03) = 3 Transaction measure value: tmv(T q ) = tmv(T02) = 9 Total measure value: Tmv(DB)= Tmv(DB)=44 Itemset measure value: imv(X, T q )= imv({A, E}, T02)=4 Local measure value: lmv(X)= lmv({BC})=2+4+5=11 Share-Confidence Framework (1/6)

9 Share-Confidence Framework (2/6) minShare=30% Itemset share: SH(X)= SH({BC})=11/44=25% SH-frequent: if SH(X) >= minShare, X is a share-frequent (SH-frequent) itemset

10 Share-Confidence Framework (3/6) ZP(Zero Pruning) 、 ZSP(Zero Subset Pruning) variants of exhaustive search prune the candidate itemsets whose local measure values are exactly zero SIP(Share Infrequent Pruning) like Apriori with errors CAC(Combine All Counted) 、 PCAC(Parametric CAC) From ZSP, using a predict function with errors IAB(Item Add-Back) 、 PIAB(Parametric IAB) join each share frequent itemset with each 1-itemset with errors Existing algorithms are either inefficient or do not discover complete share-frequent (SH-frequent) itemsets

11 Share-Confidence Framework (4/6) ZP Algorithm SIP & IAB Algorithms

12 Share-Confidence Framework (5/6) ZSP Algorithm

13 Share-Confidence Framework (6/6) CAC Algorithm PSH(XY)=SH(X)+(SH(Y) × |db x |/|DB|), |db x |<|db Y | … (1) PSH(XY)=SH(Y)+(SH(X) × |db Y |/|DB|), |db Y |<|db X | … (2) PSH(XY)=((1)+(2))/2, |db Y |=|db X | PSH(AB)=(22.7%+18.2% × 4/6+18.2% % × 4/6)/2=34.1% PSH(AE)=9.1%+22.7% × (2/6)=16.7% < 30%

14 A New FP-Tree for Mining Frequent Itemsets (1/3) NFP-growth Algorithm NFP-tree construction

15 A New FP-Tree for Mining Frequent Itemsets (2/3) TID Frequent 1-itemsets (sorted) C A B D C A C B D A B D C B D

16 A New FP-Tree for Mining Frequent Itemsets (3/3) Conditional NFP-tree of “ D(3,4) ”

17 Experimental Results (1/4) PC: Pentium IV 1.5 GHZ, 1GB SDRAM, running windows 2000 professional All algorithms were coded in VC Datasets: Real: BMS-Web View-1, BMS-Web View-2, Connect 4 Artificial: generated by IBM synthetic data generator |D||D| Number of transactions in DB |T||T| Mean size of the transactions |I||I| Mean size of the maximal potentially frequent itemsets |L||L| Number of maximal potentially frequent itemsets N Number of items

18 Experimental Results (2/4)

19 Experimental Results (3/4)

20 Experimental Results (4/4)

21 A Fast Algorithm for Mining Share-Frequent Itemsets FSM: Fast Share Measure algorithm ML: Maximum transaction length in DB MV: Maximum measure value in DB min_lmv=minShare×Tmv Level Closure Property: Given a minShare and a k- itemset X Theorem 1. If lmv(X)+(lmv(X)/k)×MV < min_lmv, all supersets of X with length k + 1 are infrequent Theorem 2. If lmv(X)+(lmv(X)/k)×MV ×k ’ < min_lmv, all supersets of X with length k+k ’ are infrequent Corollary 1. If lmv(X)+(lmv(X)/k)×MV ×(ML-k)< min_lmv, all supersets of X are infrequent

22 FSM: Fast Share Measure algorithm minShare=30% Let CF(X)=lmv(X)+(lmv(X)/k)×MV ×(ML-k) Prune X if CF(X)<min_lmv CF({ABC})=3+(3/3)×3×(6-3)=12<14=min_lmv

23 Experimental Results (1/2) T4.I2.D100k.N50.S10 minShare = 0.8% ML=14 Method Pass (k) ZSPFSM(1)FSM(2)FSM(3)FSM(ML-k) k=1 CkCk 50 RC k FkFk 32 k=2 CkCk RC k FkFk 119 k=3 CkCk RC k FkFk 65 k=4 CkCk RC k FkFk k=5 CkCk RC k FkFk k=6 CkCk RC k FkFk k>=7 CkCk RC k FkFk Time(sec)

24 Experimental Results (2/2)

25 Efficient Algorithms for Mining Share-Frequent itemsets EFSM (Enhanced FSM): instead of joining arbitrary two itemsets in RC k-1, EFSM joins arbitrary itemset of RC k-1 with a single item in RC 1 to generate C k efficiently Reduce time complexity from O(n 2k-2 ) to O(n k )

26 Efficient Algorithms for Mining Share-Frequent itemsets X k+1 : arbitrary superset of X with length k+1 in DB S(X k+1 ): the set which contains all X k+1 in DB db S(X k+1 ) : the set of transactions of which each transaction contains at least one X k+1 SuFSM and ShFSM from EFSM which prune the candidates more efficiently than FSM SuFSM (Support-counted FSM): Theorem 3. If lmv(X)+Sup(S(X k+1 ))×MV×(ML – k)< min_lmv, all supersets of X are infrequent

27 SuFSM (Support-counted FSM) lmv(X)/k Sup(X) Sup(S(X k+1 )) maxSup(X k+1 ) EX. lmv({BCD})/k=15/3=5, Sup({BCD})=3, Sup(S({BCD} k+1 ))=2, maxSup(X k+1 )=1 If there is no superset of X is an SH-frequent itemset, then the following four equations hold lmv(X)+(lmv(X)/k)×MV× (ML - k) < min_lmv lmv(X)+Sup(X) ×MV× (ML - k) < min_lmv lmv(X)+Sup(S(X k+1 )) ×MV× (ML - k) < min_lmv lmv(X)+maxSup(X k+1 ) ×MV× (ML - k) < min_lmv

28 ShFSM (Share-counted FSM) ShFSM (Share-counted FSM): Theorem 4. If Tmv(db S(X k+1 ) ) < min_lmv, all supersets of X are infrequent FSM:lmv(X)+(lmv(X)/k)×MV× (ML - k) < min_lmv SuFSM: lmv(X)+Sup(S(X k+1 )) ×MV× (ML - k) < min_lmv ShFSM: Tmv(db S(X k+1 ) ) < min_lmv

29 ShFSM (Share-counted FSM) Ex. X={AB} Tmv(db S(X k+1 ) ) = tmv(T01)+tmv(T05) =6+6=12 <14 = min_lmv

30 Experimental Results (1/4)

31 Experimental Results (2/4) minShare=0.3%

32 Experimental Results (3/4) minShare=0.3%

33 Experimental Results (4/4) Method Pass (k) FSMEFSMSuFSMShFSMFkFk k=1 CkCk RC k k=2 CkCk RC k k=3 CkCk RC k k=4 CkCk RC k k=5 CkCk RC k k=6 CkCk RC k k=7 CkCk RC k k=8 CkCk RC k k>=9 CkCk RC k Time(sec) T6.I4.D100k.N200.S10 minShare = 0.1% ML=20

34 Conclusions Support measure Uses two counters per tree node to reduce the number of the tree nodes. Applies a smaller tree and header table to discover frequent itemsets efficiently Consider the development of superior data structures and extend the pattern-growth approach

35 Share measure Proposed algorithms efficiently decrease the candidate number to be counted The performance of ShFSM is the best Consider the development of superior algorithms to accelerate the process of identifying all SH-frequent itemsets

36 ShFSM: Tmv(db S(X k+1 ) ) < min_lmv

Thank You!