CLOSET: An Efficient Algorithm for Mining Frequent Closed Itemsets

CLOSET: An Efficient Algorithm for Mining Frequent Closed Itemsets
Jian Pei, Jiawei Han and Runying Mao Intelligent Database Systems Research Lab. School of Computing Science Simon Fraser University {peijian, han, han, rmao}

Outline why mining frequent closed itemsets?
CLOSET: an efficient method Performance study and experimental results Conclusions

Mining Frequent Itemsets
Given a transaction database and a support threshold, mining frequent itemsets is to find the complete set of frequent itemsets Mining frequent itemsets is essential for many data mining tasks, e.g. association, etc. Mining frequent itemsets and association rules over them often generates a large number of frequent itemsets and rules Harm efficiency Hard to understand

From Frequent Itemsets to Frequent Closed Itemsets
Mining frequent closed itemsets has the same power as mining the complete set of frequent itemsets, but it substantially reduces redundant rules to be generated Increase both efficiency and effectiveness TDB (a1a2…a100) (a1a2…a50) min_sup=1 min_conf=50% frequent itemsets a1, …, a100, a1a2, …, a99a100, …, a1a2…a100 A tremendous number of association rules! 2 frequent closed itemsets a1a2…a100, a1a2…a50 1 rule a1a2…a50a51a52…a100

What Is Frequent Closed Itemset?
An itemset X is a closed itemset if there exists no itemset Y such that every transaction having X contains Y A closed itemset X is frequent if its support passes the given support threshold The concept is firstly proposed by Pasquier et al. in ICDT’99 and Information Systems Vol.24, No.1, 1999

How to Generate Rules on Frequent Closed Itemsets?
Rule XY is an association rule on frequent closed itemsets if Both X and XY are frequent closed itemsets There exists no frequent closed itemset Z such that XZ(XY) The confidence of the rule passes the given threshold Given rules XY and XYZ, the rule XYZ is redundant!

How to Mine Frequent Closed Itemsets?
A-Close [PBTL99] Using the A-priori framework Pruning redundancies in candidates Post-processing to generate complete but non-duplicate result ChARM [ZaHs00] Exploring a vertical data format Finding frequent closet itemsets by computing intersections of sets of transaction ids for itemsets CLOSET: our method presented here

How CLOSET Works? An Example
Transaction ID Items 10 a, c, d, e, f 20 a, b, e 30 c, e, f 40 a, c, d, f 50 Step 1. Find frequent items min_sup =2 List of frequent items in support descending order f_list=<c:4, e:4, f:4, a:3, d:2>

Divide Search Space All frequent closed itemsets can be divided into 5 non-overlap subsets based on f_lsit The ones containing d The ones containing a but no d The ones containing f but no a nor d The ones containing e but no f, a nor d The ones containing only c Transaction ID Items 10 a, c, d, e, f 20 a, b, e 30 c, e, f 40 a, c, d, f 50 f_list=<c:4, e:4, f:4, a:3, d:2>

Find Subsets of Frequent Closed Itemsets by Constructing Conditional Databases
Let a be a frequent item in TDB. The a-conditional database, denoted as TDB|a, is the subset of transactions in TDB containing a, and all occurrences of infrequent items, item a, and items following a in f_list are omitted Let b be a frequent item in X-conditional database TDB|X, the bX-conditional database, denoted as TDB|bX, is the subset of transactions in TDB|X containing b and all the occurrences of local infrequent items, item b, and items following j in local f_listX are omitted

Find Frequent Closed Itemsets Containing d
TDB cefad ea cef cfad f_list:<c:4, e:4, f:4, a:3, d:2> TDB|d (d:2) cefa cfa F.C.I.: cfad:2 TDB|a (a:3) e cf F.C.I.: a:3 TDB|ea (ea:2) c F.C.I.: ea:2 TDB|f (f:4) ce:3 F.C.I.: cf:4, cef:3 TDB|e (e:4) c:3 F.C.I.: e:4 Local frequent items: c, f, a Every transaction having d also contains c, f and a

Find Frequent Closed Itemsets Containing a but No d
Frequent closed itemsets containing a but no d can be further partitioned into subsets Ones having af but no d Ones having ae but no d nor f Ones having ac but no d, e nor f TDB cefad ea cef cfad f_list:<c:4, e:4, f:4, a:3, d:2> TDB|d (d:2) cefa cfa TDB|a (a:3) cef e cf TDB|f (f:4) ce:3 c TDB|e (e:4) c:3 F.C.I.: e:4 F.C.I.: cfad:2 F.C.I.: cf:4, cef:3 F.C.I.: a:3 sup(fa)=sup(ca)=sup(cfad) No FCI having fa or ca but no d TDB|ea (ea:2) c F.C.I.: ea:2

Find Frequent Closed Itemsets Containing f but No a Nor d
TDB cefad ea cef cfad f_list:<c:4, e:4, f:4, a:3, d:2> TDB|d (d:2) cefa cfa TDB|a (a:3) cef e cf TDB|f (f:4) ce:3 c TDB|e (e:4) c:3 F.C.I.: e:4 F.C.I.: cfad:2 F.C.I.: cf:4, cef:3 F.C.I.: a:3 TDB|ea (ea:2) c F.C.I.: ea:2

Find Frequent Closed Itemsets Containing e but No f, a Nor d
TDB cefad ea cef cfad f_list:<c:4, e:4, f:4, a:3, d:2> TDB|d (d:2) cefa cfa TDB|a (a:3) cef e cf TDB|f (f:4) ce:3 c TDB|e (e:4) c:3 F.C.I.: e:4 F.C.I.: cfad:2 F.C.I.: cf:4, cef:3 F.C.I.: a:3 TDB|ea (ea:2) c F.C.I.: ea:2

Find Frequent Closed Itemsets Containing Only c
sup(c)=sup(cf), c is not a closed itemset In summary, the set of frequent closed itemsets is {acdf:2, a:3, ae:2, cf:4, cef:3, e:4}

Optimization 1: Compress Transactional & Conditional Databases Using FP-trees
FP-tree compresses databases for frequent itemsets Conditional databases can be derived from FP-tree efficiently Please refer our SIGMOD’00 paper for details

Optimization 2: Extract Items Appearing in Every Transaction of Conditional Database
Let Y be the set of items appearing in every transaction of the X-conditional database, XY is a potential frequent closed itemset This optimization takes effect before constructing the FP-tree for the conditional database Benefits Reduce the size of FP-tree Reduce the levels of recursions

Optimization 3: Directly Extract Frequent Closed Itemsets From FP-tree
Benefits Identify frequent closed itemsets quickly Reduce the size of the remaining FP-tree to be examined Reduce the levels of recursions root a:7 abc:7 b:7 abcd:5 c:7 d:5 e:4 abcdef:4 f:4

Optimization 4: Prune Search Branches
If XY, sup(X)=sup(Y) and Y is a frequent closed itemset, there is no need to search for X-conditional database for frequent closed itemset Any frequent closed itemset having X must contain Y-X as well Benefits Avoid search for subsumed frequent itemsets

Scaling up CLOSET in Large Database
TDB cefad ea cef cfad Using projected databases in place of FP-trees Partition-based projection f_list:<c:4, e:4, f:4, a:3, d:2> TDB|d (d:2) cefa cfa TDB|a (a:3) cef e cf TDB|f (f:4) ce:3 c TDB|e (e:4) c:3 F.C.I.: e:4 F.C.I.: cfad:2 F.C.I.: cf:4, cef:3 F.C.I.: a:3 TDB|ea (ea:2) c F.C.I.: ea:2

Performance Study Test takers Datasets A-Close ChARM CLOSET
Synthetic dataset T25I20D100k with 10k items Connect-4 Pumsb

Compactness of Frequent Closed Itemsets
Example: Dataset Connect-4 Support #FCI #FI #FI/#FCI 64179 (95%) 812 2205 2.72 60801 (90%) 3486 27127 7.78 54046 (80%) 15107 533975 35.35 47290 (70%) 35875 115.12

Scalability with Support Threshold on Dataset T25I20D100k

Scalability With Support Threshold on Dataset Connect-4

Scalability With Support Threshold on Dataset Pumsb

Size Scaleup on Datasets

Conclusions CLOSET is an FP-tree-based database projection method for efficient mining of frequent closed itemsets in large databases Applying FP-tree structure Developing techniques to identify frequent closed itemsets quickly Exploring a partition-based projection mechanism for scalable mining CLOSET can be straightforwardly extended to mine max-patterns

References R. Agarwal, C. Aggarwal and V.V.V. Prasad. A tree projection algorithm for generation of frequent itemsets. In Journal of Parallel and Distributed Computing, (to appear), 2000 R. Agrawal and R. Srikant. Fast algorithms for mining association rules. In Proc. VLDB’94, Chile, September 1994 R.J. Bayardo. Efficiently mining long patterns from databases. In Proc. SIGMOD’98, WA, June 1998 J. Han, J. Pei and Y. Yin. Mining frequent patterns without candidate generation. In Proc. SIGMOD’00, TX, May 2000 H. Mannila, H. Toivonen and A.I. Verkamo. Efficient algorithms for discovering association rules. In Proc. KDD’94, WA, July 1994 N. Pasquier, Y. Bastide, R. Taouil, and L. Lakhal. Discovering frequent closed itemsets for association rules. In Proc. ICDT’99, Israel, January 1999. Nicolas Pasquier, Yves Bastide, Rafik Taouil, Lotfi Lakhal: Efficient Mining of Association Rules Using Closed Itemset Lattices. In Information Systems, Vol.24, No.1, 1999 M.J. Zaki and C. Hsiao. ChARM: An efficient algorithm for closed association rule mining. In Tech. Rep , Computer Science, Rensselaer Polytechnic Institute, 1999.

CLOSET: An Efficient Algorithm for Mining Frequent Closed Itemsets

Similar presentations

Presentation on theme: "CLOSET: An Efficient Algorithm for Mining Frequent Closed Itemsets"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

CLOSET: An Efficient Algorithm for Mining Frequent Closed Itemsets

Similar presentations

Presentation on theme: "CLOSET: An Efficient Algorithm for Mining Frequent Closed Itemsets"— Presentation transcript:

Similar presentations

About project

Feedback