CLOSET: An Efficient Algorithm for Mining Frequent Closed Itemsets

Slides:



Advertisements
Similar presentations
Frequent Itemset Mining Methods. The Apriori algorithm Finding frequent itemsets using candidate generation Seminal algorithm proposed by R. Agrawal and.
Advertisements

Mining Frequent Patterns Using FP-Growth Method Ivan Tanasić Department of Computer Engineering and Computer Science, School of Electrical.
Graph Mining Laks V.S. Lakshmanan
Frequent Closed Pattern Search By Row and Feature Enumeration
LOGO Association Rule Lecturer: Dr. Bo Yuan
Association Rule Mining. Mining Association Rules in Large Databases  Association rule mining  Algorithms Apriori and FP-Growth  Max and closed patterns.
Sampling Large Databases for Association Rules ( Toivenon’s Approach, 1996) Farzaneh Mirzazadeh Fall 2007.
Data Mining Association Analysis: Basic Concepts and Algorithms
Rakesh Agrawal Ramakrishnan Srikant
Data Mining Techniques So Far: Cluster analysis K-means Classification Decision Trees J48 (C4.5) Rule-based classification JRIP (RIPPER) Logistic Regression.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Data Mining: Concepts and Techniques (2nd ed.) — Chapter 5 —
Our New Progress on Frequent/Sequential Pattern Mining We develop new frequent/sequential pattern mining methods Performance study on both synthetic and.
Data Mining Association Analysis: Basic Concepts and Algorithms
1 Association Rule Mining Instructor Qiang Yang Slides from Jiawei Han and Jian Pei And from Introduction to Data Mining By Tan, Steinbach, Kumar.
Association Analysis: Basic Concepts and Algorithms.
Association Rule Mining. Generating assoc. rules from frequent itemsets  Assume that we have discovered the frequent itemsets and their support  How.
Data Mining Association Analysis: Basic Concepts and Algorithms
Pattern-growth Methods for Sequential Pattern Mining: Principles and Extensions Jiawei Han (UIUC) Jian Pei (Simon Fraser Univ.)
1 1 Data Mining: Concepts and Techniques (3 rd ed.) — Chapter 6 — Jiawei Han, Micheline Kamber, and Jian Pei University of Illinois at Urbana-Champaign.
© Vipin Kumar CSci 8980 Fall CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance Computing Research Center Department of Computer.
Association Rule Mining. Mining Association Rules in Large Databases  Association rule mining  Algorithms Apriori and FP-Growth  Max and closed patterns.
What Is Sequential Pattern Mining?
Abrar Fawaz AlAbed-AlHaq Kent State University October 28, 2011
1 Apriori Algorithm Review for Finals. SE 157B, Spring Semester 2007 Professor Lee By Gaurang Negandhi.
October 2, 2015 Data Mining: Concepts and Techniques 1 Data Mining: Concepts and Techniques — Chapter 8 — 8.3 Mining sequence patterns in transactional.
Data Mining: Concepts and Techniques (3rd ed.) — Chapter 6 —
Modul 7: Association Analysis. 2 Association Rule Mining  Given a set of transactions, find rules that will predict the occurrence of an item based on.
Efficient Data Mining for Calling Path Patterns in GSM Networks Information Systems, accepted 5 December 2002 SPEAKER: YAO-TE WANG ( 王耀德 )
Takeaki Uno Tatsuya Asai Yuzo Uchida Hiroki Arimura
Pattern-Growth Methods for Sequential Pattern Mining Iris Zhang
Mining Frequent Patterns without Candidate Generation : A Frequent-Pattern Tree Approach 指導教授:廖述賢博士 報 告 人:朱 佩 慧 班 級:管科所博一.
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Association Rule Mining III COMP Seminar GNET 713 BCB Module Spring 2007.
CSE4334/5334 DATA MINING CSE4334/5334 Data Mining, Fall 2014 Department of Computer Science and Engineering, University of Texas at Arlington Chengkai.
1 AC-Close: Efficiently Mining Approximate Closed Itemsets by Core Pattern Recovery Advisor : Dr. Koh Jia-Ling Speaker : Tu Yi-Lang Date : Hong.
M. Sulaiman Khan Dept. of Computer Science University of Liverpool 2009 COMP527: Data Mining ARM: Improvements March 10, 2009 Slide.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
1 Top Down FP-Growth for Association Rule Mining By Ke Wang.
CS685: Special Topics in Data Mining The UNIVERSITY of KENTUCKY Frequent Itemset Mining II Tree-based Algorithm Max Itemsets Closed Itemsets.
CLOSET: An Efficient Algorithm for Mining Frequent Closed Itemsets
Jian Pei and Runying Mao (Simon Fraser University)
TITLE What should be in Objective, Method and Significant
Reducing Number of Candidates
Data Mining Association Analysis: Basic Concepts and Algorithms
Data Mining: Concepts and Techniques
Data Mining: Concepts and Techniques
Information Management course
Predictive Analytics in SQL and Datalog
Association rule mining
Knowledge discovery & data mining Association rules and market basket analysis--introduction UCLA CS240A Course Notes*
Frequent Pattern Mining
Byung Joon Park, Sung Hee Kim
Jiawei Han, Computer Science, Univ. Illinois at Urbana-Champaign, 2017
Data Mining Association Analysis: Basic Concepts and Algorithms
Data Mining Association Analysis: Basic Concepts and Algorithms
Data Mining: Concepts and Techniques (3rd ed.) — Chapter 6 —
Market Baskets Frequent Itemsets A-Priori Algorithm
به نام خداوند جان و خرد الگوکاوي در پايگاه‌هاي تراکنش بسيار بزرگ با استفاده از رويکرد تقسيم وحل Frequent Pattern Mining on Very Large Transaction Databases.
Mining Frequent Itemsets over Uncertain Databases
Association Rule Mining
Data Mining Association Analysis: Basic Concepts and Algorithms
Farzaneh Mirzazadeh Fall 2007
Mining Frequent Patterns without Candidate Generation
Frequent-Pattern Tree
Data Mining: Concepts and Techniques (3rd ed.) — Chapter 6 —
Association Rule Mining
Data Mining: Concepts and Techniques (3rd ed.) — Chapter 6 —
An Efficient Method for Projected Clustering
CLOSET: An Efficient Algorithm for Mining Frequent Closed Itemsets
Association Analysis: Basic Concepts
Presentation transcript:

CLOSET: An Efficient Algorithm for Mining Frequent Closed Itemsets Jian Pei, Jiawei Han and Runying Mao Intelligent Database Systems Research Lab. School of Computing Science Simon Fraser University Email: {peijian, han, rmao}@cs.sfu.ca http://www.cs.sfu.ca/~{peijian, han, rmao}

Outline why mining frequent closed itemsets? CLOSET: an efficient method Performance study and experimental results Conclusions

Mining Frequent Itemsets Given a transaction database and a support threshold, mining frequent itemsets is to find the complete set of frequent itemsets Mining frequent itemsets is essential for many data mining tasks, e.g. association, etc. Mining frequent itemsets and association rules over them often generates a large number of frequent itemsets and rules Harm efficiency Hard to understand

From Frequent Itemsets to Frequent Closed Itemsets Mining frequent closed itemsets has the same power as mining the complete set of frequent itemsets, but it substantially reduces redundant rules to be generated Increase both efficiency and effectiveness TDB (a1a2…a100) (a1a2…a50) min_sup=1 min_conf=50% 2100-1 frequent itemsets a1, …, a100, a1a2, …, a99a100, …, a1a2…a100 A tremendous number of association rules! 2 frequent closed itemsets a1a2…a100, a1a2…a50 1 rule a1a2…a50a51a52…a100

What Is Frequent Closed Itemset? An itemset X is a closed itemset if there exists no itemset Y such that every transaction having X contains Y A closed itemset X is frequent if its support passes the given support threshold The concept is firstly proposed by Pasquier et al. in ICDT’99 and Information Systems Vol.24, No.1, 1999

How to Generate Rules on Frequent Closed Itemsets? Rule XY is an association rule on frequent closed itemsets if Both X and XY are frequent closed itemsets There exists no frequent closed itemset Z such that XZ(XY) The confidence of the rule passes the given threshold Given rules XY and XYZ, the rule XYZ is redundant!

How to Mine Frequent Closed Itemsets? A-Close [PBTL99] Using the A-priori framework Pruning redundancies in candidates Post-processing to generate complete but non-duplicate result ChARM [ZaHs00] Exploring a vertical data format Finding frequent closet itemsets by computing intersections of sets of transaction ids for itemsets CLOSET: our method presented here

How CLOSET Works? An Example Transaction ID Items 10 a, c, d, e, f 20 a, b, e 30 c, e, f 40 a, c, d, f 50 Step 1. Find frequent items min_sup =2 List of frequent items in support descending order f_list=<c:4, e:4, f:4, a:3, d:2>

Divide Search Space All frequent closed itemsets can be divided into 5 non-overlap subsets based on f_lsit The ones containing d The ones containing a but no d The ones containing f but no a nor d The ones containing e but no f, a nor d The ones containing only c Transaction ID Items 10 a, c, d, e, f 20 a, b, e 30 c, e, f 40 a, c, d, f 50 f_list=<c:4, e:4, f:4, a:3, d:2>

Find Subsets of Frequent Closed Itemsets by Constructing Conditional Databases Let a be a frequent item in TDB. The a-conditional database, denoted as TDB|a, is the subset of transactions in TDB containing a, and all occurrences of infrequent items, item a, and items following a in f_list are omitted Let b be a frequent item in X-conditional database TDB|X, the bX-conditional database, denoted as TDB|bX, is the subset of transactions in TDB|X containing b and all the occurrences of local infrequent items, item b, and items following j in local f_listX are omitted

Find Frequent Closed Itemsets Containing d TDB cefad ea cef cfad f_list:<c:4, e:4, f:4, a:3, d:2> TDB|d (d:2) cefa cfa F.C.I.: cfad:2 TDB|a (a:3) e cf F.C.I.: a:3 TDB|ea (ea:2) c F.C.I.: ea:2 TDB|f (f:4) ce:3 F.C.I.: cf:4, cef:3 TDB|e (e:4) c:3 F.C.I.: e:4 Local frequent items: c, f, a Every transaction having d also contains c, f and a

Find Frequent Closed Itemsets Containing a but No d Frequent closed itemsets containing a but no d can be further partitioned into subsets Ones having af but no d Ones having ae but no d nor f Ones having ac but no d, e nor f TDB cefad ea cef cfad f_list:<c:4, e:4, f:4, a:3, d:2> TDB|d (d:2) cefa cfa TDB|a (a:3) cef e cf TDB|f (f:4) ce:3 c TDB|e (e:4) c:3 F.C.I.: e:4 F.C.I.: cfad:2 F.C.I.: cf:4, cef:3 F.C.I.: a:3 sup(fa)=sup(ca)=sup(cfad) No FCI having fa or ca but no d TDB|ea (ea:2) c F.C.I.: ea:2

Find Frequent Closed Itemsets Containing f but No a Nor d TDB cefad ea cef cfad f_list:<c:4, e:4, f:4, a:3, d:2> TDB|d (d:2) cefa cfa TDB|a (a:3) cef e cf TDB|f (f:4) ce:3 c TDB|e (e:4) c:3 F.C.I.: e:4 F.C.I.: cfad:2 F.C.I.: cf:4, cef:3 F.C.I.: a:3 TDB|ea (ea:2) c F.C.I.: ea:2

Find Frequent Closed Itemsets Containing e but No f, a Nor d TDB cefad ea cef cfad f_list:<c:4, e:4, f:4, a:3, d:2> TDB|d (d:2) cefa cfa TDB|a (a:3) cef e cf TDB|f (f:4) ce:3 c TDB|e (e:4) c:3 F.C.I.: e:4 F.C.I.: cfad:2 F.C.I.: cf:4, cef:3 F.C.I.: a:3 TDB|ea (ea:2) c F.C.I.: ea:2

Find Frequent Closed Itemsets Containing Only c sup(c)=sup(cf), c is not a closed itemset In summary, the set of frequent closed itemsets is {acdf:2, a:3, ae:2, cf:4, cef:3, e:4}

Optimization 1: Compress Transactional & Conditional Databases Using FP-trees FP-tree compresses databases for frequent itemsets Conditional databases can be derived from FP-tree efficiently Please refer our SIGMOD’00 paper for details

Optimization 2: Extract Items Appearing in Every Transaction of Conditional Database Let Y be the set of items appearing in every transaction of the X-conditional database, XY is a potential frequent closed itemset This optimization takes effect before constructing the FP-tree for the conditional database Benefits Reduce the size of FP-tree Reduce the levels of recursions

Optimization 3: Directly Extract Frequent Closed Itemsets From FP-tree Benefits Identify frequent closed itemsets quickly Reduce the size of the remaining FP-tree to be examined Reduce the levels of recursions root a:7 abc:7 b:7 abcd:5 c:7 d:5 e:4 abcdef:4 f:4

Optimization 4: Prune Search Branches If XY, sup(X)=sup(Y) and Y is a frequent closed itemset, there is no need to search for X-conditional database for frequent closed itemset Any frequent closed itemset having X must contain Y-X as well Benefits Avoid search for subsumed frequent itemsets

Scaling up CLOSET in Large Database TDB cefad ea cef cfad Using projected databases in place of FP-trees Partition-based projection f_list:<c:4, e:4, f:4, a:3, d:2> TDB|d (d:2) cefa cfa TDB|a (a:3) cef e cf TDB|f (f:4) ce:3 c TDB|e (e:4) c:3 F.C.I.: e:4 F.C.I.: cfad:2 F.C.I.: cf:4, cef:3 F.C.I.: a:3 TDB|ea (ea:2) c F.C.I.: ea:2

Performance Study Test takers Datasets A-Close ChARM CLOSET Synthetic dataset T25I20D100k with 10k items Connect-4 Pumsb

Compactness of Frequent Closed Itemsets Example: Dataset Connect-4 Support #FCI #FI #FI/#FCI 64179 (95%) 812 2205 2.72 60801 (90%) 3486 27127 7.78 54046 (80%) 15107 533975 35.35 47290 (70%) 35875 4129839 115.12

Scalability with Support Threshold on Dataset T25I20D100k

Scalability With Support Threshold on Dataset Connect-4

Scalability With Support Threshold on Dataset Pumsb

Size Scaleup on Datasets

Conclusions CLOSET is an FP-tree-based database projection method for efficient mining of frequent closed itemsets in large databases Applying FP-tree structure Developing techniques to identify frequent closed itemsets quickly Exploring a partition-based projection mechanism for scalable mining CLOSET can be straightforwardly extended to mine max-patterns

References R. Agarwal, C. Aggarwal and V.V.V. Prasad. A tree projection algorithm for generation of frequent itemsets. In Journal of Parallel and Distributed Computing, (to appear), 2000 R. Agrawal and R. Srikant. Fast algorithms for mining association rules. In Proc. VLDB’94, Chile, September 1994 R.J. Bayardo. Efficiently mining long patterns from databases. In Proc. SIGMOD’98, WA, June 1998 J. Han, J. Pei and Y. Yin. Mining frequent patterns without candidate generation. In Proc. SIGMOD’00, TX, May 2000 H. Mannila, H. Toivonen and A.I. Verkamo. Efficient algorithms for discovering association rules. In Proc. KDD’94, WA, July 1994 N. Pasquier, Y. Bastide, R. Taouil, and L. Lakhal. Discovering frequent closed itemsets for association rules. In Proc. ICDT’99, Israel, January 1999. Nicolas Pasquier, Yves Bastide, Rafik Taouil, Lotfi Lakhal: Efficient Mining of Association Rules Using Closed Itemset Lattices. In Information Systems, Vol.24, No.1, 1999 M.J. Zaki and C. Hsiao. ChARM: An efficient algorithm for closed association rule mining. In Tech. Rep. 99-10, Computer Science, Rensselaer Polytechnic Institute, 1999.