Fast and Memory Efficient Mining of Frequent Closed Itemsets Claudio Lucchese Salvatore Orlando Raffaele Perego DB group seminar Presenter: Leonidas.

Slides:

Advertisements

Similar presentations

Association Rule Mining

Advertisements

Recap: Mining association rules from large datasets

Huffman Codes and Asssociation Rules (II) Prof. Sin-Min Lee Department of Computer Science.

Association Analysis (2). Example TIDList of item ID’s T1I1, I2, I5 T2I2, I4 T3I2, I3 T4I1, I2, I4 T5I1, I3 T6I2, I3 T7I1, I3 T8I1, I2, I3, I5 T9I1, I2,

Frequent Itemset Mining Methods. The Apriori algorithm Finding frequent itemsets using candidate generation Seminal algorithm proposed by R. Agrawal and.

Frequent Closed Pattern Search By Row and Feature Enumeration

LOGO Association Rule Lecturer: Dr. Bo Yuan

Association Rule Mining. 2 The Task Two ways of defining the task General –Input: A collection of instances –Output: rules to predict the values of any.

Zeev Dvir – GenMax From: “ Efficiently Mining Frequent Itemsets ” By : Karam Gouda & Mohammed J. Zaki.

Association rules The goal of mining association rules is to generate all possible rules that exceed some minimum user-specified support and confidence.

Data Mining Association Analysis: Basic Concepts and Algorithms

Association Analysis. Association Rule Mining: Definition Given a set of records each of which contain some number of items from a given collection; –Produce.

Data Mining Techniques So Far: Cluster analysis K-means Classification Decision Trees J48 (C4.5) Rule-based classification JRIP (RIPPER) Logistic Regression.

Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,

732A02 Data Mining - Clustering and Association Analysis ………………… Jose M. Peña Association rules Apriori algorithm FP grow algorithm.

Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,

1 Fast Algorithms for Mining Association Rules Rakesh Agrawal Ramakrishnan Srikant.

Association Analysis (2). Example TIDList of item ID’s T1I1, I2, I5 T2I2, I4 T3I2, I3 T4I1, I2, I4 T5I1, I3 T6I2, I3 T7I1, I3 T8I1, I2, I3, I5 T9I1, I2,

Association Rule Mining Part 2 (under construction!) Introduction to Data Mining with Case Studies Author: G. K. Gupta Prentice Hall India, 2006.

Data Mining Association Analysis: Basic Concepts and Algorithms

Association Analysis: Basic Concepts and Algorithms.

Data Mining Association Analysis: Basic Concepts and Algorithms

Association Rule Mining - MaxMiner. Mining Association Rules in Large Databases  Association rule mining  Algorithms Apriori and FP-Growth  Max and.

Fast Vertical Mining Using Diffsets Mohammed J. Zaki Karam Gouda

1 Fast Algorithms for Mining Association Rules Rakesh Agrawal Ramakrishnan Srikant Slides from Ofer Pasternak.

© Vipin Kumar CSci 8980 Fall CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance Computing Research Center Department of Computer.

Association Rule Mining. Mining Association Rules in Large Databases  Association rule mining  Algorithms Apriori and FP-Growth  Max and closed patterns.

Performance and Scalability: Apriori Implementation.

Mining Frequent Itemsets with Constraints Takeaki Uno Takeaki Uno National Institute of Informatics, JAPAN Nov/2005 FJWCP.

Sequential PAttern Mining using A Bitmap Representation

Approximate Frequency Counts over Data Streams Gurmeet Singh Manku, Rajeev Motwani Standford University VLDB2002.

Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining By Tan, Steinbach, Kumar Lecture.

Modul 7: Association Analysis. 2 Association Rule Mining  Given a set of transactions, find rules that will predict the occurrence of an item based on.

Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,

Association Rules. CS583, Bing Liu, UIC 2 Association rule mining Proposed by Agrawal et al in Initially used for Market Basket Analysis to find.

Takeaki Uno Tatsuya Asai Yuzo Uchida Hiroki Arimura

LCM ver.2: Efficient Mining Algorithms for Frequent/Closed/Maximal Itemsets Takeaki Uno Masashi Kiyomi Hiroki Arimura National Institute of Informatics,

9/03Data Mining – Association G Dong (WSU) 1 5. Association Rules Market Basket Analysis APRIORI Efficient Mining Post-processing.

Frequent Subgraph Discovery Michihiro Kuramochi and George Karypis ICDM 2001.

Frequent Item Mining. What is data mining? =Pattern Mining? What patterns? Why are they useful?

CSE4334/5334 DATA MINING CSE4334/5334 Data Mining, Fall 2014 Department of Computer Science and Engineering, University of Texas at Arlington Chengkai.

LCM ver.3: Collaboration of Array, Bitmap and Prefix Tree for Frequent Itemset Mining Takeaki Uno Masashi Kiyomi Hiroki Arimura National Institute of Informatics,

1 Inverted Matrix: Efficient Discovery of Frequent Items in Large Datasets in the Context of Interactive Mining -SIGKDD’03 Mohammad El-Hajj, Osmar R. Zaïane.

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Association Analysis This lecture node is modified based on Lecture Notes for.

1 AC-Close: Efficiently Mining Approximate Closed Itemsets by Core Pattern Recovery Advisor ： Dr. Koh Jia-Ling Speaker ： Tu Yi-Lang Date ： Hong.

M. Sulaiman Khan Dept. of Computer Science University of Liverpool 2009 COMP527: Data Mining ARM: Improvements March 10, 2009 Slide.

Association Analysis (3)

Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,

1 The Strategies for Mining Fault-Tolerant Patterns Jia-Ling Koh Department of Information and Computer Education National Taiwan Normal University.

Δ-Tolerance Closed Frequent Itemsets James Cheng,Yiping Ke,and Wilfred Ng ICDM ’ 06 報告者：林靜怡 2007/03/15.

Data Mining Association Rules Mining Frequent Itemset Mining Support and Confidence Apriori Approach.

1 Data Mining Lecture 6: Association Analysis. 2 Association Rule Mining l Given a set of transactions, find rules that will predict the occurrence of.

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Bi-Clustering COMP Seminar Spring 2008.

CLOSET: An Efficient Algorithm for Mining Frequent Closed Itemsets

Reducing Number of Candidates

Data Mining Association Analysis: Basic Concepts and Algorithms

Frequent Pattern Mining

CARPENTER Find Closed Patterns in Long Biological Datasets

Data Mining Association Analysis: Basic Concepts and Algorithms

Association Rule Mining

Data Mining Association Analysis: Basic Concepts and Algorithms

CLOSET: An Efficient Algorithm for Mining Frequent Closed Itemsets

Gyozo Gidofalvi Uppsala Database Laboratory

Association Rule Mining

Data Mining Association Analysis: Basic Concepts and Algorithms

Association Analysis: Basic Concepts and Algorithms

Frequent-Pattern Tree

Closed Itemset Mining CSCI-7173: Computational Complexity & Algorithms, Final Project - Spring 16 Supervised By Dr. Tom Altman Presented By Shahab Helmi.

CLOSET: An Efficient Algorithm for Mining Frequent Closed Itemsets

Association Analysis: Basic Concepts

Presentation transcript:

Fast and Memory Efficient Mining of Frequent Closed Itemsets Claudio Lucchese Salvatore Orlando Raffaele Perego DB group seminar Presenter: Leonidas

Abstract Frequent Itemsets Mining Closed Itemsets Mining Frequent Closed Itemsets Handling duplicates Brief introduction of the algorithm Experimental results

Frequent Itemsets Mining A set of items I, set of transactions D Discover all the itemsets from I with support > min_supp Support of a k-itemset I supp(I) : number of transactions in D includes I I is a set of items from I Transaction t in D is a set of items from I Well known algorithm: Apriori Discover frequent itemsets

Weaknesses & Solutions Number of frequent itemsets grows up quickly as min_supp decreases Complexity of mining task increases rapidly Huge size of output Complex for analysis Closed itemsets are one of the solutions Unique maximal elements of the equivalence classes defined over the lattice of all the frequent itemsets

Weaknesses & Solutions Equivalence class Distinct group of frequent itemsets Supported by same set of transactions Represent same knowledge Vertical bitwise representation of data set Association Rules extracted are more meaningful [ZAKI04] Redundancies are removed Suitable for dense data set Frequent closed itemsets are much fewer than frequent itemsets

Closed Itemsets I is subsets of items appearing in D T is subset of transactions in D Define two functions: Itemset I is closed iff Function is called Galois operator / closure operator TIDItems 1BD 2ABCD 3ACD 4C

Equivalence classes Two itemsets belong to same equivalence class iff They have same closure Supported by same set of transactions An itemset I is closed iff No supersets of I have the same support A 2 B 2 C 3 D 3 AC 2 AD 2 BC 1 BD 2 AB 1 CD 2 ABC 1 ABD 1 ACD 2 BCD 1 ABCD 1 1 ACD 2 BD 2 C 3 D 3 D 2 Frequent Closed Itemset A 2 Frequent Itemset Support Equivalence Class 44 TIDItems 1BD 2ABCD 3ACD 4C

Mining Frequent Closed Itemsets Search Space Browsing Traverse the lattice of frequent itemsets from one equivalence class to another Closure computation Compute the closure of frequent itemsets Determine the closed itemsets Closure generator : A single representative of an equivalence class Can mine all the closed itemsets by computing the closure of the generator for each class

Browsing the Search Space Choose the key patterns (minimal elements) as generators Traverse the lattice formed by key patterns with Apriori-like algorithm[TAOU00] Unfortunately, same closed itemset can be led from more then one key patterns A 2 B 2 C 3 D 3 AC 2 AD 2 BC 1 BD 2 AB 1 CD 2 ABC 1 ABD 1 ACD 2 BCD 1 ABCD 1 1 ACD 2 BD 2 C 3 D 3 44

Browsing the Search Space Closure climbing New generators are built as the supersets of the closed itemset discovered so far Jump from an equivalence class to another Cannot ensure the equivalence class is not visited yet A 2 B 2 C 3 D 3 AC 2 AD 2 BC 1 BD 2 AB 1 CD 2 ABC 1 ABD 1 ACD 2 BCD 1 ABCD 1 1 ACD 2 BD 2 C 3 D 3 44

Problem of duplicate Need duplicate checking to avoid generating the same closed itemset To avoid useless expensive closure operation, use following lemma: However, it is still expensive in time and space All the mined closed itemsets need to be kept in main memory Several algorithms are forced to adopt a strict lexicographic visiting order of the search space to ensure correct duplicate avoidance CHARM[PEI00], CLOSET[PEI03], CLOSET+[ZAKI02]

Computing Closures Besides Galois operator, make use of the lemma: Perform inclusion check for all items in I The chcek is benefited from using vertical representation of list of tidlist Calculation can be either offline or online Offline: compute closures for the entire set of generators Use key patterns, generators are shorter Online: compute closure for a discovered generator Use closure climbing, generators are longer Fewer checks for longer generators, more efficient ItemABCD T10101 T21111 T31011 T40010 tidlist

Handling duplicates To identify the unique generator for each equivalence class Define order-preserving property of generator Check whether a given generator is order-preserving or not Compute the closure of order-preserving generators only Prune other generators

Handling duplicates Order-preserving property of generators: It means that if items need to be added to an order-preserving generator to compute the closure, they need to follow the item i The introduction of order-preserving generator is used to avoid duplicate generation of closed itemset

Example {A}= Ø ∪ {A} is order-preserving generator {C,D}={C} ∪ {D} is not order-preserving A 2 B 2 C 3 D 3 AC 2 AD 2 BC 1 BD 2 AB 1 CD 2 ABC 1 ABD 1 ACD 2 BCD 1 ABCD 1 1 ACD 2 BD 2 C 3 D 3 44 ItemABCD T10101 T21111 T31011 T40010

Handling duplicates We need to check whether a generator is order- preserving or not Define a set called pre-set(gen) of a generator We can now check whether a generator is order- preserving by checking: If yes, then gen is not order-preserving

Handling duplicates The goal is to compute the closure of order- preserving generators only For any closed itemset, there exists a sequence of order-preserving generators Using closure climbing to climb a sequence of closed itemsets and reach For each closed itemset,the sequence of order- preserving generators is unique

Theorem 1 Corollary 1

4 Handling duplicates Example : A 2 AC 2 ABCD 1 ACD 2 Generator =

The DCI_C LOSED Algorithm Two different types of data sets Dense & Sparse Dense data set Transactions are long Contain strongly correlated items Number of closed itemsets may be nearly equal to number of frequent itemsets in sparse data sets Mining closed itemsets becomes more expensive Separated into two parts DCI_C LOSED s () & DCI_C LOSED d ()

The DCI_C LOSED Algorithm Discriminate between sparse and dense data sets: Scan data set to find out frequent single items F 1 ⊆ I Build bitwise vertical data set VD Items are increasingly sorted w.r.t. frequencies Decide whether a data set is sparse or dense If percentage of 1s is large If a large set of items is strongly correlated Compute the percentage of the most frequent items that co-occur in the same transaction A B E …

The DCI_C LOSED Algorithm 3 input parameters: CLOSED_SET=c(Ø), PRE_SET=Ø, POST_SET=F 1 \c(Ø) Get an item i from POST_SET (minimum in order) Add i to CLOSED_SET to build new_gen (closure climbing)closure climbing Check validity of generator new_gen with PRE_SETCheck validity Compute closure of new_gen using lemma 2 for CLOSED_SETlemma 2 New closed set generated from new_gen

The DCI_C LOSED Algorithm Use PRE_SET to check validity of new_gen Guarantee duplicate generators will be correctly pruned out POST_SET is used to guarantee generators are produced according to Theorem 1Theorem 1 POST_SET contains items j follow i in lexicographic order & not included in CLOSED_SET yet

Running example of DCI_C LOSED d () CLOSED_SET = c(Ø)=Ø, PRE_SET=Ø, POST_SET={A,B,C,D} Compute closure of generator gen= Ø ∪ {A}={A} Check with PRE_SET  order-preserving Check if g(A) ⊂ g(j), ∀ j ∈ POST_SET If yes, include j into CLOSED_SET 4 A 2 AC 2 ACD 2 Generator = ABCD T10101 T21111 T31011 T40010 Generator =

Running example of DCI_C LOSED d () CLOSED_SET={A,C,D}, PRE_SET=Ø, POST_SET={B} New generator gen= {A,C,D} ∪ {B}={A,B,C,D} Check with PRE_SET  order-preserving gen is closed since POST_SET is empty Note: {A,C,D} {A,B,C,D}, need not to be in order 4 A 2 AC 2 ACD 2 ABCD 1 Generator = ABCD T10101 T21111 T31011 T40010

Running example of DCI_C LOSED d () gen=Ø ∪ {B}, PRE_SET={A}, POST_SET={C,D} gen is order-preserving by checking with g(A) Check g(B) with g(C) and g(D) get c(B)={B,D} {B,D} is closed by checking with POST_SET 4 A 2 AC 2 ACD 2 ABCD 1 B 2 BD 2 ABCD T10101 T21111 T31011 T40010 Generator =

Running example of DCI_C LOSED d () CLOSED_SET={B,D}, PRE_SET={A}, POST_SET={C} gen now is {B,D} ∪ {C} = {B,C,D} Check g({B,C,D}) with g(A), g({B,C,D}) ⊂ g(A) gen is not order-preserving and can be pruned with all its possible extensions 4 A 2 AC 2 ACD 2 ABCD 1 Generator = B 2 BD 2 BCD 1 ABCD T10101 T21111 T31011 T40010

Running example of DCI_C LOSED d () gen=Ø ∪ {C}, PRE_SET={A,B}, POST_SET={D} gen is order-preserving by checking with g(A), g(B) gen cannot not be extended by checking with POST_SET, so it is closed 4 A 2 AC 2 ACD 2 ABCD 1 ABCD T10101 T21111 T31011 T40010 B 2 BD 2 BCD 1 C 3 Generator =

Running example of DCI_C LOSED d () CLOSED_SET={C}, PRE_SET={A,B}, POST_SET={D} gen now is {C} ∪ {D} = {C,D} Check g({C,D}) with g(A), g({C,D}) ⊂ g(A) gen is not order-preserving and can be pruned with considering its possible extensions 4 A 2 AC 2 ACD 2 ABCD 1 ABCD T10101 T21111 T31011 T40010 B 2 BD 2 BCD 1 CD 2 C 3 Generator =

Running example of DCI_C LOSED d () gen=Ø ∪ {D}, PRE_SET={A,B,C}, POST_SET= Ø gen is order-preserving by checking with g(A), g(B), g(C) gen cannot not be extended by checking with POST_SET, so it is closed 4 A 2 AC 2 ACD 2 ABCD 1 ABCD T10101 T21111 T31011 T40010 B 2 BD 2 BCD 1 CD 2 C 3 D 3 Generator =

Optimizations Vertical data set (frequent single items) is represented by a bitmap matrix VD MxN VD(i,j) =1 when item i of transaction j is frequent Row i of the matrix represents g(i), the tidlist Optimize the bitwise AND operations for tidlist intersections Inclusion checks 3 optimization techniques

Optimizations Data Set Projection (projection) For closed itemsets Z discovered by closed set X g(Z) is supported by subsets of g(X) Delete all columns from VD corresponding transactions not occurring in g(X) This process is limited to generators of 1 st level of recursion since it is expensive

Optimizations Data Sets with Highly Correlated Items (section eq) Columns of VD are reordered to profit of data correlation Maximize the submatrix VE of VD having all rows and columns are identical VE is likely to be large and includes most frequent items Many frequent itemsets can be mined within VE T1T2T3T4 A0101 B1111 C1101 D0101 T2T4T1T3 A1100 B1111 C1110 D1100

Optimizations Reusing Results of Previous Bitwise Intersections (included) To check whether an itemset X is closed, compare X with its PRE_SET For X is closed, g(X) ⊆ g(j) for all j Large part of g(X) may be included in g(j) Let g h (X) ⊆ g h (j), so g h (X ∪ Y) ⊆ g h (j) We can limit the check of various g(j) to the complementary part of g h (j) g(j) h g(X ∪ Y) check g(X)

Optimizations Actual number of bitwise AND operations vs. support threshold Optimizations “section eq” & “included” are most effective

Performance Analysis Competitors: FP-C LOSE [GRAH03], C LOSET +[PEI03] Environment: Windows XP, Pentium IV 2.8GHz, 512MB Spare & Dense data sets DatasetItemsAvg. Trans. Size Transactions T40I10D100K Retail Chess Pumsb

Performance Analysis Data set: T40I10D100K, Retail DCI_C LOSED is faster in one order of magnitude

Performance Analysis Data set:, CHESS, PUMSB

Performance Analysis Time efficiency of duplicate checking Speedup up to six when support thresholds are small chess

References [GRAH03] G. Grahne and J. Zhu, “Efficiently Using Prefix-Trees in Mining Frequent Itemsets,” Proc. ICDM Workshop Frequent Itemset Mining Implementations, Dec [PEI00] J. Pei, J. Han, and R. Mao, “CLOSET: An Efficient Algorithm for Mining Frequent Closed Itemsets,” Proc. ACM SIGMOD Int’l Workshop Data Mining and Knowledge Discovery, May [PEI03] J. Pei, J. Han, and J. Wang, “CLOSET+: Searching for the Best Strategies for Mining Frequent Closed Itemsets,” Proc. Ninth ACM SIGKDD Int’l Conf. Knowledge Discovery and Data Mining, Aug [TAOU00] R. Taouil, N. Pasquier, Y. Bastide, L. Lajhal, and G. Stumme, “Mining Frequent Patterns with Counting Inference,” SIGKDD Explorations, vol. 2, no. 2, Dec [ZAKI02] M.J. Zaki and C.-J. Hsiao, “Charm: An Efficient Algorithm for Closed Itemsets Mining,” Proc. Second SIAM Int’l Conf. Data Mining, Apr [ZAKI04] M.J. Zaki, “Mining Non-Redundant Association Rules,” Data Mining and Knowledge Discovery, vol. 9, no.3, pp , 2004.