Fast and Memory Efficient Mining of Frequent Closed Itemsets Claudio Lucchese Salvatore Orlando Raffaele Perego DB group seminar Presenter: Leonidas
Abstract Frequent Itemsets Mining Closed Itemsets Mining Frequent Closed Itemsets Handling duplicates Brief introduction of the algorithm Experimental results
Frequent Itemsets Mining A set of items I, set of transactions D Discover all the itemsets from I with support > min_supp Support of a k-itemset I supp(I) : number of transactions in D includes I I is a set of items from I Transaction t in D is a set of items from I Well known algorithm: Apriori Discover frequent itemsets
Weaknesses & Solutions Number of frequent itemsets grows up quickly as min_supp decreases Complexity of mining task increases rapidly Huge size of output Complex for analysis Closed itemsets are one of the solutions Unique maximal elements of the equivalence classes defined over the lattice of all the frequent itemsets
Weaknesses & Solutions Equivalence class Distinct group of frequent itemsets Supported by same set of transactions Represent same knowledge Vertical bitwise representation of data set Association Rules extracted are more meaningful [ZAKI04] Redundancies are removed Suitable for dense data set Frequent closed itemsets are much fewer than frequent itemsets
Closed Itemsets I is subsets of items appearing in D T is subset of transactions in D Define two functions: Itemset I is closed iff Function is called Galois operator / closure operator TIDItems 1BD 2ABCD 3ACD 4C
Equivalence classes Two itemsets belong to same equivalence class iff They have same closure Supported by same set of transactions An itemset I is closed iff No supersets of I have the same support A 2 B 2 C 3 D 3 AC 2 AD 2 BC 1 BD 2 AB 1 CD 2 ABC 1 ABD 1 ACD 2 BCD 1 ABCD 1 1 ACD 2 BD 2 C 3 D 3 D 2 Frequent Closed Itemset A 2 Frequent Itemset Support Equivalence Class 44 TIDItems 1BD 2ABCD 3ACD 4C
Mining Frequent Closed Itemsets Search Space Browsing Traverse the lattice of frequent itemsets from one equivalence class to another Closure computation Compute the closure of frequent itemsets Determine the closed itemsets Closure generator : A single representative of an equivalence class Can mine all the closed itemsets by computing the closure of the generator for each class
Browsing the Search Space Choose the key patterns (minimal elements) as generators Traverse the lattice formed by key patterns with Apriori-like algorithm[TAOU00] Unfortunately, same closed itemset can be led from more then one key patterns A 2 B 2 C 3 D 3 AC 2 AD 2 BC 1 BD 2 AB 1 CD 2 ABC 1 ABD 1 ACD 2 BCD 1 ABCD 1 1 ACD 2 BD 2 C 3 D 3 44
Browsing the Search Space Closure climbing New generators are built as the supersets of the closed itemset discovered so far Jump from an equivalence class to another Cannot ensure the equivalence class is not visited yet A 2 B 2 C 3 D 3 AC 2 AD 2 BC 1 BD 2 AB 1 CD 2 ABC 1 ABD 1 ACD 2 BCD 1 ABCD 1 1 ACD 2 BD 2 C 3 D 3 44
Problem of duplicate Need duplicate checking to avoid generating the same closed itemset To avoid useless expensive closure operation, use following lemma: However, it is still expensive in time and space All the mined closed itemsets need to be kept in main memory Several algorithms are forced to adopt a strict lexicographic visiting order of the search space to ensure correct duplicate avoidance CHARM[PEI00], CLOSET[PEI03], CLOSET+[ZAKI02]
Computing Closures Besides Galois operator, make use of the lemma: Perform inclusion check for all items in I The chcek is benefited from using vertical representation of list of tidlist Calculation can be either offline or online Offline: compute closures for the entire set of generators Use key patterns, generators are shorter Online: compute closure for a discovered generator Use closure climbing, generators are longer Fewer checks for longer generators, more efficient ItemABCD T10101 T21111 T31011 T40010 tidlist
Handling duplicates To identify the unique generator for each equivalence class Define order-preserving property of generator Check whether a given generator is order-preserving or not Compute the closure of order-preserving generators only Prune other generators
Handling duplicates Order-preserving property of generators: It means that if items need to be added to an order-preserving generator to compute the closure, they need to follow the item i The introduction of order-preserving generator is used to avoid duplicate generation of closed itemset
Example {A}= Ø ∪ {A} is order-preserving generator {C,D}={C} ∪ {D} is not order-preserving A 2 B 2 C 3 D 3 AC 2 AD 2 BC 1 BD 2 AB 1 CD 2 ABC 1 ABD 1 ACD 2 BCD 1 ABCD 1 1 ACD 2 BD 2 C 3 D 3 44 ItemABCD T10101 T21111 T31011 T40010
Handling duplicates We need to check whether a generator is order- preserving or not Define a set called pre-set(gen) of a generator We can now check whether a generator is order- preserving by checking: If yes, then gen is not order-preserving
Handling duplicates The goal is to compute the closure of order- preserving generators only For any closed itemset, there exists a sequence of order-preserving generators Using closure climbing to climb a sequence of closed itemsets and reach For each closed itemset,the sequence of order- preserving generators is unique
Theorem 1 Corollary 1
4 Handling duplicates Example : A 2 AC 2 ABCD 1 ACD 2 Generator =
The DCI_C LOSED Algorithm Two different types of data sets Dense & Sparse Dense data set Transactions are long Contain strongly correlated items Number of closed itemsets may be nearly equal to number of frequent itemsets in sparse data sets Mining closed itemsets becomes more expensive Separated into two parts DCI_C LOSED s () & DCI_C LOSED d ()
The DCI_C LOSED Algorithm Discriminate between sparse and dense data sets: Scan data set to find out frequent single items F 1 ⊆ I Build bitwise vertical data set VD Items are increasingly sorted w.r.t. frequencies Decide whether a data set is sparse or dense If percentage of 1s is large If a large set of items is strongly correlated Compute the percentage of the most frequent items that co-occur in the same transaction A B E …
The DCI_C LOSED Algorithm 3 input parameters: CLOSED_SET=c(Ø), PRE_SET=Ø, POST_SET=F 1 \c(Ø) Get an item i from POST_SET (minimum in order) Add i to CLOSED_SET to build new_gen (closure climbing)closure climbing Check validity of generator new_gen with PRE_SETCheck validity Compute closure of new_gen using lemma 2 for CLOSED_SETlemma 2 New closed set generated from new_gen
The DCI_C LOSED Algorithm Use PRE_SET to check validity of new_gen Guarantee duplicate generators will be correctly pruned out POST_SET is used to guarantee generators are produced according to Theorem 1Theorem 1 POST_SET contains items j follow i in lexicographic order & not included in CLOSED_SET yet
Running example of DCI_C LOSED d () CLOSED_SET = c(Ø)=Ø, PRE_SET=Ø, POST_SET={A,B,C,D} Compute closure of generator gen= Ø ∪ {A}={A} Check with PRE_SET order-preserving Check if g(A) ⊂ g(j), ∀ j ∈ POST_SET If yes, include j into CLOSED_SET 4 A 2 AC 2 ACD 2 Generator = ABCD T10101 T21111 T31011 T40010 Generator =
Running example of DCI_C LOSED d () CLOSED_SET={A,C,D}, PRE_SET=Ø, POST_SET={B} New generator gen= {A,C,D} ∪ {B}={A,B,C,D} Check with PRE_SET order-preserving gen is closed since POST_SET is empty Note: {A,C,D} {A,B,C,D}, need not to be in order 4 A 2 AC 2 ACD 2 ABCD 1 Generator = ABCD T10101 T21111 T31011 T40010
Running example of DCI_C LOSED d () gen=Ø ∪ {B}, PRE_SET={A}, POST_SET={C,D} gen is order-preserving by checking with g(A) Check g(B) with g(C) and g(D) get c(B)={B,D} {B,D} is closed by checking with POST_SET 4 A 2 AC 2 ACD 2 ABCD 1 B 2 BD 2 ABCD T10101 T21111 T31011 T40010 Generator =
Running example of DCI_C LOSED d () CLOSED_SET={B,D}, PRE_SET={A}, POST_SET={C} gen now is {B,D} ∪ {C} = {B,C,D} Check g({B,C,D}) with g(A), g({B,C,D}) ⊂ g(A) gen is not order-preserving and can be pruned with all its possible extensions 4 A 2 AC 2 ACD 2 ABCD 1 Generator = B 2 BD 2 BCD 1 ABCD T10101 T21111 T31011 T40010
Running example of DCI_C LOSED d () gen=Ø ∪ {C}, PRE_SET={A,B}, POST_SET={D} gen is order-preserving by checking with g(A), g(B) gen cannot not be extended by checking with POST_SET, so it is closed 4 A 2 AC 2 ACD 2 ABCD 1 ABCD T10101 T21111 T31011 T40010 B 2 BD 2 BCD 1 C 3 Generator =
Running example of DCI_C LOSED d () CLOSED_SET={C}, PRE_SET={A,B}, POST_SET={D} gen now is {C} ∪ {D} = {C,D} Check g({C,D}) with g(A), g({C,D}) ⊂ g(A) gen is not order-preserving and can be pruned with considering its possible extensions 4 A 2 AC 2 ACD 2 ABCD 1 ABCD T10101 T21111 T31011 T40010 B 2 BD 2 BCD 1 CD 2 C 3 Generator =
Running example of DCI_C LOSED d () gen=Ø ∪ {D}, PRE_SET={A,B,C}, POST_SET= Ø gen is order-preserving by checking with g(A), g(B), g(C) gen cannot not be extended by checking with POST_SET, so it is closed 4 A 2 AC 2 ACD 2 ABCD 1 ABCD T10101 T21111 T31011 T40010 B 2 BD 2 BCD 1 CD 2 C 3 D 3 Generator =
Optimizations Vertical data set (frequent single items) is represented by a bitmap matrix VD MxN VD(i,j) =1 when item i of transaction j is frequent Row i of the matrix represents g(i), the tidlist Optimize the bitwise AND operations for tidlist intersections Inclusion checks 3 optimization techniques
Optimizations Data Set Projection (projection) For closed itemsets Z discovered by closed set X g(Z) is supported by subsets of g(X) Delete all columns from VD corresponding transactions not occurring in g(X) This process is limited to generators of 1 st level of recursion since it is expensive
Optimizations Data Sets with Highly Correlated Items (section eq) Columns of VD are reordered to profit of data correlation Maximize the submatrix VE of VD having all rows and columns are identical VE is likely to be large and includes most frequent items Many frequent itemsets can be mined within VE T1T2T3T4 A0101 B1111 C1101 D0101 T2T4T1T3 A1100 B1111 C1110 D1100
Optimizations Reusing Results of Previous Bitwise Intersections (included) To check whether an itemset X is closed, compare X with its PRE_SET For X is closed, g(X) ⊆ g(j) for all j Large part of g(X) may be included in g(j) Let g h (X) ⊆ g h (j), so g h (X ∪ Y) ⊆ g h (j) We can limit the check of various g(j) to the complementary part of g h (j) g(j) h g(X ∪ Y) check g(X)
Optimizations Actual number of bitwise AND operations vs. support threshold Optimizations “section eq” & “included” are most effective
Performance Analysis Competitors: FP-C LOSE [GRAH03], C LOSET +[PEI03] Environment: Windows XP, Pentium IV 2.8GHz, 512MB Spare & Dense data sets DatasetItemsAvg. Trans. Size Transactions T40I10D100K Retail Chess Pumsb
Performance Analysis Data set: T40I10D100K, Retail DCI_C LOSED is faster in one order of magnitude
Performance Analysis Data set:, CHESS, PUMSB
Performance Analysis Time efficiency of duplicate checking Speedup up to six when support thresholds are small chess
References [GRAH03] G. Grahne and J. Zhu, “Efficiently Using Prefix-Trees in Mining Frequent Itemsets,” Proc. ICDM Workshop Frequent Itemset Mining Implementations, Dec [PEI00] J. Pei, J. Han, and R. Mao, “CLOSET: An Efficient Algorithm for Mining Frequent Closed Itemsets,” Proc. ACM SIGMOD Int’l Workshop Data Mining and Knowledge Discovery, May [PEI03] J. Pei, J. Han, and J. Wang, “CLOSET+: Searching for the Best Strategies for Mining Frequent Closed Itemsets,” Proc. Ninth ACM SIGKDD Int’l Conf. Knowledge Discovery and Data Mining, Aug [TAOU00] R. Taouil, N. Pasquier, Y. Bastide, L. Lajhal, and G. Stumme, “Mining Frequent Patterns with Counting Inference,” SIGKDD Explorations, vol. 2, no. 2, Dec [ZAKI02] M.J. Zaki and C.-J. Hsiao, “Charm: An Efficient Algorithm for Closed Itemsets Mining,” Proc. Second SIAM Int’l Conf. Data Mining, Apr [ZAKI04] M.J. Zaki, “Mining Non-Redundant Association Rules,” Data Mining and Knowledge Discovery, vol. 9, no.3, pp , 2004.