Fast Vertical Mining Using Diffsets Mohammed J. Zaki Karam Gouda
Outline Introduction Problem Setting and Notations Equivalence Classes & Diffsets Algorithms For Mining Frequent, Closed and Maximal Patterns Experimental Results Conclusions Amir Epstein
Introduction Horizontal methods (Most are Apriori variants) Mining Maximal Frequent Patterns (All-MFS,Max Miner,Depth Project,FP-Growth) Mining Closed Sets (A-Close, Closet, Charm) Vertical Methods Vertical Approach Problems Diffsets Amir Epstein
Notations I – set of items T- database transactions Tid – transaction identifier Itemset – a set Tidset – a set K-itemset – An itemset with k items Support of an itemset X, denoted - the number of transactions in which X occurs as a subset Amir Epstein
Notation Frequent itemset – if Powerset P(I) – search space enumeration Maximal frequent itemset- if it is not a subset of any other frequent itemset Closed frequent itemset (X) - if there is not exist a superset with Closure of an itemset X, denoted c(X) – the smallest closed set that contains X Amir Epstein
The Problem Find all frequent items having minimum support Amir Epstein
Database Example Amir Epstein
Frequent, Closed and Maximal Itemsets Amir Epstein
Data Formats Amir Epstein
Equivalence Classes Define a function ,where the k-length prefix of X Define an equivalence relation (prefix-based) : Amir Epstein
Example Amir Epstein {} {A,C,D,T,W} A {C,D,T,W} T {W} W C {D,T,W} AD {TW} AT {W} AW CD {T,W} CT {W} CW DT ,W} DW TW ACD {T,W} CDT {W} CDW CTW DTW ACT {W} ACW ADT ADW ATW ACDT {W} ACDW ACTW ADTW CDTW ACDTW Amir Epstein
Compute Subset Class Let Perform intersection of with all with to obtain a new class with elements ,where is frequent Amir Epstein
Tidset Intersections (example) W 1 3 4 5 1 2 3 4 5 6 2 4 5 6 1 3 5 6 1 2 3 4 5 AC AD AT AW CD CT CW DT DW TW 1 3 4 5 4 5 1 3 5 1 3 4 5 2 4 5 6 1 3 5 6 1 2 3 4 5 5 6 2 4 5 1 3 5 ACT ACW ATW CDW CTW 1 3 5 1 3 4 5 1 3 5 1 3 5 2 4 5 ACTW 1 3 5 Amir Epstein
Diffsets Difference of the prefix tidset and a class member tidset Consider class with prefix P Let t(X) denote the tidset of element X Let d(X) denote the diffset of element X, with respect to prefix tidset Let PX and PY be class members of P Support Amir Epstein
Diffsets Then Define diffset Amir Epstein
Diffsets How to Calculate using d(PX) and d(PY) ? Amir Epstein
Example t(X) t(P) t(Y) d(PY) d(PX) d(PXY) t(PXY) Amir Epstein
Diffset Intersections (example) DIFFSET database TIDSET database A C D T W A C D T W 1 3 4 5 1 2 3 4 5 6 2 4 5 6 1 3 5 6 1 2 3 4 5 2 6 1 3 2 4 6 AC AD AT AW CD CT CW DT DW TW 1 3 4 1 3 2 4 6 2 4 6 6 ACT ACW ATW CDW CTW 4 6 6 ACTW Amir Epstein
Diffset Example Diffset calculation Support calculation Amir Epstein
Diffset Example Database Size Total Size Size By Length Tidsets database size =23 Diffets database size =7 Total Size Tidsets database size =76 Diffsets database size =22 Size By Length K-itemset (k) Avg. tidset length Avg. diffset length 2 3.8 1 3 3.2 0.6 4 Amir Epstein
Experimental Study Compare diffsets versus tidsets in terms of database sizes Method Real datasets (usually dense) Synthetic datasets (sparse) Amir Epstein
Size Of Database Amir Epstein
Average Diffset / Tidset Size By length Amir Epstein
Average Diffset / Tidset Size Database Min_sup (%) Max Length Avg. Diffset Size Avg. Tidset Size Reduction Ration chess 0.5 16 26 1820 70 connect 90 12 143 62204 435 mushroom 5 17 60 622 10 Pumsb* 35 15 301 18977 63 pumsb 8 330 45036 136 T10I4D100K 0.025 11 14 86 6 T20I16D100K 0.1 31 230 T40I10D100K 18 96 755 Amir Epstein
When To Use diffsets Usually there is a cross-over point For Dense dataset start with diffset format For Sparse dataset start with tidset format Amir Epstein
Reduction Ratio Let class P Let PX and PY class members with t(PX) and t(PY) Consider new Itemset PXY in class PX PXY can be stored as t(PXY) or d(PXY) Definition : reduction ratio Benefit if or Amir Epstein
Reduction Ratio Or Amir Epstein
Compressed Bitvectors Classical way run-length encoding (RLE) – not appropriate for association mining Skinning encoding scheme (used by Viper) Worst case compression ratio reaches asymptotically 2.91 Best case compression ratio asymptotically reaches 32 Amir Epstein
GenMax: Mining Maximal Frequent Itemsets Uses backtracking search technique Optimizations Initially sort items in increasing order of their combine-set size and increasing order of support (i. first explore items with small combine sets, ii. remove a node as early as possible from the search tree) Superset checking More Optimizations Progressive focusing to improve superset checking Vertical database format to improve frequency checking using tidsets, which is more improved by diffsets Memory Handling Store at most k=m+l tidsets (diffsets) in memory, where m is the length of the longest combine-set and l is the length of the longest maximal itemset Amir Epstein
Amir Epstein
Amir Epstein
dEclat: Mining All Frequent Itemsets Performs bottom-up search The equivalence class lattice is traversed in a bfs order Input: class members F.I are generated by computing diffsets for all distinct pairs of itemsets and checking the support of the resulting itemset Stores in memory intermediate diffsets (tidsets) of at most two levels Amir Epstein
Amir Epstein
dCharm: Mining Frequent Closed Itemsets Performs bottom-up search Eliminates branches and grows itemsets using subset relationship Amir Epstein
Subset Relationships Theorem: Let and be any two members of class , with , where is a total order (e.g., lexiographic or support-based). The following for properties hold: If , then If , then , but Amir Epstein
Amir Epstein
Optimized Initialization Computation Let be the number of frequent items Let be the average tidset size Amount of data read is Number Of intersections In horizontal approach amount of data read is Amir Epstein
Improvement Compute frequent items of length 2 Combine items and only if is frequent Now The number of intersections in practice is closer to rather then Frequent itemsets of length 2 computation perform vertical to horizontal transformation Update the count of pairs of items Amir Epstein
Experimental Results Times include all costs, including horizontal to vertical database conversion Method Real datasets (usually dense) Synthetic datasets (sparse) Amir Epstein
Database Characteristics # Items Avg. trans. Length # Records chess 76 37 3,196 connect 130 43 67,557 mushroom 120 23 8,124 Pumsb* 7117 50 49,046 pumsb 74 T10I4D100K 1000 10 100,000 T20I16D100K 40 Amir Epstein
Length Of the Longest Itemset Amir Epstein
Cardinality Of F.I , C.F.I and M.F.I Amir Epstein
Improvements using Diffsets Amir Epstein
Mining Frequent Itemsets Amir Epstein
Mining Closed Itemsets Amir Epstein
Mining Maximal Itemsets Amir Epstein
Conclusions Diffsets dramatically cut down the size of memory required to store intermediate results Diffsets increase performance significantly when incorporated into previous vertical mining methods Diffsets can deliver over order of magnitude performance improvements over the best previous methods Amir Epstein