Download presentation
1
Fast Vertical Mining Using Diffsets Mohammed J. Zaki Karam Gouda
2
Outline Introduction Problem Setting and Notations
Equivalence Classes & Diffsets Algorithms For Mining Frequent, Closed and Maximal Patterns Experimental Results Conclusions Amir Epstein
3
Introduction Horizontal methods (Most are Apriori variants)
Mining Maximal Frequent Patterns (All-MFS,Max Miner,Depth Project,FP-Growth) Mining Closed Sets (A-Close, Closet, Charm) Vertical Methods Vertical Approach Problems Diffsets Amir Epstein
4
Notations I – set of items T- database transactions
Tid – transaction identifier Itemset – a set Tidset – a set K-itemset – An itemset with k items Support of an itemset X, denoted the number of transactions in which X occurs as a subset Amir Epstein
5
Notation Frequent itemset – if
Powerset P(I) – search space enumeration Maximal frequent itemset- if it is not a subset of any other frequent itemset Closed frequent itemset (X) - if there is not exist a superset with Closure of an itemset X, denoted c(X) – the smallest closed set that contains X Amir Epstein
6
The Problem Find all frequent items having minimum support
Amir Epstein
7
Database Example Amir Epstein
8
Frequent, Closed and Maximal Itemsets
Amir Epstein
9
Data Formats Amir Epstein
10
Equivalence Classes Define a function ,where the k-length prefix of X
Define an equivalence relation (prefix-based) : Amir Epstein
11
Example Amir Epstein {} {A,C,D,T,W} A {C,D,T,W} T {W} W C {D,T,W}
AD {TW} AT {W} AW CD {T,W} CT {W} CW DT ,W} DW TW ACD {T,W} CDT {W} CDW CTW DTW ACT {W} ACW ADT ADW ATW ACDT {W} ACDW ACTW ADTW CDTW ACDTW Amir Epstein
12
Compute Subset Class Let Perform intersection of with all with
to obtain a new class with elements ,where is frequent Amir Epstein
13
Tidset Intersections (example)
W 1 3 4 5 1 2 3 4 5 6 2 4 5 6 1 3 5 6 1 2 3 4 5 AC AD AT AW CD CT CW DT DW TW 1 3 4 5 4 5 1 3 5 1 3 4 5 2 4 5 6 1 3 5 6 1 2 3 4 5 5 6 2 4 5 1 3 5 ACT ACW ATW CDW CTW 1 3 5 1 3 4 5 1 3 5 1 3 5 2 4 5 ACTW 1 3 5 Amir Epstein
14
Diffsets Difference of the prefix tidset and a class member tidset
Consider class with prefix P Let t(X) denote the tidset of element X Let d(X) denote the diffset of element X, with respect to prefix tidset Let PX and PY be class members of P Support Amir Epstein
15
Diffsets Then Define diffset Amir Epstein
16
Diffsets How to Calculate using d(PX) and d(PY) ? Amir Epstein
17
Example t(X) t(P) t(Y) d(PY) d(PX) d(PXY) t(PXY) Amir Epstein
18
Diffset Intersections (example)
DIFFSET database TIDSET database A C D T W A C D T W 1 3 4 5 1 2 3 4 5 6 2 4 5 6 1 3 5 6 1 2 3 4 5 2 6 1 3 2 4 6 AC AD AT AW CD CT CW DT DW TW 1 3 4 1 3 2 4 6 2 4 6 6 ACT ACW ATW CDW CTW 4 6 6 ACTW Amir Epstein
19
Diffset Example Diffset calculation Support calculation Amir Epstein
20
Diffset Example Database Size Total Size Size By Length
Tidsets database size =23 Diffets database size =7 Total Size Tidsets database size =76 Diffsets database size =22 Size By Length K-itemset (k) Avg. tidset length Avg. diffset length 2 3.8 1 3 3.2 0.6 4 Amir Epstein
21
Experimental Study Compare diffsets versus tidsets in terms of database sizes Method Real datasets (usually dense) Synthetic datasets (sparse) Amir Epstein
22
Size Of Database Amir Epstein
23
Average Diffset / Tidset Size By length
Amir Epstein
24
Average Diffset / Tidset Size
Database Min_sup (%) Max Length Avg. Diffset Size Avg. Tidset Size Reduction Ration chess 0.5 16 26 1820 70 connect 90 12 143 62204 435 mushroom 5 17 60 622 10 Pumsb* 35 15 301 18977 63 pumsb 8 330 45036 136 T10I4D100K 0.025 11 14 86 6 T20I16D100K 0.1 31 230 T40I10D100K 18 96 755 Amir Epstein
25
When To Use diffsets Usually there is a cross-over point
For Dense dataset start with diffset format For Sparse dataset start with tidset format Amir Epstein
26
Reduction Ratio Let class P
Let PX and PY class members with t(PX) and t(PY) Consider new Itemset PXY in class PX PXY can be stored as t(PXY) or d(PXY) Definition : reduction ratio Benefit if or Amir Epstein
27
Reduction Ratio Or Amir Epstein
28
Compressed Bitvectors
Classical way run-length encoding (RLE) – not appropriate for association mining Skinning encoding scheme (used by Viper) Worst case compression ratio reaches asymptotically 2.91 Best case compression ratio asymptotically reaches 32 Amir Epstein
29
GenMax: Mining Maximal Frequent Itemsets
Uses backtracking search technique Optimizations Initially sort items in increasing order of their combine-set size and increasing order of support (i. first explore items with small combine sets, ii. remove a node as early as possible from the search tree) Superset checking More Optimizations Progressive focusing to improve superset checking Vertical database format to improve frequency checking using tidsets, which is more improved by diffsets Memory Handling Store at most k=m+l tidsets (diffsets) in memory, where m is the length of the longest combine-set and l is the length of the longest maximal itemset Amir Epstein
30
Amir Epstein
31
Amir Epstein
32
dEclat: Mining All Frequent Itemsets
Performs bottom-up search The equivalence class lattice is traversed in a bfs order Input: class members F.I are generated by computing diffsets for all distinct pairs of itemsets and checking the support of the resulting itemset Stores in memory intermediate diffsets (tidsets) of at most two levels Amir Epstein
33
Amir Epstein
34
dCharm: Mining Frequent Closed Itemsets
Performs bottom-up search Eliminates branches and grows itemsets using subset relationship Amir Epstein
35
Subset Relationships Theorem:
Let and be any two members of class , with , where is a total order (e.g., lexiographic or support-based). The following for properties hold: If , then If , then , but Amir Epstein
36
Amir Epstein
37
Optimized Initialization
Computation Let be the number of frequent items Let be the average tidset size Amount of data read is Number Of intersections In horizontal approach amount of data read is Amir Epstein
38
Improvement Compute frequent items of length 2
Combine items and only if is frequent Now The number of intersections in practice is closer to rather then Frequent itemsets of length 2 computation perform vertical to horizontal transformation Update the count of pairs of items Amir Epstein
39
Experimental Results Times include all costs, including horizontal to vertical database conversion Method Real datasets (usually dense) Synthetic datasets (sparse) Amir Epstein
40
Database Characteristics
# Items Avg. trans. Length # Records chess 76 37 3,196 connect 130 43 67,557 mushroom 120 23 8,124 Pumsb* 7117 50 49,046 pumsb 74 T10I4D100K 1000 10 100,000 T20I16D100K 40 Amir Epstein
41
Length Of the Longest Itemset
Amir Epstein
42
Cardinality Of F.I , C.F.I and M.F.I
Amir Epstein
43
Improvements using Diffsets
Amir Epstein
44
Mining Frequent Itemsets
Amir Epstein
45
Mining Closed Itemsets
Amir Epstein
46
Mining Maximal Itemsets
Amir Epstein
47
Conclusions Diffsets dramatically cut down the size of memory required to store intermediate results Diffsets increase performance significantly when incorporated into previous vertical mining methods Diffsets can deliver over order of magnitude performance improvements over the best previous methods Amir Epstein
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.