Presentation is loading. Please wait.

Presentation is loading. Please wait.

Fast Vertical Mining Using Diffsets Mohammed J. Zaki Karam Gouda

Similar presentations


Presentation on theme: "Fast Vertical Mining Using Diffsets Mohammed J. Zaki Karam Gouda"— Presentation transcript:

1 Fast Vertical Mining Using Diffsets Mohammed J. Zaki Karam Gouda

2 Outline Introduction Problem Setting and Notations
Equivalence Classes & Diffsets Algorithms For Mining Frequent, Closed and Maximal Patterns Experimental Results Conclusions Amir Epstein

3 Introduction Horizontal methods (Most are Apriori variants)
Mining Maximal Frequent Patterns (All-MFS,Max Miner,Depth Project,FP-Growth) Mining Closed Sets (A-Close, Closet, Charm) Vertical Methods Vertical Approach Problems Diffsets Amir Epstein

4 Notations I – set of items T- database transactions
Tid – transaction identifier Itemset – a set Tidset – a set K-itemset – An itemset with k items Support of an itemset X, denoted the number of transactions in which X occurs as a subset Amir Epstein

5 Notation Frequent itemset – if
Powerset P(I) – search space enumeration Maximal frequent itemset- if it is not a subset of any other frequent itemset Closed frequent itemset (X) - if there is not exist a superset with Closure of an itemset X, denoted c(X) – the smallest closed set that contains X Amir Epstein

6 The Problem Find all frequent items having minimum support
Amir Epstein

7 Database Example Amir Epstein

8 Frequent, Closed and Maximal Itemsets
Amir Epstein

9 Data Formats Amir Epstein

10 Equivalence Classes Define a function ,where the k-length prefix of X
Define an equivalence relation (prefix-based) : Amir Epstein

11 Example Amir Epstein {} {A,C,D,T,W} A {C,D,T,W} T {W} W C {D,T,W}
AD {TW} AT {W} AW CD {T,W} CT {W} CW DT ,W} DW TW ACD {T,W} CDT {W} CDW CTW DTW ACT {W} ACW ADT ADW ATW ACDT {W} ACDW ACTW ADTW CDTW ACDTW Amir Epstein

12 Compute Subset Class Let Perform intersection of with all with
to obtain a new class with elements ,where is frequent Amir Epstein

13 Tidset Intersections (example)
W 1 3 4 5 1 2 3 4 5 6 2 4 5 6 1 3 5 6 1 2 3 4 5 AC AD AT AW CD CT CW DT DW TW 1 3 4 5 4 5 1 3 5 1 3 4 5 2 4 5 6 1 3 5 6 1 2 3 4 5 5 6 2 4 5 1 3 5 ACT ACW ATW CDW CTW 1 3 5 1 3 4 5 1 3 5 1 3 5 2 4 5 ACTW 1 3 5 Amir Epstein

14 Diffsets Difference of the prefix tidset and a class member tidset
Consider class with prefix P Let t(X) denote the tidset of element X Let d(X) denote the diffset of element X, with respect to prefix tidset Let PX and PY be class members of P Support Amir Epstein

15 Diffsets Then Define diffset Amir Epstein

16 Diffsets How to Calculate using d(PX) and d(PY) ? Amir Epstein

17 Example t(X) t(P) t(Y) d(PY) d(PX) d(PXY) t(PXY) Amir Epstein

18 Diffset Intersections (example)
DIFFSET database TIDSET database A C D T W A C D T W 1 3 4 5 1 2 3 4 5 6 2 4 5 6 1 3 5 6 1 2 3 4 5 2 6 1 3 2 4 6 AC AD AT AW CD CT CW DT DW TW 1 3 4 1 3 2 4 6 2 4 6 6 ACT ACW ATW CDW CTW 4 6 6 ACTW Amir Epstein

19 Diffset Example Diffset calculation Support calculation Amir Epstein

20 Diffset Example Database Size Total Size Size By Length
Tidsets database size =23 Diffets database size =7 Total Size Tidsets database size =76 Diffsets database size =22 Size By Length K-itemset (k) Avg. tidset length Avg. diffset length 2 3.8 1 3 3.2 0.6 4 Amir Epstein

21 Experimental Study Compare diffsets versus tidsets in terms of database sizes Method Real datasets (usually dense) Synthetic datasets (sparse) Amir Epstein

22 Size Of Database Amir Epstein

23 Average Diffset / Tidset Size By length
Amir Epstein

24 Average Diffset / Tidset Size
Database Min_sup (%) Max Length Avg. Diffset Size Avg. Tidset Size Reduction Ration chess 0.5 16 26 1820 70 connect 90 12 143 62204 435 mushroom 5 17 60 622 10 Pumsb* 35 15 301 18977 63 pumsb 8 330 45036 136 T10I4D100K 0.025 11 14 86 6 T20I16D100K 0.1 31 230 T40I10D100K 18 96 755 Amir Epstein

25 When To Use diffsets Usually there is a cross-over point
For Dense dataset start with diffset format For Sparse dataset start with tidset format Amir Epstein

26 Reduction Ratio Let class P
Let PX and PY class members with t(PX) and t(PY) Consider new Itemset PXY in class PX PXY can be stored as t(PXY) or d(PXY) Definition : reduction ratio Benefit if or Amir Epstein

27 Reduction Ratio Or Amir Epstein

28 Compressed Bitvectors
Classical way run-length encoding (RLE) – not appropriate for association mining Skinning encoding scheme (used by Viper) Worst case compression ratio reaches asymptotically 2.91 Best case compression ratio asymptotically reaches 32 Amir Epstein

29 GenMax: Mining Maximal Frequent Itemsets
Uses backtracking search technique Optimizations Initially sort items in increasing order of their combine-set size and increasing order of support (i. first explore items with small combine sets, ii. remove a node as early as possible from the search tree) Superset checking More Optimizations Progressive focusing to improve superset checking Vertical database format to improve frequency checking using tidsets, which is more improved by diffsets Memory Handling Store at most k=m+l tidsets (diffsets) in memory, where m is the length of the longest combine-set and l is the length of the longest maximal itemset Amir Epstein

30 Amir Epstein

31 Amir Epstein

32 dEclat: Mining All Frequent Itemsets
Performs bottom-up search The equivalence class lattice is traversed in a bfs order Input: class members F.I are generated by computing diffsets for all distinct pairs of itemsets and checking the support of the resulting itemset Stores in memory intermediate diffsets (tidsets) of at most two levels Amir Epstein

33 Amir Epstein

34 dCharm: Mining Frequent Closed Itemsets
Performs bottom-up search Eliminates branches and grows itemsets using subset relationship Amir Epstein

35 Subset Relationships Theorem:
Let and be any two members of class , with , where is a total order (e.g., lexiographic or support-based). The following for properties hold: If , then If , then , but Amir Epstein

36 Amir Epstein

37 Optimized Initialization
Computation Let be the number of frequent items Let be the average tidset size Amount of data read is Number Of intersections In horizontal approach amount of data read is Amir Epstein

38 Improvement Compute frequent items of length 2
Combine items and only if is frequent Now The number of intersections in practice is closer to rather then Frequent itemsets of length 2 computation perform vertical to horizontal transformation Update the count of pairs of items Amir Epstein

39 Experimental Results Times include all costs, including horizontal to vertical database conversion Method Real datasets (usually dense) Synthetic datasets (sparse) Amir Epstein

40 Database Characteristics
# Items Avg. trans. Length # Records chess 76 37 3,196 connect 130 43 67,557 mushroom 120 23 8,124 Pumsb* 7117 50 49,046 pumsb 74 T10I4D100K 1000 10 100,000 T20I16D100K 40 Amir Epstein

41 Length Of the Longest Itemset
Amir Epstein

42 Cardinality Of F.I , C.F.I and M.F.I
Amir Epstein

43 Improvements using Diffsets
Amir Epstein

44 Mining Frequent Itemsets
Amir Epstein

45 Mining Closed Itemsets
Amir Epstein

46 Mining Maximal Itemsets
Amir Epstein

47 Conclusions Diffsets dramatically cut down the size of memory required to store intermediate results Diffsets increase performance significantly when incorporated into previous vertical mining methods Diffsets can deliver over order of magnitude performance improvements over the best previous methods Amir Epstein


Download ppt "Fast Vertical Mining Using Diffsets Mohammed J. Zaki Karam Gouda"

Similar presentations


Ads by Google