Fast Vertical Mining Using Diffsets Mohammed J. Zaki Karam Gouda

Name: Fast Vertical Mining Using Diffsets Mohammed J. Zaki Karam Gouda
Uploaded: 2017-12-12T18:05:36+00:00
Duration: PTM17S47
Description: Fast Vertical Mining Using Diffsets Mohammed J. Zaki Karam Gouda

Fast Vertical Mining Using Diffsets Mohammed J. Zaki Karam Gouda

Outline Introduction Problem Setting and Notations
Equivalence Classes & Diffsets Algorithms For Mining Frequent, Closed and Maximal Patterns Experimental Results Conclusions Amir Epstein

Introduction Horizontal methods (Most are Apriori variants)
Mining Maximal Frequent Patterns (All-MFS,Max Miner,Depth Project,FP-Growth) Mining Closed Sets (A-Close, Closet, Charm) Vertical Methods Vertical Approach Problems Diffsets Amir Epstein

Notations I – set of items T- database transactions
Tid – transaction identifier Itemset – a set Tidset – a set K-itemset – An itemset with k items Support of an itemset X, denoted the number of transactions in which X occurs as a subset Amir Epstein

Notation Frequent itemset – if
Powerset P(I) – search space enumeration Maximal frequent itemset- if it is not a subset of any other frequent itemset Closed frequent itemset (X) - if there is not exist a superset with Closure of an itemset X, denoted c(X) – the smallest closed set that contains X Amir Epstein

The Problem Find all frequent items having minimum support
Amir Epstein

Database Example Amir Epstein

Frequent, Closed and Maximal Itemsets
Amir Epstein

Data Formats Amir Epstein

Equivalence Classes Define a function ,where the k-length prefix of X
Define an equivalence relation (prefix-based) : Amir Epstein

Example Amir Epstein {} {A,C,D,T,W} A {C,D,T,W} T {W} W C {D,T,W}
AD {TW} AT {W} AW CD {T,W} CT {W} CW DT ,W} DW TW ACD {T,W} CDT {W} CDW CTW DTW ACT {W} ACW ADT ADW ATW ACDT {W} ACDW ACTW ADTW CDTW ACDTW Amir Epstein

Compute Subset Class Let Perform intersection of with all with
to obtain a new class with elements ,where is frequent Amir Epstein

Tidset Intersections (example)
W 1 3 4 5 1 2 3 4 5 6 2 4 5 6 1 3 5 6 1 2 3 4 5 AC AD AT AW CD CT CW DT DW TW 1 3 4 5 4 5 1 3 5 1 3 4 5 2 4 5 6 1 3 5 6 1 2 3 4 5 5 6 2 4 5 1 3 5 ACT ACW ATW CDW CTW 1 3 5 1 3 4 5 1 3 5 1 3 5 2 4 5 ACTW 1 3 5 Amir Epstein

Diffsets Difference of the prefix tidset and a class member tidset
Consider class with prefix P Let t(X) denote the tidset of element X Let d(X) denote the diffset of element X, with respect to prefix tidset Let PX and PY be class members of P Support Amir Epstein

Diffsets Then Define diffset Amir Epstein

Diffsets How to Calculate using d(PX) and d(PY) ? Amir Epstein

Example t(X) t(P) t(Y) d(PY) d(PX) d(PXY) t(PXY) Amir Epstein

Diffset Intersections (example)
DIFFSET database TIDSET database A C D T W A C D T W 1 3 4 5 1 2 3 4 5 6 2 4 5 6 1 3 5 6 1 2 3 4 5 2 6 1 3 2 4 6 AC AD AT AW CD CT CW DT DW TW 1 3 4 1 3 2 4 6 2 4 6 6 ACT ACW ATW CDW CTW 4 6 6 ACTW Amir Epstein

Diffset Example Diffset calculation Support calculation Amir Epstein

Diffset Example Database Size Total Size Size By Length
Tidsets database size =23 Diffets database size =7 Total Size Tidsets database size =76 Diffsets database size =22 Size By Length K-itemset (k) Avg. tidset length Avg. diffset length 2 3.8 1 3 3.2 0.6 4 Amir Epstein

Experimental Study Compare diffsets versus tidsets in terms of database sizes Method Real datasets (usually dense) Synthetic datasets (sparse) Amir Epstein

Size Of Database Amir Epstein

Average Diffset / Tidset Size By length
Amir Epstein

Average Diffset / Tidset Size
Database Min_sup (%) Max Length Avg. Diffset Size Avg. Tidset Size Reduction Ration chess 0.5 16 26 1820 70 connect 90 12 143 62204 435 mushroom 5 17 60 622 10 Pumsb* 35 15 301 18977 63 pumsb 8 330 45036 136 T10I4D100K 0.025 11 14 86 6 T20I16D100K 0.1 31 230 T40I10D100K 18 96 755 Amir Epstein

When To Use diffsets Usually there is a cross-over point
For Dense dataset start with diffset format For Sparse dataset start with tidset format Amir Epstein

Reduction Ratio Let class P
Let PX and PY class members with t(PX) and t(PY) Consider new Itemset PXY in class PX PXY can be stored as t(PXY) or d(PXY) Definition : reduction ratio Benefit if or Amir Epstein

Reduction Ratio Or Amir Epstein

Compressed Bitvectors
Classical way run-length encoding (RLE) – not appropriate for association mining Skinning encoding scheme (used by Viper) Worst case compression ratio reaches asymptotically 2.91 Best case compression ratio asymptotically reaches 32 Amir Epstein

GenMax: Mining Maximal Frequent Itemsets
Uses backtracking search technique Optimizations Initially sort items in increasing order of their combine-set size and increasing order of support (i. first explore items with small combine sets, ii. remove a node as early as possible from the search tree) Superset checking More Optimizations Progressive focusing to improve superset checking Vertical database format to improve frequency checking using tidsets, which is more improved by diffsets Memory Handling Store at most k=m+l tidsets (diffsets) in memory, where m is the length of the longest combine-set and l is the length of the longest maximal itemset Amir Epstein

Amir Epstein

dEclat: Mining All Frequent Itemsets
Performs bottom-up search The equivalence class lattice is traversed in a bfs order Input: class members F.I are generated by computing diffsets for all distinct pairs of itemsets and checking the support of the resulting itemset Stores in memory intermediate diffsets (tidsets) of at most two levels Amir Epstein

Amir Epstein

dCharm: Mining Frequent Closed Itemsets
Performs bottom-up search Eliminates branches and grows itemsets using subset relationship Amir Epstein

Subset Relationships Theorem:
Let and be any two members of class , with , where is a total order (e.g., lexiographic or support-based). The following for properties hold: If , then If , then , but Amir Epstein

Amir Epstein

Optimized Initialization
Computation Let be the number of frequent items Let be the average tidset size Amount of data read is Number Of intersections In horizontal approach amount of data read is Amir Epstein

Improvement Compute frequent items of length 2
Combine items and only if is frequent Now The number of intersections in practice is closer to rather then Frequent itemsets of length 2 computation perform vertical to horizontal transformation Update the count of pairs of items Amir Epstein

Experimental Results Times include all costs, including horizontal to vertical database conversion Method Real datasets (usually dense) Synthetic datasets (sparse) Amir Epstein

Database Characteristics
# Items Avg. trans. Length # Records chess 76 37 3,196 connect 130 43 67,557 mushroom 120 23 8,124 Pumsb* 7117 50 49,046 pumsb 74 T10I4D100K 1000 10 100,000 T20I16D100K 40 Amir Epstein

Length Of the Longest Itemset
Amir Epstein

Cardinality Of F.I , C.F.I and M.F.I
Amir Epstein

Improvements using Diffsets
Amir Epstein

Mining Frequent Itemsets
Amir Epstein

Mining Closed Itemsets
Amir Epstein

Mining Maximal Itemsets
Amir Epstein

Conclusions Diffsets dramatically cut down the size of memory required to store intermediate results Diffsets increase performance significantly when incorporated into previous vertical mining methods Diffsets can deliver over order of magnitude performance improvements over the best previous methods Amir Epstein

Fast Vertical Mining Using Diffsets Mohammed J. Zaki Karam Gouda

Similar presentations

Presentation on theme: "Fast Vertical Mining Using Diffsets Mohammed J. Zaki Karam Gouda"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Fast Vertical Mining Using Diffsets Mohammed J. Zaki Karam Gouda

Similar presentations

Presentation on theme: "Fast Vertical Mining Using Diffsets Mohammed J. Zaki Karam Gouda"— Presentation transcript:

Similar presentations

About project

Feedback