Fast Vertical Mining Using Diffsets Mohammed J. Zaki Karam Gouda

Slides:



Advertisements
Similar presentations
Recap: Mining association rules from large datasets
Advertisements

Association Analysis (2). Example TIDList of item ID’s T1I1, I2, I5 T2I2, I4 T3I2, I3 T4I1, I2, I4 T5I1, I3 T6I2, I3 T7I1, I3 T8I1, I2, I3, I5 T9I1, I2,
Frequent Itemset Mining Methods. The Apriori algorithm Finding frequent itemsets using candidate generation Seminal algorithm proposed by R. Agrawal and.
Query Optimization of Frequent Itemset Mining on Multiple Databases Mining on Multiple Databases David Fuhry Department of Computer Science Kent State.
Frequent Closed Pattern Search By Row and Feature Enumeration
Zeev Dvir – GenMax From: “ Efficiently Mining Frequent Itemsets ” By : Karam Gouda & Mohammed J. Zaki.
LCM: An Efficient Algorithm for Enumerating Frequent Closed Item Sets L inear time C losed itemset M iner Takeaki Uno Tatsuya Asai Hiroaki Arimura Yuzo.
1 Department of Information & Computer Education, NTNU SmartMiner: A Depth First Algorithm Guided by Tail Information for Mining Maximal Frequent Itemsets.
Mining Frequent Patterns in Data Streams at Multiple Time Granularities CS525 Paper Presentation Presented by: Pei Zhang, Jiahua Liu, Pengfei Geng and.
FP-Growth algorithm Vasiljevic Vladica,
Data Mining Association Analysis: Basic Concepts and Algorithms
Kuo-Yu HuangNCU CSIE DBLab1 The Concept of Maximal Frequent Itemsets NCU CSIE Database Laboratory Kuo-Yu Huang
Efficiently Mining Long Patterns from Databases Roberto J. Bayardo Jr. IBM Almaden Research Center.
Data Mining Techniques So Far: Cluster analysis K-means Classification Decision Trees J48 (C4.5) Rule-based classification JRIP (RIPPER) Logistic Regression.
732A02 Data Mining - Clustering and Association Analysis ………………… Jose M. Peña Association rules Apriori algorithm FP grow algorithm.
Association Analysis (2). Example TIDList of item ID’s T1I1, I2, I5 T2I2, I4 T3I2, I3 T4I1, I2, I4 T5I1, I3 T6I2, I3 T7I1, I3 T8I1, I2, I3, I5 T9I1, I2,
Data Mining Association Analysis: Basic Concepts and Algorithms
1 Mining Frequent Patterns Without Candidate Generation Apriori-like algorithm suffers from long patterns or quite low minimum support thresholds. Two.
Association Analysis: Basic Concepts and Algorithms.
Association Rule Mining. Generating assoc. rules from frequent itemsets  Assume that we have discovered the frequent itemsets and their support  How.
Data Mining Association Analysis: Basic Concepts and Algorithms
Generating Non-Redundant Association Rules Mohammed J. Zaki.
Pattern Lattice Traversal by Selective Jumps Osmar R. Zaïane and Mohammad El-Hajj Department of Computing Science, University of Alberta Edmonton, AB,
Mining Negative Rules in Large Databases using GRD Dhananjay R Thiruvady Supervisor: Professor Geoffrey Webb.
Fast Algorithms for Association Rule Mining
1 Fast Algorithms for Mining Association Rules Rakesh Agrawal Ramakrishnan Srikant Slides from Ofer Pasternak.
© Vipin Kumar CSci 8980 Fall CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance Computing Research Center Department of Computer.
Performance and Scalability: Apriori Implementation.
1 Efficiently Mining Frequent Trees in a Forest Mohammed J. Zaki.
USpan: An Efficient Algorithm for Mining High Utility Sequential Patterns Authors: Junfu Yin, Zhigang Zheng, Longbing Cao In: Proceedings of the 18th ACM.
Mining Frequent Itemsets with Constraints Takeaki Uno Takeaki Uno National Institute of Informatics, JAPAN Nov/2005 FJWCP.
Mining High Utility Itemsets without Candidate Generation Date: 2013/05/13 Author: Mengchi Liu, Junfeng Qu Source: CIKM "12 Advisor: Jia-ling Koh Speaker:
Sequential PAttern Mining using A Bitmap Representation
AR mining Implementation and comparison of three AR mining algorithms Xuehai Wang, Xiaobo Chen, Shen chen CSCI6405 class project.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining By Tan, Steinbach, Kumar Lecture.
Modul 7: Association Analysis. 2 Association Rule Mining  Given a set of transactions, find rules that will predict the occurrence of an item based on.
Efficient Data Mining for Calling Path Patterns in GSM Networks Information Systems, accepted 5 December 2002 SPEAKER: YAO-TE WANG ( 王耀德 )
Takeaki Uno Tatsuya Asai Yuzo Uchida Hiroki Arimura
LCM ver.2: Efficient Mining Algorithms for Frequent/Closed/Maximal Itemsets Takeaki Uno Masashi Kiyomi Hiroki Arimura National Institute of Informatics,
Mining High Utility Itemset in Big Data
Mining Frequent Patterns without Candidate Generation.
Frequent Item Mining. What is data mining? =Pattern Mining? What patterns? Why are they useful?
Outline Introduction – Frequent patterns and the Rare Item Problem – Multiple Minimum Support Framework – Issues with Multiple Minimum Support Framework.
LCM ver.3: Collaboration of Array, Bitmap and Prefix Tree for Frequent Itemset Mining Takeaki Uno Masashi Kiyomi Hiroki Arimura National Institute of Informatics,
1 Inverted Matrix: Efficient Discovery of Frequent Items in Large Datasets in the Context of Interactive Mining -SIGKDD’03 Mohammad El-Hajj, Osmar R. Zaïane.
M. Sulaiman Khan Dept. of Computer Science University of Liverpool 2009 COMP527: Data Mining ARM: Improvements March 10, 2009 Slide.
Association Analysis (3)
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Indexing and Mining Free Trees Yun Chi, Yirong Yang, Richard R. Muntz Department of Computer Science University of California, Los Angeles, CA {
1 Data Mining Lecture 6: Association Analysis. 2 Association Rule Mining l Given a set of transactions, find rules that will predict the occurrence of.
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Mining Complex Data COMP Seminar Spring 2011.
Approach to Data Mining from Algorithm and Computation Takeaki Uno, ETH Switzerland, NII Japan Hiroki Arimura, Hokkaido University, Japan.
CFI-Stream: Mining Closed Frequent Itemsets in Data Streams
CLOSET: An Efficient Algorithm for Mining Frequent Closed Itemsets
Reducing Number of Candidates
Data Mining Association Analysis: Basic Concepts and Algorithms
Data Mining: Concepts and Techniques
Frequent Pattern Mining
The Concept of Maximal Frequent Itemsets
Data Mining Association Analysis: Basic Concepts and Algorithms
CLOSET: An Efficient Algorithm for Mining Frequent Closed Itemsets
A Parameterised Algorithm for Mining Association Rules
Mining Complex Data COMP Seminar Spring 2011.
Scalable Algorithms for Association Mining
Frequent-Pattern Tree
数据挖掘 Introduction to Data Mining
Closed Itemset Mining CSCI-7173: Computational Complexity & Algorithms, Final Project - Spring 16 Supervised By Dr. Tom Altman Presented By Shahab Helmi.
Finding Frequent Itemsets by Transaction Mapping
CLOSET: An Efficient Algorithm for Mining Frequent Closed Itemsets
Presentation transcript:

Fast Vertical Mining Using Diffsets Mohammed J. Zaki Karam Gouda

Outline Introduction Problem Setting and Notations Equivalence Classes & Diffsets Algorithms For Mining Frequent, Closed and Maximal Patterns Experimental Results Conclusions Amir Epstein

Introduction Horizontal methods (Most are Apriori variants) Mining Maximal Frequent Patterns (All-MFS,Max Miner,Depth Project,FP-Growth) Mining Closed Sets (A-Close, Closet, Charm) Vertical Methods Vertical Approach Problems Diffsets Amir Epstein

Notations I – set of items T- database transactions Tid – transaction identifier Itemset – a set Tidset – a set K-itemset – An itemset with k items Support of an itemset X, denoted - the number of transactions in which X occurs as a subset Amir Epstein

Notation Frequent itemset – if Powerset P(I) – search space enumeration Maximal frequent itemset- if it is not a subset of any other frequent itemset Closed frequent itemset (X) - if there is not exist a superset with Closure of an itemset X, denoted c(X) – the smallest closed set that contains X Amir Epstein

The Problem Find all frequent items having minimum support Amir Epstein

Database Example Amir Epstein

Frequent, Closed and Maximal Itemsets Amir Epstein

Data Formats Amir Epstein

Equivalence Classes Define a function ,where the k-length prefix of X Define an equivalence relation (prefix-based) : Amir Epstein

Example Amir Epstein {} {A,C,D,T,W} A {C,D,T,W} T {W} W C {D,T,W} AD {TW} AT {W} AW CD {T,W} CT {W} CW DT ,W} DW TW ACD {T,W} CDT {W} CDW CTW DTW ACT {W} ACW ADT ADW ATW ACDT {W} ACDW ACTW ADTW CDTW ACDTW Amir Epstein

Compute Subset Class Let Perform intersection of with all with to obtain a new class with elements ,where is frequent Amir Epstein

Tidset Intersections (example) W 1 3 4 5 1 2 3 4 5 6 2 4 5 6 1 3 5 6 1 2 3 4 5 AC AD AT AW CD CT CW DT DW TW 1 3 4 5 4 5 1 3 5 1 3 4 5 2 4 5 6 1 3 5 6 1 2 3 4 5 5 6 2 4 5 1 3 5 ACT ACW ATW CDW CTW 1 3 5 1 3 4 5 1 3 5 1 3 5 2 4 5 ACTW 1 3 5 Amir Epstein

Diffsets Difference of the prefix tidset and a class member tidset Consider class with prefix P Let t(X) denote the tidset of element X Let d(X) denote the diffset of element X, with respect to prefix tidset Let PX and PY be class members of P Support Amir Epstein

Diffsets Then Define diffset Amir Epstein

Diffsets How to Calculate using d(PX) and d(PY) ? Amir Epstein

Example t(X) t(P) t(Y) d(PY) d(PX) d(PXY) t(PXY) Amir Epstein

Diffset Intersections (example) DIFFSET database TIDSET database A C D T W A C D T W 1 3 4 5 1 2 3 4 5 6 2 4 5 6 1 3 5 6 1 2 3 4 5 2 6 1 3 2 4 6 AC AD AT AW CD CT CW DT DW TW 1 3 4 1 3 2 4 6 2 4 6 6 ACT ACW ATW CDW CTW 4 6 6 ACTW Amir Epstein

Diffset Example Diffset calculation Support calculation Amir Epstein

Diffset Example Database Size Total Size Size By Length Tidsets database size =23 Diffets database size =7 Total Size Tidsets database size =76 Diffsets database size =22 Size By Length K-itemset (k) Avg. tidset length Avg. diffset length 2 3.8 1 3 3.2 0.6 4 Amir Epstein

Experimental Study Compare diffsets versus tidsets in terms of database sizes Method Real datasets (usually dense) Synthetic datasets (sparse) Amir Epstein

Size Of Database Amir Epstein

Average Diffset / Tidset Size By length Amir Epstein

Average Diffset / Tidset Size Database Min_sup (%) Max Length Avg. Diffset Size Avg. Tidset Size Reduction Ration chess 0.5 16 26 1820 70 connect 90 12 143 62204 435 mushroom 5 17 60 622 10 Pumsb* 35 15 301 18977 63 pumsb 8 330 45036 136 T10I4D100K 0.025 11 14 86 6 T20I16D100K 0.1 31 230 T40I10D100K 18 96 755 Amir Epstein

When To Use diffsets Usually there is a cross-over point For Dense dataset start with diffset format For Sparse dataset start with tidset format Amir Epstein

Reduction Ratio Let class P Let PX and PY class members with t(PX) and t(PY) Consider new Itemset PXY in class PX PXY can be stored as t(PXY) or d(PXY) Definition : reduction ratio Benefit if or Amir Epstein

Reduction Ratio Or Amir Epstein

Compressed Bitvectors Classical way run-length encoding (RLE) – not appropriate for association mining Skinning encoding scheme (used by Viper) Worst case compression ratio reaches asymptotically 2.91 Best case compression ratio asymptotically reaches 32 Amir Epstein

GenMax: Mining Maximal Frequent Itemsets Uses backtracking search technique Optimizations Initially sort items in increasing order of their combine-set size and increasing order of support (i. first explore items with small combine sets, ii. remove a node as early as possible from the search tree) Superset checking More Optimizations Progressive focusing to improve superset checking Vertical database format to improve frequency checking using tidsets, which is more improved by diffsets Memory Handling Store at most k=m+l tidsets (diffsets) in memory, where m is the length of the longest combine-set and l is the length of the longest maximal itemset Amir Epstein

Amir Epstein

Amir Epstein

dEclat: Mining All Frequent Itemsets Performs bottom-up search The equivalence class lattice is traversed in a bfs order Input: class members F.I are generated by computing diffsets for all distinct pairs of itemsets and checking the support of the resulting itemset Stores in memory intermediate diffsets (tidsets) of at most two levels Amir Epstein

Amir Epstein

dCharm: Mining Frequent Closed Itemsets Performs bottom-up search Eliminates branches and grows itemsets using subset relationship Amir Epstein

Subset Relationships Theorem: Let and be any two members of class , with , where is a total order (e.g., lexiographic or support-based). The following for properties hold: If , then If , then , but Amir Epstein

Amir Epstein

Optimized Initialization Computation Let be the number of frequent items Let be the average tidset size Amount of data read is Number Of intersections In horizontal approach amount of data read is Amir Epstein

Improvement Compute frequent items of length 2 Combine items and only if is frequent Now The number of intersections in practice is closer to rather then Frequent itemsets of length 2 computation perform vertical to horizontal transformation Update the count of pairs of items Amir Epstein

Experimental Results Times include all costs, including horizontal to vertical database conversion Method Real datasets (usually dense) Synthetic datasets (sparse) Amir Epstein

Database Characteristics # Items Avg. trans. Length # Records chess 76 37 3,196 connect 130 43 67,557 mushroom 120 23 8,124 Pumsb* 7117 50 49,046 pumsb 74 T10I4D100K 1000 10 100,000 T20I16D100K 40 Amir Epstein

Length Of the Longest Itemset Amir Epstein

Cardinality Of F.I , C.F.I and M.F.I Amir Epstein

Improvements using Diffsets Amir Epstein

Mining Frequent Itemsets Amir Epstein

Mining Closed Itemsets Amir Epstein

Mining Maximal Itemsets Amir Epstein

Conclusions Diffsets dramatically cut down the size of memory required to store intermediate results Diffsets increase performance significantly when incorporated into previous vertical mining methods Diffsets can deliver over order of magnitude performance improvements over the best previous methods Amir Epstein