Closed Itemset Mining CSCI-7173: Computational Complexity & Algorithms, Final Project - Spring 16 Supervised By Dr. Tom Altman Presented By Shahab Helmi.

Slides:

Advertisements

Similar presentations

Association Rule Mining

Advertisements

Recap: Mining association rules from large datasets

Huffman Codes and Asssociation Rules (II) Prof. Sin-Min Lee Department of Computer Science.

Association Analysis (2). Example TIDList of item ID’s T1I1, I2, I5 T2I2, I4 T3I2, I3 T4I1, I2, I4 T5I1, I3 T6I2, I3 T7I1, I3 T8I1, I2, I3, I5 T9I1, I2,

Frequent Itemset Mining Methods. The Apriori algorithm Finding frequent itemsets using candidate generation Seminal algorithm proposed by R. Agrawal and.

Frequent Closed Pattern Search By Row and Feature Enumeration

LOGO Association Rule Lecturer: Dr. Bo Yuan

Zeev Dvir – GenMax From: “ Efficiently Mining Frequent Itemsets ” By : Karam Gouda & Mohammed J. Zaki.

1 Department of Information & Computer Education, NTNU SmartMiner: A Depth First Algorithm Guided by Tail Information for Mining Maximal Frequent Itemsets.

ICDM'06 Panel 1 Apriori Algorithm Rakesh Agrawal Ramakrishnan Srikant (description by C. Faloutsos)

FP (FREQUENT PATTERN)-GROWTH ALGORITHM ERTAN LJAJIĆ, 3392/2013 Elektrotehnički fakultet Univerziteta u Beogradu.

Sampling Large Databases for Association Rules ( Toivenon’s Approach, 1996) Farzaneh Mirzazadeh Fall 2007.

Data Mining Association Analysis: Basic Concepts and Algorithms

Association Mining Data Mining Spring Transactional Database Transaction – A row in the database i.e.: {Eggs, Cheese, Milk} Transactional Database.

Rakesh Agrawal Ramakrishnan Srikant

Chapter 5: Mining Frequent Patterns, Association and Correlations

Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,

732A02 Data Mining - Clustering and Association Analysis ………………… Jose M. Peña Association rules Apriori algorithm FP grow algorithm.

1 Fast Algorithms for Mining Association Rules Rakesh Agrawal Ramakrishnan Srikant.

Association Rule Mining Part 2 (under construction!) Introduction to Data Mining with Case Studies Author: G. K. Gupta Prentice Hall India, 2006.

Data Mining Association Analysis: Basic Concepts and Algorithms

Association Analysis: Basic Concepts and Algorithms.

Data Mining Association Analysis: Basic Concepts and Algorithms

Is Sampling Useful in Data Mining? A Case in the Maintenance of Discovered Association Rules S.D. Lee, D. W. Cheung, B. Kao Department of Computer Science.

Association Rule Mining - MaxMiner. Mining Association Rules in Large Databases  Association rule mining  Algorithms Apriori and FP-Growth  Max and.

Association Rule Mining (Some material adapted from: Mining Sequential Patterns by Karuna Pande Joshi)‏

2/8/00CSE 711 data mining: Apriori Algorithm by S. Cha 1 CSE 711 Seminar on Data Mining: Apriori Algorithm By Sung-Hyuk Cha.

Fast Algorithms for Association Rule Mining

Fast Vertical Mining Using Diffsets Mohammed J. Zaki Karam Gouda

1 Fast Algorithms for Mining Association Rules Rakesh Agrawal Ramakrishnan Srikant Slides from Ofer Pasternak.

© Vipin Kumar CSci 8980 Fall CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance Computing Research Center Department of Computer.

Performance and Scalability: Apriori Implementation.

Mining Frequent Itemsets with Constraints Takeaki Uno Takeaki Uno National Institute of Informatics, JAPAN Nov/2005 FJWCP.

Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining By Tan, Steinbach, Kumar Lecture.

Modul 7: Association Analysis. 2 Association Rule Mining  Given a set of transactions, find rules that will predict the occurrence of an item based on.

Takeaki Uno Tatsuya Asai Yuzo Uchida Hiroki Arimura

Mining High Utility Itemset in Big Data

Frequent Item Mining. What is data mining? =Pattern Mining? What patterns? Why are they useful?

CSE4334/5334 DATA MINING CSE4334/5334 Data Mining, Fall 2014 Department of Computer Science and Engineering, University of Texas at Arlington Chengkai.

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Association Analysis This lecture node is modified based on Lecture Notes for.

M. Sulaiman Khan Dept. of Computer Science University of Liverpool 2009 COMP527: Data Mining ARM: Improvements March 10, 2009 Slide.

Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,

Data Mining Association Rules Mining Frequent Itemset Mining Support and Confidence Apriori Approach.

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Association Rule Mining COMP Seminar BCB 713 Module Spring 2011.

1 Data Mining Lecture 6: Association Analysis. 2 Association Rule Mining l Given a set of transactions, find rules that will predict the occurrence of.

Data Mining Find information from data data ? information.

Reducing Number of Candidates

Data Mining Association Analysis: Basic Concepts and Algorithms

Data Mining: Concepts and Techniques

Association rule mining

Association Rules Repoussis Panagiotis.

Knowledge discovery & data mining Association rules and market basket analysis--introduction UCLA CS240A Course Notes*

Frequent Pattern Mining

Byung Joon Park, Sung Hee Kim

Frequent Itemsets Association Rules

CARPENTER Find Closed Patterns in Long Biological Datasets

Dynamic Itemset Counting

Data Mining Association Analysis: Basic Concepts and Algorithms

Data Mining Association Analysis: Basic Concepts and Algorithms

Vasiljevic Vladica, FP-Growth algorithm Vasiljevic Vladica,

Association Rule Mining

A Parameterised Algorithm for Mining Association Rules

Data Mining Association Analysis: Basic Concepts and Algorithms

Association Rule Mining

Farzaneh Mirzazadeh Fall 2007

Association Analysis: Basic Concepts and Algorithms

Scalable Algorithms for Association Mining

Approximate Frequency Counts over Data Streams

Frequent-Pattern Tree

Association Analysis: Basic Concepts

Presentation transcript:

Closed Itemset Mining CSCI-7173: Computational Complexity & Algorithms, Final Project - Spring 16 Supervised By Dr. Tom Altman Presented By Shahab Helmi

Outline Introduction Related Work Approach Experimental Results Complexity Analysis

Introduction Introduction Related Work Approach Experimental Results Complexity Analysis

Frequent Pattern Mining Things that are frequently happen togeher! Frequent Itemset Mining Frequent Sequence Mining Frequent Episode Mining …

Frequent Itemset Mining All items that are frequently bought together at a supermarket: detergents and softeners Those items could be placed at the same isle.

Related Work Introduction Related Work Approach Experimental Results Complexity Analysis

Related Work Frequent Itemset Mining Itemset Summarization Definitions & Notations Naïve Algorithm Apriori Algorithm ECLAT Algorithm dECLAT Algorithm Itemset Summarization Maximal Patterns Closed Patterns CHARM Algorithm Related Work

Related Work Frequent Itemset Mining Itemset Summarization Definitions & Notations Naïve Algorithm Apriori Algorithm ECLAT Algorithm dECLAT Algorithm Itemset Summarization Maximal Patterns Closed Patterns Related Work

Definition & Notations A database of transactions I Item: detergent X Itemsets (set of items): detergent and softener T A transaction <tid, X> f(X) Frequency of X: how many transactions contain X? s(X) Support of X = f(X)/|D| σ Minimum support tid X 1 ABDE 2 BCE 3 4 ABCE 5 ABCDE 6 BCD For simplicity, we use frequency instead of minimum support |T| = 6 σ = 0.5 min(f(X)) = 3 f(A) = 4 f(ADE) = 3

Related Work Frequent Itemset Mining Itemset Summarization Definitions & Notations Naïve Algorithm Apriori Algorithm ECLAT Algorithm dECLAT Algorithm Itemset Summarization Maximal Patterns Closed Patterns CHARM Algorithm Related Work

Naïve Algorithm tid X 1 ABDE 2 BCE 3 4 ABCE 5 ABCDE 6 BCD σ = 3

Naïve Algorithm Complexity 2|𝐼| itemsets: 25 = 32, for 𝐼=30, 1,073,741,824 there are combinations For each itemset we need to do one dataset scan: 2|𝐼| scans Each scan takes 𝑂(|𝑇|×|𝐼|) Complexity of naïve algorithm is: 𝑂(2|𝐼|×|𝑇|×|𝐼|)

Related Work Frequent Itemset Mining Itemset Summarization Definitions & Notations Naïve Algorithm Apriori Algorithm ECLAT Algorithm dECLAT Algorithm Itemset Summarization Maximal Patterns Closed Patterns CHARM Algorithm Related Work

Apriori Property tid X … AB… tid X … C… AB… If an itemset is frequent, all of its subsets are frequent σ = 3 If an itemset is not frequent, none of its supersets are frequent tid X … AB… tid X … C… AB…

Apriori Algorithm tid X 1 ABDE 2 BCE 3 4 ABCE 5 ABCDE 6 BCD σ = 3

Apriori Algorithm Complexity Complexity of the Apriori algorithm in the worst case is still algorithm is: 𝑂(2|𝐼|×|𝑇|×|𝐼|) Works much faster in practice because of the pruning 𝑙 dataset scans as opposed to 2|𝐼|

Related Work Frequent Itemset Mining Itemset Summarization Definitions & Notations Naïve Algorithm Apriori Algorithm ECLAT Algorithm dECLAT Algorithm Itemset Summarization Maximal Patterns Closed Patterns CHARM Algorithm Related Work

ECLAT Algorithm tid X 1 ABDE 2 BCE 3 4 ABCE 5 ABCDE 6 BCD σ = 3

ECLAT Algorithm Complexity Complexity of the ECLAT algorithm in the worst case is algorithm is: 𝑂(2|𝐼|×|𝑇|) 2|𝐼| frequent and 𝑇 for intersection of each Only 1 dataset scan

Related Work Frequent Itemset Mining Itemset Summarization Definitions & Notations Naïve Algorithm Apriori Algorithm ECLAT Algorithm dECLAT Algorithm Itemset Summarization Maximal Patterns Closed Patterns CHARM Algorithm Related Work

dECLAT Algorithm tid X 1 ABDE 2 BCE 3 4 ABCE 5 ABCDE 6 BCD σ = 3 Joining X1 and X2: d(𝑋1∪𝑋2)=𝑑(𝑋2) −𝑑(𝑋1) f(𝑋1∪𝑋2)=𝑓 𝑋1 −|𝑑 𝑋2 | Joining A and B: d(AB)= ∅ − 2,6 = ∅ 𝑓 𝐴𝐵 =4 −0 =4

Related Work Frequent Itemset Mining Itemset Summarization Definitions & Notations Naïve Algorithm Apriori Algorithm ECLAT Algorithm dECLAT Algorithm Itemset Summarization Maximal Patterns Closed Patterns Related Work

Itesmset Summarization There are many frequent itemsets (especially when minimum support is too low) Costly to store Hard to analyze We can only store important itemsets Maximal itemsets Closed itemsets

Related Work Frequent Itemset Mining Itemset Summarization Definitions & Notations Naïve Algorithm Apriori Algorithm ECLAT Algorithm dECLAT Algorithm Itemset Summarization Maximal Patterns Closed Patterns Related Work

Maximal Itemsets An itemsets is a maximal itemset if: B: 1,2,3,4,5,6 It is frequent None of its supersets are frequent B: 1,2,3,4,5,6 A: 1,3,4,5 C: 2,4,5,6 D: 1,3,5,6 E: 1,2,3,4,5 AD: 1,3,5 AB: AE: ABD: ABE: ADE: ABDE: BC: BE: BD: BCE: 2,4,5 BDE: CE: DE:

Basic Idea 𝑀: the list of maximal frequent itemsets, which is initially empty. Each that we generate a new frequent itemset 𝑋 we have to do: Subset Check: ∄𝑌∈𝑀, such that 𝑋⊂𝑌 . If such a 𝑌 exists, then clearly 𝑋 is not maximal. Otherwise, we add 𝑋 to 𝑀, as a potentially maximal itemset. Superset Check: ∄𝑌∈ 𝑀, such that 𝑌⊂𝑋. If such a 𝑌 exists, then 𝑌 cannot be maximal, and we have to remove it from 𝑀. These checks are time consuming, so we need to minimize them.

MaxGen Algorithm A: 1,3,4,5 B: 1,2,3,4,5,6 C: 2,4,5,6 D: 1,3,5,6 E: AD: 1,3,5 AE: 1,3,4,5 BC: 2,4,5,6 BD: 1,3,5,6 BE: 1,2,3,4,5 CE: 2,4,5 DE: 1,3,5 ABD: 1,3,5 ABE: 1,3,4,5 ADE: 1,3,5 BCE: 2,4,5 BDE: 1,3,5 ABDE: 1,3,5 M ABDE: 1,3,5 BCE: 2,4,5

Maximal Patterns Are Not Lossless All frequent patterns can be regenerated from maximal itemsets However, only lower bounds of frequency counts are preserved

Related Work Frequent Itemset Mining Itemset Summarization Definitions & Notations Naïve Algorithm Apriori Algorithm ECLAT Algorithm dECLAT Algorithm Itemset Summarization Maximal Patterns Closed Patterns Related Work

Closed Itemsets An itemsets is a closed itemset if: B: 1,2,3,4,5,6 A: It is frequent Its frequency count is not equal to its superset B: 1,2,3,4,5,6 A: 1,3,4,5 C: 2,4,5,6 D: 1,3,5,6 E: 1,2,3,4,5 AD: 1,3,5 AB: AE: ABD: ABE: ADE: ABDE: BC: BE: BD: BCE: 2,4,5 BDE: CE: DE:

Charm Algorithm The original paper is listed under references (slide #46).

Approach Introduction Related Work Approach Experimental Results Complexity Analysis

Proposed Approach The original MaxGen algorithm is implemented using ECALT I implemented it with both ECLAT and dECLAT IDEA: Find a way to predict which one works faster with a given dataset and minimum support: The average number of items in transactions? The size of diffsets and tidsets after the first scan … Improving the maximality check using suffix trees (not implemented) There is no standard implementation of trees in C# Both algorithms spend the same amount of time for maximality checking

Experimental Results Introduction Related Work Approach Complexity Analysis

Experimental Setup System Configuration: Programming Language: Windows 10 RAM: 16GB CPU: Intel® Core™ i7-4770 @3.40 GHz Programming Language: C# [WPF] using Visual Studio 2015 Dataset [D1]: Retail market basket dataset supplied by a anonymous Belgian retail supermarket store. Duration: 5 months, 3 periods |T| = 88,163 Average(|X|) = 13

Software Modules: Sampler ECLAT Algorithm dECLAT Algorithm E-MaxGen dE-MaxGen

Failure or Success? Execution Time (S) |T|

Future Work Can we conclude that dECLAT-based MaxGen always outperforms the ECLAT-based MaxGen? According to these result yes! Can we generalize it to all datasets? Maybe! In the used dataset most of the customers |I| was between 7-13 for more that 70% of the transactions. Other datasets have to be tested!

Complexity Analysis Introduction Related Work Approach Experimental Results Complexity Analysis

Short Version [C1] According to [C2] and [C3], if a counting problem is P-complete [or P-hard], then its associated problem of enumerating all solutions must be NP-hard. In [C1] it is proved that the problem of counting maximal frequent itemsets is P- complete.

Maximal Patterns – Bipartite Graphs C D t1 1 t2 t3 t4 t5 A B D C t1 t2 t3 t4 t5 Lemma. Let 𝐷 be a database of transactions and 𝐺𝐷 the bipartite graph corresponding to 𝐷. Then every maximal σ-occurrent itemset in 𝐷 corresponds to a unique maximal bipartite (𝜎; ∗)-clique in 𝐺𝐷. P-complete {𝐴;𝐵} is a maximal 3-occurrent itemset and corresponds to the unique maximal bipartite (3;2)-clique, ({𝑡1; 𝑡2; 𝑡3;}, {𝐴;𝐵}). {𝐶;𝐷} is a maximal 3-occurrent itemset and corresponds to the unique maximal bipartite (3;2)-clique, ({𝑡3; 𝑡4; 𝑡5;}, {𝐶;𝐷}).

Counting vs. Enumeration Counting is usually an easier problem than enumeration: On way of counting is to enumerate (find) all solutions and then count them! Sometimes there are less complex ways to count the number of solutions rather than finding them: For a complete graph with n vertices, Cayley's formula gives the number of spanning trees as 𝑛(𝑛 − 𝟐).

References

Complexity C[1]: Yang, Guizhen. "The complexity of mining maximal frequent itemsets and maximal frequent patterns." Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2004. C[2]: Garey, Michael R., and David S. Johnson. "A Guide to the Theory of NP- Completeness." WH Freemann, New York (1979). C[3]: PAPADIMITRIOU, CH. "Computational Complexity· Addison-Wesley, 1994."

Frequent Itemset Mining APRIORI: Agrawal, Rakesh, and Ramakrishnan Srikant. "Fast algorithms for mining association rules." Proc. 20th int. conf. very large data bases, VLDB. Vol. 1215. 1994. ECLAT: Zaki, Mohammed Javeed, et al. "New Algorithms for Fast Discovery of Association Rules." KDD. Vol. 97. 1997. dECLAT: Zaki, Mohammed J., and Karam Gouda. "Fast vertical mining using diffsets." Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2003.

Itemset Summarization MaxGen: Gouda, Karam, and Mohammed J. Zaki. "Genmax: An efficient algorithm for mining maximal frequent itemsets." Data Mining and Knowledge Discovery11.3 (2005): 223-242. Charm: Zaki, Mohammed J., and Ching-Jui Hsiao. "Efficient algorithms for mining closed itemsets and their lattice structure." Knowledge and Data Engineering, IEEE Transactions on 17.4 (2005): 462-478.

Dataset [D1] Brijs, Tom. "Retail market basket data set." Workshop on Frequent Itemset Mining Implementations (FIMI’03). 2003.