Closed Itemset Mining CSCI-7173: Computational Complexity & Algorithms, Final Project - Spring 16 Supervised By Dr. Tom Altman Presented By Shahab Helmi.

Closed Itemset Mining CSCI-7173: Computational Complexity & Algorithms, Final Project - Spring 16 Supervised By Dr. Tom Altman Presented By Shahab Helmi

Outline Introduction Related Work Approach Experimental Results
Complexity Analysis

Introduction Introduction Related Work Approach Experimental Results
Complexity Analysis

Frequent Pattern Mining
Things that are frequently happen togeher! Frequent Itemset Mining Frequent Sequence Mining Frequent Episode Mining …

Frequent Itemset Mining
All items that are frequently bought together at a supermarket: detergents and softeners Those items could be placed at the same isle.

Related Work Introduction Related Work Approach Experimental Results
Complexity Analysis

Related Work Frequent Itemset Mining Itemset Summarization
Definitions & Notations Naïve Algorithm Apriori Algorithm ECLAT Algorithm dECLAT Algorithm Itemset Summarization Maximal Patterns Closed Patterns CHARM Algorithm Related Work

Definitions & Notations Naïve Algorithm Apriori Algorithm ECLAT Algorithm dECLAT Algorithm Itemset Summarization Maximal Patterns Closed Patterns Related Work

Definition & Notations
A database of transactions I Item: detergent X Itemsets (set of items): detergent and softener T A transaction <tid, X> f(X) Frequency of X: how many transactions contain X? s(X) Support of X = f(X)/|D| σ Minimum support tid X 1 ABDE 2 BCE 3 4 ABCE 5 ABCDE 6 BCD For simplicity, we use frequency instead of minimum support |T| = 6 σ = 0.5 min(f(X)) = 3 f(A) = 4 f(ADE) = 3

Naïve Algorithm tid X 1 ABDE 2 BCE 3 4 ABCE 5 ABCDE 6 BCD σ = 3

Naïve Algorithm Complexity
2|𝐼| itemsets: 25 = 32, for 𝐼=30, 1,073,741,824 there are combinations For each itemset we need to do one dataset scan: 2|𝐼| scans Each scan takes 𝑂(|𝑇|×|𝐼|) Complexity of naïve algorithm is: 𝑂(2|𝐼|×|𝑇|×|𝐼|)

Apriori Property tid X … AB… tid X … C… AB…
If an itemset is frequent, all of its subsets are frequent σ = 3 If an itemset is not frequent, none of its supersets are frequent tid X … AB… tid X … C… AB…

Apriori Algorithm tid X 1 ABDE 2 BCE 3 4 ABCE 5 ABCDE 6 BCD σ = 3

Apriori Algorithm Complexity
Complexity of the Apriori algorithm in the worst case is still algorithm is: 𝑂(2|𝐼|×|𝑇|×|𝐼|) Works much faster in practice because of the pruning 𝑙 dataset scans as opposed to 2|𝐼|

ECLAT Algorithm tid X 1 ABDE 2 BCE 3 4 ABCE 5 ABCDE 6 BCD σ = 3

ECLAT Algorithm Complexity
Complexity of the ECLAT algorithm in the worst case is algorithm is: 𝑂(2|𝐼|×|𝑇|) 2|𝐼| frequent and 𝑇 for intersection of each Only 1 dataset scan

dECLAT Algorithm tid X 1 ABDE 2 BCE 3 4 ABCE 5 ABCDE 6 BCD σ = 3
Joining X1 and X2: d(𝑋1∪𝑋2)=𝑑(𝑋2) −𝑑(𝑋1) f(𝑋1∪𝑋2)=𝑓 𝑋1 −|𝑑 𝑋2 | Joining A and B: d(AB)= ∅ − 2,6 = ∅ 𝑓 𝐴𝐵 =4 −0 =4

Itesmset Summarization
There are many frequent itemsets (especially when minimum support is too low) Costly to store Hard to analyze We can only store important itemsets Maximal itemsets Closed itemsets

Maximal Itemsets An itemsets is a maximal itemset if: B: 1,2,3,4,5,6
It is frequent None of its supersets are frequent B: 1,2,3,4,5,6 A: 1,3,4,5 C: 2,4,5,6 D: 1,3,5,6 E: 1,2,3,4,5 AD: 1,3,5 AB: AE: ABD: ABE: ADE: ABDE: BC: BE: BD: BCE: 2,4,5 BDE: CE: DE:

Basic Idea 𝑀: the list of maximal frequent itemsets, which is initially empty. Each that we generate a new frequent itemset 𝑋 we have to do: Subset Check: ∄𝑌∈𝑀, such that 𝑋⊂𝑌 . If such a 𝑌 exists, then clearly 𝑋 is not maximal. Otherwise, we add 𝑋 to 𝑀, as a potentially maximal itemset. Superset Check: ∄𝑌∈ 𝑀, such that 𝑌⊂𝑋. If such a 𝑌 exists, then 𝑌 cannot be maximal, and we have to remove it from 𝑀. These checks are time consuming, so we need to minimize them.

MaxGen Algorithm A: 1,3,4,5 B: 1,2,3,4,5,6 C: 2,4,5,6 D: 1,3,5,6 E:
AD: 1,3,5 AE: 1,3,4,5 BC: 2,4,5,6 BD: 1,3,5,6 BE: 1,2,3,4,5 CE: 2,4,5 DE: 1,3,5 ABD: 1,3,5 ABE: 1,3,4,5 ADE: 1,3,5 BCE: 2,4,5 BDE: 1,3,5 ABDE: 1,3,5 M ABDE: 1,3,5 BCE: 2,4,5

Maximal Patterns Are Not Lossless
All frequent patterns can be regenerated from maximal itemsets However, only lower bounds of frequency counts are preserved

Closed Itemsets An itemsets is a closed itemset if: B: 1,2,3,4,5,6 A:
It is frequent Its frequency count is not equal to its superset B: 1,2,3,4,5,6 A: 1,3,4,5 C: 2,4,5,6 D: 1,3,5,6 E: 1,2,3,4,5 AD: 1,3,5 AB: AE: ABD: ABE: ADE: ABDE: BC: BE: BD: BCE: 2,4,5 BDE: CE: DE:

Charm Algorithm The original paper is listed under references (slide #46).

Approach Introduction Related Work Approach Experimental Results
Complexity Analysis

Proposed Approach The original MaxGen algorithm is implemented using ECALT I implemented it with both ECLAT and dECLAT IDEA: Find a way to predict which one works faster with a given dataset and minimum support: The average number of items in transactions? The size of diffsets and tidsets after the first scan … Improving the maximality check using suffix trees (not implemented) There is no standard implementation of trees in C# Both algorithms spend the same amount of time for maximality checking

Experimental Results Introduction Related Work Approach
Complexity Analysis

Experimental Setup System Configuration: Programming Language:
Windows 10 RAM: 16GB CPU: Intel® Core™ GHz Programming Language: C# [WPF] using Visual Studio 2015 Dataset [D1]: Retail market basket dataset supplied by a anonymous Belgian retail supermarket store. Duration: 5 months, 3 periods |T| = 88,163 Average(|X|) = 13

Software Modules: Sampler ECLAT Algorithm dECLAT Algorithm E-MaxGen
dE-MaxGen

Failure or Success? Execution Time (S) |T|

Future Work Can we conclude that dECLAT-based MaxGen always outperforms the ECLAT-based MaxGen? According to these result yes! Can we generalize it to all datasets? Maybe! In the used dataset most of the customers |I| was between 7-13 for more that 70% of the transactions. Other datasets have to be tested!

Complexity Analysis Introduction Related Work Approach
Experimental Results Complexity Analysis

Short Version [C1] According to [C2] and [C3], if a counting problem is P-complete [or P-hard], then its associated problem of enumerating all solutions must be NP-hard. In [C1] it is proved that the problem of counting maximal frequent itemsets is P- complete.

Maximal Patterns – Bipartite Graphs
C D t1 1 t2 t3 t4 t5 A B D C t1 t2 t3 t4 t5 Lemma. Let 𝐷 be a database of transactions and 𝐺𝐷 the bipartite graph corresponding to 𝐷. Then every maximal σ-occurrent itemset in 𝐷 corresponds to a unique maximal bipartite (𝜎; ∗)-clique in 𝐺𝐷. P-complete {𝐴;𝐵} is a maximal 3-occurrent itemset and corresponds to the unique maximal bipartite (3;2)-clique, ({𝑡1; 𝑡2; 𝑡3;}, {𝐴;𝐵}). {𝐶;𝐷} is a maximal 3-occurrent itemset and corresponds to the unique maximal bipartite (3;2)-clique, ({𝑡3; 𝑡4; 𝑡5;}, {𝐶;𝐷}).

Counting vs. Enumeration
Counting is usually an easier problem than enumeration: On way of counting is to enumerate (find) all solutions and then count them! Sometimes there are less complex ways to count the number of solutions rather than finding them: For a complete graph with n vertices, Cayley's formula gives the number of spanning trees as 𝑛(𝑛 − 𝟐).

References

Complexity C[1]: Yang, Guizhen. "The complexity of mining maximal frequent itemsets and maximal frequent patterns." Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2004. C[2]: Garey, Michael R., and David S. Johnson. "A Guide to the Theory of NP- Completeness." WH Freemann, New York (1979). C[3]: PAPADIMITRIOU, CH. "Computational Complexity· Addison-Wesley, 1994."

Frequent Itemset Mining
APRIORI: Agrawal, Rakesh, and Ramakrishnan Srikant. "Fast algorithms for mining association rules." Proc. 20th int. conf. very large data bases, VLDB. Vol ECLAT: Zaki, Mohammed Javeed, et al. "New Algorithms for Fast Discovery of Association Rules." KDD. Vol dECLAT: Zaki, Mohammed J., and Karam Gouda. "Fast vertical mining using diffsets." Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2003.

Itemset Summarization
MaxGen: Gouda, Karam, and Mohammed J. Zaki. "Genmax: An efficient algorithm for mining maximal frequent itemsets." Data Mining and Knowledge Discovery11.3 (2005): Charm: Zaki, Mohammed J., and Ching-Jui Hsiao. "Efficient algorithms for mining closed itemsets and their lattice structure." Knowledge and Data Engineering, IEEE Transactions on 17.4 (2005):

Dataset [D1] Brijs, Tom. "Retail market basket data set." Workshop on Frequent Itemset Mining Implementations (FIMI’03)

Closed Itemset Mining CSCI-7173: Computational Complexity & Algorithms, Final Project - Spring 16 Supervised By Dr. Tom Altman Presented By Shahab Helmi.

Similar presentations

Presentation on theme: "Closed Itemset Mining CSCI-7173: Computational Complexity & Algorithms, Final Project - Spring 16 Supervised By Dr. Tom Altman Presented By Shahab Helmi."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Closed Itemset Mining CSCI-7173: Computational Complexity & Algorithms, Final Project - Spring 16 Supervised By Dr. Tom Altman Presented By Shahab Helmi.

Similar presentations

Presentation on theme: "Closed Itemset Mining CSCI-7173: Computational Complexity & Algorithms, Final Project - Spring 16 Supervised By Dr. Tom Altman Presented By Shahab Helmi."— Presentation transcript:

Similar presentations

About project

Feedback