Closed Itemset Mining CSCI-7173: Computational Complexity & Algorithms, Final Project - Spring 16 Supervised By Dr. Tom Altman Presented By Shahab Helmi
Outline Introduction Related Work Approach Experimental Results Complexity Analysis
Introduction Introduction Related Work Approach Experimental Results Complexity Analysis
Frequent Pattern Mining Things that are frequently happen togeher! Frequent Itemset Mining Frequent Sequence Mining Frequent Episode Mining …
Frequent Itemset Mining All items that are frequently bought together at a supermarket: detergents and softeners Those items could be placed at the same isle.
Related Work Introduction Related Work Approach Experimental Results Complexity Analysis
Related Work Frequent Itemset Mining Itemset Summarization Definitions & Notations Naïve Algorithm Apriori Algorithm ECLAT Algorithm dECLAT Algorithm Itemset Summarization Maximal Patterns Closed Patterns CHARM Algorithm Related Work
Related Work Frequent Itemset Mining Itemset Summarization Definitions & Notations Naïve Algorithm Apriori Algorithm ECLAT Algorithm dECLAT Algorithm Itemset Summarization Maximal Patterns Closed Patterns Related Work
Definition & Notations A database of transactions I Item: detergent X Itemsets (set of items): detergent and softener T A transaction <tid, X> f(X) Frequency of X: how many transactions contain X? s(X) Support of X = f(X)/|D| σ Minimum support tid X 1 ABDE 2 BCE 3 4 ABCE 5 ABCDE 6 BCD For simplicity, we use frequency instead of minimum support |T| = 6 σ = 0.5 min(f(X)) = 3 f(A) = 4 f(ADE) = 3
Related Work Frequent Itemset Mining Itemset Summarization Definitions & Notations Naïve Algorithm Apriori Algorithm ECLAT Algorithm dECLAT Algorithm Itemset Summarization Maximal Patterns Closed Patterns CHARM Algorithm Related Work
Naïve Algorithm tid X 1 ABDE 2 BCE 3 4 ABCE 5 ABCDE 6 BCD σ = 3
Naïve Algorithm Complexity 2|𝐼| itemsets: 25 = 32, for 𝐼=30, 1,073,741,824 there are combinations For each itemset we need to do one dataset scan: 2|𝐼| scans Each scan takes 𝑂(|𝑇|×|𝐼|) Complexity of naïve algorithm is: 𝑂(2|𝐼|×|𝑇|×|𝐼|)
Related Work Frequent Itemset Mining Itemset Summarization Definitions & Notations Naïve Algorithm Apriori Algorithm ECLAT Algorithm dECLAT Algorithm Itemset Summarization Maximal Patterns Closed Patterns CHARM Algorithm Related Work
Apriori Property tid X … AB… tid X … C… AB… If an itemset is frequent, all of its subsets are frequent σ = 3 If an itemset is not frequent, none of its supersets are frequent tid X … AB… tid X … C… AB…
Apriori Algorithm tid X 1 ABDE 2 BCE 3 4 ABCE 5 ABCDE 6 BCD σ = 3
Apriori Algorithm Complexity Complexity of the Apriori algorithm in the worst case is still algorithm is: 𝑂(2|𝐼|×|𝑇|×|𝐼|) Works much faster in practice because of the pruning 𝑙 dataset scans as opposed to 2|𝐼|
Related Work Frequent Itemset Mining Itemset Summarization Definitions & Notations Naïve Algorithm Apriori Algorithm ECLAT Algorithm dECLAT Algorithm Itemset Summarization Maximal Patterns Closed Patterns CHARM Algorithm Related Work
ECLAT Algorithm tid X 1 ABDE 2 BCE 3 4 ABCE 5 ABCDE 6 BCD σ = 3
ECLAT Algorithm Complexity Complexity of the ECLAT algorithm in the worst case is algorithm is: 𝑂(2|𝐼|×|𝑇|) 2|𝐼| frequent and 𝑇 for intersection of each Only 1 dataset scan
Related Work Frequent Itemset Mining Itemset Summarization Definitions & Notations Naïve Algorithm Apriori Algorithm ECLAT Algorithm dECLAT Algorithm Itemset Summarization Maximal Patterns Closed Patterns CHARM Algorithm Related Work
dECLAT Algorithm tid X 1 ABDE 2 BCE 3 4 ABCE 5 ABCDE 6 BCD σ = 3 Joining X1 and X2: d(𝑋1∪𝑋2)=𝑑(𝑋2) −𝑑(𝑋1) f(𝑋1∪𝑋2)=𝑓 𝑋1 −|𝑑 𝑋2 | Joining A and B: d(AB)= ∅ − 2,6 = ∅ 𝑓 𝐴𝐵 =4 −0 =4
Related Work Frequent Itemset Mining Itemset Summarization Definitions & Notations Naïve Algorithm Apriori Algorithm ECLAT Algorithm dECLAT Algorithm Itemset Summarization Maximal Patterns Closed Patterns Related Work
Itesmset Summarization There are many frequent itemsets (especially when minimum support is too low) Costly to store Hard to analyze We can only store important itemsets Maximal itemsets Closed itemsets
Related Work Frequent Itemset Mining Itemset Summarization Definitions & Notations Naïve Algorithm Apriori Algorithm ECLAT Algorithm dECLAT Algorithm Itemset Summarization Maximal Patterns Closed Patterns Related Work
Maximal Itemsets An itemsets is a maximal itemset if: B: 1,2,3,4,5,6 It is frequent None of its supersets are frequent B: 1,2,3,4,5,6 A: 1,3,4,5 C: 2,4,5,6 D: 1,3,5,6 E: 1,2,3,4,5 AD: 1,3,5 AB: AE: ABD: ABE: ADE: ABDE: BC: BE: BD: BCE: 2,4,5 BDE: CE: DE:
Basic Idea 𝑀: the list of maximal frequent itemsets, which is initially empty. Each that we generate a new frequent itemset 𝑋 we have to do: Subset Check: ∄𝑌∈𝑀, such that 𝑋⊂𝑌 . If such a 𝑌 exists, then clearly 𝑋 is not maximal. Otherwise, we add 𝑋 to 𝑀, as a potentially maximal itemset. Superset Check: ∄𝑌∈ 𝑀, such that 𝑌⊂𝑋. If such a 𝑌 exists, then 𝑌 cannot be maximal, and we have to remove it from 𝑀. These checks are time consuming, so we need to minimize them.
MaxGen Algorithm A: 1,3,4,5 B: 1,2,3,4,5,6 C: 2,4,5,6 D: 1,3,5,6 E: AD: 1,3,5 AE: 1,3,4,5 BC: 2,4,5,6 BD: 1,3,5,6 BE: 1,2,3,4,5 CE: 2,4,5 DE: 1,3,5 ABD: 1,3,5 ABE: 1,3,4,5 ADE: 1,3,5 BCE: 2,4,5 BDE: 1,3,5 ABDE: 1,3,5 M ABDE: 1,3,5 BCE: 2,4,5
Maximal Patterns Are Not Lossless All frequent patterns can be regenerated from maximal itemsets However, only lower bounds of frequency counts are preserved
Related Work Frequent Itemset Mining Itemset Summarization Definitions & Notations Naïve Algorithm Apriori Algorithm ECLAT Algorithm dECLAT Algorithm Itemset Summarization Maximal Patterns Closed Patterns Related Work
Closed Itemsets An itemsets is a closed itemset if: B: 1,2,3,4,5,6 A: It is frequent Its frequency count is not equal to its superset B: 1,2,3,4,5,6 A: 1,3,4,5 C: 2,4,5,6 D: 1,3,5,6 E: 1,2,3,4,5 AD: 1,3,5 AB: AE: ABD: ABE: ADE: ABDE: BC: BE: BD: BCE: 2,4,5 BDE: CE: DE:
Charm Algorithm The original paper is listed under references (slide #46).
Approach Introduction Related Work Approach Experimental Results Complexity Analysis
Proposed Approach The original MaxGen algorithm is implemented using ECALT I implemented it with both ECLAT and dECLAT IDEA: Find a way to predict which one works faster with a given dataset and minimum support: The average number of items in transactions? The size of diffsets and tidsets after the first scan … Improving the maximality check using suffix trees (not implemented) There is no standard implementation of trees in C# Both algorithms spend the same amount of time for maximality checking
Experimental Results Introduction Related Work Approach Complexity Analysis
Experimental Setup System Configuration: Programming Language: Windows 10 RAM: 16GB CPU: Intel® Core™ i7-4770 @3.40 GHz Programming Language: C# [WPF] using Visual Studio 2015 Dataset [D1]: Retail market basket dataset supplied by a anonymous Belgian retail supermarket store. Duration: 5 months, 3 periods |T| = 88,163 Average(|X|) = 13
Software Modules: Sampler ECLAT Algorithm dECLAT Algorithm E-MaxGen dE-MaxGen
Failure or Success? Execution Time (S) |T|
Future Work Can we conclude that dECLAT-based MaxGen always outperforms the ECLAT-based MaxGen? According to these result yes! Can we generalize it to all datasets? Maybe! In the used dataset most of the customers |I| was between 7-13 for more that 70% of the transactions. Other datasets have to be tested!
Complexity Analysis Introduction Related Work Approach Experimental Results Complexity Analysis
Short Version [C1] According to [C2] and [C3], if a counting problem is P-complete [or P-hard], then its associated problem of enumerating all solutions must be NP-hard. In [C1] it is proved that the problem of counting maximal frequent itemsets is P- complete.
Maximal Patterns – Bipartite Graphs C D t1 1 t2 t3 t4 t5 A B D C t1 t2 t3 t4 t5 Lemma. Let 𝐷 be a database of transactions and 𝐺𝐷 the bipartite graph corresponding to 𝐷. Then every maximal σ-occurrent itemset in 𝐷 corresponds to a unique maximal bipartite (𝜎; ∗)-clique in 𝐺𝐷. P-complete {𝐴;𝐵} is a maximal 3-occurrent itemset and corresponds to the unique maximal bipartite (3;2)-clique, ({𝑡1; 𝑡2; 𝑡3;}, {𝐴;𝐵}). {𝐶;𝐷} is a maximal 3-occurrent itemset and corresponds to the unique maximal bipartite (3;2)-clique, ({𝑡3; 𝑡4; 𝑡5;}, {𝐶;𝐷}).
Counting vs. Enumeration Counting is usually an easier problem than enumeration: On way of counting is to enumerate (find) all solutions and then count them! Sometimes there are less complex ways to count the number of solutions rather than finding them: For a complete graph with n vertices, Cayley's formula gives the number of spanning trees as 𝑛(𝑛 − 𝟐).
References
Complexity C[1]: Yang, Guizhen. "The complexity of mining maximal frequent itemsets and maximal frequent patterns." Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2004. C[2]: Garey, Michael R., and David S. Johnson. "A Guide to the Theory of NP- Completeness." WH Freemann, New York (1979). C[3]: PAPADIMITRIOU, CH. "Computational Complexity· Addison-Wesley, 1994."
Frequent Itemset Mining APRIORI: Agrawal, Rakesh, and Ramakrishnan Srikant. "Fast algorithms for mining association rules." Proc. 20th int. conf. very large data bases, VLDB. Vol. 1215. 1994. ECLAT: Zaki, Mohammed Javeed, et al. "New Algorithms for Fast Discovery of Association Rules." KDD. Vol. 97. 1997. dECLAT: Zaki, Mohammed J., and Karam Gouda. "Fast vertical mining using diffsets." Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2003.
Itemset Summarization MaxGen: Gouda, Karam, and Mohammed J. Zaki. "Genmax: An efficient algorithm for mining maximal frequent itemsets." Data Mining and Knowledge Discovery11.3 (2005): 223-242. Charm: Zaki, Mohammed J., and Ching-Jui Hsiao. "Efficient algorithms for mining closed itemsets and their lattice structure." Knowledge and Data Engineering, IEEE Transactions on 17.4 (2005): 462-478.
Dataset [D1] Brijs, Tom. "Retail market basket data set." Workshop on Frequent Itemset Mining Implementations (FIMI’03). 2003.