Summarizing Itemset Patterns: A Profile-Based Approach Xifeng Yan, Hong Cheng, Jiawei Han, Dong Xin ACM KDD 05’ Advisor:Jia-Ling Koh Speaker:Yu-Jiun Liu 2006/01/06
Introduction Ⅰ Closed frequent pattern no super-pattern with the same support. Maximal no frequent super-pattern. Top-K V.S. K representatives
Introduction Ⅱ The format of these representatives. How to find these representatives? The measure of their quality.
Definition Bernoulli Distribution Vector Pattern Profile
Equations The relative frequency of item οi in D’. Estimated Support
Pattern Profile Example Both of the above datasets can be summarized by <abcd>, but the quality is better for D1. p(a) = (50+1000)/(50+100+1000) = 0.91 Mabc = <[0.91,0.96,1], abcd, 0.87> M = <[0.91,0.96,1,1], abcd, 1>
Pattern Summarization First, construct a special profile for each pattern that only contains that pattern itself. Use the Kullback-Leibler divergence to merge similar patterns. KL-divergence
Hierarchical Agglomerative Clustering
K-means Clustering
Optimization Heuristics Closed Itemset vs. Frequent Itemsets Given patterns α and β, if and their supports are equal, then Approximate Profiles Using the following two equations to instead of original profile updating. for Algorithm 1 for Algorithm 2
Quality Evaluation Definition (Restoration Error) T is a testing pattern set. T’ is the collection of the itemsets generated by the master patterns in profiles and .
Quality Evaluation J tests “frequent patterns”, some of which may be estimated as “infrequent”. Jc tests “estimated frequent patterns”, some of which are actually “infrequent”. Therefore J and Jc are complementary to each other.
Quality Evaluation Lemma For any frequent itemset π, there must exist a profile Mk such that , where ψk is the master itemset of Mk.
Optimal Number of Profiles How to determine K? M = (p, ψ , ρ) Ex: require for any i such that p~q α~β Dα~Dβ~Dα∪Dβ Checking the derivative of the quality over K , If J increase suddenly from K* to K* - 1, K* is likely to be a good choice.
Optimal Number of Profiles
Experiment Three real datasets and a series of synthetic datasets. Language: Visual C++ CPU: Intel 3.2GHz Memory: 1GB OS: Windows XP
Mushroom ※688 closed patterns
BMS-Webview1 & Replace ※threshold = 0.1% ※4195 closed patterns ※many small frequent itemsets ※threshold = 3% ※4315 closed patterns ※many small frequent itemsets
Synthetic Datasets Provided by IBM 7 datasets, each has 10000 transactions. Choose top-500. K = 50 and 100