Presentation is loading. Please wait.

Presentation is loading. Please wait.

Summarizing Itemset Patterns: A Profile-Based Approach

Similar presentations


Presentation on theme: "Summarizing Itemset Patterns: A Profile-Based Approach"— Presentation transcript:

1 Summarizing Itemset Patterns: A Profile-Based Approach
Xifeng Yan, Hong Cheng, Jiawei Han, Dong Xin ACM KDD 05’ Advisor:Jia-Ling Koh Speaker:Yu-Jiun Liu 2006/01/06

2 Introduction Ⅰ Closed frequent pattern no super-pattern with the same
support. Maximal no frequent super-pattern. Top-K V.S. K representatives

3 Introduction Ⅱ The format of these representatives.
How to find these representatives? The measure of their quality.

4 Definition Bernoulli Distribution Vector Pattern Profile

5 Equations The relative frequency of item οi in D’. Estimated Support

6 Pattern Profile Example
Both of the above datasets can be summarized by <abcd>, but the quality is better for D1. p(a) = ( )/( ) = 0.91 Mabc = <[0.91,0.96,1], abcd, 0.87> M = <[0.91,0.96,1,1], abcd, 1>

7 Pattern Summarization
First, construct a special profile for each pattern that only contains that pattern itself. Use the Kullback-Leibler divergence to merge similar patterns. KL-divergence

8 Hierarchical Agglomerative Clustering

9 K-means Clustering

10 Optimization Heuristics
Closed Itemset vs. Frequent Itemsets Given patterns α and β, if and their supports are equal, then Approximate Profiles Using the following two equations to instead of original profile updating. for Algorithm 1 for Algorithm 2

11 Quality Evaluation Definition (Restoration Error)
T is a testing pattern set. T’ is the collection of the itemsets generated by the master patterns in profiles and

12 Quality Evaluation J tests “frequent patterns”, some of which may be estimated as “infrequent”. Jc tests “estimated frequent patterns”, some of which are actually “infrequent”. Therefore J and Jc are complementary to each other.

13 Quality Evaluation Lemma
For any frequent itemset π, there must exist a profile Mk such that , where ψk is the master itemset of Mk.

14 Optimal Number of Profiles
How to determine K? M = (p, ψ , ρ) Ex: require for any i such that p~q α~β  Dα~Dβ~Dα∪Dβ Checking the derivative of the quality over K , If J increase suddenly from K* to K* - 1, K* is likely to be a good choice.

15 Optimal Number of Profiles

16 Experiment Three real datasets and a series of synthetic datasets.
Language: Visual C++ CPU: Intel 3.2GHz Memory: 1GB OS: Windows XP

17 Mushroom ※688 closed patterns

18 BMS-Webview1 & Replace ※threshold = 0.1% ※4195 closed patterns
※many small frequent itemsets ※threshold = 3% ※4315 closed patterns ※many small frequent itemsets

19 Synthetic Datasets Provided by IBM
7 datasets, each has transactions. Choose top-500. K = 50 and 100


Download ppt "Summarizing Itemset Patterns: A Profile-Based Approach"

Similar presentations


Ads by Google