Download presentation
Presentation is loading. Please wait.
Published byShinta Salim Modified over 5 years ago
1
Summarizing Itemset Patterns: A Profile-Based Approach
Xifeng Yan, Hong Cheng, Jiawei Han, Dong Xin ACM KDD 05’ Advisor:Jia-Ling Koh Speaker:Yu-Jiun Liu 2006/01/06
2
Introduction Ⅰ Closed frequent pattern no super-pattern with the same
support. Maximal no frequent super-pattern. Top-K V.S. K representatives
3
Introduction Ⅱ The format of these representatives.
How to find these representatives? The measure of their quality.
4
Definition Bernoulli Distribution Vector Pattern Profile
5
Equations The relative frequency of item οi in D’. Estimated Support
6
Pattern Profile Example
Both of the above datasets can be summarized by <abcd>, but the quality is better for D1. p(a) = ( )/( ) = 0.91 Mabc = <[0.91,0.96,1], abcd, 0.87> M = <[0.91,0.96,1,1], abcd, 1>
7
Pattern Summarization
First, construct a special profile for each pattern that only contains that pattern itself. Use the Kullback-Leibler divergence to merge similar patterns. KL-divergence
8
Hierarchical Agglomerative Clustering
9
K-means Clustering
10
Optimization Heuristics
Closed Itemset vs. Frequent Itemsets Given patterns α and β, if and their supports are equal, then Approximate Profiles Using the following two equations to instead of original profile updating. for Algorithm 1 for Algorithm 2
11
Quality Evaluation Definition (Restoration Error)
T is a testing pattern set. T’ is the collection of the itemsets generated by the master patterns in profiles and
12
Quality Evaluation J tests “frequent patterns”, some of which may be estimated as “infrequent”. Jc tests “estimated frequent patterns”, some of which are actually “infrequent”. Therefore J and Jc are complementary to each other.
13
Quality Evaluation Lemma
For any frequent itemset π, there must exist a profile Mk such that , where ψk is the master itemset of Mk.
14
Optimal Number of Profiles
How to determine K? M = (p, ψ , ρ) Ex: require for any i such that p~q α~β Dα~Dβ~Dα∪Dβ Checking the derivative of the quality over K , If J increase suddenly from K* to K* - 1, K* is likely to be a good choice.
15
Optimal Number of Profiles
16
Experiment Three real datasets and a series of synthetic datasets.
Language: Visual C++ CPU: Intel 3.2GHz Memory: 1GB OS: Windows XP
17
Mushroom ※688 closed patterns
18
BMS-Webview1 & Replace ※threshold = 0.1% ※4195 closed patterns
※many small frequent itemsets ※threshold = 3% ※4315 closed patterns ※many small frequent itemsets
19
Synthetic Datasets Provided by IBM
7 datasets, each has transactions. Choose top-500. K = 50 and 100
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.