2015/11/271 Efficiently Clustering Transactional data with Weighted Coverage Density M. Hua Yan, Keke Chen, and Ling Liu Proceedings of the 15 th International Conference on Information and Knowledge Management, ACM CIKM, 2006 報告人 : 吳建良
2 Outline Motivation SCALE Framework BKPlot Method WCD Clustering Algorithm Cluster Validity Evaluation Experimental Results
3 Motivation Transactional data is a kind of special categorical data t1={milk, bread, beer}, t2={milk, bread} Can be transformed to row by column table with Boolean value Large volume and high dimensionality make the existing algorithms inefficient to process the transformed data Clustering transactional data algorithm: LargeItem, CLPOE, CCCD Require users to manually tune at least one or two parameters Setting these parameters are different from dataset to dataset
SCALE Framework ACE & BkPlot (SSDBM’05) ACE: Agglomerative Categorical clustering with Entropy criterion BkPlot: Examine the entropy difference between the clustering structures with varying K Reports the Ks where the clustering stricture changes dramatically Evaluation Metrics LISR: Large Item Size Ratio AMI: Average pair-clusters Merging Index 4
ACE Algorithm Bottom-up process Initially, each record is a cluster Iteratively, find the most similar pair of clusters C p and C q, and then merge them Incremental entropy The most similar pair of clusters is minimum among all possible pairs denote the I m value in forming the K-cluster partition from the K+1-cluster partition 5
BkPlot Increasing rate of entropy: N: total records, d: columns Small increasing rate Merging does not introduce any impurity to the clusters Clustering structure is not significantly changed Large increasing rate Introduce considerable impurity into the partitions Clustering structure can be changed significantly 6
BkPlot (contd.) Relative changes Use relative changes to determine if a globally significant clustering structure emerges 7 I(K)≈I(K+1), but I(K-1)>I(K)
BkPlot (contd.) 8 Entropy Characteristic Graph (ECG) Second-order differential of ECG:
WCD Clustering Algorithm Notations D: transactional dataset N: size of dataset I={I 1, I 2,…, I m }: a set of items t j ={I j1, I j2,…, I jl }: a transaction A transaction clustering result C K ={C 1, C 2,…,C K } is a partition of D, where 9
Intra-cluster Similarity Measure Coverage Density (CD) Given a cluster C k M k : Number of distinct items : Items set of C k N k : Number of transaction in C k S k : Sum occurrences of all items in C k 10 CD↑, compactness ↑
Intra-cluster Similarity Measure (contd.) Drawback of CD Insufficient to measure the density of frequent itemset Each item has equal contribution in a cluster Two clusters may have the same CD but different filled-cell distribution 11 abcabc
Intra-cluster Similarity Measure (contd.) Weighted Coverage Density (WCD) Focus on high-frequency items Define W j as 12 abcabc CDWCD
Clustering Criterion Expected Weighted Coverage Density (EWCD) Clustering algorithm try to maximize the EWCD When every individual transaction is considered as a cluster, it will get the maximum EWCD=1 Use BKPlot method to generate a set of candidate “best Ks” 13
WCD Clustering Algorithm 14 Input : Dataset D, Number of clusters K, Initial K seeds Output: K clusters /* Phase 1 – Initialization*/ K seeds form the initial K clusters; while not end of D do read one transaction t from D; add t into C i that maximizes EWCD; write back to D; /* Phase 2 – Iteration*/ while moveMark = true do moveMark = false; randomly generate the access sequence R while has not checked all transactions do read ; if moving t to cluster C j increases EWCD and i ≠ j moveMark = true; write back to D;
Cluster Validity Evaluation LISR (Large Item Size Ratio) Measure the preservation of frequent itemsets, where LS k is #Large Items in C k high concurrences of items high possibility of finding more frequent itemsets at user-specified minimum support 15
Cluster Validity Evaluation (contd.) Inter-cluster dissimilarity between C i and C j 16 simplify, where M ij is the number of distinct items after merging two cluster thus M ij ≧ max{M i, M j } Because of and, d(C i, C j ) is a real number between 0 and 1
Cluster Validity Evaluation (contd.) Example If M i =M j =M ij, then d(C i,C j )=0 M i =M j =3, M ij =5 17 abc CiCi CjCj abc abc CiCi CjCj cde
Cluster Validity Evaluation (contd.) AMI (Average pair-clusters Merging Index) Evaluate the overall inter-dissimilarity of a clustering result having K clusters better the clustering quality 18
Experiments Dataset Tc30a6r records, 30 column, 6 possible attribute values Zoo 101 records, 18 attributes Mushroom 8124 instances, 22 attributes Mushroom100k Sample the mushroom data with duplicates 100,000 instances TxI4Dx IBM Data Generator 19
Experimental Results Tc30a6r The repulsion parameter r of CLOPE is controlling the number of clusters 5 clusters9 clusters
Experimental Results (contd.) Zoo: K=7 is the best 21 2 clusters4 clusters7 clusters
Experimental Results (contd.) Mushroom: K=19 is the best 22
Experimental Results (contd.) Performance evaluation on mushroom100k 23 r=0.5~4.0r=2.0
Experimental Results (contd.) Performance evaluation on TxI4Dx 24 T10I4DxTxI4D100k