2015/11/271 Efficiently Clustering Transactional data with Weighted Coverage Density M. Hua Yan, Keke Chen, and Ling Liu Proceedings of the 15 th International.

Slides:



Advertisements
Similar presentations
Data Mining (Apriori Algorithm)DCS 802, Spring DCS 802 Data Mining Apriori Algorithm Spring of 2002 Prof. Sung-Hyuk Cha School of Computer Science.
Advertisements

Clustering Categorical Data The Case of Quran Verses
Correlation Search in Graph Databases Yiping Ke James Cheng Wilfred Ng Presented By Phani Yarlagadda.
Query Optimization of Frequent Itemset Mining on Multiple Databases Mining on Multiple Databases David Fuhry Department of Computer Science Kent State.
Interactively Co-segmentating Topically Related Images with Intelligent Scribble Guidance Dhruv Batra, Carnegie Mellon University Adarsh Kowdle, Cornell.
Association Rule Mining. 2 The Task Two ways of defining the task General –Input: A collection of instances –Output: rules to predict the values of any.
Presented by: GROUP 7 Gayathri Gandhamuneni & Yumeng Wang.
Graduate : Sheng-Hsuan Wang
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.
Clustering over Multiple Evolving Streams by Events and Correlations Mi-Yen Yeh, Bi-Ru Dai, Ming-Syan Chen Electrical Engineering, National Taiwan University.
Data Mining Cluster Analysis: Advanced Concepts and Algorithms Lecture Notes for Chapter 9 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Cluster Analysis.  What is Cluster Analysis?  Types of Data in Cluster Analysis  A Categorization of Major Clustering Methods  Partitioning Methods.
1 Mining Quantitative Association Rules in Large Relational Database Presented by Jin Jin April 1, 2004.
Basic Data Mining Techniques
Semi-Supervised Clustering Jieping Ye Department of Computer Science and Engineering Arizona State University
2/8/00CSE 711 data mining: Apriori Algorithm by S. Cha 1 CSE 711 Seminar on Data Mining: Apriori Algorithm By Sung-Hyuk Cha.
What is Cluster Analysis?
Fast Algorithms for Association Rule Mining
Clustering Ram Akella Lecture 6 February 23, & 280I University of California Berkeley Silicon Valley Center/SC.
© University of Minnesota Data Mining CSCI 8980 (Fall 2002) 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance Computing Research Center.
An Unsupervised Learning Approach for Overlapping Co-clustering Machine Learning Project Presentation Rohit Gupta and Varun Chandola
Birch: An efficient data clustering method for very large databases
CHAMELEON : A Hierarchical Clustering Algorithm Using Dynamic Modeling
Unsupervised Learning. CS583, Bing Liu, UIC 2 Supervised learning vs. unsupervised learning Supervised learning: discover patterns in the data that relate.
USpan: An Efficient Algorithm for Mining High Utility Sequential Patterns Authors: Junfu Yin, Zhigang Zheng, Longbing Cao In: Proceedings of the 18th ACM.
2008/06/06 Y.H.Chang Towards Effective Browsing of Large Scale Social Annotations1 Towards Effective Browsing of Large Scale Social Annotations WWW 2007.
Mining Frequent Itemsets with Constraints Takeaki Uno Takeaki Uno National Institute of Informatics, JAPAN Nov/2005 FJWCP.
NGDM’02 1 Efficient Data-Reduction Methods for On-line Association Rule Mining H. Bronnimann B. ChenM. Dash, Y. Qiao, P. ScheuermannP. Haas Polytechnic.
Data Mining & Knowledge Discovery Lecture: 2 Dr. Mohammad Abu Yousuf IIT, JU.
Approximate Frequency Counts over Data Streams Loo Kin Kong 4 th Oct., 2002.
Efficient Data Mining for Calling Path Patterns in GSM Networks Information Systems, accepted 5 December 2002 SPEAKER: YAO-TE WANG ( 王耀德 )
Clustering Spatial Data Using Random Walk David Harel and Yehuda Koren KDD 2001.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology HE-Tree: a framework for detecting changes in clustering.
START OF DAY 8 Reading: Chap. 14. Midterm Go over questions General issues only Specific issues: visit with me Regrading may make your grade go up OR.
Clustering Supervised vs. Unsupervised Learning Examples of clustering in Web IR Characteristics of clustering Clustering algorithms Cluster Labeling 1.
Implementation of “A New Two-Phase Sampling Based Algorithm for Discovering Association Rules” Tokunbo Makanju Adan Cosgaya Faculty of Computer Science.
Expert Systems with Applications 34 (2008) 459–468 Multi-level fuzzy mining with multiple minimum supports Yeong-Chyi Lee, Tzung-Pei Hong, Tien-Chin Wang.
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Determining the best K for clustering transactional datasets – A coverage density-based approach Presenter.
Additive Data Perturbation: the Basic Problem and Techniques.
1 Data Mining: Data Lecture Notes for Chapter 2. 2 What is Data? l Collection of data objects and their attributes l An attribute is a property or characteristic.
1 Efficient Algorithms for Incremental Update of Frequent Sequences Minghua ZHANG Dec. 7, 2001.
MINING COLOSSAL FREQUENT PATTERNS BY CORE PATTERN FUSION FEIDA ZHU, XIFENG YAN, JIAWEI HAN, PHILIP S. YU, HONG CHENG ICDE07 Advisor: Koh JiaLing Speaker:
Clustering Gene Expression Data BMI/CS 576 Colin Dewey Fall 2010.
ROCK: A Robust Clustering Algorithm for Categorical Attributes Authors: Sudipto Guha, Rajeev Rastogi, Kyuseok Shim Data Engineering, Proceedings.,
Relevant Overlapping Subspace Clusters on CATegorical Data (ROCAT) Xiao He1, Jing Feng1, Bettina Konte1, Son T.Mai1, Claudia Plant2 1: University of Munich,
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Graduate : Yu Cheng Chen Author: Manoranjan.
Clustering Clustering is a technique for finding similarity groups in data, called clusters. I.e., it groups data instances that are similar to (near)
Computational Biology Clustering Parts taken from Introduction to Data Mining by Tan, Steinbach, Kumar Lecture Slides Week 9.
2015/12/251 Hierarchical Document Clustering Using Frequent Itemsets Benjamin C.M. Fung, Ke Wangy and Martin Ester Proceeding of International Conference.
Compiled By: Raj Gaurang Tiwari Assistant Professor SRMGPC, Lucknow Unsupervised Learning.
Flat clustering approaches
Definition Finding groups of objects such that the objects in a group will be similar (or related) to one another and different from (or unrelated to)
Database Management Systems, R. Ramakrishnan 1 Algorithms for clustering large datasets in arbitrary metric spaces.
Effective Anomaly Detection with Scarce Training Data Presenter: 葉倚任 Author: W. Robertson, F. Maggi, C. Kruegel and G. Vigna NDSS
1 Microarray Clustering. 2 Outline Microarrays Hierarchical Clustering K-Means Clustering Corrupted Cliques Problem CAST Clustering Algorithm.
Clustering Wei Wang. Outline What is clustering Partitioning methods Hierarchical methods Density-based methods Grid-based methods Model-based clustering.
Using category-Based Adherence to Cluster Market-Basket Data Author : Ching-Huang Yun, Kun-Ta Chuang, Ming-Syan Chen Graduate : Chien-Ming Hsiao.
A Robust and Accurate Binning Algorithm for Metagenomic Sequences with Arbitrary Species Abundance Ratio Zainab Haydari Dr. Zelikovsky Summer 2011.
Dr. Chen, Data Mining  A/W & Dr. Chen, Data Mining Chapter 3 Basic Data Mining Techniques Jason C. H. Chen, Ph.D. Professor of MIS School of Business.
Chapter 3 Data Mining: Classification & Association Chapter 4 in the text box Section: 4.3 (4.3.1),
DB Seminar Series: The Subspace Clustering Problem By: Kevin Yip (17 May 2002)
2016/9/301 Exploiting Wikipedia as External Knowledge for Document Clustering Xiaohua Hu, Xiaodan Zhang, Caimei Lu, E. K. Park, and Xiaohua Zhou Proceeding.
COOLCAT: An Entropy-Based Algorithm for Categorical Clustering
Market Basket Many-to-many relationship between different objects
Clustering.
Approximate Frequency Counts over Data Streams
Clustering Wei Wang.
Group 9 – Data Mining: Data
K.L Ong, W. Li, W.K. Ng, and E.P. Lim
Presentation transcript:

2015/11/271 Efficiently Clustering Transactional data with Weighted Coverage Density M. Hua Yan, Keke Chen, and Ling Liu Proceedings of the 15 th International Conference on Information and Knowledge Management, ACM CIKM, 2006 報告人 : 吳建良

2 Outline Motivation SCALE Framework BKPlot Method WCD Clustering Algorithm Cluster Validity Evaluation Experimental Results

3 Motivation Transactional data is a kind of special categorical data t1={milk, bread, beer}, t2={milk, bread} Can be transformed to row by column table with Boolean value Large volume and high dimensionality  make the existing algorithms inefficient to process the transformed data Clustering transactional data algorithm: LargeItem, CLPOE, CCCD Require users to manually tune at least one or two parameters Setting these parameters are different from dataset to dataset

SCALE Framework ACE & BkPlot (SSDBM’05) ACE: Agglomerative Categorical clustering with Entropy criterion BkPlot: Examine the entropy difference between the clustering structures with varying K Reports the Ks where the clustering stricture changes dramatically Evaluation Metrics LISR: Large Item Size Ratio AMI: Average pair-clusters Merging Index 4

ACE Algorithm Bottom-up process Initially, each record is a cluster Iteratively, find the most similar pair of clusters C p and C q, and then merge them Incremental entropy The most similar pair of clusters is minimum among all possible pairs denote the I m value in forming the K-cluster partition from the K+1-cluster partition 5

BkPlot Increasing rate of entropy: N: total records, d: columns Small increasing rate Merging does not introduce any impurity to the clusters Clustering structure is not significantly changed Large increasing rate Introduce considerable impurity into the partitions Clustering structure can be changed significantly 6

BkPlot (contd.) Relative changes Use relative changes to determine if a globally significant clustering structure emerges 7 I(K)≈I(K+1), but I(K-1)>I(K)

BkPlot (contd.) 8 Entropy Characteristic Graph (ECG) Second-order differential of ECG:

WCD Clustering Algorithm Notations D: transactional dataset N: size of dataset I={I 1, I 2,…, I m }: a set of items t j ={I j1, I j2,…, I jl }: a transaction A transaction clustering result C K ={C 1, C 2,…,C K } is a partition of D, where 9

Intra-cluster Similarity Measure Coverage Density (CD) Given a cluster C k M k : Number of distinct items : Items set of C k N k : Number of transaction in C k S k : Sum occurrences of all items in C k 10 CD↑, compactness ↑

Intra-cluster Similarity Measure (contd.) Drawback of CD Insufficient to measure the density of frequent itemset Each item has equal contribution in a cluster  Two clusters may have the same CD but different filled-cell distribution 11 abcabc

Intra-cluster Similarity Measure (contd.) Weighted Coverage Density (WCD) Focus on high-frequency items Define W j as 12 abcabc CDWCD

Clustering Criterion Expected Weighted Coverage Density (EWCD) Clustering algorithm try to maximize the EWCD When every individual transaction is considered as a cluster, it will get the maximum EWCD=1 Use BKPlot method to generate a set of candidate “best Ks” 13

WCD Clustering Algorithm 14 Input : Dataset D, Number of clusters K, Initial K seeds Output: K clusters /* Phase 1 – Initialization*/ K seeds form the initial K clusters; while not end of D do read one transaction t from D; add t into C i that maximizes EWCD; write back to D; /* Phase 2 – Iteration*/ while moveMark = true do moveMark = false; randomly generate the access sequence R while has not checked all transactions do read ; if moving t to cluster C j increases EWCD and i ≠ j moveMark = true; write back to D;

Cluster Validity Evaluation LISR (Large Item Size Ratio) Measure the preservation of frequent itemsets, where LS k is #Large Items in C k high concurrences of items high possibility of finding more frequent itemsets at user-specified minimum support 15

Cluster Validity Evaluation (contd.) Inter-cluster dissimilarity between C i and C j 16 simplify, where M ij is the number of distinct items after merging two cluster thus M ij ≧ max{M i, M j } Because of and, d(C i, C j ) is a real number between 0 and 1

Cluster Validity Evaluation (contd.) Example If M i =M j =M ij, then d(C i,C j )=0 M i =M j =3, M ij =5 17 abc CiCi CjCj abc abc CiCi CjCj cde

Cluster Validity Evaluation (contd.) AMI (Average pair-clusters Merging Index) Evaluate the overall inter-dissimilarity of a clustering result having K clusters better the clustering quality 18

Experiments Dataset Tc30a6r records, 30 column, 6 possible attribute values Zoo 101 records, 18 attributes Mushroom 8124 instances, 22 attributes Mushroom100k Sample the mushroom data with duplicates 100,000 instances TxI4Dx IBM Data Generator 19

Experimental Results Tc30a6r The repulsion parameter r of CLOPE is controlling the number of clusters 5 clusters9 clusters

Experimental Results (contd.) Zoo: K=7 is the best 21 2 clusters4 clusters7 clusters

Experimental Results (contd.) Mushroom: K=19 is the best 22

Experimental Results (contd.) Performance evaluation on mushroom100k 23 r=0.5~4.0r=2.0

Experimental Results (contd.) Performance evaluation on TxI4Dx 24 T10I4DxTxI4D100k