COOLCAT: An Entropy-Based Algorithm for Categorical Clustering

Slides:



Advertisements
Similar presentations
Clustering Categorical Data The Case of Quran Verses
Advertisements

PARTITIONAL CLUSTERING
Introduction to Bioinformatics
Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.
Today Unsupervised Learning Clustering K-means. EE3J2 Data Mining Lecture 18 K-means and Agglomerative Algorithms Ali Al-Shahib.
Data Mining Cluster Analysis: Advanced Concepts and Algorithms Lecture Notes for Chapter 9 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Lecture 5: Learning models using EM
Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
What is Cluster Analysis?
On the Task Assignment Problem : Two New Efficient Heuristic Algorithms.
S IMILARITY M EASURES FOR T EXT D OCUMENT C LUSTERING Anna Huang Department of Computer Science The University of Waikato, Hamilton, New Zealand BY Farah.
Clustering Ram Akella Lecture 6 February 23, & 280I University of California Berkeley Silicon Valley Center/SC.
“A Comparison of Document Clustering Techniques” Michael Steinbach, George Karypis and Vipin Kumar (Technical Report, CSE, UMN, 2000) Mahashweta Das
Chapter 3: Cluster Analysis  3.1 Basic Concepts of Clustering  3.2 Partitioning Methods  3.3 Hierarchical Methods The Principle Agglomerative.
Clustering Unsupervised learning Generating “classes”
Evaluating Performance for Data Mining Techniques
Unsupervised Learning. CS583, Bing Liu, UIC 2 Supervised learning vs. unsupervised learning Supervised learning: discover patterns in the data that relate.
1 Template-Based Classification Method for Chinese Character Recognition Presenter: Tienwei Tsai Department of Informaiton Management, Chihlee Institute.
An Effective Fuzzy Clustering Algorithm for Web Document Classification: A Case Study in Cultural Content Mining Nils Murrugarra.
COMP53311 Clustering Prepared by Raymond Wong Some parts of this notes are borrowed from LW Chan ’ s notes Presented by Raymond Wong
Unsupervised Learning Reading: Chapter 8 from Introduction to Data Mining by Tan, Steinbach, and Kumar, pp , , (
Presented by Tienwei Tsai July, 2005
1 Converting Categories to Numbers for Approximate Nearest Neighbor Search 嘉義大學資工系 郭煌政 2004/10/20.
Similarity based Retrieval from Sequence Databases using Automata as Queries 作者 : A. Prasad Sistla, Tao Hu, Vikas howdhry 出處 :CIKM 2002 ACM 指導教授 : 郭煌政老師.
1 CSE 980: Data Mining Lecture 17: Density-based and Other Clustering Algorithms.
START OF DAY 8 Reading: Chap. 14. Midterm Go over questions General issues only Specific issues: visit with me Regrading may make your grade go up OR.
Optimal Base Station Selection for Anycast Routing in Wireless Sensor Networks 指導教授 : 黃培壝 & 黃鈴玲 學生 : 李京釜.
A genetic approach to the automatic clustering problem Author : Lin Yu Tseng Shiueng Bien Yang Graduate : Chien-Ming Hsiao.
4/8/2002Copyright Daniel Barbara Clustering by impact Daniel Barbará George Mason University ISE Dept. (joint work with.
ROCK: A Robust Clustering Algorithm for Categorical Attributes Authors: Sudipto Guha, Rajeev Rastogi, Kyuseok Shim Data Engineering, Proceedings.,
Chapter 11 Statistical Techniques. Data Warehouse and Data Mining Chapter 11 2 Chapter Objectives  Understand when linear regression is an appropriate.
DOCUMENT UPDATE SUMMARIZATION USING INCREMENTAL HIERARCHICAL CLUSTERING CIKM’10 (DINGDING WANG, TAO LI) Advisor: Koh, Jia-Ling Presenter: Nonhlanhla Shongwe.
Clustering Clustering is a technique for finding similarity groups in data, called clusters. I.e., it groups data instances that are similar to (near)
CS 8751 ML & KDDData Clustering1 Clustering Unsupervised learning Generating “classes” Distance/similarity measures Agglomerative methods Divisive methods.
Maximization of System Lifetime for Data-Centric Wireless Sensor Networks 指導教授:林永松 博士 具資料集縮能力無線感測網路 系統生命週期之最大化 研究生:郭文政 國立臺灣大學資訊管理學研究所碩士論文審查 民國 95 年 7 月.
Data Mining Practical Machine Learning Tools and Techniques By I. H. Witten, E. Frank and M. A. Hall 6.8: Clustering Rodney Nielsen Many / most of these.
Clustering Instructor: Max Welling ICS 178 Machine Learning & Data Mining.
Slide 1 EE3J2 Data Mining Lecture 18 K-means and Agglomerative Algorithms.
R-Trees: A Dynamic Index Structure For Spatial Searching Antonin Guttman.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Cluster Analysis This lecture node is modified based on Lecture Notes for Chapter.
Organizing Structured Web Sources by Query Schemas: A Clustering Approach Bin He Joint work with: Tao Tao, Kevin Chen-Chuan Chang Univ. Illinois at Urbana-Champaign.
An unsupervised conditional random fields approach for clustering gene expression time series Chang-Tsun Li, Yinyin Yuan and Roland Wilson Bioinformatics,
Flexible Speaker Adaptation using Maximum Likelihood Linear Regression Authors: C. J. Leggetter P. C. Woodland Presenter: 陳亮宇 Proc. ARPA Spoken Language.
1 Minimum Bayes-risk Methods in Automatic Speech Recognition Vaibhava Geol And William Byrne IBM ; Johns Hopkins University 2003 by CRC Press LLC 2005/4/26.
Clustering (2) Center-based algorithms Fuzzy k-means Density-based algorithms ( DBSCAN as an example ) Evaluation of clustering results Figures and equations.
Data Science Practical Machine Learning Tools and Techniques 6.8: Clustering Rodney Nielsen Many / most of these slides were adapted from: I. H. Witten,
Similarity Measures for Text Document Clustering
Today Cluster Evaluation Internal External
Data Transformation: Normalization
Machine Learning Clustering: K-means Supervised Learning
Machine Learning Lecture 9: Clustering
Data Mining K-means Algorithm
Clustering (3) Center-based algorithms Fuzzy k-means
Time Series Filtering Time Series
Outline Parameter estimation – continued Non-parametric methods.
Critical Issues with Respect to Clustering
Consensus Partition Liang Zheng 5.21.
Data Mining – Chapter 4 Cluster Analysis Part 2
LECTURE 21: CLUSTERING Objectives: Mixture Densities Maximum Likelihood Estimates Application to Gaussian Mixture Models k-Means Clustering Fuzzy k-Means.
Data Transformations targeted at minimizing experimental variance
Text Categorization Berlin Chen 2003 Reference:
Biointelligence Laboratory, Seoul National University
Group 9 – Data Mining: Data
Clustering The process of grouping samples so that the samples are similar within each group.
SEEM4630 Tutorial 3 – Clustering.
EM Algorithm and its Applications
BIRCH: Balanced Iterative Reducing and Clustering using Hierarchies
Presentation transcript:

COOLCAT: An Entropy-Based Algorithm for Categorical Clustering Daniel Barbara, Julia Couto, Yi Li ACM CIKM, 2002, pp. 582 - 589 指導教授:郭煌政 學生:楊金龍

Outline Introduction Background and problem formulation Related work Algorithm Experiment Conclusions

Introduction Clustering of categorical attributes is a difficult, yet important task. COOLCAT- A method which uses the notion of entropy to group records. It is incremental algorithm that aims to minimize the expected entropy of clusters.

Background and Problem Formulation Entropy and clustering(1) Entropy function of variable X X: A random variable S(X): The set of values that X can take p(x): The probability function of X

Background and Problem Formulation Entropy and clustering(2) Entropy of a multivariate vector ={x1,…, Xn}

Background and Problem Formulation Problem formulation(1) 1. A data set D of N points . 2. Each point is a multidimensional vector of d categorical attributes, i.e., . 3. Separate the points into k groups C1,…, Ck. 4. It is NP-Complete to minimize the entropy. 5. It is NP-Complete for any distance function d(x, y), too.

Background and Problem Formulation Problem formulation(2) 6. Expected entropy: Equation 3 E(C1),…,E(Ck): Represent the entropies of each cluster. Ci :denotes the points assigned to cluster i, Ci∩Cj = Ø, for all i, j =1,…, k i ‡ j. The symbol Č = {C1,…, Ck} represents the clustering.

Background and Problem Formulation Problem formulation(3) 7. The equation 3 allows us to implement an incremental algorithm. 8. Translate equation 2 to equation 4 and 5, so that the entropy can be calculated as the sum of entropies of the attributes.

Background and Problem Formulation Problem formulation(4)

Background and Problem Formulation Evaluating clustering results(1) Different clustering algorithms result in different solutions, so it is difficulty in evaluating the solutions. Two widely used methods of evaluating clustering results: Significance Test on External Variables The Category Utility Function (CU)

Background and Problem Formulation Evaluating clustering results(2) Significance Test on External Variables This technique calls for the usage of tests that compare the clusters on variables not used to generate them. The evaluation is performed by computing the expected entropy. The smaller the value of E(Ck), the better the clustering fares.

Background and Problem Formulation Evaluating clustering results(3) The Category Utility Function (CU) The CU function attempts to maximize both the probability that two objects in the same cluster. The function aims to measure if the clustering improves the likelihood of similar values falling in the same cluster. The higher the value of CU, the better the clustering fares.

Background and Problem Formulation Number of clusters It is not easy to compute a centroid for each cluster in categorical data. The issue is out of this paper.

Related Work(1) ENCLUS: A entropy-base algorithm The algorithm is dividing the hyperspace recursively that is completely different algorithm to COOLCAT. It has no intuitive meaning when the attributes are categorical.

Related Work(2) ROCK A algorithm computes distances between records using the Jaccard coefficient. It is an agglomerative algorithm. Using LINK and neighbors to compute the distances, and decide how to merge.

Algorithm(1) Initialization Sample S taken from the data set. (|S| <<N) N: the size of entire data set Step1: Finding the two points ps1, ps2 that maximize E(ps1, ps2 ) and placing them in two separate clusters (C1, C2), marking the records. Step2: From there, we proceed incrementally, i.e., to find the record we will put in the j-th cluster, we choose an unmarked point psj that maximizes mini=1,…,j-1(E(psi, psj )). The rest of the sample unmarked points (|S|-k) are placing on incremental step.

Algorithm(2) The size of sample The sample of at least one member of each cluster. The bound of sample size: Average size: A parameter: , m is the size of the smallest.

Algorithm(3) Incremental Step This is done by computing the expected entropy that results of placing the point in each of the clusters and selecting the cluster for which that expected entropy is the minimum. Re-processing and re-cluster:It is possible that point from good fit to poor fit, so enhanced the heuristic by re-processing a fraction m of the points in the batch.

Algorithm(4) The psudo code of incremental step

Experimental Results(1) Archaeological data set

Experimental Results(2) Congressional Voting results

Experimental Results(3) KDD CUP 1999 data set

Experimental Results(4) Synthetic data set: Results

Experimental Results(5) Synthetic data set: Performance

Conclusions COOLCAT is an efficient algorithm and it is stable for different samples and sample size. COOLCAT is better than ROCK on tuning and efficient. The incremental nature of COOLCAT makes it is suite of data stream and large volumes of data.