COOLCAT: An Entropy-Based Algorithm for Categorical Clustering

COOLCAT: An Entropy-Based Algorithm for Categorical Clustering
Daniel Barbara, Julia Couto, Yi Li ACM CIKM, 2002, pp 指導教授：郭煌政學生：楊金龍

Outline Introduction Background and problem formulation Related work
Algorithm Experiment Conclusions

Introduction Clustering of categorical attributes is a difficult, yet important task. COOLCAT- A method which uses the notion of entropy to group records. It is incremental algorithm that aims to minimize the expected entropy of clusters.

Background and Problem Formulation Entropy and clustering(1)
Entropy function of variable X X: A random variable S(X): The set of values that X can take p(x): The probability function of X

Background and Problem Formulation Entropy and clustering(2)
Entropy of a multivariate vector ={x1,…, Xn}

Background and Problem Formulation Problem formulation(1)
1. A data set D of N points 2. Each point is a multidimensional vector of d categorical attributes, i.e., 3. Separate the points into k groups C1,…, Ck. 4. It is NP-Complete to minimize the entropy. 5. It is NP-Complete for any distance function d(x, y), too.

6. Expected entropy: Equation 3 E(C1),…,E(Ck): Represent the entropies of each cluster. Ci :denotes the points assigned to cluster i, Ci∩Cj = Ø, for all i, j =1,…, k i ‡ j. The symbol Č = {C1,…, Ck} represents the clustering.

7. The equation 3 allows us to implement an incremental algorithm. 8. Translate equation 2 to equation 4 and 5, so that the entropy can be calculated as the sum of entropies of the attributes.

Background and Problem Formulation Evaluating clustering results(1)
Different clustering algorithms result in different solutions, so it is difficulty in evaluating the solutions. Two widely used methods of evaluating clustering results: Significance Test on External Variables The Category Utility Function (CU)

Significance Test on External Variables This technique calls for the usage of tests that compare the clusters on variables not used to generate them. The evaluation is performed by computing the expected entropy. The smaller the value of E(Ck), the better the clustering fares.

The Category Utility Function (CU) The CU function attempts to maximize both the probability that two objects in the same cluster. The function aims to measure if the clustering improves the likelihood of similar values falling in the same cluster. The higher the value of CU, the better the clustering fares.

Background and Problem Formulation Number of clusters
It is not easy to compute a centroid for each cluster in categorical data. The issue is out of this paper.

Related Work(1) ENCLUS: A entropy-base algorithm
The algorithm is dividing the hyperspace recursively that is completely different algorithm to COOLCAT. It has no intuitive meaning when the attributes are categorical.

Related Work(2) ROCK A algorithm computes distances between records using the Jaccard coefficient. It is an agglomerative algorithm. Using LINK and neighbors to compute the distances, and decide how to merge.

Algorithm(1) Initialization
Sample S taken from the data set. (|S| <<N) N: the size of entire data set Step1: Finding the two points ps1, ps2 that maximize E(ps1, ps2 ) and placing them in two separate clusters (C1, C2), marking the records. Step2: From there, we proceed incrementally, i.e., to find the record we will put in the j-th cluster, we choose an unmarked point psj that maximizes mini=1,…,j-1(E(psi, psj )). The rest of the sample unmarked points (|S|-k) are placing on incremental step.

Algorithm(2) The size of sample
The sample of at least one member of each cluster. The bound of sample size: Average size: A parameter: , m is the size of the smallest.

Algorithm(3) Incremental Step
This is done by computing the expected entropy that results of placing the point in each of the clusters and selecting the cluster for which that expected entropy is the minimum. Re-processing and re-cluster:It is possible that point from good fit to poor fit, so enhanced the heuristic by re-processing a fraction m of the points in the batch.

Algorithm(4) The psudo code of incremental step

Experimental Results(1)
Archaeological data set

Congressional Voting results

KDD CUP 1999 data set

Synthetic data set: Results

Synthetic data set: Performance

Conclusions COOLCAT is an efficient algorithm and it is stable for different samples and sample size. COOLCAT is better than ROCK on tuning and efficient. The incremental nature of COOLCAT makes it is suite of data stream and large volumes of data.

COOLCAT: An Entropy-Based Algorithm for Categorical Clustering

Similar presentations

Presentation on theme: "COOLCAT: An Entropy-Based Algorithm for Categorical Clustering"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

COOLCAT: An Entropy-Based Algorithm for Categorical Clustering

Similar presentations

Presentation on theme: "COOLCAT: An Entropy-Based Algorithm for Categorical Clustering"— Presentation transcript:

Similar presentations

About project

Feedback