HKU CS 11/8/ Scalable Clustering of Categorical Data HKU CS Database Research Seminar August 11th, 2004 Panagiotis Karras
HKU CS 11/8/20042 The Problem Clustering a problem of great importance. Partitioning data into groups so that similar objects are grouped together. Clustering of numerical data well-treated. Clustering of categorical data more challenging: no inherent distance measure.
HKU CS 11/8/20043 An Example A Movie Relation: Distance or similarity between values not immediately obvious
HKU CS 11/8/20044 Some Information Theory Mutual Information measure employed. Clusters should be informative about the data they contain. Given a cluster, we should be able to predict the attribute values of its objects accurately. Information loss should be minimized.
HKU CS 11/8/20045 An Example In the Movie Relation, clustering C is better then clustering D according to this measure (why?).
HKU CS 11/8/20046 The Information Bottleneck Method Formalized by Tishby et al. [1999] Clustering: the compression of one random variable that preserves as much information as possible about another. Conditional entropy of A given T: Captures the uncertainty of predicting the values of A given the values of T.
HKU CS 11/8/20047 The Information Bottleneck Method Mutual Information quantifies the amount of information that variables convey about each other [Shannon, 1948]:
HKU CS 11/8/20048 The Information Bottleneck Method A set of n tuples on m attributes. Let d be the size of the set of all possible attribute values. Then the data can be conceptualized as a n×d matrix M. M[t,a]=1 iff tuple t contains value a. Rows of normalized M contain conditional probability distributions p(A|t).
HKU CS 11/8/20049 The Information Bottleneck Method In the Movie Relation Example:
HKU CS 11/8/ The Information Bottleneck Method Clustering is a problem of maximizing the Mutual Information I(A;C) between attribute values and cluster identities, for a given number k of clusters [Tishby et al. 1999]. Finding optimal clustering NP-complete. Agglomerative Information Bottleneck proposed by Slonim and Tishby [1999]. Starts with n clusters, reduces one at each step so that Information Loss in I(A;C) be minimized.
HKU CS 11/8/ LIMBO Clustering scaLable InforMation Bottleneck Keeps only sufficient statistics in memory. Compact Summary Model. Clustering based on Model.
HKU CS 11/8/ What is a DCF? A Cluster is summarized in a Distributional Cluster Feature (DCF). Pair of probability of cluster c and conditional probability distribution of attribute values given c. Distance between DCFs is defined as the Information Loss incurred by merging the corresponding clusters (computed by the Jensen- Shannon divergence).
HKU CS 11/8/ The DCF Tree Height balanced tree of branching factor B. DCFs at leaves define clustering of tuples. Non-leaf nodes merge DCFs of children. Compact hierarchical summarization of data.
HKU CS 11/8/ The LIMBO algorithm Three phases. Phase 1: Insertion into DCF tree. o Each tuple t converted to DCF(t). o Follows path downward in tree along closest non-leaf DCFs. o At leaf level, let DCF(c) be entry closest to DCF(t). o If empty entry in leaf of DCF(c), then DCF(t) placed there. o If no empty entry and sufficient free space, leaf split in two halves, with two farthest DCFs as seeds for new leaves. Split moves upward as necessary. o Else, if no space, two closest DCF entries in {leaf, t} are merged.
HKU CS 11/8/ The LIMBO algorithm Phase 2: Clustering. o For a given value of k, the DCF tree is used to produce k DCFs that serve as representatives of k clusters, emplying the Agglomerative Information Bottleneck algorithm. Phase 3: Associating tuples with clusters. o A scan over the data set is performed and each tuple is assigned to the closest cluster.
HKU CS 11/8/ Intra-Attribute Value Distance How to define the distance between categorical attribute values of the same attribute? Values should be placed within a context. Similar values appear in similar contexts. What is a suitable context? The distribution an attribute values induces on the remaining attributes.
HKU CS 11/8/ Intra-Attribute Value Distance The distance between two values is then defined as the Information Loss incurred about the other attributes if we merge these values. In the Movie example, Scorsese and Coppola are the most similar directors. Distance between tuples = sum of distances between attributes.
HKU CS 11/8/ Experiments - Algorithms Four algorithms are compared: ROCK. An agglomerative algorithm by Guha et al. [1999] COOLCAT. A scalable non-hierarchical algorithm most similar to LIMBO by Barbará et al. [2002] STIRR. A dynamical systems approach using a hypergraph of weighted attribute values, by Gibson et al. [1998] LIMBO. In addition to the space-bound version, LIMBO was implemented in an accuracy-control version, where a distance threshold is imposed on the decision of merging two DCFs, as multiple φ of the average mutual information of all tuples. The two versions differ only in Phase 1.
HKU CS 11/8/ Experiments – Data Sets The following Data Sers are used: Congressional Votes (435 boolean tuples on 16 issues, from 1984, classified as Democrat or Republican). Mushroom (8,124 tuples with 22 attributes, classified as poisonous or edible). Database and Theory bibliography (8,000 tuples on research papers with 4 attributes). Synthetic Data Sets (5,000 tuples, 10 attributes, DS5 and DS10 for 5 and 10 classes). Web Data (web pages - a tuple set of authorities with the hubs they are linked to by as attributes)
HKU CS 11/8/ Experiments - Quality Measures Several measures are used to capture the subjectivity of clustering quality: Information Loss. The lower the better. Category Utility. Difference between expected correct guesses of attribute values with and without a clustering. Min Classification Error. For tuples already classified. Precision (P), Recall (R). P measures the accuracy with which a cluster reproduces a class and R the completeness with which this is done.
HKU CS 11/8/ Quality-Efficiency trade-offs for LIMBO Both controlling the size (S) or the accuracy (φ) of the model, there is a trade-off between expressiveness (large S, small φ) and compactness (small S, large φ). For Branching factor B=4 we obtain: For large S and small φ, the bottleneck is Phase 2.
HKU CS 11/8/ Quality-Efficiency trade-offs for LIMBO Still, in Phase 1 we can obtain significant compression of the data sets at no expense in the final quality. This consistency can be attributed in part to the effect of Phase 3, which assigns tuples to cluster representatives. Εven for large values of φ and small values of S, LIMBO obtains essentially the same clustering quality as AIB, but in linear time.
HKU CS 11/8/ Comparative Evaluations The table show the results for all algorithms and all quality measures for the Votes and Mushrooms data sets. LIMBO’s quality superior to ROCK and COOLCAT. COOLCAT comes closest to LIMBO.
HKU CS 11/8/ Web Data Authorities clustered into three clusters with information loss 61%. LIMBO accurately characterizes structure of web graph. Three clusters correspond to different viewpoints (pro, against, irrelevant).
HKU CS 11/8/ Scalability Evaluation Four data sets of size 500K, 1M, 5M, 10M (10 clusters, 10 attributes each). Phase 1 in detail for LIMBO φ For 1.0 < φ <1.5 manageable size, fast execution time.
HKU CS 11/8/ Scalability Evaluation We set φ = 1.2, 1.3, S = 1MB, 5MB. Time scales linearly with data set size. Varied number of attributes – linear behavior.
HKU CS 11/8/ Scalability - Quality Results Quality measures the same for different data set sizes.
HKU CS 11/8/ Conclusions LIMBO has advantages over other information theoretic clustering algorithms in terms of scalability and quality. LIMBO only hierarchical scalable categorical clustering algorithm – based on compact summary model.
HKU CS 11/8/ Main Reference P. Andritsos, P. Tsaparas, R. J. Miller, K. C. Sevcik. LIMBO: Scalable Clustering of Categorical Data, 9th International Conference on Extending DataBase Technology (EDBT), Heraklion, Greece, LIMBO: Scalable Clustering of Categorical Data LIMBO: Scalable Clustering of Categorical Data
HKU CS 11/8/ References D. Barbará, J. Couto, and Y. Li. COOLCAT: An entropy-based algorithm for categorical clustering. In CIKM, McLean, VA, D. Gibson, J. M. Kleinberg, and P. Raghavan. Clustering Categorical Data: An Approach Based on Dynamical Systems. In VLDB, New York, NY, S. Guha, R. Rastogi, and K. Shim. ROCK: A Robust Clustering Algorithm for Categorical Atributes. In ICDE, Sydney, Australia, C. Shannon. A Mathematical Theory of Communication, N. Slonim and N. Tishby. Agglomerative Information Bottleneck. In NIPS, Breckenridge, N. Tishby, F. C. Pereira, and W. Bialek. The Information Bottleneck Method. In 37th Annual Allerton Conference on Communication, Control and Computing, Urban-Champaign, IL, 1999.