Presentation is loading. Please wait.

Presentation is loading. Please wait.

HKU CS 11/8/2004 1 Scalable Clustering of Categorical Data HKU CS Database Research Seminar August 11th, 2004 Panagiotis Karras.

Similar presentations


Presentation on theme: "HKU CS 11/8/2004 1 Scalable Clustering of Categorical Data HKU CS Database Research Seminar August 11th, 2004 Panagiotis Karras."— Presentation transcript:

1 HKU CS 11/8/2004 1 Scalable Clustering of Categorical Data HKU CS Database Research Seminar August 11th, 2004 Panagiotis Karras

2 HKU CS 11/8/20042 The Problem  Clustering a problem of great importance.  Partitioning data into groups so that similar objects are grouped together.  Clustering of numerical data well-treated.  Clustering of categorical data more challenging: no inherent distance measure.

3 HKU CS 11/8/20043 An Example  A Movie Relation:  Distance or similarity between values not immediately obvious

4 HKU CS 11/8/20044 Some Information Theory  Mutual Information measure employed.  Clusters should be informative about the data they contain.  Given a cluster, we should be able to predict the attribute values of its objects accurately.  Information loss should be minimized.

5 HKU CS 11/8/20045 An Example  In the Movie Relation, clustering C is better then clustering D according to this measure (why?).

6 HKU CS 11/8/20046 The Information Bottleneck Method  Formalized by Tishby et al. [1999]  Clustering: the compression of one random variable that preserves as much information as possible about another.  Conditional entropy of A given T:  Captures the uncertainty of predicting the values of A given the values of T.

7 HKU CS 11/8/20047 The Information Bottleneck Method  Mutual Information quantifies the amount of information that variables convey about each other [Shannon, 1948]:

8 HKU CS 11/8/20048 The Information Bottleneck Method  A set of n tuples on m attributes.  Let d be the size of the set of all possible attribute values.  Then the data can be conceptualized as a n×d matrix M.  M[t,a]=1 iff tuple t contains value a.  Rows of normalized M contain conditional probability distributions p(A|t).

9 HKU CS 11/8/20049 The Information Bottleneck Method  In the Movie Relation Example:

10 HKU CS 11/8/200410 The Information Bottleneck Method  Clustering is a problem of maximizing the Mutual Information I(A;C) between attribute values and cluster identities, for a given number k of clusters [Tishby et al. 1999].  Finding optimal clustering NP-complete.  Agglomerative Information Bottleneck proposed by Slonim and Tishby [1999].  Starts with n clusters, reduces one at each step so that Information Loss in I(A;C) be minimized.

11 HKU CS 11/8/200411 LIMBO Clustering  scaLable InforMation Bottleneck  Keeps only sufficient statistics in memory.  Compact Summary Model.  Clustering based on Model.

12 HKU CS 11/8/200412 What is a DCF?  A Cluster is summarized in a Distributional Cluster Feature (DCF).  Pair of probability of cluster c and conditional probability distribution of attribute values given c.  Distance between DCFs is defined as the Information Loss incurred by merging the corresponding clusters (computed by the Jensen- Shannon divergence).

13 HKU CS 11/8/200413 The DCF Tree  Height balanced tree of branching factor B.  DCFs at leaves define clustering of tuples.  Non-leaf nodes merge DCFs of children.  Compact hierarchical summarization of data.

14 HKU CS 11/8/200414 The LIMBO algorithm  Three phases.  Phase 1: Insertion into DCF tree. o Each tuple t converted to DCF(t). o Follows path downward in tree along closest non-leaf DCFs. o At leaf level, let DCF(c) be entry closest to DCF(t). o If empty entry in leaf of DCF(c), then DCF(t) placed there. o If no empty entry and sufficient free space, leaf split in two halves, with two farthest DCFs as seeds for new leaves. Split moves upward as necessary. o Else, if no space, two closest DCF entries in {leaf, t} are merged.

15 HKU CS 11/8/200415 The LIMBO algorithm  Phase 2: Clustering. o For a given value of k, the DCF tree is used to produce k DCFs that serve as representatives of k clusters, emplying the Agglomerative Information Bottleneck algorithm.  Phase 3: Associating tuples with clusters. o A scan over the data set is performed and each tuple is assigned to the closest cluster.

16 HKU CS 11/8/200416 Intra-Attribute Value Distance  How to define the distance between categorical attribute values of the same attribute?  Values should be placed within a context.  Similar values appear in similar contexts.  What is a suitable context?  The distribution an attribute values induces on the remaining attributes.

17 HKU CS 11/8/200417 Intra-Attribute Value Distance  The distance between two values is then defined as the Information Loss incurred about the other attributes if we merge these values.  In the Movie example, Scorsese and Coppola are the most similar directors.  Distance between tuples = sum of distances between attributes.

18 HKU CS 11/8/200418 Experiments - Algorithms  Four algorithms are compared:  ROCK. An agglomerative algorithm by Guha et al. [1999]  COOLCAT. A scalable non-hierarchical algorithm most similar to LIMBO by Barbará et al. [2002]  STIRR. A dynamical systems approach using a hypergraph of weighted attribute values, by Gibson et al. [1998]  LIMBO. In addition to the space-bound version, LIMBO was implemented in an accuracy-control version, where a distance threshold is imposed on the decision of merging two DCFs, as multiple φ of the average mutual information of all tuples. The two versions differ only in Phase 1.

19 HKU CS 11/8/200419 Experiments – Data Sets  The following Data Sers are used:  Congressional Votes (435 boolean tuples on 16 issues, from 1984, classified as Democrat or Republican).  Mushroom (8,124 tuples with 22 attributes, classified as poisonous or edible).  Database and Theory bibliography (8,000 tuples on research papers with 4 attributes).  Synthetic Data Sets (5,000 tuples, 10 attributes, DS5 and DS10 for 5 and 10 classes).  Web Data (web pages - a tuple set of authorities with the hubs they are linked to by as attributes)

20 HKU CS 11/8/200420 Experiments - Quality Measures  Several measures are used to capture the subjectivity of clustering quality:  Information Loss. The lower the better.  Category Utility. Difference between expected correct guesses of attribute values with and without a clustering.  Min Classification Error. For tuples already classified.  Precision (P), Recall (R). P measures the accuracy with which a cluster reproduces a class and R the completeness with which this is done.

21 HKU CS 11/8/200421 Quality-Efficiency trade-offs for LIMBO  Both controlling the size (S) or the accuracy (φ) of the model, there is a trade-off between expressiveness (large S, small φ) and compactness (small S, large φ).  For Branching factor B=4 we obtain:  For large S and small φ, the bottleneck is Phase 2.

22 HKU CS 11/8/200422 Quality-Efficiency trade-offs for LIMBO  Still, in Phase 1 we can obtain significant compression of the data sets at no expense in the final quality.  This consistency can be attributed in part to the effect of Phase 3, which assigns tuples to cluster representatives.  Εven for large values of φ and small values of S, LIMBO obtains essentially the same clustering quality as AIB, but in linear time.

23 HKU CS 11/8/200423 Comparative Evaluations  The table show the results for all algorithms and all quality measures for the Votes and Mushrooms data sets.  LIMBO’s quality superior to ROCK and COOLCAT.  COOLCAT comes closest to LIMBO.

24 HKU CS 11/8/200424 Web Data  Authorities clustered into three clusters with information loss 61%.  LIMBO accurately characterizes structure of web graph.  Three clusters correspond to different viewpoints (pro, against, irrelevant).

25 HKU CS 11/8/200425 Scalability Evaluation  Four data sets of size 500K, 1M, 5M, 10M (10 clusters, 10 attributes each).  Phase 1 in detail for LIMBO φ  For 1.0 < φ <1.5 manageable size, fast execution time.

26 HKU CS 11/8/200426 Scalability Evaluation  We set φ = 1.2, 1.3, S = 1MB, 5MB.  Time scales linearly with data set size.  Varied number of attributes – linear behavior.

27 HKU CS 11/8/200427 Scalability - Quality Results  Quality measures the same for different data set sizes.

28 HKU CS 11/8/200428 Conclusions  LIMBO has advantages over other information theoretic clustering algorithms in terms of scalability and quality.  LIMBO only hierarchical scalable categorical clustering algorithm – based on compact summary model.

29 HKU CS 11/8/200429 Main Reference  P. Andritsos, P. Tsaparas, R. J. Miller, K. C. Sevcik. LIMBO: Scalable Clustering of Categorical Data, 9th International Conference on Extending DataBase Technology (EDBT), Heraklion, Greece, 2004. LIMBO: Scalable Clustering of Categorical Data LIMBO: Scalable Clustering of Categorical Data

30 HKU CS 11/8/200430 References  D. Barbará, J. Couto, and Y. Li. COOLCAT: An entropy-based algorithm for categorical clustering. In CIKM, McLean, VA, 2002.  D. Gibson, J. M. Kleinberg, and P. Raghavan. Clustering Categorical Data: An Approach Based on Dynamical Systems. In VLDB, New York, NY, 1998.  S. Guha, R. Rastogi, and K. Shim. ROCK: A Robust Clustering Algorithm for Categorical Atributes. In ICDE, Sydney, Australia, 1999.  C. Shannon. A Mathematical Theory of Communication, 1948.  N. Slonim and N. Tishby. Agglomerative Information Bottleneck. In NIPS, Breckenridge, 1999.  N. Tishby, F. C. Pereira, and W. Bialek. The Information Bottleneck Method. In 37th Annual Allerton Conference on Communication, Control and Computing, Urban-Champaign, IL, 1999.


Download ppt "HKU CS 11/8/2004 1 Scalable Clustering of Categorical Data HKU CS Database Research Seminar August 11th, 2004 Panagiotis Karras."

Similar presentations


Ads by Google