HKU CS 11/8/2004 1 Scalable Clustering of Categorical Data HKU CS Database Research Seminar August 11th, 2004 Panagiotis Karras.

Slides:



Advertisements
Similar presentations
Discrimination and Classification. Discrimination Situation: We have two or more populations  1,  2, etc (possibly p-variate normal). The populations.
Advertisements

Conceptual Clustering
Feature Selection as Relevant Information Encoding Naftali Tishby School of Computer Science and Engineering The Hebrew University, Jerusalem, Israel NIPS.
Hierarchical Clustering, DBSCAN The EM Algorithm
Clustering Categorical Data The Case of Quran Verses
Fast Algorithms For Hierarchical Range Histogram Constructions
Combining Classification and Model Trees for Handling Ordinal Problems D. Anyfantis, M. Karagiannopoulos S. B. Kotsiantis, P. E. Pintelas Educational Software.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Midterm topics Chapter 2 Data Data preprocessing Measures of similarity/dissimilarity Chapter.
Agglomerative Hierarchical Clustering 1. Compute a distance matrix 2. Merge the two closest clusters 3. Update the distance matrix 4. Repeat Step 2 until.
BIRCH: Is It Good for Databases? A review of BIRCH: An And Efficient Data Clustering Method for Very Large Databases by Tian Zhang, Raghu Ramakrishnan.
CS685 : Special Topics in Data Mining, UKY The UNIVERSITY of KENTUCKY Clustering CS 685: Special Topics in Data Mining Spring 2008 Jinze Liu.
Clustering Categorical Data: An Approach Based on Dynamical Systems (1998) David Gibson, Jon Kleinberg, Prabhakar Raghavan VLDB Journal: Very Large Data.
Image Indexing and Retrieval using Moment Invariants Imran Ahmad School of Computer Science University of Windsor – Canada.
SLIQ: A Fast Scalable Classifier for Data Mining Manish Mehta, Rakesh Agrawal, Jorma Rissanen Presentation by: Vladan Radosavljevic.
Tian Zhang Raghu Ramakrishnan Miron Livny Presented by: Peter Vile BIRCH: A New data clustering Algorithm and Its Applications.
Clustering II.
Mutual Information Mathematical Biology Seminar
Data Mining Cluster Analysis: Advanced Concepts and Algorithms Lecture Notes for Chapter 9 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Cluster Analysis.  What is Cluster Analysis?  Types of Data in Cluster Analysis  A Categorization of Major Clustering Methods  Partitioning Methods.
Cluster Analysis.  What is Cluster Analysis?  Types of Data in Cluster Analysis  A Categorization of Major Clustering Methods  Partitioning Methods.
Semi-Supervised Clustering Jieping Ye Department of Computer Science and Engineering Arizona State University
Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as.
The Power of Word Clusters for Text Classification Noam Slonim and Naftali Tishby Presented by: Yangzhe Xiao.
Birch: An efficient data clustering method for very large databases
Chapter 5 Data mining : A Closer Look.
Chapter 3: Cluster Analysis  3.1 Basic Concepts of Clustering  3.2 Partitioning Methods  3.3 Hierarchical Methods The Principle Agglomerative.
UCSC 1 Aman ShaikhICNP 2003 An Efficient Algorithm for OSPF Subnet Aggregation ICNP 2003 Aman Shaikh Dongmei Wang, Guangzhi Li, Jennifer Yates, Charles.
Clustering Unsupervised learning Generating “classes”
A Cumulative Voting Consensus Method for Partitions with a Variable Number of Clusters Hanan G. Ayad, Mohamed S. Kamel, ECE Department University of Waterloo,
Cluster Analysis Part II. Learning Objectives Hierarchical Methods Density-Based Methods Grid-Based Methods Model-Based Clustering Methods Outlier Analysis.
The BIRCH Algorithm Davitkov Miroslav, 2011/3116
The X-Tree An Index Structure for High Dimensional Data Stefan Berchtold, Daniel A Keim, Hans Peter Kriegel Institute of Computer Science Munich, Germany.
1 CSE 980: Data Mining Lecture 17: Density-based and Other Clustering Algorithms.
1 Motivation Web query is usually two or three words long. –Prone to ambiguity –Example “keyboard” –Input device of computer –Musical instruments How can.
Using Entropy-Related Measures in Categorical Data Visualization  Jamal Alsakran The University of Jordan  Xiaoke Huang, Ye Zhao Kent State University.
Software Clustering Based on Information Loss Minimization Periklis Andritsos University of Toronto Vassilios Tzerpos York University The 10th Working.
CHAN Siu Lung, Daniel CHAN Wai Kin, Ken CHOW Chin Hung, Victor KOON Ping Yin, Bob SPRINT: A Scalable Parallel Classifier for Data Mining.
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Clustering COMP Research Seminar BCB 713 Module Spring 2011 Wei Wang.
Taylor Rassmann.  Grouping data objects into X tree of clusters and uses distance matrices as clustering criteria  Two Hierarchical Clustering Categories:
ROCK: A Robust Clustering Algorithm for Categorical Attributes Authors: Sudipto Guha, Rajeev Rastogi, Kyuseok Shim Data Engineering, Proceedings.,
Chapter 11 Statistical Techniques. Data Warehouse and Data Mining Chapter 11 2 Chapter Objectives  Understand when linear regression is an appropriate.
Decision Trees Binary output – easily extendible to multiple output classes. Takes a set of attributes for a given situation or object and outputs a yes/no.
BIRCH: Balanced Iterative Reducing and Clustering Using Hierarchies A hierarchical clustering method. It introduces two concepts : Clustering feature Clustering.
K-Means Algorithm Each cluster is represented by the mean value of the objects in the cluster Input: set of objects (n), no of clusters (k) Output:
R-Trees: A Dynamic Index Structure For Spatial Searching Antonin Guttman.
Database Management Systems, R. Ramakrishnan 1 Algorithms for clustering large datasets in arbitrary metric spaces.
1 An Efficient Optimal Leaf Ordering for Hierarchical Clustering in Microarray Gene Expression Data Analysis Jianting Zhang Le Gruenwald School of Computer.
Learning for Physically Diverse Robot Teams Robot Teams - Chapter 7 CS8803 Autonomous Multi-Robot Systems 10/3/02.
1 Systematic Data Selection to Mine Concept-Drifting Data Streams Wei Fan Proceedings of the 2004 ACM SIGKDD international conference on Knowledge discovery.
Organizing Structured Web Sources by Query Schemas: A Clustering Approach Bin He Joint work with: Tao Tao, Kevin Chen-Chuan Chang Univ. Illinois at Urbana-Champaign.
Information Bottleneck Method & Double Clustering + α Summarized by Byoung Hee, Kim.
CURE: An Efficient Clustering Algorithm for Large Databases Authors: Sudipto Guha, Rajeev Rastogi, Kyuseok Shim Presentation by: Vuk Malbasa For CIS664.
Ensemble Learning, Boosting, and Bagging: Scaling up Decision Trees (with thanks to William Cohen of CMU, Michael Malohlava of 0xdata, and Manish Amde.
Introduction to Data Mining Clustering & Classification Reference: Tan et al: Introduction to data mining. Some slides are adopted from Tan et al.
Semi-Supervised Clustering
CACTUS-Clustering Categorical Data Using Summaries
RE-Tree: An Efficient Index Structure for Regular Expressions
Discrimination and Classification
BIRCH: An Efficient Data Clustering Method for Very Large Databases
CS 685: Special Topics in Data Mining Jinze Liu
A Consensus-Based Clustering Method
CS 685: Special Topics in Data Mining Jinze Liu
CS 485G: Special Topics in Data Mining
DATA MINING Introductory and Advanced Topics Part II - Clustering
The BIRCH Algorithm Davitkov Miroslav, 2011/3116
Birch presented by : Bahare hajihashemi Atefeh Rahimi
Text Categorization Berlin Chen 2003 Reference:
BIRCH: Balanced Iterative Reducing and Clustering Using Hierarchies
CS 685: Special Topics in Data Mining Jinze Liu
BIRCH: Balanced Iterative Reducing and Clustering using Hierarchies
Presentation transcript:

HKU CS 11/8/ Scalable Clustering of Categorical Data HKU CS Database Research Seminar August 11th, 2004 Panagiotis Karras

HKU CS 11/8/20042 The Problem  Clustering a problem of great importance.  Partitioning data into groups so that similar objects are grouped together.  Clustering of numerical data well-treated.  Clustering of categorical data more challenging: no inherent distance measure.

HKU CS 11/8/20043 An Example  A Movie Relation:  Distance or similarity between values not immediately obvious

HKU CS 11/8/20044 Some Information Theory  Mutual Information measure employed.  Clusters should be informative about the data they contain.  Given a cluster, we should be able to predict the attribute values of its objects accurately.  Information loss should be minimized.

HKU CS 11/8/20045 An Example  In the Movie Relation, clustering C is better then clustering D according to this measure (why?).

HKU CS 11/8/20046 The Information Bottleneck Method  Formalized by Tishby et al. [1999]  Clustering: the compression of one random variable that preserves as much information as possible about another.  Conditional entropy of A given T:  Captures the uncertainty of predicting the values of A given the values of T.

HKU CS 11/8/20047 The Information Bottleneck Method  Mutual Information quantifies the amount of information that variables convey about each other [Shannon, 1948]:

HKU CS 11/8/20048 The Information Bottleneck Method  A set of n tuples on m attributes.  Let d be the size of the set of all possible attribute values.  Then the data can be conceptualized as a n×d matrix M.  M[t,a]=1 iff tuple t contains value a.  Rows of normalized M contain conditional probability distributions p(A|t).

HKU CS 11/8/20049 The Information Bottleneck Method  In the Movie Relation Example:

HKU CS 11/8/ The Information Bottleneck Method  Clustering is a problem of maximizing the Mutual Information I(A;C) between attribute values and cluster identities, for a given number k of clusters [Tishby et al. 1999].  Finding optimal clustering NP-complete.  Agglomerative Information Bottleneck proposed by Slonim and Tishby [1999].  Starts with n clusters, reduces one at each step so that Information Loss in I(A;C) be minimized.

HKU CS 11/8/ LIMBO Clustering  scaLable InforMation Bottleneck  Keeps only sufficient statistics in memory.  Compact Summary Model.  Clustering based on Model.

HKU CS 11/8/ What is a DCF?  A Cluster is summarized in a Distributional Cluster Feature (DCF).  Pair of probability of cluster c and conditional probability distribution of attribute values given c.  Distance between DCFs is defined as the Information Loss incurred by merging the corresponding clusters (computed by the Jensen- Shannon divergence).

HKU CS 11/8/ The DCF Tree  Height balanced tree of branching factor B.  DCFs at leaves define clustering of tuples.  Non-leaf nodes merge DCFs of children.  Compact hierarchical summarization of data.

HKU CS 11/8/ The LIMBO algorithm  Three phases.  Phase 1: Insertion into DCF tree. o Each tuple t converted to DCF(t). o Follows path downward in tree along closest non-leaf DCFs. o At leaf level, let DCF(c) be entry closest to DCF(t). o If empty entry in leaf of DCF(c), then DCF(t) placed there. o If no empty entry and sufficient free space, leaf split in two halves, with two farthest DCFs as seeds for new leaves. Split moves upward as necessary. o Else, if no space, two closest DCF entries in {leaf, t} are merged.

HKU CS 11/8/ The LIMBO algorithm  Phase 2: Clustering. o For a given value of k, the DCF tree is used to produce k DCFs that serve as representatives of k clusters, emplying the Agglomerative Information Bottleneck algorithm.  Phase 3: Associating tuples with clusters. o A scan over the data set is performed and each tuple is assigned to the closest cluster.

HKU CS 11/8/ Intra-Attribute Value Distance  How to define the distance between categorical attribute values of the same attribute?  Values should be placed within a context.  Similar values appear in similar contexts.  What is a suitable context?  The distribution an attribute values induces on the remaining attributes.

HKU CS 11/8/ Intra-Attribute Value Distance  The distance between two values is then defined as the Information Loss incurred about the other attributes if we merge these values.  In the Movie example, Scorsese and Coppola are the most similar directors.  Distance between tuples = sum of distances between attributes.

HKU CS 11/8/ Experiments - Algorithms  Four algorithms are compared:  ROCK. An agglomerative algorithm by Guha et al. [1999]  COOLCAT. A scalable non-hierarchical algorithm most similar to LIMBO by Barbará et al. [2002]  STIRR. A dynamical systems approach using a hypergraph of weighted attribute values, by Gibson et al. [1998]  LIMBO. In addition to the space-bound version, LIMBO was implemented in an accuracy-control version, where a distance threshold is imposed on the decision of merging two DCFs, as multiple φ of the average mutual information of all tuples. The two versions differ only in Phase 1.

HKU CS 11/8/ Experiments – Data Sets  The following Data Sers are used:  Congressional Votes (435 boolean tuples on 16 issues, from 1984, classified as Democrat or Republican).  Mushroom (8,124 tuples with 22 attributes, classified as poisonous or edible).  Database and Theory bibliography (8,000 tuples on research papers with 4 attributes).  Synthetic Data Sets (5,000 tuples, 10 attributes, DS5 and DS10 for 5 and 10 classes).  Web Data (web pages - a tuple set of authorities with the hubs they are linked to by as attributes)

HKU CS 11/8/ Experiments - Quality Measures  Several measures are used to capture the subjectivity of clustering quality:  Information Loss. The lower the better.  Category Utility. Difference between expected correct guesses of attribute values with and without a clustering.  Min Classification Error. For tuples already classified.  Precision (P), Recall (R). P measures the accuracy with which a cluster reproduces a class and R the completeness with which this is done.

HKU CS 11/8/ Quality-Efficiency trade-offs for LIMBO  Both controlling the size (S) or the accuracy (φ) of the model, there is a trade-off between expressiveness (large S, small φ) and compactness (small S, large φ).  For Branching factor B=4 we obtain:  For large S and small φ, the bottleneck is Phase 2.

HKU CS 11/8/ Quality-Efficiency trade-offs for LIMBO  Still, in Phase 1 we can obtain significant compression of the data sets at no expense in the final quality.  This consistency can be attributed in part to the effect of Phase 3, which assigns tuples to cluster representatives.  Εven for large values of φ and small values of S, LIMBO obtains essentially the same clustering quality as AIB, but in linear time.

HKU CS 11/8/ Comparative Evaluations  The table show the results for all algorithms and all quality measures for the Votes and Mushrooms data sets.  LIMBO’s quality superior to ROCK and COOLCAT.  COOLCAT comes closest to LIMBO.

HKU CS 11/8/ Web Data  Authorities clustered into three clusters with information loss 61%.  LIMBO accurately characterizes structure of web graph.  Three clusters correspond to different viewpoints (pro, against, irrelevant).

HKU CS 11/8/ Scalability Evaluation  Four data sets of size 500K, 1M, 5M, 10M (10 clusters, 10 attributes each).  Phase 1 in detail for LIMBO φ  For 1.0 < φ <1.5 manageable size, fast execution time.

HKU CS 11/8/ Scalability Evaluation  We set φ = 1.2, 1.3, S = 1MB, 5MB.  Time scales linearly with data set size.  Varied number of attributes – linear behavior.

HKU CS 11/8/ Scalability - Quality Results  Quality measures the same for different data set sizes.

HKU CS 11/8/ Conclusions  LIMBO has advantages over other information theoretic clustering algorithms in terms of scalability and quality.  LIMBO only hierarchical scalable categorical clustering algorithm – based on compact summary model.

HKU CS 11/8/ Main Reference  P. Andritsos, P. Tsaparas, R. J. Miller, K. C. Sevcik. LIMBO: Scalable Clustering of Categorical Data, 9th International Conference on Extending DataBase Technology (EDBT), Heraklion, Greece, LIMBO: Scalable Clustering of Categorical Data LIMBO: Scalable Clustering of Categorical Data

HKU CS 11/8/ References  D. Barbará, J. Couto, and Y. Li. COOLCAT: An entropy-based algorithm for categorical clustering. In CIKM, McLean, VA,  D. Gibson, J. M. Kleinberg, and P. Raghavan. Clustering Categorical Data: An Approach Based on Dynamical Systems. In VLDB, New York, NY,  S. Guha, R. Rastogi, and K. Shim. ROCK: A Robust Clustering Algorithm for Categorical Atributes. In ICDE, Sydney, Australia,  C. Shannon. A Mathematical Theory of Communication,  N. Slonim and N. Tishby. Agglomerative Information Bottleneck. In NIPS, Breckenridge,  N. Tishby, F. C. Pereira, and W. Bialek. The Information Bottleneck Method. In 37th Annual Allerton Conference on Communication, Control and Computing, Urban-Champaign, IL, 1999.