Clustering I Data Mining Soongsil University
What is clustering ?
What is a natural grouping among these objects?
What is a natural grouping among these objects? Clustering is subjective
What is Similarity? The quality or state of being similar, likeness, resemblance as a similarity of features. Similarity is hard to define, but We know it when we see it The real meaning of similarity is a philosophical question. We will take a more pragmatic approach.
Defining Distance Measures Definition: Let O1 and O2 be two objects from the universe of possible objects. The distance (dissimilarity) between O1 and O2 is a real number denoted by D(O1,O2)
Unsupervised learning :Clustering Unseen Data Black Box
2-dimensional clustering, showing three data clusters Age Income
What is Cluster Analysis? Finding groups of objects such that the objects in a group will be similar (or related) to one another and different from (or unrelated to) the objects in other groups Inter-cluster distances are maximized Intra-cluster distances are minimized
What Is A Good Clustering? High intra-class similarity and low inter-class similarity Depending on the similarity measure The ability to discover some or all of the hidden patterns
Requirements of Clustering Scalability Ability to deal with various types of attributes Discovery of clusters with arbitrary shape Minimal requirements for domain knowledge to determine input parameters
Requirements of Clustering Able to deal with noise and outliers Insensitive to order of input records High dimensionality Incorporation of user-specified constraints Interpretability and usability
A technique demanded by many real world tasks Biology: taxonomy of living things such as kingdom, phylum, class, order, family, genus and species Information retrieval: document/multimedia data clustering Land use: Identification of areas of similar land use in an earth observation database Marketing: Help marketers discover distinct groups in their customer bases, and then use this knowledge to develop targeted marketing programs City-planning: Identify groups of houses according to their house type, value, and geographical location Earth-quake studies: Observed earth quake epicenters should be clustered along continent faults Climate: understand earth climate, find patterns of atmospheric and ocean - Social network mining: special interest group discovery
For memory-based clustering Data Matrix For memory-based clustering Also called object-by-variable structure Represents n objects with p variables (attributes, measures) A relational table
For memory-based clustering Dissimilarity Matrix For memory-based clustering Also called object-by-object structure Proximities of pairs of objects d(i,j): dissimilarity between objects i and j Nonnegative Close to 0: similar
How Good Is A Clustering? Dissimilarity/similarity depends on distance function Different applications have different functions Judgment of clustering quality is typically highly subjective
Types of Attributes There are different types of attributes Nominal Examples: ID numbers, eye color, zip codes Ordinal Examples: rankings (e.g., taste of potato chips on a scale from 1-10), grades, height in {tall, medium, short} Interval Examples: calendar dates, temperatures in Celsius or Fahrenheit. Ratio Examples: length, time, counts
Types of Data in Clustering Interval-scaled variables Binary variables Nominal, ordinal, and ratio variables Variables of mixed types
Similarity and Dissimilarity Between Objects Distances are normally used measures Minkowski distance: a generalization If q = 2, d is Euclidean distance If q = 1, d is Manhattan distance Weighed distance
Properties of Minkowski Distance Nonnegative: d(i,j) 0 The distance of an object to itself is 0 d(i,i) = 0 Symmetric: d(i,j) = d(j,i) Triangular inequality d(i,j) d(i,k) + d(k,j)
Categories of Clustering Approaches (1) Partitioning algorithms Partition the objects into k clusters Iteratively reallocate objects to improve the clustering Hierarchy algorithms Agglomerative: each object is a cluster, merge clusters to form larger ones Divisive: all objects are in a cluster, split it up into smaller clusters
Partitional Clustering A Partitional Clustering Original Points
Hierarchical Clustering Traditional Hierarchical Clustering Traditional Dendrogram Non-traditional Hierarchical Clustering Non-traditional Dendrogram
Categories of Clustering Approaches (2) Density-based methods Based on connectivity and density functions Filter out noise, find clusters of arbitrary shape Grid-based methods Quantize the object space into a grid structure Model-based Use a model to find the best fit of data
Partitioning Algorithms: Basic Concepts Partition n objects into k clusters Optimize the chosen partitioning criterion Global optimal: examine all partitions (kn-(k-1)n-…-1) possible partitions, too expensive! Heuristic methods: k-means and k-medoids K-means: a cluster is represented by the center K-medoids or PAM (partition around medoids): each cluster is represented by one of the objects in the cluster
Overview of K-Means Clustering K-Means is a partitional clustering algorithm based on iterative relocation that partitions a dataset into K clusters. Algorithm: Initialize K cluster centers randomly. Repeat until convergence: Cluster Assignment Step: Assign each data point x to the cluster Xl, such that L2 distance of x from (center of Xl) is minimum Center Re-estimation Step: Re-estimate each cluster center as the mean of the points in that cluster
K-Means Objective Function Locally minimizes sum of squared distance between the data points and their corresponding cluster centers: Initialization of K cluster centers: Totally random Random perturbation from global mean Heuristic to ensure well-separated centers Source: J. Ye 2006
K Means Example
K Means Example Randomly Initialize Means
Semi-Supervised Clustering Example . . . . . . . . . . . . . . . . . . . .
Semi-Supervised Clustering Example . . . . . . . . . . . . . . . . . . . .
Second Semi-Supervised Clustering Example . . . . . . . . . . . . . . . . . . . .
Second Semi-Supervised Clustering Example . . . . . . . . . . . . . . . . . . . .
Pros and Cons of K-means Relatively efficient: O(tkn) n: # objects, k: # clusters, t: # iterations; k, t << n. Often terminate at a local optimum Applicable only when mean is defined What about categorical data? Need to specify the number of clusters Unable to handle noisy data and outliers Unsuitable to discover non-convex clusters
Variations of the K-means Aspects of variations Selection of the initial k means Dissimilarity calculations Strategies to calculate cluster means Handling categorical data: k-modes Use mode instead of mean Mode: the most frequent item(s) A mixture of categorical and numerical data: k-prototype method
Categorical Values Handling categorical data: k-modes (Huang’98) Replacing means of clusters with modes Mode of an attribute: most frequent value Mode of instances: each attribute = most frequent value K-mode is equivalent to K-means Using a frequency-based method to update modes of clusters A mixture of categorical and numerical data: k-prototype method 37
K-medoids: the most centrally located object in a cluster A Problem of K-means + + Sensitive to outliers Outlier: objects with extremely large values May substantially distort the distribution of the data K-medoids: the most centrally located object in a cluster 1 2 3 4 5 6 7 8 9 10
PAM: A K-medoids Method PAM: partitioning around Medoids Arbitrarily choose k objects as the initial medoids Until no change, do (Re)assign each object to the cluster to which the nearest medoid Randomly select a non-medoid object o’, compute the total cost, S, of swapping medoid o with o’ If S < 0 then swap o with o’ to form the new set of k medoids
K-Medoids example 1, 2, 6, 7, 8, 10, 15, 17, 20 – break into 3 clusters Cluster = 6 – 1, 2 Cluster = 7 Cluster = 8 – 10, 15, 17, 20 Random non-medoid – 15 replace 7 (total cost=-13) Cluster = 6 – 1 (cost 0), 2 (cost 0), 7(1-0=1) Cluster = 8 – 10 (cost 0) New Cluster = 15 – 17 (cost 2-9=-7), 20 (cost 5-12=-7) Replace medoid 7 with new medoid (15) and reassign Cluster = 6 – 1, 2, 7 Cluster = 8 – 10 Cluster = 15 – 17, 20
K-Medoids example (continued) Random non-medoid – 1 replaces 6 (total cost=2) Cluster = 8 – 7 (cost 6-1=5)10 (cost 0) Cluster = 15 – 17 (cost 0), 20 (cost 0) New Cluster = 1 – 2 (cost 1-4=-3) 2 replaces 6 (total cost=1) Don’t replace medoid 6 Cluster = 6 – 1, 2, 7 Cluster = 8 – 10 Cluster = 15 – 17, 20 Random non-medoid – 7 replaces 6 (total cost=2) Cluster = 8 – 10 (cost 0) Cluster = 15 – 17(cost 0), 20(cost 0) New Cluster = 7 – 6 (cost 1-0=1), 2 (cost 5-4=1)
K-Medoids example (continued) Don’t Replace medoid 6 Cluster = 6 – 1, 2, 7 Cluster = 8 – 10 Cluster = 15 – 17, 20 Random non-medoid – 10 replaces 8 (total cost=2) don’t replace Cluster = 6 – 1(cost 0), 2(cost 0), 7(cost 0) Cluster = 15 – 17 (cost 0), 20(cost 0) New Cluster = 10 – 8 (cost 2-0=2) Random non-medoid – 17 replaces 15 (total cost=0) don’t replace Cluster = 8 – 10 (cost 0) New Cluster = 17 – 15 (cost 2-0=2), 20(cost 3-5=-2)
K-Medoids example (continued) Random non-medoid – 20 replaces 15 (total cost=3) don’t replace Cluster = 6 – 1(cost 0), 2(cost 0), 7(cost 0) Cluster = 8 – 10 (cost 0) New Cluster = 20 – 15 (cost 5-0=2), 17(cost 3-2=1) Other possible changes all have high costs 1 replaces 15, 2 replaces 15, 1 replaces 8, … No changes, final clusters Cluster = 6 – 1, 2, 7 Cluster = 8 – 10 Cluster = 15 – 17, 20
Semi-Supervised Clustering
Outline Overview of clustering and classification What is semi-supervised learning? Semi-supervised clustering Semi-supervised classification What is semi-supervised clustering? Why semi-supervised clustering? Semi-supervised clustering algorithms Source: J. Ye 2006
Supervised classification versus unsupervised clustering Unsupervised clustering Group similar objects together to find clusters Minimize intra-class distance Maximize inter-class distance Supervised classification Class label for each training sample is given Build a model from the training data Predict class label on unseen future data points Source: J. Ye 2006
What is clustering? Finding groups of objects such that the objects in a group will be similar (or related) to one another and different from (or unrelated to) the objects in other groups Inter-cluster distances are maximized Intra-cluster distances are minimized Source: J. Ye 2006
What is Classification? Source: J. Ye 2006
Clustering algorithms K-Means Hierarchical clustering Graph based clustering (Spectral clustering) Bi-clustering Source: J. Ye 2006
Classification algorithms K-Nearest-Neighbor classifiers Naïve Bayes classifier Linear Discriminant Analysis (LDA) Support Vector Machines (SVM) Logistic Regression Neural Networks Source: J. Ye 2006
Supervised Classification Example . . . .
Supervised Classification Example . . . . . . . . . . . . . . . . . . . .
Supervised Classification Example . . . . . . . . . . . . . . . . . . . .
Unsupervised Clustering Example . . . . . . . . . . . . . . . . . . . .
Unsupervised Clustering Example . . . . . . . . . . . . . . . . . . . .
Semi-Supervised Learning Combines labeled and unlabeled data during training to improve performance: Semi-supervised classification: Training on labeled data exploits additional unlabeled data, frequently resulting in a more accurate classifier. Semi-supervised clustering: Uses small amount of labeled data to aid and bias the clustering of unlabeled data. Unsupervised clustering Semi-supervised learning Supervised classification
Semi-Supervised Classification Example . . . . . . . . . . . . . . . . . . . .
Semi-Supervised Classification Example . . . . . . . . . . . . . . . . . . . .
Semi-Supervised Classification Algorithms: Semisupervised EM [Ghahramani:NIPS94,Nigam:ML00]. Co-training [Blum:COLT98]. Transductive SVM’s [Vapnik:98,Joachims:ICML99]. Graph based algorithms Assumptions: Known, fixed set of categories given in the labeled data. Goal is to improve classification of examples into these known categories.
Semi-supervised clustering: problem definition Input: A set of unlabeled objects, each described by a set of attributes (numeric and/or categorical) A small amount of domain knowledge Output: A partitioning of the objects into k clusters (possibly with some discarded as outliers) Objective: Maximum intra-cluster similarity Minimum inter-cluster similarity High consistency between the partitioning and the domain knowledge
Why semi-supervised clustering? Why not clustering? The clusters produced may not be the ones required. Sometimes there are multiple possible groupings. Why not classification? Sometimes there are insufficient labeled data. Potential applications Bioinformatics (gene and protein clustering) Document hierarchy construction News/email categorization Image categorization
Semi-Supervised Clustering Domain knowledge Partial label information is given Apply some constraints (must-links and cannot-links) Approaches Search-based Semi-Supervised Clustering Alter the clustering algorithm using the constraints Similarity-based Semi-Supervised Clustering Alter the similarity measure based on the constraints Combination of both
Search-Based Semi-Supervised Clustering Alter the clustering algorithm that searches for a good partitioning by: Modifying the objective function to give a reward for obeying labels on the supervised data [Demeriz: ANNIE99]. Enforcing constraints (must-link, cannot-link) on the labeled data during clustering [Wagstaff:ICML00, Wagstaff:ICML01]. Use the labeled data to initialize clusters in an iterative refinement algorithm (k-Means,) [Basu:ICML02]. Source: J. Ye 2006
K Means Example Assign Points to Clusters
K Means Example Re-estimate Means
K Means Example Re-assign Points to Clusters
K Means Example Re-estimate Means
K Means Example Re-assign Points to Clusters
K Means Example Re-estimate Means and Converge
Semi-Supervised K-Means Partial label information is given Seeded K-Means Constrained K-Means Constraints (Must-link, Cannot-link) COP K-Means
Semi-Supervised K-Means for partially labeled data Seeded K-Means: Labeled data provided by user are used for initialization: initial center for cluster i is the mean of the seed points having label i. Seed points are only used for initialization, and not in subsequent steps. Constrained K-Means: Labeled data provided by user are used to initialize K-Means algorithm. Cluster labels of seed data are kept unchanged in the cluster assignment steps, and only the labels of the non-seed data are re-estimated.
Seeded K-Means Use labeled data to find the initial centroids and then run K-Means. The labels for seeded points may change. Source: J. Ye 2006
Seeded K-Means Example
Seeded K-Means Example Initialize Means Using Labeled Data
Seeded K-Means Example Assign Points to Clusters
Seeded K-Means Example Re-estimate Means
Seeded K-Means Example Assign points to clusters and Converge the label is changed x
Constrained K-Means Use labeled data to find the initial centroids and then run K-Means. The labels for seeded points will not change. Source: J. Ye 2006
Constrained K-Means Example
Constrained K-Means Example Initialize Means Using Labeled Data
Constrained K-Means Example Assign Points to Clusters
Constrained K-Means Example Re-estimate Means and Converge
COP K-Means COP K-Means [Wagstaff et al.: ICML01] is K-Means with must-link (must be in same cluster) and cannot-link (cannot be in same cluster) constraints on data points. Initialization: Cluster centers are chosen randomly, but as each one is chosen any must-link constraints that it participates in are enforced (so that they cannot later be chosen as the center of another cluster). Algorithm: During cluster assignment step in COP-K-Means, a point is assigned to its nearest cluster without violating any of its constraints. If no such assignment exists, abort. Source: J. Ye 2006
COP K-Means Algorithm
Illustration Determine its label Must-link x x Assign to the red class
Illustration Determine its label Cannot-link Assign to the red class x
Illustration Determine its label Must-link Cannot-link x x Cannot-link The clustering algorithm fails
Summary Seeded and Constrained K-Means: partially labeled data COP K-Means: constraints (Must-link and Cannot-link) Constrained K-Means and COP K-Means require all the constraints to be satisfied. May not be effective if the seeds contain noise. Seeded K-Means use the seeds only in the first step to determine the initial centroids. Less sensitive to the noise in the seeds. Experiments show that semi-supervised k-Means outperform traditional K-Means.
References Ye , Jieping Introduction to Data Mining, Department of Computer Science and Engineering Arizona State University, 2006 Clifton, Chris Introduction to Data Mining, Purdue University, 2006 Zhu, Xingquan & Davidson, Ian , Knowledge Discovery and Data Mining, 2007