Clustering… in General In vector space, clusters are vectors found within of a cluster vector, with different techniques for determining the cluster vector and . Clustering is unsupervised pattern classification. Unsupervised means no correct answer or feedback. Patterns typically are samples of feature vectors or matrices. Classification means collecting the samples into groups of similar members.
Clustering Decisions Pattern Representation feature selection (e.g., stop word removal, stemming) number of categories Pattern proximity distance measure on pairs of patterns Grouping characteristics of clusters (e.g., fuzzy, hierarchical) Clustering algorithms embody different assumptions about these decisions and the form of clusters.
Formal Definitions Feature vector x is a single datum of d measurements. Hard clustering techniques assign a class label to each cluster; members of clusters are mutually exclusive. Fuzzy clustering techniques assign a fractional degree of membership to each label for each x.
Proximity Measures Generally, use Euclidean distance or mean squared distance. In IR, use similarity measure from retrieval (e.g., cosine measure for TFIDF).
[Jain, Murty & Flynn] Taxonomy of Clustering Clustering HierarchicalPartitional Single Link Complete Link Square Error Graph Theoretic Mixture Resolving Mode Seeking k-means Expectation Minimization HAC
Clustering Issues Agglomerative: begin with each sample in its own cluster and merge Divisive: begin with single cluster and split Hard: mutually exclusive cluster membership Fuzzy: degrees of membership in clusters DeterministicStochastic Incremental: samples may be added to clusters Batch: clusters created over entire sample space
Hierarchical Algorithms Produce hierarchy of classes (taxonomy) from singleton clusters to just one cluster. Select level for extracting cluster set. Representation is a dendrogram.
Complete-Link Revisited Used to create statistical thesaurus Agglomerative, hard, deterministic, batch 1. Start with 1 cluster/sample 2. Find two clusters with lowest distance 3. Merge two clusters and add to hierarchy 4. Repeat from 2 until termination criterion or until all clusters have merged
Single-Link Like Complete-Link except… use minimum of distances between all pairs of samples in the two clusters (complete-link uses maximum). Single-link has chaining effect with elongated clusters, but can construct more complex shapes.
Example:Plot
Example: Proximity Matrix 21,1526,2529,2231,1521,2723,3229,2633,21 21, , , , , , , ,210
Complete-Link Solution 1,28 4,9 9,16 13,18 21,1529,22 31,15 33,2135,35 42,45 45,4246,30 23,32 21,27 29,26 26,25 C1C2C3C4C5 C6C7C8C9 C10C11C12 C13C14 C15
Single-Link Solution 1,28 4,9 9,16 13,18 21,1529,22 31,15 33,2135,35 42,45 45,4246,30 23,32 21,27 29,26 26,25 C1 C4C5C6 C7 C9 C13 C10 C11 C15 C2 C3 C8 C12 C14
Hierarchical Agglomerative Clustering (HAC) Agglomerative, hard, deterministic, batch 1. Start with 1 cluster/sample and compute a proximity matrix between pairs of clusters. 2. Merge most similar pair of clusters and update proximity matrix. 3. Repeat 2 until all clusters merged. Difference is in how proximity matrix is updated. Ability to combine benefits of both single and complete link algorithms.
HAC for IR Intra-cluster Similarity where S is TFIDF vectors for documents, c is centroid of cluster X, and d is a document. Proximity is similarity of all documents to the cluster centroid. Select pair of clusters that produces the smallest decrease in similarity, e.g., if merge(X,Y)=>Z, then max[Sim(Z)-(Sim(X)+Sim(Y))]
HAC for IR- Alternatives Centroid Similarity cosine similarity between the centroid of the two clusters UPGMA
Partitional Algorithms Results in set of unrelated clusters. Issues: how many clusters is enough? how to search space of possible partitions? what is appropriate clustering criterion?
K Means Number of clusters is set by user to be k. Non-deterministic Clustering criterion is squared error: where S is document set, L is a clustering, K is number of clusters, x is ith document in jth cluster and c is centroid of jth cluster.
k-Means Clustering Algorithm 1. Randomly select k samples as cluster centroids. 2. Assign each pattern to the closest cluster centroid. 3. Recompute centroids. 4. If convergence criterion (e.g., minimal decrease in error or no change in cluster composition) is not met, return to 2.
Example:K-Means Solutions
k-Means Sensitivity to Initialization A B C DE FG K=3, red started w/A, D, F; yellow w/A, B, C
k-Means for IR Update centroids incrementally Calculate centroid as with hierarchical methods. Can refine into a divisive hierarchical method by starting with single cluster and splitting using k-means until forms k clusters with highest summed similarities. (bisecting k-means)
Other Types of Clustering Algorithms Graph Theoretic: construct minimal spanning tree and delete edges with largest lengths Expectation Minimization (EM): assume clusters are drawn from distributions, use maximum likelihood to estimate parameters of distributions. Nearest Neighbors: iteratively assign each sample to the cluster of its nearest labelled neighbor, so long as distance is below a set threshold.
Comparison of Clustering Algorithms [Steinbach et al.] Implement 3 versions of HAC and 2 versions of k-Means Compare performance on documents hand labelled as relevant to one of a set of classes. Well known data sets (TREC) Found that UPGMA is best of hierarchical, but bisecting k-means seems to do better if considered over many runs. M. Steinbach, G. Karypis, V.Kumar. A Comparison of Document Clustering Techniques, KDD Workshop on Text Mining, 2000.A Comparison of Document Clustering Techniques
Evaluation Metrics 1 Evaluation: how to measure cluster quality? Entropy: where pij is probability that a member of cluster j belongs to class i, nj is size of cluster j, m is number of clusters, n is number of docs and CS is a clustering solution.
Comparison Measure 2 F measure: combines precision and recall treat each cluster as the result of a query and each class as the relevant set of docs nij is # of members of class i in cluster j, nj is # in j, ni is # in i, n is # of docs.