Information Retrieval Lecture 6 Introduction to Information Retrieval (Manning et al. 2007) Chapter 16 For the MSc Computer Science Programme Dell Zhang Birkbeck, University of London
What is text clustering? Text clustering – grouping a set of documents into classes of similar documents. Classification vs. Clustering Classification: supervised learning Labeled data are given for training Clustering: unsupervised learning Only unlabeled data are available
Why text clustering? To improve user interface Navigation/analysis of corpus or search results To improve recall Cluster docs in corpus a priori. When a query matches a doc d, also return other docs in the cluster containing d. Hope if we do this, the query “car” will also return docs containing “automobile”. To improve retrieval speed Cluster Pruning
What clustering is good? External criteria Consistent with the latent classes in gold standard (ground truth) data. Internal criteria High intra-cluster similarity Low inter-cluster similarity
Issues for Clustering Similarity between docs Ideal: semantic similarity Practical: statistical similarity, e.g., cosine. Number of clusters Fixed, e.g., k Means. Flexible, e.g., Single-Link HAC. Structure of clusters Flat partition, e.g., k Means. Hierarchical tree, e.g., Single-Link HAC.
k Means Algorithm Pick k docs {s 1, s 2,…,s k } randomly as seeds. Repeat until clustering converges (or other stopping criterion): For each doc d i : Assign d i to cluster c j such that sim(d i, s j ) is maximal. For each cluster c j : Update s j to the centroid (mean) of cluster c j.
k Means – Example (k = 2) Pick seeds Reassign clusters Compute centroids x x Reassign clusters x x x x Compute centroids Reassign clusters Converged!
k Means – Example
k Means – Online Demo
Convergence k Means is proved to converge, i.e., to reach a state in which clusters don’t change. k Means usually converges quickly, i.e., the number of iterations is small in most cases.
Seeds Problem Results can vary because of random seed selections. Some seeds can result in poor convergence rate, or convergence to sub-optimal clustering. Solution Try k Means for multiple times with different random seed selections. …… In the above, if you start with B and E as centroids you converge to {A,B,C} and {D,E,F} If you start with D and F you converge to {A,B,D,E} {C,F} Example showing sensitivity to seeds
Take Home Message k Means