Download presentation
Presentation is loading. Please wait.
Published byElaine Malone Modified over 9 years ago
1
Information Retrieval Lecture 6 Introduction to Information Retrieval (Manning et al. 2007) Chapter 16 For the MSc Computer Science Programme Dell Zhang Birkbeck, University of London
2
What is text clustering? Text clustering – grouping a set of documents into classes of similar documents. Classification vs. Clustering Classification: supervised learning Labeled data are given for training Clustering: unsupervised learning Only unlabeled data are available
3
Why text clustering? To improve user interface Navigation/analysis of corpus or search results To improve recall Cluster docs in corpus a priori. When a query matches a doc d, also return other docs in the cluster containing d. Hope if we do this, the query “car” will also return docs containing “automobile”. To improve retrieval speed Cluster Pruning
4
http://clusty.com/
5
What clustering is good? External criteria Consistent with the latent classes in gold standard (ground truth) data. Internal criteria High intra-cluster similarity Low inter-cluster similarity
6
Issues for Clustering Similarity between docs Ideal: semantic similarity Practical: statistical similarity, e.g., cosine. Number of clusters Fixed, e.g., k Means. Flexible, e.g., Single-Link HAC. Structure of clusters Flat partition, e.g., k Means. Hierarchical tree, e.g., Single-Link HAC.
7
k Means Algorithm Pick k docs {s 1, s 2,…,s k } randomly as seeds. Repeat until clustering converges (or other stopping criterion): For each doc d i : Assign d i to cluster c j such that sim(d i, s j ) is maximal. For each cluster c j : Update s j to the centroid (mean) of cluster c j.
8
k Means – Example (k = 2) Pick seeds Reassign clusters Compute centroids x x Reassign clusters x x x x Compute centroids Reassign clusters Converged!
9
k Means – Example
11
k Means – Online Demo http://home.dei.polimi.it/matteucc/Clustering/tutorial_html/AppletKM.html
12
Convergence k Means is proved to converge, i.e., to reach a state in which clusters don’t change. k Means usually converges quickly, i.e., the number of iterations is small in most cases.
13
Seeds Problem Results can vary because of random seed selections. Some seeds can result in poor convergence rate, or convergence to sub-optimal clustering. Solution Try k Means for multiple times with different random seed selections. …… In the above, if you start with B and E as centroids you converge to {A,B,C} and {D,E,F} If you start with D and F you converge to {A,B,D,E} {C,F} Example showing sensitivity to seeds
14
Take Home Message k Means
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.