Information Retrieval Lecture 6 Introduction to Information Retrieval (Manning et al. 2007) Chapter 16 For the MSc Computer Science Programme Dell Zhang.

Information Retrieval Lecture 6 Introduction to Information Retrieval (Manning et al. 2007) Chapter 16 For the MSc Computer Science Programme Dell Zhang Birkbeck, University of London

What is text clustering? Text clustering – grouping a set of documents into classes of similar documents. Classification vs. Clustering  Classification: supervised learning Labeled data are given for training  Clustering: unsupervised learning Only unlabeled data are available

Why text clustering? To improve user interface  Navigation/analysis of corpus or search results To improve recall  Cluster docs in corpus a priori. When a query matches a doc d, also return other docs in the cluster containing d. Hope if we do this, the query “car” will also return docs containing “automobile”. To improve retrieval speed  Cluster Pruning

http://clusty.com/

What clustering is good? External criteria  Consistent with the latent classes in gold standard (ground truth) data. Internal criteria  High intra-cluster similarity  Low inter-cluster similarity

Issues for Clustering Similarity between docs  Ideal: semantic similarity  Practical: statistical similarity, e.g., cosine. Number of clusters  Fixed, e.g., k Means.  Flexible, e.g., Single-Link HAC. Structure of clusters  Flat partition, e.g., k Means.  Hierarchical tree, e.g., Single-Link HAC.

k Means Algorithm Pick k docs {s 1, s 2,…,s k } randomly as seeds. Repeat until clustering converges (or other stopping criterion): For each doc d i : Assign d i to cluster c j such that sim(d i, s j ) is maximal. For each cluster c j : Update s j to the centroid (mean) of cluster c j.

k Means – Example (k = 2) Pick seeds Reassign clusters Compute centroids x x Reassign clusters x x x x Compute centroids Reassign clusters Converged!

k Means – Example

k Means – Online Demo http://home.dei.polimi.it/matteucc/Clustering/tutorial_html/AppletKM.html

Convergence k Means is proved to converge, i.e., to reach a state in which clusters don’t change. k Means usually converges quickly, i.e., the number of iterations is small in most cases.

Seeds Problem  Results can vary because of random seed selections. Some seeds can result in poor convergence rate, or convergence to sub-optimal clustering. Solution  Try k Means for multiple times with different random seed selections.  …… In the above, if you start with B and E as centroids you converge to {A,B,C} and {D,E,F} If you start with D and F you converge to {A,B,D,E} {C,F} Example showing sensitivity to seeds

Take Home Message k Means

Information Retrieval Lecture 6 Introduction to Information Retrieval (Manning et al. 2007) Chapter 16 For the MSc Computer Science Programme Dell Zhang.

Similar presentations

Presentation on theme: "Information Retrieval Lecture 6 Introduction to Information Retrieval (Manning et al. 2007) Chapter 16 For the MSc Computer Science Programme Dell Zhang."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Information Retrieval Lecture 6 Introduction to Information Retrieval (Manning et al. 2007) Chapter 16 For the MSc Computer Science Programme Dell Zhang.

Similar presentations

Presentation on theme: "Information Retrieval Lecture 6 Introduction to Information Retrieval (Manning et al. 2007) Chapter 16 For the MSc Computer Science Programme Dell Zhang."— Presentation transcript:

Similar presentations

About project

Feedback