Information Retrieval Lecture 6 Introduction to Information Retrieval (Manning et al. 2007) Chapter 16 For the MSc Computer Science Programme Dell Zhang.

Slides:



Advertisements
Similar presentations
Lecture 15(Ch16): Clustering
Advertisements

Clustering k-mean clustering Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein.
Information Retrieval Lecture 7 Introduction to Information Retrieval (Manning et al. 2007) Chapter 17 For the MSc Computer Science Programme Dell Zhang.
©2012 Paula Matuszek CSC 9010: Text Mining Applications: Document Clustering l Dr. Paula Matuszek l
Clustering Paolo Ferragina Dipartimento di Informatica Università di Pisa This is a mix of slides taken from several presentations, plus my touch !
CSCI 5417 Information Retrieval Systems Jim Martin Lecture 16 10/18/2011.
Unsupervised learning
K-means clustering Hongning Wang
CS347 Lecture 8 May 7, 2001 ©Prabhakar Raghavan. Today’s topic Clustering documents.
Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.
CS728 Web Clustering II Lecture 14. K-Means Assumes documents are real-valued vectors. Clusters based on centroids (aka the center of gravity or mean)
1 Text Clustering. 2 Clustering Partition unlabeled examples into disjoint subsets of clusters, such that: –Examples within a cluster are very similar.
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 16: Flat Clustering 1.
Clustering 10/9/2002. Idea and Applications Clustering is the process of grouping a set of physical or abstract objects into classes of similar objects.
Lecture 13: Clustering (continued) May 12, 2010
CS276A Text Retrieval and Mining Lecture 13 [Borrows slides from Ray Mooney and Soumen Chakrabarti]
Clustering. 2 Outline  Introduction  K-means clustering  Hierarchical clustering: COBWEB.
Semi-Supervised Clustering Jieping Ye Department of Computer Science and Engineering Arizona State University
What is Cluster Analysis?
Unsupervised Learning: Clustering 1 Lecture 16: Clustering Web Search and Mining.
CSC 4510 – Machine Learning Dr. Mary-Angela Papalaskari Department of Computing Sciences Villanova University Course website:
Moshe Koppel and Navot Akiva
Unsupervised Learning. CS583, Bing Liu, UIC 2 Supervised learning vs. unsupervised learning Supervised learning: discover patterns in the data that relate.
Text Clustering.
Clustering Supervised vs. Unsupervised Learning Examples of clustering in Web IR Characteristics of clustering Clustering algorithms Cluster Labeling 1.
Introduction to Information Retrieval Introduction to Information Retrieval CS276: Information Retrieval and Web Search Pandu Nayak and Prabhakar Raghavan.
ITCS 6265 Information Retrieval & Web Mining Lecture 15 Clustering.
Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1.
1 Motivation Web query is usually two or three words long. –Prone to ambiguity –Example “keyboard” –Input device of computer –Musical instruments How can.
Introduction to Information Retrieval Introduction to Information Retrieval Modified from Stanford CS276 slides Chap. 16: Clustering.
Clustering Paolo Ferragina Dipartimento di Informatica Università di Pisa This is a mix of slides taken from several presentations, plus my touch !
UNSUPERVISED LEARNING David Kauchak CS 451 – Fall 2013.
Unsupervised Learning. Supervised learning vs. unsupervised learning.
Hinrich Schütze and Christina Lioma Lecture 16: Flat Clustering
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 16: Flat Clustering 1.
Prepared by: Mahmoud Rafeek Al-Farra
Clustering Clustering is a technique for finding similarity groups in data, called clusters. I.e., it groups data instances that are similar to (near)
Information Retrieval and Organisation Chapter 16 Flat Clustering Dell Zhang Birkbeck, University of London.
Mehdi Ghayoumi MSB rm 132 Ofc hr: Thur, a Machine Learning.
Clustering. What is Clustering? Clustering: the process of grouping a set of objects into classes of similar objects –Documents within a cluster should.
V. Clustering 인공지능 연구실 이승희 Text: Text mining Page:82-93.
Machine Learning Queens College Lecture 7: Clustering.
CSCI 5417 Information Retrieval Systems Jim Martin Lecture 15 10/13/2011.
Clustering (Modified from Stanford CS276 Slides - Lecture 17 Clustering)
Today’s Topic: Clustering  Document clustering Motivations Document representations Success criteria  Clustering algorithms Partitional Hierarchical.
E.G.M. PetrakisText Clustering1 Clustering  “Clustering is the unsupervised classification of patterns (observations, data items or feature vectors) into.
Lecture 12: Clustering May 5, Clustering (Ch 16 and 17)  Document clustering  Motivations  Document representations  Success criteria  Clustering.
Information Retrieval and Organisation Chapter 17 Hierarchical Clustering Dell Zhang Birkbeck, University of London.
Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1.
Clustering Slides adapted from Chris Manning, Prabhakar Raghavan, and Hinrich Schütze (
Information Retrieval and Organisation Chapter 14 Vector Space Classification Dell Zhang Birkbeck, University of London.
1 Machine Learning Lecture 9: Clustering Moshe Koppel Slides adapted from Raymond J. Mooney.
Introduction to Information Retrieval Introduction to Information Retrieval Clustering Chris Manning, Pandu Nayak, and Prabhakar Raghavan.
Introduction to Data Mining Clustering & Classification Reference: Tan et al: Introduction to data mining. Some slides are adopted from Tan et al.
Introduction to Information Retrieval Introduction to Information Retrieval CS276: Information Retrieval and Web Search Christopher Manning and Prabhakar.
Introduction to Information Retrieval Introduction to Information Retrieval CS276: Information Retrieval and Web Search Christopher Manning and Prabhakar.
Data Mining and Text Mining. The Standard Data Mining process.
Big Data Infrastructure
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
Sampath Jayarathna Cal Poly Pomona
Semi-Supervised Clustering
Machine Learning Lecture 9: Clustering
Information Retrieval
Information Organization: Clustering
本投影片修改自Introduction to Information Retrieval一書之投影片 Ch 16 & 17
Machine Learning on Data Lecture 9b- Clustering
Text Categorization Berlin Chen 2003 Reference:
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
Data Mining CSCI 307, Spring 2019 Lecture 24
Introduction to Machine learning
Presentation transcript:

Information Retrieval Lecture 6 Introduction to Information Retrieval (Manning et al. 2007) Chapter 16 For the MSc Computer Science Programme Dell Zhang Birkbeck, University of London

What is text clustering? Text clustering – grouping a set of documents into classes of similar documents. Classification vs. Clustering  Classification: supervised learning Labeled data are given for training  Clustering: unsupervised learning Only unlabeled data are available

Why text clustering? To improve user interface  Navigation/analysis of corpus or search results To improve recall  Cluster docs in corpus a priori. When a query matches a doc d, also return other docs in the cluster containing d. Hope if we do this, the query “car” will also return docs containing “automobile”. To improve retrieval speed  Cluster Pruning

What clustering is good? External criteria  Consistent with the latent classes in gold standard (ground truth) data. Internal criteria  High intra-cluster similarity  Low inter-cluster similarity

Issues for Clustering Similarity between docs  Ideal: semantic similarity  Practical: statistical similarity, e.g., cosine. Number of clusters  Fixed, e.g., k Means.  Flexible, e.g., Single-Link HAC. Structure of clusters  Flat partition, e.g., k Means.  Hierarchical tree, e.g., Single-Link HAC.

k Means Algorithm Pick k docs {s 1, s 2,…,s k } randomly as seeds. Repeat until clustering converges (or other stopping criterion): For each doc d i : Assign d i to cluster c j such that sim(d i, s j ) is maximal. For each cluster c j : Update s j to the centroid (mean) of cluster c j.

k Means – Example (k = 2) Pick seeds Reassign clusters Compute centroids x x Reassign clusters x x x x Compute centroids Reassign clusters Converged!

k Means – Example

k Means – Online Demo

Convergence k Means is proved to converge, i.e., to reach a state in which clusters don’t change. k Means usually converges quickly, i.e., the number of iterations is small in most cases.

Seeds Problem  Results can vary because of random seed selections. Some seeds can result in poor convergence rate, or convergence to sub-optimal clustering. Solution  Try k Means for multiple times with different random seed selections.  …… In the above, if you start with B and E as centroids you converge to {A,B,C} and {D,E,F} If you start with D and F you converge to {A,B,D,E} {C,F} Example showing sensitivity to seeds

Take Home Message k Means