1 Text Clustering. 2 Clustering Partition unlabeled examples into disjoint subsets of clusters, such that: –Examples within a cluster are very similar.

Slides:



Advertisements
Similar presentations
CS 478 – Tools for Machine Learning and Data Mining Clustering: Distance-based Approaches.
Advertisements

Information Retrieval Lecture 7 Introduction to Information Retrieval (Manning et al. 2007) Chapter 17 For the MSc Computer Science Programme Dell Zhang.
©2012 Paula Matuszek CSC 9010: Text Mining Applications: Document Clustering l Dr. Paula Matuszek l
Clustering Paolo Ferragina Dipartimento di Informatica Università di Pisa This is a mix of slides taken from several presentations, plus my touch !
Unsupervised learning
Introduction to Bioinformatics
Clustering 2/26/04 Homework 2 due today Midterm date: 3/11/04 Project part B assigned.
Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.
CS728 Web Clustering II Lecture 14. K-Means Assumes documents are real-valued vectors. Clusters based on centroids (aka the center of gravity or mean)
INFM 700: Session 7 Unstructured Information (Part II) Jimmy Lin The iSchool University of Maryland Monday, March 10, 2008 This work is licensed under.
Clustering 10/9/2002. Idea and Applications Clustering is the process of grouping a set of physical or abstract objects into classes of similar objects.
Lecture 13: Clustering (continued) May 12, 2010
CS276A Text Retrieval and Mining
Clustering. 2 Outline  Introduction  K-means clustering  Hierarchical clustering: COBWEB.
Semi-Supervised Clustering Jieping Ye Department of Computer Science and Engineering Arizona State University
What is Cluster Analysis?
Unsupervised Learning: Clustering 1 Lecture 16: Clustering Web Search and Mining.
Evaluating Performance for Data Mining Techniques
CSC 4510 – Machine Learning Dr. Mary-Angela Papalaskari Department of Computing Sciences Villanova University Course website:
1 Text Categorization. 2 Categorization Given: –A description of an instance, x  X, where X is the instance language or instance space. –A fixed set.
Partitional and Hierarchical Based clustering Lecture 22 Based on Slides of Dr. Ikle & chapter 8 of Tan, Steinbach, Kumar.
START OF DAY 8 Reading: Chap. 14. Midterm Go over questions General issues only Specific issues: visit with me Regrading may make your grade go up OR.
Text Clustering.
Clustering Supervised vs. Unsupervised Learning Examples of clustering in Web IR Characteristics of clustering Clustering algorithms Cluster Labeling 1.
tch?v=Y6ljFaKRTrI Fireflies.
Introduction to Information Retrieval Introduction to Information Retrieval CS276: Information Retrieval and Web Search Pandu Nayak and Prabhakar Raghavan.
ITCS 6265 Information Retrieval & Web Mining Lecture 15 Clustering.
Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1.
1 Motivation Web query is usually two or three words long. –Prone to ambiguity –Example “keyboard” –Input device of computer –Musical instruments How can.
Introduction to Information Retrieval Introduction to Information Retrieval Modified from Stanford CS276 slides Chap. 16: Clustering.
Clustering Paolo Ferragina Dipartimento di Informatica Università di Pisa This is a mix of slides taken from several presentations, plus my touch !
Information Retrieval Lecture 6 Introduction to Information Retrieval (Manning et al. 2007) Chapter 16 For the MSc Computer Science Programme Dell Zhang.
Clustering I. 2 The Task Input: Collection of instances –No special class label attribute! Output: Clusters (Groups) of instances where members of a cluster.
Clustering Algorithms for Numerical Data Sets. Contents 1.Data Clustering Introduction 2.Hierarchical Clustering Algorithms 3.Partitional Data Clustering.
Information Retrieval and Organisation Chapter 16 Flat Clustering Dell Zhang Birkbeck, University of London.
Mehdi Ghayoumi MSB rm 132 Ofc hr: Thur, a Machine Learning.
Clustering. What is Clustering? Clustering: the process of grouping a set of objects into classes of similar objects –Documents within a cluster should.
Hierarchical Clustering Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram – A tree like diagram that.
Machine Learning Queens College Lecture 7: Clustering.
CSCI 5417 Information Retrieval Systems Jim Martin Lecture 15 10/13/2011.
Clustering (Modified from Stanford CS276 Slides - Lecture 17 Clustering)
Today’s Topic: Clustering  Document clustering Motivations Document representations Success criteria  Clustering algorithms Partitional Hierarchical.
Slide 1 EE3J2 Data Mining Lecture 18 K-means and Agglomerative Algorithms.
Compiled By: Raj Gaurang Tiwari Assistant Professor SRMGPC, Lucknow Unsupervised Learning.
E.G.M. PetrakisText Clustering1 Clustering  “Clustering is the unsupervised classification of patterns (observations, data items or feature vectors) into.
Lecture 12: Clustering May 5, Clustering (Ch 16 and 17)  Document clustering  Motivations  Document representations  Success criteria  Clustering.
Text Clustering Hongning Wang
Information Retrieval and Organisation Chapter 17 Hierarchical Clustering Dell Zhang Birkbeck, University of London.
Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1.
Clustering Slides adapted from Chris Manning, Prabhakar Raghavan, and Hinrich Schütze (
1 CS 391L: Machine Learning Clustering Raymond J. Mooney University of Texas at Austin.
1 Machine Learning Lecture 9: Clustering Moshe Koppel Slides adapted from Raymond J. Mooney.
Introduction to Information Retrieval Introduction to Information Retrieval Clustering Chris Manning, Pandu Nayak, and Prabhakar Raghavan.
Introduction to Information Retrieval Introduction to Information Retrieval CS276: Information Retrieval and Web Search Christopher Manning and Prabhakar.
Introduction to Information Retrieval Introduction to Information Retrieval CS276: Information Retrieval and Web Search Christopher Manning and Prabhakar.
Clustering Machine Learning Unsupervised Learning K-means Optimization objective Random initialization Determining Number of Clusters Hierarchical Clustering.
Data Mining and Text Mining. The Standard Data Mining process.
Big Data Infrastructure Week 9: Data Mining (4/4) This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States.
Hierarchical Clustering & Topic Models
Unsupervised Learning: Clustering
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
Unsupervised Learning: Clustering
Semi-Supervised Clustering
Machine Learning Clustering: K-means Supervised Learning
Machine Learning Lecture 9: Clustering
Instance Based Learning
CS 391L: Machine Learning Clustering
Text Categorization Berlin Chen 2003 Reference:
Clustering Techniques
Presentation transcript:

1 Text Clustering

2 Clustering Partition unlabeled examples into disjoint subsets of clusters, such that: –Examples within a cluster are very similar –Examples in different clusters are very different Discover new categories in an unsupervised manner (no sample category labels provided).

3. Clustering Example

4 Text Clustering Clustering has been applied to text in a straightforward way. Typically use normalized, TF/IDF-weighted vectors and cosine similarity. Applications: –During retrieval, add other documents in the same cluster as the initial retrieved documents to improve recall. –Clustering of results of retrieval to present more organized results to the user (à la Vivisimo folders). –Automated production of hierarchical taxonomies of documents for browsing purposes (à la Yahoo & DMOZ).

5

6

7

8 Non-Hierarchical Clustering Typically must provide the number of desired clusters, k. Randomly choose k instances as seeds, one per cluster. Form initial clusters based on these seeds. Iterate, repeatedly reallocating instances to different clusters to improve the overall clustering. Stop when clustering converges or after a fixed number of iterations.

9 K-Means Assumes instances are real-valued vectors. Clusters based on centroids, center of gravity, or mean of points in a cluster, c: Reassignment of instances to clusters is based on distance to the current cluster centroids.

10 Distance Metrics Euclidian distance (L 2 norm): L 1 norm: Cosine Similarity (transform to a distance by subtracting from 1):

11 K-Means Algorithm Let d be the distance measure between instances. Select k random instances {s 1, s 2,… s k } as seeds. Until clustering converges or other stopping criterion: For each instance x i : Assign x i to the cluster c j such that d(x i, s j ) is minimal. (Update the seeds to the centroid of each cluster) For each cluster c j s j =  (c j )

12 K Means Example (K=2) Pick seeds Reassign clusters Compute centroids x x Reasssign clusters x x x x Compute centroids Reassign clusters Converged!

13 Seed Choice Results can vary based on random seed selection. Some seeds can result in poor convergence rate, or convergence to sub-optimal clusterings. Select good seeds using a heuristic or the results of another method.

14 Hierarchical Clustering Build a tree-based hierarchical taxonomy (dendrogram) from a set of unlabeled examples. Recursive application of a standard clustering algorithm can produce a hierarchical clustering. animal vertebrate fish reptile amphib. mammal worm insect crustacean invertebrate

15 Aglommerative vs. Divisive Clustering Aglommerative (bottom-up) methods start with each example in its own cluster and iteratively combine them to form larger and larger clusters. Divisive (partitional, top-down) separate all examples immediately into clusters.

16 Hierarchical Agglomerative Clustering (HAC) Assumes a similarity function for determining the similarity of two instances. Starts with all instances in a separate cluster and then repeatedly joins the two clusters that are most similar until there is only one cluster. The history of merging forms a binary tree or hierarchy.

17 HAC Algorithm Start with all instances in their own cluster. Until there is only one cluster: Among the current clusters, determine the two clusters, c i and c j, that are most similar. Replace c i and c j with a single cluster c i  c j

18 Cluster Similarity Assume a similarity function that determines the similarity of two instances: sim(x,y). –Cosine similarity of document vectors. How to compute similarity of two clusters each possibly containing multiple instances? –Single Link: Similarity of two most similar members. –Complete Link: Similarity of two least similar members. –Group Average: Average similarity between members.

19 Single Link Agglomerative Clustering Use maximum similarity of pairs: Can result in “straggly” (long and thin) clusters due to chaining effect. –Appropriate in some domains, such as clustering islands.

20 Single Link Example

21 Complete Link Agglomerative Clustering Use minimum similarity of pairs: Makes more “tight,” spherical clusters that are typically preferable.

22 Complete Link Example

23 Computing Cluster Similarity After merging c i and c j, the similarity of the resulting cluster to any other cluster, c k, can be computed by: –Single Link: –Complete Link:

24 Group Average Agglomerative Clustering Use average similarity across all pairs within the merged cluster to measure the similarity of two clusters. Compromise between single and complete link. Averaged across all ordered pairs in the merged cluster instead of unordered pairs between the two clusters (why, I don’t really understand).

25 Computing Group Average Similarity Assume cosine similarity and normalized vectors with unit length. Always maintain sum of vectors in each cluster. Compute similarity of clusters in constant time:

26 Hierarquical Divisive Clustering Aplicação de k-Means de forma interativa –Inicialmente, divida todos os objetos em dois clusters usando k-Means –Aplique k-Means nos clusters formados para gerar subclusters –Repita até atingir critério de parada