I256 Applied Natural Language Processing Fall 2009

I256 Applied Natural Language Processing Fall 2009
Lecture 15 (Text) clustering Barbara Rosario

Outline Motivation and applications for text clustering
Hard vs. soft clustering Flat vs. hierarchical clustering Similarity measures Flat K-means Hierarchical Agglomerative Clustering

Text Clustering Finds overall similarities among groups of documents
Finds overall similarities among groups of tokens (words, adjectives…) Goal is to place similar objects in the same groups and to assign dissimilar objects to different groups

Motivation Smoothing for statistical language models Generalization
Forming bins (by inducing the bins from the data) From Michael Collins’s slides (MIT NLP course)

Motivation Aid for Question-Answering and Information Retrieval
From Michael Collins’s slides (MIT NLP course)

Word Similarity Find semantically related words by combining similarity evidence from multiple indicators From Michael Collins’s slides (MIT NLP course)

Word clustering From Michael Collins’s slides (MIT NLP course)

Clustering of nouns Distributional Clustering of English Words - Pereira, Tishby and Lee, ACL 93

Distributional Clustering of English Words - Pereira, Tishby and Lee, ACL 93

Clustering of adjectives
Cluster adjectives based on the nouns they modify Multiple syntactic clues for modification Predicting the semantic orientation of adjectives, V Hatzivassiloglou, KR McKeown, EACL 1997

Document clustering Classification

Scatter/Gather: Clustering a Large Text Collection
Cutting, Pedersen, Tukey & Karger 92, 93 Hearst & Pedersen 95 Cluster sets of documents into general “themes”, like a table of contents Display the contents of the clusters by showing topical terms and typical titles User chooses subsets of the clusters and re-clusters the documents within Resulting new groups have different “themes”

From http://nlp.stanford.edu/IR-book/pdf/irbookonlinereading.pdf

S/G Example: query on “star”
Encyclopedia text 14 sports 8 symbols 47 film, tv 68 film, tv (p) music 97 astrophysics 67 astronomy(p) 12 steller phenomena 10 flora/fauna galaxies, stars 29 constellations 7 miscelleneous Clustering and re-clustering is entirely automated

Motivation: Visualization & EDA
Exploratory data analysis (EDA) (related to visualization) Get a feeling for what the data look like Try to find overall trends or patterns in text collections

Visualization Use clustering to map the entire huge multidimensional document space into a huge number of small clusters. “Project” these onto a 2D graphical representation Looks neat, but difficult to detect patterns Usefulness debatable

Motivation: Clustering for Information Retrieval
The cluster hypothesis states the fundamental assumption we make when using clustering in information retrieval. Cluster hypothesis. Documents in the same cluster behave similarly with respect to relevance to information needs. Tends to place similar docs together

Search result clustering
Instead of lists, clusters the search results, so that similar documents appear together. It is often easier to scan a few coherent groups than many individual documents. Particularly useful if a search term has different word senses. Vivísimo search engine (

Motivation: unsupervised classification
Classification when labeled data is not available Also called unsupervised classification Results of clustering depends only on the natural division in the data, not on any pre-existing categorization scheme

Classification Class1 Class2

Clustering

Methods Hard/soft clustering Flat/hierarchical clustering
Similarity measures Merging methods

Text Clustering Clustering is “The art of finding groups in data.”
-- Kaufmann and Rousseeu Term 1 Term 2

Hard/soft Clustering Hard Clustering -- Each object belongs to a single cluster Soft Clustering -- Each object is probabilistically assigned to clusters

Soft clustering A variation of many clustering methods
Instead of assigning each data sample to one and only one cluster, it calculates probabilities of membership for all clusters A sample might belong to cluster A with probability 0.4 and to cluster B with probability 0.6 More appropriate for NLP tasks

Flat Vs. Hierarchical Flat clustering creates a flat set of clusters without any explicit structure that would relate clusters to each other Hierarchical clustering produces a hierarchy of nodes Leaves are the single objects of the clustered set Node represents the cluster that contains all the nodes of its descendant

From http://nlp.stanford.edu/IR-book/pdf/irbookonlinereading.pdf

Flat Vs. Hierarchical Flat
Preferable is efficiency is a consideration or data sets are very large K-means is a very simple methods that should probably be used fist on anew data set because its results are often sufficient K-means assumes a simple Euclidean representation save so cannot be used for many data set, for example nominal data like colors In such cases use EM (expectation minimization)

Flat Vs. Hierarchical Hierarchical
Preferable for detailed data analysis Provide more information than flat clustering Does not require us to pre-specify the number of clusters Less efficient: the most common hierarchical clustering algorithms have a complexity that is at least quadratic in the number of documents compared to the linear complexity of most flat clustering methods

Clustering issues Two main issues Similarity measure
How to cluster data point together (o not) Clustering algorithms Merging criteria

Similarity Vector-space representation and similarity computation
Select important distributional properties of a word Create a vector of length n for each word to be classified Viewing the n-dimensional vector as a point in an n-dimensional space, cluster points that are near one another

Similarity From Michael Collins’s slides (MIT NLP course)

Pair-wise Document Similarity
B C D nova galaxy heat h’wood film role diet fur How to compute document similarity?

Pair-wise Document Similarity (no normalization for simplicity)
B C D nova galaxy heat h’wood film role diet fur

Pair-wise Document Similarity (cosine normalization)

Document/Document Matrix

Similarity And many other similarity measures!

Flat Clustering: K-means
K-means is the most important flat clustering algorithm. Objective is to minimize the average squared Euclidean distance of documents from their cluster centers where a cluster center is defined as the mean or centroid μ of the documents in a cluster ω:

K-Means Clustering Decide on a pair-wise similarity measure
Compute K centroids Assign each document to nearest center, forming new clusters Unless terminate condition, repeat 1-2

K-means algorithm A K-means example for K = 2 in R2
From

K-means algorithm Convergence of the position of the two centroids
From

K-means Residual sum of squares or RSS: measure of how well the centroids represent the members of their clusters RSS: squared distance of each vector from its centroid summed over all vectors RSS is the objective function in K-means and our goal is to minimize it

Model-based clustering
Model-based clustering assumes that the data were generated by a model and tries to recover the original model from the data. (Flat) The model that we recover from the data then defines clusters and an assignment of documents to clusters. EM (expectation-maximization)

Hierarchical Clustering
Agglomerative or bottom-up: Initialization: Start with each sample in its own cluster Merge the two closest clusters Each iteration: Find two most similar clusters and merge them Termination: All the objects are in the same cluster Divisive or top-down: Start with all elements in one cluster Partition one of the current clusters in two Repeat until all samples are in singleton clusters

Agglomerative Clustering
A B C D E F G H I

Merging nodes/Clustering function
Each node is a combination of the documents combined below it We represent the merged nodes as a vector of term weights This vector is referred to as the cluster centroid

Clustering functions aka Merging criteria
Extending the distance measure from samples to sets of samples Similarity of 2 most similar members Similarity of 2 least similar members Average similarity between members From Michael Collins’s slides (MIT NLP course)

Single-link merging criteria
merge each word type is a single-point cluster Merge closest pair of clusters: Single-link: clusters are close if any of their points are dist(A,B) = min dist(a,b) for aA, bB

Bottom-Up Clustering – Single-Link
... Fast, but tend to get long, stringy, meandering clusters

Bottom-Up Clustering – Complete-Link
distance between clusters Again, merge closest pair of clusters: Complete-link: dist(A,B) = max dist(a,b) for aA, bB

Bottom-Up Clustering – Complete-Link
distance between clusters Slow to find closest pair – need quadratically many distances

Choosing k How to select an appropriate level of granularity?
Too small, and clusters provide insufficient generalization Too large, and they are inappropriately generalized

Choosing k In both hierarchical and k-means/medians, we need to be told where to stop, i.e., how many clusters to form This is partially alleviated by visual inspection of the hierarchical tree (the dendrogram) It would be nice if we could find an optimal k from the data We can do this by trying different values of k and seeing which produces the best separation among the resulting clusters. And there are some theoretical measures

How to evaluate clusters?
In practice, it’s hard to do Different algorithms’ results look good and bad in different ways It’s difficult to distinguish their outcomes In theory, define an evaluation function Typically choose something easy to measure (e.g., the sum of the average distance in each class)

How to evaluate clusters?
Perform task-based evaluation Test the resulting clusters intuitively, i.e., inspect them and see if they make sense. Not advisable. Have an expert generate clusters manually, and test the automatically generated ones against them. Test the clusters against a predefined classification if there is one From Michael Collins’s slides (MIT NLP course)

Resources FCLUSTER - A tool for fuzzy cluster analysis
LNKnet Pattern Classification Software Principal Direction Divisive Partitioning k-means clustering Text Clustering (Chapter 16 and 17)

I256 Applied Natural Language Processing Fall 2009

Similar presentations

Presentation on theme: "I256 Applied Natural Language Processing Fall 2009"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

I256 Applied Natural Language Processing Fall 2009

Similar presentations

Presentation on theme: "I256 Applied Natural Language Processing Fall 2009"— Presentation transcript:

Similar presentations

About project

Feedback