Flat Clustering Adapted from Slides by

Flat Clustering Adapted from Slides by
Prabhakar Raghavan, Christopher Manning, Ray Mooney and Soumen Chakrabarti Flat clustering creates a flat set of clusters without any explicit structure that would relate clusters. What is clustering and why is it useful? Flat vs tree-based clusters Issues in clustering : Doc rep, similarity metric, choice of k K-means algorithm – convergence/termination-complexity – choice of seed and k Classification vs clustering :: supervised vs unsupervised (hand tagged training set) (research: cluster label generation) (News dataset : label generation subsumed by picking a prototype from an authoritative source) Prasad L16FlatCluster

Today’s Topic: Clustering
Document clustering Motivations Document representations Success criteria Clustering algorithms Partitional (Flat) Hierarchical (Tree) WHAY – WHY – HOW? Prasad L16FlatCluster

What is clustering? The commonest form of unsupervised learning
Clustering: the process of grouping a set of objects into classes of similar objects Documents within a cluster should be similar. Documents from different clusters should be dissimilar. The commonest form of unsupervised learning Unsupervised learning = learning from raw data, as opposed to supervised data where a classification of examples is given A common and important task that finds many applications in IR and other places Prasad L16FlatCluster

A data set with clear cluster structure
How would you design an algorithm for finding the three clusters in this case?

Why cluster documents? Whole corpus analysis/navigation
Better user interface For improving recall in search applications Better search results (pseudo RF) For better navigation of search results Effective “user recall” will be higher For speeding up vector space retrieval Faster search Organization in the face of information overload (a la sense-based grouping) Analysis of data to understand it better – structure + label Recent News generated as a cluster (Goggle News) … (intrinsically a candidate for unsupervised learning) 2) Organize the result set for easy navigation and enhancing precision: Clusty 3) Novel browsing paradigm : Scatter/Gather 4) Improve search efficiency by dragging similar documents (Engineering Search Engine) user perspective : to enhance recall computational perspective : to improve efficiency by reducing comparisons Prasad L16FlatCluster

Yahoo! Hierarchy isn’t clustering but is the kind of output you want from clustering
… (30) agriculture biology physics CS space ... ... ... ... ... dairy botany cell AI courses crops craft magnetism agronomy HCI missions forestry Yahoo Science hierarchy, consisting of text of web pages pointed to by Yahoo, gathered summer of 1997. 264 classes, only sample shown here. Yahoo! Hierarchy isn’t clustering but is the kind of output you want from clustering evolution relativity Prasad L16FlatCluster

Google News: automatic clustering gives an effective news presentation metaphor

Scatter/Gather: Cutting, Karger, and Pedersen
gather Clusters – select interesting clusters, union them and then scatter documents in them – gather clusters … Novel navigation paradigm (cf. query reformulation vs identifying interesting clusters) Doug Cutting =>contributed Lucene, Hadoop, etc to open source software

For visualizing a document collection and its themes
Wise et al, “Visualizing the non-visual” PNNL ThemeScapes, Cartia [Mountain height = cluster size]

For improving search recall
Cluster hypothesis - Documents in the same cluster behave similarly with respect to relevance to information needs Therefore, to improve search recall: Cluster docs in corpus a priori When a query matches a doc D, also return other docs in the cluster containing D Example: The query “car” will also return docs containing automobile Because clustering grouped together docs containing car with those containing automobile. add precomputed similar docs Cf. Latent Sematic Indexing (topics) Car and automobile related because they co-occur with the same set of words Polysymy (precision) vs Synonymy (recall) Why might this happen? Prasad

For better navigation of search results
For grouping search results thematically clusty.com / Vivisimo organizing result set WORDNET-based approach to query expansion and clustering based on synonyms Jaguar : foorball team, animal, car, etc Polysymy (precision) vs Synonymy (recall) ========================== Now called YIPPY Prasad L16FlatCluster

Issues for clustering Representation for clustering How many clusters?
Document representation Vector space? Normalization? Need a notion of similarity/distance How many clusters? Fixed a priori? Completely data driven? Avoid “trivial” clusters - too large or small In an application, if a cluster's too large, then for navigation purposes you've wasted an extra user click without whittling down the set of documents much. Note that purpose of clustering impacts document representation. For topic similarity, stop words can be safely omitted. On the other hand, language recognition may benefit from focussing on stop words! Eventual clustering may depend on the choice of number of clusters and initial seeds. Prasad L16FlatCluster

What makes docs “related”?
Ideal: semantic similarity. Practical: statistical similarity Docs as vectors. For many algorithms, easier to think in terms of a distance (rather than similarity) between docs. We can use cosine similarity (alternatively, Euclidean Distance). This directly panders to “internal criteria”. However, for “external criteria”, we need a gold standard to really gauge the effectiveness of distance/similarity measure. Jaguar -> car, animal, sports team, OS Prasad L16FlatCluster

Clustering Algorithms
Partitional (Flat) algorithms Usually start with a random (partial) partition Refine it iteratively K means clustering (Model based clustering) Hierarchical (Tree) algorithms Bottom-up, agglomerative (Top-down, divisive) Flat: No relationship among clusters Hierarchical: Tree-structured clusters Prasad L16FlatCluster

Hard vs. soft clustering
Hard clustering: Each document belongs to exactly one cluster More common and easier to do Soft clustering: A document can belong to more than one cluster. Makes more sense for applications like creating browsable hierarchies You may want to put a pair of sneakers in two clusters: (i) sports apparel and (ii) shoes We won’t do soft clustering today. See IIR 16.5, 18 (Latent Semantic Indexing) Using the hard clusters, we can get soft “intersecting” clusters if the distance of a document from two centriods is within tolerance (e.g., 5%)

Partitioning Algorithms
Partitioning method: Construct a partition of n documents into a set of K clusters Given: a set of documents and the number K Find: a partition of K clusters that optimizes the chosen partitioning criterion Globally optimal: exhaustively enumerate all partitions Effective heuristic methods: K-means and K-medoids algorithms Prasad L16FlatCluster

K-Means Assumes documents are real-valued vectors.
Clusters based on centroids (aka the center of gravity or mean) of points in a cluster, c. Reassignment of instances to clusters is based on distance to the current cluster centroids. (Or one can equivalently phrase it in terms of similarities) Centroid :: mediod :::: mean :: median (not in dataset vs in dataset) Medoid may be sparse while centroid may be dense Prasad L16FlatCluster

K-Means Algorithm Select K random docs {s1, s2,… sK} as seeds.
Until clustering converges or other stopping criterion: For each doc di: Assign di to the cluster cj such that dist(xi, sj) is minimal. (Update the seeds to the centroid of each cluster) For each cluster cj sj = (cj) Prasad L16FlatCluster

K Means Example (K=2) Pick seeds Reassign clusters Compute centroids
Converged! X is not a valid datapoint (centroid vs medoid) Prasad L16FlatCluster

Termination conditions
Several possibilities, e.g., A fixed number of iterations. Doc partition unchanged. Centroid positions don’t change. a a a a a a A a b b b B b b b c C c c c c c c a a a A a a a b b b b B b b b b c c c C c c c In the long run, centroid unchanged implies documents unchanged? Last two conditions are equivalent. Argue => <= Danger of oscillation if a sample is equidistant from two centroids and we keep assigning it to different cluster each time. Does this mean that the docs in a cluster are unchanged? Prasad

Convergence Why should the K-means algorithm ever reach a fixed point?
A state in which clusters don’t change. K-means is a special case of a general procedure known as the Expectation Maximization (EM) algorithm. EM is known to converge. Number of iterations could be large. Danger of oscillation if a sample is equidistant from two centroids and we keep assigning it to different cluster each time. Prasad L16FlatCluster

Convergence of K-Means
Lower case Convergence of K-Means Define goodness measure of cluster k as sum of squared distances from cluster centroid: Gk = Σi (di – ck) (sum over all di in cluster k) G = Σk Gk Intuition: Reassignment monotonically decreases G since each vector is assigned to the closest centroid. Σ (di – a)2 reaches minimum for: Σ –2(di – a) = 0 Σ di = Σ a mK a = Σ di a = (1/ mk) Σ di = ck K-means typically converges quickly Prasad L16FlatCluster

Convergence of K-Means
Sec. 16.4 Convergence of K-Means Recomputation monotonically decreases each Gk since (mk is number of members in cluster k): Σ (di – a)2 reaches minimum for: Σ –2(di – a) = 0 Σ di = Σ a mK a = Σ di a = (1/ mk) Σ di = ck K-means typically converges quickly

Time Complexity Computing distance between two docs is O(m) where m is the dimensionality of the vectors. Reassigning clusters: O(Kn) distance computations, or O(Knm). Computing centroids: Each doc gets added once to some centroid: O(nm). Assume these two steps are each done once for I iterations: O(IKnm). Prasad L16FlatCluster

Seed Choice Results can vary based on random seed selection.
Some seeds can result in poor convergence rate, or convergence to sub-optimal clusterings. Select good seeds using a heuristic (e.g., doc least similar to any existing mean) Try out multiple starting points Initialize with the results of another method. Example showing sensitivity to seeds In the above, if you start with B and E as centroids you converge to {A,B,C} and {D,E,F}. If you start with D and F you converge to {A,B,D,E} {C,F}. Prasad L16FlatCluster

K-means issues, variations, etc.
Recomputing the centroid after every assignment (rather than after all points are re-assigned) can improve speed of convergence of K-means Assumes clusters are spherical in vector space Sensitive to coordinate changes, weighting etc. Disjoint and exhaustive Doesn’t have a notion of “outliers” But can add outlier filtering Prasad L16FlatCluster

How Many Clusters? Number of clusters K is given
Partition n docs into predetermined number of clusters Finding the “right” number of clusters is part of the problem Given docs, partition into an “appropriate” number of subsets. E.g., for query results - ideal value of K not known up front - though UI may impose limits. Prasad L16FlatCluster

Hierarchical Clustering
Adapted from Slides by Prabhakar Raghavan, Christopher Manning, Ray Mooney and Soumen Chakrabarti Tree-based clusters: Top-down or bottom-up approach Choices : Cluster representatives for combining; Definition of closest cluster (criteria for merging) Evaluation critera :-> Gold standard : purity, Rand index (Flat vs Tree: practical complexity: linear vs quadratic) Prasad L17HierCluster

“The Curse of Dimensionality”
Why document clustering is difficult While clustering looks intuitive in 2 dimensions, many of our applications involve 10,000 or more dimensions… High-dimensional spaces look different The probability of random points being close drops quickly as the dimensionality grows. Furthermore, random pair of vectors are all almost perpendicular. Docs contain different words. Prasad L17HierCluster

Hierarchical Clustering
Build a tree-based hierarchical taxonomy (dendrogram) from a set of documents. One approach: recursive application of a partitional clustering algorithm. animal vertebrate fish reptile amphib. mammal worm insect crustacean invertebrate Prasad L17HierCluster

Dendogram: Hierarchical Clustering
Clustering obtained by cutting the dendrogram at a desired level: each connected component forms a cluster. Prasad L17HierCluster

Hierarchical Clustering algorithms
Agglomerative (bottom-up): Precondition: Start with each document as a separate cluster. Postcondition: Eventually all documents belong to the same cluster. Divisive (top-down): Precondition: Start with all documents belonging to the same cluster. Postcondition: Eventually each document forms a cluster of its own. Does not require the number of clusters k in advance. Needs a termination/readout condition The final mode in both Agglomerative and Divisive is of no use. The history of merging forms a binary tree or hierarchy. Prasad L17HierCluster

Hierarchical Agglomerative Clustering (HAC) Algorithm
Start with all instances in their own cluster. Until there is only one cluster: Among the current clusters, determine the two clusters, ci and cj, that are most similar. Replace ci and cj with a single cluster ci  cj Consider agglomerative clustering on n points on a line. Explain how you could avoid n3 distance computations - how many will your scheme use? Prasad L17HierCluster

Dendrogram: Document Example
As clusters agglomerate, docs likely to fall into a hierarchy of “topics” or concepts. d3 d5 d1 d3,d4,d5 d4 d2 d1,d2 d4,d5 d3 Prasad L17HierCluster

Key notion: cluster representative
We want a notion of a representative point in a cluster, to represent the location of each cluster Representative should be some sort of “typical” or central point in the cluster, e.g., point inducing smallest radii to docs in cluster smallest squared distances, etc. point that is the “average” of all docs in the cluster Centroid or center of gravity Measure intercluster distances by distances of centroids. Mean, median, mode … Prasad L17HierCluster

Example: n=6, k=3, closest pair of centroids
Centroid after second step. d1 d2 Centroid after first step. Prasad

Outliers in centroid computation
Can ignore outliers when computing centroid. What is an outlier? Lots of statistical definitions, e.g. moment of point to centroid > M  some cluster moment. Say 10. Centroid Outlier Prasad L17HierCluster

Closest pair of clusters
Many variants to defining closest pair of clusters Single-link Similarity of the most cosine-similar (single-link) Complete-link Similarity of the “furthest” points, the least cosine-similar Centroid Clusters whose centroids (centers of gravity) are the most cosine-similar Average-link Average cosine between pairs of elements

Single Link Agglomerative Clustering
Use maximum similarity of pairs: Can result in “straggly” (long and thin) clusters due to chaining effect. After merging ci and cj, the similarity of the resulting cluster to another cluster, ck, is: Merge cluster ci with cj if sim(ci,cj) = max_over_x sim(ci,cx) Prasad L17HierCluster

Single Link Example Prasad L17HierCluster

Complete Link Agglomerative Clustering
Use minimum similarity of pairs: Makes “tighter,” spherical clusters that are typically preferable. After merging ci and cj, the similarity of the resulting cluster to another cluster, ck, is: Merge cluster ci with cj if sim(ci,cj) = max_over_x sim(ci,cx) Ci Cj Ck Prasad

Complete Link Example Prasad L17HierCluster

Computational Complexity
In the first iteration, all HAC methods need to compute similarity of all pairs of n individual instances which is O(n2). In each of the subsequent n2 merging iterations, compute the distance between the most recently created cluster and all other existing clusters. In order to maintain an overall O(n2) performance, computing similarity to each cluster must be done in constant time. Often O(n3) if done naively or O(n2 log n) if done more cleverly Iteration one: n^2 distances Computing distances from newly formed cluster = o(n) Subsequent iterations = o(n) So overall complexity for subsequent iterations: o(n^2) Prasad L17HierCluster

Group Average Agglomerative Clustering
Use average similarity across all pairs within the merged cluster to measure the similarity of two clusters. Compromise between single and complete link. Two options: Averaged across all ordered pairs in the merged cluster Averaged over all pairs between the two original clusters No clear difference in efficacy Weighted average can be computed efficiently (remember avg and number of points) Prasad L17HierCluster

Computing Group Average Similarity
Assume cosine similarity and normalized vectors with unit length. Always maintain sum of vectors in each cluster. Compute similarity of clusters in constant time: Prasad L17HierCluster

Medoid as Cluster Representative
The centroid does not have to be a document. Medoid: A cluster representative that is one of the documents (E.g., the document closest to the centroid) One reason this is useful Consider the representative of a large cluster (>1K docs) The centroid of this cluster will be a dense vector The medoid of this cluster will be a sparse vector Compare: mean/centroid vs. median/medoid Prasad L17HierCluster

Efficiency: “Using approximations”
In standard algorithm, must find closest pair of centroids at each step Approximation: instead, find nearly closest pair use some data structure that makes this approximation easier to maintain simplistic example: maintain closest pair based on distances in projection on a random line Random line Prasad

Major issue - labeling After clustering algorithm finds clusters - how can they be useful to the end user? Need pithy label for each cluster In search results, say “Animal” or “Car” in the jaguar example. In topic trees (Yahoo), need navigational cues. Often done by hand, a posteriori. Cf. News docs appl Clusty search engine Prasad L17HierCluster

How to Label Clusters Show titles of typical documents
Titles are easy to scan Authors create them for quick scanning! But you can only show a few titles which may not fully represent cluster Show words/phrases prominent in cluster More likely to fully represent cluster Use distinguishing words/phrases Differential labeling News doc cluster labels: our NLDB-2008 paper Synthesize cluster label using set-based comparison criteria restoring word sequence information in final label Prasad L17HierCluster

Labeling Common heuristics - list 5-10 most frequent terms in the centroid vector. Drop stop-words; stem. Differential labeling by frequent terms Within a collection “Computers”, clusters all have the word computer as frequent term. Discriminant analysis of centroids. Perhaps better: distinctive noun phrase Prasad L17HierCluster

What is a Good Clustering?
Internal criterion: A good clustering will produce high quality clusters in which: the intra-class (that is, intra-cluster) similarity is high the inter-class similarity is low The measured quality of a clustering depends on both the document representation and the similarity measure used Prasad L17HierCluster

External criteria for clustering quality
Quality measured by its ability to discover some or all of the hidden patterns or latent classes in gold standard data Assesses a clustering with respect to ground truth … requires labeled data Assume documents with C gold standard classes, while our clustering algorithms produce K clusters, ω1, ω2, …, ωK with ni members. Prasad L17HierCluster

External Evaluation of Cluster Quality
Simple measure: purity, the ratio between the dominant class in the cluster πi and the size of cluster ωi Others are entropy of classes in clusters (or mutual information between classes and clusters) Biased because having n clusters maximizes purity Prasad L17HierCluster

Purity example                  Cluster I Cluster II
                 Cluster I Cluster II Cluster III Cluster I: Purity = 1/6 * (max(5, 1, 0)) = 5/6 Cluster II: Purity = 1/6 * (max(1, 4, 1)) = 4/6 Cluster III: Purity = 1/5 * (max(2, 0, 3)) = 3/5

A B C D Rand Index Number of points Same Cluster in clustering
Different Clusters in clustering Same class in ground truth A C Different classes in ground truth B D

Rand index: symmetric version
Compare with standard Precision and Recall. Prasad L17HierCluster

20 24 72 Rand Index example: 0.68 Number of points
Same Cluster in clustering Different Clusters in clustering Same class in ground truth 20 24 Different classes in ground truth 72 20+72 / = 92 / 136 = 0.68

Final word and resources
In clustering, clusters are inferred from the data without human input (unsupervised learning) However, in practice, it’s a bit less clear: there are many ways of influencing the outcome of clustering: number of clusters, similarity measure, representation of documents, . . . Resources IIR 16 except 16.5 IIR 17.1–17.3

SKIP WHAT FOLLOWS Prasad L17HierCluster

Term vs. document space So far, we clustered docs based on their similarities in term space For some applications, e.g., topic analysis for inducing navigation structures, can “dualize” use docs as axes represent (some) terms as vectors proximity based on co-occurrence of terms in docs now clustering terms, not docs Prasad L17HierCluster

Term vs. document space Cosine computation Cluster labeling
Constant for docs in term space Grows linearly with corpus size for terms in doc space Cluster labeling Clusters have clean descriptions in terms of noun phrase co-occurrence Application of term clusters Cf. Latent Semantic Indexing Prasad L17HierCluster

Multi-lingual docs E.g., Canadian government docs.
Every doc in English and equivalent French. Must cluster by concepts rather than language Simplest: pad docs in one language with dictionary equivalents in the other thus each doc has a representation in both languages Axes are terms in both languages Prasad L17HierCluster

Feature selection Which terms to use as axes for vector space?
IDF is a form of feature selection Can exaggerate noise e.g., mis-spellings Better to use highest weight mid-frequency words – the most discriminating terms Pseudo-linguistic heuristics, e.g., drop stop-words stemming/lemmatization use only nouns/noun phrases Good clustering should “figure out” some of these Prasad L17HierCluster

Evaluation of clustering
Perhaps the most substantive issue in data mining in general: how do you measure goodness? Most measures focus on computational efficiency Time and space For application of clustering to search: Measure retrieval effectiveness Prasad L17HierCluster

Approaches to evaluating
Anecdotal Ground “truth” comparison Cluster retrieval Purely quantitative measures Probability of generating clusters found Average distance between cluster members Microeconomic / utility Prasad L17HierCluster

Anecdotal evaluation Probably the commonest (and surely the easiest)
“I wrote this clustering algorithm and look what it found!” No benchmarks, no comparison possible Any clustering algorithm will pick up the easy stuff like partition by languages Generally, unclear scientific value. Prasad L17HierCluster

Ground “truth” comparison
Take a union of docs from a taxonomy & cluster Yahoo!, ODP, newspaper sections … Compare clustering results to baseline e.g., 80% of the clusters found map “cleanly” to taxonomy nodes But is it the “right” answer? There can be several equally right answers For the docs given, the static prior taxonomy may be incomplete/wrong in places the clustering algorithm may have gotten right things not in the static taxonomy “Subjective” Prasad

Microeconomic viewpoint
Anything - including clustering - is only as good as the economic utility it provides For clustering: net economic gain produced by an approach (vs. another approach) Strive for a concrete optimization problem Examples recommendation systems clock time for interactive search expensive Prasad L17HierCluster

Evaluation example: Cluster retrieval
Ad-hoc retrieval Cluster docs in returned set Identify best cluster & only retrieve docs from it How do various clustering methods affect the quality of what’s retrieved? Concrete measure of quality: Precision as measured by user judgements for these queries Done with TREC queries Prasad L17HierCluster

Evaluation Compare two IR algorithms
1. send query, present ranked results 2. send query, cluster results, present clusters Experiment was simulated (no users) Results were clustered into 5 clusters Clusters were ranked according to percentage relevant documents Documents within clusters were ranked according to similarity to query Prasad L17HierCluster

Flat Clustering Adapted from Slides by

Similar presentations

Presentation on theme: "Flat Clustering Adapted from Slides by"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Flat Clustering Adapted from Slides by

Similar presentations

Presentation on theme: "Flat Clustering Adapted from Slides by"— Presentation transcript:

Similar presentations

About project

Feedback