Download presentation
Presentation is loading. Please wait.
1
Clustering
2
What is clustering? Grouping similar objects together and keeping dissimilar objects apart. In Information Retrieval, the cluster hypothesis is that if one document in a cluster is relevant, all the documents in that cluster will probably be relevant.
3
Similarity / Distance measures Cosine similarity measure Euclidian Distance = sqrt ( (q1-d1)² + (q2-d2)² + … + (qn-dn)²) Simple matching coefficient = number of features in common Manhattan distance = (q1-d1) + (q2-d2) + … + (qn-dn) Dice’s similarity measure = 2 * number of matches / (number of features in a + number of features in b)
4
Non-heirarchic clustering The data is partitioned into clusters of similar objects with no hierarchic relationship between the clusters. Clusters can be represented by their centroid, which is the “average” of all the cluster members, sometimes called a class exemplar. The similarity of the objects being clustered to the cluster centroid is measured by a similarity measure.
5
User-defined parameters The number of clusters desired (may arise automatically as part of the clustering procedure). The minimum and maximum size for each cluster The vigilance parameter: a threshold value on the similarity measure, below which an object will not be included in a cluster Control of the degree of overlap between clusters Non-hierarchical algorithms can be transformed into hierarchical algorithms by using the clusters obtained at one level as the objects to be classified at the next level, thus producing a hierarchy of clusters..
6
Single pass algorithm (one version) The objects to be clustered are processed one by one The first object description becomes the centroid of the first cluster. Each subsequent object is matched against all cluster centroids existing at its processing time The object is assigned to one cluster (or more if overlap is allowed) according to some condition on the similarity measure If an object fails to match any existing cluster sufficiently closely, it becomes the exemplar of a new cluster.
7
Single pass algorithm (example) Set vigilance parameter (VP) to 2 Pattern 1 = [4 0 2] automatically goes into the first cluster, which will have centroid [4 0 2] Pattern 2 = [4 0 1] is sufficiently close to the first cluster to join it as well, since the Manhattan distance from pattern 2 to the centroid of cluster 1 <= VP. Centroid of cluster 1 is updated to [4 0 0.5] Pattern 3 = [0 5 0] forms its own new cluster, since it is too far away from the first cluster (Manhattan dist = 9.5, VP =2). The new (second) cluster starts with centroid [0 5 0] Pattern 4 = [1 4 0]. Manhattan distance from centroid 1 = 7.5, Manhattan distance from centroid 2 = 2. So pattern 4 goes into cluster 2, which now has the centroid [0.5 4.5 0].
8
Two pass algorithm (MacQueen’s k- means method) Take the first k objects in the data set as clusters of one member each (seed points) Assign each of the remaining m-k objects to the cluster with the nearest centroid. After each assignment, recompute the centroid of the gaining cluster. After all objects have been assigned, take the existing cluster centroids as seed points and make one more pass through the data set assigning each object to the nearest seed point.
9
Hierarchic clustering methods Hierarchical document clustering methods produce tree-like categorisations (dendrograms) where small clusters of highly similar documents are included within much larger clusters of less similar documents The individual objects (e.g. documents) are represented by the leaves of the tree while the root of the tree represents the fact that all the objects ultimately combine into a single cluster. May be agglomerative (inside out, bottom up) or divisive (outside in, top down).
10
Divisive clustering We start with a single cluster containing all the documents, and sequentially subdivide it until we are left with the individual documents. Divisive methods tend to produce monothetic categorisations, where all the documents in a cluster must share certain index terms.
11
Outside in Clustering
12
Outside In (2)
13
Outside In (3)
14
Agglomerative clustering More common than divisive clustering, especially in information retrieval. Tend to produce polythetic categorisations, which are more useful in document retrieval. In a polythetic categorisation, documents are placed in a cluster with the greatest number of index terms in common, but there is no single index term which is a prerequisite for cluster membership.
15
Types of hierarchical agglomerative clustering techniques Single linkage (nearest neighbour) Average linkage Complete linkage (furthest neighbour) All these methods start from a matrix containing the similarity value between every pair of documents in the collection The following algorithm covers all 3 methods:
16
General heirarchical agglomerative clustering technique: For every document pair find SIM[i,j], the entry in the similarity matrix, then repeat the following: –Search the similarity matrix to identify the most similar remaining pair of clusters; –Fuse this pair K and L to form a new cluster KL; –Update SIM by calculating the similarity between the new cluster and each of the remaining clusters; Until there is only one cluster left.
17
Differences between single, average and complete linkage methods The methods vary in the method of updating the similarity matrix In the average linkage method, when two items are fused, the similarity matrix is updated by averaging the similarities to every other document Single linkage – the similarity between two documents is based on the most similar pair of documents Complete linkage – the similarity between two clusters is based on the least similar pair of documents
20
The validity of document clustering Danger: clustering methods will find patterns even in random data (think of the constellations). In general, methods which result in little modification of the original similarity data are better than those which distort the inter-object similarity data The most common distortion measure is the cophenetic correlation coefficient produced by comparing the values in the original similarity matrix with the inter-object similarities found in the resulting dendrogram.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.