Presentation is loading. Please wait.

Presentation is loading. Please wait.

E.G.M. PetrakisText Clustering1 Clustering  “Clustering is the unsupervised classification of patterns (observations, data items or feature vectors) into.

Similar presentations


Presentation on theme: "E.G.M. PetrakisText Clustering1 Clustering  “Clustering is the unsupervised classification of patterns (observations, data items or feature vectors) into."— Presentation transcript:

1 E.G.M. PetrakisText Clustering1 Clustering  “Clustering is the unsupervised classification of patterns (observations, data items or feature vectors) into groups (clusters)” [ACM CS’99]ACM CS’99  Instances within a cluster are very similar  Instances in different clusters are very different

2 E.G.M. PetrakisText Clustering2 Example...................... term1 term2term2

3 E.G.M. PetrakisText Clustering3 Applications  Faster retrieval  Faster and better browsing  Structuring of search results  Revealing classes and other data regularities  Directory construction  Better data organization in general

4 E.G.M. PetrakisText Clustering4 Cluster Searching  Similar instances tend to be relevant to the same requests  The query is mapped to the closest cluster by comparison with the cluster-centroids

5 E.G.M. PetrakisText Clustering5 Notation  N: number of elements  Class: real world grouping – ground truth  Cluster: grouping by algorithm  The ideal clustering algorithm will produce clusters equivalent to real world classes with exactly the same members

6 E.G.M. PetrakisText Clustering6 Problems  How many clusters ?  Complexity? N is usually large  Quality of clustering  When a method is better than another?  Overlapping clusters  Sensitivity to outliers

7 E.G.M. PetrakisText Clustering7 Example...........................................

8 E.G.M. PetrakisText Clustering8 Clustering Approaches  Divisive: build clusters “top down” starting from the entire data set  K-means, Bisecting K-means  Hierarchical or flat clustering  Agglomerative: build clusters “bottom-up” starting with individual instances and by iteratively combining them to form larger cluster at higher level  Hierarchical clustering  Combinations of the above  Buckshot algorithm

9 E.G.M. PetrakisText Clustering9 Hierarchical – Flat Clustering  Flat: all clusters at the same level  K-means, Buckshot  Hierarchical: nested sequence of clusters  Single cluster with all data at the top & singleton clusters at the bottom  Intermediate levels are more useful  Every intermediate level combines two clusters from the next lower level  Agglomerative, Bisecting K-means

10 E.G.M. PetrakisText Clustering10 Flat Clustering......................

11 E.G.M. PetrakisText Clustering11 Hierarchical Clustering............................. 5 4 6 7 2 3 1 4 1 23 567

12 E.G.M. PetrakisText Clustering12 Text Clustering  Finds overall similarities among documents or groups of documents  Faster searching, browsing etc.  Needs to know how to compute the similarity (or equivalently the distance) between documents

13 E.G.M. PetrakisText Clustering13 Query – Document Similarity  Similarity is defined as the cosine of the angle between document and query vectors θ d1d1 d2d2

14 E.G.M. PetrakisText Clustering14 Document Distance  Consider documents d 1, d 2 with vectors u 1, u 2  Their distance is defined as the length AB

15 E.G.M. PetrakisText Clustering15 Normalization by Document Length  The longer the document is, the more likely it is for a given term to appear in it  Normalize the term weights by document length (so terms in long documents are not given more weight)

16 E.G.M. PetrakisText Clustering16 Evaluation of Cluster Quality  Clusters can be evaluated using internal or external knowledge  Internal Measures: intra cluster cohesion and cluster separability  intra cluster similarity  inter cluster similarity  External measures: quality of clusters compared to real classes  Entropy (E), Harmonic Mean (F)

17 E.G.M. PetrakisText Clustering17 Intra Cluster Similarity  A measure of cluster cohesion  Defined as the average pair-wise similarity of documents in a cluster  Where : cluster centroid  Documents (not centroids) have unit length

18 E.G.M. PetrakisText Clustering18 Inter Cluster Similarity a)Single Link: similarity of two most similar members b)Complete Link: similarity of two least similar members c)Group Average: average similarity between members

19 E.G.M. PetrakisText Clustering19 Example.... c c’ single link complete link group average S S’

20 E.G.M. PetrakisText Clustering20 Entropy  Measures the quality of flat clusters using external knowledge  Pre-existing classification  Assessment by experts  P ij : probability that a member of cluster j belong to class i  The entropy of cluster j is defined as E j =- Σ i P ij logP ij

21 E.G.M. PetrakisText Clustering21 Entropy (con’t)  Total entropy for all clusters  Where n j is the size of cluster j  m is the number of clusters  N is the number of instances  The smaller the value of E is the better the quality of the algorithm is  The best entropy is obtained when each cluster contains exactly one instance

22 E.G.M. PetrakisText Clustering22 Harmonic Mean (F)  Treats each cluster as a query result  F combines precision (P) and recall (R)  F ij for cluster j and class i is defined as n ij : number of instances of class i in cluster j, n i : number of instances of class i, n j : number of instances of cluster j

23 E.G.M. PetrakisText Clustering23 Harmonic Mean (con’t)  The F value of any class i is the maximum value it achieves over all j F i = max j F ij  The F value of a clustering solution is computed as the weighted average over all classes  Where N is the number of data instances

24 E.G.M. PetrakisText Clustering24 Quality of Clustering  A good clustering method  Maximizes intra-cluster similarity  Minimizes inter cluster similarity  Minimizes Entropy  Maximizes Harmonic Mean  Difficult to achieve all together simultaneously  Maximize some objective function of the above  An algorithm is better than an other if it has better values on most of these measures

25 E.G.M. PetrakisText Clustering25 K-means Algorithm  Select K centroids  Repeat I times or until the centroids do not change  Assign each instance to the cluster represented by its nearest centroid  Compute new centroids  Reassign instances  Compute new centroids  …….

26 6/1/2016Nikos Hourdakis, MSc Thesis26 K-Means demo (1/7): http://www.delft-cluster.nl/textminer/theory/kmeans/kmeans.html

27 6/1/2016Nikos Hourdakis, MSc Thesis27 K-Means demo (2/7)

28 6/1/2016Nikos Hourdakis, MSc Thesis28 K-Means demo (3/7)

29 6/1/2016Nikos Hourdakis, MSc Thesis29 K-Means demo (4/7)

30 6/1/2016Nikos Hourdakis, MSc Thesis30 K-Means demo (5/7)

31 6/1/2016Nikos Hourdakis, MSc Thesis31 K-Means demo (6/7)

32 6/1/2016Nikos Hourdakis, MSc Thesis32 K-Means demo (7/7)

33 E.G.M. PetrakisText Clustering33 Comments on K-Means (1)  Generates a flat partition of K clusters  K is the desired number of clusters and must be known in advance  Starts with K random cluster centroids  A centroid is the mean or the median of a group of instances  The mean rarely corresponds to a real instance

34 E.G.M. PetrakisText Clustering34 Comments on K-Means (2)  Up to I=10 iterations  Keep the clustering resulted in best inter/intra similarity or the final clusters after I iterations  Complexity O(IKN)  A repeated application of K-Means for K=2, 4,… can produce a hierarchical clustering

35 E.G.M. PetrakisText Clustering35 Choosing Centroids for K-means  Quality of clustering depends on the selection of initial centroids  Random selection may result in poor convergence rate, or convergence to sub-optimal clusterings.  Select good initial centroids using a heuristic or the results of another method  Buckshot algorithm

36 E.G.M. PetrakisText Clustering36 Incremental K-Means  Update each centroid during each iteration after each point is assigned to a cluster rather than at the end of each iteration  Reassign instances to clusters at the end of each iteration  Converges faster than simple K-means  Usually 2-5 iterations

37 E.G.M. PetrakisText Clustering37 Bisecting K-Means  Starts with a single cluster with all instances  Select a cluster to split: larger cluster or cluster with less intra similarity  The selected cluster is split into 2 partitions using K-means (K=2)  Repeat up to the desired depth h  Hierarchical clustering  Complexity O(2hN)

38 E.G.M. PetrakisText Clustering38 Agglomerative Clustering  Compute the similarity matrix between all pairs of instances  Starting from singleton clusters  Repeat until a single cluster remains  Merge the two most similar clusters  Replace them with a single cluster  Replace the merged cluster in the matrix and update the similarity matrix  Complexity O(N 2 )

39 E.G.M. PetrakisText Clustering39 Similarity Matrix C 1 =d 1 C 2 =d 2 …C N =d N C 1 =d 1 10.8…0.3 C 2 =d 2 0.81…0.6 ….……1… C N =d N 0.30.6…1

40 E.G.M. PetrakisText Clustering40 Update Similarity Matrix C 1 =d 1 C 2 =d 2 …C N =d N C 1 =d 1 10.8…0.3 C 2 =d 2 0.81…0.6 ….……1… C N =d N 0.30.6…1 merged

41 E.G.M. PetrakisText Clustering41 New Similarity Matrix C 12 = d 1  d 2 …C N =d N C 12 = d 1  d 2 1…0.4 ……1… C N =d N 0.4…1

42 E.G.M. PetrakisText Clustering42 Single Link  Selecting the most similar clusters for merging using single link  Can result in long and thin clusters due to “chaining effect”  Appropriate in some domains, such as clustering islands

43 E.G.M. PetrakisText Clustering43 Complete Link  Selecting the most similar clusters for merging using complete link  Results in compact, spherical clusters that are preferable

44 E.G.M. PetrakisText Clustering44 Group Average  Selecting the most similar clusters for merging using group average  Fast compromise between single and complete link

45 E.G.M. PetrakisText Clustering45 Example.... c1c1 c2c2 single link complete link group average A B

46 E.G.M. PetrakisText Clustering46 Inter Cluster Similarity  A new cluster is represented by its centroid  The document to cluster similarity is computed as  The cluster-to-cluster similarity can be computed as single, complete or group average similarity

47 E.G.M. PetrakisText Clustering47 Buckshot K-Means  Combines Agglomerative and K-Means  Agglomerative results in a good clustering solution but has O(N 2 ) complexity  Randomly select a sample  N instances  Applying Agglomerative on the sample which takes (N) time  Take the centroids of the cluster as input to K-Means  Overall complexity is O(N)

48 E.G.M. PetrakisText Clustering48 Example 4 1 23 657 8910 1112131415 initial cetroids for K-Means

49 E.G.M. PetrakisText Clustering49 More on Clustering  Sound methods based on the document- to-document similarity matrix  graph theoretic methods  O(N 2 ) time  Iterative methods operating directly on the document vectors  O(NlogN),O(N 2 /logN), O(mN) time

50 E.G.M. PetrakisText Clustering50 Soft Clustering  Hard clustering: each instance belongs to exactly one cluster  Does not allow for uncertainty  An instance may belong to two or more clusters  Soft clustering is based on probabilities that an instance belongs to each of a set of clusters  probabilities of all categories must sum to 1  Expectation Minimization (EM) is the most popular approach

51 E.G.M. PetrakisText Clustering51 More Methods  Two documents with similarity > T (threshold) are connected with an edge [Duda&Hart73]  clusters: the connected components (maximal cliques) of the resulting graph  problem: selection of appropriate threshold T  Zahn’s method [Zahn71]

52 E.G.M. PetrakisText Clustering52 Zahn’s method [Zahn71] 1.Find the minimum spanning tree 2. for each doc delete edges with length l > l avg  l avg : average distance if its incident edges 3.clusters: the connected components of the graph the dashed edge is inconsistent and is deleted

53 E.G.M. PetrakisText Clustering53 References  "Searching Multimedia Databases by Content", Christos Faloutsos, Kluwer Academic Publishers, 1996  “A Comparison of Document Clustering Techniques”, M. Steinbach, G. Karypis, V. Kumar, In KDD Workshop on Text Mining,2000A Comparison of Document Clustering Techniques  “Data Clustering: A Review”, A.K. Jain, M.N. Murphy, P.J. Flynn, ACM Comp. Surveys, Vol. 31, No. 3, Sept. 99.Data Clustering: A Review  “Algorithms for Clustering Data” A.K. Jain, R.C. Dubes; Prentice-Hall, 1988, ISBN 0-13-022278-XAlgorithms for Clustering Data  “Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer”, G. Salton, Addison-Wesley, 1989


Download ppt "E.G.M. PetrakisText Clustering1 Clustering  “Clustering is the unsupervised classification of patterns (observations, data items or feature vectors) into."

Similar presentations


Ads by Google