Download presentation
Presentation is loading. Please wait.
Published byAmi Cecily Parker Modified over 9 years ago
1
Data Clustering 2 – K Means contd & Hierarchical Methods Data Clustering – An IntroductionSlide 1
2
Recap: K-Means Clustering 1.Place K points into the feature space. These points represent initial cluster centroids. 2.Assign each pattern to the closest cluster centroid. 3.When all objects have been assigned, recalculate the positions of the K centroids. 4.Repeat Steps 2 and 3 until the assignments do not change. Data Clustering – An IntroductionSlide 2
3
K-Means Clustering Interactive Demo: http://home.dei.polimi.it/matteucc/Clust ering/tutorial_html/AppletKM.html http://home.dei.polimi.it/matteucc/Clust ering/tutorial_html/AppletKM.html Data Clustering – An IntroductionSlide 3
4
Discussions (4) 4. Intuitively, what is the ideal partition of the following problem? Can K-means Clustering algorithm give a satisfactory answer to this problem? Elongated clusters Clusters with different variation
5
Limitations of K-Means At each iteration of the K-Means Clustering algorithm, a pattern can be assigned to one cluster only, i.e. the assignment is ‘hard’. Cluster 1 Cluster 2 x1 x2 Observe an extra pattern x: It locates in the middle of the two cluster centroids. So with K-Means Clustering algorithm, it will either (1) drag m1 down, or (2) drag m2 up. x m1 m2 But intuitively, it should have equal contributions to both clusters…
6
Limitations of the K-Means (2) Cannot solve effectively problems with elongated clusters or clusters with different variation. Data Clustering – An IntroductionSlide 6
7
Post Processing After running a clustering algorithm, different procedures can be used to improve the final assignment: Split clusters with the highest SSE Merge clusters (e.g. those that are closest) Introduce a new centroid (often point furthest from any cluster centre) Data Clustering – An IntroductionSlide 7
8
Slide 8 Pros and Cons of KM Advantages May be computationally faster than hierarchical clustering (if K is small). May produce tighter clusters than hierarchical clustering, especially if the clusters are globular. Disadvantages Fixed number of clusters can make it difficult to predict what K should be. Different initial partitions can result in different final clusters. Potential empty clusters (not always bad) Does not work well with non-globular clusters.
9
Slide 9 Pros and Cons of KM
10
Outliers SSE can be affected greatly by outliers An outlier is a piece of data that does not fit within the distribution of the rest of the data Outlier analysis is large research topic in data mining Key question: Is the outlier noise? Or correct and “interesting”? Data Clustering – An IntroductionSlide 10
11
Hierarchical (agglomerative) Clustering Hierarchical clustering results in a series of clustering results The results start off with each object in their own cluster and end with all of the objects in the same cluster The intermediate clusters are created by a series of merges The resultant tree like structure is called a dendrogram Data Clustering – An IntroductionSlide 11
12
Slide 12 Dendrogram
13
The Hierarchical Clustering Algorithm Data Clustering – An IntroductionSlide 13 1)Each item is assigned to its own cluster (n clusters of size one) 2)Let the distances between the clusters equals the distances between the objects they contain 3)Find the closest pair of clusters and merge them into a single cluster (one less cluster) 4)Re-compute the distances between the new cluster and each of the old clusters 5)Repeat steps 3 and 4 until there is only one cluster left
14
Re-computing Distances LinkageDescription Single The smallest distance between any two pairs from the two clusters (one from each) being compared/measured Average The average distance between pairs Complete The largest distance between any two pairs from the two clusters (one from each) being compared/measured Other methods include Ward, McQuitty, Median and Centroid Data Clustering – An IntroductionSlide 14 http://home.dei.polimi.it/matteucc/Clustering/tutorial_html/AppletH.html
15
Re-computing Distances Single Linkage Complete Linkage Average Linkage Data Clustering – An IntroductionSlide 15
16
Pros and Cons of HC Advantages Can produce an ordering of the objects, which may be informative for data display. Smaller clusters are generated, which may be helpful for discovery. Disadvantages No provision can be made for a relocation of objects that may have been 'incorrectly' grouped at an early stage. Use of different distance metrics for measuring distances between clusters may generate different results. Data Clustering – An IntroductionSlide 16
17
Data Clustering – An IntroductionSlide 17 Clustering Gene Expression Data Gene expression data often consists of thousands of genes observed over tens of conditions Clustering on expression level will help towards identifying the functionality of unknown genes Clustering can be used to reduce the dimensionality of the data, making it easier to model
18
Data Clustering – An IntroductionSlide 18 Clustering Gene Expression Data
19
Other clustering methods Fuzzy Clustering For example: Fuzzy c-means In real applications often no sharp boundary between clusters Fuzzy clustering is often better suited Fuzzy c-means is a fuzzification of k- Means and the most well-known Data Clustering – An IntroductionSlide 19
20
Fuzzy Clustering http://home.dei.polimi.it/matteucc/Clustering/tutorial_html/AppletFCM.html Data Clustering – An IntroductionSlide 20 Cluster membership is now a weight between zero and one The distance to a centroid is multiplied by the membership weight
21
DBSCAN (from Ranjay Sankar notes, University of Florida) DBSCAN is a density based clustering algorithm Density = number of points within a specified radius (Eps) A point is a core point if it has more than specified number of points (MinPts) within Eps (Core point is in the interior of a cluster) Data Clustering – An IntroductionSlide 21
22
DBSCAN (from Ranjay Sankar notes, University of Florida) A border point has fewer than MinPts within Eps but is in neighborhood of a core point A noise point is any point that is neither a core point nor a border point Data Clustering – An IntroductionSlide 22
23
DBSCAN (from Ranjay Sankar notes, University of Florida) Data Clustering – An IntroductionSlide 23 http://www.cise.ufl.edu/class/cis4930sp09dm/notes/dm5part4.pdf
24
DBSCAN Density-Reachable (directly and indirectly): A point p is directly density-reachable from p2 p2 is directly density-reachable from p1 p1 is directly density-reachable from q p <- p2 <- p1 <- q form a chain Data Clustering – An IntroductionSlide 24
25
DBSCAN (University of Buffalo) Data Clustering – An IntroductionSlide 25 http://www.cse.buffalo.edu/~jing/cse601/fa12/materials/clustering_density.pdf
26
DBSCAN (from Ranjay Sankar notes, University of Florida) Data Clustering – An IntroductionSlide 26
27
Other Clustering Methods Clustering as Optimisation Remember one flaw with hierarchical? “No provision can be made for a relocation of objects that may have been 'incorrectly' grouped at an early stage.” And with K-Means “Different initial partitions can result in different final clusters.” Both are using Local Search Data Clustering – An IntroductionSlide 27
28
Clustering as Optimisation We can overcome this by using the different optimisation techniques for finding global optima e.g. Evolutionary algorithms Represent a cluster as a chromosome Evolve better clusterings using a fitness function: SSE? Data Clustering – An IntroductionSlide 28
29
Clustering as Optimisation E.g. Represent a cluster as a chromosome containing integers: [1, 2, 1, 1, 2, 2, 3, 1, 1, 3] Data Clustering – An IntroductionSlide 29 a, b, c, d, e, f, g, h, i, j a c d h i b e f g j Cluster 1 Cluster 2 Cluster 3
30
Data Clustering – An IntroductionSlide 30 Feature Selection Recall the clustering process involved: Selection of appropriate features Process is known as feature selection A very large research topic in data mining Will affect the final clustering
31
Data Clustering – An IntroductionSlide 31 Feature Selection Scatterplots from different features of the same dataset
32
Data Clustering – An IntroductionSlide 32 Evaluating Cluster Quality How do we know if the discovered clusters are any good? The choice of correct metric for judging the worth of a clustering arrangement is vital for success There are as many metrics as methods! Each has their own merits and drawbacks
33
Data Clustering – An IntroductionSlide 33 Evaluating Cluster Quality For example, K-Means clustering judges the worth of a clustering arrangement based on the square of how far each item in the cluster is from the centre This is the sum of squared Euclidean distances X is a cluster of size k, x i an element in the cluster and c is the centre of the cluster
34
Data Clustering – An IntroductionSlide 34 Evaluating Cluster Quality Unsupervised Cohesion / Homogeneity & Separation Supervised How close to the “true” clustering Relative Use the above for comparing two or more clusterings e.g. K-means vs Hierarchical
35
Data Clustering – An IntroductionSlide 35 Cohesion & Separation Cohesion Separation
36
Data Clustering – An IntroductionSlide 36 Evaluating Cluster Quality Other variations: Silhouette +1, indicating points that are very distant from neighbouring clusters 0, indicating points that are not distinctly in one cluster or another -1, indicating points that are probably assigned to the wrong cluster.
37
Data Clustering – An IntroductionSlide 37 Supervised But what if we know something about the “true clusters” Can we use this to test the effectiveness of different clustering algorithms?
38
Data Clustering – ValidationSlide 38 Comparing Clusters Metrics exist to measure how similar two clustering arrangements are Thus if a method produces a set of similar clustering arrangements (according to the metric) then the method is consistent We will consider the Weighted-Kappa metric which has been adapted from Medical Statistics
39
Lab in 2 weeks Data Clustering – An IntroductionSlide 39 In the lab: Explore some more complex data Using Weighted Kappa to evaluate the clusters – essentially a value between 0 and 1 – more next week Explore the effect of feature selection on the scatterplots and on clustering
40
Data Clustering – ValidationSlide 40 Lecture in 2 weeks... More on Weighted Kappa And other cluster evaluation methods And Coursework!
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.