Presentation is loading. Please wait.

Presentation is loading. Please wait.

Data Clustering 2 – K Means contd & Hierarchical Methods Data Clustering – An IntroductionSlide 1.

Similar presentations


Presentation on theme: "Data Clustering 2 – K Means contd & Hierarchical Methods Data Clustering – An IntroductionSlide 1."— Presentation transcript:

1 Data Clustering 2 – K Means contd & Hierarchical Methods Data Clustering – An IntroductionSlide 1

2 Recap: K-Means Clustering 1.Place K points into the feature space. These points represent initial cluster centroids. 2.Assign each pattern to the closest cluster centroid. 3.When all objects have been assigned, recalculate the positions of the K centroids. 4.Repeat Steps 2 and 3 until the assignments do not change. Data Clustering – An IntroductionSlide 2

3 K-Means Clustering Interactive Demo: http://home.dei.polimi.it/matteucc/Clust ering/tutorial_html/AppletKM.html http://home.dei.polimi.it/matteucc/Clust ering/tutorial_html/AppletKM.html Data Clustering – An IntroductionSlide 3

4 Discussions (4) 4. Intuitively, what is the ideal partition of the following problem? Can K-means Clustering algorithm give a satisfactory answer to this problem? Elongated clusters Clusters with different variation

5 Limitations of K-Means At each iteration of the K-Means Clustering algorithm, a pattern can be assigned to one cluster only, i.e. the assignment is ‘hard’. Cluster 1 Cluster 2 x1 x2 Observe an extra pattern x: It locates in the middle of the two cluster centroids. So with K-Means Clustering algorithm, it will either (1) drag m1 down, or (2) drag m2 up. x m1 m2 But intuitively, it should have equal contributions to both clusters…

6 Limitations of the K-Means (2) Cannot solve effectively problems with elongated clusters or clusters with different variation. Data Clustering – An IntroductionSlide 6

7 Post Processing After running a clustering algorithm, different procedures can be used to improve the final assignment: Split clusters with the highest SSE Merge clusters (e.g. those that are closest) Introduce a new centroid (often point furthest from any cluster centre) Data Clustering – An IntroductionSlide 7

8 Slide 8 Pros and Cons of KM Advantages May be computationally faster than hierarchical clustering (if K is small). May produce tighter clusters than hierarchical clustering, especially if the clusters are globular. Disadvantages Fixed number of clusters can make it difficult to predict what K should be. Different initial partitions can result in different final clusters. Potential empty clusters (not always bad) Does not work well with non-globular clusters.

9 Slide 9 Pros and Cons of KM

10 Outliers SSE can be affected greatly by outliers An outlier is a piece of data that does not fit within the distribution of the rest of the data Outlier analysis is large research topic in data mining Key question: Is the outlier noise? Or correct and “interesting”? Data Clustering – An IntroductionSlide 10

11 Hierarchical (agglomerative) Clustering Hierarchical clustering results in a series of clustering results The results start off with each object in their own cluster and end with all of the objects in the same cluster The intermediate clusters are created by a series of merges The resultant tree like structure is called a dendrogram Data Clustering – An IntroductionSlide 11

12 Slide 12 Dendrogram

13 The Hierarchical Clustering Algorithm Data Clustering – An IntroductionSlide 13 1)Each item is assigned to its own cluster (n clusters of size one) 2)Let the distances between the clusters equals the distances between the objects they contain 3)Find the closest pair of clusters and merge them into a single cluster (one less cluster) 4)Re-compute the distances between the new cluster and each of the old clusters 5)Repeat steps 3 and 4 until there is only one cluster left

14 Re-computing Distances LinkageDescription Single The smallest distance between any two pairs from the two clusters (one from each) being compared/measured Average The average distance between pairs Complete The largest distance between any two pairs from the two clusters (one from each) being compared/measured Other methods include Ward, McQuitty, Median and Centroid Data Clustering – An IntroductionSlide 14 http://home.dei.polimi.it/matteucc/Clustering/tutorial_html/AppletH.html

15 Re-computing Distances Single Linkage Complete Linkage Average Linkage Data Clustering – An IntroductionSlide 15

16 Pros and Cons of HC Advantages Can produce an ordering of the objects, which may be informative for data display. Smaller clusters are generated, which may be helpful for discovery. Disadvantages No provision can be made for a relocation of objects that may have been 'incorrectly' grouped at an early stage. Use of different distance metrics for measuring distances between clusters may generate different results. Data Clustering – An IntroductionSlide 16

17 Data Clustering – An IntroductionSlide 17 Clustering Gene Expression Data Gene expression data often consists of thousands of genes observed over tens of conditions Clustering on expression level will help towards identifying the functionality of unknown genes Clustering can be used to reduce the dimensionality of the data, making it easier to model

18 Data Clustering – An IntroductionSlide 18 Clustering Gene Expression Data

19 Other clustering methods Fuzzy Clustering For example: Fuzzy c-means In real applications often no sharp boundary between clusters Fuzzy clustering is often better suited Fuzzy c-means is a fuzzification of k- Means and the most well-known Data Clustering – An IntroductionSlide 19

20 Fuzzy Clustering http://home.dei.polimi.it/matteucc/Clustering/tutorial_html/AppletFCM.html Data Clustering – An IntroductionSlide 20 Cluster membership is now a weight between zero and one The distance to a centroid is multiplied by the membership weight

21 DBSCAN (from Ranjay Sankar notes, University of Florida) DBSCAN is a density based clustering algorithm Density = number of points within a specified radius (Eps) A point is a core point if it has more than specified number of points (MinPts) within Eps (Core point is in the interior of a cluster) Data Clustering – An IntroductionSlide 21

22 DBSCAN (from Ranjay Sankar notes, University of Florida) A border point has fewer than MinPts within Eps but is in neighborhood of a core point A noise point is any point that is neither a core point nor a border point Data Clustering – An IntroductionSlide 22

23 DBSCAN (from Ranjay Sankar notes, University of Florida) Data Clustering – An IntroductionSlide 23 http://www.cise.ufl.edu/class/cis4930sp09dm/notes/dm5part4.pdf

24 DBSCAN Density-Reachable (directly and indirectly): A point p is directly density-reachable from p2 p2 is directly density-reachable from p1 p1 is directly density-reachable from q p <- p2 <- p1 <- q form a chain Data Clustering – An IntroductionSlide 24

25 DBSCAN (University of Buffalo) Data Clustering – An IntroductionSlide 25 http://www.cse.buffalo.edu/~jing/cse601/fa12/materials/clustering_density.pdf

26 DBSCAN (from Ranjay Sankar notes, University of Florida) Data Clustering – An IntroductionSlide 26

27 Other Clustering Methods Clustering as Optimisation Remember one flaw with hierarchical? “No provision can be made for a relocation of objects that may have been 'incorrectly' grouped at an early stage.” And with K-Means “Different initial partitions can result in different final clusters.” Both are using Local Search Data Clustering – An IntroductionSlide 27

28 Clustering as Optimisation We can overcome this by using the different optimisation techniques for finding global optima e.g. Evolutionary algorithms Represent a cluster as a chromosome Evolve better clusterings using a fitness function: SSE? Data Clustering – An IntroductionSlide 28

29 Clustering as Optimisation E.g. Represent a cluster as a chromosome containing integers: [1, 2, 1, 1, 2, 2, 3, 1, 1, 3] Data Clustering – An IntroductionSlide 29 a, b, c, d, e, f, g, h, i, j a c d h i b e f g j Cluster 1 Cluster 2 Cluster 3

30 Data Clustering – An IntroductionSlide 30 Feature Selection Recall the clustering process involved: Selection of appropriate features Process is known as feature selection A very large research topic in data mining Will affect the final clustering

31 Data Clustering – An IntroductionSlide 31 Feature Selection Scatterplots from different features of the same dataset

32 Data Clustering – An IntroductionSlide 32 Evaluating Cluster Quality How do we know if the discovered clusters are any good? The choice of correct metric for judging the worth of a clustering arrangement is vital for success There are as many metrics as methods! Each has their own merits and drawbacks

33 Data Clustering – An IntroductionSlide 33 Evaluating Cluster Quality For example, K-Means clustering judges the worth of a clustering arrangement based on the square of how far each item in the cluster is from the centre This is the sum of squared Euclidean distances X is a cluster of size k, x i an element in the cluster and c is the centre of the cluster

34 Data Clustering – An IntroductionSlide 34 Evaluating Cluster Quality Unsupervised Cohesion / Homogeneity & Separation Supervised How close to the “true” clustering Relative Use the above for comparing two or more clusterings e.g. K-means vs Hierarchical

35 Data Clustering – An IntroductionSlide 35 Cohesion & Separation Cohesion Separation

36 Data Clustering – An IntroductionSlide 36 Evaluating Cluster Quality Other variations: Silhouette +1, indicating points that are very distant from neighbouring clusters 0, indicating points that are not distinctly in one cluster or another -1, indicating points that are probably assigned to the wrong cluster.

37 Data Clustering – An IntroductionSlide 37 Supervised But what if we know something about the “true clusters” Can we use this to test the effectiveness of different clustering algorithms?

38 Data Clustering – ValidationSlide 38 Comparing Clusters Metrics exist to measure how similar two clustering arrangements are Thus if a method produces a set of similar clustering arrangements (according to the metric) then the method is consistent We will consider the Weighted-Kappa metric which has been adapted from Medical Statistics

39 Lab in 2 weeks Data Clustering – An IntroductionSlide 39 In the lab: Explore some more complex data Using Weighted Kappa to evaluate the clusters – essentially a value between 0 and 1 – more next week Explore the effect of feature selection on the scatterplots and on clustering

40 Data Clustering – ValidationSlide 40 Lecture in 2 weeks... More on Weighted Kappa And other cluster evaluation methods And Coursework!


Download ppt "Data Clustering 2 – K Means contd & Hierarchical Methods Data Clustering – An IntroductionSlide 1."

Similar presentations


Ads by Google