Download presentation
Presentation is loading. Please wait.
1
SEEM4630 Tutorial 3 – Clustering
2
What is Cluster Analysis?
Finding groups of objects such that the objects in a group will be similar (or related to one another and different from (or unrelated to) the objects in other groups. A good clustering method will produce high quality clusters high intra-class similarity: cohesive within clusters low inter-class similarity: distinctive between clusters
3
Notion of a Cluster can be Ambiguous
How many clusters? Six Clusters Two Clusters Four Clusters
4
K-Means Clustering fixed Euclidean Distance etc.
5
K-Means Clustering: Example
Given: Means of the cluster ki, mi = (ti1 + ti2 + … + tim)/m Data {2, 4, 10, 12, 3, 20, 30, 11, 25} K = 2 Solution: m1 = 2, m2 = 4, K1 = {2, 3}, and K2 = {4, 10, 12, 20, 30, 11, 25} m1 = 2.5, m2 = 16 K1 = {2, 3, 4}, and K2 = {10, 12, 20, 30, 11, 25} m1 = 3, m2 = 18 K1 = {2, 3, 4, 10}, and K2 = {12, 20, 30, 11, 25} m1 = 4.75, m2 = 19.6 K1 = {2, 3, 4, 10, 11, 12}, and K2 = {20, 30, 25} m1 = 7, m2 = 25
6
K-Means Clustering: Evaluation
Sum of Squared Error (SSE) Given clusters, choose the one with the smallest error Data point in cluster Ci Centroid of cluster Ci
7
Limitations of K-means
It is hard to determine a good K value The initial K centroids K-means has problems when the data contains outliers. Outliers can be handled better by hierarchical clustering and density-based clustering
8
Hierarchical Clustering
Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram A tree like diagram that records the sequences of merges or splits
9
Strengths of Hierarchical Clustering
Do not have to assume any particular number of clusters Any desired number of clusters can be obtained by ‘cutting’ the dendrogram at the proper level Partition direction Agglomerative: starting with single elements and aggregating them into clusters Divisive: starting with the complete data set and dividing it into partitions
10
Agglomerative Hierarchical Clustering
Basic algorithm is straightforward Compute the proximity matrix Let each data point be a cluster Repeat Merge the two closest clusters Update the proximity matrix Until only a single cluster remains Key operation is the computation of the proximity of two clusters Different approaches to define the distance between clusters
11
Hierarchical Clustering
Define Inter-Cluster Similarity Min Max Group Average Distance between Centroids
12
Hierarchical Clustering: Min or Single Link
Euclidean distance I1 {I2, I5, I3, I6, I4} 0.00 0.22 {I2, I5, I3, I6, I4} I1 {I2, I5, I3, I6} I4 0.00 0.22 0.37 {I2, I5, I3, I6} {I4} 0.15 I1 I2 {I3, I6} I4 I5 0.00 0.24 0.22 0.37 0.34 0.15 0.20 0.14 0.28 0.29 I1 I2 I3 I4 I5 0.00 0.24 0.22 0.37 0.34 0.15 0.20 0.14 0.28 0.29 I6 0.23 0.25 0.11 0.39 I1 {I2, I5} {I3, I6} I4 0.00 0.24 0.22 0.37 0.15 0.20 0.2 0.15 0.1 0.05 3 6 2 5 4 1
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.