Download presentation
1
EE3J2 Data Mining Lecture 18 K-means and Agglomerative Algorithms Ali Al-Shahib
2
Today Unsupervised Learning Clustering K-means
3
Distortion The distortion for the centroid set C = c1,…,cM is defined by: In other words, the distortion is the sum of distances between each data point and its nearest centroid The task of clustering is to find a centroid set C such that the distortion Dist(C) is minimised EE3J2 Data Mining
4
The K-Means Clustering Method
Given k, the k-means algorithm is implemented in 4 steps: Initialisation Define the number of clusters (k). Designate a cluster centre (a vector quantity that is of the same dimensionality of the data) for each cluster. Assign each data point to the closest cluster centre (centroid). That data point is now a member of that cluster. Calculate the new cluster centre (the geometric average of all the members of a certain cluster). Calculate the sum of within-cluster sum-of-squares. If this value has not significantly changed over a certain number of iterations, exit the algorithm. If it has, or the change is insignificant but has not been seen to persist over a certain number of iterations, go back to Step 2. Remember you converge when you have found the minimum overall distance between the centroid and the objects.
5
The K-Means Clustering Method
Example
6
Lets watch an animation!
7
K-means Clustering Suppose that we have decided how many centroids we need - denote this number by K Suppose that we have an initial estimate of suitable positions for our K centroids K-means clustering is an iterative procedure for moving these centroids to reduce distortion
8
K-means clustering - notation
Suppose there are T data points, denoted by: Suppose that the initial K clusters are denoted by: One iteration of K-means clustering will produce a new set of clusters Such that
9
K-means clustering (1) For each data point yt let ci(t) be the closest centroid In other words: d(yt, ci(t)) = minmd(yt,cm) Now, for each centroid c0k define: In other words, Y0k is the set of data points which are closer to c0k than any other cluster
10
K-means clustering (2) Now define a new kth centroid c1k by:
where |Yk0| is the number of samples in Yk0 In other words, c1k is the average value of the samples which were closest to c0k
11
K-means clustering (3) Now repeat the same process starting with the new centroids: to create a new set of centroids: … and so on until the process converges Each new set of centroids has smaller distortion than the previous set
12
So….Basically Start with randomly k data points (objects).
Find the set of data points that are closer to C0k (Y0k). Compute average of these points, notate C1k -> new centroid. Now repeat again this process and find the closest objects to C1k Compute the average to get C2k -> new centroid, and so on…. Until convergence.
13
Comments on the K-Means Method
Strength Relatively efficient: O(tkn), where n is # objects, k is # clusters, and t is # iterations. Normally, k, t << n. Often terminates at a local optimum. The global optimum may be found using techniques such as: deterministic annealing and genetic algorithms Weakness Applicable only when mean is defined, then what about categorical data? Need to specify k, the number of clusters, in advance Unable to handle noisy data and outliers Not suitable to discover clusters with non-convex shapes
14
Hierarchical Clustering
Grouping data objects into a tree of clusters. Agglomerative clustering Begin by assuming that every data point is a separate centroid Combine closest centroids until the desired number of clusters is reached Divisive clustering Begin by assuming that there is just one centroid/cluster Split clusters until the desired number of clusters is reached
15
Agglomerative Clustering - Example
Students Exam1 Exam2 Exam3 Mike 9 3 7 Tom 10 2 Bill 1 4 T Ren 6 5 Ali
16
Distances between objects
Using Euclidean Distance measure, what's the difference between Mike and Tom? Mike:9,3,7 Tom: 10,2,9 S E1 E2 E3 Mike 9 3 7 Tom 10 2 Bill 1 4 T 6 5 Ali
17
Distance Matrix Mike Tom Bill T Ren Ali 2.5 10.44 4.12 11.75 - 12.5
Mike Tom Bill T Ren Ali 2.5 10.44 4.12 11.75 - 12.5 6.4 13.93 6.48 1.41 7.35
18
The Algorithm Step 1 Identify the entities which are most similar - this can be easily discerned from the distance table constructed. In this example, Bill and Ali are most similar, with a distance value of They are therefore the most 'related' Bill Ali
19
The Algorithm – Step 2 The two entities that are most similar can now be merged so that they represent a single cluster (or new entity). So Bill and Ali can now be considered to be a single entity. How do we compare this entity with others? We use the Average linkage between the two. So the new average vector is [1, 9.5, 3.5] – see first table and average the marks for Bill and Ali. We now need to redraw the distance table, including the merger of the two entities, and new distance calculations.
20
The Algorithm – Step 3 Mike Tom T Ren {Bill & Ali} - 2.5 4.12 10.9 6.4
Mike Tom T Ren {Bill & Ali} - 2.5 4.12 10.9 6.4 9.1 6.9
21
Next closest students Mike and Tom with 2.5!
So, now we have 2 clusters! Bill Ali Mike Tom
22
The distance matrix now
{Mike & Tom} T Ren {Bill & Ali} - 3.7 9.2 6.9 Now, T Ren is closest to Bill and Ali so T Ren joines them In the cluster.
23
The final dendogram Bill Ali MANY ‘SUB-CLUSTERS’ WITHIN ONE CLUSTER
Mike Tom T Ren
24
Conclusions K- Means Algorithm – memorize equations and algorithm.
Hierarchical Clustering: Agglomerative Clustering
25
On Tuesday Sequence Analysis: BLAST Algorithm
26
Some References and Acknowledgments
University College London:
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.