Presentation is loading. Please wait.

Presentation is loading. Please wait.

Slide 1 EE3J2 Data Mining EE3J2 Data Mining Lecture 11: K-Means Clustering Martin Russell.

Similar presentations


Presentation on theme: "Slide 1 EE3J2 Data Mining EE3J2 Data Mining Lecture 11: K-Means Clustering Martin Russell."— Presentation transcript:

1 Slide 1 EE3J2 Data Mining EE3J2 Data Mining Lecture 11: K-Means Clustering Martin Russell

2 Slide 2 EE3J2 Data Mining Objectives  To explain the for K-means clustering  To understand the K-means clustering algorithm

3 Slide 3 EE3J2 Data Mining Clustering so far  Agglomerative clustering –Begin by assuming that every data point is a separate centroid –Combine closest centroids until the desired number of clusters is reached –See agglom.c on the course website  Divisive clustering –Begin by assuming that there is just one centroid/cluster –Split clusters until the desired number of clusters is reached

4 Slide 4 EE3J2 Data Mining Optimality  Neither agglomerative clustering nor divisive clustering is optimal  In other words, the set of centroids which they give is not guaranteed to minimise distortion

5 Slide 5 EE3J2 Data Mining Optimality continued  For example: –In agglomerative clustering, a dense cluster of data points will be combined into a single centroid –But to minimise distortion, need several centroids in a region where there are many data points –A single ‘outlier’ may get its own cluster  Agglomerative clustering provides a useful starting point, but further refinement is needed

6 Slide 6 EE3J2 Data Mining 12 centroids

7 Slide 7 EE3J2 Data Mining K-means Clustering  Suppose that we have decided how many centroids we need - denote this number by K  Suppose that we have an initial estimate of suitable positions for our K centroids  K-means clustering is an iterative procedure for moving these centroids to reduce distortion

8 Slide 8 EE3J2 Data Mining K-means clustering - notation  Suppose there are T data points, denoted by:  Suppose that the initial K clusters are denoted by:  One iteration of K-means clustering will produce a new set of clusters Such that

9 Slide 9 EE3J2 Data Mining K-means clustering (1)  For each data point y t let c i(t) be the closest centroid  In other words: d(y t, c i(t) ) = min m d(y t,c m )  Now, for each centroid c 0 k define:  In other words, Y 0 k is the set of data points which are closer to c 0 k than any other cluster

10 Slide 10 EE3J2 Data Mining K-means clustering (2)  Now define a new k th centroid c 1 k by: where |Y k 0 | is the number of samples in Y k 0  In other words, c 1 k is the average value of the samples which were closest to c 0 k

11 Slide 11 EE3J2 Data Mining K-means clustering (3)  Now repeat the same process starting with the new centroids: to create a new set of centroids: … and so on until the process converges  Each new set of centroids has smaller distortion than the previous set

12 Slide 12 EE3J2 Data Mining Initialisation  An outstanding problem is to choose the initial cluster set C 0  Possibilities include: –Choose C 0 randomly –Choose C 0 using agglomerative clustering –Choose C 0 using divisive clustering  Choice of C 0 can be important –K-means clustering is a “hill-climbing” algorithm –It only finds a local minimum of the distortion function –This local minimum is determined by C 0

13 Slide 13 EE3J2 Data Mining Local optimality Distortion Cluster set C 0 C 1..C n Dist(C 0 ) Local minimum Global minimum N.B: I’ve drawn the cluster set space as 1 dimensional for simplicity. In reality it is a very high dimensional space

14 Slide 14 EE3J2 Data Mining Example

15 Slide 15 EE3J2 Data Mining Example - distortion

16 Slide 16 EE3J2 Data Mining C programs on website  agglom.c –Agglomerative clustering –agglom dataFile centFile numCent –Runs agglomerative clustering on the data in dataFile until the number of centroids is numCent. Writes the centroid (x,y) coordinates to centFile

17 Slide 17 EE3J2 Data Mining C programs on website  k-means.c –K-means clustering –k-means dataFile centFile opFile –Runs 10 iterations of k-means clustering on the data in dataFile starting with the centroids in centFile. –After each iteration writes distortion and new centroids to opFile  sa1.txt –This is the data file that I have used in the demonstrations

18 Slide 18 EE3J2 Data Mining Summary  The need for k-means clustering  The k-means clustering algorithm  Example of k-means clustering


Download ppt "Slide 1 EE3J2 Data Mining EE3J2 Data Mining Lecture 11: K-Means Clustering Martin Russell."

Similar presentations


Ads by Google