Download presentation
Presentation is loading. Please wait.
1
Slide 1 EE3J2 Data Mining EE3J2 Data Mining Lecture 11: Clustering Martin Russell
2
Slide 2 EE3J2 Data Mining Objectives To explain the motivation for clustering To introduce the ideas of distance and distortion To describe agglomerative and divisive clustering To explain the relationships between clustering and decision trees
3
Slide 3 EE3J2 Data Mining Example from speech processing Plot of high-frequency energy vs low- frequency energy, for 25 ms speech segments, sampled every 10ms
4
Slide 4 EE3J2 Data Mining Structure of data Typical real data is not uniformly distrubuted It has structure Variables might be correlated The data might be grouped into natural ‘clusters’ The purpose of cluster analysis is to find this underlying structure automatically
5
Slide 5 EE3J2 Data Mining Clusters and centroids If we assume that the clusters are spherical, then they are determined by their centres The cluster centres are called centroids How many centroids do we need? Where should we put them? centroids
6
Slide 6 EE3J2 Data Mining Distance A function d(x,y) defined on pairs of points x and y is called a distance or metric if it satisfies: –d(x,x) = 0 for every point x –d(x,y) = d(y,x) for all points x and y (d is symmetric) –d(x,z) d(x,y) + d(y,z) for all points x, y and z (this is called the triangle inequality)
7
Slide 7 EE3J2 Data Mining Example metrics The most common metric is the Euclidean metric In this case, if x = (x 1, x 2,…,x N ) and y = (y 1,y 2,…,y N ) then: This corresponds to the standard notion of distance in Euclidean space There are lots of others, but focus on this one
8
Slide 8 EE3J2 Data Mining Distortion Distortion is a measure of how well a set of centroids models a set of data Suppose we have: –data points y 1, y 2,…,y T –centroids c 1,…,c M For each data point y t let c i(t) be the closest centroid In other words: d(y t, c i(t) ) = min m d(y t,c m )
9
Slide 9 EE3J2 Data Mining Distortion The distortion for the centroid set C = c 1,…,c M is defined by: In other words, the distortion is the sum of distances between each data point and its nearest centroid The task of clustering is to find a centroid set C such that the distortion Dist(C) is minimised
10
Slide 10 EE3J2 Data Mining Types of Clustering Initially we will look at two types of cluster analysis: –Agglomerative clustering, or ‘bottom-up’ clustering –Divisive clustering, or ‘top-down’ clustering
11
Slide 11 EE3J2 Data Mining Agglomerative clustering Agglomerative clustering begins by assuming that each data point belongs to its own, unique, 1 point cluster Clusters are then combined until the required number of clusters is obtained The simplest agglomerative clustering algorithm is one which, at each stage, combines the two closest centroids into a single centroid
12
Slide 12 EE3J2 Data Mining Original data (302 points)
13
Slide 13 EE3J2 Data Mining 252 centroids
14
Slide 14 EE3J2 Data Mining 152 centroids
15
Slide 15 EE3J2 Data Mining 52 centroids
16
Slide 16 EE3J2 Data Mining 12 centroids
17
Slide 17 EE3J2 Data Mining Divisive Clustering Divisive clustering begins by assuming that there is just one centroid – typically in the centre of the set of data points That point is replaced with 2 new centroids Then each of these is replaced with 2 new centroids …
18
Slide 18 EE3J2 Data Mining Original data (302 points)
19
Slide 19 EE3J2 Data Mining Original data (302 points)
20
Slide 20 EE3J2 Data Mining Decision tree interpretation........ Single centroid - whole set Multiple centroids – one per data point Top down clustering - divisive Bottom up clustering - agglomerative
21
Slide 21 EE3J2 Data Mining Note on optimality An ‘optimal’ set of centroids is one which minimises the distortion None of these methods necessarily give optimal sets of centroids Instead they give locally optimal sets of centroids Why?
22
Slide 22 EE3J2 Data Mining Summary Distance metrics and distortion Agglomerative clustering Divisive clustering Decision tree interpretation
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.