Presentation is loading. Please wait.

Presentation is loading. Please wait.

Slide 1 EE3J2 Data Mining Lecture 16 Unsupervised Learning Ali Al-Shahib.

Similar presentations


Presentation on theme: "Slide 1 EE3J2 Data Mining Lecture 16 Unsupervised Learning Ali Al-Shahib."— Presentation transcript:

1 Slide 1 EE3J2 Data Mining Lecture 16 Unsupervised Learning Ali Al-Shahib

2 Slide 2 Lectures on WebCT

3 Slide 3 Today  Unsupervised Learning  Clustering  K-means

4 Slide 4 What is Clustering?  It is a unsupervised learning method (no predefined classes, in our buy_computer example, there is no ‘no’ and ‘yes’).  Imagine you are given a set of data objects for analysis, unlike in classification, the class label of each example is not known.  Clustering is the process of grouping the data into classes or clusters so that examples within a cluster have high similarity in comparison to one another, but are very dissimilar to examples in other clusters.  Dissimilarities are assessed based on the attribute values describing the examples.  Often, distance measures are used.

5 Slide 5 Clustering Note: You do not know which type of Star each star is, they are unlabelled, you Just use the information given in the attributes (or features) of the star

6 Slide 6 EE3J2 Data Mining Structure of data  Typical real data is not uniformly distributed  It has structure  Variables might be correlated  The data might be grouped into natural ‘clusters’  The purpose of cluster analysis is to find this underlying structure automatically

7 Slide 7 7 Data Structures Clustering algorithms typically operate on either:  Data matrix – represents n objects (a.k.a. examples e.g. persons) with p variables (e.g. age,height,gender, etc.). Its n examples x p variables  Dissimilarity matrix – stores a collection of distances between examples. d(x,y) = difference or dissimilarity between examples x and y. How can dissimilarity d(x,y) be assessed?

8 Slide 8 EE3J2 Data Mining Clusters and centroids  In another words……  If we assume that the clusters are spherical, then they are determined by their centres  The cluster centres are called centroids  How many centroids do we need?  Where should we put them? centroids d(x,y) x y

9 Slide 9 Measuring dissimilarity (or similarity)  To measure similarity, often a distance function ‘d’ is used  Measures “dissimilarity” between pairs objects x and y Small distance d(x, y): objects x and y are more similar Large distance d(x, y): objects x and y are less similar

10 Slide 10 Properties of the distance function  So, a function d(x,y) defined on pairs of points x and y is called a distance (d) if it satisfies: d(x,y)≥ 0: Distance is a nonnegative number d(x,x) = 0 the distance of an object to itself is 0. d(x,y) = d(y,x) for all points x and y (d is symmetric) d(x,y)  d(x,y) + d(y,z) for all points x, y and z (this is called the triangle inequality)

11 Slide 11 Euclidean Distance  The most popular distance measure is Euclidean distance.  If x = (x 1, x 2,…,x N ) and y = (y 1,y 2,…,y N ) then:  This corresponds to the standard notion of distance in Euclidean space

12 Slide 12 EE3J2 Data Mining Distortion  Distortion is a measure of how well a set of centroids models a set of data  Suppose we have: data points y 1, y 2,…,y T centroids c 1,…,c M  For each data point y t let c i(t) be the closest centroid  In other words: d(y t, c i(t) ) = min m d(y t,c m )

13 Slide 13 EE3J2 Data Mining Distortion  The distortion for the centroid set C = c 1,…,c M is defined by:  In other words, the distortion is the sum of distances between each data point and its nearest centroid  The task of clustering is to find a centroid set C such that the distortion Dist(C) is minimised

14 Slide 14 14 The K-Means Clustering Method  Given k, the k-means algorithm is implemented in 4 steps: Partition objects into k nonempty subsets Compute seed points as the centroids of the clusters of the current partition. The centroid is the center (mean point) of the cluster. Assign each object to the cluster with the nearest seed point. Go back to Step 2, stop when no more new assignment.

15 Slide 15 15 The K-Means Clustering Method Example

16 Slide 16 Lets watch an animation! http://r.yihui.name/stat/multivariate_stat/kmeans/index.htm

17 Slide 17 K-means Clustering  Suppose that we have decided how many centroids we need - denote this number by K  Suppose that we have an initial estimate of suitable positions for our K centroids  K-means clustering is an iterative procedure for moving these centroids to reduce distortion

18 Slide 18 K-means clustering - notation  Suppose there are T data points, denoted by:  Suppose that the initial K clusters are denoted by:  One iteration of K-means clustering will produce a new set of clusters Such that

19 Slide 19 K-means clustering (1)  For each data point y t let c i(t) be the closest centroid  In other words: d(y t, c i(t) ) = min m d(y t,c m )  Now, for each centroid c 0 k define:  In other words, Y 0 k is the set of data points which are closer to c 0 k than any other cluster

20 Slide 20 K-means clustering (2)  Now define a new k th centroid c 1 k by: where |Y k 0 | is the number of samples in Y k 0  In other words, c 1 k is the average value of the samples which were closest to c 0 k

21 Slide 21 K-means clustering (3)  Now repeat the same process starting with the new centroids: to create a new set of centroids: … and so on until the process converges  Each new set of centroids has smaller distortion than the previous set

22 Slide 22 22 Comments on the K-Means Method  Strength Relatively efficient: O(tkn), where n is # objects, k is # clusters, and t is # iterations. Normally, k, t << n. Often terminates at a local optimum. The global optimum may be found using techniques such as: deterministic annealing and genetic algorithms  Weakness Applicable only when mean is defined, then what about categorical data? Need to specify k, the number of clusters, in advance Unable to handle noisy data and outliers Not suitable to discover clusters with non- convex shapes

23 Slide 23 Conclusions  Unsupervised Learning  Clustering  Distance metrics  k-means clustering algorithm

24 Slide 24 On Tuesday  Sequence Analysis

25 Slide 25 Some References and Acknowledgments  Data Mining: Concepts and Techniques. J.Han and M.Kamber  J.Han slides, University of Illinois at Urbana-Champaign


Download ppt "Slide 1 EE3J2 Data Mining Lecture 16 Unsupervised Learning Ali Al-Shahib."

Similar presentations


Ads by Google