Slide 1 EE3J2 Data Mining Lecture 16 Unsupervised Learning Ali Al-Shahib.

EE3J2 Data Mining Lecture 16 Unsupervised Learning Ali Al-Shahib

Lectures on WebCT

Today  Unsupervised Learning  Clustering  K-means

What is Clustering?  It is a unsupervised learning method (no predefined classes, in our buy_computer example, there is no ‘no’ and ‘yes’).  Imagine you are given a set of data objects for analysis, unlike in classification, the class label of each example is not known.  Clustering is the process of grouping the data into classes or clusters so that examples within a cluster have high similarity in comparison to one another, but are very dissimilar to examples in other clusters.  Dissimilarities are assessed based on the attribute values describing the examples.  Often, distance measures are used.

Clustering Note: You do not know which type of Star each star is, they are unlabelled, you Just use the information given in the attributes (or features) of the star

EE3J2 Data Mining Structure of data  Typical real data is not uniformly distributed  It has structure  Variables might be correlated  The data might be grouped into natural ‘clusters’  The purpose of cluster analysis is to find this underlying structure automatically

7 Data Structures Clustering algorithms typically operate on either:  Data matrix – represents n objects (a.k.a. examples e.g. persons) with p variables (e.g. age,height,gender, etc.). Its n examples x p variables  Dissimilarity matrix – stores a collection of distances between examples. d(x,y) = difference or dissimilarity between examples x and y. How can dissimilarity d(x,y) be assessed?

EE3J2 Data Mining Clusters and centroids  In another words……  If we assume that the clusters are spherical, then they are determined by their centres  The cluster centres are called centroids  How many centroids do we need?  Where should we put them? centroids d(x,y) x y

Measuring dissimilarity (or similarity)  To measure similarity, often a distance function ‘d’ is used  Measures “dissimilarity” between pairs objects x and y Small distance d(x, y): objects x and y are more similar Large distance d(x, y): objects x and y are less similar

Properties of the distance function  So, a function d(x,y) defined on pairs of points x and y is called a distance (d) if it satisfies: d(x,y)≥ 0: Distance is a nonnegative number d(x,x) = 0 the distance of an object to itself is 0. d(x,y) = d(y,x) for all points x and y (d is symmetric) d(x,y)  d(x,y) + d(y,z) for all points x, y and z (this is called the triangle inequality)

Euclidean Distance  The most popular distance measure is Euclidean distance.  If x = (x 1, x 2,…,x N ) and y = (y 1,y 2,…,y N ) then:  This corresponds to the standard notion of distance in Euclidean space

EE3J2 Data Mining Distortion  Distortion is a measure of how well a set of centroids models a set of data  Suppose we have: data points y 1, y 2,…,y T centroids c 1,…,c M  For each data point y t let c i(t) be the closest centroid  In other words: d(y t, c i(t) ) = min m d(y t,c m )

EE3J2 Data Mining Distortion  The distortion for the centroid set C = c 1,…,c M is defined by:  In other words, the distortion is the sum of distances between each data point and its nearest centroid  The task of clustering is to find a centroid set C such that the distortion Dist(C) is minimised

14 The K-Means Clustering Method  Given k, the k-means algorithm is implemented in 4 steps: Partition objects into k nonempty subsets Compute seed points as the centroids of the clusters of the current partition. The centroid is the center (mean point) of the cluster. Assign each object to the cluster with the nearest seed point. Go back to Step 2, stop when no more new assignment.

15 The K-Means Clustering Method Example

Lets watch an animation! http://r.yihui.name/stat/multivariate_stat/kmeans/index.htm

K-means Clustering  Suppose that we have decided how many centroids we need - denote this number by K  Suppose that we have an initial estimate of suitable positions for our K centroids  K-means clustering is an iterative procedure for moving these centroids to reduce distortion

K-means clustering - notation  Suppose there are T data points, denoted by:  Suppose that the initial K clusters are denoted by:  One iteration of K-means clustering will produce a new set of clusters Such that

K-means clustering (1)  For each data point y t let c i(t) be the closest centroid  In other words: d(y t, c i(t) ) = min m d(y t,c m )  Now, for each centroid c 0 k define:  In other words, Y 0 k is the set of data points which are closer to c 0 k than any other cluster

K-means clustering (2)  Now define a new k th centroid c 1 k by: where |Y k 0 | is the number of samples in Y k 0  In other words, c 1 k is the average value of the samples which were closest to c 0 k

K-means clustering (3)  Now repeat the same process starting with the new centroids: to create a new set of centroids: … and so on until the process converges  Each new set of centroids has smaller distortion than the previous set

22 Comments on the K-Means Method  Strength Relatively efficient: O(tkn), where n is # objects, k is # clusters, and t is # iterations. Normally, k, t << n. Often terminates at a local optimum. The global optimum may be found using techniques such as: deterministic annealing and genetic algorithms  Weakness Applicable only when mean is defined, then what about categorical data? Need to specify k, the number of clusters, in advance Unable to handle noisy data and outliers Not suitable to discover clusters with non- convex shapes

Conclusions  Unsupervised Learning  Clustering  Distance metrics  k-means clustering algorithm

On Tuesday  Sequence Analysis

Some References and Acknowledgments  Data Mining: Concepts and Techniques. J.Han and M.Kamber  J.Han slides, University of Illinois at Urbana-Champaign

Slide 1 EE3J2 Data Mining Lecture 16 Unsupervised Learning Ali Al-Shahib.

Similar presentations

Presentation on theme: "Slide 1 EE3J2 Data Mining Lecture 16 Unsupervised Learning Ali Al-Shahib."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Slide 1 EE3J2 Data Mining Lecture 16 Unsupervised Learning Ali Al-Shahib.

Similar presentations

Presentation on theme: "Slide 1 EE3J2 Data Mining Lecture 16 Unsupervised Learning Ali Al-Shahib."— Presentation transcript:

Similar presentations

About project

Feedback