CZ5211 Topics in Computational Biology Lecture 4: Clustering Analysis for Microarray Data II Prof. Chen Yu Zong Tel: Room 07-24, level 7, SOC1, NUS
2 K-means clustering This method differs from the hierarchical clustering in several ways. In particular: There is no hierarchy, the data are partitioned. You will be presented only with the final cluster membership for each case. There is no role for the dendrogram in k-means clustering. You must supply the number of clusters (k) into which the data are to be grouped.
3 Example of K-means algorithm: Lloyd ’ s algorithm Has been shown to converge to a locally optimal solution But can converge to a solution arbitrarily bad compared to the optimal solution K=3 Data Points Optimal Centers Heuristic Centers
4 K-means clustering Given a set of n data points in d-dimensional space and an integer k We want to find the set of k points in d-dimensional space that minimizes the mean squared distance from each data point to its nearest center No exact polynomial-time algorithms are known for this problem
5 K-means clustering Usually uses Euclidean distance Gives spherical clusters How many clusters, K? Solution is not unique, clustering can depend on your starting point
6 K-means clustering Step 1: Transform n (genes) * m (experiments) matrix into n(genes) * n(genes) distance matrix Step 2: Cluster genes based on a k-means clustering algorithm
7 K-means clustering To transform the n*m matrix into n*n matrix, use a similarity (distance) metric. (Tavazoie et al. Nature Genetics Jul;22:281-5) Euclidean distance Where any two genes X and Y observed over a series of M conditions.
8 K-means clustering
9 K-means clustering algorithm Step 1: Suppose distance of genes expression patterns are positioned on a two dimensional space based a distance matrix Step 2: The first cluster center(red) is chosen randomly and then subsequent centers are by finding the data point farthest from the centers already chosen. In this example, k=3.
10 K-means clustering algorithm Step 3: Each point is assigned to the cluster associated with the closest representative center Step 4: Minimizes the within-cluster sum of squared distances from the cluster mean by moving the centroid (star points), that is computing a new cluster representative
11 K-means clustering algorithm Run step 3, 4 and 5 until no further changes occur – Self- consistency reached Step 5: Repeat step 3 and 4 with a new representative
12 Basic Algorithm for K-Means 1.Choose K initial cluster centers at random 2.Partition objects into k clusters by assigning objects to the closest centroid 3.Calculate the centroid of each of the k clusters. 4.Assign each object to cluster i, by first calculating the distance from each object to all cluster centers, choose closest. 5.If object changes clusters, recalculate the centroids 6.Repeat until objects not moving anymore.
13 Euclidean Distance and Centroid Point Simple and Fast! Remember this when we consider the complexity! The above equation is used to find the n dimensional centroid point amid k n dimensional points:
14 K-means 2nd example with k=2 1.We Pick k=2 centers at random 2.We cluster our data around these center points
15 K-means 2nd example with k=2 3.We recalculate centers based on our current clusters
16 K-means 2nd example with k=2 4.We re-cluster our data around our new center points
17 K-means 2nd example with k=2 5. We repeat the last two steps until no more data points are moved into a different cluster
18 K-means 3 rd example: Initialization x x x
19 K-means 3 rd example: Iteration 1 x x x
20 K-means 3 rd example: Iteration 2 x x x
21 K-means 3 rd example: Iteration 3 x x x
22 K-means 3 rd example: Iteration 4 x x x
23 K-means clustering problems Random initialization means that you may get different clusters each time Data points are assigned to only one cluster (hard assignment) Implicit assumptions about the “ shapes ” of clusters You have to pick the number of clusters …
24 K-means problem: always finds k clusters: x x x
25 K-means problem: distance may not always accurately reflect relationship -Each data point is assigned to the correct cluster -But data points that seem to be far away from each other in heuristic are in reality very closely related to each other
26 Tips on improving K-means clustering: to split/combine clusters Variations of the ISODATA algorithm –Split clusters that are too large by increasing k by one –Merge clusters that are too small, by merging clusters that are very close to one another What is too close and too far?
27 Tips on improving K-means clustering: Use of K-mediods instead of centroids Kmeans uses centroids, average of samples in a cluster Mediod: “representative object” within a cluster Less Sensitive to outliers
28 Tips on improving K-means clustering: How to choose k? Use another clustering method Run algorithm on data with several different values of k, and look at the stability of the results Use advance knowledge about the characteristics of your test
29 Tips on improving K-means clustering: Choosing K by using Silhouettes Silhouette of a gene, i, is: a i : average distance of sample, i, to other samples in the same cluster b i : average distance of sample, i, to genes in the nearest neighbor cluster maximal average Silhouette width can be used to select the number of clusters, s(i) close to one are well- classified
30 Tips on improving K-means clustering: Choosing K by using Silhouettes k=2k=3
31 Tips on improving K-means clustering: Choosing K by using WADP weighted average discrepancy pairs Add noise (perturbations to original data) Calculate the number of paired samples that cluster together in the original cluster that didn’t get perturbed Repeat for every cutoff level in HC or each k in k-means Estimate the proportion of pairs that changes for each k Use different levels of noise (heuristic) Look for largest k before WADP gets large
32 Tips on improving K-means clustering: Choosing K by using Cluster Quality Measures By introducing a measure of cluster quality Q, different values of k can be evaluated until an optimal value of Q is reached But, since clustering is an unsupervised learning method, one can ’ t really expect to find a “ correct ” measure Q … So, once again there are different choices of Q and our decision will depend on what dissimilarity measure are used and what types of clusters we want
33 Tips on improving K-means clustering: Choosing K by using Cluster Quality Measures Jagota suggested a measure that emphasizes cluster tightness or homogeneity: |C i | is the number of data points in cluster i Q will be small if (on average) the data points in each cluster are close
34 Tips on improving K-means clustering: Choosing K by using Cluster Quality Measures k Q This is a plot of the Q measure as given in Jagota for k- means clustering on the data shown earlier How many clusters do you think there actually are?
35 Tips on improving K-means clustering: Choosing K by using Cluster Quality Measures The Q measure given in Jagota takes into account homogeneity within clusters, but not separation between clusters Other measures try to combine these two characteristics (i.e., the Davies-Bouldin measure) An alternate approach is to look at cluster stability: –Add random noise to the data many times and count how many pairs of data points no longer cluster together –How much noise to add? Should reflect estimated variance in the data
36 What makes a clustering good? Clustering results can be different for different methods and distance metrics Except in the simplest of cases, result is sensitive to noise and outliers in the data Like the case of differential genes, looking for –Homogeneity: similarity within a cluster –Separation: differences between clusters
37 What makes a clustering good? Hypothesis Testing Approach Null hypothesis is that data has NO structure Generate a reference data population under the random hypothesis, data models a random structure and compare it to the actual data Estimate a statistic that indicates data structure
38 Cluster Quality Since any data can be clustered, how do we know our clusters are meaningful? –The size (diameter) of the cluster vs. The inter-cluster distance –Distance between the members of a cluster and the cluster ’ s center –Diameter of the smallest sphere
39 Cluster Quality size=5 distance=20 distance=5 Quality of cluster assessed by ratio of distance to nearest cluster and cluster diameter
40 Cluster Quality Quality can be assessed simply by looking at the diameter of a cluster A cluster can be formed even when there is no similarity between clustered patterns. This occurs because the algorithm forces k clusters to be created.
41 Characteristics of k-means clustering The random selection of initial center points creates the following properties –Non-Determinism –May produce clusters without patterns One solution is to choose the centers randomly from existing patterns
42 K-means clustering algorithm complexity Linear relationship with the number of data points, N CPU time required is proportional to cN –c does not depend on N, but rather the number of clusters, k Low computational complexity High speed