CZ5211 Topics in Computational Biology Lecture 4: Clustering Analysis for Microarray Data II Prof. Chen Yu Zong Tel: 6874-6877

Slides:



Advertisements
Similar presentations
Clustering k-mean clustering Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein.
Advertisements

K-means Clustering Given a data point v and a set of points X,
Basic Gene Expression Data Analysis--Clustering
Cluster Analysis: Basic Concepts and Algorithms
PARTITIONAL CLUSTERING
Clustering Clustering of data is a method by which large sets of data is grouped into clusters of smaller sets of similar data. The example below demonstrates.
Assessment. Schedule graph may be of help for selecting the best solution Best solution corresponds to a plateau before a high jump Solutions with very.
Metrics, Algorithms & Follow-ups Profile Similarity Measures Cluster combination procedures Hierarchical vs. Non-hierarchical Clustering Statistical follow-up.
Introduction to Bioinformatics
Cluster Analysis.
Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.
L16: Micro-array analysis Dimension reduction Unsupervised clustering.
Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Cluster Analysis: Basic Concepts and Algorithms
Semi-Supervised Clustering Jieping Ye Department of Computer Science and Engineering Arizona State University
Cluster Analysis (1).
Introduction to Bioinformatics - Tutorial no. 12
What is Cluster Analysis?
Cluster Analysis for Gene Expression Data Ka Yee Yeung Center for Expression Arrays Department of Microbiology.
What is Cluster Analysis?
K-means Clustering. What is clustering? Why would we want to cluster? How would you determine clusters? How can you do this efficiently?
Clustering Ram Akella Lecture 6 February 23, & 280I University of California Berkeley Silicon Valley Center/SC.
Ulf Schmitz, Pattern recognition - Clustering1 Bioinformatics Pattern recognition - Clustering Ulf Schmitz
Birch: An efficient data clustering method for very large databases
Introduction to Bioinformatics Algorithms Clustering and Microarray Analysis.
Clustering Unsupervised learning Generating “classes”
Evaluating Performance for Data Mining Techniques
Elizabeth Garrett-Mayer November 5, 2003 Oncology Biostatistics
Unsupervised Learning. CS583, Bing Liu, UIC 2 Supervised learning vs. unsupervised learning Supervised learning: discover patterns in the data that relate.
COMP53311 Clustering Prepared by Raymond Wong Some parts of this notes are borrowed from LW Chan ’ s notes Presented by Raymond Wong
CZ5225: Modeling and Simulation in Biology Lecture 5: Clustering Analysis for Microarray Data III Prof. Chen Yu Zong Tel:
COMMON EVALUATION FINAL PROJECT Vira Oleksyuk ECE 8110: Introduction to machine Learning and Pattern Recognition.
Partitional and Hierarchical Based clustering Lecture 22 Based on Slides of Dr. Ikle & chapter 8 of Tan, Steinbach, Kumar.
START OF DAY 8 Reading: Chap. 14. Midterm Go over questions General issues only Specific issues: visit with me Regrading may make your grade go up OR.
Text Clustering.
1 Motivation Web query is usually two or three words long. –Prone to ambiguity –Example “keyboard” –Input device of computer –Musical instruments How can.
Data Clustering 2 – K Means contd & Hierarchical Methods Data Clustering – An IntroductionSlide 1.
Technological Educational Institute Of Crete Department Of Applied Informatics and Multimedia Intelligent Systems Laboratory 1 CLUSTERS Prof. George Papadourakis,
Clustering What is clustering? Also called “unsupervised learning”Also called “unsupervised learning”
Cluster Analysis Cluster Analysis Cluster analysis is a class of techniques used to classify objects or cases into relatively homogeneous groups.
Microarray Data Analysis (Lecture for CS498-CXZ Algorithms in Bioinformatics) Oct 13, 2005 ChengXiang Zhai Department of Computer Science University of.
Clustering I. 2 The Task Input: Collection of instances –No special class label attribute! Output: Clusters (Groups) of instances where members of a cluster.
K-Means Algorithm Each cluster is represented by the mean value of the objects in the cluster Input: set of objects (n), no of clusters (k) Output:
CZ5225: Modeling and Simulation in Biology Lecture 3: Clustering Analysis for Microarray Data I Prof. Chen Yu Zong Tel:
CS 8751 ML & KDDData Clustering1 Clustering Unsupervised learning Generating “classes” Distance/similarity measures Agglomerative methods Divisive methods.
Clustering Instructor: Max Welling ICS 178 Machine Learning & Data Mining.
Computational Biology Clustering Parts taken from Introduction to Data Mining by Tan, Steinbach, Kumar Lecture Slides Week 9.
Machine Learning Queens College Lecture 7: Clustering.
Compiled By: Raj Gaurang Tiwari Assistant Professor SRMGPC, Lucknow Unsupervised Learning.
Flat clustering approaches
Clustering Patrice Koehl Department of Biological Sciences National University of Singapore
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Cluster Analysis This lecture node is modified based on Lecture Notes for Chapter.
1 Pattern Recognition: Statistical and Neural Lonnie C. Ludeman Lecture 28 Nov 9, 2005 Nanjing University of Science & Technology.
1 Microarray Clustering. 2 Outline Microarrays Hierarchical Clustering K-Means Clustering Corrupted Cliques Problem CAST Clustering Algorithm.
DATA MINING: CLUSTER ANALYSIS Instructor: Dr. Chun Yu School of Statistics Jiangxi University of Finance and Economics Fall 2015.
Data Mining and Text Mining. The Standard Data Mining process.
Unsupervised Learning
Semi-Supervised Clustering
Clustering CSC 600: Data Mining Class 21.
CZ5211 Topics in Computational Biology Lecture 3: Clustering Analysis for Microarray Data I Prof. Chen Yu Zong Tel:
Data Mining K-means Algorithm
Clustering and Multidimensional Scaling
Multivariate Statistical Methods
Data Mining – Chapter 4 Cluster Analysis Part 2
Cluster Analysis.
Text Categorization Berlin Chen 2003 Reference:
Clustering.
Unsupervised Learning
Presentation transcript:

CZ5211 Topics in Computational Biology Lecture 4: Clustering Analysis for Microarray Data II Prof. Chen Yu Zong Tel: Room 07-24, level 7, SOC1, NUS

2 K-means clustering This method differs from the hierarchical clustering in several ways. In particular: There is no hierarchy, the data are partitioned. You will be presented only with the final cluster membership for each case. There is no role for the dendrogram in k-means clustering. You must supply the number of clusters (k) into which the data are to be grouped.

3 Example of K-means algorithm: Lloyd ’ s algorithm Has been shown to converge to a locally optimal solution But can converge to a solution arbitrarily bad compared to the optimal solution K=3 Data Points Optimal Centers Heuristic Centers

4 K-means clustering Given a set of n data points in d-dimensional space and an integer k We want to find the set of k points in d-dimensional space that minimizes the mean squared distance from each data point to its nearest center No exact polynomial-time algorithms are known for this problem

5 K-means clustering Usually uses Euclidean distance Gives spherical clusters How many clusters, K? Solution is not unique, clustering can depend on your starting point

6 K-means clustering Step 1: Transform n (genes) * m (experiments) matrix into n(genes) * n(genes) distance matrix Step 2: Cluster genes based on a k-means clustering algorithm

7 K-means clustering To transform the n*m matrix into n*n matrix, use a similarity (distance) metric. (Tavazoie et al. Nature Genetics Jul;22:281-5) Euclidean distance Where any two genes X and Y observed over a series of M conditions.

8 K-means clustering

9 K-means clustering algorithm Step 1: Suppose distance of genes expression patterns are positioned on a two dimensional space based a distance matrix Step 2: The first cluster center(red) is chosen randomly and then subsequent centers are by finding the data point farthest from the centers already chosen. In this example, k=3.

10 K-means clustering algorithm Step 3: Each point is assigned to the cluster associated with the closest representative center Step 4: Minimizes the within-cluster sum of squared distances from the cluster mean by moving the centroid (star points), that is computing a new cluster representative

11 K-means clustering algorithm Run step 3, 4 and 5 until no further changes occur – Self- consistency reached Step 5: Repeat step 3 and 4 with a new representative

12 Basic Algorithm for K-Means 1.Choose K initial cluster centers at random 2.Partition objects into k clusters by assigning objects to the closest centroid 3.Calculate the centroid of each of the k clusters. 4.Assign each object to cluster i, by first calculating the distance from each object to all cluster centers, choose closest. 5.If object changes clusters, recalculate the centroids 6.Repeat until objects not moving anymore.

13 Euclidean Distance and Centroid Point Simple and Fast! Remember this when we consider the complexity! The above equation is used to find the n dimensional centroid point amid k n dimensional points:

14 K-means 2nd example with k=2 1.We Pick k=2 centers at random 2.We cluster our data around these center points

15 K-means 2nd example with k=2 3.We recalculate centers based on our current clusters

16 K-means 2nd example with k=2 4.We re-cluster our data around our new center points

17 K-means 2nd example with k=2 5. We repeat the last two steps until no more data points are moved into a different cluster

18 K-means 3 rd example: Initialization x x x

19 K-means 3 rd example: Iteration 1 x x x

20 K-means 3 rd example: Iteration 2 x x x

21 K-means 3 rd example: Iteration 3 x x x

22 K-means 3 rd example: Iteration 4 x x x

23 K-means clustering problems Random initialization means that you may get different clusters each time Data points are assigned to only one cluster (hard assignment) Implicit assumptions about the “ shapes ” of clusters You have to pick the number of clusters …

24 K-means problem: always finds k clusters: x x x

25 K-means problem: distance may not always accurately reflect relationship -Each data point is assigned to the correct cluster -But data points that seem to be far away from each other in heuristic are in reality very closely related to each other

26 Tips on improving K-means clustering: to split/combine clusters Variations of the ISODATA algorithm –Split clusters that are too large by increasing k by one –Merge clusters that are too small, by merging clusters that are very close to one another What is too close and too far?

27 Tips on improving K-means clustering: Use of K-mediods instead of centroids Kmeans uses centroids, average of samples in a cluster Mediod: “representative object” within a cluster Less Sensitive to outliers

28 Tips on improving K-means clustering: How to choose k? Use another clustering method Run algorithm on data with several different values of k, and look at the stability of the results Use advance knowledge about the characteristics of your test

29 Tips on improving K-means clustering: Choosing K by using Silhouettes Silhouette of a gene, i, is: a i : average distance of sample, i, to other samples in the same cluster b i : average distance of sample, i, to genes in the nearest neighbor cluster maximal average Silhouette width can be used to select the number of clusters, s(i) close to one are well- classified

30 Tips on improving K-means clustering: Choosing K by using Silhouettes k=2k=3

31 Tips on improving K-means clustering: Choosing K by using WADP weighted average discrepancy pairs Add noise (perturbations to original data) Calculate the number of paired samples that cluster together in the original cluster that didn’t get perturbed Repeat for every cutoff level in HC or each k in k-means Estimate the proportion of pairs that changes for each k Use different levels of noise (heuristic) Look for largest k before WADP gets large

32 Tips on improving K-means clustering: Choosing K by using Cluster Quality Measures By introducing a measure of cluster quality Q, different values of k can be evaluated until an optimal value of Q is reached But, since clustering is an unsupervised learning method, one can ’ t really expect to find a “ correct ” measure Q … So, once again there are different choices of Q and our decision will depend on what dissimilarity measure are used and what types of clusters we want

33 Tips on improving K-means clustering: Choosing K by using Cluster Quality Measures Jagota suggested a measure that emphasizes cluster tightness or homogeneity: |C i | is the number of data points in cluster i Q will be small if (on average) the data points in each cluster are close

34 Tips on improving K-means clustering: Choosing K by using Cluster Quality Measures k Q This is a plot of the Q measure as given in Jagota for k- means clustering on the data shown earlier How many clusters do you think there actually are?

35 Tips on improving K-means clustering: Choosing K by using Cluster Quality Measures The Q measure given in Jagota takes into account homogeneity within clusters, but not separation between clusters Other measures try to combine these two characteristics (i.e., the Davies-Bouldin measure) An alternate approach is to look at cluster stability: –Add random noise to the data many times and count how many pairs of data points no longer cluster together –How much noise to add? Should reflect estimated variance in the data

36 What makes a clustering good? Clustering results can be different for different methods and distance metrics Except in the simplest of cases, result is sensitive to noise and outliers in the data Like the case of differential genes, looking for –Homogeneity: similarity within a cluster –Separation: differences between clusters

37 What makes a clustering good? Hypothesis Testing Approach Null hypothesis is that data has NO structure Generate a reference data population under the random hypothesis, data models a random structure and compare it to the actual data Estimate a statistic that indicates data structure

38 Cluster Quality Since any data can be clustered, how do we know our clusters are meaningful? –The size (diameter) of the cluster vs. The inter-cluster distance –Distance between the members of a cluster and the cluster ’ s center –Diameter of the smallest sphere

39 Cluster Quality size=5 distance=20 distance=5 Quality of cluster assessed by ratio of distance to nearest cluster and cluster diameter

40 Cluster Quality Quality can be assessed simply by looking at the diameter of a cluster A cluster can be formed even when there is no similarity between clustered patterns. This occurs because the algorithm forces k clusters to be created.

41 Characteristics of k-means clustering The random selection of initial center points creates the following properties –Non-Determinism –May produce clusters without patterns One solution is to choose the centers randomly from existing patterns

42 K-means clustering algorithm complexity Linear relationship with the number of data points, N CPU time required is proportional to cN –c does not depend on N, but rather the number of clusters, k Low computational complexity High speed