More About Clustering Naomi Altman Nov '06. Assessing Clusters Some things we might like to do: 1.Understand the within cluster similarity and between.

Slides:



Advertisements
Similar presentations
Original Figures for "Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring"
Advertisements

Cluster Analysis: Basic Concepts and Algorithms
Hierarchical Clustering, DBSCAN The EM Algorithm
Exploring Data using Dimension Reduction and Clustering Naomi Altman Nov. 06.
Cluster analysis for microarray data Anja von Heydebreck.
Principal Component Analysis (PCA) for Clustering Gene Expression Data K. Y. Yeung and W. L. Ruzzo.
Introduction to Bioinformatics
Cluster Analysis.
UNSUPERVISED ANALYSIS GOAL A: FIND GROUPS OF GENES THAT HAVE CORRELATED EXPRESSION PROFILES. THESE GENES ARE BELIEVED TO BELONG TO THE SAME BIOLOGICAL.
Clustering approaches for high- throughput data Sushmita Roy BMI/CS 576 Nov 12 th, 2013.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ What is Cluster Analysis? l Finding groups of objects such that the objects in a group will.
Clustering short time series gene expression data Jason Ernst, Gerard J. Nau and Ziv Bar-Joseph BIOINFORMATICS, vol
SocalBSI 2008: Clustering Microarray Datasets Sagar Damle, Ph.D. Candidate, Caltech  Distance Metrics: Measuring similarity using the Euclidean and Correlation.
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.
C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E Lecture 9 Clustering Algorithms Bioinformatics Data Analysis and Tools.
University of CreteCS4831 The use of Minimum Spanning Trees in microarray expression data Gkirtzou Ekaterini.
Cluster Analysis.  What is Cluster Analysis?  Types of Data in Cluster Analysis  A Categorization of Major Clustering Methods  Partitioning Methods.
L16: Micro-array analysis Dimension reduction Unsupervised clustering.
Clustering Petter Mostad. Clustering vs. class prediction Class prediction: Class prediction: A learning set of objects with known classes A learning.
Dimension reduction : PCA and Clustering Christopher Workman Center for Biological Sequence Analysis DTU.
What is Cluster Analysis?
Gene Expression 1. Methods –Unsupervised Clustering Hierarchical clustering K-means clustering Expression data –GEO –UCSC EPCLUST 2.
Cluster Analysis for Gene Expression Data Ka Yee Yeung Center for Expression Arrays Department of Microbiology.
Microarray analysis 2 Golan Yona. 2) Analysis of co-expression Search for similarly expressed genes experiment1 experiment2 experiment3 ……….. Gene i:
Cluster Validation.
Tutorial 8 Clustering 1. General Methods –Unsupervised Clustering Hierarchical clustering K-means clustering Expression data –GEO –UCSC –ArrayExpress.
Ulf Schmitz, Pattern recognition - Clustering1 Bioinformatics Pattern recognition - Clustering Ulf Schmitz
© University of Minnesota Data Mining CSCI 8980 (Fall 2002) 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance Computing Research Center.
Introduction to Bioinformatics Algorithms Clustering and Microarray Analysis.
Clustering Unsupervised learning Generating “classes”
Principal Component Analysis (PCA) for Clustering Gene Expression Data K. Y. Yeung and W. L. Ruzzo.
Elizabeth Garrett-Mayer November 5, 2003 Oncology Biostatistics
Graph-based consensus clustering for class discovery from gene expression data Zhiwen Yum, Hau-San Wong and Hongqiang Wang Bioinformatics, 2007.
Hierarchical Clustering and Dynamic Branch Cutting
More on Microarrays Chitta Baral Arizona State University.
Proliferation cluster (G12) Figure S1 A The proliferation cluster is a stable one. A dendrogram depicting results of cluster analysis of all varying genes.
Lecture 20: Cluster Validation
Microarrays.
1 Gene Ontology Javier Cabrera. 2 Outline Goal: How to identify biological processes or biochemical pathways that are changed by treatment.Goal: How to.
Distances Between Genes and Samples Naomi Altman Oct. 06.
Dimension reduction : PCA and Clustering Slides by Agnieszka Juncker and Chris Workman modified by Hanne Jarmer.
Chapter 14 – Cluster Analysis © Galit Shmueli and Peter Bruce 2010 Data Mining for Business Intelligence Shmueli, Patel & Bruce.
Quantitative analysis of 2D gels Generalities. Applications Mutant / wild type Physiological conditions Tissue specific expression Disease / normal state.
Marketing Research Aaker, Kumar, Day and Leone Tenth Edition Instructor’s Presentation Slides 1.
An Overview of Clustering Methods Michael D. Kane, Ph.D.
CZ5225: Modeling and Simulation in Biology Lecture 3: Clustering Analysis for Microarray Data I Prof. Chen Yu Zong Tel:
Gene expression & Clustering. Determining gene function Sequence comparison tells us if a gene is similar to another gene, e.g., in a new species –Dynamic.
Cluster validation Integration ICES Bioinformatics.
Computational Biology Clustering Parts taken from Introduction to Data Mining by Tan, Steinbach, Kumar Lecture Slides Week 9.
Analyzing Expression Data: Clustering and Stats Chapter 16.
Compiled By: Raj Gaurang Tiwari Assistant Professor SRMGPC, Lucknow Unsupervised Learning.
Data Mining Cluster Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 8 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Clustering/Cluster Analysis. What is Cluster Analysis? l Finding groups of objects such that the objects in a group will be similar (or related) to one.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Cluster Analysis This lecture node is modified based on Lecture Notes for Chapter.
CZ5211 Topics in Computational Biology Lecture 4: Clustering Analysis for Microarray Data II Prof. Chen Yu Zong Tel:
Hierarchical clustering approaches for high-throughput data Colin Dewey BMI/CS 576 Fall 2015.
1 Cluster Analysis Prepared by : Prof Neha Yadav.
C LUSTERING José Miguel Caravalho. CLUSTER ANALYSIS OR CLUSTERING IS THE TASK OF ASSIGNING A SET OF OBJECTS INTO GROUPS ( CALLED CLUSTERS ) SO THAT THE.
Clustering (2) Center-based algorithms Fuzzy k-means Density-based algorithms ( DBSCAN as an example ) Evaluation of clustering results Figures and equations.
Computational Biology
Unsupervised Learning
Clustering CSC 600: Data Mining Class 21.
Clustering (3) Center-based algorithms Fuzzy k-means
Hierarchical clustering approaches for high-throughput data
Cluster Analysis of Microarray Data
Multivariate Statistical Methods
Cluster Validity For supervised classification we have a variety of measures to evaluate how good our model is Accuracy, precision, recall For cluster.
Cluster Analysis.
Lecture 16. Classification (II): Practical Considerations
Presentation transcript:

More About Clustering Naomi Altman Nov '06

Assessing Clusters Some things we might like to do: 1.Understand the within cluster similarity and between cluster differences in terms of distances and expression profiles. 2.Decide how many clusters "are" in the data. 3.Display the clusters graphically. 4.Determine cluster stability. 5.Determine stability of genes in a cluster.

Assessing Clusters 1. Understand the within cluster similarity and between cluster differences in terms of distances and expression profiles. 2.Decide how many clusters "are" in the data. Distance similarity: genes in the same cluster should be closer together than genes in different clusters. Expression similarity: genes in the same cluster should have expression profiles that are more similar (on features of interest) than genes in different clusters. (And if this is not true, maybe we need a different similarity measure.) Display the clusters graphically. Determine cluster stability. Determine stability of genes in a cluster.

Distance similarity: Genes in the same cluster should be closer together than genes in different clusters. We need to be careful about how we define "closer".

Some criteria Minimize the total within cluster distances. (K- means clustering finds a local minimum of this function for p=2. We might also consider p=1. Clearly, the larger K is, the smaller this criterion. We could consider a scree plot. The silhouette is a measure of how much closer points in the cluster are to one another compared to points outside the cluster. Observations with large and positive sil (~1) are well clustered, those with sil around 0 lie between clusters, and those with negative sil are placed in the “wrong” cluster.

#M.yeast is the data with 294 genes and 24 times (arrays) library(cluster) kmeans6.out=kmeans(M.yeast,6) sil6=silhouette(kmeans.out$cluster,dist(yeast.M[,2:25], method = "euclidean")) plot(sil6) summary(sil6)

Some diagnostics for the Cell Cycle Data Silhouette of 294 units in 3 Cluster sizes and average silhouette widths: Individual silhouette widths: Min. 1st Qu. Median Mean 3rd Qu. Max

Using Height In hierarchical clustering, we can use the cluster height to help determine where to define clusters. complete.out=hclust(dist(M.yeast),method="complete") plot(1:293,complete.out$height,ylab="Distance at Join",main="Complete Linkage")

Using Height in Hierarchical Clustering Gaps might indicate heights at which to cut the tree to form clusters

Assessing Clusters Understand the within cluster similarity and between cluster differences in terms of distances and expression profiles. Decide how many clusters "are" in the data. Display the clusters graphically. How do the clusters relate to the treatments? What are the "typical" expression profiles for genes in the clusters Determine cluster stability. Determine stability of genes in a cluster.

Plotting Cluster Centers # All of the clustering routines provide a vector in the same order # as the genes, with the cluster number. # For hierarchical methods, you need to cut the tree to form the # clusters. You can do it by height or the desired number of # clusters. clust6=cutree(complete.out,k=6) par(mfrow=c(2,3)) for (i in 1:6){ M=M.yeast[clust6==i,] meansi=apply(M,2,mean) plot(time,meansi,xlab="Time",ylab="Mean Expression",main=paste("Cluster",i),type="l")}

Complete Linkage Single Linkage cluster n n

Complete Linkage K-means cluster n n

The genes do not look much like the cluster means - wrong metric! We are interested in the cyclic pattern - we should use correlation distance.

zscore=function(v) (v-mean(v))/sqrt(var(v)) yeast.Z=apply(yeast.M[,2:25],1,zscore) #apply saves the results in the columns - we want genes in rows yeast.Z=t(yeast.Z) Correlation Distance by Zscores

← data zscore →

Assessing Clusters Understand the within cluster similarity and between cluster differences in terms of distances and expression profiles. Decide how many clusters "are" in the data. Display the clusters graphically. Determine cluster stability. We already know that different clustering methods and metrics cluster differently. How stable are the clusters if we: add noise leave out a treatment, array, gene. Removing noise stabilizes real clusters. Determine stability of genes in a cluster.

The Effect of Noise on Clustering Each data value could be anywhere inside the circle. The red gene might be clustered with the blue or green genes. The correlation between the black and red genes is 1.0. Correlation distance will blow up patterns that are just due to noise.

Assessing "Noise" When we have replicates for the treatments, we have several ways to assess noise. 1) Bootstrap arrays - i.e. resample with replacement within a treatment. 2) Subsample arrays - i.e. take subsets of arrays within a treatment 3) Take residuals from a fitted model a) Bootstrap the residuals b) Use the residuals to estimate an error distribution and resample from the distribution

Resampling from Arrays We have treatments 1... T and for each treatment there are n t arrays. Bootstrap: sample n t arrays with replacement Subsample: sample k t arrays without replacement for each treatment Recluster. Look at changes in clusters or cluster summaries for each reclustering. (Note that cluster labels will change.)

Using Residuals e.g. We have treatments 1... T and for each treatment there are n t arrays. For each gene i and treatment t there is an expression mean m it. For a gene i, treatment t, array a, there is an expression value, y ita. The estimated noise is r ita = y ita - m it. Obtain bootstrap samples by (for each gene, but ignoring t and a), sampling with replacement from r ita giving residuals r* ita The bootstrap samples are y* ita = m it +r* ita (If you are worried about correlation among the genes on the same array, you can sample the entire vector of errors.) Alternatively, use the histogram of r ita to estimate a distribution for each gene (e.g. N(0,  i   and sample from the distribution (but it is harder to deal with correlation among genes) Recluster. Look at changes in clusters or cluster summaries for each reclustering. (Note that cluster labels will change.)

Removing genes breaks up clusters Removing arrays changes similarity

Other Perturbations Recluster: Random subsamples of genes Random subsamples of treatments

Using the Simulated Samples If the clusters are well-separated, we expect the changes due to the perturbations to be small. If the clusters are just "zip-coding" i.e. dividing a big cluster into nearby but not separated regions, we expect the changes due to the perturbations to be large. If the clusters are due to noise, we expect changes due to the perturbations to be large.

The original data and a slightly perturbed version lead to perturbed dendrograms and cluster centers

Reducing Noise to Find Structure As for differential expression analysis, we hope to find replicable structure by clustering. Reducing noise improves our ability to find biologically meaningful structure - particularly when we are using scale invariant similarity measures like correlation. 1: Replicate, replicate, replicate We can cluster arrays just like we cluster genes. If an array does not cluster with its biological replicates then either there is no differential expression or it is a bad array. For each gene, the mean over the replicates is less noisy by √#reps - 2 reps give 30% less noise, 4 reps give 50% less. Instead of clustering the profile over arrays, cluster over treatment means.

Reducing Noise to Find Structure As for differential expression analysis, we hope to find replicable structure by clustering. Reducing noise improve our ability to find biologically meaningful structure - particularly when we are using scale invariant similarity measures like correlation. 2. Filter genes Remove non-differentially expressing genes from the data matrix before clustering - especially when using scale-free distances. Best: If possible, do a differential expression analysis and use only the significant genes. The correlation between the black and red genes is 1.0. Correlation distance will blow up patterns that are just due to noise.

Reducing Noise to Find Structure As for differential expression analysis, we hope to find replicable structure by clustering. Reducing noise improve our ability to find biologically meaningful structure - particularly when we are using scale invariant similarity measures like correlation. 3. Reduce dimension The initial dimension for clustering genes is the number of arrays. Using treatment means reduces this to the number of treatments. If there are no replicates, dimension reduction can be used such as regressing on the most significant eigengenes, or a pattern of interest such as sine and cosine waves. The regression coefficients are then used as the input to the clustering routine.

svd.z=svd(yeast.Z) design.reg=model.matrix(~svd.z$v[,1:4]) fit.reg=lmFit(yeast.Z,design.reg) pc.out=hclust(dist(fit.reg$coef))

Assessing Clusters Understand the within cluster similarity and between cluster differences in terms of distances and expression profiles. Decide how many clusters "are" in the data. Display the clusters graphically. Determine cluster stability. Determine stability of genes in a cluster. If the clusters stay distinguishable during the simulation, count the number of times each gene is in each cluster. If the clusters are less stable, count the number of times the pair gene i and gene j are in the same cluster.

Summarizing the Perturbed Data Dendrograms: Compute the consensus tree There are various methods based on the number of times each branch of the dendrogram appears in the perturbed data. (I use phylogenetics software for this.) Clusters: If clusters are stable, put each gene into the cluster it joins the most often If clusters are less stable, use the pairwise counts as a new distance measure and recluster based on this measure.

Other partitioning Methods 1.Partitioning around medioids (PAM): instead of averages, use multidim medians as centroids (cluster “prototypes”). Dudoit and Freedland (2002). 2.Self-organizing maps (SOM): add an underlying “topology” (neighboring structure on a lattice) that relates cluster centroids to one another. Kohonen (1997), Tamayo et al. (1999). 3.Fuzzy k-means: allow for a “gradation” of points between clusters; soft partitions. Gash and Eisen (2002). 4.Mixture-based clustering: implemented through an EM (Expectation-Maximization)algorithm. This provides soft partitioning, and allows for modeling of cluster centroids and shapes. Yeung et al. (2001), McLachlan et al. (2002) F. Chiaromonte Sp 06 5

Assessing the Clusters Computationally The bottom line is that the clustering is "good" if it is biologically meaningful (but this is hard to assess). Computationally we can: 1) Use a goodness of cluster measure, such as the within cluster distances compared to the between cluster distances. 2) Perturb the data and assess cluster changes: a) add noise (maybe residuals after ANOVA) b) resample (genes, arrays)