Download presentation
Presentation is loading. Please wait.
Published byDiana Pope Modified over 8 years ago
1
Canadian Bioinformatics Workshops www.bioinformatics.ca
2
2Module #: Title of Module
3
Module 5 Clustering D EPARTMENT OF B IOCHEMISTRY D EPARTMENT OF M OLECULAR G ENETICS † Herakles and Iolaos battle the Hydra. Classical (450-400 BCE) This workshop includes material originally developed by Raphael Gottardo, FHCRC and by Sohrab Shah, UBC † Exploratory Data Analysis of Biological Data using R Boris Steipe Toronto, May 23. and 24. 2013
4
Module 5: Clustering bioinformatics.ca Principles Hierarchical clustering Partitioning methods Centroid based clustering (K-means etc.) Model–based clustering Outline
5
Module 5: Clustering bioinformatics.ca Complexes in interaction data Domains in protein structure Proteins of similar function (based on measured similar properties, e.g. coregulation)... Examples
6
Module 5: Clustering bioinformatics.ca Introduction to clustering Clustering...... is an example of unsupervised learning... is useful for the analysis of patterns in data... can lead to class discovery. Clustering is the partitioning of a data set into groups of elements that are more similar to each other than to elements in other groups. Clustering is a completely general method that can be applied to genes, samples, or both.
7
Module 5: Clustering bioinformatics.ca Hierarchical clustering Given N items and a distance metric... 1. Assign each item to its own "cluster". Initialize the distance matrix between clusters as the distance between items. 2. Find the closest pair of clusters and merge them into a single cluster. 3. Compute new distances between clusters. 4. Repeat 2-3 until all clusters have been merged into a single cluster.
8
Module 5: Clustering bioinformatics.ca "Given N items and a distance metric..." What is a metric? A metric has to fulfill three conditions: "identity" "symmetry" "triangle inequality" Hierarchical clustering
9
Module 5: Clustering bioinformatics.ca Distance metrics Common metrics include: Manhattan distance: Euclidean distance: 1-correlation: (proportional to Euclidean distance, but invariant to range of measurement from one sample to the next). dissimilar similar
10
Module 5: Clustering bioinformatics.ca Distance metrics compared EuclideanManhattan1-Correlation Distance matters!
11
Module 5: Clustering bioinformatics.ca Other distance metrics Hamming distance for ordinal, binary or categorical data:
12
Module 5: Clustering bioinformatics.ca Agglomerative hierarchical clustering
13
Module 5: Clustering bioinformatics.ca Hierarchical clustering Anatomy of hierarchical clustering distance matrix linkage method Output dendrogram a tree that defines the relationships between objects and the distance between clusters a nested sequence of clusters
14
Module 5: Clustering bioinformatics.ca Linkage methods single complete average distance between centroids
15
Module 5: Clustering bioinformatics.ca cho.data<-as.matrix(read.table("logcho_237_4class.txt",skip=1)[1:50,3:19]) D.cho<-dist(cho.data, method = "euclidean") hc.single<-hclust(D.cho, method = "single", members=NULL) Example: cell cycle data
16
Module 5: Clustering bioinformatics.ca plot(hc.single) Single linkage Example: cell cycle data
17
Module 5: Clustering bioinformatics.ca Careful with the interpretation of dendrograms: they introduce a proximity between elements that does not correlate with distance between elements! cf.: # 1 and #47 Example: cell cycle data
18
Module 5: Clustering bioinformatics.ca Single linkage, k=2 rect.hclust(hc.single,k=2) Example: cell cycle data
19
Module 5: Clustering bioinformatics.ca Single linkage, k=3 rect.hclust(hc.single,k=3) Example: cell cycle data
20
Module 5: Clustering bioinformatics.ca Single linkage, k=4 rect.hclust(hc.single,k=4) Example: cell cycle data
21
Module 5: Clustering bioinformatics.ca Single linkage, k=5 rect.hclust(hc.single,k=5) Example: cell cycle data
22
Module 5: Clustering bioinformatics.ca Single linkage, k=25 rect.hclust(hc.single,k=25) Example: cell cycle data
23
Module 5: Clustering bioinformatics.ca 12 34 class.single<-cutree(hc.single, k = 4) par(mfrow=c(2,2)) matplot(t(cho.data[class.single==1,]),type="l", xlab="time",ylab="log expression value") matplot(t(cho.data[class.single==2,]),type="l", xlab="time",ylab="log expression value") matplot(as.matrix(cho.data[class.single==3,]), type="l",xlab="time",ylab="log expression value") matplot(t(cho.data[class.single==4,]),type="l", xlab="time",ylab="log expression value") Properties of cluster members, single linkage, k=4 Example: cell cycle data
24
Module 5: Clustering bioinformatics.ca 12 34 Single linkage, k=4 1 2 34 Example: cell cycle data
25
Module 5: Clustering bioinformatics.ca hc.complete<-hclust(D.cho, method = "complete", members=NULL) plot(hc.complete) rect.hclust(hc.complete,k=4) class.complete<-cutree(hc.complete, k = 4) par(mfrow=c(2,2)) matplot... Example: cell cycle data
26
Module 5: Clustering bioinformatics.ca Complete linkage, k=4 rect.hclust(hc.complete,k=4) Example: cell cycle data
27
Module 5: Clustering bioinformatics.ca 12 34 Properties of cluster members, complete linkage, k=4 Example: cell cycle data
28
Module 5: Clustering bioinformatics.ca 12 34 Complete linkage, k=4 12 34 Single linkage, k=4 Example: cell cycle data
29
Module 5: Clustering bioinformatics.ca 12 34 Complete linkage, k=4 12 34 Single linkage, k=4 Could use this to revise analysis... bimodal distribution? Analyse signal, not noise! Example: Cell cycle data
30
Module 5: Clustering bioinformatics.ca hc.average<-hclust(D.cho, method = "average", members=NULL) plot(hc.average)rect.hclust(hc.average,k=4) class.average<-cutree(hc.average, k = 4) par(mfrow=c(2,2)) matplot(t(cho.data[class.average==1,]),type="l",xlab="time",ylab="log expression value") matplot(t(cho.data[class.average==2,]),type="l",xlab="time",ylab="log expression value") matplot(as.matrix(cho.data[class.average==3,]),type="l",xlab="time",ylab="log expression value") matplot(t(cho.data[class.average==4,]),type="l",xlab="time",ylab="log expression value") Example: Cell cycle data
31
Module 5: Clustering bioinformatics.ca Hierarchical clustering analyzed AdvantagesDisadvantages There may be small clusters nested inside large ones Clusters might not be naturally represented by a hierarchical structure No need to specify number groups ahead of time Its necessary to ‘cut’ the dendrogram in order to produce clusters Flexible linkage methodsBottom up clustering can result in poor structure at the top of the tree. Early joins cannot be ‘undone’
32
Module 5: Clustering bioinformatics.ca Partitioning methods Anatomy of a partitioning based method data matrix distance function number of groups Output group assignment of every object
33
Module 5: Clustering bioinformatics.ca Partitioning based methods Choose K groups initialise group centers aka centroid, medoid assign each object to the nearest centroid according to the distance metric reassign (or recompute) centroids repeat last 2 steps until assignment stabilizes
34
Module 5: Clustering bioinformatics.ca K-means vs. K-medoids K-meansK-medoids Centroids are the ‘mean’ of the clusters Centroids are an actual object that minimizes the total within cluster distance Centroids need to be recomputed every iteration Centroid can be determined from quick look up into the distance matrix Initialisation difficult as notion of centroid may be unclear before beginning Initialisation is simply K randomly selected objects kmeanspam
35
Module 5: Clustering bioinformatics.ca Partitioning based methods AdvantagesDisadvantages Number of groups is well defined Have to choose the number of groups A clear, deterministic assignment of an object to a group Sometimes objects do not fit well to any cluster Simple algorithms for inference Can converge on locally optimal solutions and often require multiple restarts with random initializations
36
Module 5: Clustering bioinformatics.ca N items, assume K clusters Goal is to minimize over the possible assignments and centroids. represents the location of the cluster. K-means
37
Module 5: Clustering bioinformatics.ca 1. Divide the data into K clusters Initialize the centroids with the mean of the clusters 2. Assign each item to the cluster with closest centroid 3. When all objects have been assigned, recalculate the centroids (mean) 4. Repeat 2-3 until the centroids no longer move K-means
38
Module 5: Clustering bioinformatics.ca set.seed(100) x <- rbind(matrix(rnorm(100, sd = 0.3), ncol = 2), matrix(rnorm(100, mean = 1, sd = 0.3), ncol = 2)) colnames(x) <- c("x", "y") set.seed(100); cl <- NULL cl<-kmeans(x, matrix(runif(10,-.5,.5),5,2),iter.max=1) plot(x,col=cl$cluster) points(cl$centers, col = 1:5, pch = 8, cex=2) set.seed(100); cl <- NULL cl<-kmeans(x, matrix(runif(10,-.5,.5),5,2),iter.max=2) plot(x,col=cl$cluster) points(cl$centers, col = 1:5, pch = 8, cex=2) set.seed(100); cl <- NULL cl<-kmeans(x, matrix(runif(10,-.5,.5),5,2),iter.max=3) plot(x,col=cl$cluster) points(cl$centers, col = 1:5, pch = 8, cex=2) K-means
39
Module 5: Clustering bioinformatics.ca set.seed(100) x <- rbind(matrix(rnorm(100, sd = 0.3), ncol = 2), matrix(rnorm(100, mean = 1, sd = 0.3), ncol = 2)) colnames(x) <- c("x", "y") set.seed(100); cl <- NULL cl<-kmeans(x, matrix(runif(10,-.5,.5),5,2),iter.max=1) plot(x,col=cl$cluster) points(cl$centers, col = 1:5, pch = 8, cex=2) set.seed(100); cl <- NULL cl<-kmeans(x, matrix(runif(10,-.5,.5),5,2),iter.max=2) plot(x,col=cl$cluster) points(cl$centers, col = 1:5, pch = 8, cex=2) set.seed(100); cl <- NULL cl<-kmeans(x, matrix(runif(10,-.5,.5),5,2),iter.max=3) plot(x,col=cl$cluster) points(cl$centers, col = 1:5, pch = 8, cex=2) K-means
40
Module 5: Clustering bioinformatics.ca K-means, k=4 12 34 set.seed(100) km.cho<-kmeans(cho.data, 4) par(mfrow=c(2,2)) matplot(t(cho.data[km.cho$cluster==1,]),type="l",xlab="time",ylab="log expression value") matplot(t(cho.data[km.cho$cluster==2,]),type="l",xlab="time",ylab="log expression value") matplot(t(cho.data[km.cho$cluster==3,]),type="l",xlab="time",ylab="log expression value") matplot(t(cho.data[km.cho$cluster==4,]),type="l",xlab="time",ylab="log expression value") K-means
41
Module 5: Clustering bioinformatics.ca 12 34 K-means, k=4 12 34 Single linkage, k=4 K-means
42
Module 5: Clustering bioinformatics.ca K-means and hierarchical clustering methods are simple, fast and useful techniques Beware of memory requirements for HC Both are bit “ad hoc”: Number of clusters? Distance metric? Good clustering? Summary
43
Module 5: Clustering bioinformatics.ca Model based approaches Assume the data are ‘generated’ from a mixture of K distributions What cluster assignment and parameters of the K distributions best explain the data? ‘Fit’ a model to the data Try to get the best fit Classical example: mixture of Gaussians (mixture of normals) Take advantage of probability theory and well-defined distributions in statistics
44
Module 5: Clustering bioinformatics.ca Model based clustering: array CGH
45
Module 5: Clustering bioinformatics.ca Model based clustering of aCGH Approach: Cluster the data by extending the profiling to the multi-group setting Shah et al (Bioinformatics, 2009) Patient p Group g State k …… Profile State c Problem: patient cohorts often exhibit molecular heterogeneity making rarer shared CNAs hard to detect A mixture of HMMs: HMM-Mix Sparse profiles Distribution of calls in a group CNA calls Raw data
46
Module 5: Clustering bioinformatics.ca Advantages of model based approaches In addition to clustering patients into groups, we output a ‘model’ that best represents the patients in a group We can then associate each model with clinical variables and simply output a classifier to be used on new patients Choosing the number of groups becomes a model selection problem (cf. the Bayesian Information Criterion) see Yeung et al Bioinformatics (2001)
47
Module 5: Clustering bioinformatics.ca Advanced topics in clustering Top down clustering Bi-clustering or ‘two-way’ clustering Principal components analysis Choosing the number of groups model selection AIC, BIC Silhouette coefficient The Gap curve Joint clustering and feature selection
48
Module 5: Clustering bioinformatics.ca What is the best clustering method? That depends on what you want to achieve. And sometimes clustering is not the best approach to begin with. Best method?
49
Module 5: Clustering bioinformatics.ca set.seed(103) x1<-array(c(runif(70, 0,10)), c(35,2)) x2<-array(c(rnorm(30, 7, 0.7)), c(15,2)) xrange<-range(x1[,1], x2[,1]) yrange<-range(x1[,2], x2[,2]) plot(x1, xlim=xrange, ylim=yrange, col="black", xlab="x", ylab="y") par(new=T) plot(x2, xlim=xrange, ylim=yrange, col="red", axes=F, xlab="", ylab="") Clustering is a partition method. However, consider the following data: Density estimation
50
Module 5: Clustering bioinformatics.ca set.seed(103) x1<-array(c(runif(70, 0,10)), c(35,2)) x2<-array(c(rnorm(30, 5, 0.7)), c(15,2)) xrange<-range(x1[,1], x2[,1]) yrange<-range(x1[,2], x2[,2]) plot(x1, xlim=xrange, ylim=yrange, col="black", xlab="x", ylab="y") par(new=T) plot(x2, xlim=xrange, ylim=yrange, col="red", axes=F, xlab="", ylab="") Density estimation
51
Module 5: Clustering bioinformatics.ca > length(faithful$eruptions) [1] 272 > head(faithful$eruptions, 8) [1] 3.600 1.800 3.333 2.283 4.533 2.883 4.700 3.600 > hist(faithful$eruptions, col=rgb(0.9,0.9,0.9), main="") (From density() documentation...) (univariate) Density estimation
52
Module 5: Clustering bioinformatics.ca > length(faithful$eruptions) [1] 272 > head(faithful$eruptions, 8) [1] 3.600 1.800 3.333 2.283 4.533 2.883 4.700 3.600 > hist(faithful$eruptions, col=rgb(0.9,0.9,0.9), main="") (From density() documentation...) (univariate) Density estimation
53
Module 5: Clustering bioinformatics.ca > length(faithful$eruptions) [1] 272 > head(faithful$eruptions, 8) [1] 3.600 1.800 3.333 2.283 4.533 2.883 4.700 3.600 > hist(faithful$eruptions, col=rgb(0.9,0.9,0.9), main="") > par(new="T") > plot(density(faithful$eruptions, bw = "sj"), main="", xlab="", ylab="", axes="F", col="red", lwd=3) (From density() documentation...) (univariate) Density estimation
54
Module 5: Clustering bioinformatics.ca > length(faithful$eruptions) [1] 272 > head(faithful$eruptions, 8) [1] 3.600 1.800 3.333 2.283 4.533 2.883 4.700 3.600 > hist(faithful$eruptions, col=rgb(0.9,0.9,0.9), main="") > par(new="T") > plot(density(faithful$eruptions, bw = "sj"), main="", xlab="", ylab="", axes="F", col="red", lwd=3) (From density() documentation...) (univariate) Density estimation
55
Module 5: Clustering bioinformatics.ca set.seed(103) x1<-array(c(runif(70, 0,10)), c(35,2)) x2<-array(c(rnorm(30, 5, 0.7)), c(15,2)) xrange<-range(x1[,1], x2[,1]) yrange<-range(x1[,2], x2[,2]) plot(x1, xlim=xrange, ylim=yrange, col="black", xlab="x", ylab="y") par(new=T) plot(x2, xlim=xrange, ylim=yrange, col="red", axes=F, xlab="", ylab="") (univariate) Density estimation
56
Module 5: Clustering bioinformatics.ca set.seed(103) x1<-array(c(runif(70, 0,10)), c(35,2)) x2<-array(c(rnorm(30, 7, 0.7)), c(15,2)) xrange<-range(x1[,1], x2[,1]) yrange<-range(x1[,2], x2[,2]) plot(x1, xlim=xrange, ylim=yrange, col="black", xlab="x", ylab="y") par(new=T) plot(x2, xlim=xrange, ylim=yrange, col="red", axes=F, xlab="", ylab="") x3<-rbind(x1, x2) par(new=T) plot(density(x3[,2], bw = "sj"), main="", xlab="", ylab="", axes="F", col="blue", lwd=3) set.seed(103) x1<-array(c(runif(70, 0,10)), c(35,2)) x2<-array(c(rnorm(30, 7, 0.7)), c(15,2)) xrange<-range(x1[,1], x2[,1]) yrange<-range(x1[,2], x2[,2]) plot(x1, xlim=xrange, ylim=yrange, col="black", xlab="x", ylab="y") par(new=T) plot(x2, xlim=xrange, ylim=yrange, col="red", axes=F, xlab="", ylab="") x3<-rbind(x1, x2) par(new=T) plot(density(x3[,2], bw = "sj"), main="", xlab="", ylab="", axes="F", col="blue", lwd=3) (univariate) Density estimation
57
Module 5: Clustering bioinformatics.ca set.seed(103) x1<-array(c(runif(70, 0,10)), c(35,2)) x2<-array(c(rnorm(30, 7, 0.7)), c(15,2)) xrange<-range(x1[,1], x2[,1]) yrange<-range(x1[,2], x2[,2]) plot(x1, xlim=xrange, ylim=yrange, col="black", xlab="x", ylab="y") par(new=T) plot(x2, xlim=xrange, ylim=yrange, col="red", axes=F, xlab="", ylab="") x3<-rbind(x1, x2) par(new=T) plot(density(x3[,2], bw = "sj"), main="", xlab="", ylab="", axes="F", col="blue", lwd=3) set.seed(103) x1<-array(c(runif(70, 0,10)), c(35,2)) x2<-array(c(rnorm(30, 7, 0.7)), c(15,2)) xrange<-range(x1[,1], x2[,1]) yrange<-range(x1[,2], x2[,2]) plot(x1, xlim=xrange, ylim=yrange, col="black", xlab="x", ylab="y") par(new=T) plot(x2, xlim=xrange, ylim=yrange, col="red", axes=F, xlab="", ylab="") x3<-rbind(x1, x2) par(new=T) plot(density(x3[,2], bw = "sj"), main="", xlab="", ylab="", axes="F", col="blue", lwd=3) (univariate) Density estimation
58
Module 5: Clustering bioinformatics.ca boris.steipe@utoronto.ca
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.