Canadian Bioinformatics Workshops www.bioinformatics.ca.

Canadian Bioinformatics Workshops www.bioinformatics.ca

2 Day 2 Module 6

Boris Steipe Department of Biochemistry Department of Molecular Genetics University of Toronto M ODULE 5 : C LUSTERING Exploratory Data Analysis and Essential Statistics using R

4 Day 2 Module 6 o u t l i n e Principles Hierarchical clustering K-means Model-based clustering Leaning heavily on Raphael Gottardo's excellent 2008 material.

5 Day 2 Module 6 o b j e c t i v e s Understand the foundations of clustering: similarity, metricity Understand that there is usualy no such thing as a "best" clustering, rather this is a problem of trade-offs. Learn what functions are available to cluster data in R functions. Become competent in constructing synthetic datasets, performing cluster analysis. Obtain a perspective on modern cluster techniques.

6 Day 2 Module 6 p r o b l e m

7 Day 2 Module 6 problem

8 Day 2 Module 6 problem What is this Data? What does Clustering Partitioning Classification mean?

9 Day 2 Module 6 e x a m p l e s

10 Day 2 Module 6 Complexes in interaction data Domains in protein structure Proteins of similar function (based on measured similar properties, e.g. coregulation)... examples

11 Day 2 Module 6 Define your own scenario, then analyse it as a thought experiment: how many samples how many dimensions what are the elements what are their properties what could variation in the dimensions look like what is the metric of similarity what is the question we would like to ask examples

12 Day 2 Module 6 clustering Clustering is the classification of similar objects into different groups. Partition a data set into subsets (clusters), so that the data in each subset are “close” to one another - where closeness is measured through some metric.

13 Day 2 Module 6 a p p r o a c h e s

14 Day 2 Module 6 hierarchical clustering Given N items and a distance metric... 1. Assign each item to a cluster Initialize the distance matrix between clusters as the distance between items 2. Find the closest pair of clusters and merge them into a single cluster 3. Compute new distances between clusters 4. Repeat 2-3 until call items are classified into a single cluster

15 Day 2 Module 6 hierarchical clustering Given N items and a distance metric What is a metric?

16 Day 2 Module 6 hierarchical clustering "Given N items and a distance metric..." What is a metric? A metric has to fulfill three conditions: "identity" "symmetry" "triangle inequality"

17 Day 2 Module 6 The distance between clusters is defined as the shortest distance from any member of one cluster to any member of the other cluster. Cluster 1Cluster 2 d single linkage

18 Day 2 Module 6 The distance between clusters is defined as the greatest distance from any member of one cluster to any member of the other cluster. Cluster 1Cluster 2 d complete linkage

19 Day 2 Module 6 The distance between clusters is defined as the average distance from any member of one cluster to any member of the other cluster. Cluster 1Cluster 2 d=Average of all distances average linkage

20 Day 2 Module 6 Cell cycle dataset (Cho et al. 1998) Expression levels of ~6000 genes during the cell cycle 17 time points (2 cell cycles) example

21 Day 2 Module 6 cho.data<-as.matrix(read.table("logcho_237_4class.txt",skip=1)[1:50,3:19]) D.cho<-dist(cho.data, method = "euclidean") hc.single<-hclust(D.cho, method = "single", members=NULL) example

22 Day 2 Module 6 example plot(hc.single) Single linkage

23 Day 2 Module 6 example Careful with dendrograms: they introduce proximity between elements that does not correlate with distance between elements!

24 Day 2 Module 6 Single linkage, k=2 example rect.hclust(hc.single,k=2)

25 Day 2 Module 6 example Single linkage, k=3 rect.hclust(hc.single,k=3)

29 Day 2 Module 6 12 34 example class.single<-cutree(hc.single, k = 4) par(mfrow=c(2,2)) matplot(t(cho.data[class.single==1,]),type="l", xlab="time",ylab="log expression value") matplot(t(cho.data[class.single==2,]),type="l", xlab="time",ylab="log expression value") matplot(as.matrix(cho.data[class.single==3,]), type="l",xlab="time",ylab="log expression value") matplot(t(cho.data[class.single==4,]),type="l", xlab="time",ylab="log expression value") Properties of cluster members, single linkage, k=4

30 Day 2 Module 6 12 34 example Single linkage, k=4 1 2 34

31 Day 2 Module 6 hc.complete<-hclust(D.cho, method = "complete", members=NULL) plot(hc.complete) rect.hclust(hc.complete,k=4) class.complete<-cutree(hc.complete, k = 4) par(mfrow=c(2,2)) matplot... example

32 Day 2 Module 6 example Complete linkage, k=4 rect.hclust(hc.complete,k=4)

33 Day 2 Module 6 example 12 34 Properties of cluster members, complete linkage, k=4

34 Day 2 Module 6 12 34 example Complete linkage, k=4 12 34 Single linkage, k=4

35 Day 2 Module 6 12 34 example Complete linkage, k=4 12 34 Could use this to revise analysis... bimodal distribution? Analyse signal, not noise!

36 Day 2 Module 6 hc.average<-hclust(D.cho, method = "average", members=NULL) plot(hc.average)rect.hclust(hc.average,k=4) class.average<-cutree(hc.average, k = 4) par(mfrow=c(2,2)) matplot(t(cho.data[class.average==1,]),type="l",xlab="time",ylab="log expression value") matplot(t(cho.data[class.average==2,]),type="l",xlab="time",ylab="log expression value") matplot(as.matrix(cho.data[class.average==3,]),type="l",xlab="time",ylab="log expression value") matplot(t(cho.data[class.average==4,]),type="l",xlab="time",ylab="log expression value") example

37 Day 2 Module 6 N items, assume K clusters Goal is to minimized over the possible assignments and centroids. represents the location of the cluster. K-means

38 Day 2 Module 6 1. Divide the data into K clusters Initialize the centroids with the mean of the clusters 2. Assign each item to the cluster with closest centroid 3. When all objects have been assigned, recalculate the centroids (mean) 4. Repeat 2-3 until the centroids no longer move K-means algorithm

39 Day 2 Module 6 set.seed(100) x <- rbind(matrix(rnorm(100, sd = 0.3), ncol = 2), matrix(rnorm(100, mean = 1, sd = 0.3), ncol = 2)) colnames(x) <- c("x", "y") set.seed(100) for(i in 1:4) { set.seed(100) cl<-kmeans(x, matrix(runif(10,-.5,.5),5,2),iter.max=i) plot(x,col=cl$cluster) points(cl$centers, col = 1:5, pch = 8, cex=2) Sys.sleep(2) } K-means

40 Day 2 Module 6 set.seed(100) x <- rbind(matrix(rnorm(100, sd = 0.3), ncol = 2), matrix(rnorm(100, mean = 1, sd = 0.3), ncol = 2)) colnames(x) <- c("x", "y") set.seed(100) for(i in 1:4) { set.seed(100) cl<-kmeans(x, matrix(runif(10,-.5,.5),5,2),iter.max=i) plot(x,col=cl$cluster) points(cl$centers, col = 1:5, pch = 8, cex=2) Sys.sleep(2) } K-means

41 Day 2 Module 6 12 34 example K-means, k=4 12 34 Complete linkage, k=4 set.seed(100) km.cho<-kmeans(cho.data, 4) par(mfrow=c(2,2)) matplot(t(cho.data[km.cho$cluster==1,]),type="l",xlab="time",ylab="log expression value") matplot(t(cho.data[km.cho$cluster==2,]),type="l",xlab="time",ylab="log expression value") matplot(t(cho.data[km.cho$cluster==3,]),type="l",xlab="time",ylab="log expression value") matplot(t(cho.data[km.cho$cluster==4,]),type="l",xlab="time",ylab="log expression value")

42 Day 2 Module 6 12 34 example K-means, k=4 12 34 Complete linkage, k=4

43 Day 2 Module 6 12 34 Why? example

44 Day 2 Module 6 12 34 example

45 Day 2 Module 6 K-means and hierarchical clustering methods are useful techniques Fast and easy to implement Beware of memory requirements for HC A bit “ad hoc”: – Number of clusters? – Distance metric? – Good clustering? summary

46 Day 2 Module 6 Based on probability models (e.g. Normal mixture models) We could talk about good clustering Compare several models Estimate the number of clusters! model based clustering

47 Day 2 Module 6 Multivariate observations K clusters Assume observation i belongs to cluster k, then that is each cluster can be represented by a multivariate normal distribution with mean and covariance Yeung et al. (2001) model based clustering

48 Day 2 Module 6 Banfield and Raftery (1993) VolumeOrientationShape Eigenvalue decomposition model based clustering

49 Day 2 Module 6 0 0 0 0 Equal volume spherical EII Unequal volume spherical VII Equal volume, shape, orientation (EEE) Unconstrained (VVV) model based clustering

50 Day 2 Module 6 Given the number of clusters and the covariance structure the EM algorithm can be used Mclust R package available from CRAN Likelihood (Mixture model) estimation

51 Day 2 Module 6 Which model is appropriate? - Which covariance structure? - How many clusters? Compare the different models using BIC model selection

52 Day 2 Module 6 We wish to compare two models and with parameters and respectively. Given the observed data D, define the integrated likelihood Probability to observe the data given model andmight have different dimensionsNB: model selection

53 Day 2 Module 6 To compare two models and use the integrated likelihoods. The integral is difficult to compute! Bayesian information criteria: is the maximum likelihood is the number of parameter in model model selection

54 Day 2 Module 6 Bayesian information criteria: Measure of fitPenalty term A large BIC score indicates strong evidence for the corresponding model BIC can be used to choose the number of clusters and the covariance parametrization (Mclust) model selection

55 Day 2 Module 6 1 EII 2 EEI install.packages("mclust") library(mclust) cho.mclust.bic<- EMclust(cho.data.std,modelNames=c("EII","EEI")) plot(cho.mclust.bic) cho.mclust<-EMclust(cho.data.std,4,"EII") sum.cho<-summary(cho.mclust,cho.data.std) plot(cho.mclust) example revisited

56 Day 2 Module 6 par(mfrow=c(2,2)) matplot(t(cho.data[sum.cho$classification==1,]),type="l",xlab="time",ylab="log expression value") matplot(t(cho.data[sum.cho$classification==2,]),type="l",xlab="time",ylab="log expression value") matplot(t(cho.data[sum.cho$classification==3,]),type="l",xlab="time",ylab="log expression value") matplot(t(cho.data[sum.cho$classification==4,]),type="l",xlab="time",ylab="log expression value") example revisited

57 Day 2 Module 6 12 34 EII 4 clusters example revisited

58 Day 2 Module 6 cho.mclust<-EMclust(cho.data.std,3,"EEI") sum.cho<-summary(cho.mclust,cho.data.std) par(mfrow=c(2,2)) matplot(t(cho.data[sum.cho$classification==1,]),type="l",xlab="time",ylab="log expression value") matplot(t(cho.data[sum.cho$classification==2,]),type="l",xlab="time",ylab="log expression value") matplot(t(cho.data[sum.cho$classification==3,]),type="l",xlab="time",ylab="log expression value") example revisited

59 Day 2 Module 6 12 3 EEI 3 clusters example revisited

60 Day 2 Module 6 Model based clustering is an attractive alternative to heuristic clustering algorithms BIC can be used for choosing the covariance structure and the number of clusters summary

61 Day 2 Module 6 best method What is the best clustering method?

62 Day 2 Module 6 best method What is the best clustering method? That depends on what you want to achieve. And sometimes clustering is not the best approach to begin with.

63 Day 2 Module 6 density estimation set.seed(103) x1<-array(c(runif(70, 0,10)), c(35,2)) x2<-array(c(rnorm(30, 7, 0.7)), c(15,2)) xrange<-range(x1[,1], x2[,1]) yrange<-range(x1[,2], x2[,2]) plot(x1, xlim=xrange, ylim=yrange, col="black", xlab="x", ylab="y") par(new=T) plot(x2, xlim=xrange, ylim=yrange, col="red", axes=F, xlab="", ylab="") Clustering is a partition method. Consider:

64 Day 2 Module 6 density estimation set.seed(103) x1<-array(c(runif(70, 0,10)), c(35,2)) x2<-array(c(rnorm(30, 5, 0.7)), c(15,2)) xrange<-range(x1[,1], x2[,1]) yrange<-range(x1[,2], x2[,2]) plot(x1, xlim=xrange, ylim=yrange, col="black", xlab="x", ylab="y") par(new=T) plot(x2, xlim=xrange, ylim=yrange, col="red", axes=F, xlab="", ylab="")

65 Day 2 Module 6 (univariate) density estimation > length(faithful$eruptions) [1] 272 > head(faithful$eruptions, 8) [1] 3.600 1.800 3.333 2.283 4.533 2.883 4.700 3.600 >hist(faithful$eruptions, col=rgb(0.9,0.9,0.9), main="") (From density documentation...)

66 Day 2 Module 6 > length(faithful$eruptions) [1] 272 > head(faithful$eruptions, 8) [1] 3.600 1.800 3.333 2.283 4.533 2.883 4.700 3.600 >hist(faithful$eruptions, col=rgb(0.9,0.9,0.9), main="") (From density documentation...) (univariate) density estimation

67 Day 2 Module 6 > length(faithful$eruptions) [1] 272 > head(faithful$eruptions, 8) [1] 3.600 1.800 3.333 2.283 4.533 2.883 4.700 3.600 >hist(faithful$eruptions, col=rgb(0.9,0.9,0.9), main="") > par(new="T") > plot(density(faithful$eruptions, bw = "sj"), main="", xlab="", ylab="", axes="F", col="red", lwd=3) (From density documentation...) (univariate) density estimation

68 Day 2 Module 6 > length(faithful$eruptions) [1] 272 > head(faithful$eruptions, 8) [1] 3.600 1.800 3.333 2.283 4.533 2.883 4.700 3.600 >hist(faithful$eruptions, col=rgb(0.9,0.9,0.9), main="") > par(new="T") > plot(density(faithful$eruptions, bw = "sj"), main="", xlab="", ylab="", axes="F", col="red", lwd=3) (From density documentation...) (univariate) density estimation

69 Day 2 Module 6 structure space Ramachandran plot Alanine Averaged structures may be meaningless ! Consequences for clustering... Issues: Metric Embedding Discrete or continuous

70 Day 2 Module 6 density of structure space Discrete observation Similarity axis (1-D projection not required) Gaussian kernel Sum: Estimate density from discrete samples

71 Day 2 Module 6 density Issue: Best choice of parameter   ? Globally constant ? Locally adaptive ?

72 Day 2 Module 6 local maxima Procedure: Issue: Definition of parameter "neighbourhood" begin at a fragment c. for all fragments f within neighborhood: calculate density if (density > maximum density) maximum m = f if (m == c) return fragment c else c = m, repeat

73 Day 2 Module 6 example: complex loop Motif:1icf 215 Length:7 Support:7 Rank:399

74 Day 2 Module 6 density estimation set.seed(103) x1<-array(c(runif(70, 0,10)), c(35,2)) x2<-array(c(rnorm(30, 5, 0.7)), c(15,2)) xrange<-range(x1[,1], x2[,1]) yrange<-range(x1[,2], x2[,2]) plot(x1, xlim=xrange, ylim=yrange, col="black", xlab="x", ylab="y") par(new=T) plot(x2, xlim=xrange, ylim=yrange, col="red", axes=F, xlab="", ylab="")

75 Day 2 Module 6 density estimation set.seed(103) x1<-array(c(runif(70, 0,10)), c(35,2)) x2<-array(c(rnorm(30, 7, 0.7)), c(15,2)) xrange<-range(x1[,1], x2[,1]) yrange<-range(x1[,2], x2[,2]) plot(x1, xlim=xrange, ylim=yrange, col="black", xlab="x", ylab="y") par(new=T) plot(x2, xlim=xrange, ylim=yrange, col="red", axes=F, xlab="", ylab="") x3<-rbind(x1, x2) par(new=T) plot(density(x3[,2], bw = "sj"), main="", xlab="", ylab="", axes="F", col="blue", lwd=3) set.seed(103) x1<-array(c(runif(70, 0,10)), c(35,2)) x2<-array(c(rnorm(30, 7, 0.7)), c(15,2)) xrange<-range(x1[,1], x2[,1]) yrange<-range(x1[,2], x2[,2]) plot(x1, xlim=xrange, ylim=yrange, col="black", xlab="x", ylab="y") par(new=T) plot(x2, xlim=xrange, ylim=yrange, col="red", axes=F, xlab="", ylab="") x3<-rbind(x1, x2) par(new=T) plot(density(x3[,2], bw = "sj"), main="", xlab="", ylab="", axes="F", col="blue", lwd=3)

76 Day 2 Module 6 density estimation set.seed(103) x1<-array(c(runif(70, 0,10)), c(35,2)) x2<-array(c(rnorm(30, 7, 0.7)), c(15,2)) xrange<-range(x1[,1], x2[,1]) yrange<-range(x1[,2], x2[,2]) plot(x1, xlim=xrange, ylim=yrange, col="black", xlab="x", ylab="y") par(new=T) plot(x2, xlim=xrange, ylim=yrange, col="red", axes=F, xlab="", ylab="") x3<-rbind(x1, x2) par(new=T) plot(density(x3[,2], bw = "sj"), main="", xlab="", ylab="", axes="F", col="blue", lwd=3) set.seed(103) x1<-array(c(runif(70, 0,10)), c(35,2)) x2<-array(c(rnorm(30, 7, 0.7)), c(15,2)) xrange<-range(x1[,1], x2[,1]) yrange<-range(x1[,2], x2[,2]) plot(x1, xlim=xrange, ylim=yrange, col="black", xlab="x", ylab="y") par(new=T) plot(x2, xlim=xrange, ylim=yrange, col="red", axes=F, xlab="", ylab="") x3<-rbind(x1, x2) par(new=T) plot(density(x3[,2], bw = "sj"), main="", xlab="", ylab="", axes="F", col="blue", lwd=3)

77 Day 2 Module 6 This is just scratching the surface regarding clustering algorithms There are many others – Two way clustering – Plaid model... If there are many, none is probably good for all situations Clustering is a useful tool and... a dangerous weapon To be consumed with moderation! conclusion

78 Day 2 Module 6 A twentieth century fluid dynamicist could hardly expect to advance knowledge in his field without first adopting a body of terminology and mathematical technique. In return, unconsciously, he would give up much freedom to question the foundations of his science. Thomas S. Kuhn to ponder

79 Day 2 Module 6 http://bioinformatics.ca boris.steipe@utoronto.ca

80 Day 2 Module 6 nuit blanche tonight one night all night

Canadian Bioinformatics Workshops www.bioinformatics.ca.

Similar presentations

Presentation on theme: "Canadian Bioinformatics Workshops www.bioinformatics.ca."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Canadian Bioinformatics Workshops www.bioinformatics.ca.

Similar presentations

Presentation on theme: "Canadian Bioinformatics Workshops www.bioinformatics.ca."— Presentation transcript:

Similar presentations

About project

Feedback