Download presentation
Presentation is loading. Please wait.
Published byAlbert Harris Modified over 8 years ago
1
Canadian Bioinformatics Workshops www.bioinformatics.ca
2
2 Day 2 Module 6
3
Boris Steipe Department of Biochemistry Department of Molecular Genetics University of Toronto M ODULE 5 : C LUSTERING Exploratory Data Analysis and Essential Statistics using R
4
4 Day 2 Module 6 o u t l i n e Principles Hierarchical clustering K-means Model-based clustering Leaning heavily on Raphael Gottardo's excellent 2008 material.
5
5 Day 2 Module 6 o b j e c t i v e s Understand the foundations of clustering: similarity, metricity Understand that there is usualy no such thing as a "best" clustering, rather this is a problem of trade-offs. Learn what functions are available to cluster data in R functions. Become competent in constructing synthetic datasets, performing cluster analysis. Obtain a perspective on modern cluster techniques.
6
6 Day 2 Module 6 p r o b l e m
7
7 Day 2 Module 6 problem
8
8 Day 2 Module 6 problem What is this Data? What does Clustering Partitioning Classification mean?
9
9 Day 2 Module 6 e x a m p l e s
10
10 Day 2 Module 6 Complexes in interaction data Domains in protein structure Proteins of similar function (based on measured similar properties, e.g. coregulation)... examples
11
11 Day 2 Module 6 Define your own scenario, then analyse it as a thought experiment: how many samples how many dimensions what are the elements what are their properties what could variation in the dimensions look like what is the metric of similarity what is the question we would like to ask examples
12
12 Day 2 Module 6 clustering Clustering is the classification of similar objects into different groups. Partition a data set into subsets (clusters), so that the data in each subset are “close” to one another - where closeness is measured through some metric.
13
13 Day 2 Module 6 a p p r o a c h e s
14
14 Day 2 Module 6 hierarchical clustering Given N items and a distance metric... 1. Assign each item to a cluster Initialize the distance matrix between clusters as the distance between items 2. Find the closest pair of clusters and merge them into a single cluster 3. Compute new distances between clusters 4. Repeat 2-3 until call items are classified into a single cluster
15
15 Day 2 Module 6 hierarchical clustering Given N items and a distance metric What is a metric?
16
16 Day 2 Module 6 hierarchical clustering "Given N items and a distance metric..." What is a metric? A metric has to fulfill three conditions: "identity" "symmetry" "triangle inequality"
17
17 Day 2 Module 6 The distance between clusters is defined as the shortest distance from any member of one cluster to any member of the other cluster. Cluster 1Cluster 2 d single linkage
18
18 Day 2 Module 6 The distance between clusters is defined as the greatest distance from any member of one cluster to any member of the other cluster. Cluster 1Cluster 2 d complete linkage
19
19 Day 2 Module 6 The distance between clusters is defined as the average distance from any member of one cluster to any member of the other cluster. Cluster 1Cluster 2 d=Average of all distances average linkage
20
20 Day 2 Module 6 Cell cycle dataset (Cho et al. 1998) Expression levels of ~6000 genes during the cell cycle 17 time points (2 cell cycles) example
21
21 Day 2 Module 6 cho.data<-as.matrix(read.table("logcho_237_4class.txt",skip=1)[1:50,3:19]) D.cho<-dist(cho.data, method = "euclidean") hc.single<-hclust(D.cho, method = "single", members=NULL) example
22
22 Day 2 Module 6 example plot(hc.single) Single linkage
23
23 Day 2 Module 6 example Careful with dendrograms: they introduce proximity between elements that does not correlate with distance between elements!
24
24 Day 2 Module 6 Single linkage, k=2 example rect.hclust(hc.single,k=2)
25
25 Day 2 Module 6 example Single linkage, k=3 rect.hclust(hc.single,k=3)
26
26 Day 2 Module 6 example Single linkage, k=4 rect.hclust(hc.single,k=4)
27
27 Day 2 Module 6 example Single linkage, k=5 rect.hclust(hc.single,k=5)
28
28 Day 2 Module 6 example Single linkage, k=25 rect.hclust(hc.single,k=25)
29
29 Day 2 Module 6 12 34 example class.single<-cutree(hc.single, k = 4) par(mfrow=c(2,2)) matplot(t(cho.data[class.single==1,]),type="l", xlab="time",ylab="log expression value") matplot(t(cho.data[class.single==2,]),type="l", xlab="time",ylab="log expression value") matplot(as.matrix(cho.data[class.single==3,]), type="l",xlab="time",ylab="log expression value") matplot(t(cho.data[class.single==4,]),type="l", xlab="time",ylab="log expression value") Properties of cluster members, single linkage, k=4
30
30 Day 2 Module 6 12 34 example Single linkage, k=4 1 2 34
31
31 Day 2 Module 6 hc.complete<-hclust(D.cho, method = "complete", members=NULL) plot(hc.complete) rect.hclust(hc.complete,k=4) class.complete<-cutree(hc.complete, k = 4) par(mfrow=c(2,2)) matplot... example
32
32 Day 2 Module 6 example Complete linkage, k=4 rect.hclust(hc.complete,k=4)
33
33 Day 2 Module 6 example 12 34 Properties of cluster members, complete linkage, k=4
34
34 Day 2 Module 6 12 34 example Complete linkage, k=4 12 34 Single linkage, k=4
35
35 Day 2 Module 6 12 34 example Complete linkage, k=4 12 34 Could use this to revise analysis... bimodal distribution? Analyse signal, not noise!
36
36 Day 2 Module 6 hc.average<-hclust(D.cho, method = "average", members=NULL) plot(hc.average)rect.hclust(hc.average,k=4) class.average<-cutree(hc.average, k = 4) par(mfrow=c(2,2)) matplot(t(cho.data[class.average==1,]),type="l",xlab="time",ylab="log expression value") matplot(t(cho.data[class.average==2,]),type="l",xlab="time",ylab="log expression value") matplot(as.matrix(cho.data[class.average==3,]),type="l",xlab="time",ylab="log expression value") matplot(t(cho.data[class.average==4,]),type="l",xlab="time",ylab="log expression value") example
37
37 Day 2 Module 6 N items, assume K clusters Goal is to minimized over the possible assignments and centroids. represents the location of the cluster. K-means
38
38 Day 2 Module 6 1. Divide the data into K clusters Initialize the centroids with the mean of the clusters 2. Assign each item to the cluster with closest centroid 3. When all objects have been assigned, recalculate the centroids (mean) 4. Repeat 2-3 until the centroids no longer move K-means algorithm
39
39 Day 2 Module 6 set.seed(100) x <- rbind(matrix(rnorm(100, sd = 0.3), ncol = 2), matrix(rnorm(100, mean = 1, sd = 0.3), ncol = 2)) colnames(x) <- c("x", "y") set.seed(100) for(i in 1:4) { set.seed(100) cl<-kmeans(x, matrix(runif(10,-.5,.5),5,2),iter.max=i) plot(x,col=cl$cluster) points(cl$centers, col = 1:5, pch = 8, cex=2) Sys.sleep(2) } K-means
40
40 Day 2 Module 6 set.seed(100) x <- rbind(matrix(rnorm(100, sd = 0.3), ncol = 2), matrix(rnorm(100, mean = 1, sd = 0.3), ncol = 2)) colnames(x) <- c("x", "y") set.seed(100) for(i in 1:4) { set.seed(100) cl<-kmeans(x, matrix(runif(10,-.5,.5),5,2),iter.max=i) plot(x,col=cl$cluster) points(cl$centers, col = 1:5, pch = 8, cex=2) Sys.sleep(2) } K-means
41
41 Day 2 Module 6 12 34 example K-means, k=4 12 34 Complete linkage, k=4 set.seed(100) km.cho<-kmeans(cho.data, 4) par(mfrow=c(2,2)) matplot(t(cho.data[km.cho$cluster==1,]),type="l",xlab="time",ylab="log expression value") matplot(t(cho.data[km.cho$cluster==2,]),type="l",xlab="time",ylab="log expression value") matplot(t(cho.data[km.cho$cluster==3,]),type="l",xlab="time",ylab="log expression value") matplot(t(cho.data[km.cho$cluster==4,]),type="l",xlab="time",ylab="log expression value")
42
42 Day 2 Module 6 12 34 example K-means, k=4 12 34 Complete linkage, k=4
43
43 Day 2 Module 6 12 34 Why? example
44
44 Day 2 Module 6 12 34 example
45
45 Day 2 Module 6 K-means and hierarchical clustering methods are useful techniques Fast and easy to implement Beware of memory requirements for HC A bit “ad hoc”: – Number of clusters? – Distance metric? – Good clustering? summary
46
46 Day 2 Module 6 Based on probability models (e.g. Normal mixture models) We could talk about good clustering Compare several models Estimate the number of clusters! model based clustering
47
47 Day 2 Module 6 Multivariate observations K clusters Assume observation i belongs to cluster k, then that is each cluster can be represented by a multivariate normal distribution with mean and covariance Yeung et al. (2001) model based clustering
48
48 Day 2 Module 6 Banfield and Raftery (1993) VolumeOrientationShape Eigenvalue decomposition model based clustering
49
49 Day 2 Module 6 0 0 0 0 Equal volume spherical EII Unequal volume spherical VII Equal volume, shape, orientation (EEE) Unconstrained (VVV) model based clustering
50
50 Day 2 Module 6 Given the number of clusters and the covariance structure the EM algorithm can be used Mclust R package available from CRAN Likelihood (Mixture model) estimation
51
51 Day 2 Module 6 Which model is appropriate? - Which covariance structure? - How many clusters? Compare the different models using BIC model selection
52
52 Day 2 Module 6 We wish to compare two models and with parameters and respectively. Given the observed data D, define the integrated likelihood Probability to observe the data given model andmight have different dimensionsNB: model selection
53
53 Day 2 Module 6 To compare two models and use the integrated likelihoods. The integral is difficult to compute! Bayesian information criteria: is the maximum likelihood is the number of parameter in model model selection
54
54 Day 2 Module 6 Bayesian information criteria: Measure of fitPenalty term A large BIC score indicates strong evidence for the corresponding model BIC can be used to choose the number of clusters and the covariance parametrization (Mclust) model selection
55
55 Day 2 Module 6 1 EII 2 EEI install.packages("mclust") library(mclust) cho.mclust.bic<- EMclust(cho.data.std,modelNames=c("EII","EEI")) plot(cho.mclust.bic) cho.mclust<-EMclust(cho.data.std,4,"EII") sum.cho<-summary(cho.mclust,cho.data.std) plot(cho.mclust) example revisited
56
56 Day 2 Module 6 par(mfrow=c(2,2)) matplot(t(cho.data[sum.cho$classification==1,]),type="l",xlab="time",ylab="log expression value") matplot(t(cho.data[sum.cho$classification==2,]),type="l",xlab="time",ylab="log expression value") matplot(t(cho.data[sum.cho$classification==3,]),type="l",xlab="time",ylab="log expression value") matplot(t(cho.data[sum.cho$classification==4,]),type="l",xlab="time",ylab="log expression value") example revisited
57
57 Day 2 Module 6 12 34 EII 4 clusters example revisited
58
58 Day 2 Module 6 cho.mclust<-EMclust(cho.data.std,3,"EEI") sum.cho<-summary(cho.mclust,cho.data.std) par(mfrow=c(2,2)) matplot(t(cho.data[sum.cho$classification==1,]),type="l",xlab="time",ylab="log expression value") matplot(t(cho.data[sum.cho$classification==2,]),type="l",xlab="time",ylab="log expression value") matplot(t(cho.data[sum.cho$classification==3,]),type="l",xlab="time",ylab="log expression value") example revisited
59
59 Day 2 Module 6 12 3 EEI 3 clusters example revisited
60
60 Day 2 Module 6 Model based clustering is an attractive alternative to heuristic clustering algorithms BIC can be used for choosing the covariance structure and the number of clusters summary
61
61 Day 2 Module 6 best method What is the best clustering method?
62
62 Day 2 Module 6 best method What is the best clustering method? That depends on what you want to achieve. And sometimes clustering is not the best approach to begin with.
63
63 Day 2 Module 6 density estimation set.seed(103) x1<-array(c(runif(70, 0,10)), c(35,2)) x2<-array(c(rnorm(30, 7, 0.7)), c(15,2)) xrange<-range(x1[,1], x2[,1]) yrange<-range(x1[,2], x2[,2]) plot(x1, xlim=xrange, ylim=yrange, col="black", xlab="x", ylab="y") par(new=T) plot(x2, xlim=xrange, ylim=yrange, col="red", axes=F, xlab="", ylab="") Clustering is a partition method. Consider:
64
64 Day 2 Module 6 density estimation set.seed(103) x1<-array(c(runif(70, 0,10)), c(35,2)) x2<-array(c(rnorm(30, 5, 0.7)), c(15,2)) xrange<-range(x1[,1], x2[,1]) yrange<-range(x1[,2], x2[,2]) plot(x1, xlim=xrange, ylim=yrange, col="black", xlab="x", ylab="y") par(new=T) plot(x2, xlim=xrange, ylim=yrange, col="red", axes=F, xlab="", ylab="")
65
65 Day 2 Module 6 (univariate) density estimation > length(faithful$eruptions) [1] 272 > head(faithful$eruptions, 8) [1] 3.600 1.800 3.333 2.283 4.533 2.883 4.700 3.600 >hist(faithful$eruptions, col=rgb(0.9,0.9,0.9), main="") (From density documentation...)
66
66 Day 2 Module 6 > length(faithful$eruptions) [1] 272 > head(faithful$eruptions, 8) [1] 3.600 1.800 3.333 2.283 4.533 2.883 4.700 3.600 >hist(faithful$eruptions, col=rgb(0.9,0.9,0.9), main="") (From density documentation...) (univariate) density estimation
67
67 Day 2 Module 6 > length(faithful$eruptions) [1] 272 > head(faithful$eruptions, 8) [1] 3.600 1.800 3.333 2.283 4.533 2.883 4.700 3.600 >hist(faithful$eruptions, col=rgb(0.9,0.9,0.9), main="") > par(new="T") > plot(density(faithful$eruptions, bw = "sj"), main="", xlab="", ylab="", axes="F", col="red", lwd=3) (From density documentation...) (univariate) density estimation
68
68 Day 2 Module 6 > length(faithful$eruptions) [1] 272 > head(faithful$eruptions, 8) [1] 3.600 1.800 3.333 2.283 4.533 2.883 4.700 3.600 >hist(faithful$eruptions, col=rgb(0.9,0.9,0.9), main="") > par(new="T") > plot(density(faithful$eruptions, bw = "sj"), main="", xlab="", ylab="", axes="F", col="red", lwd=3) (From density documentation...) (univariate) density estimation
69
69 Day 2 Module 6 structure space Ramachandran plot Alanine Averaged structures may be meaningless ! Consequences for clustering... Issues: Metric Embedding Discrete or continuous
70
70 Day 2 Module 6 density of structure space Discrete observation Similarity axis (1-D projection not required) Gaussian kernel Sum: Estimate density from discrete samples
71
71 Day 2 Module 6 density Issue: Best choice of parameter ? Globally constant ? Locally adaptive ?
72
72 Day 2 Module 6 local maxima Procedure: Issue: Definition of parameter "neighbourhood" begin at a fragment c. for all fragments f within neighborhood: calculate density if (density > maximum density) maximum m = f if (m == c) return fragment c else c = m, repeat
73
73 Day 2 Module 6 example: complex loop Motif:1icf 215 Length:7 Support:7 Rank:399
74
74 Day 2 Module 6 density estimation set.seed(103) x1<-array(c(runif(70, 0,10)), c(35,2)) x2<-array(c(rnorm(30, 5, 0.7)), c(15,2)) xrange<-range(x1[,1], x2[,1]) yrange<-range(x1[,2], x2[,2]) plot(x1, xlim=xrange, ylim=yrange, col="black", xlab="x", ylab="y") par(new=T) plot(x2, xlim=xrange, ylim=yrange, col="red", axes=F, xlab="", ylab="")
75
75 Day 2 Module 6 density estimation set.seed(103) x1<-array(c(runif(70, 0,10)), c(35,2)) x2<-array(c(rnorm(30, 7, 0.7)), c(15,2)) xrange<-range(x1[,1], x2[,1]) yrange<-range(x1[,2], x2[,2]) plot(x1, xlim=xrange, ylim=yrange, col="black", xlab="x", ylab="y") par(new=T) plot(x2, xlim=xrange, ylim=yrange, col="red", axes=F, xlab="", ylab="") x3<-rbind(x1, x2) par(new=T) plot(density(x3[,2], bw = "sj"), main="", xlab="", ylab="", axes="F", col="blue", lwd=3) set.seed(103) x1<-array(c(runif(70, 0,10)), c(35,2)) x2<-array(c(rnorm(30, 7, 0.7)), c(15,2)) xrange<-range(x1[,1], x2[,1]) yrange<-range(x1[,2], x2[,2]) plot(x1, xlim=xrange, ylim=yrange, col="black", xlab="x", ylab="y") par(new=T) plot(x2, xlim=xrange, ylim=yrange, col="red", axes=F, xlab="", ylab="") x3<-rbind(x1, x2) par(new=T) plot(density(x3[,2], bw = "sj"), main="", xlab="", ylab="", axes="F", col="blue", lwd=3)
76
76 Day 2 Module 6 density estimation set.seed(103) x1<-array(c(runif(70, 0,10)), c(35,2)) x2<-array(c(rnorm(30, 7, 0.7)), c(15,2)) xrange<-range(x1[,1], x2[,1]) yrange<-range(x1[,2], x2[,2]) plot(x1, xlim=xrange, ylim=yrange, col="black", xlab="x", ylab="y") par(new=T) plot(x2, xlim=xrange, ylim=yrange, col="red", axes=F, xlab="", ylab="") x3<-rbind(x1, x2) par(new=T) plot(density(x3[,2], bw = "sj"), main="", xlab="", ylab="", axes="F", col="blue", lwd=3) set.seed(103) x1<-array(c(runif(70, 0,10)), c(35,2)) x2<-array(c(rnorm(30, 7, 0.7)), c(15,2)) xrange<-range(x1[,1], x2[,1]) yrange<-range(x1[,2], x2[,2]) plot(x1, xlim=xrange, ylim=yrange, col="black", xlab="x", ylab="y") par(new=T) plot(x2, xlim=xrange, ylim=yrange, col="red", axes=F, xlab="", ylab="") x3<-rbind(x1, x2) par(new=T) plot(density(x3[,2], bw = "sj"), main="", xlab="", ylab="", axes="F", col="blue", lwd=3)
77
77 Day 2 Module 6 This is just scratching the surface regarding clustering algorithms There are many others – Two way clustering – Plaid model... If there are many, none is probably good for all situations Clustering is a useful tool and... a dangerous weapon To be consumed with moderation! conclusion
78
78 Day 2 Module 6 A twentieth century fluid dynamicist could hardly expect to advance knowledge in his field without first adopting a body of terminology and mathematical technique. In return, unconsciously, he would give up much freedom to question the foundations of his science. Thomas S. Kuhn to ponder
79
79 Day 2 Module 6 http://bioinformatics.ca boris.steipe@utoronto.ca
80
80 Day 2 Module 6 nuit blanche tonight one night all night
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.