Download presentation
Presentation is loading. Please wait.
Published byJustina Laura Gibson Modified over 9 years ago
1
Canadian Bioinformatics Workshops www.bioinformatics.ca
2
2Module #: Title of Module
3
Lecture 7 ML & Data Visualization & Microarrays MBP1010 Dr. Paul C. Boutros Winter 2015 D EPARTMENT OF MEDICAL BIOPHYSICS This workshop includes material originally developed by Drs. Raphael Gottardo, Sohrab Shah, Boris Steipe and others † † Aegeus, King of Athens, consulting the Delphic Oracle. High Classical (~430 BCE)
4
Lecture 6: Machine Learning & Data Visualization bioinformatics.ca Course Overview Lecture 1: What is Statistics? Introduction to R Lecture 2: Univariate Analyses I: continuous Lecture 3: Univariate Analyses II: discrete Lecture 4: Multivariate Analyses I: specialized models Lecture 5: Multivariate Analyses II: general models Lecture 6: Machine-Learning Lecture 7: Microarray Analysis I: Pre-Processing Lecture 8: Microarray Analysis II: Multiple-Testing Lecture 9: Sequence Analysis Final Exam (written)
5
Lecture 6: Machine Learning & Data Visualization bioinformatics.ca House Rules Cell phones to silent No side conversations Hands up for questions
6
Lecture 6: Machine Learning & Data Visualization bioinformatics.ca Topics For This Week Machine-learning 101 (Briefly) Data visualization 101 Attendance Microarrays 101
7
Lecture 6: Machine Learning & Data Visualization bioinformatics.ca cho.data<-as.matrix(read.table("logcho_237_4class.txt",skip=1)[1:50,3:19]) D.cho<-dist(cho.data, method = "euclidean") hc.single<-hclust(D.cho, method = "single", members=NULL) Example: cell cycle data
8
Lecture 6: Machine Learning & Data Visualization bioinformatics.ca plot(hc.single) Single linkage Example: cell cycle data
9
Lecture 6: Machine Learning & Data Visualization bioinformatics.ca Careful with the interpretation of dendrograms: they introduce a proximity between elements that does not correlate with distance between elements! cf.: # 1 and #47 Example: cell cycle data
10
Lecture 6: Machine Learning & Data Visualization bioinformatics.ca Single linkage, k=2 rect.hclust(hc.single,k=2) Example: cell cycle data
11
Lecture 6: Machine Learning & Data Visualization bioinformatics.ca Single linkage, k=3 rect.hclust(hc.single,k=3) Example: cell cycle data
12
Lecture 6: Machine Learning & Data Visualization bioinformatics.ca Single linkage, k=4 rect.hclust(hc.single,k=4) Example: cell cycle data
13
Lecture 6: Machine Learning & Data Visualization bioinformatics.ca Single linkage, k=5 rect.hclust(hc.single,k=5) Example: cell cycle data
14
Lecture 6: Machine Learning & Data Visualization bioinformatics.ca Single linkage, k=25 rect.hclust(hc.single,k=25) Example: cell cycle data
15
Lecture 6: Machine Learning & Data Visualization bioinformatics.ca 12 34 class.single<-cutree(hc.single, k = 4) par(mfrow=c(2,2)) matplot(t(cho.data[class.single==1,]),type="l", xlab="time",ylab="log expression value") matplot(t(cho.data[class.single==2,]),type="l", xlab="time",ylab="log expression value") matplot(as.matrix(cho.data[class.single==3,]), type="l",xlab="time",ylab="log expression value") matplot(t(cho.data[class.single==4,]),type="l", xlab="time",ylab="log expression value") Properties of cluster members, single linkage, k=4 Example: cell cycle data
16
Lecture 6: Machine Learning & Data Visualization bioinformatics.ca 12 34 Single linkage, k=4 1 2 34 Example: cell cycle data
17
Lecture 6: Machine Learning & Data Visualization bioinformatics.ca 12 34 Complete linkage, k=4 12 34 Single linkage, k=4 Example: cell cycle data
18
Lecture 6: Machine Learning & Data Visualization bioinformatics.ca Hierarchical clustering analyzed AdvantagesDisadvantages There may be small clusters nested inside large ones Clusters might not be naturally represented by a hierarchical structure No need to specify number groups ahead of time Its necessary to ‘cut’ the dendrogram in order to produce clusters Flexible linkage methodsBottom up clustering can result in poor structure at the top of the tree. Early joins cannot be ‘undone’
19
Lecture 6: Machine Learning & Data Visualization bioinformatics.ca Partitioning methods Anatomy of a partitioning based method data matrix distance function number of groups Output group assignment of every object
20
Lecture 6: Machine Learning & Data Visualization bioinformatics.ca Partitioning based methods Choose K groups initialise group centers aka centroid, medoid assign each object to the nearest centroid according to the distance metric reassign (or recompute) centroids repeat last 2 steps until assignment stabilizes
21
Lecture 6: Machine Learning & Data Visualization bioinformatics.ca K-means vs. K-medoids K-meansK-medoids Centroids are the ‘mean’ of the clusters Centroids are an actual object that minimizes the total within cluster distance Centroids need to be recomputed every iteration Centroid can be determined from quick look up into the distance matrix Initialisation difficult as notion of centroid may be unclear before beginning Initialisation is simply K randomly selected objects kmeanspam
22
Lecture 6: Machine Learning & Data Visualization bioinformatics.ca Partitioning based methods AdvantagesDisadvantages Number of groups is well defined Have to choose the number of groups A clear, deterministic assignment of an object to a group Sometimes objects do not fit well to any cluster Simple algorithms for inference Can converge on locally optimal solutions and often require multiple restarts with random initializations
23
Lecture 6: Machine Learning & Data Visualization bioinformatics.ca N items, assume K clusters Goal is to minimize over the possible assignments and centroids. represents the location of the cluster. K-means
24
Lecture 6: Machine Learning & Data Visualization bioinformatics.ca 1. Divide the data into K clusters Initialize the centroids with the mean of the clusters 2. Assign each item to the cluster with closest centroid 3. When all objects have been assigned, recalculate the centroids (mean) 4. Repeat 2-3 until the centroids no longer move K-means
25
Lecture 6: Machine Learning & Data Visualization bioinformatics.ca set.seed(100) x <- rbind(matrix(rnorm(100, sd = 0.3), ncol = 2), matrix(rnorm(100, mean = 1, sd = 0.3), ncol = 2)) colnames(x) <- c("x", "y") set.seed(100); cl <- NULL cl<-kmeans(x, matrix(runif(10,-.5,.5),5,2),iter.max=1) plot(x,col=cl$cluster) points(cl$centers, col = 1:5, pch = 8, cex=2) set.seed(100); cl <- NULL cl<-kmeans(x, matrix(runif(10,-.5,.5),5,2),iter.max=2) plot(x,col=cl$cluster) points(cl$centers, col = 1:5, pch = 8, cex=2) set.seed(100); cl <- NULL cl<-kmeans(x, matrix(runif(10,-.5,.5),5,2),iter.max=3) plot(x,col=cl$cluster) points(cl$centers, col = 1:5, pch = 8, cex=2) K-means
26
Lecture 6: Machine Learning & Data Visualization bioinformatics.ca set.seed(100) x <- rbind(matrix(rnorm(100, sd = 0.3), ncol = 2), matrix(rnorm(100, mean = 1, sd = 0.3), ncol = 2)) colnames(x) <- c("x", "y") set.seed(100); cl <- NULL cl<-kmeans(x, matrix(runif(10,-.5,.5),5,2),iter.max=1) plot(x,col=cl$cluster) points(cl$centers, col = 1:5, pch = 8, cex=2) set.seed(100); cl <- NULL cl<-kmeans(x, matrix(runif(10,-.5,.5),5,2),iter.max=2) plot(x,col=cl$cluster) points(cl$centers, col = 1:5, pch = 8, cex=2) set.seed(100); cl <- NULL cl<-kmeans(x, matrix(runif(10,-.5,.5),5,2),iter.max=3) plot(x,col=cl$cluster) points(cl$centers, col = 1:5, pch = 8, cex=2) K-means
27
Lecture 6: Machine Learning & Data Visualization bioinformatics.ca K-means, k=4 12 34 set.seed(100) km.cho<-kmeans(cho.data, 4) par(mfrow=c(2,2)) matplot(t(cho.data[km.cho$cluster==1,]),type="l",xlab="time",ylab="log expression value") matplot(t(cho.data[km.cho$cluster==2,]),type="l",xlab="time",ylab="log expression value") matplot(t(cho.data[km.cho$cluster==3,]),type="l",xlab="time",ylab="log expression value") matplot(t(cho.data[km.cho$cluster==4,]),type="l",xlab="time",ylab="log expression value") K-means
28
Lecture 6: Machine Learning & Data Visualization bioinformatics.ca 12 34 K-means, k=4 12 34 Single linkage, k=4 K-means
29
Lecture 6: Machine Learning & Data Visualization bioinformatics.ca K-means and hierarchical clustering methods are simple, fast and useful techniques Beware of memory requirements for HC Both are bit “ad hoc”: Number of clusters? Distance metric? Good clustering? Summary
30
Lecture 6: Machine Learning & Data Visualization bioinformatics.ca Meta-Analysis Combining results of multiple-studies that study related hypotheses Often used to merge data from different microarray platforms Very challenging – unclear what the best approaches are, or how they should be adapted to the pecularities of microarray data
31
Lecture 6: Machine Learning & Data Visualization bioinformatics.ca Why Do Meta-Analysis? Can identify publication biases Appropriately weights diverse studies Sample-size Experimental-reliability Similarity of study-specific hypotheses to the overall one Increases statistical power Reduces information A single meta-analysis vs. five large studies Provides clearer guidance
32
Lecture 6: Machine Learning & Data Visualization bioinformatics.ca Challenges of Meta-Analysis No control for bias What happens if most studies are poorly designed? File-drawer problem Publication bias can be detected, but not explicitly controlled for How homogeneous is the data? Can it be fairly grouped? Simpson’s Paradox
33
Lecture 6: Machine Learning & Data Visualization bioinformatics.ca Simpson’s Paradox Group-wise correlations are inverted when the groups are merged. Cautionary note for all meta-analyses!
34
Lecture 6: Machine Learning & Data Visualization bioinformatics.ca Topics For This Week Machine-learning 101 (Focus: Unsupervised) Data visualization 101
35
Lecture 6: Machine Learning & Data Visualization bioinformatics.ca Topics For This Week Machine-learning 101 (Briefly) Data visualization 101 Attendance Microarrays 101
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.