Canadian Bioinformatics Workshops www.bioinformatics.ca.

Canadian Bioinformatics Workshops www.bioinformatics.ca

2Module #: Title of Module

Lecture 7 ML & Data Visualization & Microarrays MBP1010 Dr. Paul C. Boutros Winter 2015 D EPARTMENT OF MEDICAL BIOPHYSICS This workshop includes material originally developed by Drs. Raphael Gottardo, Sohrab Shah, Boris Steipe and others † † Aegeus, King of Athens, consulting the Delphic Oracle. High Classical (~430 BCE)

Lecture 6: Machine Learning & Data Visualization bioinformatics.ca Course Overview Lecture 1: What is Statistics? Introduction to R Lecture 2: Univariate Analyses I: continuous Lecture 3: Univariate Analyses II: discrete Lecture 4: Multivariate Analyses I: specialized models Lecture 5: Multivariate Analyses II: general models Lecture 6: Machine-Learning Lecture 7: Microarray Analysis I: Pre-Processing Lecture 8: Microarray Analysis II: Multiple-Testing Lecture 9: Sequence Analysis Final Exam (written)

Lecture 6: Machine Learning & Data Visualization bioinformatics.ca House Rules Cell phones to silent No side conversations Hands up for questions

Lecture 6: Machine Learning & Data Visualization bioinformatics.ca Topics For This Week Machine-learning 101 (Briefly) Data visualization 101 Attendance Microarrays 101

Lecture 6: Machine Learning & Data Visualization bioinformatics.ca cho.data<-as.matrix(read.table("logcho_237_4class.txt",skip=1)[1:50,3:19]) D.cho<-dist(cho.data, method = "euclidean") hc.single<-hclust(D.cho, method = "single", members=NULL) Example: cell cycle data

Lecture 6: Machine Learning & Data Visualization bioinformatics.ca plot(hc.single) Single linkage Example: cell cycle data

Lecture 6: Machine Learning & Data Visualization bioinformatics.ca Careful with the interpretation of dendrograms: they introduce a proximity between elements that does not correlate with distance between elements! cf.: # 1 and #47 Example: cell cycle data

Lecture 6: Machine Learning & Data Visualization bioinformatics.ca Single linkage, k=2 rect.hclust(hc.single,k=2) Example: cell cycle data

Lecture 6: Machine Learning & Data Visualization bioinformatics.ca 12 34 class.single<-cutree(hc.single, k = 4) par(mfrow=c(2,2)) matplot(t(cho.data[class.single==1,]),type="l", xlab="time",ylab="log expression value") matplot(t(cho.data[class.single==2,]),type="l", xlab="time",ylab="log expression value") matplot(as.matrix(cho.data[class.single==3,]), type="l",xlab="time",ylab="log expression value") matplot(t(cho.data[class.single==4,]),type="l", xlab="time",ylab="log expression value") Properties of cluster members, single linkage, k=4 Example: cell cycle data

Lecture 6: Machine Learning & Data Visualization bioinformatics.ca 12 34 Single linkage, k=4 1 2 34 Example: cell cycle data

Lecture 6: Machine Learning & Data Visualization bioinformatics.ca 12 34 Complete linkage, k=4 12 34 Single linkage, k=4 Example: cell cycle data

Lecture 6: Machine Learning & Data Visualization bioinformatics.ca Hierarchical clustering analyzed AdvantagesDisadvantages There may be small clusters nested inside large ones Clusters might not be naturally represented by a hierarchical structure No need to specify number groups ahead of time Its necessary to ‘cut’ the dendrogram in order to produce clusters Flexible linkage methodsBottom up clustering can result in poor structure at the top of the tree. Early joins cannot be ‘undone’

Lecture 6: Machine Learning & Data Visualization bioinformatics.ca Partitioning methods Anatomy of a partitioning based method data matrix distance function number of groups Output group assignment of every object

Lecture 6: Machine Learning & Data Visualization bioinformatics.ca Partitioning based methods Choose K groups initialise group centers aka centroid, medoid assign each object to the nearest centroid according to the distance metric reassign (or recompute) centroids repeat last 2 steps until assignment stabilizes

Lecture 6: Machine Learning & Data Visualization bioinformatics.ca K-means vs. K-medoids K-meansK-medoids Centroids are the ‘mean’ of the clusters Centroids are an actual object that minimizes the total within cluster distance Centroids need to be recomputed every iteration Centroid can be determined from quick look up into the distance matrix Initialisation difficult as notion of centroid may be unclear before beginning Initialisation is simply K randomly selected objects kmeanspam

Lecture 6: Machine Learning & Data Visualization bioinformatics.ca Partitioning based methods AdvantagesDisadvantages Number of groups is well defined Have to choose the number of groups A clear, deterministic assignment of an object to a group Sometimes objects do not fit well to any cluster Simple algorithms for inference Can converge on locally optimal solutions and often require multiple restarts with random initializations

Lecture 6: Machine Learning & Data Visualization bioinformatics.ca N items, assume K clusters Goal is to minimize over the possible assignments and centroids. represents the location of the cluster. K-means

Lecture 6: Machine Learning & Data Visualization bioinformatics.ca 1. Divide the data into K clusters Initialize the centroids with the mean of the clusters 2. Assign each item to the cluster with closest centroid 3. When all objects have been assigned, recalculate the centroids (mean) 4. Repeat 2-3 until the centroids no longer move K-means

Lecture 6: Machine Learning & Data Visualization bioinformatics.ca set.seed(100) x <- rbind(matrix(rnorm(100, sd = 0.3), ncol = 2), matrix(rnorm(100, mean = 1, sd = 0.3), ncol = 2)) colnames(x) <- c("x", "y") set.seed(100); cl <- NULL cl<-kmeans(x, matrix(runif(10,-.5,.5),5,2),iter.max=1) plot(x,col=cl$cluster) points(cl$centers, col = 1:5, pch = 8, cex=2) set.seed(100); cl <- NULL cl<-kmeans(x, matrix(runif(10,-.5,.5),5,2),iter.max=2) plot(x,col=cl$cluster) points(cl$centers, col = 1:5, pch = 8, cex=2) set.seed(100); cl <- NULL cl<-kmeans(x, matrix(runif(10,-.5,.5),5,2),iter.max=3) plot(x,col=cl$cluster) points(cl$centers, col = 1:5, pch = 8, cex=2) K-means

Lecture 6: Machine Learning & Data Visualization bioinformatics.ca K-means, k=4 12 34 set.seed(100) km.cho<-kmeans(cho.data, 4) par(mfrow=c(2,2)) matplot(t(cho.data[km.cho$cluster==1,]),type="l",xlab="time",ylab="log expression value") matplot(t(cho.data[km.cho$cluster==2,]),type="l",xlab="time",ylab="log expression value") matplot(t(cho.data[km.cho$cluster==3,]),type="l",xlab="time",ylab="log expression value") matplot(t(cho.data[km.cho$cluster==4,]),type="l",xlab="time",ylab="log expression value") K-means

Lecture 6: Machine Learning & Data Visualization bioinformatics.ca 12 34 K-means, k=4 12 34 Single linkage, k=4 K-means

Lecture 6: Machine Learning & Data Visualization bioinformatics.ca K-means and hierarchical clustering methods are simple, fast and useful techniques Beware of memory requirements for HC Both are bit “ad hoc”: Number of clusters? Distance metric? Good clustering? Summary

Lecture 6: Machine Learning & Data Visualization bioinformatics.ca Meta-Analysis Combining results of multiple-studies that study related hypotheses Often used to merge data from different microarray platforms Very challenging – unclear what the best approaches are, or how they should be adapted to the pecularities of microarray data

Lecture 6: Machine Learning & Data Visualization bioinformatics.ca Why Do Meta-Analysis? Can identify publication biases Appropriately weights diverse studies Sample-size Experimental-reliability Similarity of study-specific hypotheses to the overall one Increases statistical power Reduces information A single meta-analysis vs. five large studies Provides clearer guidance

Lecture 6: Machine Learning & Data Visualization bioinformatics.ca Challenges of Meta-Analysis No control for bias What happens if most studies are poorly designed? File-drawer problem Publication bias can be detected, but not explicitly controlled for How homogeneous is the data? Can it be fairly grouped? Simpson’s Paradox

Lecture 6: Machine Learning & Data Visualization bioinformatics.ca Simpson’s Paradox Group-wise correlations are inverted when the groups are merged. Cautionary note for all meta-analyses!

Lecture 6: Machine Learning & Data Visualization bioinformatics.ca Topics For This Week Machine-learning 101 (Focus: Unsupervised) Data visualization 101

Lecture 6: Machine Learning & Data Visualization bioinformatics.ca Topics For This Week Machine-learning 101 (Briefly) Data visualization 101 Attendance Microarrays 101

Canadian Bioinformatics Workshops www.bioinformatics.ca.

Similar presentations

Presentation on theme: "Canadian Bioinformatics Workshops www.bioinformatics.ca."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Canadian Bioinformatics Workshops www.bioinformatics.ca.

Similar presentations

Presentation on theme: "Canadian Bioinformatics Workshops www.bioinformatics.ca."— Presentation transcript:

Similar presentations

About project

Feedback