Canadian Bioinformatics Workshops www.bioinformatics.ca.

Slides:



Advertisements
Similar presentations
Clustering k-mean clustering Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein.
Advertisements

CS 478 – Tools for Machine Learning and Data Mining Clustering: Distance-based Approaches.
SEEM Tutorial 4 – Clustering. 2 What is Cluster Analysis?  Finding groups of objects such that the objects in a group will be similar (or.
Cluster Analysis: Basic Concepts and Algorithms
PARTITIONAL CLUSTERING
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ What is Cluster Analysis? l Finding groups of objects such that the objects in a group will.
Metrics, Algorithms & Follow-ups Profile Similarity Measures Cluster combination procedures Hierarchical vs. Non-hierarchical Clustering Statistical follow-up.
Introduction to Bioinformatics
Essential Statistics in Biology: Getting the Numbers Right Raphael Gottardo Clinical Research Institute of Montreal (IRCM)
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ What is Cluster Analysis? l Finding groups of objects such that the objects in a group will.
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.
Dimension reduction : PCA and Clustering Agnieszka S. Juncker Slides: Christopher Workman and Agnieszka S. Juncker Center for Biological Sequence Analysis.
Cluster Analysis: Basic Concepts and Algorithms
Clustering. 2 Outline  Introduction  K-means clustering  Hierarchical clustering: COBWEB.
Dimension reduction : PCA and Clustering Christopher Workman Center for Biological Sequence Analysis DTU.
Clustering. 2 Outline  Introduction  K-means clustering  Hierarchical clustering: COBWEB.
Adapted by Doug Downey from Machine Learning EECS 349, Bryan Pardo Machine Learning Clustering.
Introduction to Bioinformatics - Tutorial no. 12
What is Cluster Analysis?
Cluster Analysis CS240B Lecture notes based on those by © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004.
Microarray analysis 2 Golan Yona. 2) Analysis of co-expression Search for similarly expressed genes experiment1 experiment2 experiment3 ……….. Gene i:
What is Cluster Analysis?
Clustering Ram Akella Lecture 6 February 23, & 280I University of California Berkeley Silicon Valley Center/SC.
Ulf Schmitz, Pattern recognition - Clustering1 Bioinformatics Pattern recognition - Clustering Ulf Schmitz
Introduction to Bioinformatics Algorithms Clustering and Microarray Analysis.
Clustering Unsupervised learning Generating “classes”
CSC 4510 – Machine Learning Dr. Mary-Angela Papalaskari Department of Computing Sciences Villanova University Course website:
Canadian Bioinformatics Workshops
1 Lecture 10 Clustering. 2 Preview Introduction Partitioning methods Hierarchical methods Model-based methods Density-based methods.
Partitional and Hierarchical Based clustering Lecture 22 Based on Slides of Dr. Ikle & chapter 8 of Tan, Steinbach, Kumar.
START OF DAY 8 Reading: Chap. 14. Midterm Go over questions General issues only Specific issues: visit with me Regrading may make your grade go up OR.
Cluster analysis 포항공과대학교 산업공학과 확률통계연구실 이 재 현. POSTECH IE PASTACLUSTER ANALYSIS Definition Cluster analysis is a technigue used for combining observations.
1 Motivation Web query is usually two or three words long. –Prone to ambiguity –Example “keyboard” –Input device of computer –Musical instruments How can.
CLUSTERING. Overview Definition of Clustering Existing clustering methods Clustering examples.
Clustering What is clustering? Also called “unsupervised learning”Also called “unsupervised learning”
Dimension reduction : PCA and Clustering Slides by Agnieszka Juncker and Chris Workman modified by Hanne Jarmer.
Clustering I. 2 The Task Input: Collection of instances –No special class label attribute! Output: Clusters (Groups) of instances where members of a cluster.
Clustering.
CLUSTERING AND SEGMENTATION MIS2502 Data Analytics Adapted from Tan, Steinbach, and Kumar (2004). Introduction to Data Mining.
Course Work Project Project title “Data Analysis Methods for Microarray Based Gene Expression Analysis” Sushil Kumar Singh (batch ) IBAB, Bangalore.
Computational Biology Clustering Parts taken from Introduction to Data Mining by Tan, Steinbach, Kumar Lecture Slides Week 9.
CSSE463: Image Recognition Day 23 Midterm behind us… Midterm behind us… Foundations of Image Recognition completed! Foundations of Image Recognition completed!
Machine Learning Queens College Lecture 7: Clustering.
Slide 1 EE3J2 Data Mining Lecture 18 K-means and Agglomerative Algorithms.
Compiled By: Raj Gaurang Tiwari Assistant Professor SRMGPC, Lucknow Unsupervised Learning.
Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1.
Tutorial 8 Gene expression analysis 1. How to interpret an expression matrix Expression data DBs - GEO Clustering –Hierarchical clustering –K-means clustering.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Cluster Analysis This lecture node is modified based on Lecture Notes for Chapter.
Clustering Algorithms Sunida Ratanothayanon. What is Clustering?
CZ5211 Topics in Computational Biology Lecture 4: Clustering Analysis for Microarray Data II Prof. Chen Yu Zong Tel:
1Module #: Title of Module. Machine Learning 101.
Canadian Bioinformatics Workshops
DATA MINING: CLUSTER ANALYSIS Instructor: Dr. Chun Yu School of Statistics Jiangxi University of Finance and Economics Fall 2015.
Canadian Bioinformatics Workshops
1 Canadian Bioinformatics Workshops
Data Mining Cluster Analysis: Basic Concepts and Algorithms Lecture Notes Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction.
Data Mining and Text Mining. The Standard Data Mining process.
Data Mining: Basic Cluster Analysis
CSSE463: Image Recognition Day 21
Data Mining K-means Algorithm
Canadian Bioinformatics Workshops
Canadian Bioinformatics Workshops
John Nicholas Owen Sarah Smith
CSSE463: Image Recognition Day 23
Clustering and Multidimensional Scaling
DATA MINING Introductory and Advanced Topics Part II - Clustering
Text Categorization Berlin Chen 2003 Reference:
CSSE463: Image Recognition Day 23
SEEM4630 Tutorial 3 – Clustering.
Presentation transcript:

Canadian Bioinformatics Workshops

2Module #: Title of Module

Lecture 7 ML & Data Visualization & Microarrays MBP1010 Dr. Paul C. Boutros Winter 2015 D EPARTMENT OF MEDICAL BIOPHYSICS This workshop includes material originally developed by Drs. Raphael Gottardo, Sohrab Shah, Boris Steipe and others † † Aegeus, King of Athens, consulting the Delphic Oracle. High Classical (~430 BCE)

Lecture 6: Machine Learning & Data Visualization bioinformatics.ca Course Overview Lecture 1: What is Statistics? Introduction to R Lecture 2: Univariate Analyses I: continuous Lecture 3: Univariate Analyses II: discrete Lecture 4: Multivariate Analyses I: specialized models Lecture 5: Multivariate Analyses II: general models Lecture 6: Machine-Learning Lecture 7: Microarray Analysis I: Pre-Processing Lecture 8: Microarray Analysis II: Multiple-Testing Lecture 9: Sequence Analysis Final Exam (written)

Lecture 6: Machine Learning & Data Visualization bioinformatics.ca House Rules Cell phones to silent No side conversations Hands up for questions

Lecture 6: Machine Learning & Data Visualization bioinformatics.ca Topics For This Week Machine-learning 101 (Briefly) Data visualization 101 Attendance Microarrays 101

Lecture 6: Machine Learning & Data Visualization bioinformatics.ca cho.data<-as.matrix(read.table("logcho_237_4class.txt",skip=1)[1:50,3:19]) D.cho<-dist(cho.data, method = "euclidean") hc.single<-hclust(D.cho, method = "single", members=NULL) Example: cell cycle data

Lecture 6: Machine Learning & Data Visualization bioinformatics.ca plot(hc.single) Single linkage Example: cell cycle data

Lecture 6: Machine Learning & Data Visualization bioinformatics.ca Careful with the interpretation of dendrograms: they introduce a proximity between elements that does not correlate with distance between elements! cf.: # 1 and #47 Example: cell cycle data

Lecture 6: Machine Learning & Data Visualization bioinformatics.ca Single linkage, k=2 rect.hclust(hc.single,k=2) Example: cell cycle data

Lecture 6: Machine Learning & Data Visualization bioinformatics.ca Single linkage, k=3 rect.hclust(hc.single,k=3) Example: cell cycle data

Lecture 6: Machine Learning & Data Visualization bioinformatics.ca Single linkage, k=4 rect.hclust(hc.single,k=4) Example: cell cycle data

Lecture 6: Machine Learning & Data Visualization bioinformatics.ca Single linkage, k=5 rect.hclust(hc.single,k=5) Example: cell cycle data

Lecture 6: Machine Learning & Data Visualization bioinformatics.ca Single linkage, k=25 rect.hclust(hc.single,k=25) Example: cell cycle data

Lecture 6: Machine Learning & Data Visualization bioinformatics.ca class.single<-cutree(hc.single, k = 4) par(mfrow=c(2,2)) matplot(t(cho.data[class.single==1,]),type="l", xlab="time",ylab="log expression value") matplot(t(cho.data[class.single==2,]),type="l", xlab="time",ylab="log expression value") matplot(as.matrix(cho.data[class.single==3,]), type="l",xlab="time",ylab="log expression value") matplot(t(cho.data[class.single==4,]),type="l", xlab="time",ylab="log expression value") Properties of cluster members, single linkage, k=4 Example: cell cycle data

Lecture 6: Machine Learning & Data Visualization bioinformatics.ca Single linkage, k= Example: cell cycle data

Lecture 6: Machine Learning & Data Visualization bioinformatics.ca Complete linkage, k= Single linkage, k=4 Example: cell cycle data

Lecture 6: Machine Learning & Data Visualization bioinformatics.ca Hierarchical clustering analyzed AdvantagesDisadvantages There may be small clusters nested inside large ones Clusters might not be naturally represented by a hierarchical structure No need to specify number groups ahead of time Its necessary to ‘cut’ the dendrogram in order to produce clusters Flexible linkage methodsBottom up clustering can result in poor structure at the top of the tree. Early joins cannot be ‘undone’

Lecture 6: Machine Learning & Data Visualization bioinformatics.ca Partitioning methods Anatomy of a partitioning based method data matrix distance function number of groups Output group assignment of every object

Lecture 6: Machine Learning & Data Visualization bioinformatics.ca Partitioning based methods Choose K groups initialise group centers aka centroid, medoid assign each object to the nearest centroid according to the distance metric reassign (or recompute) centroids repeat last 2 steps until assignment stabilizes

Lecture 6: Machine Learning & Data Visualization bioinformatics.ca K-means vs. K-medoids K-meansK-medoids Centroids are the ‘mean’ of the clusters Centroids are an actual object that minimizes the total within cluster distance Centroids need to be recomputed every iteration Centroid can be determined from quick look up into the distance matrix Initialisation difficult as notion of centroid may be unclear before beginning Initialisation is simply K randomly selected objects kmeanspam

Lecture 6: Machine Learning & Data Visualization bioinformatics.ca Partitioning based methods AdvantagesDisadvantages Number of groups is well defined Have to choose the number of groups A clear, deterministic assignment of an object to a group Sometimes objects do not fit well to any cluster Simple algorithms for inference Can converge on locally optimal solutions and often require multiple restarts with random initializations

Lecture 6: Machine Learning & Data Visualization bioinformatics.ca N items, assume K clusters Goal is to minimize over the possible assignments and centroids. represents the location of the cluster. K-means

Lecture 6: Machine Learning & Data Visualization bioinformatics.ca 1. Divide the data into K clusters Initialize the centroids with the mean of the clusters 2. Assign each item to the cluster with closest centroid 3. When all objects have been assigned, recalculate the centroids (mean) 4. Repeat 2-3 until the centroids no longer move K-means

Lecture 6: Machine Learning & Data Visualization bioinformatics.ca set.seed(100) x <- rbind(matrix(rnorm(100, sd = 0.3), ncol = 2), matrix(rnorm(100, mean = 1, sd = 0.3), ncol = 2)) colnames(x) <- c("x", "y") set.seed(100); cl <- NULL cl<-kmeans(x, matrix(runif(10,-.5,.5),5,2),iter.max=1) plot(x,col=cl$cluster) points(cl$centers, col = 1:5, pch = 8, cex=2) set.seed(100); cl <- NULL cl<-kmeans(x, matrix(runif(10,-.5,.5),5,2),iter.max=2) plot(x,col=cl$cluster) points(cl$centers, col = 1:5, pch = 8, cex=2) set.seed(100); cl <- NULL cl<-kmeans(x, matrix(runif(10,-.5,.5),5,2),iter.max=3) plot(x,col=cl$cluster) points(cl$centers, col = 1:5, pch = 8, cex=2) K-means

Lecture 6: Machine Learning & Data Visualization bioinformatics.ca set.seed(100) x <- rbind(matrix(rnorm(100, sd = 0.3), ncol = 2), matrix(rnorm(100, mean = 1, sd = 0.3), ncol = 2)) colnames(x) <- c("x", "y") set.seed(100); cl <- NULL cl<-kmeans(x, matrix(runif(10,-.5,.5),5,2),iter.max=1) plot(x,col=cl$cluster) points(cl$centers, col = 1:5, pch = 8, cex=2) set.seed(100); cl <- NULL cl<-kmeans(x, matrix(runif(10,-.5,.5),5,2),iter.max=2) plot(x,col=cl$cluster) points(cl$centers, col = 1:5, pch = 8, cex=2) set.seed(100); cl <- NULL cl<-kmeans(x, matrix(runif(10,-.5,.5),5,2),iter.max=3) plot(x,col=cl$cluster) points(cl$centers, col = 1:5, pch = 8, cex=2) K-means

Lecture 6: Machine Learning & Data Visualization bioinformatics.ca K-means, k= set.seed(100) km.cho<-kmeans(cho.data, 4) par(mfrow=c(2,2)) matplot(t(cho.data[km.cho$cluster==1,]),type="l",xlab="time",ylab="log expression value") matplot(t(cho.data[km.cho$cluster==2,]),type="l",xlab="time",ylab="log expression value") matplot(t(cho.data[km.cho$cluster==3,]),type="l",xlab="time",ylab="log expression value") matplot(t(cho.data[km.cho$cluster==4,]),type="l",xlab="time",ylab="log expression value") K-means

Lecture 6: Machine Learning & Data Visualization bioinformatics.ca K-means, k= Single linkage, k=4 K-means

Lecture 6: Machine Learning & Data Visualization bioinformatics.ca K-means and hierarchical clustering methods are simple, fast and useful techniques Beware of memory requirements for HC Both are bit “ad hoc”: Number of clusters? Distance metric? Good clustering? Summary

Lecture 6: Machine Learning & Data Visualization bioinformatics.ca Meta-Analysis Combining results of multiple-studies that study related hypotheses Often used to merge data from different microarray platforms Very challenging – unclear what the best approaches are, or how they should be adapted to the pecularities of microarray data

Lecture 6: Machine Learning & Data Visualization bioinformatics.ca Why Do Meta-Analysis? Can identify publication biases Appropriately weights diverse studies Sample-size Experimental-reliability Similarity of study-specific hypotheses to the overall one Increases statistical power Reduces information A single meta-analysis vs. five large studies Provides clearer guidance

Lecture 6: Machine Learning & Data Visualization bioinformatics.ca Challenges of Meta-Analysis No control for bias What happens if most studies are poorly designed? File-drawer problem Publication bias can be detected, but not explicitly controlled for How homogeneous is the data? Can it be fairly grouped? Simpson’s Paradox

Lecture 6: Machine Learning & Data Visualization bioinformatics.ca Simpson’s Paradox Group-wise correlations are inverted when the groups are merged. Cautionary note for all meta-analyses!

Lecture 6: Machine Learning & Data Visualization bioinformatics.ca Topics For This Week Machine-learning 101 (Focus: Unsupervised) Data visualization 101

Lecture 6: Machine Learning & Data Visualization bioinformatics.ca Topics For This Week Machine-learning 101 (Briefly) Data visualization 101 Attendance Microarrays 101