Essential Statistics in Biology: Getting the Numbers Right Raphael Gottardo Clinical Research Institute of Montreal (IRCM)

Slides:



Advertisements
Similar presentations
Clustering II.
Advertisements

Clustering k-mean clustering Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein.
Basic Gene Expression Data Analysis--Clustering
Hierarchical Clustering. Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram – A tree-like diagram that.
PARTITIONAL CLUSTERING
Clustering Beyond K-means
K Means Clustering , Nearest Cluster and Gaussian Mixture
Clustering Clustering of data is a method by which large sets of data is grouped into clusters of smaller sets of similar data. The example below demonstrates.
Principal Component Analysis (PCA) for Clustering Gene Expression Data K. Y. Yeung and W. L. Ruzzo.
Model-based clustering of gene expression data Ka Yee Yeung 1,Chris Fraley 2, Alejandro Murua 3, Adrian E. Raftery 2, and Walter L. Ruzzo 1 1 Department.
Metrics, Algorithms & Follow-ups Profile Similarity Measures Cluster combination procedures Hierarchical vs. Non-hierarchical Clustering Statistical follow-up.
Introduction to Bioinformatics
Canadian Bioinformatics Workshops
UNSUPERVISED ANALYSIS GOAL A: FIND GROUPS OF GENES THAT HAVE CORRELATED EXPRESSION PROFILES. THESE GENES ARE BELIEVED TO BELONG TO THE SAME BIOLOGICAL.
Clustering II.
Mutual Information Mathematical Biology Seminar
Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.
1 Cluster Analysis EPP 245 Statistical Analysis of Laboratory Data.
Clustering.
Dimension reduction : PCA and Clustering Christopher Workman Center for Biological Sequence Analysis DTU.
Adapted by Doug Downey from Machine Learning EECS 349, Bryan Pardo Machine Learning Clustering.
Cluster Analysis for Gene Expression Data Ka Yee Yeung Center for Expression Arrays Department of Microbiology.
Microarray analysis 2 Golan Yona. 2) Analysis of co-expression Search for similarly expressed genes experiment1 experiment2 experiment3 ……….. Gene i:
Parallel K-Means Clustering Based on MapReduce The Key Laboratory of Intelligent Information Processing, Chinese Academy of Sciences Weizhong Zhao, Huifang.
Clustering Ram Akella Lecture 6 February 23, & 280I University of California Berkeley Silicon Valley Center/SC.
Applications of Data Mining in Microarray Data Analysis Yen-Jen Oyang Dept. of Computer Science and Information Engineering.
Evaluating Performance for Data Mining Techniques
Principal Component Analysis (PCA) for Clustering Gene Expression Data K. Y. Yeung and W. L. Ruzzo.
Graph-based consensus clustering for class discovery from gene expression data Zhiwen Yum, Hau-San Wong and Hongqiang Wang Bioinformatics, 2007.
Classification (Supervised Clustering) Naomi Altman Nov '06.
Essential Statistics in Biology: Getting the Numbers Right
COMMON EVALUATION FINAL PROJECT Vira Oleksyuk ECE 8110: Introduction to machine Learning and Pattern Recognition.
Particle Filters for Shape Correspondence Presenter: Jingting Zeng.
1 Gene Ontology Javier Cabrera. 2 Outline Goal: How to identify biological processes or biochemical pathways that are changed by treatment.Goal: How to.
CLUSTERING. Overview Definition of Clustering Existing clustering methods Clustering examples.
Clustering What is clustering? Also called “unsupervised learning”Also called “unsupervised learning”
Randomized Algorithms for Bayesian Hierarchical Clustering
Hierarchical Bayesian Model Specification Model is specified by the Directed Acyclic Network (DAG) and the conditional probability distributions of all.
Quantitative analysis of 2D gels Generalities. Applications Mutant / wild type Physiological conditions Tissue specific expression Disease / normal state.
Chapter 11 Statistical Techniques. Data Warehouse and Data Mining Chapter 11 2 Chapter Objectives  Understand when linear regression is an appropriate.
DATA MINING WITH CLUSTERING AND CLASSIFICATION Spring 2007, SJSU Benjamin Lam.
Data Mining Practical Machine Learning Tools and Techniques By I. H. Witten, E. Frank and M. A. Hall 6.8: Clustering Rodney Nielsen Many / most of these.
Cluster analysis and spike sorting
Clustering Patrice Koehl Department of Biological Sciences National University of Singapore
Model-based Clustering
CZ5211 Topics in Computational Biology Lecture 4: Clustering Analysis for Microarray Data II Prof. Chen Yu Zong Tel:
1 Cluster Analysis – 2 Approaches K-Means (traditional) Latent Class Analysis (new) by Jay Magidson, Statistical Innovations based in part on a presentation.
Hidden Markov Models. A Hidden Markov Model consists of 1.A sequence of states {X t |t  T } = {X 1, X 2,..., X T }, and 2.A sequence of observations.
Hierarchical clustering approaches for high-throughput data Colin Dewey BMI/CS 576 Fall 2015.
Canadian Bioinformatics Workshops
Canadian Bioinformatics Workshops
Clustering Machine Learning Unsupervised Learning K-means Optimization objective Random initialization Determining Number of Clusters Hierarchical Clustering.
Data Mining and Text Mining. The Standard Data Mining process.
SPH 247 Statistical Analysis of Laboratory Data. Supervised and Unsupervised Learning Logistic regression and Fisher’s LDA and QDA are examples of supervised.
Canadian Bioinformatics Workshops
Clustering (2) Center-based algorithms Fuzzy k-means Density-based algorithms ( DBSCAN as an example ) Evaluation of clustering results Figures and equations.
Data Science Practical Machine Learning Tools and Techniques 6.8: Clustering Rodney Nielsen Many / most of these slides were adapted from: I. H. Witten,
Unsupervised Learning Part 2. Topics How to determine the K in K-means? Hierarchical clustering Soft clustering with Gaussian mixture models Expectation-Maximization.
Cluster Analysis II 10/03/2012.
Classification of unlabeled data:
EPP 245/298 Statistical Analysis of Laboratory Data
Canadian Bioinformatics Workshops
Canadian Bioinformatics Workshops
Discrimination and Classification
Hierarchical clustering approaches for high-throughput data
Clustering and Multidimensional Scaling
Text Categorization Berlin Chen 2003 Reference:
Presentation transcript:

Essential Statistics in Biology: Getting the Numbers Right Raphael Gottardo Clinical Research Institute of Montreal (IRCM)

Day 1 2 Outline Exploratory Data Analysis 1-2 sample t-tests, multiple testing Clustering SVD/PCA Frequentists vs. Bayesians

Clustering (Multivariate analysis)

Day 1 - Section 3 4 Outline Basics of clustering Hierarchical clustering K-means Model based clustering

Day 1 - Section 3 5 What is it? Clustering is the classification of similar objects into different groups. Partition a data set into subsets (clusters), so that the data in each subset are “close” to one another - often proximity according to some defined distance measure. Examples: www, gene clustering

Day 1 - Section 3 6 Hierarchical clustering Given N items and a distance metric 1. Assign each item to a cluster Initialize the distance matrix between clusters as the distance between items 2. Find the closest pair of clusters and merge them into a single cluster 3. Compute new distances between clusters 4. Repeat 2-3 until call items are classified into a single cluster

Day 1 - Section 3 7 Single linkage The distance between clusters is defined as the shortest distance from any member of one cluster to any member of the other cluster. Cluster 1Cluster 2 d

Day 1 - Section 3 8 Complete linkage The distance between clusters is defined as the greatest distance from any member of one cluster to any member of the other cluster. Cluster 1Cluster 2 d

Day 1 - Section 3 9 Average linkage The distance between clusters is defined as the average distance from any member of one cluster to any member of the other cluster. Cluster 1Cluster 2 d=Average of all distances

Day 1 - Section 3 10 Example Cell cycle dataset (Cho et al. 1998) Expression levels of ~6000 genes during the cell cycle 17 time points (2 cell cycles)

Day 1 - Section 3 11 Example cho.data<-as.matrix(read.table("logcho_237_4class.txt",skip=1)[1:50,3:19])D.cho<-dist(cho.data, method = "euclidean")hc.single<-hclust(D.cho, method = "single", members=NULL)plot(hc.single)rect.hclust(hc.single,k=4)class.single<-cutree(hc.single, k = 4)par(mfrow=c(2,2))matplot(t(cho.data[class.single==1,]),type="l",xlab="time",ylab="log expression value")matplot(t(cho.data[class.single==2,]),type="l",xlab="time",ylab="log expression value")matplot(as.matrix(cho.data[class.single==3,]),type="l",xlab="time",ylab="log expression value")matplot(t(cho.data[class.single==4,]),type="l",xlab="time",ylab="log expression value")hc.complete<-hclust(D.cho, method = "complete", members=NULL)plot(hc.complete)rect.hclust(hc.complete,k=4)class.complete<-cutree(hc.complete, k = 4)par(mfrow=c(2,2))matplot(t(cho.data[class.complete==1,]),type="l",xlab="time",ylab="log expression value")matplot(t(cho.data[class.complete==2,]),type="l",xlab="time",ylab="log expression value")matplot(t(cho.data[class.complete==3,]),type="l",xlab="time",ylab="log expression value")matplot(t(cho.data[class.complete==4,]),type="l",xlab="time",ylab="log expression value")hc.average<-hclust(D.cho, method = "average", members=NULL)plot(hc.average)rect.hclust(hc.average,k=4)class.average<-cutree(hc.average, k = 4)par(mfrow=c(2,2))matplot(t(cho.data[class.average==1,]),type="l",xlab="time",ylab="log expression value")matplot(t(cho.data[class.average==2,]),type="l",xlab="time",ylab="log expression value")matplot(as.matrix(cho.data[class.average==3,]),type="l",xlab="time",ylab="log expression value")matplot(t(cho.data[class.average==4,]),type="l",xlab="time",ylab="log expression value")set.seed(100)km.cho<-kmeans(cho.data, 4)par(mfrow=c(2,2))matplot(t(cho.data[km.cho$cluster==1,]),type="l",xlab="time",ylab="log expression value")matplot(t(cho.data[km.cho$cluster==2,]),type="l",xlab="time",ylab="log expression value")matplot(t(cho.data[km.cho$cluster==3,]),type="l",xlab="time",ylab="log expression value")matplot(t(cho.data[km.cho$cluster==4,]),type="l",xlab="time",ylab="log expression value")

Day 1 - Section 3 12 Example Single linkage

Day 1 - Section 3 13 Example Single linkage k=2

Day 1 - Section 3 14 Example Single linkage k=3

Day 1 - Section 3 15 Example Single linkage k=4

Day 1 - Section 3 16 Example Single linkage k=5

Day 1 - Section 3 17 Example Single linkage k=25

Day 1 - Section 3 18 Example Single linkage k=

Day 1 - Section 3 19 Example Complete linkage k=4

Day 1 - Section 3 20 Example Complete linkage k=

Day 1 - Section 3 21 K-means N items, assume K clusters Goal is to minimized over the possible assignments and centroids. represents the location of the cluster.

Day 1 - Section 3 22 K-means - algorithm 1. Divide the data into K clusters Initialize the centroids with the mean of the clusters 2. Assign each item to the cluster with closest centroid 3. When all objects have been assigned, recalculate the centroids (mean) 4. Repeat 2-3 until the centroids no longer move

Day 1 - Section 3 23 K-means - algorithm set.seed(100) x <- rbind(matrix(rnorm(100, sd = 0.3), ncol = 2),matrix(rnorm(100, mean = 1, sd = 0.3), ncol = 2)) colnames(x) <- c("x", "y") set.seed(100) for(i in 1:4) { set.seed(100) cl<-kmeans(x, matrix(runif(10,-.5,.5),5,2),iter.max=i) plot(x,col=cl$cluster) points(cl$centers, col = 1:5, pch = 8, cex=2) Sys.sleep(2) }

Day 1 - Section 3 24 Example Why?

Day 1 - Section 3 26 Example 12 34

Day 1 - Section 3 27 Summary K-means and hierarchical clustering methods are useful techniques Fast and easy to implement Beware of memory requirements for HC A bit “ad hoc”: Number of clusters? Distance metric? Good clustering?

Day 1 - Section 3 28 Model based clustering Based on probability models (e.g. Normal mixture models) We could talk about good clustering Compare several models Estimate the number of clusters!

Day 1 - Section 3 29 Model based clustering Multivariate observations K clusters Assume observation i belongs to cluster k, then that is each cluster can be represented by a multivariate normal distribution with mean and covariance Yeung et al. (2001)

Day 1 - Section 3 30 Model based clustering Banfield and Raftery (1993) VolumeOrientationShape Eigenvalue decomposition

Day 1 - Section 3 31 Model based clustering Equal volume spherical EII Unequal volume spherical VII Equal volume, shape, orientation (EEE) Unconstrained (VVV)

Day 1 - Section 3 32 Estimation Given the number of clusters and the covariance structure the EM algorithm can be used Mclust R package available from CRAN Likelihood (Mixture model)

Day 1 - Section 3 33 Model selection Which model is appropriate? - Which covariance structure? - How many clusters? Compare the different models using BIC

Day 1 - Section 3 34 Model selection We wish to compare two models and with parameters and respectively. Given the observed data D, define the integrated likelihood Probability to observe the data given model andmight have different dimensionsNB:

Day 1 - Section 3 35 Model selection To compare two models and use the integrated likelihoods. The integral is difficult to compute! Bayesian information criteria: is the maximum likelihood is the number of parameter in model

Day 1 - Section 3 36 Model selection Bayesian information criteria: Measure of fitPenalty term A large BIC score indicates strong evidence for the corresponding model BIC can be used to choose the number of clusters and the covariance parametrization (Mclust)

Day 1 - Section 3 37 Example revisited 1 EII 2 EEI library(mclust) cho.mclust.bic<- EMclust(cho.data.std,modelNames=c("EII","EEI")) plot(cho.mclust.bic) cho.mclust<-EMclust(cho.data.std,4,"EII") sum.cho<-summary(cho.mclust,cho.data.std)

Day 1 - Section 3 38 Example revisited par(mfrow=c(2,2)) matplot(t(cho.data[sum.cho$classification==1,]),type="l",xlab="time",ylab="log expression value") matplot(t(cho.data[sum.cho$classification==2,]),type="l",xlab="time",ylab="log expression value") matplot(t(cho.data[sum.cho$classification==3,]),type="l",xlab="time",ylab="log expression value") matplot(t(cho.data[sum.cho$classification==4,]),type="l",xlab="time",ylab="log expression value")

Day 1 - Section 3 39 Example revisited EII 4 clusters

Day 1 - Section 3 40 Example revisited cho.mclust<-EMclust(cho.data.std,3,"EEI") sum.cho<-summary(cho.mclust,cho.data.std) par(mfrow=c(2,2)) matplot(t(cho.data[sum.cho$classification==1,]),type="l",xlab="time",ylab="log expression value") matplot(t(cho.data[sum.cho$classification==2,]),type="l",xlab="time",ylab="log expression value") matplot(t(cho.data[sum.cho$classification==3,]),type="l",xlab="time",ylab="log expression value")

Day 1 - Section 3 41 Example revisited 12 3 EEI 3 clusters

Day 1 - Section 3 42 Summary Model based clustering is a nice alternative to heuristic clustering algorithms BIC can be used for choosing the covariance structure and the number of clusters

Day 1 - Section 3 43 Conclusion We have seen a few clustering algorithms There are many others Two way clustering Plaid model... Clustering is a useful tool and... a dangerous weapon To be consumed with moderation!