Cluster analysis for microarray data Anja von Heydebreck.

Slides:



Advertisements
Similar presentations
Basic Gene Expression Data Analysis--Clustering
Advertisements

PARTITIONAL CLUSTERING
Unsupervised learning
Introduction to Bioinformatics
BASIC METHODOLOGIES OF ANALYSIS: SUPERVISED ANALYSIS: HYPOTHESIS TESTING USING CLINICAL INFORMATION (MLL VS NO TRANS.) IDENTIFY DIFFERENTIATING GENES Basic.
Cluster Analysis Hal Whitehead BIOL4062/5062. What is cluster analysis? Non-hierarchical cluster analysis –K-means Hierarchical divisive cluster analysis.
UNSUPERVISED ANALYSIS GOAL A: FIND GROUPS OF GENES THAT HAVE CORRELATED EXPRESSION PROFILES. THESE GENES ARE BELIEVED TO BELONG TO THE SAME BIOLOGICAL.
Agenda 1.Introduction to clustering 1.Dissimilarity measure 2.Preprocessing 2.Clustering method 1.Hierarchical clustering 2.K-means and K-memoids 3.Self-organizing.
Mutual Information Mathematical Biology Seminar
SocalBSI 2008: Clustering Microarray Datasets Sagar Damle, Ph.D. Candidate, Caltech  Distance Metrics: Measuring similarity using the Euclidean and Correlation.
Dimension reduction : PCA and Clustering Agnieszka S. Juncker Slides: Christopher Workman and Agnieszka S. Juncker Center for Biological Sequence Analysis.
Microarray Data Preprocessing and Clustering Analysis
L16: Micro-array analysis Dimension reduction Unsupervised clustering.
Dimension reduction : PCA and Clustering by Agnieszka S. Juncker
Dimension reduction : PCA and Clustering Slides by Agnieszka Juncker and Chris Workman.
Clustering Petter Mostad. Clustering vs. class prediction Class prediction: Class prediction: A learning set of objects with known classes A learning.
Computational Biology, Part 12 Expression array cluster analysis Robert F. Murphy, Shann-Ching Chen Copyright  All rights reserved.
Cluster Analysis Class web site: Statistics for Microarrays.
Dimension reduction : PCA and Clustering Christopher Workman Center for Biological Sequence Analysis DTU.
Bioinformatics Challenge  Learning in very high dimensions with very few samples  Acute leukemia dataset: 7129 # of gene vs. 72 samples  Colon cancer.
Introduction to Bioinformatics - Tutorial no. 12
What is Cluster Analysis?
Cluster Analysis for Gene Expression Data Ka Yee Yeung Center for Expression Arrays Department of Microbiology.
Normalization Review and Cluster Analysis Class web site: Statistics for Microarrays.
Microarray analysis 2 Golan Yona. 2) Analysis of co-expression Search for similarly expressed genes experiment1 experiment2 experiment3 ……….. Gene i:
Pitfalls in Cluster Analysis Darlene Goldstein Data Club 20 November 2002.
Clustering Ram Akella Lecture 6 February 23, & 280I University of California Berkeley Silicon Valley Center/SC.
Ulf Schmitz, Pattern recognition - Clustering1 Bioinformatics Pattern recognition - Clustering Ulf Schmitz
Clustering microarray data 09/26/07. Sub-classes of lung cancer types have signature genes (Bhattacharjee 2001)
Lecture 09 Clustering-based Learning
Introduction to Bioinformatics Algorithms Clustering and Microarray Analysis.
1 Comparison of Discrimination Methods for the Classification of Tumors Using Gene Expression Data Presented by: Tun-Hsiang Yang.
Graph-based consensus clustering for class discovery from gene expression data Zhiwen Yum, Hau-San Wong and Hongqiang Wang Bioinformatics, 2007.
DNA microarray technology allows an individual to rapidly and quantitatively measure the expression levels of thousands of genes in a biological sample.
Clustering of DNA Microarray Data Michael Slifker CIS 526.
More on Microarrays Chitta Baral Arizona State University.
Lecture 11. Microarray and RNA-seq II
START OF DAY 8 Reading: Chap. 14. Midterm Go over questions General issues only Specific issues: visit with me Regrading may make your grade go up OR.
Clustering in Microarray Data-mining and Challenges Beyond Qing-jun Wang Center for Biophysics & Computational Biology University of Illinois at Urbana-Champaign.
Microarray data analysis David A. McClellan, Ph.D. Introduction to Bioinformatics Brigham Young University Dept. Integrative Biology.
1 Gene Ontology Javier Cabrera. 2 Outline Goal: How to identify biological processes or biochemical pathways that are changed by treatment.Goal: How to.
Clustering What is clustering? Also called “unsupervised learning”Also called “unsupervised learning”
Dimension reduction : PCA and Clustering Slides by Agnieszka Juncker and Chris Workman modified by Hanne Jarmer.
More About Clustering Naomi Altman Nov '06. Assessing Clusters Some things we might like to do: 1.Understand the within cluster similarity and between.
Statistical Analysis of DNA Microarray. An Example of HDLSS in Genetics.
An Overview of Clustering Methods Michael D. Kane, Ph.D.
Course Work Project Project title “Data Analysis Methods for Microarray Based Gene Expression Analysis” Sushil Kumar Singh (batch ) IBAB, Bangalore.
CZ5225: Modeling and Simulation in Biology Lecture 3: Clustering Analysis for Microarray Data I Prof. Chen Yu Zong Tel:
Gene expression & Clustering. Determining gene function Sequence comparison tells us if a gene is similar to another gene, e.g., in a new species –Dynamic.
Computational Biology Clustering Parts taken from Introduction to Data Mining by Tan, Steinbach, Kumar Lecture Slides Week 9.
Analyzing Expression Data: Clustering and Stats Chapter 16.
Compiled By: Raj Gaurang Tiwari Assistant Professor SRMGPC, Lucknow Unsupervised Learning.
Flat clustering approaches
Evaluation of gene-expression clustering via mutual information distance measure Ido Priness, Oded Maimon and Irad Ben-Gal BMC Bioinformatics, 2007.
Tutorial 8 Gene expression analysis 1. How to interpret an expression matrix Expression data DBs - GEO Clustering –Hierarchical clustering –K-means clustering.
1 Microarray Clustering. 2 Outline Microarrays Hierarchical Clustering K-Means Clustering Corrupted Cliques Problem CAST Clustering Algorithm.
CZ5211 Topics in Computational Biology Lecture 4: Clustering Analysis for Microarray Data II Prof. Chen Yu Zong Tel:
Example Apply hierarchical clustering with d min to below data where c=3. Nearest neighbor clustering d min d max will form elongated clusters!
Hierarchical clustering approaches for high-throughput data Colin Dewey BMI/CS 576 Fall 2015.
C LUSTERING José Miguel Caravalho. CLUSTER ANALYSIS OR CLUSTERING IS THE TASK OF ASSIGNING A SET OF OBJECTS INTO GROUPS ( CALLED CLUSTERS ) SO THAT THE.
Unsupervised Learning
Combinatorial clustering algorithms. Example: K-means clustering
PREDICT 422: Practical Machine Learning
Clustering CSC 600: Data Mining Class 21.
CZ5211 Topics in Computational Biology Lecture 3: Clustering Analysis for Microarray Data I Prof. Chen Yu Zong Tel:
CSCE555 Bioinformatics Lecture 14 Cluster analysis for microarray data
Cluster Analysis in Bioinformatics
Text Categorization Berlin Chen 2003 Reference:
Unsupervised Learning
Presentation transcript:

Cluster analysis for microarray data Anja von Heydebreck

Aim of clustering: Group objects according to their similarity Cluster: a set of objects that are similar to each other and separated from the other objects. Example: green/ red data points were generated from two different normal distributions

Clustering microarray data Genes and experiments/samples are given as the row and column vectors of a gene expression data matrix. Clustering may be applied either to genes or experiments (regarded as vectors in R p or R n ). n experiments p genes gene expression data matrix

Why cluster genes? Identify groups of possibly co-regulated genes (e.g. in conjunction with sequence data). Identify typical temporal or spatial gene expression patterns (e.g. cell cycle data). Arrange a set of genes in a linear order that is at least not totally meaningless.

Why cluster experiments/samples? Quality control: Detect experimental artifacts/bad hybridizations Check whether samples are grouped according to known categories (though this might be better addressed using a supervised approach: statistical tests, classification) Identify new classes of biological samples (e.g. tumor subtypes)

Alizadeh et al., Nature 403:503-11, 2000

Cluster analysis Generally, cluster analysis is based on two ingredients: Distance measure: Quantification of (dis-)similarity of objects. Cluster algorithm: A procedure to group objects. Aim: small within-cluster distances, large between-cluster distances.

Some distance measures Given vectors x = (x 1, …, x n ), y = (y 1, …, y n ) Euclidean distance: Manhattan distance: Correlation distance:

Which distance measure to use? x = (1, 1, 1.5, 1.5) y = (2.5, 2.5, 3.5, 3.5) = 2x z = (1.5, 1.5, 1, 1) d c (x, y) = 0, d c (x, z) = 2. d E (x, z) = 1, d E (x, y) ~ The choice of distance measure should be based on the application area. What sort of similarities would you like to detect? Correlation distance d c measures trends/relative differences: d c (x, y)= d c (ax+b, y) if a > 0.

Which distance measure to use? Euclidean and Manhattan distance both measure absolute differences between vectors. Manhattan distance is more robust against outliers. May apply standardization to the observations: Subtract mean and divide by standard deviation: After standardization, Euclidean and correlation distance are equivalent:

K-means clustering Input: N objects given as data points in R p Specify the number k of clusters. Initialize k cluster centers. Iterate until convergence: - Assign each object to the cluster with the closest center (wrt Euclidean distance). - The centroids/mean vectors of the obtained clusters are taken as new cluster centers. K-means can be seen as an optimization problem: Minimize the sum of squared within-cluster distances, Results depend on the initialization. Use several starting points and choose the “best” solution (with minimal W(C)).

Partitioning around medoids (PAM) K-means clustering is based on Euclidean distance. Partitioning around medoids (PAM) generalizes the idea and can be used with any distance measure d (objects x i need not be vectors). The cluster centers/prototypes are required to be observations: Try to minimize the sum of distances of the objects to their cluster centers, using an iterative procedure analogous to the one in K-means clustering.

K-means/PAM: How to choose K (the number of clusters)? There is no easy answer. Many heuristic approaches try to compare the quality of clustering results for different values of K (for an overview see Dudoit/Fridlyand 2002). The problem can be better addressed in model- based clustering, where each cluster represents a probability distribution, and a likelihood- based framework can be used.

Self-organizing maps K = r*s clusters are arranged as nodes of a two-dimensional grid. Nodes represent cluster centers/prototype vectors. This allows to represent similarity between clusters. Algorithm: Initialize nodes at random positions. Iterate: - Randomly pick one data point (gene) x. - Move nodes towards x, the closest node most, remote nodes (in terms of the grid) less. Decrease amount of movements with no. of iterations. from Tamayo et al. 1999

Self-organizing maps from Tamayo et al (yeast cell cycle data)

Hierarchical clustering Similarity of objects is represented in a tree structure (dendrogram). Advantage: no need to specify the number of clusters in advance. Nested clusters can be represented. Golub data: different types of leukemia. Clustering based on the 150 genes with highest variance across all samples.

Agglomerative hierarchical clustering Bottom-up algorithm (top-down (divisive) methods are less common). Start with the objects as clusters. In each iteration, merge the two clusters with the minimal distance from each other - until you are left with a single cluster comprising all objects. But what is the distance between two clusters?

Distances between clusters used for hierarchical clustering Calculation of the distance between two clusters is based on the pairwise distances between members of the clusters. Complete linkage: largest distance Average linkage: average distance Single linkage: smallest distance Complete linkage gives preference to compact/spherical clusters. Single linkage can produce long stretched clusters.

Hierarchical clustering The height of a node in the dendrogram represents the distance of the two children clusters. Loss of information: n objects have n(n-1)/2 pairwise distances, tree has n-1 inner nodes. The ordering of the leaves is not uniquely defined by the dendrogram: 2 n-2 possible choices. Golub data: different types of leukemia. Clustering based on the 150 genes with highest variance across all samples.

Alternative: direct visualization of similarity/distance matrices Useful if one wants to investigate a specific factor (advantage: no loss of information). Sort experiments according to that factor. Array batch 1 Array batch 2

The role of feature selection Sometimes, people first select genes that appear to be differentially expressed between groups of samples. Then they cluster the samples based on the expression levels of these genes. Is it remarkable if the samples then cluster into the two groups? No, this doesn’t prove anything, because the genes were selected with respect to the two groups! Such effects can even be obtained with a matrix of i.i.d. random numbers.

Clustering of time course data Suppose we have expression data from different time points t 1, …, t n, and want to identify typical temporal expression profiles by clustering the genes. Usual clustering methods/distance measures don’t take the ordering of the time points into account – the result would be the same if the time points were permuted. Simple modification: Consider the difference x i(j+1) – x ij between consecutive timepoints as an additional observation y ij. Then apply a clustering algorithm such as K-means to the augmented data matrix.

Biclustering Usual clustering algorithms are based on global similarities of rows or columns of an expression data matrix. But the similarity of the expression profiles of a group of genes may be restricted to certain experimental conditions. Goal of biclustering: identify “homogeneous” submatrices. Difficulties: computational complexity, assessing the statistical significance of results Example: Tanay et al

Conclusions Clustering has been very popular in the analysis of microarray data. However, many typical questions in microarray studies (e.g. identifying expression signatures of tumor types) are better addressed by other methods (classification, statistical tests). Clustering algorithms are easy to apply (you cannot really do it wrong), and they are useful for exploratory analysis. But it is difficult to assess the validity/significance of the results. Even “random” data with no structure can yield clusters or exhibit interesting looking patterns.

Clustering in R library mva: - Hierarchical clustering: hclust, heatmap - k-means: kmeans library class: - Self-organizing maps: SOM library cluster: - pam and other functions

References T. Hastie, R. Tibshirani, J. Friedman: The elements of statistical learning (Chapter 14). Springer M. Eisen et al.: Cluster analysis and display of genome-wide expression patterns. Proc.Natl.Acad.Sci.USA 95, , P. Tamayo et al.: Interpreting patterns of gene expression with self-organizing maps: Methods and application to hematopoietic differentiation. Proc.Natl.Acad.Sci.USA 96, , S. Dudoit, J. Fridlyand: A prediction-based resampling method for estimating the number of clusters in a dataset. Genome Biology 3(7), A. Tanay, R. Sharan, R. Shamir: Discovering statistically significant biclusters in gene expression data. Bioinformatics 18, Suppl.1, , 2002.