Cluster Analysis Class web site: Statistics for Microarrays.

Slides:



Advertisements
Similar presentations
Basic Gene Expression Data Analysis--Clustering
Advertisements

Hierarchical Clustering. Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram – A tree-like diagram that.
Cluster analysis for microarray data Anja von Heydebreck.
Data Mining Techniques: Clustering
Introduction to Bioinformatics
Important clustering methods used in microarray data analysis Steve Horvath Human Genetics and Biostatistics UCLA.
Agenda 1.Introduction to clustering 1.Dissimilarity measure 2.Preprocessing 2.Clustering method 1.Hierarchical clustering 2.K-means and K-memoids 3.Self-organizing.
Clustering II.
Analysis of microarray data. Gene expression database – a conceptual view Samples Genes Gene expression levels Sample annotations Gene annotations Gene.
Mutual Information Mathematical Biology Seminar
SocalBSI 2008: Clustering Microarray Datasets Sagar Damle, Ph.D. Candidate, Caltech  Distance Metrics: Measuring similarity using the Euclidean and Correlation.
Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.
L16: Micro-array analysis Dimension reduction Unsupervised clustering.
Clustering Algorithms Bioinformatics Data Analysis and Tools
‘Gene Shaving’ as a method for identifying distinct sets of genes with similar expression patterns Tim Randolph & Garth Tan Presentation for Stat 593E.
Dimension reduction : PCA and Clustering Slides by Agnieszka Juncker and Chris Workman.
Dimension reduction : PCA and Clustering Christopher Workman Center for Biological Sequence Analysis DTU.
Introduction to Hierarchical Clustering Analysis Pengyu Hong 09/16/2005.
Microarray Data Analysis
Introduction to Bioinformatics - Tutorial no. 12
Cluster Analysis for Gene Expression Data Ka Yee Yeung Center for Expression Arrays Department of Microbiology.
Normalization Review and Cluster Analysis Class web site: Statistics for Microarrays.
Microarray analysis 2 Golan Yona. 2) Analysis of co-expression Search for similarly expressed genes experiment1 experiment2 experiment3 ……….. Gene i:
CLUSTERING (Segmentation)
Pitfalls in Cluster Analysis Darlene Goldstein Data Club 20 November 2002.
Statistics for Microarrays
Introduction to Bioinformatics Algorithms Clustering and Microarray Analysis.
B IOINFORMATICS Dr. Aladdin HamwiehKhalid Al-shamaa Abdulqader Jighly Lecture 8 Analyzing Microarray Data Aleppo University Faculty of technical.
Clustering Unsupervised learning Generating “classes”
Gene expression profiling identifies molecular subtypes of gliomas
BIONFORMATIC ALGORITHMS Ryan Tinsley Brandon Lile May 9th, 2014.
Gene expression & Clustering (Chapter 10)
Analysis and Management of Microarray Data Dr G. P. S. Raghava.
Clustering of DNA Microarray Data Michael Slifker CIS 526.
CDNA Microarrays MB206.
More on Microarrays Chitta Baral Arizona State University.
Analysis of Microarray Data Analysis of images Preprocessing of gene expression data Normalization of data –Subtraction of Background Noise –Global/local.
START OF DAY 8 Reading: Chap. 14. Midterm Go over questions General issues only Specific issues: visit with me Regrading may make your grade go up OR.
Microarrays.
Microarray data analysis David A. McClellan, Ph.D. Introduction to Bioinformatics Brigham Young University Dept. Integrative Biology.
Clustering What is clustering? Also called “unsupervised learning”Also called “unsupervised learning”
Cluster Analysis Cluster Analysis Cluster analysis is a class of techniques used to classify objects or cases into relatively homogeneous groups.
Dimension reduction : PCA and Clustering Slides by Agnieszka Juncker and Chris Workman modified by Hanne Jarmer.
More About Clustering Naomi Altman Nov '06. Assessing Clusters Some things we might like to do: 1.Understand the within cluster similarity and between.
Clustering.
An Overview of Clustering Methods Michael D. Kane, Ph.D.
C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E Lecture 4 Clustering Algorithms Bioinformatics Data Analysis and Tools
CZ5225: Modeling and Simulation in Biology Lecture 3: Clustering Analysis for Microarray Data I Prof. Chen Yu Zong Tel:
Gene expression & Clustering. Determining gene function Sequence comparison tells us if a gene is similar to another gene, e.g., in a new species –Dynamic.
CS 8751 ML & KDDData Clustering1 Clustering Unsupervised learning Generating “classes” Distance/similarity measures Agglomerative methods Divisive methods.
Computational Biology Clustering Parts taken from Introduction to Data Mining by Tan, Steinbach, Kumar Lecture Slides Week 9.
Analyzing Expression Data: Clustering and Stats Chapter 16.
Compiled By: Raj Gaurang Tiwari Assistant Professor SRMGPC, Lucknow Unsupervised Learning.
Lecture 13 Clustering. What is Clustering? Clustering is an EXPLORATORY statistical technique used to break up a data set into smaller groups or “clusters”
Hierarchical Clustering
Computational Biology Group. Class prediction of tumor samples Supervised Clustering Detection of Subgroups in a Class.
1 Microarray Clustering. 2 Outline Microarrays Hierarchical Clustering K-Means Clustering Corrupted Cliques Problem CAST Clustering Algorithm.
CZ5211 Topics in Computational Biology Lecture 4: Clustering Analysis for Microarray Data II Prof. Chen Yu Zong Tel:
Hierarchical clustering approaches for high-throughput data Colin Dewey BMI/CS 576 Fall 2015.
C LUSTERING José Miguel Caravalho. CLUSTER ANALYSIS OR CLUSTERING IS THE TASK OF ASSIGNING A SET OF OBJECTS INTO GROUPS ( CALLED CLUSTERS ) SO THAT THE.
Unsupervised Learning
PREDICT 422: Practical Machine Learning
CZ5211 Topics in Computational Biology Lecture 3: Clustering Analysis for Microarray Data I Prof. Chen Yu Zong Tel:
K-means and Hierarchical Clustering
Multivariate Statistical Methods
Data Mining – Chapter 4 Cluster Analysis Part 2
Text Categorization Berlin Chen 2003 Reference:
Hierarchical Clustering
Unsupervised Learning
Presentation transcript:

Cluster Analysis Class web site: Statistics for Microarrays

Biological question Differentially expressed genes Sample class prediction etc. Testing Biological verification and interpretation Microarray experiment Estimation Experimental design Image analysis Normalization Clustering Discrimination R, G 16-bit TIFF files (Rfg, Rbg), (Gfg, Gbg)

cDNA gene expression data Data on p genes for n samples Genes mRNA samples Gene expression level of gene i in mRNA sample j = (normalized) Log( Red intensity / Green intensity) sample1sample2sample3sample4sample5 …

Cluster analysis Used to find groups of objects when not already known “Unsupervised learning” Associated with each object is a set of measurements (the feature vector) Aim is to identify groups of similar objects on the basis of the observed measurements

Clustering Gene Expression Data Can cluster genes (rows), e.g. to (attempt to) identify groups of co- regulated genes Can cluster samples (columns), e.g. to identify tumors based on profiles Can cluster both rows and columns at the same time

Clustering Gene Expression Data Leads to readily interpretable figures Can be helpful for identifying patterns in time or space Useful (essential?) when seeking new subclasses of samples Can be used for exploratory purposes

Similarity Similarity s ij indicates the strength of relationship between two objects i and j Usually 0 ≤ s ij ≤1 Correlation-based similarity ranges from –1 to 1

Problems using correlation

Dissimilarity and Distance Associated with similarity measures s ij bounded by 0 and 1 is a dissimilarity d ij = 1 - s ij Distance measures have the metric property (d ij +d ik ≥ d jk ) Many examples: Euclidean, Manhattan, etc. Distance measure has a large effect on performance Behavior of distance measure related to scale of measurement

Partitioning Methods Partition the objects into a prespecified number of groups K Iteratively reallocate objects to clusters until some criterion is met (e.g. minimize within cluster sums of squares) Examples: k-means, partitioning around medoids (PAM), self-organizing maps (SOM), model-based clustering

Hierarchical Clustering Produce a dendrogram Avoid prespecification of the number of clusters K The tree can be built in two distinct ways: –Bottom-up: agglomerative clustering –Top-down: divisive clustering

Agglomerative Methods Start with n mRNA sample (or p gene) clusters At each step, merge the two closest clusters using a measure of between-cluster dissimilarity which reflects the shape of the clusters Examples of between-cluster dissimilarities: –Unweighted Pair Group Method with Arithmetic Mean (UPGMA): average of pairwise dissimilarities –Single-link (NN): minimum of pairwise dissimilarities –Complete-link (FN): maximum of pairwise dissimilarities

Divisive Methods Start with only one cluster At each step, split clusters into two parts Advantage: Obtain the main structure of the data (i.e. focus on upper levels of dendrogram) Disadvantage: Computational difficulties when considering all possible divisions into two groups

Partitioning vs. Hierarchical Partitioning –Advantage: Provides clusters that satisfy some optimality criterion (approximately) –Disadvantages: Need initial K, long computation time Hierarchical –Advantage: Fast computation (agglomerative) –Disadvantages: Rigid, cannot correct later for erroneous decisions made earlier

Generic Clustering Tasks Estimating number of clusters Assigning each object to a cluster Assessing strength/confidence of cluster assignments for individual objects Assessing cluster homogeneity

Estimating Number of Clusters

Bittner et al. It has been proposed (by many) that a cancer taxonomy can be identified from gene expression experiments.

Dataset description 31 melanomas (from a variety of tissues/cell lines) 7 controls 8150 cDNAs 6971 unique genes 3613 genes ‘strongly detected’

This is why you need to take logs!

After logging…

How many clusters are present?

‘cluster’ unclustered Average linkage hierarchical clustering, melanoma only 1-  =.54

Issues in Clustering Pre-processing (Image analysis and Normalization) Which genes (variables) are used Which samples are used Which distance measure is used Which algorithm is applied How to decide the number of clusters K

Issues in Clustering Pre-processing (Image analysis and Normalization) Which genes (variables) are used Which samples are used Which distance measure is used Which algorithm is applied How to decide the number of clusters K

Filtering Genes All genes (i.e. don’t filter any) At least k (or a proportion p) of the samples must have expression values larger than some specified amount, A Genes showing “sufficient” variation –a gap of size A in the central portion of the data –a interquartile range of at least B Filter based on statistical comparison –t-test –ANOVA –Cox model, etc.

Issues in Clustering Pre-processing (Image analysis and Normalization) Which genes (variables) are used Which samples are used Which distance measure is used Which algorithm is applied How to decide the number of clusters K

‘cluster’ unclustered Average linkage hierarchical clustering, melanoma only

‘cluster’ control unclustered Average linkage hierarchical clustering, melanoma & controls

Issues in clustering Pre-processing Which genes (variables) are used Which samples are used Which distance measure is used Which algorithm is applied How to decide the number of clusters K

Complete linkage (FN) hierarchical clustering

Single linkage (NN) hierarchical clustering

Issues in clustering Pre-processing Which genes (variables) are used Which samples are used Which distance measure is used Which algorithm is applied How to decide the number of clusters K

Divisive clustering, melanoma only

Divisive clustering, melanoma & controls

Partitioning methods K-means and PAM, 2 groups BittnerK-meansPAM# samples

3 groups BittnerK-meansPAM# samples

Issues in clustering Pre-processing Which genes (variables) are used Which samples are used Which distance measure is used Which algorithm is applied How to decide the number of clusters K

How many clusters K? Many suggestions for how to decide this! Milligan and Cooper (Psychometrika 50: , 1985) studied 30 methods Some new methods include GAP (Tibshirani ) and clest (Fridlyand and Dudoit) Applying several methods yielded estimates of K = 2 (largest cluster has 27 members) to K = 8 (largest cluster has 19 members)

cluster unclustered Average linkage hierarchical clustering, melanoma only

Summary ‘Buyer beware’ – results of cluster analysis should be treated with GREAT CAUTION and ATTENTION TO SPECIFICS, because… Many things can vary in a cluster analysis If covariates/group labels are known, then clustering is usually inefficient

Acknowledgements IPAM Group, UCLA: Debashis Ghosh Erin Conlon Dirk Holste Steve Horvath Lei Li Henry Lu Eduardo Neves Marcia Salzano Xianghong Zhao Others: Sandrine Dudoit Jane Fridlyand Jose Correa Terry Speed William Lemon Fred Wright