An Overview of Clustering Methods Michael D. Kane, Ph.D.

Slides:



Advertisements
Similar presentations
Basic Gene Expression Data Analysis--Clustering
Advertisements

Supervised and unsupervised analysis of gene expression data Bing Zhang Department of Biomedical Informatics Vanderbilt University
Cluster analysis for microarray data Anja von Heydebreck.
Principal Component Analysis (PCA) for Clustering Gene Expression Data K. Y. Yeung and W. L. Ruzzo.
Introduction to Bioinformatics
BASIC METHODOLOGIES OF ANALYSIS: SUPERVISED ANALYSIS: HYPOTHESIS TESTING USING CLINICAL INFORMATION (MLL VS NO TRANS.) IDENTIFY DIFFERENTIATING GENES Basic.
UNSUPERVISED ANALYSIS GOAL A: FIND GROUPS OF GENES THAT HAVE CORRELATED EXPRESSION PROFILES. THESE GENES ARE BELIEVED TO BELONG TO THE SAME BIOLOGICAL.
Statistics for Marketing & Consumer Research Copyright © Mario Mazzocchi 1 Cluster Analysis (from Chapter 12)
University at BuffaloThe State University of New York Interactive Exploration of Coherent Patterns in Time-series Gene Expression Data Daxin Jiang Jian.
Making Sense of Complicated Microarray Data Part II Gene Clustering and Data Analysis Gabriel Eichler Boston University Some slides adapted from: MeV documentation.
Microarray technology and analysis of gene expression data Hillevi Lindroos.
Clustering short time series gene expression data Jason Ernst, Gerard J. Nau and Ziv Bar-Joseph BIOINFORMATICS, vol
SocalBSI 2008: Clustering Microarray Datasets Sagar Damle, Ph.D. Candidate, Caltech  Distance Metrics: Measuring similarity using the Euclidean and Correlation.
Dimension reduction : PCA and Clustering Agnieszka S. Juncker Slides: Christopher Workman and Agnieszka S. Juncker Center for Biological Sequence Analysis.
C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E Lecture 9 Clustering Algorithms Bioinformatics Data Analysis and Tools.
Bio277 Lab 2: Clustering and Classification of Microarray Data Jess Mar Department of Biostatistics Quackenbush Lab DFCI
Microarray II. What is a microarray Microarray Experiment RT-PCR LASER DNA “Chip” High glucose Low glucose.
Microarrays and Cancer Segal et al. CS 466 Saurabh Sinha.
Dimension reduction : PCA and Clustering by Agnieszka S. Juncker
Dimension reduction : PCA and Clustering Slides by Agnieszka Juncker and Chris Workman.
Clustering Petter Mostad. Clustering vs. class prediction Class prediction: Class prediction: A learning set of objects with known classes A learning.
Clustering with FITCH en UPGMA Bob W. Kooi, David M. Stork and Jorn de Haan Theoretical Biology.
Cluster Analysis Class web site: Statistics for Microarrays.
Dimension reduction : PCA and Clustering Christopher Workman Center for Biological Sequence Analysis DTU.
Introduction to Hierarchical Clustering Analysis Pengyu Hong 09/16/2005.
Introduction to Bioinformatics - Tutorial no. 12
Cluster Analysis for Gene Expression Data Ka Yee Yeung Center for Expression Arrays Department of Microbiology.
Fuzzy K means.
Dimension reduction : PCA and Clustering by Agnieszka S. Juncker Part of the slides is adapted from Chris Workman.
Clustering microarray data 09/26/07. Sub-classes of lung cancer types have signature genes (Bhattacharjee 2001)
Lecture 09 Clustering-based Learning
Microarray Gene Expression Data Analysis A.Venkatesh CBBL Functional Genomics Chapter: 07.
Principal Component Analysis (PCA) for Clustering Gene Expression Data K. Y. Yeung and W. L. Ruzzo.
Graph-based consensus clustering for class discovery from gene expression data Zhiwen Yum, Hau-San Wong and Hongqiang Wang Bioinformatics, 2007.
Clustering of DNA Microarray Data Michael Slifker CIS 526.
COMMON EVALUATION FINAL PROJECT Vira Oleksyuk ECE 8110: Introduction to machine Learning and Pattern Recognition.
Microarrays.
Dimension reduction : PCA and Clustering Slides by Agnieszka Juncker and Chris Workman modified by Hanne Jarmer.
1 Course #412 Analyzing Microarray Data using the mAdb System April 1-2, :00 pm - 4:00pm Intended for users of the.
Quantitative analysis of 2D gels Generalities. Applications Mutant / wild type Physiological conditions Tissue specific expression Disease / normal state.
More About Clustering Naomi Altman Nov '06. Assessing Clusters Some things we might like to do: 1.Understand the within cluster similarity and between.
Course Work Project Project title “Data Analysis Methods for Microarray Based Gene Expression Analysis” Sushil Kumar Singh (batch ) IBAB, Bangalore.
C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E Lecture 4 Clustering Algorithms Bioinformatics Data Analysis and Tools
MLL translocations specify a distinct gene expression profile that distinguishes a unique leukemia Armstrong et al, Nature Genetics 30, (2002)
CZ5225: Modeling and Simulation in Biology Lecture 3: Clustering Analysis for Microarray Data I Prof. Chen Yu Zong Tel:
Computational Biology Clustering Parts taken from Introduction to Data Mining by Tan, Steinbach, Kumar Lecture Slides Week 9.
Analyzing Expression Data: Clustering and Stats Chapter 16.
Brad Windle, Ph.D Unsupervised Learning and Microarrays Web Site: Link to Courses and.
Flat clustering approaches
Applied Multivariate Statistics Cluster Analysis Fall 2015 Week 9.
Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring T.R. Golub et al., Science 286, 531 (1999)
Gene expression. Gene Expression 2 protein RNA DNA.
1 Limma homework Is it possible that some of these gene expression changes are miscalled (i.e. biologically significant but insignificant p value and vice.
1 baySeq homework HS analysis: Out of 7388 genes with data, 1995 genes were DE at FDR
Cluster Analysis, an Overview Laurie Heyer. Why Cluster? Data reduction – Analyze representative data points, not the whole dataset Hypothesis generation.
Hierarchical clustering approaches for high-throughput data Colin Dewey BMI/CS 576 Fall 2015.
C LUSTERING José Miguel Caravalho. CLUSTER ANALYSIS OR CLUSTERING IS THE TASK OF ASSIGNING A SET OF OBJECTS INTO GROUPS ( CALLED CLUSTERS ) SO THAT THE.
Unsupervised Learning
Semi-Supervised Clustering
Gene expression.
Dimension reduction : PCA and Clustering by Agnieszka S. Juncker
Image from Gene-Chips (Micorrrays) Statistics for microarray analysis (SMA)
Hierarchical clustering approaches for high-throughput data
GPX: Interactive Exploration of Time-series Microarray Data
Dimension reduction : PCA and Clustering
(A) Hierarchical clustering was performed to identify groups of patients with similar RNASeq expression of 20 genes associated with reduced survivability.
Self-organizing map numeric vectors and sequence motifs
Inferring Cellular Processes from Coexpressing Genes
Unsupervised Learning
Presentation transcript:

An Overview of Clustering Methods Michael D. Kane, Ph.D.

Topics What is clustering? Clustering mechanics (how the computer does it). Parameter choices and their effect. Examples.

What is clustering? Grouping by similarity.

Similar genes. Group genes that have similar expression profiles when observed over multiple samples. Genes Samples Gene clustering

Similar samples. Group samples that are similar when observed over multiple genes. Genes Samples Sample clustering

Why cluster? Similar gene expression infers common biology. Function of uncharacterized genes may be deduced from co- expression with known genes. Associate expression patterns with: Response to environmental change. Disease pathology/progression.

Clustering Mechanics E1E E2E2 Gene a Gene e Gene b Gene c Gene d Gene f E2E2 E1E1 E2E2 c ed f For gene clustering, we must measure similarity between genes. a b

Distance (similarity) measure E1E E2E2 a b c ed f Euclidean distance d be (4.6, 0.5) (1.0, 1.7)

Distance Measure Pearson Correlation S=( ) Used in “Eisen” clustering

Hierarchical Clustering E1E E2E2 a b c ed f a b c d e f

Measuring distance between clusters Single linkage The minimum distance between clusters. May form loose clusters. Complete linkage The maximum distance between clusters. Tends to form compact clusters. Produces “chained” clusters.

Methods for joining clusters UPGMA unweighted pair group method (Average linkage) The average distance between clusters. Weighted pair group method Same as UPGMA but the distance is weighted by cluster size. Use when clusters are expected to be significantly uneven in size!

Effect of distance measure Euclidean Single Linkage Euclidean Complete Linkage

Effect of distance measure Euclidean UPGMA Euclidean Ward’s Method

Alternatives to hierarchical clustering Number of clusters specified by user. Good when prior knowledge available. k-means

k-means clustering E1E E2E2 a b c ed f 1. Number of clusters specified by user. 2. Genes randomly assigned to clusters. 3. Assess inter and intra-cluster similarity. 4. Move genes to alternative cluster if distance is reduced. 3. Assess inter and intra-cluster similarity. 4. Move genes to alternative cluster if distance is reduced.

Alternatives to hierarchical clustering Number of clusters specified by user. Good when prior knowledge available. SOM Self-organizing maps

SOM Gene a Gene e Gene b Gene c Gene d Gene f E2E2 E1E1 E2E E1E1 E2E E1E1 E2E E1E1 E2E E1E1 E2E2 cluster 1 cluster 2 cluster 3 User specified number of clusters. Each initially given a random expression representation E1E1 E2E E1E1 E2E E1E1 E2E2 cluster 1 cluster 2 cluster 3 For a gene, find the most similar cluster representation E1E1 E2E E1E1 E2E E1E1 E2E2 cluster 1 cluster 2 cluster 3 Increase the similarity by adjusting the cluster representation. “Training” E1E1 E2E E1E1 E2E E1E1 E2E2 cluster 1 cluster 2 cluster 3 Iteratively train the cluster representations E1E1 E2E E1E1 E2E E1E1 E2E2 cluster 1 cluster 2 cluster 3 After training, assign each gene to the most similar cluster.

Gene clustering Eisen et al., Cluster analysis and display of genome-wide expression patterns. PNAS v95, , hour time course after re-introduction of serum to serum-deprived human fibroblasts. Pearson correlation, average linkage. cholesterol biosynthesis cell cycle immediate-early response signaling wound healing

Sample clustering Ross et al., Systematic variation in gene expression patterns in human cancer cell lines. Nature Genetics v24, , cancer cell lines clustered. 8,000 genes. Clustering performed with 2 different subsets of genes. Similar results. Pearson correlation, average linkage. Note breast cancer cell lines, derived from the same patient.

Summary Different methods often provide different clusters. No overall “best” clustering method. Clustering applied to unrelated data will still provide clusters. Use biological insight in method selection and interpretation.

Clustering E1E E2E2 a b c ed f a b c d e f

SOM Gene a Gene e Gene b Gene c Gene d Gene f E2E2 E1E1 E2E E1E1 E2E E1E1 E2E E1E1 E2E2 cluster 1 cluster 2 cluster 3 After training, assign each gene to the most similar cluster.