Introduction to Hierarchical Clustering Analysis Pengyu Hong 09/16/2005.

Slides:



Advertisements
Similar presentations
Different types of data e.g. Continuous data:height Categorical data ordered (nominal):growth rate very slow, slow, medium, fast, very fast not ordered:fruit.
Advertisements

Basic Gene Expression Data Analysis--Clustering
SEEM Tutorial 4 – Clustering. 2 What is Cluster Analysis?  Finding groups of objects such that the objects in a group will be similar (or.
Information Retrieval Lecture 7 Introduction to Information Retrieval (Manning et al. 2007) Chapter 17 For the MSc Computer Science Programme Dell Zhang.
Albert Gatt Corpora and Statistical Methods Lecture 13.
Data Mining Cluster Analysis: Basic Concepts and Algorithms
Introduction to Bioinformatics
Cluster Analysis.
UNSUPERVISED ANALYSIS GOAL A: FIND GROUPS OF GENES THAT HAVE CORRELATED EXPRESSION PROFILES. THESE GENES ARE BELIEVED TO BELONG TO THE SAME BIOLOGICAL.
2004/05/03 Clustering 1 Clustering (Part One) Ku-Yaw Chang Assistant Professor, Department of Computer Science and Information.
The Broad Institute of MIT and Harvard Clustering.
Clustering (1) Clustering Similarity measure Hierarchical clustering Model-based clustering Figures from the book Data Clustering by Gan et al.
Unsupervised learning: Clustering Ata Kaban The University of Birmingham
Seeing the forest for the trees : using the Gene Ontology to restructure hierarchical clustering Dikla Dotan-Cohen, Simon Kasif and Avraham A. Melkman.
Clustering II.
Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.
Bio277 Lab 2: Clustering and Classification of Microarray Data Jess Mar Department of Biostatistics Quackenbush Lab DFCI
University of CreteCS4831 The use of Minimum Spanning Trees in microarray expression data Gkirtzou Ekaterini.
L16: Micro-array analysis Dimension reduction Unsupervised clustering.
Clustering Algorithms Bioinformatics Data Analysis and Tools
MACHINE LEARNING TECHNIQUES IN BIO-INFORMATICS
Cluster Analysis Class web site: Statistics for Microarrays.
Introduction to Bioinformatics - Tutorial no. 12
Cluster Analysis for Gene Expression Data Ka Yee Yeung Center for Expression Arrays Department of Microbiology.
Microarray analysis 2 Golan Yona. 2) Analysis of co-expression Search for similarly expressed genes experiment1 experiment2 experiment3 ……….. Gene i:
Cluster Analysis Hierarchical and k-means. Expression data Expression data are typically analyzed in matrix form with each row representing a gene and.
Clustering microarray data 09/26/07. Sub-classes of lung cancer types have signature genes (Bhattacharjee 2001)
Gene Expression Clustering. The Main Goal Gain insight into the gene’s function. Using: Sequence Transcription levels.
Graph-based consensus clustering for class discovery from gene expression data Zhiwen Yum, Hau-San Wong and Hongqiang Wang Bioinformatics, 2007.
Gene expression profiling identifies molecular subtypes of gliomas
COMP53311 Clustering Prepared by Raymond Wong Some parts of this notes are borrowed from LW Chan ’ s notes Presented by Raymond Wong
Clustering of DNA Microarray Data Michael Slifker CIS 526.
1 Motivation Web query is usually two or three words long. –Prone to ambiguity –Example “keyboard” –Input device of computer –Musical instruments How can.
Microarray data analysis David A. McClellan, Ph.D. Introduction to Bioinformatics Brigham Young University Dept. Integrative Biology.
Clustering What is clustering? Also called “unsupervised learning”Also called “unsupervised learning”
Hierarchical Clustering of Gene Expression Data Author : Feng Luo, Kun Tang Latifur Khan Graduate : Chien-Ming Hsiao.
An Overview of Clustering Methods Michael D. Kane, Ph.D.
Course Work Project Project title “Data Analysis Methods for Microarray Based Gene Expression Analysis” Sushil Kumar Singh (batch ) IBAB, Bangalore.
C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E Lecture 4 Clustering Algorithms Bioinformatics Data Analysis and Tools
MLL translocations specify a distinct gene expression profile that distinguishes a unique leukemia Armstrong et al, Nature Genetics 30, (2002)
CZ5225: Modeling and Simulation in Biology Lecture 3: Clustering Analysis for Microarray Data I Prof. Chen Yu Zong Tel:
Computational Biology Clustering Parts taken from Introduction to Data Mining by Tan, Steinbach, Kumar Lecture Slides Week 9.
Hierarchical Clustering Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram – A tree like diagram that.
Brad Windle, Ph.D Unsupervised Learning and Microarrays Web Site: Link to Courses and.
Compiled By: Raj Gaurang Tiwari Assistant Professor SRMGPC, Lucknow Unsupervised Learning.
Hierarchical Clustering
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Cluster Analysis This lecture node is modified based on Lecture Notes for Chapter.
1 baySeq homework HS analysis: Out of 7388 genes with data, 1995 genes were DE at FDR
Clustering Algorithms Sunida Ratanothayanon. What is Clustering?
Cluster Analysis, an Overview Laurie Heyer. Why Cluster? Data reduction – Analyze representative data points, not the whole dataset Hypothesis generation.
Hierarchical clustering approaches for high-throughput data Colin Dewey BMI/CS 576 Fall 2015.
Clustering Microarray Data based on Density and Shared Nearest Neighbor Measure CATA’06, March 23-25, 2006 Seattle, WA, USA Ranapratap Syamala, Taufik.
Data Mining Cluster Analysis: Basic Concepts and Algorithms Lecture Notes Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction.
C LUSTERING José Miguel Caravalho. CLUSTER ANALYSIS OR CLUSTERING IS THE TASK OF ASSIGNING A SET OF OBJECTS INTO GROUPS ( CALLED CLUSTERS ) SO THAT THE.
Data Mining and Text Mining. The Standard Data Mining process.
Flow cytometry data analysis: SPADE for cell population identification and sample clustering Narahara.
Graph clustering to detect network modules
Unsupervised Learning
Clustering CSC 600: Data Mining Class 21.
CZ5211 Topics in Computational Biology Lecture 3: Clustering Analysis for Microarray Data I Prof. Chen Yu Zong Tel:
Hierarchical clustering approaches for high-throughput data
Cluster Analysis of Microarray Data
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Multivariate Statistical Methods
(A) Hierarchical clustering was performed to identify groups of patients with similar RNASeq expression of 20 genes associated with reduced survivability.
Cluster Analysis.
Text Categorization Berlin Chen 2003 Reference:
Hierarchical Clustering
Unsupervised Learning
Presentation transcript:

Introduction to Hierarchical Clustering Analysis Pengyu Hong 09/16/2005

Background Cell/Tissue 1 Cell/Tissue 2 Cell/Tissue N … Data 1 Data 2 Data N … Put similar samples/entries together.

Background  Clustering is one of the most important unsupervised learning processes that organizing objects into groups whose members are similar in some way.  Clustering finds structures in a collection of unlabeled data.  A cluster is a collection of objects which are similar between them and are dissimilar to the objects belonging to other clusters.

Motivation I Microarray data quality checking –Does replicates cluster together? –Does similar conditions, time points, tissue types cluster together?

Data: Rat Schizophrenia Data (Allen Fienberg and Mayetri Gupta)  Two time points:35 days (PD 35) and 60 days (PD60) past birth.  Two brain regions: Prefrontal cortex (PFC) and Nucleus accumbens (NA).  Two replicates (Samples are from the same set of tissue split into different tubes so that replicates should be in close agreement.)  dChip was used to normalize the data and get model-based expression values, using the full PM/MM model. How to read this clustering result? Heat map Link length Sample IDs Gene IDs Clustering results Problem?

Motivation II Cluster genes  Prediction of functions of unknown genes by known ones

Functional significant gene clusters Two-way clustering Gene clusters Sample clusters

Motivation II Cluster genes  Prediction of functions of unknown genes by known ones Cluster samples  Discover clinical characteristics (e.g. survival, marker status) shared by samples.

Bhattacharjee et al. (2001) Human lung carcinomas mRNA expression profiling reveals distinct adenocarcinoma subclasses. Proc. Natl. Acad. Sci. USA, Vol. 98,

Motivation II Cluster genes  Prediction of functions of unknown genes by known ones Cluster samples  Discover clinical characteristics (e.g. survival, marker status) shared by samples Promoter analysis of commonly regulated genes

David J. Lockhart & Elizabeth A. Winzeler, NATURE | VOL 405 | 15 JUNE 2000, p827 Promoter analysis of commonly regulated genes

Clustering Algorithms Start with a collection of n objects each represented by a p–dimensional feature vector x i, i=1, …n. The goal is to divide these n objects into k clusters so that objects within a clusters are more “similar” than objects between clusters. k is usually unknown. Popular methods: hierarchical, k-means, SOM, mixture models, etc.

Hierarchical Clustering Dendrogram Venn Diagram of Clustered Data From

Hierarchical Clustering (Cont.) Multilevel clustering: level 1 has n clusters  level n has one cluster. Agglomerative HC: starts with singleton and merge clusters. Divisive HC: starts with one sample and split clusters.

Nearest Neighbor Algorithm Nearest Neighbor Algorithm is an agglomerative approach (bottom-up). Starts with n nodes (n is the size of our sample), merges the 2 most similar nodes at each step, and stops when the desired number of clusters is reached. From

Nearest Neighbor, Level 2, k = 7 clusters. From

Nearest Neighbor, Level 3, k = 6 clusters.

Nearest Neighbor, Level 4, k = 5 clusters.

Nearest Neighbor, Level 5, k = 4 clusters.

Nearest Neighbor, Level 6, k = 3 clusters.

Nearest Neighbor, Level 7, k = 2 clusters.

Nearest Neighbor, Level 8, k = 1 cluster.

Calculate the similarity between all possible combinations of two profiles Two most similar clusters are grouped together to form a new cluster Calculate the similarity between the new cluster and all remaining clusters. Hierarchical Clustering Keys Similarity Clustering

Similarity Measurements Pearson Correlation Two profiles (vectors) and +1  Pearson Correlation  – 1

Similarity Measurements Pearson Correlation: Trend Similarity

Similarity Measurements Euclidean Distance

Similarity Measurements Euclidean Distance: Absolute difference

Similarity Measurements Cosine Correlation +1  Cosine Correlation  – 1

Similarity Measurements Cosine Correlation: Trend + Mean Distance

Similarity Measurements

Similar?

Clustering C1C1 C2C2 C3C3 Merge which pair of clusters?

+ + Clustering Single Linkage C1C1 C2C2 Dissimilarity between two clusters = Minimum dissimilarity between the members of two clusters Tend to generate “long chains”

+ + Clustering Complete Linkage C1C1 C2C2 Dissimilarity between two clusters = Maximum dissimilarity between the members of two clusters Tend to generate “clumps”

+ + Clustering Average Linkage C1C1 C2C2 Dissimilarity between two clusters = Averaged distances of all pairs of objects (one from each cluster).

+ + Clustering Average Group Linkage C1C1 C2C2 Dissimilarity between two clusters = Distance between two cluster means.

Considerations What genes are used to cluster samples? –Expression variation –Inherent variation –Prior knowledge (irrelevant genes) –Etc.

Take Home Questions Which clustering method is better? How to cut the clustering tree to get relatively tight clusters of genes or samples?