C LUSTERING José Miguel Caravalho. CLUSTER ANALYSIS OR CLUSTERING IS THE TASK OF ASSIGNING A SET OF OBJECTS INTO GROUPS ( CALLED CLUSTERS ) SO THAT THE.

Slides:



Advertisements
Similar presentations
Basic Gene Expression Data Analysis--Clustering
Advertisements

Original Figures for "Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring"
Clustering.
PARTITIONAL CLUSTERING
Cluster analysis for microarray data Anja von Heydebreck.
Machine Learning and Data Mining Clustering
Metrics, Algorithms & Follow-ups Profile Similarity Measures Cluster combination procedures Hierarchical vs. Non-hierarchical Clustering Statistical follow-up.
Introduction to Bioinformatics
UNSUPERVISED ANALYSIS GOAL A: FIND GROUPS OF GENES THAT HAVE CORRELATED EXPRESSION PROFILES. THESE GENES ARE BELIEVED TO BELONG TO THE SAME BIOLOGICAL.
Clustering approaches for high- throughput data Sushmita Roy BMI/CS 576 Nov 12 th, 2013.
Analysis of microarray data. Gene expression database – a conceptual view Samples Genes Gene expression levels Sample annotations Gene annotations Gene.
SocalBSI 2008: Clustering Microarray Datasets Sagar Damle, Ph.D. Candidate, Caltech  Distance Metrics: Measuring similarity using the Euclidean and Correlation.
Dimension reduction : PCA and Clustering Agnieszka S. Juncker Slides: Christopher Workman and Agnieszka S. Juncker Center for Biological Sequence Analysis.
L16: Micro-array analysis Dimension reduction Unsupervised clustering.
Figure 1: (A) A microarray may contain thousands of ‘spots’. Each spot contains many copies of the same DNA sequence that uniquely represents a gene from.
Clustering Petter Mostad. Clustering vs. class prediction Class prediction: Class prediction: A learning set of objects with known classes A learning.
Alizadeh et. al. (2000) Stephen Ayers 12/2/01. Clustering “Clustering is finding a natural grouping in a set of data, so that samples within a cluster.
Cluster Analysis Class web site: Statistics for Microarrays.
Dimension reduction : PCA and Clustering Christopher Workman Center for Biological Sequence Analysis DTU.
Introduction to Hierarchical Clustering Analysis Pengyu Hong 09/16/2005.
Introduction to Bioinformatics - Tutorial no. 12
Cluster Analysis for Gene Expression Data Ka Yee Yeung Center for Expression Arrays Department of Microbiology.
Microarray analysis 2 Golan Yona. 2) Analysis of co-expression Search for similarly expressed genes experiment1 experiment2 experiment3 ……….. Gene i:
1 Cluster Analysis EPP 245 Statistical Analysis of Laboratory Data.
Ulf Schmitz, Pattern recognition - Clustering1 Bioinformatics Pattern recognition - Clustering Ulf Schmitz
Georg Gerber Lecture #6, 2/6/02
Introduction to Bioinformatics Algorithms Clustering and Microarray Analysis.
DNA microarray technology allows an individual to rapidly and quantitatively measure the expression levels of thousands of genes in a biological sample.
Clustering of DNA Microarray Data Michael Slifker CIS 526.
More on Microarrays Chitta Baral Arizona State University.
START OF DAY 8 Reading: Chap. 14. Midterm Go over questions General issues only Specific issues: visit with me Regrading may make your grade go up OR.
Microarrays.
Clustering Supervised vs. Unsupervised Learning Examples of clustering in Web IR Characteristics of clustering Clustering algorithms Cluster Labeling 1.
Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1.
CLUSTERING. Overview Definition of Clustering Existing clustering methods Clustering examples.
Clustering What is clustering? Also called “unsupervised learning”Also called “unsupervised learning”
Dimension reduction : PCA and Clustering Slides by Agnieszka Juncker and Chris Workman modified by Hanne Jarmer.
Microarray Data Analysis (Lecture for CS498-CXZ Algorithms in Bioinformatics) Oct 13, 2005 ChengXiang Zhai Department of Computer Science University of.
Quantitative analysis of 2D gels Generalities. Applications Mutant / wild type Physiological conditions Tissue specific expression Disease / normal state.
More About Clustering Naomi Altman Nov '06. Assessing Clusters Some things we might like to do: 1.Understand the within cluster similarity and between.
11/23/2015Slide 1 Using a combination of tables and plots from SPSS plus spreadsheets from Excel, we will show the linkage between correlation and linear.
An Overview of Clustering Methods Michael D. Kane, Ph.D.
CLUSTERING AND SEGMENTATION MIS2502 Data Analytics Adapted from Tan, Steinbach, and Kumar (2004). Introduction to Data Mining.
CZ5225: Modeling and Simulation in Biology Lecture 3: Clustering Analysis for Microarray Data I Prof. Chen Yu Zong Tel:
Gene expression & Clustering. Determining gene function Sequence comparison tells us if a gene is similar to another gene, e.g., in a new species –Dynamic.
Computational Biology Clustering Parts taken from Introduction to Data Mining by Tan, Steinbach, Kumar Lecture Slides Week 9.
Analyzing Expression Data: Clustering and Stats Chapter 16.
Compiled By: Raj Gaurang Tiwari Assistant Professor SRMGPC, Lucknow Unsupervised Learning.
Evaluation of gene-expression clustering via mutual information distance measure Ido Priness, Oded Maimon and Irad Ben-Gal BMC Bioinformatics, 2007.
Prof. Yechiam Yemini (YY) Computer Science Department Columbia University (c)Copyrights; Yechiam Yemini; Lecture 2: Introduction to Paradigms 2.3.
1 Limma homework Is it possible that some of these gene expression changes are miscalled (i.e. biologically significant but insignificant p value and vice.
Tutorial 8 Gene expression analysis 1. How to interpret an expression matrix Expression data DBs - GEO Clustering –Hierarchical clustering –K-means clustering.
David Corne, and Nick Taylor, Heriot-Watt University - These slides and related resources:
1 Microarray Clustering. 2 Outline Microarrays Hierarchical Clustering K-Means Clustering Corrupted Cliques Problem CAST Clustering Algorithm.
CZ5211 Topics in Computational Biology Lecture 4: Clustering Analysis for Microarray Data II Prof. Chen Yu Zong Tel:
Hierarchical clustering approaches for high-throughput data Colin Dewey BMI/CS 576 Fall 2015.
Clustering / Scaling. Cluster Analysis Objective: – Partitions observations into meaningful groups with individuals in a group being more “similar” to.
Unsupervised Learning
Cluster Analysis of Gene Expression Profiles
Hierarchical Clustering
CZ5211 Topics in Computational Biology Lecture 3: Clustering Analysis for Microarray Data I Prof. Chen Yu Zong Tel:
Hierarchical clustering approaches for high-throughput data
Multivariate Statistical Methods
Cluster Analysis in Bioinformatics
(A) Hierarchical clustering was performed to identify groups of patients with similar RNASeq expression of 20 genes associated with reduced survivability.
MIS2502: Data Analytics Clustering and Segmentation
MIS2502: Data Analytics Clustering and Segmentation
Text Categorization Berlin Chen 2003 Reference:
Clustering.
Unsupervised Learning
Presentation transcript:

C LUSTERING José Miguel Caravalho

CLUSTER ANALYSIS OR CLUSTERING IS THE TASK OF ASSIGNING A SET OF OBJECTS INTO GROUPS ( CALLED CLUSTERS ) SO THAT THE OBJECTS IN THE SAME CLUSTER ARE MORE SIMILAR ( IN SOME SENSE OR ANOTHER ) TO EACH OTHER THAN TO THOSE IN OTHER CLUSTERS What is clustering? A way of grouping together data samples that are similar in some way - according to some criteria that you pick A form of unsupervised learning – you generally don’t have examples demonstrating how the data should be grouped together So, it’s a method of data exploration – a way of looking for patterns or structure in the data that are of interest

W HY CLUSTER ? Cluster genes = rows Measure expression at multiple time-points, different conditions, etc. Similar expression patterns may suggest similar functions of genes Cluster samples = columns Expression levels of thousands of genes for each tumor sample Similar expression patterns may suggest biological relationship among samples

E XAMPLE 1: CLUSTERING GENES P. Tamayo et al., Interpreting patterns of gene expression with self-organizing maps: methods and application to hematopoietic differentiation, PNAS 96: , Treatment of HL-60 cells (myeloid leukemia cell line) with PMA leads to differentiation into macrophages Measured expression of genes at 0, 0.5, 4 and 24 hours after PMA treatment

E XAMPLE 1: CLUSTERING GENES Used SOM technique; shown are cluster averages Clusters contain a number of known related genes involved in macrophage differentiation late induction cytokines, cell-cycle genes (down- regulated since PMA induces terminal differentiation)

E XAMPLE 2: CLUSTERING SAMPLES A. Alizadeh et al., Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling, Nature 403: , Response to treatment of patients w/ diffuse large B-cell lymphoma (DLBCL) is heterogeneous Try to use expression data to discover finer distinctions among tumor types Collected gene expression data for 42 DLBCL tumor samples + normal B-cells in various stages of differentiation + various controls

Found some tumor samples have expression more similar to germinal center B-cells and others to peripheral blood activated B-cells Patients with “germinal center type” DLBCL generally had higher five- year survival rates

C HOOSING ( DIS ) SIMILARITY MEASURES – A CRITICAL STEP IN CLUSTERING How do we define “similarity”? Recall that the goal is to group together “similar” data – but what does this mean? No single answer – it depends on what we want to find or emphasize in the data; this is one reason why clustering is an “art” The similarity measure is often more important than the clustering algorithm used – don’t overlook this choice!

(D IS ) SIMILARITY MEASURES Instead of talking about similarity measures, we often equivalently refer to dissimilarity measures (I’ll give an example of how to convert between them in a few slides…) Jagota defines a dissimilarity measure as a function f( x, y ) such that f( x, y ) > f( w, z ) if and only if x is less similar to y than w is to z This is always a pair-wise measure Think of x, y, w, and z as gene expression profiles (rows or columns)

E UCLIDEAN DISTANCE Here K is the number of dimensions in the data vector. For instance: Number of time-points/conditions (when clustering genes) Number of genes (when clustering samples)

E UCLIDEAN DISTANCE d euc = d euc = d euc = d euc =1.41 d euc =1.22

C ORRELATION We might care more about the overall shape of expression profiles rather than the actual magnitudes That is, we might want to consider genes similar when they are “up” and “down” together When might we want this kind of measure? What experimental issues might make this appropriate?

P EARSON L INEAR C ORRELATION Pearson linear correlation (PLC) is a measure that is invariant to scaling and shifting (vertically) of the expression values Always between –1 and +1 (perfectly anti-correlated and perfectly correlated) This is a similarity measure, but we can easily make it into a dissimilarity measure:

P EARSON L INEAR C ORRELATION PLC only measures the degree of a linear relationship between two expression profiles!  = , so d p = The green curve is the square of the blue curve this relationship is not captured with PLC More correlation examples

M ISSING V ALUES A common problem with microarray data One approach with Euclidean distance or PLC is just to ignore missing values (i.e., pretend the data has fewer dimensions) There are more sophisticated approaches that use information such as continuity of a time series or related genes to estimate missing values – better to use these if possible

M ISSING V ALUES The green profile is missing the point in the middle If we just ignore the missing point, the green and blue profiles will be perfectly correlated (also smaller Euclidean distance than between the red and blue profiles)

H IERARCHICAL A GGLOMERATIVE C LUSTERING We start with every data point in a separate cluster We keep merging the most similar pairs of data points/clusters until we have one big cluster left This is called a bottom-up or agglomerative method

H IERARCHICAL C LUSTERING This produces a binary tree or dendrogram The final cluster is the root and each data item is a leaf The height of the bars indicate how close the items are

L INKAGE IN H IERARCHICAL C LUSTERING We already know about distance measures between data items, but what about between a data item and a cluster or between two clusters? We just treat a data point as a cluster with a single item, so our only problem is to define a linkage method between clusters As usual, there are lots of choices…

A VERAGE L INKAGE Eisen’s cluster program defines average linkage as follows: Each cluster c i is associated with a mean vector  i which is the mean of all the data items in the cluster The distance between two clusters c i and c j is then just d(  i,  j ) This is somewhat non-standard – this method is usually referred to as centroid linkage and average linkage is defined as the average of all pairwise distances between points in the two clusters

S INGLE L INKAGE The minimum of all pairwise distances between points in the two clusters Tends to produce long, “loose” clusters

C OMPLETE L INKAGE The maximum of all pairwise distances between points in the two clusters Tends to produce very tight clusters

C ENTROID L INKAGE Used only for Euclidean distance. The distance between two clusters is the Euclidean distance between their centroids, as calculated by arithmetic mean.

K- MEANS C LUSTERING Choose a number of clusters k Initialize cluster centers  1,…  k Could pick k data points and set cluster centers to these points Or could randomly assign points to clusters and take means of clusters For each data point, compute the cluster center it is closest to (using some distance measure) and assign the data point to this cluster Re-compute cluster centers (mean of data points in cluster) Stop when there are no new re-assignments

K- MEANS C LUSTERING How many clusters do you think there are in this data? How might it have been generated?

K- MEANS C LUSTERING

K- MEANS C LUSTERING I SSUES Random initialization means that you may get different clusters each time Data points are assigned to only one cluster (hard assignment) Implicit assumptions about the “shapes” of clusters You have to pick the number of clusters…

T HE K-M EANS C LUSTERING M ETHOD

E XTERNAL EVALUATION