Canadian Bioinformatics Workshops

Canadian Bioinformatics Workshops

Module #: Title of Module
2

Herakles and Iolaos battle the Hydra. Classical (450-400 BCE)
Module 5 Clustering Exploratory Data Analysis and Essential Statistics using R Boris Steipe Toronto, September 8– † Herakles and Iolaos battle the Hydra. Classical ( BCE) DEPARTMENT OF BIOCHEMISTRY MOLECULAR GENETICS † Includes material originally developed by Sohrab Shah

Introduction to clustering
What is clustering? unsupervised learning discovery of patterns in data class discovery Grouping together “objects” that are most similar (or least dissimilar) objects may be genes, or samples, or both Example question: Are there samples in my cohort that can be subgrouped based on molecular profiling? Do these groups have correlation to clinical outcome?

Distance metrics In order to perform clustering, we need to have a way to measure how similar (or dissimilar) two objects are Euclidean distance: Manhattan distance: 1-correlation proportional to Euclidean distance, but invariant to range of measurement from one sample to the next dissimilar similar

Distance metrics compared
Euclidean Manhattan 1-Pearson Conclusion: distance matters!

Other distance metrics
Hamming distance for ordinal, binary or categorical data:

Approaches to clustering
Partitioning methods K-means K-medoids (partitioning around medoids) Model based approaches Hierarchical methods nested clusters start with pairs build a tree up to the root

Partitioning methods Anatomy of a partitioning based method Output
data matrix distance function number of groups Output group assignment of every object

Partitioning based methods
Choose K groups initialise group centers aka centroid, medoid assign each object to the nearest centroid according to the distance metric reassign (or recompute) centroids repeat last 2 steps until assignment stabilizes

K-means vs K-medoids K-means K-medoids
Centroids are the ‘mean’ of the clusters Centroids are an actual object that minimizes the total within cluster distance Centroids need to be recomputed every iteration Centroid can be determined from quick look up into the distance matrix Initialisation difficult as notion of centroid may be unclear before beginning Initialisation is simply K randomly selected objects kmeans pam

Partitioning based methods
Advantages Disadvantages Number of groups is well defined Have to choose the number of groups A clear, deterministic assignment of an object to a group Sometimes objects do not fit well to any cluster Simple algorithms for inference Can converge on locally optimal solutions and often require multiple restarts with random initializations

Agglomerative hierarchical clustering

Hierarchical clustering
Anatomy of hierarchical clustering distance matrix linkage method Output dendrogram a tree that defines the relationships between objects and the distance between clusters a nested sequence of clusters

Linkage methods single complete distance between centroids average

Linkage methods Ward (1963)
form partitions that minimizes the loss associated with each grouping loss defined as error sum of squares (ESS) consider 10 objects with scores (2, 6, 5, 6, 2, 2, 2, 2, 0, 0, 0) ESSOnegroup = (2 -2.5)2 + (6 -2.5) (0 -2.5)2 = 50.5 On the other hand, if the 10 objects are classified according to their scores into four sets, {0,0,0}, {2,2,2,2}, {5}, {6,6} The ESS can be evaluated as the sum of squares of four separate error sums of squares: ESSOnegroup = ESSgroup1 + ESSgroup2 + ESSgroup3 + ESSgroup4 = 0.0 Thus, clustering the 10 scores into 4 clusters results in no loss of information.

Linkage methods in action
clustering based on single linkage single <- hclust(dist(t(exprMatSub),method="euclidean"), method=”single"); plot(single);

clustering based on complete linkage complete <- hclust(dist(t(exprMatSub),method="euclidean"), method="complete"); plot(complete)

clustering based on centroid linkage centroid <- hclust(dist(t(exprMatSub),method="euclidean"), method=”centroid"); plot(centroid);

clustering based on average linkage average <- hclust(dist(t(exprMatSub),method="euclidean"), method=”average"); plot(average);

clustering based on Ward linkage ward <- hclust(dist(t(exprMatSub),method="euclidean"), method=”ward"); plot(ward);

Conclusion: linkage matters!

Hierarchical clustering analyzed
Advantages Disadvantages There may be small clusters nested inside large ones Clusters might not be naturally represented by a hierarchical structure No need to specify number groups ahead of time Its necessary to ‘cut’ the dendrogram in order to produce clusters Flexible linkage methods Bottom up clustering can result in poor structure at the top of the tree. Early joins cannot be ‘undone’

Model based approaches
Assume the data are ‘generated’ from a mixture of K distributions What cluster assignment and parameters of the K distributions best explain the data? ‘Fit’ a model to the data Try to get the best fit Classical example: mixture of Gaussians (mixture of normals) Take advantage of probability theory and well-defined distributions in statistics

Model based clustering: array CGH

Model based clustering of aCGH
Problem: patient cohorts often exhibit molecular heterogeneity making rarer shared CNAs hard to detect Approach: Cluster the data by extending the profiling to the multi-group setting Shah et al (Bioinformatics, 2009) A mixture of HMMs: HMM-Mix … Group g Sparse profiles Profile State c Distribution of calls in a group Patient p State k CNA calls Raw data

Advantages of model based approaches
In addition to clustering patients into groups, we output a ‘model’ that best represents the patients in a group We can then associate each model with clinical variables and simply output a classifier to be used on new patients Choosing the number of groups becomes a model selection problem (cf. the Bayesian Information Criterion) see Yeung et al Bioinformatics (2001)

Clustering 106 follicular lymphoma patients with HMM-Mix
Initialisation Profiles Clinical Converged Recapitulates known FL subgroups Subgroups have clinical relevance

Feature selection Most features (genes, SNP probesets, BAC clones) in high dimensional datasets will be uninformative examples: unexpressed genes, housekeeping genes, ‘passenger alterations’ Clustering (and classification) has a much higher chance of success if uninformative features are removed Simple approaches: select intrinsically variable genes require a minimum level of expression in a proportion of samples genefilter package (Bioonductor): Lab1 Return to feature selection in the context of classification

Advanced topics in clustering
Top down clustering Bi-clustering or ‘two-way’ clustering Principal components analysis Choosing the number of groups model selection AIC, BIC Silhouette coefficient The Gap curve Joint clustering and feature selection

What Have We Learned? There are three main types of clustering approaches hierarchical partitioning model based Feature selection is important reduces computational time more likely to identify well-separated groups The distance metric matters The linkage method matters in hierarchical clustering Model based approaches offer principled probabilistic methods

We are on a Coffee Break & Networking Session

Canadian Bioinformatics Workshops

Similar presentations

Presentation on theme: "Canadian Bioinformatics Workshops"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Canadian Bioinformatics Workshops

Similar presentations

Presentation on theme: "Canadian Bioinformatics Workshops"— Presentation transcript:

Similar presentations

About project

Feedback