Canadian Bioinformatics Workshops www.bioinformatics.ca
Module #: Title of Module 2
Herakles and Iolaos battle the Hydra. Classical (450-400 BCE) Module 5 Clustering Exploratory Data Analysis and Essential Statistics using R Boris Steipe Toronto, September 8–9 2011 † Herakles and Iolaos battle the Hydra. Classical (450-400 BCE) DEPARTMENT OF BIOCHEMISTRY MOLECULAR GENETICS † Includes material originally developed by Sohrab Shah
Introduction to clustering What is clustering? unsupervised learning discovery of patterns in data class discovery Grouping together “objects” that are most similar (or least dissimilar) objects may be genes, or samples, or both Example question: Are there samples in my cohort that can be subgrouped based on molecular profiling? Do these groups have correlation to clinical outcome?
Distance metrics In order to perform clustering, we need to have a way to measure how similar (or dissimilar) two objects are Euclidean distance: Manhattan distance: 1-correlation proportional to Euclidean distance, but invariant to range of measurement from one sample to the next dissimilar similar
Distance metrics compared Euclidean Manhattan 1-Pearson Conclusion: distance matters!
Other distance metrics Hamming distance for ordinal, binary or categorical data:
Approaches to clustering Partitioning methods K-means K-medoids (partitioning around medoids) Model based approaches Hierarchical methods nested clusters start with pairs build a tree up to the root
Partitioning methods Anatomy of a partitioning based method Output data matrix distance function number of groups Output group assignment of every object
Partitioning based methods Choose K groups initialise group centers aka centroid, medoid assign each object to the nearest centroid according to the distance metric reassign (or recompute) centroids repeat last 2 steps until assignment stabilizes
K-means vs K-medoids K-means K-medoids Centroids are the ‘mean’ of the clusters Centroids are an actual object that minimizes the total within cluster distance Centroids need to be recomputed every iteration Centroid can be determined from quick look up into the distance matrix Initialisation difficult as notion of centroid may be unclear before beginning Initialisation is simply K randomly selected objects kmeans pam
Partitioning based methods Advantages Disadvantages Number of groups is well defined Have to choose the number of groups A clear, deterministic assignment of an object to a group Sometimes objects do not fit well to any cluster Simple algorithms for inference Can converge on locally optimal solutions and often require multiple restarts with random initializations
Agglomerative hierarchical clustering
Hierarchical clustering Anatomy of hierarchical clustering distance matrix linkage method Output dendrogram a tree that defines the relationships between objects and the distance between clusters a nested sequence of clusters
Linkage methods single complete distance between centroids average
Linkage methods Ward (1963) form partitions that minimizes the loss associated with each grouping loss defined as error sum of squares (ESS) consider 10 objects with scores (2, 6, 5, 6, 2, 2, 2, 2, 0, 0, 0) ESSOnegroup = (2 -2.5)2 + (6 -2.5)2 + ....... + (0 -2.5)2 = 50.5 On the other hand, if the 10 objects are classified according to their scores into four sets, {0,0,0}, {2,2,2,2}, {5}, {6,6} The ESS can be evaluated as the sum of squares of four separate error sums of squares: ESSOnegroup = ESSgroup1 + ESSgroup2 + ESSgroup3 + ESSgroup4 = 0.0 Thus, clustering the 10 scores into 4 clusters results in no loss of information.
Linkage methods in action clustering based on single linkage single <- hclust(dist(t(exprMatSub),method="euclidean"), method=”single"); plot(single);
Linkage methods in action clustering based on complete linkage complete <- hclust(dist(t(exprMatSub),method="euclidean"), method="complete"); plot(complete)
Linkage methods in action clustering based on centroid linkage centroid <- hclust(dist(t(exprMatSub),method="euclidean"), method=”centroid"); plot(centroid);
Linkage methods in action clustering based on average linkage average <- hclust(dist(t(exprMatSub),method="euclidean"), method=”average"); plot(average);
Linkage methods in action clustering based on Ward linkage ward <- hclust(dist(t(exprMatSub),method="euclidean"), method=”ward"); plot(ward);
Linkage methods in action Conclusion: linkage matters!
Hierarchical clustering analyzed Advantages Disadvantages There may be small clusters nested inside large ones Clusters might not be naturally represented by a hierarchical structure No need to specify number groups ahead of time Its necessary to ‘cut’ the dendrogram in order to produce clusters Flexible linkage methods Bottom up clustering can result in poor structure at the top of the tree. Early joins cannot be ‘undone’
Model based approaches Assume the data are ‘generated’ from a mixture of K distributions What cluster assignment and parameters of the K distributions best explain the data? ‘Fit’ a model to the data Try to get the best fit Classical example: mixture of Gaussians (mixture of normals) Take advantage of probability theory and well-defined distributions in statistics
Model based clustering: array CGH
Model based clustering of aCGH Problem: patient cohorts often exhibit molecular heterogeneity making rarer shared CNAs hard to detect Approach: Cluster the data by extending the profiling to the multi-group setting Shah et al (Bioinformatics, 2009) A mixture of HMMs: HMM-Mix … Group g Sparse profiles Profile State c Distribution of calls in a group Patient p State k CNA calls Raw data
Advantages of model based approaches In addition to clustering patients into groups, we output a ‘model’ that best represents the patients in a group We can then associate each model with clinical variables and simply output a classifier to be used on new patients Choosing the number of groups becomes a model selection problem (cf. the Bayesian Information Criterion) see Yeung et al Bioinformatics (2001)
Clustering 106 follicular lymphoma patients with HMM-Mix Initialisation Profiles Clinical Converged Recapitulates known FL subgroups Subgroups have clinical relevance
Feature selection Most features (genes, SNP probesets, BAC clones) in high dimensional datasets will be uninformative examples: unexpressed genes, housekeeping genes, ‘passenger alterations’ Clustering (and classification) has a much higher chance of success if uninformative features are removed Simple approaches: select intrinsically variable genes require a minimum level of expression in a proportion of samples genefilter package (Bioonductor): Lab1 Return to feature selection in the context of classification
Advanced topics in clustering Top down clustering Bi-clustering or ‘two-way’ clustering Principal components analysis Choosing the number of groups model selection AIC, BIC Silhouette coefficient The Gap curve Joint clustering and feature selection
What Have We Learned? There are three main types of clustering approaches hierarchical partitioning model based Feature selection is important reduces computational time more likely to identify well-separated groups The distance metric matters The linkage method matters in hierarchical clustering Model based approaches offer principled probabilistic methods
We are on a Coffee Break & Networking Session