1 Canadian Bioinformatics Workshops
2Module #: Title of Module
3 Module 2: Clustering, Classification and Feature Selection Sohrab Shah Centre for Translational and Applied Genomics Molecular Oncology Breast Cancer Research Program BC Cancer Agency
4Module #: Title of Module Module Overview Introduction to clustering –distance metrics –hierarchical, partitioning and model based clustering Introduction to classification –building a classifier –avoiding overfitting –cross validation Feature Selection in clustering and classification
5Module #: Title of Module Introduction to clustering What is clustering? –unsupervised learning –discovery of patterns in data –class discovery Grouping together “objects” that are most similar (or least dissimilar) –objects may be genes, or samples, or both Example question: Are there samples in my cohort that can be subgrouped based on molecular profiling? –Do these groups have correlation to clinical outcome?
Distance metrics In order to perform clustering, we need to have a way to measure how similar (or dissimilar) two objects are Euclidean distance: Manhattan distance: 1-correlation –proportional to Euclidean distance, but invariant to range of measurement from one sample to the next 6Module #: Title of Module dissimilar similar
Distance metrics compared 7Module #: Title of Module EuclideanManhattan1-Pearson Conclusion: distance matters!
Other distance metrics Hamming distance for ordinal, binary or categorical data: 8Module #: Title of Module
Approaches to clustering Partitioning methods –K-means –K-medoids (partitioning around medoids) –Model based approaches Hierarchical methods –nested clusters start with pairs build a tree up to the root 9Module #: Title of Module
Partitioning methods Anatomy of a partitioning based method –data matrix –distance function –number of groups Output –group assignment of every object 10Module #: Title of Module
Partitioning based methods Choose K groups –initialise group centers aka centroid, medoid –assign each object to the nearest centroid according to the distance metric –reassign (or recompute) centroids –repeat last 2 steps until assignment stabilizes 11Module #: Title of Module
K-medoids in action 12Module #: Title of Module
K-means vs K-medoids 13Module #: Title of Module K-meansK-medoids Centroids are the ‘mean’ of the clusters Centroids are an actual object that minimizes the total within cluster distance Centroids need to be recomputed every iteration Centroid can be determined from quick look up into the distance matrix Initialisation difficult as notion of centroid may be unclear before beginning Initialisation is simply K randomly selected objects kmeanspam
Partitioning based methods 14Module #: Title of Module AdvantagesDisadvantages Number of groups is well defined Have to choose the number of groups A clear, deterministic assignment of an object to a group Sometimes objects do not fit well to any cluster Simple algorithms for inference Can converge on locally optimal solutions and often require multiple restarts with random initializations
Agglomerative hierarchical clustering 15Module #: Title of Module
Hierarchical clustering Anatomy of hierarchical clustering –distance matrix –linkage method Output –dendrogram a tree that defines the relationships between objects and the distance between clusters a nested sequence of clusters 16Module #: Title of Module
Linkage methods 17Module #: Title of Module single complete average distance between centroids
Linkage methods Ward (1963) –form partitions that minimizes the loss associated with each grouping –loss defined as error sum of squares (ESS) –consider 10 objects with scores (2, 6, 5, 6, 2, 2, 2, 2, 0, 0, 0) 18Module #: Title of Module ESS Onegroup = (2 -2.5) 2 + (6 -2.5) (0 -2.5) 2 = 50.5 On the other hand, if the 10 objects are classified according to their scores into four sets, {0,0,0}, {2,2,2,2}, {5}, {6,6} The ESS can be evaluated as the sum of squares of four separate error sums of squares: ESS Onegroup = ESS group1 + ESS group2 + ESS group3 + ESS group4 = 0.0 Thus, clustering the 10 scores into 4 clusters results in no loss of information.
Linkage methods in action clustering based on single linkage single <- hclust(dist(t(exprMatSub),method="euclidean"), method=”single"); plot(single); 19Module #: Title of Module
Linkage methods in action clustering based on complete linkage complete <- hclust(dist(t(exprMatSub),method="euclidean"), method="complete"); plot(complete) 20Module #: Title of Module
Linkage methods in action clustering based on centroid linkage centroid <- hclust(dist(t(exprMatSub),method="euclidean"), method=”centroid"); plot(centroid); 21Module #: Title of Module
Linkage methods in action 22Module #: Title of Module clustering based on average linkage average <- hclust(dist(t(exprMatSub),method="euclidean"), method=”average"); plot(average);
23Module #: Title of Module Linkage methods in action clustering based on Ward linkage ward <- hclust(dist(t(exprMatSub),method="euclidean"), method=”ward"); plot(ward);
Linkage methods in action 24Module #: Title of Module Conclusion: linkage matters!
Hierarchical clustering analyzed 25Module #: Title of Module AdvantagesDisadvantages There may be small clusters nested inside large ones Clusters might not be naturally represented by a hierarchical structure No need to specify number groups ahead of time Its necessary to ‘cut’ the dendrogram in order to produce clusters Flexible linkage methodsBottom up clustering can result in poor structure at the top of the tree. Early joins cannot be ‘undone’
Model based approaches Assume the data are ‘generated’ from a mixture of K distributions –What cluster assignment and parameters of the K distributions best explain the data? ‘Fit’ a model to the data Try to get the best fit Classical example: mixture of Gaussians (mixture of normals) Take advantage of probability theory and well-defined distributions in statistics 26Module #: Title of Module
Model based clustering: array CGH 27Module #: Title of Module
Model based clustering of aCGH 28 Approach: Cluster the data by extending the profiling to the multi-group setting Shah et al (Bioinformatics, 2009) Patient p Group g State k …… Profile State c Problem: patient cohorts often exhibit molecular heterogeneity making rarer shared CNAs hard to detect A mixture of HMMs: HMM-Mix Sparse profiles Distribution of calls in a group CNA calls Raw data
Advantages of model based approaches In addition to clustering patients into groups, we output a ‘model’ that best represents the patients in a group We can then associate each model with clinical variables and simply output a classifier to be used on new patients Choosing the number of groups becomes a model selection problem (ie the Bayesian Information Criterion) –see Yeung et al Bioinformatics (2001) 29Module #: Title of Module
30Module #: Title of Module Clustering 106 follicular lymphoma patients with HMM-Mix 30 Recapitulates known FL subgroups Subgroups have clinical relevance Initialisation Converged Clinical Profiles
Feature selection Most features (genes, SNP probesets, BAC clones) in high dimensional datasets will be uninformative –examples: unexpressed genes, housekeeping genes, ‘passenger alterations’ Clustering (and classification) has a much higher chance of success if uninformative features are removed Simple approaches: –select intrinsically variable genes –require a minimum level of expression in a proportion of samples –genefilter package (Bioonductor): Lab1 Return to feature selection in the context of classification 31Module #: Title of Module
Advanced topics in clustering Top down clustering Bi-clustering or ‘two-way’ clustering Principal components analysis Choosing the number of groups –model selection AIC, BIC Silhouette coefficient The Gap curve Joint clustering and feature selection 32Module #: Title of Module
33Module #: Title of Module What Have We Learned? There are three main types of clustering approaches –hierarchical –partitioning –model based Feature selection is important –reduces computational time –more likely to identify well-separated groups The distance metric matters The linkage method matters in hierarchical clustering Model based approaches offer principled probabilistic methods
34Module #: Title of Module Module Overview Clustering ClassificationClassification Feature Selection
Classification What is classificiation? –Supervised learning –discriminant analysis Work from a set of objects with predefined classes –ie basal vs luminal or good responder vs poor responder Task: learn from the features of the objects: what is the basis for discrimination? Statistically and mathematically heavy 35Module #: Title of Module
Classification 36Module #: Title of Module poor response good response learn a classifier poor response good response new patient What is the most likely response?
Example: DLBCL subtypes 37Module #: Title of Module Wright et al, PNAS (2003)
DLBCL subtypes 38Module #: Title of Module Wright et al, PNAS (2003)
Classification approaches Wright et al PNAS (2003) Weighted features in a linear predictor score: a j : weight of gene j determined by t-test statistic X j : expression value of gene j Assume there are 2 distinct distributions of LPS: 1 for ABC, 1 for GCB 39Module #: Title of Module
Wright et al, DLBCL, cont’d 40Module #: Title of Module Use Bayes’ rule to determine a probability that a sample comes from group 1: : probability density function that represents group 1
Learning the classifier, Wright et al Choosing the genes (feature selection): –use cross validation –Leave one out cross validation Pick a set of samples Use all but one of the samples as training, leaving one out for test Fit the model using the training data Can the classifier correctly pick the class of the remaining case? Repeat exhaustively for leaving out each sample in turn –Repeat using different sets and numbers of genes based on t- statistic –Pick the set of genes that give the highest accuracy 41Module #: Title of Module
Overfitting In many cases in biology, the number of features is much larger than the number of samples Important features may not be represented in the training data This can result in overfitting –when a classifier discriminates well on its training data, but does not generalise to orthogonally derived data sets Validation is required in at least one external cohort to believe the results example: the expression subtypes for breast cancer have been repeatedly validated in numerous data sets 42Module #: Title of Module
Overfitting To reduce the problem of overfitting, one can use Bayesian priors to ‘regularize’ the parameter estimates of the model Some methods now integrate feature selection and classification in a unified analytical framework –see Law et al IEEE (2005): Sparse Multinomial Logistic Regression (SMLR): Cross validation should always be used in training a classifier 43Module #: Title of Module
Evaluating a classifier The receiver operator characteristic curve –plots the true positive rate vs the false positive rate 44Module #: Title of Module Given ground truth and a probabilistic classifier –for some number of probability thresholds –compute the TPR –proportion of positives that were predicted as true –compute the FPR –number of false predictions over the total number of predictions
Other methods for classification Support vector machines Linear discriminant analysis Logistic regression Random forests See: –Ma and Huang Briefings in Bioinformatics (2008) –Saeys et al Bioinformatics (2007) 45Module #: Title of Module
46Module #: Title of Module Questions?
47Module #: Title of Module Lab: Clustering and feature selection Get familiar clustering tools and plotting –Feature selection methods –Distance matrices –Linkage methods –Partition methods Try to reproduce some of the figures from Chin et al using the freely available data
48Module #: Title of Module Module 2: Lab Coffee break Back at: 15:00