1 Canadian Bioinformatics Workshops www.bioinformatics.ca.

Slides:

Advertisements

Similar presentations

Aaker, Kumar, Day Seventh Edition Instructor’s Presentation Slides

Advertisements

Clustering Basic Concepts and Algorithms

Supervised Learning Recap

Machine Learning and Data Mining Clustering

Introduction to Bioinformatics

Canadian Bioinformatics Workshops

Lecture 17: Supervised Learning Recap Machine Learning April 6, 2010.

Mutual Information Mathematical Biology Seminar

© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.

Cluster Analysis.  What is Cluster Analysis?  Types of Data in Cluster Analysis  A Categorization of Major Clustering Methods  Partitioning Methods.

Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.

Semi-Supervised Clustering Jieping Ye Department of Computer Science and Engineering Arizona State University

Adapted by Doug Downey from Machine Learning EECS 349, Bryan Pardo Machine Learning Clustering.

Introduction to Bioinformatics - Tutorial no. 12

What is Cluster Analysis?

Cluster Analysis for Gene Expression Data Ka Yee Yeung Center for Expression Arrays Department of Microbiology.

Clustering Ram Akella Lecture 6 February 23, & 280I University of California Berkeley Silicon Valley Center/SC.

Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)

Evaluating Performance for Data Mining Techniques

ArrayCluster: an analytic tool for clustering, data visualization and module ﬁnder on gene expression proﬁles 組員：李祥豪謝紹陽江建霖.

CPSC 502, Lecture 15Slide 1 Introduction to Artificial Intelligence (AI) Computer Science cpsc502, Lecture 16 Nov, 3, 2011 Slide credit: C. Conati, S.

The Broad Institute of MIT and Harvard Classification / Prediction.

START OF DAY 8 Reading: Chap. 14. Midterm Go over questions General issues only Specific issues: visit with me Regrading may make your grade go up OR.

1 Motivation Web query is usually two or three words long. –Prone to ambiguity –Example “keyboard” –Input device of computer –Musical instruments How can.

Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.6: Linear Models Rodney Nielsen Many of.

CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.

Empirical Research Methods in Computer Science Lecture 7 November 30, 2005 Noah Smith.

MACHINE LEARNING 8. Clustering. Motivation Based on E ALPAYDIN 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2  Classification problem:

Evolutionary Algorithms for Finding Optimal Gene Sets in Micro array Prediction. J. M. Deutsch Presented by: Shruti Sharma.

Chapter 11 Statistical Techniques. Data Warehouse and Data Mining Chapter 11 2 Chapter Objectives  Understand when linear regression is an appropriate.

Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.

Marketing Research Aaker, Kumar, Day and Leone Tenth Edition Instructor’s Presentation Slides 1.

Radial Basis Function ANN, an alternative to back propagation, uses clustering of examples in the training set.

Cluster validation Integration ICES Bioinformatics.

Computational Biology Clustering Parts taken from Introduction to Data Mining by Tan, Steinbach, Kumar Lecture Slides Week 9.

Data Mining Practical Machine Learning Tools and Techniques By I. H. Witten, E. Frank and M. A. Hall DM Finals Study Guide Rodney Nielsen.

Clustering/Cluster Analysis. What is Cluster Analysis? l Finding groups of objects such that the objects in a group will be similar (or related) to one.

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Cluster Analysis This lecture node is modified based on Lecture Notes for Chapter.

CZ5211 Topics in Computational Biology Lecture 4: Clustering Analysis for Microarray Data II Prof. Chen Yu Zong Tel:

Bayesian Hierarchical Clustering Paper by K. Heller and Z. Ghahramani ICML 2005 Presented by David Williams Paper Discussion Group ( )

SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.

Canadian Bioinformatics Workshops

Multivariate statistical methods Cluster analysis.

Canadian Bioinformatics Workshops

Data Science Practical Machine Learning Tools and Techniques 6.8: Clustering Rodney Nielsen Many / most of these slides were adapted from: I. H. Witten,

Unsupervised Learning

CSE 4705 Artificial Intelligence

Multivariate statistical methods

PREDICT 422: Practical Machine Learning

Semi-Supervised Clustering

Clustering CSC 600: Data Mining Class 21.

Classification with Gene Expression Data

Machine Learning Clustering: K-means Supervised Learning

Data Mining K-means Algorithm

Canadian Bioinformatics Workshops

Canadian Bioinformatics Workshops

Revision (Part II) Ke Chen

PCA, Clustering and Classification by Agnieszka S. Juncker

Revision (Part II) Ke Chen

Pattern Recognition and Machine Learning

Generally Discriminant Analysis

Text Categorization Berlin Chen 2003 Reference:

Parametric Methods Berlin Chen, 2005 References:

Clustering The process of grouping samples so that the samples are similar within each group.

Unsupervised Learning

Presentation transcript:

1 Canadian Bioinformatics Workshops

2Module #: Title of Module

3 Module 2: Clustering, Classification and Feature Selection Sohrab Shah Centre for Translational and Applied Genomics Molecular Oncology Breast Cancer Research Program BC Cancer Agency

4Module #: Title of Module Module Overview Introduction to clustering –distance metrics –hierarchical, partitioning and model based clustering Introduction to classification –building a classifier –avoiding overfitting –cross validation Feature Selection in clustering and classification

5Module #: Title of Module Introduction to clustering What is clustering? –unsupervised learning –discovery of patterns in data –class discovery Grouping together “objects” that are most similar (or least dissimilar) –objects may be genes, or samples, or both Example question: Are there samples in my cohort that can be subgrouped based on molecular profiling? –Do these groups have correlation to clinical outcome?

Distance metrics In order to perform clustering, we need to have a way to measure how similar (or dissimilar) two objects are Euclidean distance: Manhattan distance: 1-correlation –proportional to Euclidean distance, but invariant to range of measurement from one sample to the next 6Module #: Title of Module dissimilar similar

Distance metrics compared 7Module #: Title of Module EuclideanManhattan1-Pearson Conclusion: distance matters!

Other distance metrics Hamming distance for ordinal, binary or categorical data: 8Module #: Title of Module

Approaches to clustering Partitioning methods –K-means –K-medoids (partitioning around medoids) –Model based approaches Hierarchical methods –nested clusters start with pairs build a tree up to the root 9Module #: Title of Module

Partitioning methods Anatomy of a partitioning based method –data matrix –distance function –number of groups Output –group assignment of every object 10Module #: Title of Module

Partitioning based methods Choose K groups –initialise group centers aka centroid, medoid –assign each object to the nearest centroid according to the distance metric –reassign (or recompute) centroids –repeat last 2 steps until assignment stabilizes 11Module #: Title of Module

K-medoids in action 12Module #: Title of Module

K-means vs K-medoids 13Module #: Title of Module K-meansK-medoids Centroids are the ‘mean’ of the clusters Centroids are an actual object that minimizes the total within cluster distance Centroids need to be recomputed every iteration Centroid can be determined from quick look up into the distance matrix Initialisation difficult as notion of centroid may be unclear before beginning Initialisation is simply K randomly selected objects kmeanspam

Partitioning based methods 14Module #: Title of Module AdvantagesDisadvantages Number of groups is well defined Have to choose the number of groups A clear, deterministic assignment of an object to a group Sometimes objects do not fit well to any cluster Simple algorithms for inference Can converge on locally optimal solutions and often require multiple restarts with random initializations

Agglomerative hierarchical clustering 15Module #: Title of Module

Hierarchical clustering Anatomy of hierarchical clustering –distance matrix –linkage method Output –dendrogram a tree that defines the relationships between objects and the distance between clusters a nested sequence of clusters 16Module #: Title of Module

Linkage methods 17Module #: Title of Module single complete average distance between centroids

Linkage methods Ward (1963) –form partitions that minimizes the loss associated with each grouping –loss defined as error sum of squares (ESS) –consider 10 objects with scores (2, 6, 5, 6, 2, 2, 2, 2, 0, 0, 0) 18Module #: Title of Module ESS Onegroup = (2 -2.5) 2 + (6 -2.5) (0 -2.5) 2 = 50.5 On the other hand, if the 10 objects are classified according to their scores into four sets, {0,0,0}, {2,2,2,2}, {5}, {6,6} The ESS can be evaluated as the sum of squares of four separate error sums of squares: ESS Onegroup = ESS group1 + ESS group2 + ESS group3 + ESS group4 = 0.0 Thus, clustering the 10 scores into 4 clusters results in no loss of information.

Linkage methods in action clustering based on single linkage single <- hclust(dist(t(exprMatSub),method="euclidean"), method=”single"); plot(single); 19Module #: Title of Module

Linkage methods in action clustering based on complete linkage complete <- hclust(dist(t(exprMatSub),method="euclidean"), method="complete"); plot(complete) 20Module #: Title of Module

Linkage methods in action clustering based on centroid linkage centroid <- hclust(dist(t(exprMatSub),method="euclidean"), method=”centroid"); plot(centroid); 21Module #: Title of Module

Linkage methods in action 22Module #: Title of Module clustering based on average linkage average <- hclust(dist(t(exprMatSub),method="euclidean"), method=”average"); plot(average);

23Module #: Title of Module Linkage methods in action clustering based on Ward linkage ward <- hclust(dist(t(exprMatSub),method="euclidean"), method=”ward"); plot(ward);

Linkage methods in action 24Module #: Title of Module Conclusion: linkage matters!

Hierarchical clustering analyzed 25Module #: Title of Module AdvantagesDisadvantages There may be small clusters nested inside large ones Clusters might not be naturally represented by a hierarchical structure No need to specify number groups ahead of time Its necessary to ‘cut’ the dendrogram in order to produce clusters Flexible linkage methodsBottom up clustering can result in poor structure at the top of the tree. Early joins cannot be ‘undone’

Model based approaches Assume the data are ‘generated’ from a mixture of K distributions –What cluster assignment and parameters of the K distributions best explain the data? ‘Fit’ a model to the data Try to get the best fit Classical example: mixture of Gaussians (mixture of normals) Take advantage of probability theory and well-defined distributions in statistics 26Module #: Title of Module

Model based clustering: array CGH 27Module #: Title of Module

Model based clustering of aCGH 28 Approach: Cluster the data by extending the profiling to the multi-group setting Shah et al (Bioinformatics, 2009) Patient p Group g State k …… Profile State c Problem: patient cohorts often exhibit molecular heterogeneity making rarer shared CNAs hard to detect A mixture of HMMs: HMM-Mix Sparse profiles Distribution of calls in a group CNA calls Raw data

Advantages of model based approaches In addition to clustering patients into groups, we output a ‘model’ that best represents the patients in a group We can then associate each model with clinical variables and simply output a classifier to be used on new patients Choosing the number of groups becomes a model selection problem (ie the Bayesian Information Criterion) –see Yeung et al Bioinformatics (2001) 29Module #: Title of Module

30Module #: Title of Module Clustering 106 follicular lymphoma patients with HMM-Mix 30 Recapitulates known FL subgroups Subgroups have clinical relevance Initialisation Converged Clinical Profiles

Feature selection Most features (genes, SNP probesets, BAC clones) in high dimensional datasets will be uninformative –examples: unexpressed genes, housekeeping genes, ‘passenger alterations’ Clustering (and classification) has a much higher chance of success if uninformative features are removed Simple approaches: –select intrinsically variable genes –require a minimum level of expression in a proportion of samples –genefilter package (Bioonductor): Lab1 Return to feature selection in the context of classification 31Module #: Title of Module

Advanced topics in clustering Top down clustering Bi-clustering or ‘two-way’ clustering Principal components analysis Choosing the number of groups –model selection AIC, BIC Silhouette coefficient The Gap curve Joint clustering and feature selection 32Module #: Title of Module

33Module #: Title of Module What Have We Learned? There are three main types of clustering approaches –hierarchical –partitioning –model based Feature selection is important –reduces computational time –more likely to identify well-separated groups The distance metric matters The linkage method matters in hierarchical clustering Model based approaches offer principled probabilistic methods

34Module #: Title of Module Module Overview Clustering ClassificationClassification Feature Selection

Classification What is classificiation? –Supervised learning –discriminant analysis Work from a set of objects with predefined classes –ie basal vs luminal or good responder vs poor responder Task: learn from the features of the objects: what is the basis for discrimination? Statistically and mathematically heavy 35Module #: Title of Module

Classification 36Module #: Title of Module poor response good response learn a classifier poor response good response new patient What is the most likely response?

Example: DLBCL subtypes 37Module #: Title of Module Wright et al, PNAS (2003)

DLBCL subtypes 38Module #: Title of Module Wright et al, PNAS (2003)

Classification approaches Wright et al PNAS (2003) Weighted features in a linear predictor score: a j : weight of gene j determined by t-test statistic X j : expression value of gene j Assume there are 2 distinct distributions of LPS: 1 for ABC, 1 for GCB 39Module #: Title of Module

Wright et al, DLBCL, cont’d 40Module #: Title of Module Use Bayes’ rule to determine a probability that a sample comes from group 1: : probability density function that represents group 1

Learning the classifier, Wright et al Choosing the genes (feature selection): –use cross validation –Leave one out cross validation Pick a set of samples Use all but one of the samples as training, leaving one out for test Fit the model using the training data Can the classifier correctly pick the class of the remaining case? Repeat exhaustively for leaving out each sample in turn –Repeat using different sets and numbers of genes based on t- statistic –Pick the set of genes that give the highest accuracy 41Module #: Title of Module

Overfitting In many cases in biology, the number of features is much larger than the number of samples Important features may not be represented in the training data This can result in overfitting –when a classifier discriminates well on its training data, but does not generalise to orthogonally derived data sets Validation is required in at least one external cohort to believe the results example: the expression subtypes for breast cancer have been repeatedly validated in numerous data sets 42Module #: Title of Module

Overfitting To reduce the problem of overfitting, one can use Bayesian priors to ‘regularize’ the parameter estimates of the model Some methods now integrate feature selection and classification in a unified analytical framework –see Law et al IEEE (2005): Sparse Multinomial Logistic Regression (SMLR): Cross validation should always be used in training a classifier 43Module #: Title of Module

Evaluating a classifier The receiver operator characteristic curve –plots the true positive rate vs the false positive rate 44Module #: Title of Module Given ground truth and a probabilistic classifier –for some number of probability thresholds –compute the TPR –proportion of positives that were predicted as true –compute the FPR –number of false predictions over the total number of predictions

Other methods for classification Support vector machines Linear discriminant analysis Logistic regression Random forests See: –Ma and Huang Briefings in Bioinformatics (2008) –Saeys et al Bioinformatics (2007) 45Module #: Title of Module

46Module #: Title of Module Questions?

47Module #: Title of Module Lab: Clustering and feature selection Get familiar clustering tools and plotting –Feature selection methods –Distance matrices –Linkage methods –Partition methods Try to reproduce some of the figures from Chin et al using the freely available data

48Module #: Title of Module Module 2: Lab Coffee break Back at: 15:00