Canadian Bioinformatics Workshops

Slides:



Advertisements
Similar presentations
K-means Clustering Given a data point v and a set of points X,
Advertisements

Basic Gene Expression Data Analysis--Clustering
Cluster Analysis: Basic Concepts and Algorithms
PARTITIONAL CLUSTERING
Introduction to Bioinformatics
Canadian Bioinformatics Workshops
Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.
Dimension reduction : PCA and Clustering Agnieszka S. Juncker Slides: Christopher Workman and Agnieszka S. Juncker Center for Biological Sequence Analysis.
Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Adapted by Doug Downey from Machine Learning EECS 349, Bryan Pardo Machine Learning Clustering.
Introduction to Bioinformatics - Tutorial no. 12
What is Cluster Analysis?
Cluster Analysis for Gene Expression Data Ka Yee Yeung Center for Expression Arrays Department of Microbiology.
Microarray analysis 2 Golan Yona. 2) Analysis of co-expression Search for similarly expressed genes experiment1 experiment2 experiment3 ……….. Gene i:
What is Cluster Analysis?
Clustering Ram Akella Lecture 6 February 23, & 280I University of California Berkeley Silicon Valley Center/SC.
Ulf Schmitz, Pattern recognition - Clustering1 Bioinformatics Pattern recognition - Clustering Ulf Schmitz
Clustering. What is clustering? Grouping similar objects together and keeping dissimilar objects apart. In Information Retrieval, the cluster hypothesis.
Evaluating Performance for Data Mining Techniques
Math 5364 Notes Chapter 8: Cluster Analysis Jesse Crawford Department of Mathematics Tarleton State University.
START OF DAY 8 Reading: Chap. 14. Midterm Go over questions General issues only Specific issues: visit with me Regrading may make your grade go up OR.
1 Motivation Web query is usually two or three words long. –Prone to ambiguity –Example “keyboard” –Input device of computer –Musical instruments How can.
Dimension reduction : PCA and Clustering Slides by Agnieszka Juncker and Chris Workman modified by Hanne Jarmer.
MACHINE LEARNING 8. Clustering. Motivation Based on E ALPAYDIN 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2  Classification problem:
Quantitative analysis of 2D gels Generalities. Applications Mutant / wild type Physiological conditions Tissue specific expression Disease / normal state.
Clustering I. 2 The Task Input: Collection of instances –No special class label attribute! Output: Clusters (Groups) of instances where members of a cluster.
Chapter 11 Statistical Techniques. Data Warehouse and Data Mining Chapter 11 2 Chapter Objectives  Understand when linear regression is an appropriate.
Compiled By: Raj Gaurang Tiwari Assistant Professor SRMGPC, Lucknow Unsupervised Learning.
Tutorial 8 Gene expression analysis 1. How to interpret an expression matrix Expression data DBs - GEO Clustering –Hierarchical clustering –K-means clustering.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Cluster Analysis This lecture node is modified based on Lecture Notes for Chapter.
CZ5211 Topics in Computational Biology Lecture 4: Clustering Analysis for Microarray Data II Prof. Chen Yu Zong Tel:
Canadian Bioinformatics Workshops
Multivariate statistical methods Cluster analysis.
1 Canadian Bioinformatics Workshops
Clustering (2) Center-based algorithms Fuzzy k-means Density-based algorithms ( DBSCAN as an example ) Evaluation of clustering results Figures and equations.
4.0 - Data Mining Sébastien Lemieux Elitra Canada Ltd.
Unsupervised Learning
Clustering (1) Clustering Similarity measure Hierarchical clustering
Multivariate statistical methods
Machine Learning for the Quantified Self
PREDICT 422: Practical Machine Learning
Semi-Supervised Clustering
Clustering CSC 600: Data Mining Class 21.
Machine Learning Clustering: K-means Supervised Learning
Data Mining K-means Algorithm
Dimension reduction : PCA and Clustering by Agnieszka S. Juncker
Canadian Bioinformatics Workshops
Clustering (3) Center-based algorithms Fuzzy k-means
CSE 5243 Intro. to Data Mining
K-means and Hierarchical Clustering
John Nicholas Owen Sarah Smith
Hierarchical clustering approaches for high-throughput data
Cluster Analysis of Microarray Data
Clustering and Multidimensional Scaling
Clustering.
Data Mining 資料探勘 分群分析 (Cluster Analysis) Min-Yuh Day 戴敏育
Multivariate Statistical Methods
DATA MINING Introductory and Advanced Topics Part II - Clustering
Dimension reduction : PCA and Clustering
Data Mining – Chapter 4 Cluster Analysis Part 2
Clustering Wei Wang.
LECTURE 21: CLUSTERING Objectives: Mixture Densities Maximum Likelihood Estimates Application to Gaussian Mixture Models k-Means Clustering Fuzzy k-Means.
Cluster Analysis.
Text Categorization Berlin Chen 2003 Reference:
Clustering Techniques
Clustering The process of grouping samples so that the samples are similar within each group.
Unsupervised Learning
Presentation transcript:

Canadian Bioinformatics Workshops www.bioinformatics.ca

Module #: Title of Module 2

Herakles and Iolaos battle the Hydra. Classical (450-400 BCE) Module 5 Clustering Exploratory Data Analysis and Essential Statistics using R Boris Steipe Toronto, September 8–9 2011 † Herakles and Iolaos battle the Hydra. Classical (450-400 BCE) DEPARTMENT OF BIOCHEMISTRY MOLECULAR GENETICS † Includes material originally developed by Sohrab Shah

Introduction to clustering What is clustering? unsupervised learning discovery of patterns in data class discovery Grouping together “objects” that are most similar (or least dissimilar) objects may be genes, or samples, or both Example question: Are there samples in my cohort that can be subgrouped based on molecular profiling? Do these groups have correlation to clinical outcome?

Distance metrics In order to perform clustering, we need to have a way to measure how similar (or dissimilar) two objects are Euclidean distance: Manhattan distance: 1-correlation proportional to Euclidean distance, but invariant to range of measurement from one sample to the next dissimilar similar

Distance metrics compared Euclidean Manhattan 1-Pearson Conclusion: distance matters!

Other distance metrics Hamming distance for ordinal, binary or categorical data:

Approaches to clustering Partitioning methods K-means K-medoids (partitioning around medoids) Model based approaches Hierarchical methods nested clusters start with pairs build a tree up to the root

Partitioning methods Anatomy of a partitioning based method Output data matrix distance function number of groups Output group assignment of every object

Partitioning based methods Choose K groups initialise group centers aka centroid, medoid assign each object to the nearest centroid according to the distance metric reassign (or recompute) centroids repeat last 2 steps until assignment stabilizes

K-means vs K-medoids K-means K-medoids Centroids are the ‘mean’ of the clusters Centroids are an actual object that minimizes the total within cluster distance Centroids need to be recomputed every iteration Centroid can be determined from quick look up into the distance matrix Initialisation difficult as notion of centroid may be unclear before beginning Initialisation is simply K randomly selected objects kmeans pam

Partitioning based methods Advantages Disadvantages Number of groups is well defined Have to choose the number of groups A clear, deterministic assignment of an object to a group Sometimes objects do not fit well to any cluster Simple algorithms for inference Can converge on locally optimal solutions and often require multiple restarts with random initializations

Agglomerative hierarchical clustering

Hierarchical clustering Anatomy of hierarchical clustering distance matrix linkage method Output dendrogram a tree that defines the relationships between objects and the distance between clusters a nested sequence of clusters

Linkage methods single complete distance between centroids average

Linkage methods Ward (1963) form partitions that minimizes the loss associated with each grouping loss defined as error sum of squares (ESS) consider 10 objects with scores (2, 6, 5, 6, 2, 2, 2, 2, 0, 0, 0) ESSOnegroup = (2 -2.5)2 + (6 -2.5)2 + ....... + (0 -2.5)2 = 50.5 On the other hand, if the 10 objects are classified according to their scores into four sets, {0,0,0}, {2,2,2,2}, {5}, {6,6} The ESS can be evaluated as the sum of squares of four separate error sums of squares: ESSOnegroup = ESSgroup1 + ESSgroup2 + ESSgroup3 + ESSgroup4 = 0.0 Thus, clustering the 10 scores into 4 clusters results in no loss of information.

Linkage methods in action clustering based on single linkage single <- hclust(dist(t(exprMatSub),method="euclidean"), method=”single"); plot(single);

Linkage methods in action clustering based on complete linkage complete <- hclust(dist(t(exprMatSub),method="euclidean"), method="complete"); plot(complete)

Linkage methods in action clustering based on centroid linkage centroid <- hclust(dist(t(exprMatSub),method="euclidean"), method=”centroid"); plot(centroid);

Linkage methods in action clustering based on average linkage average <- hclust(dist(t(exprMatSub),method="euclidean"), method=”average"); plot(average);

Linkage methods in action clustering based on Ward linkage ward <- hclust(dist(t(exprMatSub),method="euclidean"), method=”ward"); plot(ward);

Linkage methods in action Conclusion: linkage matters!

Hierarchical clustering analyzed Advantages Disadvantages There may be small clusters nested inside large ones Clusters might not be naturally represented by a hierarchical structure No need to specify number groups ahead of time Its necessary to ‘cut’ the dendrogram in order to produce clusters Flexible linkage methods Bottom up clustering can result in poor structure at the top of the tree. Early joins cannot be ‘undone’

Model based approaches Assume the data are ‘generated’ from a mixture of K distributions What cluster assignment and parameters of the K distributions best explain the data? ‘Fit’ a model to the data Try to get the best fit Classical example: mixture of Gaussians (mixture of normals) Take advantage of probability theory and well-defined distributions in statistics

Model based clustering: array CGH

Model based clustering of aCGH Problem: patient cohorts often exhibit molecular heterogeneity making rarer shared CNAs hard to detect Approach: Cluster the data by extending the profiling to the multi-group setting Shah et al (Bioinformatics, 2009) A mixture of HMMs: HMM-Mix … Group g Sparse profiles Profile State c Distribution of calls in a group Patient p State k CNA calls Raw data

Advantages of model based approaches In addition to clustering patients into groups, we output a ‘model’ that best represents the patients in a group We can then associate each model with clinical variables and simply output a classifier to be used on new patients Choosing the number of groups becomes a model selection problem (cf. the Bayesian Information Criterion) see Yeung et al Bioinformatics (2001)

Clustering 106 follicular lymphoma patients with HMM-Mix Initialisation Profiles Clinical Converged Recapitulates known FL subgroups Subgroups have clinical relevance

Feature selection Most features (genes, SNP probesets, BAC clones) in high dimensional datasets will be uninformative examples: unexpressed genes, housekeeping genes, ‘passenger alterations’ Clustering (and classification) has a much higher chance of success if uninformative features are removed Simple approaches: select intrinsically variable genes require a minimum level of expression in a proportion of samples genefilter package (Bioonductor): Lab1 Return to feature selection in the context of classification

Advanced topics in clustering Top down clustering Bi-clustering or ‘two-way’ clustering Principal components analysis Choosing the number of groups model selection AIC, BIC Silhouette coefficient The Gap curve Joint clustering and feature selection

What Have We Learned? There are three main types of clustering approaches hierarchical partitioning model based Feature selection is important reduces computational time more likely to identify well-separated groups The distance metric matters The linkage method matters in hierarchical clustering Model based approaches offer principled probabilistic methods

We are on a Coffee Break & Networking Session