GLBIO ML workshop May 17, 2016 Ivan Kryukov and Jeff Wintersinger

Slides:



Advertisements
Similar presentations
Clustering k-mean clustering Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein.
Advertisements

CSC321: Introduction to Neural Networks and Machine Learning Lecture 24: Non-linear Support Vector Machines Geoffrey Hinton.
Unsupervised Learning
Unsupervised Learning Reading: Chapter 8 from Introduction to Data Mining by Tan, Steinbach, and Kumar, pp , , Chapter 8.
Supervised Learning Recap
Lecture 17: Supervised Learning Recap Machine Learning April 6, 2010.
Slide 1 EE3J2 Data Mining EE3J2 Data Mining Lecture 10 Statistical Modelling Martin Russell.
Cluster Analysis.  What is Cluster Analysis?  Types of Data in Cluster Analysis  A Categorization of Major Clustering Methods  Partitioning Methods.
Clustering Evaluation April 29, Today Cluster Evaluation – Internal We don’t know anything about the desired labels – External We have some information.
Lecture 4 Unsupervised Learning Clustering & Dimensionality Reduction
Unsupervised Learning
What is Cluster Analysis?
Pattern Recognition. Introduction. Definitions.. Recognition process. Recognition process relates input signal to the stored concepts about the object.
Clustering with Bregman Divergences Arindam Banerjee, Srujana Merugu, Inderjit S. Dhillon, Joydeep Ghosh Presented by Rohit Gupta CSci 8980: Machine Learning.
Clustering & Dimensionality Reduction 273A Intro Machine Learning.
Unsupervised Learning. CS583, Bing Liu, UIC 2 Supervised learning vs. unsupervised learning Supervised learning: discover patterns in the data that relate.
Unsupervised Learning Reading: Chapter 8 from Introduction to Data Mining by Tan, Steinbach, and Kumar, pp , , (
COMMON EVALUATION FINAL PROJECT Vira Oleksyuk ECE 8110: Introduction to machine Learning and Pattern Recognition.
CPSC 502, Lecture 15Slide 1 Introduction to Artificial Intelligence (AI) Computer Science cpsc502, Lecture 16 Nov, 3, 2011 Slide credit: C. Conati, S.
UNSUPERVISED LEARNING David Kauchak CS 451 – Fall 2013.
Map of the Great Divide Basin, Wyoming, created using a neural network and used to find likely fossil beds See:
MACHINE LEARNING 8. Clustering. Motivation Based on E ALPAYDIN 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2  Classification problem:
Lecture 6 Spring 2010 Dr. Jianjun Hu CSCE883 Machine Learning.
Clustering Clustering is a technique for finding similarity groups in data, called clusters. I.e., it groups data instances that are similar to (near)
Information Retrieval and Organisation Chapter 16 Flat Clustering Dell Zhang Birkbeck, University of London.
Clustering Unsupervised learning introduction Machine Learning.
Clustering Instructor: Max Welling ICS 178 Machine Learning & Data Mining.
Machine Learning Queens College Lecture 7: Clustering.
Compiled By: Raj Gaurang Tiwari Assistant Professor SRMGPC, Lucknow Unsupervised Learning.
Flat clustering approaches
CZ5211 Topics in Computational Biology Lecture 4: Clustering Analysis for Microarray Data II Prof. Chen Yu Zong Tel:
R. Kass/W03 P416 Lecture 5 l Suppose we are trying to measure the true value of some quantity (x T ). u We make repeated measurements of this quantity.
An unsupervised conditional random fields approach for clustering gene expression time series Chang-Tsun Li, Yinyin Yuan and Roland Wilson Bioinformatics,
Data Mining – Algorithms: K Means Clustering
Clustering (2) Center-based algorithms Fuzzy k-means Density-based algorithms ( DBSCAN as an example ) Evaluation of clustering results Figures and equations.
Unsupervised Learning Part 2. Topics How to determine the K in K-means? Hierarchical clustering Soft clustering with Gaussian mixture models Expectation-Maximization.
Unsupervised Learning
Cluster Validity For supervised classification we have a variety of measures to evaluate how good our model is Accuracy, precision, recall For cluster.
Today Cluster Evaluation Internal External
Unsupervised Learning Reading: Chapter 8 from Introduction to Data Mining by Tan, Steinbach, and Kumar, pp , , (
Unsupervised Learning Reading: Chapter 8 from Introduction to Data Mining by Tan, Steinbach, and Kumar, pp , , (
Unsupervised Learning: Clustering
Sampath Jayarathna Cal Poly Pomona
Unsupervised Learning: Clustering
Hierarchical Clustering: Time and Space requirements
Data Mining K-means Algorithm
Classification of unlabeled data:
Haim Kaplan and Uri Zwick
Map of the Great Divide Basin, Wyoming, created using a neural network and used to find likely fossil beds See:
Clustering (3) Center-based algorithms Fuzzy k-means
Clustering Evaluation The EM Algorithm
A special case of calibration
CSE 4705 Artificial Intelligence
Introduction to Instrumentation Engineering
دانشگاه صنعتی امیرکبیر Instructor : Saeed Shiry
Bayesian Models in Machine Learning
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Clustering 77B Recommender Systems
Gaussian Mixture Models And their training with the EM algorithm
Unsupervised Learning II: Soft Clustering with Gaussian Mixture Models
Cluster Validity For supervised classification we have a variety of measures to evaluate how good our model is Accuracy, precision, recall For cluster.
Data Mining – Chapter 4 Cluster Analysis Part 2
Lecture 6: Introduction to Machine Learning
Evaluate the limit: {image} Choose the correct answer from the following:
LECTURE 21: CLUSTERING Objectives: Mixture Densities Maximum Likelihood Estimates Application to Gaussian Mixture Models k-Means Clustering Fuzzy k-Means.
Statistical Models and Machine Learning Algorithms --Review
EM Algorithm and its Applications
Calibration and homographies
Unsupervised Learning
Presentation transcript:

GLBIO ML workshop May 17, 2016 Ivan Kryukov and Jeff Wintersinger Clustering GLBIO ML workshop May 17, 2016 Ivan Kryukov and Jeff Wintersinger

Introduction

Why cluster? Goal: given data points, group them by common properties What properties do they share? Example of unsupervised learning -- no ground truth against which we can compare Sometimes we want small number of clusters broadly summarizing trends Sometimes we want large number of homogeneous clusters, each with only a few members Image source: Wikipedia

Our problem We have single-cell RNA-seq data for 271 cells across 575 genes Cells sampled at 0 h, 24 h, 48 h, 72 h Do cells at same timepoints show same gene expression? If each cluster consists of only cells from the same timepoint, then the answer is yes! Image source: Trapnell (2014)

K-means

K-means clustering Extremely simple clustering algorithm, but can be quite effective One of two clustering algorithms we will discuss You must define the number of clusters K you want Image source: Wikipedia

K-means clustering: step 1 We’re going to create three clusters So, we randomly place three centroids amongst our data Image source: Wikipedia

K-means clustering: step 2 Assign every data point to its closest centroid Image source: Wikipedia

K-means clustering: step 3 Move each centroid to the centre of all the data points belonging to its cluster Now go back to step 2 and iterate Image source: Wikipedia

K-means: step 4 When no data points change assignments, you’re done! Note that, depending on where you place your centroids at the start, your results may differ Image source: Wikipedia

Gaussian mixture models

Gaussian mixture model clustering We will fit a mixture of Gaussians using expectation maximization Each Gaussian has parameters describing mean and variance

GMM step 1 Initialize with a Gaussian for each cluster, using random means and variances

GMM step 2 Calculate expectation of cluster membership for each point Not captured by figure: these are soft assignments

GMM step 3 Choose parameter values that maximize likelihood of observed assignment of points to clusters

GMM step 4 Once you converge, you’re done!

Let’s cluster simulated data using a GMM! Once more, to the notebook!

Evaluating clustering success

Evaluating clustering success How do we evaluate clustering? For supervised learning, we can examine accuracy, precision-recall curve, etc. Two types of evaluation: extrinsic and intrinsic Extrinsic measure: compare your clusters relative to ground-truth classes This is similar to supervised learning, in which you know the “correct” answer for some of your data For our RNA-seq data, we know what timepoint each cell came from But if gene expression isn’t consistent between cells in same timepoint, the data won’t cluster well -- this is a problem with the data, not the clustering algorithm Intrinsic measure: examine structure of clusters without reference to external ground truth

Extrinsic metric: V-measure V-measure: average of homogeneity and completeness, both of which are desireable Homogeneity: for a given cluster, do all the points in it come from the same class? Completeness: for a given class, are all its points placed in one cluster? Achieving good V-measure scores: Perfect homogeneity, perfect completeness: your clustering matches your classes perfectly Perfect homogeneity, horrible completeness: every single point is placed in its own cluster Perfect completeness, horrible homogeneity: all your points are placed in just one cluster

Calculating homogeneity Homogeneity and completeness are defined in terms of entropies, which is a numeric measure of uncertainty Both values occur on the [0, 1] interval If I tell you what points went in a given cluster -- e.g., “for cluster 1, cells 19, 143, and 240 are in it” -- and you know with certainty the class of all points in that cluster -- “oh, that’s the T = 24 h timepoint”, then the cluster is homogeneous

Calculating completeness If I tell you what points are in a given class -- “the T = 48 h timepoint has cells 131, 179, and 221” -- and you know with certainty what cluster they belong to -- “oh, those cells are all in the second cluster” -- then that class is complete with respect to the clustering

Now that we have homogeneity and completeness ... V-measure is just the (harmonic) mean of homogeneity and completeness Why the harmonic mean rather than the arithmetic mean? If h = 1 and c = 0: then arithmetic mean is 0.5 This is the degenerate case where each point goes to its own cluster But under same values, harmonic mean is 0, which better represents the quality of the clustering

Intrinsic measure: silhouette score

Example low silhouette score

Let’s see how well our simulated data is clustered! Notebook time! Hooray! Why does k-means do better than GMM? Our data were generated via Gaussians Exercise: generate more complex simulated data, evaluate performance

The curse of dimensionality

What is the curse of dimensionality? You have a straight line 100 metres long. Drop a penny on it. Easy to find! You have a square 100 m * 100 m. Drop a penny inside it. Harder to find Like two football fields put next to each other You have a building 100 m * 100 m * 100 m. Drop a penny in it Now you’re searching inside a 30-storey building the size of a football field Your life sucks The point: intuition of what works in two or three dimensions breaks down as we move to much higher-dimensional spaces Gene data: 575 differentially expressed genes -- we’re working in 575 dimensions! With so many dimensions, everything is “far” from everything else -- clustering based on distance breaks down