GLBIO ML workshop May 17, 2016 Ivan Kryukov and Jeff Wintersinger Clustering GLBIO ML workshop May 17, 2016 Ivan Kryukov and Jeff Wintersinger
Introduction
Why cluster? Goal: given data points, group them by common properties What properties do they share? Example of unsupervised learning -- no ground truth against which we can compare Sometimes we want small number of clusters broadly summarizing trends Sometimes we want large number of homogeneous clusters, each with only a few members Image source: Wikipedia
Our problem We have single-cell RNA-seq data for 271 cells across 575 genes Cells sampled at 0 h, 24 h, 48 h, 72 h Do cells at same timepoints show same gene expression? If each cluster consists of only cells from the same timepoint, then the answer is yes! Image source: Trapnell (2014)
K-means
K-means clustering Extremely simple clustering algorithm, but can be quite effective One of two clustering algorithms we will discuss You must define the number of clusters K you want Image source: Wikipedia
K-means clustering: step 1 We’re going to create three clusters So, we randomly place three centroids amongst our data Image source: Wikipedia
K-means clustering: step 2 Assign every data point to its closest centroid Image source: Wikipedia
K-means clustering: step 3 Move each centroid to the centre of all the data points belonging to its cluster Now go back to step 2 and iterate Image source: Wikipedia
K-means: step 4 When no data points change assignments, you’re done! Note that, depending on where you place your centroids at the start, your results may differ Image source: Wikipedia
Gaussian mixture models
Gaussian mixture model clustering We will fit a mixture of Gaussians using expectation maximization Each Gaussian has parameters describing mean and variance
GMM step 1 Initialize with a Gaussian for each cluster, using random means and variances
GMM step 2 Calculate expectation of cluster membership for each point Not captured by figure: these are soft assignments
GMM step 3 Choose parameter values that maximize likelihood of observed assignment of points to clusters
GMM step 4 Once you converge, you’re done!
Let’s cluster simulated data using a GMM! Once more, to the notebook!
Evaluating clustering success
Evaluating clustering success How do we evaluate clustering? For supervised learning, we can examine accuracy, precision-recall curve, etc. Two types of evaluation: extrinsic and intrinsic Extrinsic measure: compare your clusters relative to ground-truth classes This is similar to supervised learning, in which you know the “correct” answer for some of your data For our RNA-seq data, we know what timepoint each cell came from But if gene expression isn’t consistent between cells in same timepoint, the data won’t cluster well -- this is a problem with the data, not the clustering algorithm Intrinsic measure: examine structure of clusters without reference to external ground truth
Extrinsic metric: V-measure V-measure: average of homogeneity and completeness, both of which are desireable Homogeneity: for a given cluster, do all the points in it come from the same class? Completeness: for a given class, are all its points placed in one cluster? Achieving good V-measure scores: Perfect homogeneity, perfect completeness: your clustering matches your classes perfectly Perfect homogeneity, horrible completeness: every single point is placed in its own cluster Perfect completeness, horrible homogeneity: all your points are placed in just one cluster
Calculating homogeneity Homogeneity and completeness are defined in terms of entropies, which is a numeric measure of uncertainty Both values occur on the [0, 1] interval If I tell you what points went in a given cluster -- e.g., “for cluster 1, cells 19, 143, and 240 are in it” -- and you know with certainty the class of all points in that cluster -- “oh, that’s the T = 24 h timepoint”, then the cluster is homogeneous
Calculating completeness If I tell you what points are in a given class -- “the T = 48 h timepoint has cells 131, 179, and 221” -- and you know with certainty what cluster they belong to -- “oh, those cells are all in the second cluster” -- then that class is complete with respect to the clustering
Now that we have homogeneity and completeness ... V-measure is just the (harmonic) mean of homogeneity and completeness Why the harmonic mean rather than the arithmetic mean? If h = 1 and c = 0: then arithmetic mean is 0.5 This is the degenerate case where each point goes to its own cluster But under same values, harmonic mean is 0, which better represents the quality of the clustering
Intrinsic measure: silhouette score
Example low silhouette score
Let’s see how well our simulated data is clustered! Notebook time! Hooray! Why does k-means do better than GMM? Our data were generated via Gaussians Exercise: generate more complex simulated data, evaluate performance
The curse of dimensionality
What is the curse of dimensionality? You have a straight line 100 metres long. Drop a penny on it. Easy to find! You have a square 100 m * 100 m. Drop a penny inside it. Harder to find Like two football fields put next to each other You have a building 100 m * 100 m * 100 m. Drop a penny in it Now you’re searching inside a 30-storey building the size of a football field Your life sucks The point: intuition of what works in two or three dimensions breaks down as we move to much higher-dimensional spaces Gene data: 575 differentially expressed genes -- we’re working in 575 dimensions! With so many dimensions, everything is “far” from everything else -- clustering based on distance breaks down