Download presentation
1
Flat clustering approaches
Colin Dewey BMI/CS 576 Fall 2015
2
Flat clustering Cluster objects/genes/samples into K clusters
In the following, we will consider clustering genes based on their expression profiles K: number of clusters, a user defined argument Two example algorithms K-means Gaussian mixture model-based clustering
3
Notation for K-means clustering
K number of clusters Nk Number of elements in cluster k p-dimensional expression profile for ith gene is the collection of N gene expression profiles to cluster Center of the kth cluster C(i) Cluster assignment (1 to K) of ith gene
4
K-means clustering Hard-clustering algorithm
Dissimilarity measure is the Euclidean distance Minimizes within-cluster scatter defined as Center of cluster k Sum over all genes assigned to cluster k
5
K-means clustering consider an example in which our vectors have 2 dimensions + cluster center profile
6
K-means clustering each iteration involves two steps
assignment of profiles to clusters re-computation of the cluster centers (means) + assignment + + + + re-computation of cluster centers
7
K-means algorithm Input: K, number of clusters, a set X={x1,.. xN} of data points, where xi are p-dimensional vectors Initialize Select initial cluster means Repeat until convergence Assign each xi to cluster C(i) such that Re-estimate the mean of each cluster based on new members
8
K-means: updating the mean
To compute the mean of the kth cluster All genes in cluster k Number of genes in cluster k
9
K-means stopping criteria
Assignment of objects to clusters don’t change Fix the max number of iterations Optimization criterion changes by a small value
10
K-means Clustering Example
Given the following 4 instances and 2 clusters initialized as shown. f1 f2 x1 x2 x3 x4
11
K-means Clustering Example (Continued)
assignments remain the same, so the procedure has converged
12
Gaussian mixture model based clustering
K-means is hard clustering At each iteration, a data point is assigned to one and only one cluster We can do soft clustering based on Gaussian mixture models Each cluster is represented by a distribution (in our case a Gaussian) We assume the data is generated by a mixture of the Gaussians
13
Gaussian distribution
: Mean : Standard deviation
14
Representation of Clusters
in the EM approach, we’ll represent each cluster using a p-dimensional multivariate Gaussian where is the mean of the Gaussian is the covariance matrix this is a representation of a Gaussian in a 2-D space
15
Cluster generation We model our data points with a generative process
Each point is generated by: Choosing a cluster (1,2,…,k) by sampling from probability distribution over the clusters Prob(cluster k) = Pk Sampling a point from the Gaussian distribution Nj
16
EM Clustering the EM algorithm will try to set the parameters of the Gaussians, , to maximize the log likelihood of the data, X
17
EM Clustering the parameters of the model, , include the means, the covariance matrix and sometimes prior weights for each Gaussian here, we’ll assume that the covariance matrix is fixed; we’ll focus just on setting the means and the prior weights
18
EM Clustering: Hidden Variables
on each iteration of K-means clustering, we had to assign each instance to a cluster in the EM approach, we’ll use hidden variables to represent this idea for each instance we have a set of hidden variables we can think of as being 1 if is a member of cluster j and 0 otherwise
19
EM Clustering: the E-step
recall that is a hidden variable which is 1 if generated and 0 otherwise in the E-step, we compute , the expected value of this hidden variable + assignment 0.3 0.7
20
EM Clustering: the M-step
given the expected values , we re-estimate the means of the Gaussians and the cluster probabilities can also re-estimate the covariance matrix if we’re varying it
21
EM Clustering – Overall algorithm
Initialize parameters (e.g., means) Loop until convergence E-step: Compute expected values of Zij values given current parameters M-step: Update parameters using E[Zij] values Means Cluster probabilities
22
EM Clustering Example Consider a one-dimensional clustering problem in which the data given are: x1 = -4 x2 = -3 x3 = -1 x4 = 3 x5 = 5 The initial mean of the first Gaussian is 0 and the initial mean of the second is 2. The Gaussians both have variance = 4; their density function is: where denotes the mean (center) of the Gaussian. Initially, we set P1 = P2 = 0.5
23
EM Clustering Example
24
EM Clustering Example: E Step
25
EM Clustering Example: M-step
26
EM Clustering Example here we’ve shown just one step of the EM procedure we would continue the E- and M-steps until convergence
27
Comparing K-means and GMMs
Hard clustering Optimizes within cluster scatter Requires estimation of means GMMs Soft clustering Optimizes likelihood of the data Requires estimation of mean (and possibly covariance) and mixture probabilities
28
Choosing the number of clusters
Picking the number of clusters based on the clustering objective will result in K=N (number of data points) Two common options: Pick k based on penalized clustering objective G Pick based on cross-validation
29
Picking K based on cross-validation
Data set Split into 3 sets Training set Test set Compute objective based on test data Cluster Cluster Cluster Evaluate Evaluate Evaluate Run method on all data once K has been determined Average
30
Evaluation of clusters
Internal validation How well does clustering optimize the intra-cluster similarity and inter-cluster dissimilarity External validation Do genes in the same cluster have similar function?
31
Internal validation One measure of assessing cluster quality is the Silhouette index (SI) More positive the SI, better the clusters K: number of clusters Cj : Set representing jth cluster b(xi): Average distance of xi to instances in next closest cluster a(xi): Average distance of xi to other instances in same cluster
32
External validation Are genes in the same cluster associated with similar function? Gene Ontology (GO) is a controlled vocabulary of terms used to annotate genes of a particular category One can use GO terms to study whether the genes in a cluster are associated with the same GO term more than expected by chance One can also see if genes in a cluster are associated with similar transcription factor binding sites
33
The Gene Ontology A controlled vocabulary of more than 30K concepts describing molecular functions, biological processes, and cellular components
34
Gene Ontology terms for yeast genes
Gene ontology process YDR420W 1,3-beta-D-glucan_biosynthetic_process YGR032W 1,3-beta-D-glucan_biosynthetic_process YLR342W 1,3-beta-D-glucan_biosynthetic_process YMR306W 1,3-beta-D-glucan_biosynthetic_process YDR420W 1,3-beta-D-glucan_metabolic_process YGR032W 1,3-beta-D-glucan_metabolic_process YLR342W 1,3-beta-D-glucan_metabolic_process YMR306W 1,3-beta-D-glucan_metabolic_process YDL049C 1,6-beta-glucan_biosynthetic_process YFR042W 1,6-beta-glucan_biosynthetic_process YGR143W 1,6-beta-glucan_biosynthetic_process YKL035W 1,6-beta-glucan_biosynthetic_process YOR336W 1,6-beta-glucan_biosynthetic_process YPR159W 1,6-beta-glucan_biosynthetic_process YDL049C 1,6-beta-glucan_metabolic_process YFR042W 1,6-beta-glucan_metabolic_process YGR143W 1,6-beta-glucan_metabolic_process YJL174W 1,6-beta-glucan_metabolic_process YKL035W 1,6-beta-glucan_metabolic_process YOR336W 1,6-beta-glucan_metabolic_process YPR159W 1,6-beta-glucan_metabolic_process YDR111C 1-aminocyclopropane-1-carboxylate_biosynthetic_process …
35
Assessing significance of association
Assume we have a term t with Nt terms annotated in the genome For a cluster c with Nc genes, assume Hct is the number of genes annotated with term t We ask if c’s members are significantly associated with genes annotated with term t Specifically we ask how likely are we to see Hct of Nc genes to be annotated with t by random chance A common way to do this is using the Hyper-geometric distribution
36
Hypergeometric distribution
A probability distribution over discrete random variables that describes the probability of k successes in n draws without replacement from a finite population of size N, with exactly K successes
37
Hypergeometric test example
Consider the example of a crate of 100 wine bottles The crate has 20 bottles of red wine and 80 bottles of white wine Suppose you were blind-folded and were allowed to select 10 bottles What is the probability of selecting 5 of these 10 bottles to be of red wine? Number of successes K=20 Total number of objects N=100 Number of draws (n=10) Number of successes in our draws k=5
38
Hypergeometric distribution to assess significance of cluster term association
Nc: Number of genes in a cluster c (Draws) Nt: Number of genes in our genome background to have term t function (Successes) Hct: Number of genes in cluster c with term t (Successes in Draws) N: Total genes in the background Probability of observing Hct of Nc genes with term t by random chance is
39
Other commonly used annotation categories
KEGG pathway BioCarta MsigDB (Molecular Signature database) REACTOME Binding sites of transcription factors
40
Using Gene Ontology to assess the quality of a cluster
Transcription factor binding sites for HAP4 and MSN2/4 GO terms Conditions Genes
41
Summary of flat clustering
Two methods Gaussian mixture models (soft clustering) K-means (hard clustering) Both need users to specify K K can be picked based on penalized objective or cross-validation Clusters can be evaluated based on internal and external validation
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.