Flat clustering approaches

Name: Flat clustering approaches
Uploaded: 2017-11-24T10:17:23+00:00
Duration: PTM15S31
Channel: Bonnie Horn
Description: Flat clustering approaches

Flat clustering approaches
Colin Dewey BMI/CS 576 Fall 2015

Flat clustering Cluster objects/genes/samples into K clusters
In the following, we will consider clustering genes based on their expression profiles K: number of clusters, a user defined argument Two example algorithms K-means Gaussian mixture model-based clustering

Notation for K-means clustering
K number of clusters Nk Number of elements in cluster k p-dimensional expression profile for ith gene is the collection of N gene expression profiles to cluster Center of the kth cluster C(i) Cluster assignment (1 to K) of ith gene

K-means clustering Hard-clustering algorithm
Dissimilarity measure is the Euclidean distance Minimizes within-cluster scatter defined as Center of cluster k Sum over all genes assigned to cluster k

K-means clustering consider an example in which our vectors have 2 dimensions + cluster center profile

K-means clustering each iteration involves two steps
assignment of profiles to clusters re-computation of the cluster centers (means) + assignment + + + + re-computation of cluster centers

K-means algorithm Input: K, number of clusters, a set X={x1,.. xN} of data points, where xi are p-dimensional vectors Initialize Select initial cluster means Repeat until convergence Assign each xi to cluster C(i) such that Re-estimate the mean of each cluster based on new members

K-means: updating the mean
To compute the mean of the kth cluster All genes in cluster k Number of genes in cluster k

K-means stopping criteria
Assignment of objects to clusters don’t change Fix the max number of iterations Optimization criterion changes by a small value

K-means Clustering Example
Given the following 4 instances and 2 clusters initialized as shown. f1 f2 x1 x2 x3 x4

K-means Clustering Example (Continued)
assignments remain the same, so the procedure has converged

Gaussian mixture model based clustering
K-means is hard clustering At each iteration, a data point is assigned to one and only one cluster We can do soft clustering based on Gaussian mixture models Each cluster is represented by a distribution (in our case a Gaussian) We assume the data is generated by a mixture of the Gaussians

Gaussian distribution
: Mean : Standard deviation

Representation of Clusters
in the EM approach, we’ll represent each cluster using a p-dimensional multivariate Gaussian where is the mean of the Gaussian is the covariance matrix this is a representation of a Gaussian in a 2-D space

Cluster generation We model our data points with a generative process
Each point is generated by: Choosing a cluster (1,2,…,k) by sampling from probability distribution over the clusters Prob(cluster k) = Pk Sampling a point from the Gaussian distribution Nj

EM Clustering the EM algorithm will try to set the parameters of the Gaussians, , to maximize the log likelihood of the data, X

EM Clustering the parameters of the model, , include the means, the covariance matrix and sometimes prior weights for each Gaussian here, we’ll assume that the covariance matrix is fixed; we’ll focus just on setting the means and the prior weights

EM Clustering: Hidden Variables
on each iteration of K-means clustering, we had to assign each instance to a cluster in the EM approach, we’ll use hidden variables to represent this idea for each instance we have a set of hidden variables we can think of as being 1 if is a member of cluster j and 0 otherwise

EM Clustering: the E-step
recall that is a hidden variable which is 1 if generated and 0 otherwise in the E-step, we compute , the expected value of this hidden variable + assignment 0.3 0.7

EM Clustering: the M-step
given the expected values , we re-estimate the means of the Gaussians and the cluster probabilities can also re-estimate the covariance matrix if we’re varying it

EM Clustering – Overall algorithm
Initialize parameters (e.g., means) Loop until convergence E-step: Compute expected values of Zij values given current parameters M-step: Update parameters using E[Zij] values Means Cluster probabilities

EM Clustering Example Consider a one-dimensional clustering problem in which the data given are: x1 = -4 x2 = -3 x3 = -1 x4 = 3 x5 = 5 The initial mean of the first Gaussian is 0 and the initial mean of the second is 2. The Gaussians both have variance = 4; their density function is: where denotes the mean (center) of the Gaussian. Initially, we set P1 = P2 = 0.5

EM Clustering Example

EM Clustering Example: E Step

EM Clustering Example: M-step

EM Clustering Example here we’ve shown just one step of the EM procedure we would continue the E- and M-steps until convergence

Comparing K-means and GMMs
Hard clustering Optimizes within cluster scatter Requires estimation of means GMMs Soft clustering Optimizes likelihood of the data Requires estimation of mean (and possibly covariance) and mixture probabilities

Choosing the number of clusters
Picking the number of clusters based on the clustering objective will result in K=N (number of data points) Two common options: Pick k based on penalized clustering objective G Pick based on cross-validation

Picking K based on cross-validation
Data set Split into 3 sets Training set Test set Compute objective based on test data Cluster Cluster Cluster Evaluate Evaluate Evaluate Run method on all data once K has been determined Average

Evaluation of clusters
Internal validation How well does clustering optimize the intra-cluster similarity and inter-cluster dissimilarity External validation Do genes in the same cluster have similar function?

Internal validation One measure of assessing cluster quality is the Silhouette index (SI) More positive the SI, better the clusters K: number of clusters Cj : Set representing jth cluster b(xi): Average distance of xi to instances in next closest cluster a(xi): Average distance of xi to other instances in same cluster

External validation Are genes in the same cluster associated with similar function? Gene Ontology (GO) is a controlled vocabulary of terms used to annotate genes of a particular category One can use GO terms to study whether the genes in a cluster are associated with the same GO term more than expected by chance One can also see if genes in a cluster are associated with similar transcription factor binding sites

The Gene Ontology A controlled vocabulary of more than 30K concepts describing molecular functions, biological processes, and cellular components

Gene Ontology terms for yeast genes
Gene ontology process YDR420W 1,3-beta-D-glucan_biosynthetic_process YGR032W 1,3-beta-D-glucan_biosynthetic_process YLR342W 1,3-beta-D-glucan_biosynthetic_process YMR306W 1,3-beta-D-glucan_biosynthetic_process YDR420W 1,3-beta-D-glucan_metabolic_process YGR032W 1,3-beta-D-glucan_metabolic_process YLR342W 1,3-beta-D-glucan_metabolic_process YMR306W 1,3-beta-D-glucan_metabolic_process YDL049C 1,6-beta-glucan_biosynthetic_process YFR042W 1,6-beta-glucan_biosynthetic_process YGR143W 1,6-beta-glucan_biosynthetic_process YKL035W 1,6-beta-glucan_biosynthetic_process YOR336W 1,6-beta-glucan_biosynthetic_process YPR159W 1,6-beta-glucan_biosynthetic_process YDL049C 1,6-beta-glucan_metabolic_process YFR042W 1,6-beta-glucan_metabolic_process YGR143W 1,6-beta-glucan_metabolic_process YJL174W 1,6-beta-glucan_metabolic_process YKL035W 1,6-beta-glucan_metabolic_process YOR336W 1,6-beta-glucan_metabolic_process YPR159W 1,6-beta-glucan_metabolic_process YDR111C 1-aminocyclopropane-1-carboxylate_biosynthetic_process …

Assessing significance of association
Assume we have a term t with Nt terms annotated in the genome For a cluster c with Nc genes, assume Hct is the number of genes annotated with term t We ask if c’s members are significantly associated with genes annotated with term t Specifically we ask how likely are we to see Hct of Nc genes to be annotated with t by random chance A common way to do this is using the Hyper-geometric distribution

Hypergeometric distribution
A probability distribution over discrete random variables that describes the probability of k successes in n draws without replacement from a finite population of size N, with exactly K successes

Hypergeometric test example
Consider the example of a crate of 100 wine bottles The crate has 20 bottles of red wine and 80 bottles of white wine Suppose you were blind-folded and were allowed to select 10 bottles What is the probability of selecting 5 of these 10 bottles to be of red wine? Number of successes K=20 Total number of objects N=100 Number of draws (n=10) Number of successes in our draws k=5

Hypergeometric distribution to assess significance of cluster term association
Nc: Number of genes in a cluster c (Draws) Nt: Number of genes in our genome background to have term t function (Successes) Hct: Number of genes in cluster c with term t (Successes in Draws) N: Total genes in the background Probability of observing Hct of Nc genes with term t by random chance is

Other commonly used annotation categories
KEGG pathway BioCarta MsigDB (Molecular Signature database) REACTOME Binding sites of transcription factors

Using Gene Ontology to assess the quality of a cluster
Transcription factor binding sites for HAP4 and MSN2/4 GO terms Conditions Genes

Summary of flat clustering
Two methods Gaussian mixture models (soft clustering) K-means (hard clustering) Both need users to specify K K can be picked based on penalized objective or cross-validation Clusters can be evaluated based on internal and external validation

Flat clustering approaches

Similar presentations

Presentation on theme: "Flat clustering approaches"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Flat clustering approaches

Similar presentations

Presentation on theme: "Flat clustering approaches"— Presentation transcript:

Similar presentations

About project

Feedback