Clustering approaches for high- throughput data Sushmita Roy BMI/CS 576 Nov 12 th, 2013.

Clustering approaches for high- throughput data Sushmita Roy BMI/CS 576 www.biostat.wisc.edu/bmi576 sroy@biostat.wisc.edu Nov 12 th, 2013

Key concepts Hierarchical clustering Determining the number of clusters Ways to assess cluster quality

Hierarchical clustering In K-means and GMMs we need to specify the number of clusters Hierarchical clustering instead requires us to specify how much dissimilarity we will tolerate We maintain a matrix of distance (or similarity) scores for all pairs of – expression vectors – clusters (formed so far) – Expression vectors and clusters

Hierarchical clustering leaves represent objects to be clustered (e.g. genes) height of bar indicates degree of distance within cluster distance scale 0

Distance between two clusters The distance between two clusters can be determined in several ways – single link: distance of two most similar profiles – complete link: distance of two least similar profiles – average link: average distance between profiles

Updating distances efficiently If we just merged and into, we can determine distance to each other cluster as follows – single link: – complete link: – average link:

Effect of different linkage methods Complete linkage Average linkage Single linkage

Flat clustering from a hierarchical clustering cutting here results in 2 clusters cutting here results in 4 clusters We can always generate a flat clustering from a hierarchical clustering by “cutting” the tree at some distance threshold

Computational complexity The naïve implementation of hierarchical clustering has time complexity, where n is the number of objects – computing the initial distance matrix takes time – there are merging steps – on each step, we have to update the distance matrix and select the next pair of clusters to merge K -means and EM have time complexity for each iteration – reassignment step: compute K × n distances – recomputation step: loop through n profiles updating k means

Choosing the number of clusters Picking the number of clusters based on the clustering objective will result in k=N (number of data points) Pick k based on penalized clustering objective Pick based on cross-validation

Picking k based on cross-validation Cluster Training setTest set Evaluate Average Data set Split into 3 sets Compute objective based on test data Run method on all data once k has been determined

Evaluation of clusters Internal validation – How well does clustering optimize the intra-cluster similarity and inter-cluster dissimilarity External validation – Do genes in the same cluster have similar function?

Internal validation One measure of assessing cluster quality is the Silhouette index (SI) More positive the SI, better the clusters K: number of clusters C j : Set representing j th cluster b(x i ) : Average distance of x i to instances in next closest cluster a(x i ) : Average distance of x i to other instances in same cluster

External validation Are genes in the same cluster associated with similar function? Gene Ontology (GO) is a controlled vocabulary of terms used to annotate genes of a particular category One can use GO terms to study whether the genes in a cluster are associated with the same GO term more than expected by chance One can also see if genes in a cluster are associated with similar transcription factor binding sites

The Gene Ontology A controlled vocabulary of more than 30K concepts describing molecular functions, biological processes, and cellular components

Using Gene Ontology to assess the quality of a cluster Genes Conditions GO terms Transcription factor binding sites for HAP4 and MSN2/4

Summary of clustering Many different methods to cluster – Flat clustering – Hierarchical clustering – Distance metrics among objects can influence clustering results a lot Picking the number of clusters is difficult but there are some ways to do this Evaluation of clusters is hard sometimes – Comparison with other sources of information can help assess cluster quality

Clustering approaches for high- throughput data Sushmita Roy BMI/CS 576 Nov 12 th, 2013.

Similar presentations

Presentation on theme: "Clustering approaches for high- throughput data Sushmita Roy BMI/CS 576 Nov 12 th, 2013."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Clustering approaches for high- throughput data Sushmita Roy BMI/CS 576 Nov 12 th, 2013.

Similar presentations

Presentation on theme: "Clustering approaches for high- throughput data Sushmita Roy BMI/CS 576 Nov 12 th, 2013."— Presentation transcript:

Similar presentations

About project

Feedback