Download presentation
Presentation is loading. Please wait.
Published byMervyn Fox Modified over 9 years ago
1
Clustering approaches for high- throughput data Sushmita Roy BMI/CS 576 www.biostat.wisc.edu/bmi576 sroy@biostat.wisc.edu Nov 12 th, 2013
2
Key concepts Hierarchical clustering Determining the number of clusters Ways to assess cluster quality
3
Hierarchical clustering In K-means and GMMs we need to specify the number of clusters Hierarchical clustering instead requires us to specify how much dissimilarity we will tolerate We maintain a matrix of distance (or similarity) scores for all pairs of – expression vectors – clusters (formed so far) – Expression vectors and clusters
4
Hierarchical clustering leaves represent objects to be clustered (e.g. genes) height of bar indicates degree of distance within cluster distance scale 0
5
Distance between two clusters The distance between two clusters can be determined in several ways – single link: distance of two most similar profiles – complete link: distance of two least similar profiles – average link: average distance between profiles
6
Updating distances efficiently If we just merged and into, we can determine distance to each other cluster as follows – single link: – complete link: – average link:
7
Effect of different linkage methods Complete linkage Average linkage Single linkage
8
Flat clustering from a hierarchical clustering cutting here results in 2 clusters cutting here results in 4 clusters We can always generate a flat clustering from a hierarchical clustering by “cutting” the tree at some distance threshold
9
Computational complexity The naïve implementation of hierarchical clustering has time complexity, where n is the number of objects – computing the initial distance matrix takes time – there are merging steps – on each step, we have to update the distance matrix and select the next pair of clusters to merge K -means and EM have time complexity for each iteration – reassignment step: compute K × n distances – recomputation step: loop through n profiles updating k means
10
Choosing the number of clusters Picking the number of clusters based on the clustering objective will result in k=N (number of data points) Pick k based on penalized clustering objective Pick based on cross-validation
11
Picking k based on cross-validation Cluster Training setTest set Evaluate Average Data set Split into 3 sets Compute objective based on test data Run method on all data once k has been determined
12
Evaluation of clusters Internal validation – How well does clustering optimize the intra-cluster similarity and inter-cluster dissimilarity External validation – Do genes in the same cluster have similar function?
13
Internal validation One measure of assessing cluster quality is the Silhouette index (SI) More positive the SI, better the clusters K: number of clusters C j : Set representing j th cluster b(x i ) : Average distance of x i to instances in next closest cluster a(x i ) : Average distance of x i to other instances in same cluster
14
External validation Are genes in the same cluster associated with similar function? Gene Ontology (GO) is a controlled vocabulary of terms used to annotate genes of a particular category One can use GO terms to study whether the genes in a cluster are associated with the same GO term more than expected by chance One can also see if genes in a cluster are associated with similar transcription factor binding sites
15
The Gene Ontology A controlled vocabulary of more than 30K concepts describing molecular functions, biological processes, and cellular components
16
Using Gene Ontology to assess the quality of a cluster Genes Conditions GO terms Transcription factor binding sites for HAP4 and MSN2/4
17
Summary of clustering Many different methods to cluster – Flat clustering – Hierarchical clustering – Distance metrics among objects can influence clustering results a lot Picking the number of clusters is difficult but there are some ways to do this Evaluation of clusters is hard sometimes – Comparison with other sources of information can help assess cluster quality
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.