Important clustering methods used in microarray data analysis Steve Horvath Human Genetics and Biostatistics UCLA
Contents Multidimensional scaling plots –Related to principal component analysis k-means clustering hierarchical clustering
Introduction to clustering
MDS plot of clusters
2 references for clustering T. Hastie, R. Tibshirani, J. Friedman (2002) The elements of Statistical Learning. Springer Series L. Kaufman, P. Rousseeuw (1990) Finding groups in data. Wiley Series in Probability
Introduction to clustering Cluster analysis aims to group or segment a collection of objects into subsets or "clusters", such that those within each cluster are more closely related to one another than objects assigned to different clusters. An object can be described by a set of measurements (e.g. covariates, features, attributes) or by its relation to other objects. Sometimes the goal is to arrange the clusters into a natural hierarchy, which involves successively grouping or merging the clusters themselves so that at each level of the hierarchy clusters within the same group are more similar to each other than those in different groups.
Proximity matrices are the input to most clustering algorithms Proximity between pairs of objects: similarity or dissimilarity. If the original data were collected as similarities, a monotone- decreasing function can be used to convert them to dissimilarities. Most algorithms use (symmetric) dissimilarities (e.g. distances) But the triangle inequality does *not* have to hold. Triangle inequality:
Different intergroup dissimilarities Let G and H represent 2 groups.
Agglomerative clustering, hierarchical clustering and dendrograms
Hierarchical clustering plot
Agglomerative clustering Agglomerative clustering algorithms begin with every observation representing a singleton cluster. At each of the N-1 the closest 2 (least dissimilar) clusters are merged into a single cluster. Therefore a measure of dissimilarity between 2 clusters must be defined.
Comparing different linkage methods If there is a strong clustering tendency, all 3 methods produce similar results. Single linkage has a tendency to combine observations linked by a series of close intermediate observations ("chaining“). Good for elongated clusters Bad: Complete linkage may lead to clusters where observations assigned to a cluster can be much closer to members of other clusters than they are to some members of their own cluster. Use for very compact clusters (like perls on a string). Group average clustering represents a compromise between the extremes of single and complete linkage. Use for ball shaped clusters
Dendrogram Recursive binary splitting/agglomeration can be represented by a rooted binary tree. The root node represents the entire data set. The N terminal nodes of the trees represent individual observations. Each nonterminal node ("parent") has two daughter nodes. Thus the binary tree can be plotted so that the height of each node is proportional to the value of the intergroup dissimilarity between its 2 daughters. A dendrogram provides a complete description of the hierarchical clustering in graphical format.
Comments on dendrograms Caution: different hierarchical methods as well as small changes in the data can lead to different dendrograms. Hierarchical methods impose hierarchical structure whether or not such structure actually exists in the data. In general dendrograms are a description of the results of the algorithm and not graphical summary of the data. Only valid summary to the extent that the pairwise *observation* dissimilarities obey the ultrametric inequality for all i,i’,k
Figure 1 averagecomplete single