Elements of cluster analysis Purpose of cluster analysis Various clustering techniques Agglomerative clustering Individual distances Group distances Other types of clustering Examples R commands
Purpose of cluster analysis Purpose of cluster analyses is to divide given data into groups using some characteristics. It is usually attempted to find “natural groups”. However it is difficult to define what is the natural group. A tentative definition would be region of high density in the n dimensional space surrounded with low density. Then cluster could be defined using density function. If the density function is f(x) then clusters could be defined with given f * as maximal connected regions of the space that satisfies {x|f(x)>f * }. This definition can not be considered satisfactory since it applies to continuous variable with density, number of modes may not be invariant under non-linear transformation etc. Nevertheless as it is the case in many data analytical techniques it is easier and useful to think in terms of natural clusters. If some projections of the data gives convincing evidence that there are several groups in the data then we can assume that clustering will give sufficiently good results. If no projection gives groups then result of clustering may be misleading. Most clustering algorithms tend to work with clusters of particular shape. Easiest identifiable clusters are spherical ones. If clusters are elongated ellipses then it might be difficult to identify them. If clusters have L-shape then it is very difficult to identify them. In this case non-parametric mode hunting techniques could give reasonable results.
Clustering techniques There are many clustering techniques. Here are several of them: 1)Hierarchical clustering: This technique can also be subdivided into two groups. a) Divisive that starts from one group and then divides iteratively to two groups, then to three groups etc. b) Agglomerative clustering starts from n groups and joins two groups and finds n-1 groups. It continues until there is only one group. Both these clustering result in a tree like structure called dendrograms. 2)Optimal partitioning: Most popular of them is k-means partitioning. It tries to find g groups in the data by optimising some kind of criteria. If we want g+1 group then whole procedure in general should be started from the beginning 3)Distribution mixtures. These techniques assume that data came from some population with a certain distribution that is a mixture of several distributions. The end-product is not partitioning itself but some posterior distribution. Of course this distribution can be used to assign each individual to some groups using probability of corresponding component of the mixture 4)Non-parametric methods use local density estimation. These techniques use direct search at each point of the space and try to avoid distribution assumptions. They are usually very computer intensive and data structure after them is not apparent.
Agglomerative clustering Agglomerative clustering is very simple and (not) surprisingly very popular. It uses nxn dissimilarity matrix (distances) between individuals. First it has n groups. Then it finds minimum distances between groups and joins two groups. When it joins two groups then new distance matrix for n-1 groups is defined. Then this step is repeated. There are many variations of this technique. Difference between them is in definition of between group distances. When new groups have been formed then original observations are forgotten. If distances would be Euclidean distances then original distances could in principle be used. But in practice it is almost never done. Using dissimilarity matrix has advantage that if the original data would be used then missing data would cause problem. Once dissimilarity matrix have been calculated then missing data points do not pose a problem any more. Disadvantage of dissimilarity matrices is that they should be calculated before groups are known. Moreover dissimilarities depend on the scale of original variables and that is usually unknown. In many cases it is useful to transform the data before calculating dissimilarities. Of course transformation will affect outcome of clustering and it should be done with care: a)Some methods produce clusters of roughly equal size. Some transformation may cause to join two groups even if they are well separated in the original data. b)Few points between clusters may cause them to join. It is not necessarily bad things. Some times finding missing link between two clusters might be important.
Dendrograms The result of hierarchical clustering is usually represented by a tree like structure. Top of the tree represents one single group (root) of all data points and at the bottom individual items are given. Ordinate scale shows the dissimilarity measure at which groups are joined. At the end of this lecture there will be example of dendrograms using different techniques of agglomerative clustering. Dendrograms are built usually from bottom to top but it is natural to read them from top to bottom.
Individual dissimilarity Since agglomerative clustering is based on the individual dissimilarity it is important to consider them carefully. Following points should be considered: 1)Transformation of the data. If variables have highly skewed distribution they can be transformed to more or less symmetric distributions. They can also be standardised to have more or less equal variances 2)Choice of variables might be important. Correlated variables can dominate clusters. Unwanted clusters should be avoided. For example if males and females in hospital are considered differently then there might be two clusters of the same illness. It does not mean that there are two different variants of the same illness. 3)Weighting could also be considered. Weighting should reflect relative “importance” of the variable considered. 4)Distances. There are many different forms of distances. Here are few of them: 5)Missing data: When distance is calculated then only observed values of the variables can be used and then n could be replaced by n-r where r is number of missing points in one or other observation.
Dissimilarity for discrete variables It is usual to define similarity for discrete variables and then convert it to dissimilarity. For binary variables there are number of similarity measures. For that we can define coefficients a, b, c, d using the following table: x i a b x j 0 c d Now we can define similarity coefficients (there are many other types of similarities also): a and d are the numbers of matched variables, c and b are the number of mismatched variables. Similarly, distances for mixed variables also can be defined.
Distances between groups During clustering the distances between individuals is converted to the distances between newly formed groups. Different inter-group distance definition can produce very different clusters and groups. First consider extreme cases: Single linkage. Distances between two groups is defined as smallest distances between members of one group and members of the other group. When single linkage distance is used then distances between newly formed groups either are unchanged or reduced. As a result large groups are formed. Single linkage distances handles tied distances (equal distances) well. If two distances are equal then clustering with single distances will not depend on choice of one of them. This property is unique for single linkage distance. On the other hand this type of distance can produce chaining. I.e. if two groups are linked with one data point then they will be fused to form large clusters Complete linkage. In this case distance between groups is taken as maximum distance between elements of groups. If group increases then distance from this group to all others either increases or remains the same. Clustering based on this type of distances tend to produce smaller clusters. At each stage of clustering group size will increase much slower than in the case of single linkage clustering. Main aim of the complete distance clustering is to produce convenient groups in the unstructured data. Single linkage distance would not be able to do that.
Flexible distances Very general forms of distances between groups can be written (after joining groups A and B): Most popular forms of the coefficients are: Centroid distance defines distance between centroid of the groups. When groups join then new centroid of the group is calculated and distance from that is calculated. It gives reasonable clustering Median distance uses distance between medians of two groups. If two groups with different sizes are joined then median might be unrealistic. Group average defines average over all intergroup member distances. Wards methods merges groups to give smallest possible increase in sum of squares within group distances. It can be sensitive to outliers
Some other techniques of clustering 1) Distribution mixtures uses the following density function assuming that there are g groups: It is usually assumed that distribution is a mixture of normal distributions. Coefficients correspond to the probability of the component. In many cases it is assumed that either variances or correlation of the different components are equal to each other. In this case problem becomes maximum likelihood to find mean values of the components and their probabilities. This technique has the attractive site that we can use tests from mathematical statistics to test number of components. We can use likelihood ratio test to decide number of components. Unfortunately this problem is highly non-linear and it is not clear if maximum has been reached. There are techniques available that maximises likelihood (or posterior distribution) with no assumption on number of components. Widely used technique is “reversible jump Monte-Carlo technique” that is usually used in the Bayesian context. 2) Fuzzy cluster assumes each member of data can belong to several groups with varying degree of “belongness”. It is related with the theory of fuzzy logic and fuzzy set where the classic set theory is extended and each element in the space have vector of “membership” that defines degree with which this particular point belongs to particular set. 3) Overlapping clusters is another type of generalisation of cluster analysis. In this case each point can belong to several groups. Difference between fuzzy and overlapping clusters is that in fuzzy cluster membership defines degree of doubt, in overlapping clusters points can genuinely belong to different clusters.
Example Example taken from Krzanowski. There are seven species: Modern dog (Md), Golden Jackal (Gj), Chinese wolf (Cw), Indian wolf (Iw), Cuon (c), Dingo (D), Prehistoric dog (Pd). These specie have been characterised with 6 variables. Here is the table: x1 x2 x3 x4 x5 x6 Md Gj Cw Iw C D Pd distances after normalising each column to unit variance: Md Gj Cw Iw C D Gj Cw Iw C D PD
Dendrograms Dendrogram calculated using complete and single linkage distances (in this case although shape of tree changes their contents do not change):
R commands for clustering Clustering can be done using hclust from the library(mva) library(mva) ?hclust - always good idea to have a look description of the command (or data) then you can use (if you have a distance matrix) cc1 = hclust(d) You can plot using plclust plclust(cc1) rect.hclust(ccl,k=ncluster) – draws rectangles around clusters of specified number. Rectangles can also be drawn using heights. There are many other clustering commands also. First I would play with hclust with different distances. Then I would go to others. There is function to calculate dissimilarities (from package cluster): library(cluster) dist(data) – calculates distances between points daisy(data) – calculates dissimilarity matrix. It can handle other than numeric variables. cophenetic(ccl) - gives distances derived from clusters themsleve. It might be good idea to calculate correlation between original distances and cophenetic distances.
R for cluster Library cluster contains two more commands for clustering: Fa = fanny(data,k) – Fuzzy clustering. K is the number of desired clusters. The results can be ploted using the plot command: plot(Fa) – plots Kcl=kmeans(data,k) – k-means clustering
Exercise Take the data eurodist from R and analyse it using metric scaling and cluster analysis
References 1. Krzanowski WJ and Marriout FHC. (1994) Multivatiate analysis. Kendall’s library of statistics 2.Mardia, K.V. Kent, J.T. and Bibby, J.M. (2003) Multivariate analysis