Lecture 6 Statistical Lecture ─ Cluster Analysis
Cluster Analysis Grouping similar objects to produce a classification Useful when the priori the structure of the data is unknown Involving the assessment of the relative distances between points
Clustering Algorithms Partitioning : Divide the data set into k clusters where k needs to be specified beforehand, e.g. k-means.
Clustering Algorithms Hierarchical : –Agglomerative methods : Start with the situation where each object forms its own little cluster, and then successively merges clusters until only one large cluster left –Divisive methods : Start by considering the whole data set as one cluster, and then splits up clusters until each object is separate
Caution Most users are interested in the main structure of their data, consisting of a few large clusters When forming larger clusters, agglomerative methods might makes wrong decisions in the first step. (Once one step is wrong, the whole thing is wrong) For divisive methods, the larger clusters are determined first, so they are less likely to suffer from earlier steps
Agglomerative Hierarchical Clustering Procedure (1)Each observation begins in a cluster by itself (2)The two closest clusters are merged to from a new cluster that replaces the two old clusters (3)Repeat (2) until only one cluster is left The various clustering methods differ in how the distance between two clusters is computed.
Remarks For coordinate data, variables with large variances tend to have more effect on the resulting clusters than those with small variance Scaling or transforming the variables might be needed Standardization (standardize the variables to mean 0 and standard deviation 1) or principle components is useful but not always appropriate Outliers should be removed before analysis
Remarks(cont.) Nonlinear transformations of the variables may change the number of population clusters and should therefore be approached with caution For most applications, the variables should be transformed so that equal differences are of equal practical importance An interval scale of measurement is required if raw data are used as input. Ordinal or ranked coordinate data are generally not appropriate
Notation nnumber of observation vnumber of variables if data are coordinates Gnumber of clusters at any given level of the hierarchy x i i th observation C k k th cluster, subset of {1, 2, …, n} N k number of observations in C k
Notation(cont.) sample mean vector mean vector for cluster C k ||x||Euclidean length of the vector x, that is the square root of the sum of the squares of the elements of x T W k
Notation(cont.) P G W j, where summation is over the G clusters at the G th level of the hierarchy B kl W m – W k – W l if C m =C k C l d(x, y)any distance or dissimilarity measure between observations or vectors x and y D kl any distance or dissimilarity measure between clusters C k and C l
Clustering Method ─ Average Linkage The distance between two clusters is defined by If d(x, y)=||x – y|| 2, then The combinatorial formula is if C m =C k C l
Average Linkage The distance between clusters is the average distance between pairs of observations, one in each cluster It tends to join clusters with small variance and is slightly biased toward producing clusters with the same variance
Centroid Method The distance between two clusters is defined by If d(x, y)=||x – y|| 2, then the combinatorial formula is
Centroid Method The distance between two clusters is defined as the squared Euclidean distance between their centroids or means It is more robust to outliers than most other hierarchical methods but in other respects may not perform as well as Ward’s method or average linkage
Complete Linkage The distance between two clusters is defined by The combinatorial formula is
Complete Linkage The distance between two cluster is the maximum distance between an observation in one cluster and an observation in the other cluster It is strongly biased toward producing clusters with roughly equal diameters and can be severely distorted by moderate outliers
Single Linkage The distance between two clusters is defined by The combinatorial formula is
Single Linkage The distance between two clusters is the minimum distance between an observation in one cluster and an observation in the other cluster It sacrifices performance in the recovery of compact clusters in return for the ability to detect elongated and irregular clusters
Ward’s Minimum-Variance Method The distance between two clusters is defined by If d(x, y)=||x – y|| 2, then the combinatorial formula is
Ward’s Minimum-Variance Method The distance between two clusters is the ANOVA sum of squares between the two clusters added up over all the variables It tends to join clusters with a small number of observation It is strongly biased toward producing clusters with roughly the same number of observations It is also very sensitive to outliers
Assumptions for WMVM Multivariate normal mixture Equal spherical covariance matrices Equal sampling probabilities
Remarks Single linkage tends to lead to the formation of long straggly clusters Average, complete linkage and Ward’s method often find spherical clusters even when the data appear to contain clusters of other shapes
McQuitty’s Similarity Analysis The combinatorial formula is Median Method If d(x, y)=||x – y|| 2, then the combinatorial formula is
K th -nearest Neighbor Method Prespecify k Let r k (x) be the distance from point x to the k th nearest observation Consider a closed sphere centered at x with radius r k (x), say S k (x)
K th -nearest Neighbor Method The estimated density at x is defined by For any two observations x i and x j
K-Means Algorithm It is intended for use with large data sets, from approximately 100 to observations With small data sets, the results may be highly sensitive to the order of the observations in the data set It combines an effective method for finding initial clusters with a standard iterative algorithm for minimizing the sum of squared distance from the cluster means
K-Means Algorithm Specify the number of clusters, say k A set of k points called cluster seeds is selected as a first guess of the means of the k clusters Each observation is assigned to the nearest seed to form temporary clusters The seeds are then replaced by the means of the temporary clusters The process is repeated until no further changes occur in the clusters
Cluster Seeds Select the first complete (no missing values) observation as the first seed The next complete observation that is separated from the first seed by at least the prespecified distance becomes the second seed Later observations are selected as new seeds if they are separated from all previous seeds by at least the radius, as long as the maximum number of seeds is not exceeded
Cluster Seeds If an observation is complete but fails to qualify as a new seed, two tests can be made to see if the observation can replace one of the old seeds
Cluster Seeds(cont.) An old seed is replaced if the distance between the observation and the closest seed is greater than the minimum distance between seeds. The seed that is replaced is selected from the two seeds that are closest to each other. The seed that is replaced is the one of these two with the shortest distance to the closest of the remaining seed when the other seed is replaced by the current observation
Cluster Seeds(cont.) If the observation fails the first test for seed replacement, a second test is made. The observation replaces the nearest seed if the smallest distance from the observation to all seeds other than the nearest one is greater than the shortest distance from the nearest seed to all other seeds. If this test is failed, go on to the next observation.
Dissimilarity Matrices n n dissimilarity matrix where d(i, j)=d(j, i) measures the “difference” or dissimilarity between the objects i and j.
Dissimilarity Matrices d usually satisfies d(i, i) = 0 d(i, j) 0 d(i, j) = d(j, i)
Dissimilarity Interval-scaled variables-continuous measurements on a (roughly) linear scale (temperature, height, weight, etc.)
Dissimilarity(cont.) The choice of measurement units strongly affects the resulting clustering The variable with the large dispersion will have the largest impact on clustering If all variables are considered equally important, the data need to be standardized first
Standardization Mean absolute deviation (Robust) Median absolute deviation (Robust) Usual standard deviation
Continuous Ordinal Variables These are continuous measurements on an unknown scale, or where only the ordering is known but not the actual magnitude. Replace the x if by their rank r if {1, …, M f } Transform the scale to [0,1] as follows : Compute the dissimilarities as for interval- scaled variables
Ratio-Scaled Variables These are positive continuous measurements on a nonlinear scale, such as an exponential scale. One example would be the growth of a bacterial population (say, with a growth function Ae Bt ). Simple as interval-scaled variables, though this is not recommended as it can distort the measurement scale As continuous ordinal data By first transforming the data (perhaps by taking logarithms), and then treating the results as interval- scaled variables
Discrete Ordinal Variables A variable of this type has M possible values (scores) which are ordered. The dissimilarities are computed in the same way as for continuous ordinal variables.
Nominal Variables Such a variable has M possible values, which are not ordered. The dissimilarity between objects i and j is usually defined as
Symmetric Binary Variables Two possible values, coded 0 and 1, which are equally important (s.t. a male and female). Consider the contingency table of the objects i and j :
Asymmetric Binary Variables Two possible values, one of which carries more importance than the other. The most meaningful outcome is coded as 1, and the less meaningful outcome as 0. Typically, 1 stands for the presence of a certain attribute (e.g., a particular distance), and 0 for its absence.
Asymmetric Binary Variables
Cluster Analysis of Flying Mileages Between 10 American Cities 0ATLANTA 587 0CHICAGO DENVER HOUSTON LOS ANGELES MIAMI NEW YORK SAN FRANCISCO SEATTLE WASHINGTON D.C.
The CLUSTER Procedure Average Linkage Cluster Analysis Cluster History NCLClusters JoinedFREQPSFPST2 Norm RMS Dist TieTie 9NEW YORKWASHINGTON D.C LOS ANGELESSAN FRANCISCO ATLANTACHICAGO CL7CL CL8SEATTLE DENVERHOUSTON CL6MIAMI CL3CL CL2CL Root-Mean-Square Distance Between Observations =
Average Linkage Cluster Analysis
The CLUSTER Procedure Centroid Hierarchical Cluster Analysis Cluster History NCLClusters JoinedFREQPSFPST2 Norm Cent Dist TieTie 9NEW YORKWASHINGTON D.C LOS ANGELESSAN FRANCISCO ATLANTACHICAGO CL7CL CL8SEATTLE DENVERCL CL6MIAMI CL3HOUSTON CL2CL Root-Mean-Square Distance Between Observations =
Centroid Hierarchical Cluster Analysis
The CLUSTER Procedure Single Linkage Cluster Analysis Cluster History NCLClusters JoinedFREQ Norm Min Dist TieTie 9NEW YORKWASHINGTON D.C LOS ANGELESSAN FRANCISCO ATLANTACL CL7CHICAGO CL6MIAMI CL8SEATTLE CL5HOUSTON DENVERCL CL3CL Mean Distance Between Observations =
Single Linkage Cluster Analysis
The CLUSTER Procedure Ward's Minimum Variance Cluster Analysis Cluster History NCLClusters JoinedFREQSPRSQRSQPSFPST2 TieTie 9NEW YORKWASHINGTON D.C LOS ANGELESSAN FRANCISCO ATLANTACHICAGO CL7CL DENVERHOUSTON CL8SEATTLE CL6MIAMI CL3CL CL2CL Root-Mean-Square Distance Between Observations =
Ward's Minimum Variance Cluster Analysis
Fisher (1936) Iris Data Initial Seeds ClusterSepalLengthSepalWidthPetalLengthPetalWidth Minimum Distance Between Initial Seeds = The FASTCLUS ProcedureReplace=FULL Radius=0 Maxclusters=2 Maxiter=10 Converge=0.02
Fisher (1936) Iris Data The FASTCLUS ProcedureReplace=FULL Radius=0 Maxclusters=2 Maxiter=10 Converge=0.02 Iteration History IterationCriterion Relative Change in Cluster Seeds Convergence criterion is satisfied. Criterion Based on Final Seeds =5.0417
Fisher (1936) Iris Data The FASTCLUS Procedure Cluster Summary Clust er Freque ncy RMS Std Deviation Maximum Distance from Seed to Observation Radius Exceed ed Nearest Cluster Distance Between Cluster Centroids
Fisher (1936) Iris Data The FASTCLUS Procedure Statistics for Variables VariableTotal STDWithin STDR-SquareRSQ/(1-RSQ) SepalLength SepalWidth PetalLength PetalWidth OVER-ALL Pseudo F Statistic = Approximate Expected Over-All R-Squared = Cubic Clustering Criterion = WARNING: The two above values are invalid for correlated variables
c: number of clusters n: number of observations
Fisher (1936) Iris Data The FASTCLUS ProcedureReplace=FULL Radius=0 Maxclusters=2 Maxiter=10 Converge=0.02 Cluster Means ClusterSepalLengthSepalWidthPetalLengthPetalWidth Cluster Standard Deviations ClusterSepalLengthSepalWidthPetalLengthPetalWidth
Fisher (1936) Iris Data The FREQ Procedure Frequency Percent Row Pct Col Pct Table of CLUSTER by Species CLUSTER(Cluster) Species Total SetosaVersicolorVirginica Total
Fisher (1936) Iris Data The FASTCLUS ProcedureReplace=FULL Radius=0 Maxclusters=3 Maxiter=10 Converge=0.02 Initial Seeds ClusterSepalLengthSepalWidthPetalLengthPetalWidth Minimum Distance Between Initial Seeds =
Fisher (1936) Iris Data The FASTCLUS ProcedureReplace=FULL Radius=0 Maxclusters=3 Maxiter=10 Converge=0.02 Iteration History IterationCriterion Relative Change in Cluster Seeds Convergence criterion is satisfied.Criterion Based on Final Seeds =3.6289
Fisher (1936) Iris Data Cluster Summary Clust er Freque ncy RMS Std Deviation Maximum Distance from Seed to Observation Radius Excee ded Nearest Cluster Distance Between Cluster Centroids
Fisher (1936) Iris Data Statistics for Variables VariableTotal STDWithin STDR-SquareRSQ/(1-RSQ) SepalLength SepalWidth PetalLength PetalWidth OVER-ALL Pseudo F Statistic =561.63Approximate Expected Over-All R-Squared = Cubic Clustering Criterion = WARNING: The two above values are invalid for correlated variables.
Fisher (1936) Iris Data The FASTCLUS ProcedureReplace=FULL Radius=0 Maxclusters=3 Maxiter=10 Converge=0.02 Cluster Means ClusterSepalLengthSepalWidthPetalLengthPetalWidth Cluster Standard Deviations ClusterSepalLengthSepalWidthPetalLengthPetalWidth
Fisher (1936) Iris Data The FREQ Procedure Frequency Percent Row Pct Col Pct Table of CLUSTER by Species CLUSTER(Cluster) Species Total SetosaVersicolorVirginica Total