Presentation is loading. Please wait.

Presentation is loading. Please wait.

Lecture 6 Statistical Lecture ─ Cluster Analysis.

Similar presentations


Presentation on theme: "Lecture 6 Statistical Lecture ─ Cluster Analysis."— Presentation transcript:

1 Lecture 6 Statistical Lecture ─ Cluster Analysis

2 Cluster Analysis Grouping similar objects to produce a classification Useful when the priori the structure of the data is unknown Involving the assessment of the relative distances between points

3 Clustering Algorithms Partitioning : Divide the data set into k clusters where k needs to be specified beforehand, e.g. k-means.

4 Clustering Algorithms Hierarchical : –Agglomerative methods : Start with the situation where each object forms its own little cluster, and then successively merges clusters until only one large cluster left –Divisive methods : Start by considering the whole data set as one cluster, and then splits up clusters until each object is separate

5 Caution Most users are interested in the main structure of their data, consisting of a few large clusters When forming larger clusters, agglomerative methods might makes wrong decisions in the first step. (Once one step is wrong, the whole thing is wrong) For divisive methods, the larger clusters are determined first, so they are less likely to suffer from earlier steps

6 Agglomerative Hierarchical Clustering Procedure (1)Each observation begins in a cluster by itself (2)The two closest clusters are merged to from a new cluster that replaces the two old clusters (3)Repeat (2) until only one cluster is left The various clustering methods differ in how the distance between two clusters is computed.

7 Remarks For coordinate data, variables with large variances tend to have more effect on the resulting clusters than those with small variance Scaling or transforming the variables might be needed Standardization (standardize the variables to mean 0 and standard deviation 1) or principle components is useful but not always appropriate Outliers should be removed before analysis

8 Remarks(cont.) Nonlinear transformations of the variables may change the number of population clusters and should therefore be approached with caution For most applications, the variables should be transformed so that equal differences are of equal practical importance An interval scale of measurement is required if raw data are used as input. Ordinal or ranked coordinate data are generally not appropriate

9 Notation nnumber of observation vnumber of variables if data are coordinates Gnumber of clusters at any given level of the hierarchy x i i th observation C k k th cluster, subset of {1, 2, …, n} N k number of observations in C k

10 Notation(cont.) sample mean vector mean vector for cluster C k ||x||Euclidean length of the vector x, that is the square root of the sum of the squares of the elements of x T W k

11 Notation(cont.) P G  W j, where summation is over the G clusters at the G th level of the hierarchy B kl W m – W k – W l if C m =C k  C l d(x, y)any distance or dissimilarity measure between observations or vectors x and y D kl any distance or dissimilarity measure between clusters C k and C l

12 Clustering Method ─ Average Linkage The distance between two clusters is defined by If d(x, y)=||x – y|| 2, then The combinatorial formula is if C m =C k  C l

13 Average Linkage The distance between clusters is the average distance between pairs of observations, one in each cluster It tends to join clusters with small variance and is slightly biased toward producing clusters with the same variance

14 Centroid Method The distance between two clusters is defined by If d(x, y)=||x – y|| 2, then the combinatorial formula is

15 Centroid Method The distance between two clusters is defined as the squared Euclidean distance between their centroids or means It is more robust to outliers than most other hierarchical methods but in other respects may not perform as well as Ward’s method or average linkage

16 Complete Linkage The distance between two clusters is defined by The combinatorial formula is

17 Complete Linkage The distance between two cluster is the maximum distance between an observation in one cluster and an observation in the other cluster It is strongly biased toward producing clusters with roughly equal diameters and can be severely distorted by moderate outliers

18 Single Linkage The distance between two clusters is defined by The combinatorial formula is

19 Single Linkage The distance between two clusters is the minimum distance between an observation in one cluster and an observation in the other cluster It sacrifices performance in the recovery of compact clusters in return for the ability to detect elongated and irregular clusters

20 Ward’s Minimum-Variance Method The distance between two clusters is defined by If d(x, y)=||x – y|| 2, then the combinatorial formula is

21 Ward’s Minimum-Variance Method The distance between two clusters is the ANOVA sum of squares between the two clusters added up over all the variables It tends to join clusters with a small number of observation It is strongly biased toward producing clusters with roughly the same number of observations It is also very sensitive to outliers

22 Assumptions for WMVM Multivariate normal mixture Equal spherical covariance matrices Equal sampling probabilities

23 Remarks Single linkage tends to lead to the formation of long straggly clusters Average, complete linkage and Ward’s method often find spherical clusters even when the data appear to contain clusters of other shapes

24 McQuitty’s Similarity Analysis The combinatorial formula is Median Method If d(x, y)=||x – y|| 2, then the combinatorial formula is

25 K th -nearest Neighbor Method Prespecify k Let r k (x) be the distance from point x to the k th nearest observation Consider a closed sphere centered at x with radius r k (x), say S k (x)

26 K th -nearest Neighbor Method The estimated density at x is defined by For any two observations x i and x j

27 K-Means Algorithm It is intended for use with large data sets, from approximately 100 to 100000 observations With small data sets, the results may be highly sensitive to the order of the observations in the data set It combines an effective method for finding initial clusters with a standard iterative algorithm for minimizing the sum of squared distance from the cluster means

28 K-Means Algorithm Specify the number of clusters, say k A set of k points called cluster seeds is selected as a first guess of the means of the k clusters Each observation is assigned to the nearest seed to form temporary clusters The seeds are then replaced by the means of the temporary clusters The process is repeated until no further changes occur in the clusters

29 Cluster Seeds Select the first complete (no missing values) observation as the first seed The next complete observation that is separated from the first seed by at least the prespecified distance becomes the second seed Later observations are selected as new seeds if they are separated from all previous seeds by at least the radius, as long as the maximum number of seeds is not exceeded

30 Cluster Seeds If an observation is complete but fails to qualify as a new seed, two tests can be made to see if the observation can replace one of the old seeds

31 Cluster Seeds(cont.) An old seed is replaced if the distance between the observation and the closest seed is greater than the minimum distance between seeds. The seed that is replaced is selected from the two seeds that are closest to each other. The seed that is replaced is the one of these two with the shortest distance to the closest of the remaining seed when the other seed is replaced by the current observation

32 Cluster Seeds(cont.) If the observation fails the first test for seed replacement, a second test is made. The observation replaces the nearest seed if the smallest distance from the observation to all seeds other than the nearest one is greater than the shortest distance from the nearest seed to all other seeds. If this test is failed, go on to the next observation.

33 Dissimilarity Matrices n  n dissimilarity matrix where d(i, j)=d(j, i) measures the “difference” or dissimilarity between the objects i and j.

34 Dissimilarity Matrices d usually satisfies d(i, i) = 0 d(i, j)  0 d(i, j) = d(j, i)

35 Dissimilarity Interval-scaled variables-continuous measurements on a (roughly) linear scale (temperature, height, weight, etc.)

36 Dissimilarity(cont.) The choice of measurement units strongly affects the resulting clustering The variable with the large dispersion will have the largest impact on clustering If all variables are considered equally important, the data need to be standardized first

37 Standardization Mean absolute deviation (Robust) Median absolute deviation (Robust) Usual standard deviation

38 Continuous Ordinal Variables These are continuous measurements on an unknown scale, or where only the ordering is known but not the actual magnitude. Replace the x if by their rank r if  {1, …, M f } Transform the scale to [0,1] as follows : Compute the dissimilarities as for interval- scaled variables

39 Ratio-Scaled Variables These are positive continuous measurements on a nonlinear scale, such as an exponential scale. One example would be the growth of a bacterial population (say, with a growth function Ae Bt ). Simple as interval-scaled variables, though this is not recommended as it can distort the measurement scale As continuous ordinal data By first transforming the data (perhaps by taking logarithms), and then treating the results as interval- scaled variables

40 Discrete Ordinal Variables A variable of this type has M possible values (scores) which are ordered. The dissimilarities are computed in the same way as for continuous ordinal variables.

41 Nominal Variables Such a variable has M possible values, which are not ordered. The dissimilarity between objects i and j is usually defined as

42 Symmetric Binary Variables Two possible values, coded 0 and 1, which are equally important (s.t. a male and female). Consider the contingency table of the objects i and j :

43 Asymmetric Binary Variables Two possible values, one of which carries more importance than the other. The most meaningful outcome is coded as 1, and the less meaningful outcome as 0. Typically, 1 stands for the presence of a certain attribute (e.g., a particular distance), and 0 for its absence.

44 Asymmetric Binary Variables

45 Cluster Analysis of Flying Mileages Between 10 American Cities 0ATLANTA 587 0CHICAGO 1212 920 0DENVER 701 940 879 0HOUSTON 1936 1745 831 1374 0LOS ANGELES 604 1188 1726 968 2339 0MIAMI 748 713 1631 1420 2451 1092 0NEW YORK 2139 1858 949 1645 347 2594 2571 0SAN FRANCISCO 2182 1737 1021 1891 959 2734 2408 678 0SEATTLE 543 597 1494 1220 2300 923 205 2442 2329 0WASHINGTON D.C.

46 The CLUSTER Procedure Average Linkage Cluster Analysis Cluster History NCLClusters JoinedFREQPSFPST2 Norm RMS Dist TieTie 9NEW YORKWASHINGTON D.C.266.7.0.1297 8LOS ANGELESSAN FRANCISCO239.2.0.2196 7ATLANTACHICAGO221.7.0.3715 6CL7CL9414.53.40.4149 5CL8SEATTLE312.47.30.5255 4DENVERHOUSTON213.9.0.5562 3CL6MIAMI515.53.80.6185 2CL3CL4716.05.30.8005 1CL2CL510.16.01.2967 Root-Mean-Square Distance Between Observations = 1580.242

47 Average Linkage Cluster Analysis

48 The CLUSTER Procedure Centroid Hierarchical Cluster Analysis Cluster History NCLClusters JoinedFREQPSFPST2 Norm Cent Dist TieTie 9NEW YORKWASHINGTON D.C.266.7.0.1297 8LOS ANGELESSAN FRANCISCO239.2.0.2196 7ATLANTACHICAGO221.7.0.3715 6CL7CL9414.53.40.3652 5CL8SEATTLE312.47.30.5139 4DENVERCL5412.42.10.5337 3CL6MIAMI514.23.80.5743 2CL3HOUSTON622.12.60.6091 1CL2CL410.22.11.173 Root-Mean-Square Distance Between Observations = 1580.242

49 Centroid Hierarchical Cluster Analysis

50 The CLUSTER Procedure Single Linkage Cluster Analysis Cluster History NCLClusters JoinedFREQ Norm Min Dist TieTie 9NEW YORKWASHINGTON D.C.20.1447 8LOS ANGELESSAN FRANCISCO20.2449 7ATLANTACL930.3832 6CL7CHICAGO40.4142 5CL6MIAMI50.4262 4CL8SEATTLE30.4784 3CL5HOUSTON60.4947 2DENVERCL440.5864 1CL3CL2100.6203 Mean Distance Between Observations = 1417.133

51 Single Linkage Cluster Analysis

52 The CLUSTER Procedure Ward's Minimum Variance Cluster Analysis Cluster History NCLClusters JoinedFREQSPRSQRSQPSFPST2 TieTie 9NEW YORKWASHINGTON D.C.20.0019.99866.7. 8LOS ANGELESSAN FRANCISCO20.0054.99339.2. 7ATLANTACHICAGO20.0153.97721.7. 6CL7CL940.0296.94814.53.4 5DENVERHOUSTON20.0344.91313.2. 4CL8SEATTLE30.0391.87413.97.3 3CL6MIAMI50.0586.81615.53.8 2CL3CL570.1488.66716.05.3 1CL2CL4100.6669.000.16.0 Root-Mean-Square Distance Between Observations = 1580.242

53 Ward's Minimum Variance Cluster Analysis

54 Fisher (1936) Iris Data Initial Seeds ClusterSepalLengthSepalWidthPetalLengthPetalWidth 143.0000000030.0000000011.000000001.00000000 277.0000000026.0000000069.0000000023.00000000 Minimum Distance Between Initial Seeds =70.85196 The FASTCLUS ProcedureReplace=FULL Radius=0 Maxclusters=2 Maxiter=10 Converge=0.02

55 Fisher (1936) Iris Data The FASTCLUS ProcedureReplace=FULL Radius=0 Maxclusters=2 Maxiter=10 Converge=0.02 Iteration History IterationCriterion Relative Change in Cluster Seeds 12 111.06380.19040.3163 25.37800.05960.0264 35.07180.01740.00766 Convergence criterion is satisfied. Criterion Based on Final Seeds =5.0417

56 Fisher (1936) Iris Data The FASTCLUS Procedure Cluster Summary Clust er Freque ncy RMS Std Deviation Maximum Distance from Seed to Observation Radius Exceed ed Nearest Cluster Distance Between Cluster Centroids 1533.705021.1621239.2879 2975.677924.6430139.2879

57 Fisher (1936) Iris Data The FASTCLUS Procedure Statistics for Variables VariableTotal STDWithin STDR-SquareRSQ/(1-RSQ) SepalLength8.280665.493130.5628961.287784 SepalWidth4.358663.703930.2827100.394137 PetalLength17.652986.803310.8524705.778291 PetalWidth7.622383.572000.7818683.584390 OVER-ALL10.692245.072910.7764103.472463 Pseudo F Statistic =513.92 Approximate Expected Over-All R-Squared =0.51539Cubic Clustering Criterion =14.806 WARNING: The two above values are invalid for correlated variables

58 c: number of clusters n: number of observations

59 Fisher (1936) Iris Data The FASTCLUS ProcedureReplace=FULL Radius=0 Maxclusters=2 Maxiter=10 Converge=0.02 Cluster Means ClusterSepalLengthSepalWidthPetalLengthPetalWidth 150.0566037733.6981132115.603773582.90566038 263.0103092828.8659793849.5876288716.95876289 Cluster Standard Deviations ClusterSepalLengthSepalWidthPetalLengthPetalWidth 13.4273509304.3966110454.4042794862.105525249 26.3368874553.2679914387.8005776734.155612484

60 Fisher (1936) Iris Data The FREQ Procedure Frequency Percent Row Pct Col Pct Table of CLUSTER by Species CLUSTER(Cluster) Species Total SetosaVersicolorVirginica 150 33.33 94.34 100.00 3 2.00 5.66 6.00 0 0.00 0.00 0.00 53 35.33 20 0.00 0.00 0.00 47 31.33 48.45 94.00 50 33.33 51.55 100.00 97 64.67 Total50 33.33 150 100.00

61 Fisher (1936) Iris Data The FASTCLUS ProcedureReplace=FULL Radius=0 Maxclusters=3 Maxiter=10 Converge=0.02 Initial Seeds ClusterSepalLengthSepalWidthPetalLengthPetalWidth 158.0000000040.0000000012.000000002.00000000 277.0000000038.0000000067.0000000022.00000000 349.0000000025.0000000045.0000000017.00000000 Minimum Distance Between Initial Seeds =38.23611

62 Fisher (1936) Iris Data The FASTCLUS ProcedureReplace=FULL Radius=0 Maxclusters=3 Maxiter=10 Converge=0.02 Iteration History IterationCriterion Relative Change in Cluster Seeds 123 16.75910.26520.32050.2985 23.709700.04590.0317 33.642700.01820.0124 Convergence criterion is satisfied.Criterion Based on Final Seeds =3.6289

63 Fisher (1936) Iris Data Cluster Summary Clust er Freque ncy RMS Std Deviation Maximum Distance from Seed to Observation Radius Excee ded Nearest Cluster Distance Between Cluster Centroids 1502.780312.4803333.5693 2384.016814.9736317.9718 3624.039816.9272217.9718

64 Fisher (1936) Iris Data Statistics for Variables VariableTotal STDWithin STDR-SquareRSQ/(1-RSQ) SepalLength8.280664.394880.7220962.598359 SepalWidth4.358663.248160.4521020.825156 PetalLength17.652984.214310.94377316.784895 PetalWidth7.622382.452440.8978728.791618 OVER-ALL10.692243.661980.8842757.641194 Pseudo F Statistic =561.63Approximate Expected Over-All R-Squared =0.62728Cubic Clustering Criterion =25.021 WARNING: The two above values are invalid for correlated variables.

65 Fisher (1936) Iris Data The FASTCLUS ProcedureReplace=FULL Radius=0 Maxclusters=3 Maxiter=10 Converge=0.02 Cluster Means ClusterSepalLengthSepalWidthPetalLengthPetalWidth 150.0600000034.2800000014.620000002.46000000 268.5000000030.7368421157.4210526320.71052632 359.0161290327.4838709743.9354838714.33870968 Cluster Standard Deviations ClusterSepalLengthSepalWidthPetalLengthPetalWidth 13.5248968723.7906436911.7366399651.053855894 24.9415502552.9009244614.8858957462.798724562 34.6641005512.9628405485.0889496732.974997167

66 Fisher (1936) Iris Data The FREQ Procedure Frequency Percent Row Pct Col Pct Table of CLUSTER by Species CLUSTER(Cluster) Species Total SetosaVersicolorVirginica 150 33.33 100.00 100.00 0 0.00 0.00 0.00 50 33.33 20 0.00 0.00 0.00 2 1.33 5.26 4.00 36 24.00 94.74 72.00 38 25.33 30 0.00 0.00 0.00 48 32.00 77.42 96.00 14 9.33 22.58 28.00 62 41.33 Total50 33.33 150 100.00


Download ppt "Lecture 6 Statistical Lecture ─ Cluster Analysis."

Similar presentations


Ads by Google