Clustering Procedure Cheng Lei Department of Electrical and Computer Engineering University of Victoria April 16, 2015
Outline ❖ Overview ❖ CLUSTER Procedure ❖ Clustering Methods
Overview Data: Distances Coordinates Clustering methods 11 methods supported FASTCLUS Procedure CPU time: proportional to the number of observations Use FASTCLUS for a preliminary cluster analysis Use CLUSTER to cluster the preliminary clusters hierarchically Principles Each observation begins in a cluster by itself Two closet clusters are merged to form a new one to replace the two old ones Repeat the merging step until only one cluster is left
Overview CLUSTER Procedure Not practical to very large data sets as CPU time is roughly proportional to the square or cube of the number of the observations Displays a history of the clustering process Shows statistics for estimating the number of clusters RMSSTD Pseudo F Pseudo T-squre Creates dendrogram Create output data sets for TREE procedure to output the cluster membership
CLUSTER Procedure PROC CLUSTER METHOD=method-name ; BY variables; COPY variables; FREQ variables; ID variables; RMSSTD variables; VAR variables;
Options RMSSTD Root mean squared standard deviation of a cluster Pseudo F The ratio of between-cluster variance to within cluster variance Pseudo T-square A measure of merging two clusters to a new cluster
RMSSTD : the within-group sum of squares of cluster k : the number of elements in cluster k : the number of variables
Pseudo F : the between-group sum of squares : the within-group sum of squares : the number of clusters at a certain step : the number of observations
Pseudo T-Square : within-cluster sum of squares of clusters K and L : number of observations in cluster k and L : between-cluster sum of squares
METHODS Average Linkage (AVE or AVERAGE) Centroid Method (CEN or CENTROID) Complete Linkage (COM or COMPLETE) Density Linkage (DEN or DENSITY) Maximum likelihood (EML) Flexible-Beta Method (FLE or FLEXIBLE) McQuitty’s Similarity Analysis (MCQ or MCQUITTY) Median Method (MED or MEDIAN) Single Linkage (SIN or SINGLE) Two-Stage Density Linkage (TWO or TWOSTAGE) Ward’s minimum-variance method (WAR or WARD)
Average Linkage Idea: Compute the distance between two clusters and it is defined as the average distance between pairs of observations, one in each cluster
Centroid Method Idea: Compute the Euclidean distance between two clusters
Next week’s work Do examples with SAS base language More reading about other procedures in SAS/STAT
Thank You!!!