Jagdish Gangolly State University of New York at Albany Clustering Jagdish Gangolly State University of New York at Albany Acc 522 Fall, 2006 (Jagdish S. Gangolly) 9/17/2018
Clustering Clustering in S-Plus Objectives of Clustering Methods Hierarchical Partitioning (iterative-relocation) Model-based methods Acc 522 Fall, 2006 (Jagdish S. Gangolly) 9/17/2018
Clustering in S-Plus You need to load the S-Plus cluster library library(cluster) Data can be either in np matrix of measurement on each of the p variables for each object, or nn matrix of dissimilarities where d(i,j) in the matrix represents dissimilarity between object i and object j. daisy in the library cluster constructs the dissimilarity matrix. Acc 522 Fall, 2006 (Jagdish S. Gangolly) 9/17/2018
Objectives of Clustering To classify data set into groups that are internally cohesive and externally isolated (loosely coupled) dataset (matrix, dataframe) distance measure optimisation criterion number of clusters (partitioning) shape of clusters, probability distribution (model-based) Acc 522 Fall, 2006 (Jagdish S. Gangolly) 9/17/2018
Distance Measures Data mining Text Ch 2. Slides 47-56. Acc 522 Fall, 2006 (Jagdish S. Gangolly) 9/17/2018
Clustering methods: Hierarchical I Hierarchical Methods: Agglomerative methods: Start with each observation forming a separate group. Observations close to each other are successively merged. The results are displayed in the form of a dendrogram Divisive methods: Initial cluster consists of one cluster containing the whole dataset. This is successively split into ntwo smaller clusters until each cluster contains exactly a single object Acc 522 Fall, 2006 (Jagdish S. Gangolly) 9/17/2018
Clustering methods: Hierarchical II Agglomerative Nesting: agnes(x, diss, metric, stand, method,…) Methods: average (group average) single (linkage), nearest neighbour method complete (linkage), furthest neighbour method ward (Ward’s method) weighted (weighted average linkage) Evaluation criterion: Agglomeration coefficient (AC) Results display: Dendrogram, Banner plot Acc 522 Fall, 2006 (Jagdish S. Gangolly) 9/17/2018
Clustering methods: Hierarchical III hclust: hierarchical clustering hclust(dist, method, sim) dist: distances method: compact (complete linkage) average connected (single linkage) results displayed using plclust Acc 522 Fall, 2006 (Jagdish S. Gangolly) 9/17/2018
Clustering methods: Hierarchical IV Divisive Analysis: diana(x, diss, metric, stand, …) Evaluation criterion: Divisive coefficient (DC) Results display: Dendrogram, Banner plot Monothetic Analysis: For binary data matrix. For each split, mona uses a single (well-chosen) variable mona(x) Acc 522 Fall, 2006 (Jagdish S. Gangolly) 9/17/2018
Clustering methods: Partitioning Methods I Method for dividing the set of objects into k clusters; k needs to be specified by the user. k-means: Partitioning among Medoids: accepts a dissimilarity matrix, minimises the sum of dissimilarities (rather than distances) and so is more robust, and displays a silhoutte plot pam(data, k, diss, metric, stand,…) data: matrix or dataframe diss: T or F metric: euclidean or manhattan stand: T or F Acc 522 Fall, 2006 (Jagdish S. Gangolly) 9/17/2018
Clustering methods: Partitioning Methods II Clustering large applications: Considers data subsets of fixed size to cluster very large datasets clara(x, k, metric, stand, samples, sampsize, …) Fanny: Fuzzy clustering. fanny(x, k, diss, metric, stand,…) Acc 522 Fall, 2006 (Jagdish S. Gangolly) 9/17/2018
Clustering: Displays of Results Dendrograms: plot.agnes() plot.diana() plot.mona() Print: print.agnes() print.diana() print.mona() print.pam() print.fanny() print.clara() Acc 522 Fall, 2006 (Jagdish S. Gangolly) 9/17/2018