Clustering based on book chapter Cluster Analysis in Multivariate Analysis by Hair, Anderson, Tatham, and Black
Cluster analysis What? we group objects based on characteristics they posses also called as numerical taxonomy or typology construction often atheoretical: no statistical basis, lots of heuristics
Intuitive basis Ach, ja: Gestaltlagen
Clustering methods: nonhierarchical hierarchical fuzzy vector quantization hierarchical agglomerative divisive fuzzy probabilistic mixture models?
Obectives exploratory/confirmatory taxonomy description (e.g. biology) data simplification (e.g. segmentation)
Select the variables abracadabra, explicit theories, past research, suppositions, hopes, deadlines,
Research design detect and remove outliers choose a similarity measure Householder norm (usually Euclid) Mahalanobis correlation standardize the data by variable within case
Similarity measures
Research design representativeness of the sample (cf. outliers) multicollinearity?
How is this done? abracadabra, explicit theories, past research, suppositions, hopes, deadlines,
Clustering procedure single linkage complete linkage average linkage centroid method Ward’s method
Single linkage results easily in snake-like clusters even if they don’t exist
Complete linkage eliminates the snake formation, otherwise a big question mark
Average linkage joins clusters with smallest average distances not so outlier sensitive tends to form cluster with small within-cluster variation biased to form clusters with approximately the same variance etc.
Centroid method
Centroid method most outlier robust confusing situations: intercentroid distances may become smaller than distances between already joined pairs: messes up the dendorgram
Ward’s method distance between two clusters is something squared tends to combine clusters with small number of objects biased toward clusters with approximately equal number of objects
Nonhierachical heuristical methods: sequential treshold/parallel treshold objective function based: VQ:s, e.g., K-means procedure Hierachical: O(N2), K-means O(KN)
How many clusters open question practical limits (it would be nice to have 3-6 clusters) dendrogram based (large increase in cluster distances
Validation exogeneous variables indexes, e.g. Davies-Bouldin measure age:15 age:20 age:14 doesn’t like DD likes Donald Duck
Key issues similarity or dissimilarity measure ...and data standardization