Download presentation
Presentation is loading. Please wait.
Published byErick Horton Modified over 6 years ago
1
Clustering based on book chapter Cluster Analysis in Multivariate Analysis by Hair, Anderson, Tatham, and Black
2
Cluster analysis What? we group objects based on characteristics they posses also called as numerical taxonomy or typology construction often atheoretical: no statistical basis, lots of heuristics
3
Intuitive basis Ach, ja: Gestaltlagen
4
Clustering methods: nonhierarchical hierarchical fuzzy
vector quantization hierarchical agglomerative divisive fuzzy probabilistic mixture models?
5
Obectives exploratory/confirmatory taxonomy description (e.g. biology)
data simplification (e.g. segmentation)
6
Select the variables abracadabra, explicit theories, past research, suppositions, hopes, deadlines,
7
Research design detect and remove outliers choose a similarity measure
Householder norm (usually Euclid) Mahalanobis correlation standardize the data by variable within case
8
Similarity measures
9
Research design representativeness of the sample (cf. outliers)
multicollinearity?
10
How is this done? abracadabra, explicit theories, past research, suppositions, hopes, deadlines,
11
Clustering procedure single linkage complete linkage average linkage
centroid method Ward’s method
12
Single linkage results easily in snake-like clusters even if they don’t exist
13
Complete linkage eliminates the snake formation, otherwise a big question mark
14
Average linkage joins clusters with smallest average distances
not so outlier sensitive tends to form cluster with small within-cluster variation biased to form clusters with approximately the same variance etc.
15
Centroid method
16
Centroid method most outlier robust
confusing situations: intercentroid distances may become smaller than distances between already joined pairs: messes up the dendorgram
17
Ward’s method distance between two clusters is something squared
tends to combine clusters with small number of objects biased toward clusters with approximately equal number of objects
18
Nonhierachical heuristical methods: sequential treshold/parallel treshold objective function based: VQ:s, e.g., K-means procedure Hierachical: O(N2), K-means O(KN)
19
How many clusters open question
practical limits (it would be nice to have 3-6 clusters) dendrogram based (large increase in cluster distances
20
Validation exogeneous variables indexes, e.g. Davies-Bouldin measure
age:15 age:20 age:14 doesn’t like DD likes Donald Duck
21
Key issues similarity or dissimilarity measure
...and data standardization
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.