Download presentation
Presentation is loading. Please wait.
Published byAllyson Alison Carpenter Modified over 9 years ago
1
Applied Multivariate Statistics Cluster Analysis Fall 2015 Week 9
2
Cluster Analysis Classification according to certain characteristics Widely used technique –Target marketing of groups –Biological classification –Classifying a number of observations into a smaller number of more manageable groups without losing information
3
Cluster Analysis Used to identify groups or clusters of homogeneous individuals Observations in each cluster are similar to each other. Homogeneous within clusters Observations from one cluster are different from those from other clusters. Heterogeneous between clusters
4
Cluster Analysis - An Example.. 1 Income and Education are the clustering variables A Income Education B CD E F
5
Cluster Analysis - An Example.. 2 Use squared Euclidean distances and the centroid to measure distance from a cluster Similarity M’x based on squared distances (5-6) 2 +(5-6) 2 (15-5) 2 +(14-5) 2 (25-30) 2 +(20-19) 2
6
Cluster Analysis - An Example.. 3 The observations A-B and C-D are close together & the 1st cluster could be formed by combining either pair. Choose A-B The centroid for this cluster is (5.5,5.5). Use this to calculate the similarity matrix Repeat the process combining the next pair (or cluster) of observations
7
Cluster Analysis - An Example.. 4 Agglomeration Cluster Solution Min. No. of Step Dist 2 Obs. Clusters Clusters 0 (A)(B)(C)(D)(E)(F) 6 1 2 A-B (A-B)(C)(D)(E)(F) 5 2 2 C-D (A-B)(C-D)(E)(F) 4 3 26 E-F (A-B)(C-D)(E-F) 3 4 169 (C-D-E-F) (A-B)(C-D-E-F) 2 5 388 ALL (A-B-C-D-E-F) 1
8
Cluster Analysis - An Example.. 5 Graphical representation of the heirarchial clustering process Dendrogram Distance A B C D E F 1 3 2 4 5
9
Cluster Analysis - An Example.. 6 Determining the ‘best’ number of clusters. Fairly subjective decision. Can use a rapid increase in the agglomeration index (Dist 2 ) as a guide For this example, there’s a large increase between Steps 3 (3 clusters) and 4 (2 clusters) Suggests 3 clusters are suitable for these observations. The dendrogram also indicates 3 as a suitable number of clusters.
10
Stage 1.. The Problem Objectives of Cluster Analysis Taxonomical description. –Forming a taxonomy - an empirical classification Data simplification. –Grouping similar observations to simplify the following analyses Relationship Identification –Identifying relationships between observations
11
Stage 2.. Design the Analysis Selection of the clustering variables –The derived clusters reflect the inherent structure only as defined by the clustering variate –Use theoretical, conceptual and practical considerations to select the clustering variate Outliers –Errors or are some groups under-represented ? –Can use profile diagrams. Tedious.
12
Stage 2.. Design the Analysis.. 2 Observation Profile
13
Stage 2.. Design the Analysis.. 3 Measures of similarity. –Correlation –Distance (Most common) –Association (Applicable with non-metric data) Distance Corelation
14
Stage 2.. Design the Analysis.. 3 Measures of Similarity- Distance A O B Euclidian Distance = (A-O) 2 +(B-O) 2 = (X 1A -X 1B ) 2 +(X 2A -X 2B ) 2 Block Distance = |A-O| + |B-O| = | X 1A -X 1B | + | X 2A -X 2B | X1X1 X2X2 X 2A X 2B X 1A X 1B (X iA -X iB ) P i=1 n k
17
P is the number of variables
18
Stage 2.. Design the Analysis.. 4 Standardizing the data –Scaling alters the Euclidean distances and the relative importance of each characteristic (Time measured in hours is 60 times less influential than time measured in minutes) –When ever conceptually possible, variables should be standardized - expressed as the no. of s.d.’s from the mean –multicollinearity implicitly increases the weights of the multicollinear characteristics
19
Standardizing the Data CASE-WISE STANDARDIZATION VARIABLE-WISE STANDARDIZATION
20
Stage 3.. Assumptions of Cluster Analysis No important assumptions It is mostly mathematical analysis Statistical foundations are weak
21
Stage 4.. Deriving the Clusters 2 main clustering algorithms –Hierarchical –Non-hierarchical Hierarchical algorithms. Illustrated by early example –agglomerative or divisive procedures –several measures of the distance between clusters
22
Stage 4.. Deriving the Clusters Measuring the Distance Between Clusters
24
Centroid. Distance from the cluster centroids Single linkage or nearest neighbor. Minimum distance between members of the separate clusters Complete linkage or farthest neighbor. Maximum distance between members of the separate clusters. Ward’s method. The within cluster sum of squares is minimized over all clusters
25
Stage 4.. Measuring the Distance Between Clusters - Centroid + +
26
Stage 4.. Measuring the Distance Between Clusters - Single Linkage
27
Stage 4. Measuring the Distance Between Clusters - Complete Linkage
28
Stage 4. Measuring the Distance Between Clusters - Ward’s Method SS 1 SS 2 SS 3 SS 4 Min { (SS 1 +SS 2 ),(SS 3 +SS 4 ) }
29
R Q P
32
Stage 4.. Deriving the Clusters Non-Hierarchical Clustering Start by selecting ‘cluster seeds’ as cluster centres Sequential threshold. Cluster all observations within a specified distance of the seed. Then add extra seeds. Parallel threshold. Select several seeds and assign objects within the threshold distance to the closest seed. Optimization. Allows observations to be moved to a cluster that has become closer Selection of cluster seeds alters the clusters obtained
33
Stage 4.. Deriving the Clusters Non-Hierarchical Clustering- Stage 1
34
Stage 4.. Deriving the Clusters Non-Hierarchical Clustering- Stage 2 + +
35
Stage 4.. Deriving the clusters Choosing between algorithms Problems with hierarchical methods –Influenced by outliers –Not amenable to analyzing very large samples (> 500) Problems with non-hierarchical methods –solution depends on the choice of seeds Perhaps a combination of methods gives the best result. –Use hierarchical method to find suitable seeds and then a non-hierarchical method
36
Stage 5. Interpreting the Clusters Examine each cluster to assign a label describing the nature of the cluster Interpreting the clusters can confirm prior theories. Can check preconceived typology
37
Stage 6. Validation Ensure practical significance of clusters Use profile analysis to examine the results
38
Summary Cluster analysis is an art more than a science! Different measures and different algorithms can effect the results Final selection of the clusters is based on both objective and subjective considerations
39
LIFE STYLE SEGMENTATION AN APPLICATION
40
VARIABLES
41
DESCRIPTIVES
44
SEGMENT PROFILES NO STANDARDIZATION
46
EXAMPLE
47
1-Çok uygun 2-Uygun 3-Uygun değil 4-Hiç uygun değil
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.