Applied Multivariate Statistics Cluster Analysis Fall 2015 Week 9.

Applied Multivariate Statistics Cluster Analysis Fall 2015 Week 9

Cluster Analysis Classification according to certain characteristics Widely used technique –Target marketing of groups –Biological classification –Classifying a number of observations into a smaller number of more manageable groups without losing information

Cluster Analysis Used to identify groups or clusters of homogeneous individuals Observations in each cluster are similar to each other. Homogeneous within clusters Observations from one cluster are different from those from other clusters. Heterogeneous between clusters

Cluster Analysis - An Example.. 1 Income and Education are the clustering variables A Income Education B CD E F

Cluster Analysis - An Example.. 2 Use squared Euclidean distances and the centroid to measure distance from a cluster Similarity M’x based on squared distances (5-6) 2 +(5-6) 2 (15-5) 2 +(14-5) 2 (25-30) 2 +(20-19) 2

Cluster Analysis - An Example.. 3 The observations A-B and C-D are close together & the 1st cluster could be formed by combining either pair. Choose A-B The centroid for this cluster is (5.5,5.5). Use this to calculate the similarity matrix Repeat the process combining the next pair (or cluster) of observations

Cluster Analysis - An Example.. 4 Agglomeration Cluster Solution Min. No. of Step Dist 2 Obs. Clusters Clusters 0 (A)(B)(C)(D)(E)(F) 6 1 2 A-B (A-B)(C)(D)(E)(F) 5 2 2 C-D (A-B)(C-D)(E)(F) 4 3 26 E-F (A-B)(C-D)(E-F) 3 4 169 (C-D-E-F) (A-B)(C-D-E-F) 2 5 388 ALL (A-B-C-D-E-F) 1

Cluster Analysis - An Example.. 5 Graphical representation of the heirarchial clustering process Dendrogram Distance A B C D E F 1 3 2 4 5

Cluster Analysis - An Example.. 6 Determining the ‘best’ number of clusters. Fairly subjective decision. Can use a rapid increase in the agglomeration index (Dist 2 ) as a guide For this example, there’s a large increase between Steps 3 (3 clusters) and 4 (2 clusters) Suggests 3 clusters are suitable for these observations. The dendrogram also indicates 3 as a suitable number of clusters.

Stage 1.. The Problem Objectives of Cluster Analysis Taxonomical description. –Forming a taxonomy - an empirical classification Data simplification. –Grouping similar observations to simplify the following analyses Relationship Identification –Identifying relationships between observations

Stage 2.. Design the Analysis Selection of the clustering variables –The derived clusters reflect the inherent structure only as defined by the clustering variate –Use theoretical, conceptual and practical considerations to select the clustering variate Outliers –Errors or are some groups under-represented ? –Can use profile diagrams. Tedious.

Stage 2.. Design the Analysis.. 2 Observation Profile

Stage 2.. Design the Analysis.. 3 Measures of similarity. –Correlation –Distance (Most common) –Association (Applicable with non-metric data) Distance Corelation

Stage 2.. Design the Analysis.. 3 Measures of Similarity- Distance A O B Euclidian Distance = (A-O) 2 +(B-O) 2 = (X 1A -X 1B ) 2 +(X 2A -X 2B ) 2 Block Distance = |A-O| + |B-O| = | X 1A -X 1B | + | X 2A -X 2B | X1X1 X2X2 X 2A X 2B X 1A X 1B  (X iA -X iB ) P i=1 n k

P is the number of variables

Stage 2.. Design the Analysis.. 4 Standardizing the data –Scaling alters the Euclidean distances and the relative importance of each characteristic (Time measured in hours is 60 times less influential than time measured in minutes) –When ever conceptually possible, variables should be standardized - expressed as the no. of s.d.’s from the mean –multicollinearity implicitly increases the weights of the multicollinear characteristics

Standardizing the Data CASE-WISE STANDARDIZATION VARIABLE-WISE STANDARDIZATION

Stage 3.. Assumptions of Cluster Analysis No important assumptions It is mostly mathematical analysis Statistical foundations are weak

Stage 4.. Deriving the Clusters 2 main clustering algorithms –Hierarchical –Non-hierarchical Hierarchical algorithms. Illustrated by early example –agglomerative or divisive procedures –several measures of the distance between clusters

Stage 4.. Deriving the Clusters Measuring the Distance Between Clusters

Centroid. Distance from the cluster centroids Single linkage or nearest neighbor. Minimum distance between members of the separate clusters Complete linkage or farthest neighbor. Maximum distance between members of the separate clusters. Ward’s method. The within cluster sum of squares is minimized over all clusters

Stage 4.. Measuring the Distance Between Clusters - Centroid + +

Stage 4.. Measuring the Distance Between Clusters - Single Linkage

Stage 4. Measuring the Distance Between Clusters - Complete Linkage

Stage 4. Measuring the Distance Between Clusters - Ward’s Method SS 1 SS 2 SS 3 SS 4 Min { (SS 1 +SS 2 ),(SS 3 +SS 4 ) }

Stage 4.. Deriving the Clusters Non-Hierarchical Clustering Start by selecting ‘cluster seeds’ as cluster centres Sequential threshold. Cluster all observations within a specified distance of the seed. Then add extra seeds. Parallel threshold. Select several seeds and assign objects within the threshold distance to the closest seed. Optimization. Allows observations to be moved to a cluster that has become closer Selection of cluster seeds alters the clusters obtained

Stage 4.. Deriving the Clusters Non-Hierarchical Clustering- Stage 1

Stage 4.. Deriving the Clusters Non-Hierarchical Clustering- Stage 2 + +

Stage 4.. Deriving the clusters Choosing between algorithms Problems with hierarchical methods –Influenced by outliers –Not amenable to analyzing very large samples (> 500) Problems with non-hierarchical methods –solution depends on the choice of seeds Perhaps a combination of methods gives the best result. –Use hierarchical method to find suitable seeds and then a non-hierarchical method

Stage 5. Interpreting the Clusters Examine each cluster to assign a label describing the nature of the cluster Interpreting the clusters can confirm prior theories. Can check preconceived typology

Stage 6. Validation Ensure practical significance of clusters Use profile analysis to examine the results

Summary Cluster analysis is an art more than a science! Different measures and different algorithms can effect the results Final selection of the clusters is based on both objective and subjective considerations

LIFE STYLE SEGMENTATION AN APPLICATION

VARIABLES

DESCRIPTIVES

SEGMENT PROFILES NO STANDARDIZATION

EXAMPLE

1-Çok uygun 2-Uygun 3-Uygun değil 4-Hiç uygun değil

Applied Multivariate Statistics Cluster Analysis Fall 2015 Week 9.

Similar presentations

Presentation on theme: "Applied Multivariate Statistics Cluster Analysis Fall 2015 Week 9."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Applied Multivariate Statistics Cluster Analysis Fall 2015 Week 9.

Similar presentations

Presentation on theme: "Applied Multivariate Statistics Cluster Analysis Fall 2015 Week 9."— Presentation transcript:

Similar presentations

About project

Feedback