Presentation is loading. Please wait.

Presentation is loading. Please wait.

Applied Multivariate Statistics Cluster Analysis Fall 2015 Week 9.

Similar presentations


Presentation on theme: "Applied Multivariate Statistics Cluster Analysis Fall 2015 Week 9."— Presentation transcript:

1 Applied Multivariate Statistics Cluster Analysis Fall 2015 Week 9

2 Cluster Analysis Classification according to certain characteristics Widely used technique –Target marketing of groups –Biological classification –Classifying a number of observations into a smaller number of more manageable groups without losing information

3 Cluster Analysis Used to identify groups or clusters of homogeneous individuals Observations in each cluster are similar to each other. Homogeneous within clusters Observations from one cluster are different from those from other clusters. Heterogeneous between clusters

4 Cluster Analysis - An Example.. 1 Income and Education are the clustering variables A Income Education B CD E F

5 Cluster Analysis - An Example.. 2 Use squared Euclidean distances and the centroid to measure distance from a cluster Similarity M’x based on squared distances (5-6) 2 +(5-6) 2 (15-5) 2 +(14-5) 2 (25-30) 2 +(20-19) 2

6 Cluster Analysis - An Example.. 3 The observations A-B and C-D are close together & the 1st cluster could be formed by combining either pair. Choose A-B The centroid for this cluster is (5.5,5.5). Use this to calculate the similarity matrix Repeat the process combining the next pair (or cluster) of observations

7 Cluster Analysis - An Example.. 4 Agglomeration Cluster Solution Min. No. of Step Dist 2 Obs. Clusters Clusters 0 (A)(B)(C)(D)(E)(F) 6 1 2 A-B (A-B)(C)(D)(E)(F) 5 2 2 C-D (A-B)(C-D)(E)(F) 4 3 26 E-F (A-B)(C-D)(E-F) 3 4 169 (C-D-E-F) (A-B)(C-D-E-F) 2 5 388 ALL (A-B-C-D-E-F) 1

8 Cluster Analysis - An Example.. 5 Graphical representation of the heirarchial clustering process Dendrogram Distance A B C D E F 1 3 2 4 5

9 Cluster Analysis - An Example.. 6 Determining the ‘best’ number of clusters. Fairly subjective decision. Can use a rapid increase in the agglomeration index (Dist 2 ) as a guide For this example, there’s a large increase between Steps 3 (3 clusters) and 4 (2 clusters) Suggests 3 clusters are suitable for these observations. The dendrogram also indicates 3 as a suitable number of clusters.

10 Stage 1.. The Problem Objectives of Cluster Analysis Taxonomical description. –Forming a taxonomy - an empirical classification Data simplification. –Grouping similar observations to simplify the following analyses Relationship Identification –Identifying relationships between observations

11 Stage 2.. Design the Analysis Selection of the clustering variables –The derived clusters reflect the inherent structure only as defined by the clustering variate –Use theoretical, conceptual and practical considerations to select the clustering variate Outliers –Errors or are some groups under-represented ? –Can use profile diagrams. Tedious.

12 Stage 2.. Design the Analysis.. 2 Observation Profile

13 Stage 2.. Design the Analysis.. 3 Measures of similarity. –Correlation –Distance (Most common) –Association (Applicable with non-metric data) Distance Corelation

14 Stage 2.. Design the Analysis.. 3 Measures of Similarity- Distance A O B Euclidian Distance = (A-O) 2 +(B-O) 2 = (X 1A -X 1B ) 2 +(X 2A -X 2B ) 2 Block Distance = |A-O| + |B-O| = | X 1A -X 1B | + | X 2A -X 2B | X1X1 X2X2 X 2A X 2B X 1A X 1B  (X iA -X iB ) P i=1 n k

15

16

17 P is the number of variables

18 Stage 2.. Design the Analysis.. 4 Standardizing the data –Scaling alters the Euclidean distances and the relative importance of each characteristic (Time measured in hours is 60 times less influential than time measured in minutes) –When ever conceptually possible, variables should be standardized - expressed as the no. of s.d.’s from the mean –multicollinearity implicitly increases the weights of the multicollinear characteristics

19 Standardizing the Data CASE-WISE STANDARDIZATION VARIABLE-WISE STANDARDIZATION

20 Stage 3.. Assumptions of Cluster Analysis No important assumptions It is mostly mathematical analysis Statistical foundations are weak

21 Stage 4.. Deriving the Clusters 2 main clustering algorithms –Hierarchical –Non-hierarchical Hierarchical algorithms. Illustrated by early example –agglomerative or divisive procedures –several measures of the distance between clusters

22 Stage 4.. Deriving the Clusters Measuring the Distance Between Clusters

23

24 Centroid. Distance from the cluster centroids Single linkage or nearest neighbor. Minimum distance between members of the separate clusters Complete linkage or farthest neighbor. Maximum distance between members of the separate clusters. Ward’s method. The within cluster sum of squares is minimized over all clusters

25 Stage 4.. Measuring the Distance Between Clusters - Centroid + +

26 Stage 4.. Measuring the Distance Between Clusters - Single Linkage

27 Stage 4. Measuring the Distance Between Clusters - Complete Linkage

28 Stage 4. Measuring the Distance Between Clusters - Ward’s Method SS 1 SS 2 SS 3 SS 4 Min { (SS 1 +SS 2 ),(SS 3 +SS 4 ) }

29 R Q P

30

31

32 Stage 4.. Deriving the Clusters Non-Hierarchical Clustering Start by selecting ‘cluster seeds’ as cluster centres Sequential threshold. Cluster all observations within a specified distance of the seed. Then add extra seeds. Parallel threshold. Select several seeds and assign objects within the threshold distance to the closest seed. Optimization. Allows observations to be moved to a cluster that has become closer Selection of cluster seeds alters the clusters obtained

33 Stage 4.. Deriving the Clusters Non-Hierarchical Clustering- Stage 1

34 Stage 4.. Deriving the Clusters Non-Hierarchical Clustering- Stage 2 + +

35 Stage 4.. Deriving the clusters Choosing between algorithms Problems with hierarchical methods –Influenced by outliers –Not amenable to analyzing very large samples (> 500) Problems with non-hierarchical methods –solution depends on the choice of seeds Perhaps a combination of methods gives the best result. –Use hierarchical method to find suitable seeds and then a non-hierarchical method

36 Stage 5. Interpreting the Clusters Examine each cluster to assign a label describing the nature of the cluster Interpreting the clusters can confirm prior theories. Can check preconceived typology

37 Stage 6. Validation Ensure practical significance of clusters Use profile analysis to examine the results

38 Summary Cluster analysis is an art more than a science! Different measures and different algorithms can effect the results Final selection of the clusters is based on both objective and subjective considerations

39 LIFE STYLE SEGMENTATION AN APPLICATION

40 VARIABLES

41 DESCRIPTIVES

42

43

44 SEGMENT PROFILES NO STANDARDIZATION

45

46 EXAMPLE

47 1-Çok uygun 2-Uygun 3-Uygun değil 4-Hiç uygun değil

48

49

50

51

52


Download ppt "Applied Multivariate Statistics Cluster Analysis Fall 2015 Week 9."

Similar presentations


Ads by Google