Presentation is loading. Please wait.

Presentation is loading. Please wait.

CLUSTER ANALYSIS. Cluster Analysis  Cluster analysis is a major technique for classifying a ‘mountain’ of information into manageable meaningful piles.

Similar presentations


Presentation on theme: "CLUSTER ANALYSIS. Cluster Analysis  Cluster analysis is a major technique for classifying a ‘mountain’ of information into manageable meaningful piles."— Presentation transcript:

1 CLUSTER ANALYSIS

2 Cluster Analysis  Cluster analysis is a major technique for classifying a ‘mountain’ of information into manageable meaningful piles.  It is a data reduction tool that creates subgroups that are more manageable than individual datum.  It does not require any prior knowledge about which elements belong to which clusters Rahul Chandra

3 Purpose  Cluster analysis (CA) is an exploratory data analysis tool for organizing observed data (e.g. people, things, events, brands, companies) into meaningful, groups, or  clusters, based on combinations of IV’s, which maximizes the similarity of cases within each cluster while maximizing the dissimilarity between groups that are initially unknown. Rahul Chandra

4 Example Rahul Chandra

5 CLUSTERED PREFERENCE

6 Commercial applications  A chain of radio-stores uses cluster analysis for identifying three different customer types with varying needs.  An insurance company is using cluster analysis for classifying customers into segments like the “self confident customer”, “the price conscious customer” etc.  A producer of copying machines succeeds in classifying industrial customers into “satisfied” and “non-satisfied or quarrelling” customers. Rahul Chandra

7 Overview of clustering methods Name in SPSS 123456789123456789 Between-groups linkage Within-groups linkage Nearest neighbour Furthest neighbour Centroid clustering Median clustering Ward’s method K-means cluster (Factor) HierarchicalNon-hierarchical/ Partitioning/k-means Agglomerative Divisive - Sequential threshold - Parallel threshold - Neural Networks - Optimized partitioning (8) Non-overlapping (Exclusive) Methods Overlapping Methods Non-hierarchical - Overlapping k-centroids -Overlapping k-means - Latent class techniques - Fuzzy clustering - Q-type Factor analysis (9) Linkage Methods Centroid Methods Variance Methods - Centroid (5) - Median (6) - Average - Between (1) - Within (2) - Weighted - Single - Ordinary (3) - Density - Two stage Density - Complete (4) - Ward (7) Note: Methods in italics are available In SPSS. Neural networks necessitate SPSS’ data mining tool Clementine Rahul Chandra

8 CONDUCTING CLUSTER ANALYSIS  First you need to have pool of observations/things needs to be grouped.  Defining the variables on which the clustering will be based.  Collect data on the Selected variables.  Select a suitable clustering method  Measuring the inter respondents distance by using a distance formula.  Select a suitable Linkage rule. Rahul Chandra

9 Major Clustering Approaches  Partitioning approach: Construct various partitions and then evaluate them by some criterion, e.g., minimizing the sum of square errors.  Hierarchical approach: Create a hierarchical decomposition of the set of data (or objects) using some criterion. Rahul Chandra

10 Partitioning Algorithms  Partitioning method: Construct a partition of a database D of n objects into a set of k clusters, s.t., min sum of squared distance Rahul Chandra

11 The K-Means Clustering Method  Given k, the k-means algorithm is implemented as:  Partition objects into k nonempty subsets  Compute seed points as the centroids of the clusters of the current partition (the centroid is the center, i.e., mean point, of the cluster)  Assign each object to the cluster with the nearest seed point  Go back to Step 2, stop when no more new assignment Rahul Chandra

12 The K-Means Clustering Method  Example 0 1 2 3 4 5 6 7 8 9 10 0123456789 0 1 2 3 4 5 6 7 8 9 0123456789 K=2 Arbitrarily choose K object as initial cluster center Assign each objects to most similar center Update the cluster means reassign Rahul Chandra

13 Hierarchical Clustering  Use distance matrix as clustering criteria. This method does not require the number of clusters k as an input, but needs a termination condition. Step 0 Step 1 Step 2 Step 3 Step 4 b d c e a a b d e c d e a b c d e Step 4 Step 3 Step 2 Step 1 Step 0 Agglomerative Divisive Rahul Chandra

14 Agglomerative Nesting  Merge nodes that have the least dissimilarity  Go on in a non-descending fashion  Eventually all nodes belong to the same cluster Rahul Chandra

15 Divisive Analysis  Inverse order of Agglomerate  Eventually each node forms a cluster on its own Rahul Chandra

16 Distance Between Objects  Distances are normally used to measure the similarity or dissimilarity between two data objects. The most commonly used method to calculate distance is Euclidean distance. Distance between two objects i, and j on p dimensions is given as, Rahul Chandra

17 Euclidean distance Example of Euclidean distance between two points A and B on two dimensional space. * A B X Y (x 1, y 1 ) (x 2, y 2 ) y 2 -y 1 x 2 -x 1 * d = (x 2 -x 1 ) 2 + (y 2 -y 1 ) 2 Rahul Chandra

18 Alternatives to Calculate the Distance between Clusters  Single linkage:  Complete Linkage  Average Linkage  Ward Method Rahul Chandra

19 Single linkage  S mallest distance between an element in one cluster and an element in the other, i.e., dis(K i, K j ) = min(t ip, t jq ) Rahul Chandra

20 Single linkage 7,0 8,5 * A * B * C * H * G * D * E Rahul Chandra

21 Complete linkage  L argest distance between an element in one cluster and an element in the other, i.e., dis(K i, K j ) = max(t ip, t jq ) Rahul Chandra

22 Complete linkage 10,5 9,5 * A * B * C * H * G * D * E Rahul Chandra

23 Average Linkage  It calculates the average of distances between all the possible pairs contained in both the clusters being combined. Rahul Chandra

24 Average linkage 9,0 8,5 * A * B * C * H * G * D * E Rahul Chandra

25 Wards Method This method is distinct from other methods because it uses an analysis of variance approach to evaluate the distances between clusters. In general, this method is very efficient. Cluster membership is assessed by calculating the total sum of squared deviations from the mean of a cluster. The criterion for fusion is that it should produce the smallest possible increase in the error sum of squares. Rahul Chandra

26 Step 0: Each observation is treated as a separate cluster Distance Measure Dendrogram OBS 1 OBS 2 OBS 3 OBS 4 OBS 5 OBS 6 0,2 0,4 0,6 0,8 1,0 * * * * * * Rahul Chandra

27 k-means clustering  This method of clustering is very different from the hierarchical clustering and Ward method, which are applied when there is no prior knowledge of how many clusters there may be or what they are characterized by.  K-means clustering is used when you already have hypotheses concerning the number of clusters in your cases or variables. Rahul Chandra

28 k-means clustering  Very frequently, both the hierarchical and the k- means techniques are used successively.  The former (Ward’s method) is used to get some sense of the possible number of clusters  and the way they merge as seen from the dendrogram.  Then the clustering is rerun with only a chosen optimum number in which to place all  the cases (k means clustering). Rahul Chandra

29 ANOVA Test in clustering  The cluster centroids produced by SPSS are essentially means of the cluster score for the elements of cluster. Then we usually examine the means for each cluster on each dimension using ANOVA to assess how distinct our clusters are. Ideally, we would obtain significantly  different means for most, if not all dimensions, used in the analysis. Rahul Chandra

30 Example  A keep fit gym group wants to determine the best grouping of their customers with regard to the type of fitness work programs.  A hierarchical analysis is run and three major clusters stand out between everyone being initially in a separate cluster and the final one cluster.  This is then quantified using a k-means cluster analysis with three clusters, which reveals that the means of different measures of physical fitness measures do indeed produce the three clusters (i.e. customers in cluster 1 are high on measure 1, low on measure 2, etc.). Rahul Chandra

31 SPSS Output Rahul Chandra

32 SPSS Outputs Rahul Chandra

33 SPSS Output Rahul Chandra

34 SPSS Output Rahul Chandra


Download ppt "CLUSTER ANALYSIS. Cluster Analysis  Cluster analysis is a major technique for classifying a ‘mountain’ of information into manageable meaningful piles."

Similar presentations


Ads by Google