Download presentation
Presentation is loading. Please wait.
Published byJoseph Mills Modified over 8 years ago
1
CLUSTER ANALYSIS
2
Cluster Analysis Cluster analysis is a major technique for classifying a ‘mountain’ of information into manageable meaningful piles. It is a data reduction tool that creates subgroups that are more manageable than individual datum. It does not require any prior knowledge about which elements belong to which clusters Rahul Chandra
3
Purpose Cluster analysis (CA) is an exploratory data analysis tool for organizing observed data (e.g. people, things, events, brands, companies) into meaningful, groups, or clusters, based on combinations of IV’s, which maximizes the similarity of cases within each cluster while maximizing the dissimilarity between groups that are initially unknown. Rahul Chandra
4
Example Rahul Chandra
5
CLUSTERED PREFERENCE
6
Commercial applications A chain of radio-stores uses cluster analysis for identifying three different customer types with varying needs. An insurance company is using cluster analysis for classifying customers into segments like the “self confident customer”, “the price conscious customer” etc. A producer of copying machines succeeds in classifying industrial customers into “satisfied” and “non-satisfied or quarrelling” customers. Rahul Chandra
7
Overview of clustering methods Name in SPSS 123456789123456789 Between-groups linkage Within-groups linkage Nearest neighbour Furthest neighbour Centroid clustering Median clustering Ward’s method K-means cluster (Factor) HierarchicalNon-hierarchical/ Partitioning/k-means Agglomerative Divisive - Sequential threshold - Parallel threshold - Neural Networks - Optimized partitioning (8) Non-overlapping (Exclusive) Methods Overlapping Methods Non-hierarchical - Overlapping k-centroids -Overlapping k-means - Latent class techniques - Fuzzy clustering - Q-type Factor analysis (9) Linkage Methods Centroid Methods Variance Methods - Centroid (5) - Median (6) - Average - Between (1) - Within (2) - Weighted - Single - Ordinary (3) - Density - Two stage Density - Complete (4) - Ward (7) Note: Methods in italics are available In SPSS. Neural networks necessitate SPSS’ data mining tool Clementine Rahul Chandra
8
CONDUCTING CLUSTER ANALYSIS First you need to have pool of observations/things needs to be grouped. Defining the variables on which the clustering will be based. Collect data on the Selected variables. Select a suitable clustering method Measuring the inter respondents distance by using a distance formula. Select a suitable Linkage rule. Rahul Chandra
9
Major Clustering Approaches Partitioning approach: Construct various partitions and then evaluate them by some criterion, e.g., minimizing the sum of square errors. Hierarchical approach: Create a hierarchical decomposition of the set of data (or objects) using some criterion. Rahul Chandra
10
Partitioning Algorithms Partitioning method: Construct a partition of a database D of n objects into a set of k clusters, s.t., min sum of squared distance Rahul Chandra
11
The K-Means Clustering Method Given k, the k-means algorithm is implemented as: Partition objects into k nonempty subsets Compute seed points as the centroids of the clusters of the current partition (the centroid is the center, i.e., mean point, of the cluster) Assign each object to the cluster with the nearest seed point Go back to Step 2, stop when no more new assignment Rahul Chandra
12
The K-Means Clustering Method Example 0 1 2 3 4 5 6 7 8 9 10 0123456789 0 1 2 3 4 5 6 7 8 9 0123456789 K=2 Arbitrarily choose K object as initial cluster center Assign each objects to most similar center Update the cluster means reassign Rahul Chandra
13
Hierarchical Clustering Use distance matrix as clustering criteria. This method does not require the number of clusters k as an input, but needs a termination condition. Step 0 Step 1 Step 2 Step 3 Step 4 b d c e a a b d e c d e a b c d e Step 4 Step 3 Step 2 Step 1 Step 0 Agglomerative Divisive Rahul Chandra
14
Agglomerative Nesting Merge nodes that have the least dissimilarity Go on in a non-descending fashion Eventually all nodes belong to the same cluster Rahul Chandra
15
Divisive Analysis Inverse order of Agglomerate Eventually each node forms a cluster on its own Rahul Chandra
16
Distance Between Objects Distances are normally used to measure the similarity or dissimilarity between two data objects. The most commonly used method to calculate distance is Euclidean distance. Distance between two objects i, and j on p dimensions is given as, Rahul Chandra
17
Euclidean distance Example of Euclidean distance between two points A and B on two dimensional space. * A B X Y (x 1, y 1 ) (x 2, y 2 ) y 2 -y 1 x 2 -x 1 * d = (x 2 -x 1 ) 2 + (y 2 -y 1 ) 2 Rahul Chandra
18
Alternatives to Calculate the Distance between Clusters Single linkage: Complete Linkage Average Linkage Ward Method Rahul Chandra
19
Single linkage S mallest distance between an element in one cluster and an element in the other, i.e., dis(K i, K j ) = min(t ip, t jq ) Rahul Chandra
20
Single linkage 7,0 8,5 * A * B * C * H * G * D * E Rahul Chandra
21
Complete linkage L argest distance between an element in one cluster and an element in the other, i.e., dis(K i, K j ) = max(t ip, t jq ) Rahul Chandra
22
Complete linkage 10,5 9,5 * A * B * C * H * G * D * E Rahul Chandra
23
Average Linkage It calculates the average of distances between all the possible pairs contained in both the clusters being combined. Rahul Chandra
24
Average linkage 9,0 8,5 * A * B * C * H * G * D * E Rahul Chandra
25
Wards Method This method is distinct from other methods because it uses an analysis of variance approach to evaluate the distances between clusters. In general, this method is very efficient. Cluster membership is assessed by calculating the total sum of squared deviations from the mean of a cluster. The criterion for fusion is that it should produce the smallest possible increase in the error sum of squares. Rahul Chandra
26
Step 0: Each observation is treated as a separate cluster Distance Measure Dendrogram OBS 1 OBS 2 OBS 3 OBS 4 OBS 5 OBS 6 0,2 0,4 0,6 0,8 1,0 * * * * * * Rahul Chandra
27
k-means clustering This method of clustering is very different from the hierarchical clustering and Ward method, which are applied when there is no prior knowledge of how many clusters there may be or what they are characterized by. K-means clustering is used when you already have hypotheses concerning the number of clusters in your cases or variables. Rahul Chandra
28
k-means clustering Very frequently, both the hierarchical and the k- means techniques are used successively. The former (Ward’s method) is used to get some sense of the possible number of clusters and the way they merge as seen from the dendrogram. Then the clustering is rerun with only a chosen optimum number in which to place all the cases (k means clustering). Rahul Chandra
29
ANOVA Test in clustering The cluster centroids produced by SPSS are essentially means of the cluster score for the elements of cluster. Then we usually examine the means for each cluster on each dimension using ANOVA to assess how distinct our clusters are. Ideally, we would obtain significantly different means for most, if not all dimensions, used in the analysis. Rahul Chandra
30
Example A keep fit gym group wants to determine the best grouping of their customers with regard to the type of fitness work programs. A hierarchical analysis is run and three major clusters stand out between everyone being initially in a separate cluster and the final one cluster. This is then quantified using a k-means cluster analysis with three clusters, which reveals that the means of different measures of physical fitness measures do indeed produce the three clusters (i.e. customers in cluster 1 are high on measure 1, low on measure 2, etc.). Rahul Chandra
31
SPSS Output Rahul Chandra
32
SPSS Outputs Rahul Chandra
33
SPSS Output Rahul Chandra
34
SPSS Output Rahul Chandra
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.