CLUSTER ANALYSIS. Cluster Analysis  Cluster analysis is a major technique for classifying a ‘mountain’ of information into manageable meaningful piles.

CLUSTER ANALYSIS

Cluster Analysis  Cluster analysis is a major technique for classifying a ‘mountain’ of information into manageable meaningful piles.  It is a data reduction tool that creates subgroups that are more manageable than individual datum.  It does not require any prior knowledge about which elements belong to which clusters Rahul Chandra

Purpose  Cluster analysis (CA) is an exploratory data analysis tool for organizing observed data (e.g. people, things, events, brands, companies) into meaningful, groups, or  clusters, based on combinations of IV’s, which maximizes the similarity of cases within each cluster while maximizing the dissimilarity between groups that are initially unknown. Rahul Chandra

Example Rahul Chandra

CLUSTERED PREFERENCE

Commercial applications  A chain of radio-stores uses cluster analysis for identifying three different customer types with varying needs.  An insurance company is using cluster analysis for classifying customers into segments like the “self confident customer”, “the price conscious customer” etc.  A producer of copying machines succeeds in classifying industrial customers into “satisfied” and “non-satisfied or quarrelling” customers. Rahul Chandra

Overview of clustering methods Name in SPSS 123456789123456789 Between-groups linkage Within-groups linkage Nearest neighbour Furthest neighbour Centroid clustering Median clustering Ward’s method K-means cluster (Factor) HierarchicalNon-hierarchical/ Partitioning/k-means Agglomerative Divisive - Sequential threshold - Parallel threshold - Neural Networks - Optimized partitioning (8) Non-overlapping (Exclusive) Methods Overlapping Methods Non-hierarchical - Overlapping k-centroids -Overlapping k-means - Latent class techniques - Fuzzy clustering - Q-type Factor analysis (9) Linkage Methods Centroid Methods Variance Methods - Centroid (5) - Median (6) - Average - Between (1) - Within (2) - Weighted - Single - Ordinary (3) - Density - Two stage Density - Complete (4) - Ward (7) Note: Methods in italics are available In SPSS. Neural networks necessitate SPSS’ data mining tool Clementine Rahul Chandra

CONDUCTING CLUSTER ANALYSIS  First you need to have pool of observations/things needs to be grouped.  Defining the variables on which the clustering will be based.  Collect data on the Selected variables.  Select a suitable clustering method  Measuring the inter respondents distance by using a distance formula.  Select a suitable Linkage rule. Rahul Chandra

Major Clustering Approaches  Partitioning approach: Construct various partitions and then evaluate them by some criterion, e.g., minimizing the sum of square errors.  Hierarchical approach: Create a hierarchical decomposition of the set of data (or objects) using some criterion. Rahul Chandra

Partitioning Algorithms  Partitioning method: Construct a partition of a database D of n objects into a set of k clusters, s.t., min sum of squared distance Rahul Chandra

The K-Means Clustering Method  Given k, the k-means algorithm is implemented as:  Partition objects into k nonempty subsets  Compute seed points as the centroids of the clusters of the current partition (the centroid is the center, i.e., mean point, of the cluster)  Assign each object to the cluster with the nearest seed point  Go back to Step 2, stop when no more new assignment Rahul Chandra

The K-Means Clustering Method  Example 0 1 2 3 4 5 6 7 8 9 10 0123456789 0 1 2 3 4 5 6 7 8 9 0123456789 K=2 Arbitrarily choose K object as initial cluster center Assign each objects to most similar center Update the cluster means reassign Rahul Chandra

Hierarchical Clustering  Use distance matrix as clustering criteria. This method does not require the number of clusters k as an input, but needs a termination condition. Step 0 Step 1 Step 2 Step 3 Step 4 b d c e a a b d e c d e a b c d e Step 4 Step 3 Step 2 Step 1 Step 0 Agglomerative Divisive Rahul Chandra

Agglomerative Nesting  Merge nodes that have the least dissimilarity  Go on in a non-descending fashion  Eventually all nodes belong to the same cluster Rahul Chandra

Divisive Analysis  Inverse order of Agglomerate  Eventually each node forms a cluster on its own Rahul Chandra

Distance Between Objects  Distances are normally used to measure the similarity or dissimilarity between two data objects. The most commonly used method to calculate distance is Euclidean distance. Distance between two objects i, and j on p dimensions is given as, Rahul Chandra

Euclidean distance Example of Euclidean distance between two points A and B on two dimensional space. * A B X Y (x 1, y 1 ) (x 2, y 2 ) y 2 -y 1 x 2 -x 1 * d = (x 2 -x 1 ) 2 + (y 2 -y 1 ) 2 Rahul Chandra

Alternatives to Calculate the Distance between Clusters  Single linkage:  Complete Linkage  Average Linkage  Ward Method Rahul Chandra

Single linkage  S mallest distance between an element in one cluster and an element in the other, i.e., dis(K i, K j ) = min(t ip, t jq ) Rahul Chandra

Single linkage 7,0 8,5 * A * B * C * H * G * D * E Rahul Chandra

Complete linkage  L argest distance between an element in one cluster and an element in the other, i.e., dis(K i, K j ) = max(t ip, t jq ) Rahul Chandra

Complete linkage 10,5 9,5 * A * B * C * H * G * D * E Rahul Chandra

Average Linkage  It calculates the average of distances between all the possible pairs contained in both the clusters being combined. Rahul Chandra

Average linkage 9,0 8,5 * A * B * C * H * G * D * E Rahul Chandra

Wards Method This method is distinct from other methods because it uses an analysis of variance approach to evaluate the distances between clusters. In general, this method is very efficient. Cluster membership is assessed by calculating the total sum of squared deviations from the mean of a cluster. The criterion for fusion is that it should produce the smallest possible increase in the error sum of squares. Rahul Chandra

Step 0: Each observation is treated as a separate cluster Distance Measure Dendrogram OBS 1 OBS 2 OBS 3 OBS 4 OBS 5 OBS 6 0,2 0,4 0,6 0,8 1,0 * * * * * * Rahul Chandra

k-means clustering  This method of clustering is very different from the hierarchical clustering and Ward method, which are applied when there is no prior knowledge of how many clusters there may be or what they are characterized by.  K-means clustering is used when you already have hypotheses concerning the number of clusters in your cases or variables. Rahul Chandra

k-means clustering  Very frequently, both the hierarchical and the k- means techniques are used successively.  The former (Ward’s method) is used to get some sense of the possible number of clusters  and the way they merge as seen from the dendrogram.  Then the clustering is rerun with only a chosen optimum number in which to place all  the cases (k means clustering). Rahul Chandra

ANOVA Test in clustering  The cluster centroids produced by SPSS are essentially means of the cluster score for the elements of cluster. Then we usually examine the means for each cluster on each dimension using ANOVA to assess how distinct our clusters are. Ideally, we would obtain significantly  different means for most, if not all dimensions, used in the analysis. Rahul Chandra

Example  A keep fit gym group wants to determine the best grouping of their customers with regard to the type of fitness work programs.  A hierarchical analysis is run and three major clusters stand out between everyone being initially in a separate cluster and the final one cluster.  This is then quantified using a k-means cluster analysis with three clusters, which reveals that the means of different measures of physical fitness measures do indeed produce the three clusters (i.e. customers in cluster 1 are high on measure 1, low on measure 2, etc.). Rahul Chandra

SPSS Output Rahul Chandra

SPSS Outputs Rahul Chandra

SPSS Output Rahul Chandra

CLUSTER ANALYSIS. Cluster Analysis  Cluster analysis is a major technique for classifying a ‘mountain’ of information into manageable meaningful piles.

Similar presentations

Presentation on theme: "CLUSTER ANALYSIS. Cluster Analysis  Cluster analysis is a major technique for classifying a ‘mountain’ of information into manageable meaningful piles."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

CLUSTER ANALYSIS. Cluster Analysis  Cluster analysis is a major technique for classifying a ‘mountain’ of information into manageable meaningful piles.

Similar presentations

Presentation on theme: "CLUSTER ANALYSIS. Cluster Analysis  Cluster analysis is a major technique for classifying a ‘mountain’ of information into manageable meaningful piles."— Presentation transcript:

Similar presentations

About project

Feedback