Cluster Analysis. 1. A cluster, by definition, is a group of similar objects. Cluster analysis is a technique for grouping individuals or objects into.

Slides:



Advertisements
Similar presentations
Session 12: Positioning : Entrepreneurial Marketing.
Advertisements

McGraw-Hill/Irwin McGraw-Hill/Irwin Copyright © 2009 by The McGraw-Hill Companies, Inc. All rights reserved.
Statistics.  Statistically significant– When the P-value falls below the alpha level, we say that the tests is “statistically significant” at the alpha.
Discriminant Analysis Database Marketing Instructor:Nanda Kumar.
Analysis of variance (ANOVA)-the General Linear Model (GLM)
Factor Analysis for Data Reduction. Introduction 1. Factor Analysis is a set of techniques used for understanding variables by grouping them into “factors”
Livelihoods analysis using SPSS. Why do we analyze livelihoods?  Food security analysis aims at informing geographical and socio-economic targeting 
Business Statistics for Managerial Decision
Metrics, Algorithms & Follow-ups Profile Similarity Measures Cluster combination procedures Hierarchical vs. Non-hierarchical Clustering Statistical follow-up.
AEB 37 / AE 802 Marketing Research Methods Week 7
Cluster Analysis.
Statistics for Marketing & Consumer Research Copyright © Mario Mazzocchi 1 Cluster Analysis (from Chapter 12)
Chapter 17 Overview of Multivariate Analysis Methods
Chapter Seventeen Copyright © 2006 McGraw-Hill/Irwin Data Analysis: Multivariate Techniques for the Research Process.
Measurement in Survey Research Developing Questionnaire Items with Respect to Content and Analysis.
Segmentation and Profiling using SPSS for Windows Kate Grayson.
Chapter 7 Correlational Research Gay, Mills, and Airasian
Introduction to Data Mining Data mining is a rapidly growing field of business analytics focused on better understanding of characteristics and.
Business Statistics: Communicating with Numbers
Clustering analysis workshop Clustering analysis workshop CITM, Lab 3 18, Oct 2014 Facilitator: Hosam Al-Samarraie, PhD.
Designing a Questionnaire. Questionnaire A questionnaire consists of a number of questions printed or typed in a definite order on a form or set of forms.
Targeting Research: Segmentation Birds of a feather flock together, i.e. people with similar characteristics tend to exhibit similar behaviors Characteristics.
1 Dr. Jerrell T. Stracener EMIS 7370 STAT 5340 Probability and Statistics for Scientists and Engineers Department of Engineering Management, Information.
Hypothesis testing – mean differences between populations
Basic Data Analysis for Quantitative Research
Descriptive Statistics Used to describe the basic features of the data in any quantitative study. Both graphical displays and descriptive summary statistics.
Modeling Possibilities
Marketing Is All Around Us
Chapter 4 Statistics. 4.1 – What is Statistics? Definition Data are observed values of random variables. The field of statistics is a collection.
CLUSTER ANALYSIS.
Marketing Research Aaker, Kumar, Day and Leone Tenth Edition Instructor’s Presentation Slides 1.
8 Identifying Market Segments and Targets
© 2007 Prentice Hall20-1 Chapter Twenty Cluster Analysis.
Chapter XX Cluster Analysis. Chapter Outline Chapter Outline 1) Overview 2) Basic Concept 3) Statistics Associated with Cluster Analysis 4) Conducting.
Business Statistics for Managerial Decision Farideh Dehkordi-Vakil.
MGT-491 QUANTITATIVE ANALYSIS AND RESEARCH FOR MANAGEMENT OSMAN BIN SAIF Session 16.
Cluster Analysis Cluster Analysis Cluster analysis is a class of techniques used to classify objects or cases into relatively homogeneous groups.
Marketing Research Aaker, Kumar, Day and Leone Tenth Edition Instructor’s Presentation Slides 1.
Chapter 15 The Chi-Square Statistic: Tests for Goodness of Fit and Independence PowerPoint Lecture Slides Essentials of Statistics for the Behavioral.
CHAPTER 5: Marketing Information & Research Mrs. Piotrowski Principles of Marketing 1.
Applied Multivariate Statistics Cluster Analysis Fall 2015 Week 9.
Marketing Research Chapter 29. The Marketing Research Process The five steps that a business follows when conducting marketing research are: Defining.
Copyright © 2009 Pearson Education, Inc. 8.1 Sampling Distributions LEARNING GOAL Understand the fundamental ideas of sampling distributions and how the.
Optimal Database Marketing Drozdenko & Drake, ©
Copyright © Cengage Learning. All rights reserved. 5 Joint Probability Distributions and Random Samples.
Principal Component Analysis
Clustering / Scaling. Cluster Analysis Objective: – Partitions observations into meaningful groups with individuals in a group being more “similar” to.
1 Cluster Analysis Prepared by : Prof Neha Yadav.
ChapterDemand 8 8 Guiding Questions  Section 1: Understanding Demand  How does the law of demand affect the quantity demanded? The law of demand states.
Copyright © 2009 Pearson Education, Inc t LEARNING GOAL Understand when it is appropriate to use the Student t distribution rather than the normal.
CLUSTER ANALYSIS. What is Cluster analysis? Cluster analysis is a techniques for grouping objects, cases, entities on the basis of multiple variables.
Copyright © Cengage Learning. All rights reserved. 14 Goodness-of-Fit Tests and Categorical Data Analysis.
Conjoint Analysis. 1. Managers frequently want to know what utility a particular product feature or service feature will have for a consumer. 2. Conjoint.
Chapter_20 Cluster Analysis Naresh K. Malhotra
Multivariate Analysis - Introduction. What is Multivariate Analysis? The expression multivariate analysis is used to describe analyses of data that have.
Marketing Research.
Unsupervised Learning
Cluster Analysis.
Multivariate Analysis - Introduction
Measurement and Scaling: Fundamentals and Comparative Scaling
Welcome.
Conjoint Analysis.
Data Mining – Chapter 4 Cluster Analysis Part 2
MIS2502: Data Analytics Clustering and Segmentation
MIS2502: Data Analytics Clustering and Segmentation
Chapter_20 Cluster Analysis
Cluster Analysis.
Cluster analysis Presented by Dr.Chayada Bhadrakom
Multivariate Analysis - Introduction
Unsupervised Learning
Presentation transcript:

Cluster Analysis

1. A cluster, by definition, is a group of similar objects. Cluster analysis is a technique for grouping individuals or objects into unknown groups. It differs from other methods of classification, such as discriminant analysis, in that in cluster analysis the number and characteristics of the groups are to be derived from the data and are not usually known prior to the analysis. 2. There could be clusters of people, brands or other objects 3. If clusters are formed of customers similar to one another, then cluster analysis can help marketers identify segments (clusters) 4. If clusters of brands are formed, this can be used to gain insights into brands that are perceived as similar to each other on a set of attributes 5. Cluster analysis is best performed when the variables are interval or ratio-scaled

1. There are two major classes of cluster analysis techniques: hierarchical and non- hierarchical 2. In hierarchical clustering, some measure of distance is used to identify distances between all pairs of objects to be clustered. One of the popular distance measures used is Euclidean Distance. Another is the Squared Euclidean Distance 3. We begin with all objects in separate clusters. Say, we have ten objects in separate clusters. Two closest objects are joined to form a cluster. The remaining 8 objects would remain separate. This is stage 1 of hierarchical clustering.

4. In stage 2, again the two closest objects form another cluster. Now, we have two clusters, and 6 unclustered objects. This means a total of eight clusters, two with two objects each, and six with one object each. 5. This process continues, until points join existing clusters (because they are closest to an existing cluster), and clusters join other clusters, based on the shortest distance criterion 6. In this way, a range of possible solutions is formed, from a 10-cluster solution in the beginning, to a single cluster solution at the end. 7. We have to decide how many clusters the data seems to have, depending on either the agglomeration schedule, or the dendrogram to help make the decision. Both of these are computer outputs that describe in numbers or visually, the sequence of cluster formation. This decision is somewhat subjective, but there are some guidelines one can follow, as illustrated in the worked example.

1. In non-hierarchical clustering methods (also known as k-means clustering methods), we need to specify the number of clusters we want the objects to be clustered into. 2. This can be done if we have a hypothesis that the objects will group into a certain number of clusters. Alternatively, we can first do a hierarchical clustering on the data, find the approximate number of clusters, and then perform a k-means clustering 3. In our illustration, we have used both hierarchical and non-hierarchical methods in combination with one another 4. Let us move on to our worked example

Worked Out Example Problem: A major FMCG company wants to map the profile of its target market in terms of lifestyle, attitudes and perceptions. The company's managers prepare, with the help of their marketing research team, a set of 15 statements, which they feel measure many of the variables of interest. These 15 statements are given below. The respondent had to agree or disagree (1 = Strongly Agree, 2 = Agree, 3 = Neither Agree nor Disagree, 4 = Disagree, 5 = Strongly Disagree) with each statement. 1. I prefer to use rather than write a letter. 2. I feel that quality products are always priced high. 3. I think twice before I buy anything. 4. Television is a major source of entertainment. 5. A car is a necessity rather than a luxury. 6. I prefer fast food and ready to use products. 7. People are more health conscious today. 8. Entry of foreign companies has increased the efficiency of companies. 9. Women are active participants in purchase decisions. 10. I believe politicians can play a positive role. 11. I enjoy watching movies. 12. If I get a chance, I would like to settle abroad. 13. I always buy branded products. 14. I frequently go out on weekends. 15. I prefer to pay by credit card rather than in cash.

Fig 1 contd...

Analyze – Classify – Hierarchical cluster – Select Var01 to Var15; Radio button Cluster:Cases; Display:Statistics, Plots; Statistics: SelectAgglomeration Schedule, Radio button Cluster membership:none; Continue Plots: Select Dendogram; Radio button Icicle: All clusters; Radio button Orientation: vertical; Continue Method: Cluster Method: Between group linkage; Measure - Radio button Interval: Squared Euclidean distance; Continue Save:None; Continue OK

The computer output is obtained by first doing a hierarchical cluster analysis to find the number of clusters that exist in the data. These outputs are in Figs. 2 to 4 (Agglomeration schedule, vertical Icicle Plot and Dendrogram using Average Linkage, respectively). The second stage is a K-means (quick cluster) output with a pre-determined number of clusters to be specified. In this case, the output is for 4 clusters. We will look at both stage 1 and stage 2 outputs to understand the interpretation of both stages.

Fig.2 - Agglomeration Schedule Stage Cluster Combined Coefficie nts Stage Cluster First Appears Next Stage Cluster 1Cluster 2Cluster 1Cluster

Fig.3: Vertical Icicle Plot

Fig.4: Average Linkage

1. A look at Fig 2, the agglomeration schedule, can help us to identify large differences in the coefficient (4 th column). The agglomeration schedule from top to bottom (stage 1 to 19) indicates the sequence in which cases get combined with others (or one cluster combines with another), until all 20 cases are combined together in one cluster at the last stage (stage 19). 2. Therefore, stage 19 represents a 1 cluster solution, stage 18 represents a 2 cluster solution, stage 17 represents a 3 cluster solution, and so on, going up from the last row to the first row. We have to identify how many clusters are in the data. We use the difference between rows in a measure called coefficient (also known as fusion coefficient) in column 4 to identify the number of clusters in the data.

3. We will look at this figure from the last row upwards, because we would like to have lowest possible number of clusters, for reasons of economy and ease of interpretation. We see that there is a difference of ( – ) in the coefficients between the 1 cluster solution (stage 19) and the 2 cluster solution (stage 18). This is a difference of The next difference is of ( – 51.00) which is equal to (between stage 18, the 2 cluster solution and stage 17, the 3 cluster solution). The next one after that is ( ), is 5.167, between stage 17 and stage 16. After this, the difference between the stages 16 and 15, is ( – 42) or Thereafter, the differences are smaller between subsequent rows of coefficients. 4. A large difference in the coefficient values between any two rows indicates a solution pertaining to the number of clusters which the lower row represents. Ignoring the first difference of which would indicate only 1 cluster in the data, we look at the next largest differences is the difference between row 2 from the bottom and row 3 from the bottom, indicating a 2 cluster solution. For a 3 cluster solution, the difference is But the difference between stage 16 and 15, indicating a 4 cluster solution is At this point, it is the judgement of the researcher, which should decide whether to go for a 2 cluster or a 4 cluster solution. Just for illustration, we will choose both the 4 and 2 cluster solutions.

Final Cluster Centers Cluster 1234 VAR VAR VAR VAR VAR VAR VAR VAR VAR VAR VAR VAR VAR VAR VAR

Distances between Final Cluster Centers Cluster Number of Cases in each Cluster Cluster Valid Missing.000

1. The final cluster centers (above) describe the mean value of each variable for each of the 4 clusters. For example, cluster 1 is described by the mean values of variable 1 = 1.6, variable 2 = 2.4, variable 3 = 3.6, variable 4 = 3.0 and so on. 2. We now go back to the original variables (in this case the 15 statements in our questionnaire), and interpret the clusters in terms of the 15 variables. For example, cluster 1 consists of people who are on the rather than writing conventional letters (variable 1 value = 1.60 which is equivalent to “agree” on the scale of 1 to 5). They also agree (variable 2 value = 2.40) that quality products are always priced high – that is, they have a positive correlation in their minds about a product’s quality and price. 3. On these same variables, cluster 2 shows people who prefer conventional mail to (variable 1 value = 3.5 or close to “disagree”), people who do not necessarily associate high price with good quality (variable 2 value = 3.33). In this way, when we compare final cluster center values on each of the 15 variables, for 1 cluster at a time, a complete picture of the clusters emerges.

In this case, we will briefly describe each of the 4 clusters as follows: Cluster 1 users, feel quality comes at a price, not careful spenders, do not like television much, do not think a car is a necessity, do not like fast food and ready to use products, are more health- conscious today, think foreign companies have increased somewhat the efficiency of companies, disagree that women are active purchasing decision makers, feel that politicians can play an active role, do not enjoy watching movies, might consider settling abroad, tend to buy branded products, do go out much on weekends and like to pay cash, rather than charging to their credit cards (if they have one). It is thus a cluster exhibiting many traditional values, except that they have adapted to use. They are also beginning to loosen their purse strings.

Cluster 2 Regular letter writers, bargain hunters or aggressive buyers, not too particular about thinking before spending, not so great valuers of TV, believe the car is a luxury, not too fond of fast food and convenience products, do not think people are very health conscious, feel foreign companies have done us good, think women are active purchasing decision makers, do not believe in politicians, do not like movies, do not want to settle abroad, do not stress on branded products, do not go out on weekends, but do prefer credit cards for payments. It is a group which likes to use credit, spends more freely, believes in woman power, believe in economics rather than politics, and feel quality products can be cheap. Also, they seem to have a patriotic streak, as they do not want to settle abroad.

Cluster 3 Not too particular about , quality measured by price, do not think twice before buying, like TV, car is a luxury to them, not too fond of fast food, agree that people are health conscious, do not think foreign companies have made us efficient, believe in woman power, neutral on politicians, do not enjoy watching movies, not willing to settle abroad, always buy branded products, go out on weekends, prefer credit cards to cash. This group is a free spending one, but health conscious, more brand loyal to branded products, but outgoing compared to other groups, but not willing to go abroad to settle.

Cluster 4 Not too particular about , measure quality by price, free spending, do not enjoy watching TV, think a car is necessary, not fond of fast food, think people are not health conscious, think foreign companies have made us efficient, do not believe in woman power, somewhat positive about politicians, not movie watchers, do not want to settle abroad, indifferent to branding, not outgoing and not in favour of credit cards rather than cash. This group is optimistic, free spending and not a good target for TV advertising, particularly consumer durables and entertainment. But they are not necessarily influenced by brands. They may want value for money, but if they see value, they may spend a lot. In summary, the cluster analysis of this sample of respondents tells us a lot about the possible segments which exist in the target population.

ANOVA ClusterError FSig. Mean Squaredf Mean Squaredf VAR VAR VAR VAR VAR VAR VAR VAR VAR VAR VAR VAR VAR VAR VAR The F tests should be used only for descriptive purposes because the clusters have been chosen to maximize the differences among cases in different clusters. The observed significance levels are not corrected for this and thus cannot be interpreted as tests of the hypothesis that the cluster means are equal.

The output of SPSS for a 2 cluster analysis are presented below:

Additional Comments on Cluster Analysis Objects We have looked at an example of classifying people, with interval-scaled data. It is possible to classify objects such as brands, products, cities, etc. with cluster analysis. For example, which brands are clustered together in terms of consumer perceptions for a positioning exercise, or which cities are clustered together in terms of income, education and age profile of its residents. Number of Clusters One of the main decisions of a researcher is to decide how many clusters are present in the data. In certain cases, if for example we have a prior hypothesis about how many clusters ought to be present, this decision may already be made. But otherwise, it tends to be a subjective decision. One of the criteria that can be used in addition to ones we have described is that every cluster must have a reasonable or minimum number of objects. Which means, if a cluster comes out with only one or two objects in it, look for another solution. It may be useful to experiment with two or three possible solutions before deciding on the number of clusters.

Variables Once the reader is aware of the basics of cluster analysis, he can begin to use it creatively. For example, a cluster analysis can be done on some of the measured variables, and then other variables can be checked to see if they also exhibit differences across clusters. In the worked out example discussed earlier, only behavioural variables were used to get the 4 clusters. We could then see if they belonged to different places, had different education levels, or whether one gender figured predominantly in any one of the clusters. Scale Cluster analysis is ideally suited to interval scaled variables, because Euclidean distance is a commonly used distance measure used in the clustering process. But nominal and ordinal level data can be used after standardization, if appropriate. This may also necessitate the use of other measures of distance, more appropriate with the scales of variables being used. But this should be done with care. In general, it is a good idea to standardize the variables before clustering, if the units of measurement are radically different.

Statistical Tests As mentioned briefly earlier, some statistical tests for cluster analysis are available. But their validity being questionable, caution is recommended in using either ANOVA or any other tests. A general caution about cluster analysis itself is that it tends to produce different results with different methods and some methods are quite vulnerable to errors in data. So, the stability of the clusters can be checked through splitting the sample and repeating the cluster analysis.