Statistics for Marketing & Consumer Research Copyright © 2008 - Mario Mazzocchi 1 Correspondence Analysis Chapter 14.

Slides:



Advertisements
Similar presentations
CHAPTER TWELVE ANALYSING DATA I: QUANTITATIVE DATA ANALYSIS.
Advertisements

Mutidimensional Data Analysis Growth of big databases requires important data processing.  Need for having methods allowing to extract this information.
COURSE: JUST 3900 INTRODUCTORY STATISTICS FOR CRIMINAL JUSTICE Instructor: Dr. John J. Kerbs, Associate Professor Joint Ph.D. in Social Work and Sociology.
1 SSS II Lecture 1: Correlation and Regression Graduate School 2008/2009 Social Science Statistics II Gwilym Pryce
Regression Analysis Using Excel. Econometrics Econometrics is simply the statistical analysis of economic phenomena Here, we just summarize some of the.
Chapter 11 Contingency Table Analysis. Nonparametric Systems Another method of examining the relationship between independent (X) and dependant (Y) variables.
Chapter 17 Overview of Multivariate Analysis Methods
1-1 Copyright © 2015, 2010, 2007 Pearson Education, Inc. Chapter 25, Slide 1 Chapter 25 Comparing Counts.
A quick introduction to the analysis of questionnaire data John Richardson.
Chapter 16 Chi Squared Tests.
Statistical Analysis SC504/HS927 Spring Term 2008 Week 17 (25th January 2008): Analysing data.
Chapter 11 Multiple Regression.
Contingency tables and Correspondence analysis Contingency table Pearson’s chi-squared test for association Correspondence analysis using SVD Plots References.
Aaker, Kumar, Day Seventh Edition Instructor’s Presentation Slides
Summary of Quantitative Analysis Neuman and Robson Ch. 11
Review for Exam 2 Some important themes from Chapters 6-9 Chap. 6. Significance Tests Chap. 7: Comparing Two Groups Chap. 8: Contingency Tables (Categorical.
Discriminant analysis
This Week: Testing relationships between two metric variables: Correlation Testing relationships between two nominal variables: Chi-Squared.
Aaker, Kumar, Day Ninth Edition Instructor’s Presentation Slides
Descriptive Statistics Used to describe the basic features of the data in any quantitative study. Both graphical displays and descriptive summary statistics.
Census A survey to collect data on the entire population.   Data The facts and figures collected, analyzed, and summarized for presentation and.
Copyright © 2010, 2007, 2004 Pearson Education, Inc. Chapter 26 Comparing Counts.
Copyright © 2010 Pearson Education, Inc. Warm Up- Good Morning! If all the values of a data set are the same, all of the following must equal zero except.
14 Elements of Nonparametric Statistics
Chapter 15 Data Analysis: Testing for Significant Differences.
Chapter 8 – 1 Chapter 8: Bivariate Regression and Correlation Overview The Scatter Diagram Two Examples: Education & Prestige Correlation Coefficient Bivariate.
Statistics for Marketing & Consumer Research Copyright © Mario Mazzocchi 1 Association log-linear analysis and canonical correlat ion analysis Chapter.
Copyright © 2010, 2007, 2004 Pearson Education, Inc. All Rights Reserved Lecture Slides Elementary Statistics Eleventh Edition and the Triola.
Basic Statistics Correlation Var Relationships Associations.
Spatial Association Defining the relationship between two variables.
Slide Slide 1 Copyright © 2007 Pearson Education, Inc Publishing as Pearson Addison-Wesley. Lecture Slides Elementary Statistics Tenth Edition and the.
Chapter 20 For Explaining Psychological Statistics, 4th ed. by B. Cohen 1 These tests can be used when all of the data from a study has been measured on.
Section Copyright © 2014, 2012, 2010 Pearson Education, Inc. Lecture Slides Elementary Statistics Twelfth Edition and the Triola Statistics Series.
Chapter 16 The Chi-Square Statistic
Examining Relationships in Quantitative Research
1 Copyright © Cengage Learning. All rights reserved. 3 Descriptive Analysis and Presentation of Bivariate Data.
Research Seminars in IT in Education (MIT6003) Quantitative Educational Research Design 2 Dr Jacky Pow.
1 Nonparametric Statistical Techniques Chapter 17.
Chapter Twelve Copyright © 2006 John Wiley & Sons, Inc. Data Processing, Fundamental Data Analysis, and Statistical Testing of Differences.
© Copyright McGraw-Hill Correlation and Regression CHAPTER 10.
CHI SQUARE TESTS.
Section Copyright © 2014, 2012, 2010 Pearson Education, Inc. Lecture Slides Elementary Statistics Twelfth Edition and the Triola Statistics Series.
Copyright © 2013, 2009, and 2007, Pearson Education, Inc. Chapter 14 Comparing Groups: Analysis of Variance Methods Section 14.3 Two-Way ANOVA.
Lecture 12 Factor Analysis.
Copyright © 2010 Pearson Education, Inc. Warm Up- Good Morning! If all the values of a data set are the same, all of the following must equal zero except.
Module III Multivariate Analysis Techniques- Framework, Factor Analysis, Cluster Analysis and Conjoint Analysis Research Report.
12/23/2015Slide 1 The chi-square test of independence is one of the most frequently used hypothesis tests in the social sciences because it can be used.
Copyright © 2013, 2009, and 2007, Pearson Education, Inc. Chapter 11 Analyzing the Association Between Categorical Variables Section 11.2 Testing Categorical.
Chapter 15 The Chi-Square Statistic: Tests for Goodness of Fit and Independence PowerPoint Lecture Slides Essentials of Statistics for the Behavioral.
Multidimensional Scaling and Correspondence Analysis © 2007 Prentice Hall21-1.
Chi-Square Analyses.
Outline of Today’s Discussion 1.The Chi-Square Test of Independence 2.The Chi-Square Test of Goodness of Fit.
Eco 6380 Predictive Analytics For Economists Spring 2016 Professor Tom Fomby Department of Economics SMU.
Comparing Counts Chapter 26. Goodness-of-Fit A test of whether the distribution of counts in one categorical variable matches the distribution predicted.
Université d’Ottawa / University of Ottawa 2001 Bio 8100s Applied Multivariate Biostatistics L11.1 Lecture 11: Canonical correlation analysis (CANCOR)
Slide Slide 1 Copyright © 2007 Pearson Education, Inc Publishing as Pearson Addison-Wesley. Lecture Slides Elementary Statistics Tenth Edition and the.
FACTOR ANALYSIS.  The basic objective of Factor Analysis is data reduction or structure detection.  The purpose of data reduction is to remove redundant.
Multivariate statistical methods. Multivariate methods multivariate dataset – group of n objects, m variables (as a rule n>m, if possible). confirmation.
PSY 325 AID Education Expert/psy325aid.com FOR MORE CLASSES VISIT
Chapter 15 Analyzing Quantitative Data. Levels of Measurement Nominal measurement Involves assigning numbers to classify characteristics into categories.
Copyright © 2012 by Nelson Education Limited. Chapter 12 Association Between Variables Measured at the Ordinal Level 12-1.
Bivariate Association. Introduction This chapter is about measures of association This chapter is about measures of association These are designed to.
Stats Methods at IC Lecture 3: Regression.
Lecture Slides Elementary Statistics Twelfth Edition
INTRODUCTORY STATISTICS FOR CRIMINAL JUSTICE
Multidimensional Scaling and Correspondence Analysis
Lecture Slides Elementary Statistics Eleventh Edition
15.1 The Role of Statistics in the Research Process
Making Use of Associations Tests
Correspondence Analysis
Presentation transcript:

Statistics for Marketing & Consumer Research Copyright © Mario Mazzocchi 1 Correspondence Analysis Chapter 14

Statistics for Marketing & Consumer Research Copyright © Mario Mazzocchi 2 Correspondence analysis Multivariate statistical technique which looks into the association of two or more categorical variables and display them jointly on a bivariate graph It can be used to apply multidimensional scaling to categorical variable.

Statistics for Marketing & Consumer Research Copyright © Mario Mazzocchi 3 Correspondence analysis and data reduction techniques Factor and principal component analyses are only applied to metric (interval or ratio) quantitative variables Traditional multidimensional scaling deals with non-metric preference and perceptual data when those are on an ordinal scale Correspondence analysis allows data reduction (and graphical representation of dissimilarities) on non-metric nominal (categorical) variables The issue with categorical (non-ordinal) variables is how to measure distances between two objects: Correspondence analysis exploits contingency tables and association measures

Statistics for Marketing & Consumer Research Copyright © Mario Mazzocchi 4 Example (Trust data) Do consumers with different jobs (q55) show preferences for some specific type of chicken (q6)?

Statistics for Marketing & Consumer Research Copyright © Mario Mazzocchi 5 Independence If the two characters are independent then the number in the cells of the table should simply depend on the row and column totals (lecture 9) Measure the distance between the expected frequency in each cell and the actual (observed) frequency Compute a statistic (the Chi-square statistic) which allows one to test whether the difference between the expected and actual value is statistically significant

Statistics for Marketing & Consumer Research Copyright © Mario Mazzocchi 6 Reducing the number of dimensions The elements composing the Chi-square statistic are standardized metric values, one for each of the cells They become larger as the association between two specific characters increases These elements can be interpreted as a metric measure of distance The resulting matrix is similar to a covariance matrix A method similar to principal component analysis can be applied to this matrix to reduce the number of dimensions

Statistics for Marketing & Consumer Research Copyright © Mario Mazzocchi 7 coordinates The principal component scores provide standardized values that can be used as coordinates One may apply the same data reduction technique first by rows (synthesizing occupation as a function of types of chicken) then by column (synthesizing types of chicken as a function of occupation) The first two components for each application generate a bivariate plot which shows both the occupation and the type of chicken in the same space

Statistics for Marketing & Consumer Research Copyright © Mario Mazzocchi 8 Output from Correspondence Analysis Executives prefer “Luxury” chicken Unemployed are closer to “Value” chicken

Statistics for Marketing & Consumer Research Copyright © Mario Mazzocchi 9 Applications It is possible to represent on the same graph consumer preferences for different brands and characteristics of a specific product (e.g. car brands together with colour, power, size, etc.) This allows one to explore brand choice in relation to characteristics opening the way to product modifications and innovations to meet consumer preferences Correspondence analysis is particularly useful when the variables have many categories The application to metric (continuous) data is not ruled out but data need to be categorized first

Statistics for Marketing & Consumer Research Copyright © Mario Mazzocchi 10 Summary Correspondence analysis is a compositional technique which starts from a set of product attributes to portrait the overall preference for a brand This technique is very similar to PCA and can be employed for data reduction purposes or to plot perceptual maps Because of the way it is constructed correspondence analysis can be applied to either the row or the columns of the data matrix For example if rows represent brands and columns are different attributes: 1.By applying the method by rows one obtains the coordinates for the brands 2.The application by columns allows one to represent the attributes in the same graph

Statistics for Marketing & Consumer Research Copyright © Mario Mazzocchi 11 Steps to run correspondence analysis Represent the data in a contingency table Translate the frequencies of the contingency table into a matrix of metric (continuous) distances through a set of Chi-square association measures on the row and column profiles Extract the dimensions (in a similar fashion to PCA) Evaluate the explanatory power of the selected number of dimensions Plot row and column objects in the same co- ordinate space

Statistics for Marketing & Consumer Research Copyright © Mario Mazzocchi 12 The frequency table y1y1 y2y2 …yjyj …ylyl x1x1 f 11 f 12 f 1j f 1l f 10 x2x2 f 21 f 22 f 2j f 2l f 20 …… xixi f i1 f ij f il f i0 …… xkxk f k1 f j2 f kj f kl f 01 f 02 f 0j f 0l 1 Categorical variable Y (l categories) Categorical variable X (k categories) Row profile Row masses Column profileColumn masses

Statistics for Marketing & Consumer Research Copyright © Mario Mazzocchi 13 Interpretation of coordinates The categories of the x variable can be seen as different coordinates for the points identified by the y variable The categories of the y variable can be seen as different coordinates for the points identified by the x variable Thus it is possible to represent the x and y categories as points in space, imposing (as in multidimensional scaling) that they respect some distance measure

Statistics for Marketing & Consumer Research Copyright © Mario Mazzocchi 14 Representations Take the row profile (the categories of x) and plot the categories in a bi-dimensional graph, using the categories of y to define the distances This allows one to compare nominal categories within the same variable: those categories of x which show similar levels of association with a given category of y can be considered as closer than those with very different levels of association with the same category of y The same procedure is carried out transposing the table which means that the categories of y can be represented using the categories of x to define the distances

Statistics for Marketing & Consumer Research Copyright © Mario Mazzocchi 15 Computing the distances When the coordinates are defined simultaneously for the categories of x and y the Chi-square value can be computed for each cell as follows Obtain the expected table frequencies Where n ij and f ij are the absolute and relative frequencies, respectively, n i0 and n 0j (or f i0 and f 0j ) are the marginal totals for row i and column j (the row masses and column masses) respectively and n 00 is the sample size (hence the total relative frequency f 00 equals one) The Chi-square value can now be computed for each cell (i,j) These are the quadratic distances between category i and category j of the x variable

Statistics for Marketing & Consumer Research Copyright © Mario Mazzocchi 16 The distance matrix The matrix  2 measures all of the associations between the categories of the first variable and those of the second one. A generalization of the multivariate case ( MCA is possible by stacking the matrix Stacking: compose a large matrix by blocks, where each block is the contingency matrix for two variables (all possible associations are taken into consideration) The stacked matrix is referred to as the Burt Table To obtain similarity values from the  2 matrix: compute the square root of the elemental Chi-square values use the the appropriate sign (the sign of the difference f ij –f ij * ) large positive values correspond to strongly associated categories large negative values identify those categories where the association is strong but negative indicating dissimilarity

Statistics for Marketing & Consumer Research Copyright © Mario Mazzocchi 17 Estimation The resulting matrix D contains metric and continuous similarity data It is possible to apply PCA to translate such a matrix into coordinates for each of the categories first those of x then those of y Before PCA can be applied some normalization is required so that the input matrix becomes similar to a correlation matrix The use of the square root of the row masses (columns) for normalizing the values in D represents the key difference from PCA The rest of the estimation process follows the results of the PCA As for PCA eigenvalues are computed, one for each dimension, which can be used to evaluate the proportion of dissimilarity maintained by that dimension

Statistics for Marketing & Consumer Research Copyright © Mario Mazzocchi 18 Inertia Inertia is a measure of association between two categorical variables based on the Chi-squared statistic. In correspondence analysis the proportion of inertia explained by each of the dimensions can be regarded as a measure of goodness-of-fit because the effectiveness of correspondence analysis depends on the degree of association between x and y Total inertia –is a measure of the overall association between x and y –is equal to the sum of the eigenvalues –corresponds to the Chi-square value divided by the number of observations –A total inertia above 0.20 is expected for adequate representations Inertia values can be computed for each of the dimensions and represent the contribution of that dimension to the association (Chi-square) between the two variables

Statistics for Marketing & Consumer Research Copyright © Mario Mazzocchi 19 SPSS example EFS data set: economic position of the household reference person (a093) type of tenure (a121) Their Pearson Chi- square value is 274, which means significant association at the 99.9% confidence level)

Statistics for Marketing & Consumer Research Copyright © Mario Mazzocchi 20 Analysis Define the range, i.e. the categories for each variable that enter the analysis Some categories can be indicated as supplementary: they appear in the graphical representation, but do not influence the actual estimation of the scores

Statistics for Marketing & Consumer Research Copyright © Mario Mazzocchi 21 Model options Choose the number of dimensions to be retained Choice of distance measure Standardization (only for Euclidean distance) Normalization Which variable should be privileged?

Statistics for Marketing & Consumer Research Copyright © Mario Mazzocchi 22 Number of dimensions The maximum number of dimensions for the analysis is equal to the number of rows minus one, or the number of columns minus one (whichever the smaller) In our example, the maximum number of dimensions would be five which reduces to four due to missing values in one row category. As shown later in this section one may then choose to graphically represent only a sub-set of the extracted dimensions (usually two or three) to make interpretation easier

Statistics for Marketing & Consumer Research Copyright © Mario Mazzocchi 23 Distance measure Chi-square distance (as discussed earlier) Euclidean distance uses the square root of the sum of squared differences between pairs of rows and pairs of columns this also requires one to choose a method for centering the data (see the SPSS manual for details) For this example standard correspondence analysis (with the Chi-square distance) does not require a standardization method.

Statistics for Marketing & Consumer Research Copyright © Mario Mazzocchi 24 Normalization method Defines how correspondence analysis is run: whether to give priority to comparisons between the categories for x (row) or those for y (columns) This choice influence the way distances are summarized by the first dimensions Row principal normalization: the Euclidean distances in the final bivariate plot of x and y are as close as possible to the Chi-square distances between the rows, that is the categories of x The opposite is valid for the column principal method Symmetrical normalization: the distances on the graph resemble as much as possible distances for both x and y by spreading the total inertia symmetrically Principal normalization: inertia is first spread over the scores for x, then y Weighted normalization: defines a weighting value between minus one and plus one where minus one is the column principal zero is symmetrical and plus one is the row principal EFS example: the row principal method is more appropriate as it is more relevant to see how differences in socio-economic conditions impact on the tenure type than it is by looking at distances between tenure types.

Statistics for Marketing & Consumer Research Copyright © Mario Mazzocchi 25 Additional statistics Although CA is a nonparametric method, it is possible to compute standard deviations and correlations under the assumption of multinomial distribution of the cell frequencies, (when data are obtained as a random sample from a normally distributed population) Allows one to order the categories of x and y using scores obtained from CA E.g. the tenure types and the socio-economic conditions might follow some ordering but cannot be defined with sufficient precision to consider these variables as ordinal. One can use the scores in the first dimension (or the first two) to order the categories and produce a permutated correspondence table.

Statistics for Marketing & Consumer Research Copyright © Mario Mazzocchi 26 Plots Three graphs: Biplot (both x & y) x only (rows) y only (columns) One usually chooses to represent only the first two or three of the extracted dimensions

Statistics for Marketing & Consumer Research Copyright © Mario Mazzocchi 27 Output The SV is the square root of inertia (the eigenvalue) The Chi-square stat suggests strong and significant association The first dimensin explains 85%, the first two 93% of total inertia. However, note that total inertia does not correspond to total variability, but to the variability of the extracted dimensions Usually a value of total inertia above 0.2 is regarded as acceptable These precision measures are based on the multinomial distribution assumption

Statistics for Marketing & Consumer Research Copyright © Mario Mazzocchi 28 Row scores The mass column shows the relative weight of each category on the sample Scores are computed for each category but the supplemental one, provided there are no missing data Scores are the coordinates for the map Shows how total inertia has been distributed across rows (similar to communalities) These categories have a higher relevance because they are more important categories in the original correspondence table. These two categories (especially retirement) strongly contribute to explaining the first dimension The second dimension is characterized by unemployed and part-time employees

Statistics for Marketing & Consumer Research Copyright © Mario Mazzocchi 29 Column scores The same exercise is carried out on columns, however the row principal method does not normalize by column By column the first dimension is especially related to the “owned by mortgage” and “owned outright” categories

Statistics for Marketing & Consumer Research Copyright © Mario Mazzocchi 30 Bi-plot Employed individuals are closer to owned accommodations Retired individuals are also close to owned accommodations Part-time employees and unemployed individuals are closer to rented accommodations and other forms of accommodations

Statistics for Marketing & Consumer Research Copyright © Mario Mazzocchi 31 Multiple Correspondence Analysis(MCA) When all variables are multiple nominal, then optimal scaling applies MCA

Statistics for Marketing & Consumer Research Copyright © Mario Mazzocchi 32 Plot with 3 variables The analysis now also includes the government office region

Statistics for Marketing & Consumer Research Copyright © Mario Mazzocchi 33 SAS correspondence analysis SAS procedure: proc CORRESP simple correspondence analysis multiple correspondence analysis (option MCA) same types of normalization as SPSS option PROFILE (ROW, COLUMN or BOTH)