Statistics for Marketing & Consumer Research Copyright © Mario Mazzocchi 1 Correspondence Analysis Chapter 14
Statistics for Marketing & Consumer Research Copyright © Mario Mazzocchi 2 Correspondence analysis Multivariate statistical technique which looks into the association of two or more categorical variables and display them jointly on a bivariate graph It can be used to apply multidimensional scaling to categorical variable.
Statistics for Marketing & Consumer Research Copyright © Mario Mazzocchi 3 Correspondence analysis and data reduction techniques Factor and principal component analyses are only applied to metric (interval or ratio) quantitative variables Traditional multidimensional scaling deals with non-metric preference and perceptual data when those are on an ordinal scale Correspondence analysis allows data reduction (and graphical representation of dissimilarities) on non-metric nominal (categorical) variables The issue with categorical (non-ordinal) variables is how to measure distances between two objects: Correspondence analysis exploits contingency tables and association measures
Statistics for Marketing & Consumer Research Copyright © Mario Mazzocchi 4 Example (Trust data) Do consumers with different jobs (q55) show preferences for some specific type of chicken (q6)?
Statistics for Marketing & Consumer Research Copyright © Mario Mazzocchi 5 Independence If the two characters are independent then the number in the cells of the table should simply depend on the row and column totals (lecture 9) Measure the distance between the expected frequency in each cell and the actual (observed) frequency Compute a statistic (the Chi-square statistic) which allows one to test whether the difference between the expected and actual value is statistically significant
Statistics for Marketing & Consumer Research Copyright © Mario Mazzocchi 6 Reducing the number of dimensions The elements composing the Chi-square statistic are standardized metric values, one for each of the cells They become larger as the association between two specific characters increases These elements can be interpreted as a metric measure of distance The resulting matrix is similar to a covariance matrix A method similar to principal component analysis can be applied to this matrix to reduce the number of dimensions
Statistics for Marketing & Consumer Research Copyright © Mario Mazzocchi 7 coordinates The principal component scores provide standardized values that can be used as coordinates One may apply the same data reduction technique first by rows (synthesizing occupation as a function of types of chicken) then by column (synthesizing types of chicken as a function of occupation) The first two components for each application generate a bivariate plot which shows both the occupation and the type of chicken in the same space
Statistics for Marketing & Consumer Research Copyright © Mario Mazzocchi 8 Output from Correspondence Analysis Executives prefer “Luxury” chicken Unemployed are closer to “Value” chicken
Statistics for Marketing & Consumer Research Copyright © Mario Mazzocchi 9 Applications It is possible to represent on the same graph consumer preferences for different brands and characteristics of a specific product (e.g. car brands together with colour, power, size, etc.) This allows one to explore brand choice in relation to characteristics opening the way to product modifications and innovations to meet consumer preferences Correspondence analysis is particularly useful when the variables have many categories The application to metric (continuous) data is not ruled out but data need to be categorized first
Statistics for Marketing & Consumer Research Copyright © Mario Mazzocchi 10 Summary Correspondence analysis is a compositional technique which starts from a set of product attributes to portrait the overall preference for a brand This technique is very similar to PCA and can be employed for data reduction purposes or to plot perceptual maps Because of the way it is constructed correspondence analysis can be applied to either the row or the columns of the data matrix For example if rows represent brands and columns are different attributes: 1.By applying the method by rows one obtains the coordinates for the brands 2.The application by columns allows one to represent the attributes in the same graph
Statistics for Marketing & Consumer Research Copyright © Mario Mazzocchi 11 Steps to run correspondence analysis Represent the data in a contingency table Translate the frequencies of the contingency table into a matrix of metric (continuous) distances through a set of Chi-square association measures on the row and column profiles Extract the dimensions (in a similar fashion to PCA) Evaluate the explanatory power of the selected number of dimensions Plot row and column objects in the same co- ordinate space
Statistics for Marketing & Consumer Research Copyright © Mario Mazzocchi 12 The frequency table y1y1 y2y2 …yjyj …ylyl x1x1 f 11 f 12 f 1j f 1l f 10 x2x2 f 21 f 22 f 2j f 2l f 20 …… xixi f i1 f ij f il f i0 …… xkxk f k1 f j2 f kj f kl f 01 f 02 f 0j f 0l 1 Categorical variable Y (l categories) Categorical variable X (k categories) Row profile Row masses Column profileColumn masses
Statistics for Marketing & Consumer Research Copyright © Mario Mazzocchi 13 Interpretation of coordinates The categories of the x variable can be seen as different coordinates for the points identified by the y variable The categories of the y variable can be seen as different coordinates for the points identified by the x variable Thus it is possible to represent the x and y categories as points in space, imposing (as in multidimensional scaling) that they respect some distance measure
Statistics for Marketing & Consumer Research Copyright © Mario Mazzocchi 14 Representations Take the row profile (the categories of x) and plot the categories in a bi-dimensional graph, using the categories of y to define the distances This allows one to compare nominal categories within the same variable: those categories of x which show similar levels of association with a given category of y can be considered as closer than those with very different levels of association with the same category of y The same procedure is carried out transposing the table which means that the categories of y can be represented using the categories of x to define the distances
Statistics for Marketing & Consumer Research Copyright © Mario Mazzocchi 15 Computing the distances When the coordinates are defined simultaneously for the categories of x and y the Chi-square value can be computed for each cell as follows Obtain the expected table frequencies Where n ij and f ij are the absolute and relative frequencies, respectively, n i0 and n 0j (or f i0 and f 0j ) are the marginal totals for row i and column j (the row masses and column masses) respectively and n 00 is the sample size (hence the total relative frequency f 00 equals one) The Chi-square value can now be computed for each cell (i,j) These are the quadratic distances between category i and category j of the x variable
Statistics for Marketing & Consumer Research Copyright © Mario Mazzocchi 16 The distance matrix The matrix 2 measures all of the associations between the categories of the first variable and those of the second one. A generalization of the multivariate case ( MCA is possible by stacking the matrix Stacking: compose a large matrix by blocks, where each block is the contingency matrix for two variables (all possible associations are taken into consideration) The stacked matrix is referred to as the Burt Table To obtain similarity values from the 2 matrix: compute the square root of the elemental Chi-square values use the the appropriate sign (the sign of the difference f ij –f ij * ) large positive values correspond to strongly associated categories large negative values identify those categories where the association is strong but negative indicating dissimilarity
Statistics for Marketing & Consumer Research Copyright © Mario Mazzocchi 17 Estimation The resulting matrix D contains metric and continuous similarity data It is possible to apply PCA to translate such a matrix into coordinates for each of the categories first those of x then those of y Before PCA can be applied some normalization is required so that the input matrix becomes similar to a correlation matrix The use of the square root of the row masses (columns) for normalizing the values in D represents the key difference from PCA The rest of the estimation process follows the results of the PCA As for PCA eigenvalues are computed, one for each dimension, which can be used to evaluate the proportion of dissimilarity maintained by that dimension
Statistics for Marketing & Consumer Research Copyright © Mario Mazzocchi 18 Inertia Inertia is a measure of association between two categorical variables based on the Chi-squared statistic. In correspondence analysis the proportion of inertia explained by each of the dimensions can be regarded as a measure of goodness-of-fit because the effectiveness of correspondence analysis depends on the degree of association between x and y Total inertia –is a measure of the overall association between x and y –is equal to the sum of the eigenvalues –corresponds to the Chi-square value divided by the number of observations –A total inertia above 0.20 is expected for adequate representations Inertia values can be computed for each of the dimensions and represent the contribution of that dimension to the association (Chi-square) between the two variables
Statistics for Marketing & Consumer Research Copyright © Mario Mazzocchi 19 SPSS example EFS data set: economic position of the household reference person (a093) type of tenure (a121) Their Pearson Chi- square value is 274, which means significant association at the 99.9% confidence level)
Statistics for Marketing & Consumer Research Copyright © Mario Mazzocchi 20 Analysis Define the range, i.e. the categories for each variable that enter the analysis Some categories can be indicated as supplementary: they appear in the graphical representation, but do not influence the actual estimation of the scores
Statistics for Marketing & Consumer Research Copyright © Mario Mazzocchi 21 Model options Choose the number of dimensions to be retained Choice of distance measure Standardization (only for Euclidean distance) Normalization Which variable should be privileged?
Statistics for Marketing & Consumer Research Copyright © Mario Mazzocchi 22 Number of dimensions The maximum number of dimensions for the analysis is equal to the number of rows minus one, or the number of columns minus one (whichever the smaller) In our example, the maximum number of dimensions would be five which reduces to four due to missing values in one row category. As shown later in this section one may then choose to graphically represent only a sub-set of the extracted dimensions (usually two or three) to make interpretation easier
Statistics for Marketing & Consumer Research Copyright © Mario Mazzocchi 23 Distance measure Chi-square distance (as discussed earlier) Euclidean distance uses the square root of the sum of squared differences between pairs of rows and pairs of columns this also requires one to choose a method for centering the data (see the SPSS manual for details) For this example standard correspondence analysis (with the Chi-square distance) does not require a standardization method.
Statistics for Marketing & Consumer Research Copyright © Mario Mazzocchi 24 Normalization method Defines how correspondence analysis is run: whether to give priority to comparisons between the categories for x (row) or those for y (columns) This choice influence the way distances are summarized by the first dimensions Row principal normalization: the Euclidean distances in the final bivariate plot of x and y are as close as possible to the Chi-square distances between the rows, that is the categories of x The opposite is valid for the column principal method Symmetrical normalization: the distances on the graph resemble as much as possible distances for both x and y by spreading the total inertia symmetrically Principal normalization: inertia is first spread over the scores for x, then y Weighted normalization: defines a weighting value between minus one and plus one where minus one is the column principal zero is symmetrical and plus one is the row principal EFS example: the row principal method is more appropriate as it is more relevant to see how differences in socio-economic conditions impact on the tenure type than it is by looking at distances between tenure types.
Statistics for Marketing & Consumer Research Copyright © Mario Mazzocchi 25 Additional statistics Although CA is a nonparametric method, it is possible to compute standard deviations and correlations under the assumption of multinomial distribution of the cell frequencies, (when data are obtained as a random sample from a normally distributed population) Allows one to order the categories of x and y using scores obtained from CA E.g. the tenure types and the socio-economic conditions might follow some ordering but cannot be defined with sufficient precision to consider these variables as ordinal. One can use the scores in the first dimension (or the first two) to order the categories and produce a permutated correspondence table.
Statistics for Marketing & Consumer Research Copyright © Mario Mazzocchi 26 Plots Three graphs: Biplot (both x & y) x only (rows) y only (columns) One usually chooses to represent only the first two or three of the extracted dimensions
Statistics for Marketing & Consumer Research Copyright © Mario Mazzocchi 27 Output The SV is the square root of inertia (the eigenvalue) The Chi-square stat suggests strong and significant association The first dimensin explains 85%, the first two 93% of total inertia. However, note that total inertia does not correspond to total variability, but to the variability of the extracted dimensions Usually a value of total inertia above 0.2 is regarded as acceptable These precision measures are based on the multinomial distribution assumption
Statistics for Marketing & Consumer Research Copyright © Mario Mazzocchi 28 Row scores The mass column shows the relative weight of each category on the sample Scores are computed for each category but the supplemental one, provided there are no missing data Scores are the coordinates for the map Shows how total inertia has been distributed across rows (similar to communalities) These categories have a higher relevance because they are more important categories in the original correspondence table. These two categories (especially retirement) strongly contribute to explaining the first dimension The second dimension is characterized by unemployed and part-time employees
Statistics for Marketing & Consumer Research Copyright © Mario Mazzocchi 29 Column scores The same exercise is carried out on columns, however the row principal method does not normalize by column By column the first dimension is especially related to the “owned by mortgage” and “owned outright” categories
Statistics for Marketing & Consumer Research Copyright © Mario Mazzocchi 30 Bi-plot Employed individuals are closer to owned accommodations Retired individuals are also close to owned accommodations Part-time employees and unemployed individuals are closer to rented accommodations and other forms of accommodations
Statistics for Marketing & Consumer Research Copyright © Mario Mazzocchi 31 Multiple Correspondence Analysis(MCA) When all variables are multiple nominal, then optimal scaling applies MCA
Statistics for Marketing & Consumer Research Copyright © Mario Mazzocchi 32 Plot with 3 variables The analysis now also includes the government office region
Statistics for Marketing & Consumer Research Copyright © Mario Mazzocchi 33 SAS correspondence analysis SAS procedure: proc CORRESP simple correspondence analysis multiple correspondence analysis (option MCA) same types of normalization as SPSS option PROFILE (ROW, COLUMN or BOTH)