Download presentation
Presentation is loading. Please wait.
Published bySharyl Ray Modified over 9 years ago
1
Statistics for Marketing & Consumer Research Copyright © 2008 - Mario Mazzocchi 1 Association log-linear analysis and canonical correlat ion analysis Chapter 9
2
Statistics for Marketing & Consumer Research Copyright © 2008 - Mario Mazzocchi 2 Association between qualitative variables Association is a generic term referring to the relationship of two variables Correlation measures strictly refer to quantitative variables Thus, association, generally refers to qualitative variables Two qualitative variables are said to be associated when changes in one variable lead to changes in the other variable (i.e. they are not independent. For example, education is generally associated with job position. Association measures for categorical variables are based on tables of frequencies, also termed contingency tables
3
Statistics for Marketing & Consumer Research Copyright © 2008 - Mario Mazzocchi 3 Contingency tables Frequency tables show the joint frequencies of two categorical variables The marginal totals, that is the row and column totals of the contingency table, represent the univariate frequency distribution for each of the two variables If these variables are independent one would expect that the distribution of frequencies across the internal cells of the contingency table only depends on the marginal totals and the sample size
4
Statistics for Marketing & Consumer Research Copyright © 2008 - Mario Mazzocchi 4 Contingency table (frequencies)
5
Statistics for Marketing & Consumer Research Copyright © 2008 - Mario Mazzocchi 5 Independent variables In probability terms, two events are regarded as independent when their joint probability is the product of the probabilities of the two individual events Prob(X=a,Y=b)=Prob(X=a)Prob(Y=b) Similarly, two categorical variables are independent when the joint probability of two categorical outcomes is equal to the product of the probabilities of the individual outcomes for each variable
6
Statistics for Marketing & Consumer Research Copyright © 2008 - Mario Mazzocchi 6 Expected frequencies under independence Thus, the frequencies within the contingency table should not be too different from these expected values: where n ij and f ij are the absolute and relative frequencies, respectively n i0 and n 0j (or f i0 and f 0j ) are the marginal totals for row i and column j, respectively n 00 is the sample size (hence the total relative frequency f 00 equals one).
7
Statistics for Marketing & Consumer Research Copyright © 2008 - Mario Mazzocchi 7 Independence and association – testing The more that empirical frequencies are at a distance from the expected frequency under independence, the more the two categorical variables are associated. Thus, a synthetic measure of association is given by
8
Statistics for Marketing & Consumer Research Copyright © 2008 - Mario Mazzocchi 8 The Chi-square statistic The more distant the actual joint frequencies are from the expected ones, the larger is the Chi-square statistics Under the independence assumption, the chi-square statistic has a known probability distribution, so that its empirical values can be associated with a probability value to test independence The observed frequency values may differ from the expected values f ij * because of random errors, so that the discrepancy can be tested using a statistical tool, the Chi-square distribution As usual, the basic principle is also to measure the probability that the discrepancy between the expected and observed value is due to randomness only If this probability value (from the Chi-square theoretical distribution) is very low (below the significance threshold), then one rejects the null hypothesis of independence between the two variable and proceed assuming some association
9
Statistics for Marketing & Consumer Research Copyright © 2008 - Mario Mazzocchi 9 Other association measures contingency coefficient: ranges from zero (independence) to values close to one for strong association, but its value depends on the shape (number of rows and columns) of the contingency table Cramers V, bound between zero and one does not suffer from the above shortcoming (but strong associations may translate in relatively low - below 0.5 – values Goodman and Kruskal's Lambda for strictly nominal variables, compares predictions obtained for one of the variables using two different methods, one which only considers the marginal frequency distribution for that variable, the other which picks up the most likely values after considering the distribution of the other variable.
10
Statistics for Marketing & Consumer Research Copyright © 2008 - Mario Mazzocchi 10 Other association measures Uncertainty coefficient as the Goodman and Kruskals lambda, but considers the reduction in the prediction error rather than the rate of correct predictions. Ordinal variables: Gamma statistic (between minus one and one, zero indicates independence) Somers d statistic, adjustment of the Gamma statistics to account for the direction of the relationshop Kendall’s Tau b and Tau c statistics, for square and rectangular tables, respectively These statistics check all pairs of values assumed by the two variables to see if (a) a category increase in one variable leads to a category increase in the second one (positive association); or (b) whether the opposite happens (negative association); or (c) the ordering of one variable is independent from the ordering of the other.
11
Statistics for Marketing & Consumer Research Copyright © 2008 - Mario Mazzocchi 11 Directional vs. symmetric measures Directional measures (e.g. Somer’s d) assume that the change in one variable depends on the change in the other variable (there is a direction) Symmetric measures assume no direction
12
Statistics for Marketing & Consumer Research Copyright © 2008 - Mario Mazzocchi 12 Association measures in SPSS Click here to see the list of available statistics
13
Statistics for Marketing & Consumer Research Copyright © 2008 - Mario Mazzocchi 13 Chi-square test in SPSS Contingency Table Chi-square test As the p-value is above 0.05, the hypothesis of independence cannot be rejected at the 95% confidence level
14
Statistics for Marketing & Consumer Research Copyright © 2008 - Mario Mazzocchi 14 Other association measures (symmetric)
15
Statistics for Marketing & Consumer Research Copyright © 2008 - Mario Mazzocchi 15 Directional association measures
16
Statistics for Marketing & Consumer Research Copyright © 2008 - Mario Mazzocchi 16 More than two variables: three-way contingency tables
17
Statistics for Marketing & Consumer Research Copyright © 2008 - Mario Mazzocchi 17 Log-linear analysis The objective of log-linear analysis is to explore the association between more than two categorical variables check whether associations are significant and explore how the variables are associated Log-linear analysis can be applied by considering a general log-linear model
18
Statistics for Marketing & Consumer Research Copyright © 2008 - Mario Mazzocchi 18 Log-linear analysis: saturated model Consider the three variables of the three-way contingency table: T(in government) Gender and Country The frequency of each cell in the table can be rewritten as: where: n ijk is the frequency for trust-level i, gender j and country k u G is the main effect of Gender (Trust, Country) u GT is the interaction effect of Gender and Trust u GCT is the interaction effect of Gender Trust and Country is scale parameter which depends on the total number of obs. and similarly for u T, u C, u GC,... The frequency of each cell is fully explained when considering all of the main and interaction effect (the model is saturated)
19
Statistics for Marketing & Consumer Research Copyright © 2008 - Mario Mazzocchi 19 Interpretation of the model The u terms represent the main and interaction effects and can be interpreted as the expected relative frequencies For example, in a two by two contingency table with no interaction, one would have n ij =Nf i0 f 0j Instead, if there is dependence (relevant interaction), the frequencies of a two by two contingency table are exactly explained (this is in fact a saturated model) by where the term between brackets reflects the frequency explained by the interaction term and is one under independence
20
Statistics for Marketing & Consumer Research Copyright © 2008 - Mario Mazzocchi 20 The log-linear model By taking the logarithms one moves to a linear rather than multiplicative form The saturated model is not very useful, as it fits the data perfectly and does not tell much about the relevance of each of the effects Thus, log-linear analysis check whether simplified log-linear models are as good as the saturated model in predicting the table frequencies
21
Statistics for Marketing & Consumer Research Copyright © 2008 - Mario Mazzocchi 21 1.What log-linear analysis does 1.Computes the main and interaction effects for the saturated model 2.Simplifies the saturated model by deleting (according to a given rule) some of the main and interaction effects and obtains estimates for all of the main and interaction effects left in the simplified model 3.Compares the simplified model with the benchmark model If the simplified model performs well, it goes back to No. 2 and proceeds with attempts for further simplification Otherwise it stops
22
Statistics for Marketing & Consumer Research Copyright © 2008 - Mario Mazzocchi 22 Simplified models For example, suppose that the three-way interaction term is omitted, the log-linear model becomes: Now there is an error term The effects cannot be computed exactly, but can be estimated through a regression-like model, where the dependent variable is the (logarithm of) cell frequency the explanatory variables are a set of dummy variables with value one when a main effect or interaction is relevant to that cell of the contingency table and zero otherwise the estimated coefficients are the (logarithm of) the corresponding main or interaction effects
23
Statistics for Marketing & Consumer Research Copyright © 2008 - Mario Mazzocchi 23 Hierarchical log-linear analysis Proceeds hierarchically backward Delete the highest-order interaction first (u GCT ) Delete lower-order interactions (u GC, u CT, u GT ), one by one, two together, three altogether Delete main effects (u G, u T, u C ), one by one, two together, three altogether
24
Statistics for Marketing & Consumer Research Copyright © 2008 - Mario Mazzocchi 24 Hierarchical LLA in SPSS Select the categorical variables and define their range of values Select backward elimination for hierarchical LLA
25
Statistics for Marketing & Consumer Research Copyright © 2008 - Mario Mazzocchi 25 Options Provides the (exact) estimates for the saturated model Provides the association table (useful for deleting terms)
26
Statistics for Marketing & Consumer Research Copyright © 2008 - Mario Mazzocchi 26 Output K-way effects: K=1 MAIN EFFECTS K=2 2-WAY INTERACTION K=3 3-WAY INTERACTION Test the effect of deleting that k-way order effect AND ALL EFFECTS OF AN HIGHER ORDER Test the effect of deleting that k- way order effect ONLY The 3-way interaction can be omitted, but other effects seem to be relevant
27
Statistics for Marketing & Consumer Research Copyright © 2008 - Mario Mazzocchi 27 Deletion of terms Now it is possible to look within a given k- way class (partial association table) Deletion of these terms does not make the prediction of the contingency table cells worse compared to the model with the term
28
Statistics for Marketing & Consumer Research Copyright © 2008 - Mario Mazzocchi 28 Specification search This term can be eliminated No more terms can be eliminated
29
Statistics for Marketing & Consumer Research Copyright © 2008 - Mario Mazzocchi 29 Further steps The hierarchical procedure stops when it cannot eliminate all effects for a given order However, the partial association table showed that the main effect for country might be non-relevant It may be desirable to test another model where that main effect is eliminated
30
Statistics for Marketing & Consumer Research Copyright © 2008 - Mario Mazzocchi 30 Further steps Select the variables here Click here to define the model
31
Statistics for Marketing & Consumer Research Copyright © 2008 - Mario Mazzocchi 31 The model Specify the model by inserting those 2-way interaction terms retained from hierarchical analysis and deleting the main effect for q64
32
Statistics for Marketing & Consumer Research Copyright © 2008 - Mario Mazzocchi 32 Output The model is still acceptable after deleting the country main effect
33
Statistics for Marketing & Consumer Research Copyright © 2008 - Mario Mazzocchi 33 And the winner is... This model explains the contingency table cells almost as well as the saturated (exact) model Thus, (a) the interaction among country, trust level and gender; and (b) the interaction between trust level and gender are not relevant
34
Statistics for Marketing & Consumer Research Copyright © 2008 - Mario Mazzocchi 34 Parameter estimates These can be regarded as size effects (how much are these terms relevant? comparisons are allowed!) – check the Z statistic Check the SPSS output (click on OPTIONS to ask for estimates) Odds-ratio (the ratio between estimates of the Z for different cells) indicate the ratio of the probabilities of ending up in a cell compared to the one chosen as a benchmark
35
Statistics for Marketing & Consumer Research Copyright © 2008 - Mario Mazzocchi 35 Odds-ratio example compare UK females (Z=3.97) with German females (Z=1.02) The ratio is about four, which means that: the interaction between being female and from the UK is about four times more important than the interaction between being female and from Germany in explaining departure from a flat distribution the effect is positive (it increase frequencies) This would suggest that in contingency tables it is more likely to find UK females than German females after accounting for all other effects
36
Statistics for Marketing & Consumer Research Copyright © 2008 - Mario Mazzocchi 36 Canonical Correlation Analysis (CCA) (1) This technique allows one to explore the relationship between a set of dependent variables and a set of explanatory variables. Multiple regression analysis can be seen as a special case of canonical correlation analysis where there is a single dependent variable CCA is applicable to both metric and non- metric variables.
37
Statistics for Marketing & Consumer Research Copyright © 2008 - Mario Mazzocchi 37 CCA(2) Link with correlation analysis: canonical correlation is the method which maximizes the correlation between two sets of variables rather than individual variables Example: relation between attitudes towards chicken and general food lifestyles in the Trust data-set Attitudes towards chicken are measured through a set of variables which include taste, perceived safety, value for money, safety, etc. (items in q12) Lifestyle measurement is based on agreement with statements like “I purchase the best quality food I can afford” or “I am afraid of things that I have never eaten before” (items in q25)
38
Statistics for Marketing & Consumer Research Copyright © 2008 - Mario Mazzocchi 38 Canonical variates and canonical correlation CCA relates two sets of variables This technique also needs to combine variables within each set to obtain two composite measures which can be correlated In standard correlation analysis this synthesis consists in a linear combination of the original variables for each set leading to the estimation of canonical variates or linear composites The bivariate correlation between the two canonical variates is the canonical correlation
39
Statistics for Marketing & Consumer Research Copyright © 2008 - Mario Mazzocchi 39 Canonical correlation equations 1)m dependent variables, y 1, y 2, …, y m 2)k independent variables, x 1, x 2, …, x k The objective is to estimate several (say c) canonical variates as follows: the (canonical) correlation between the canonical variables Y S 1 and X S 1 is the highest, followed by the correlation between Y S 2 and X S 2 and so on Furthermore, the extracted canonical variates are not correlated between each other, so that CORR( Y S i, Y S j )=0 and CORR( X S i, X S j )=0 for any i≠j, which also implies CORR( Y S i, X S j )=0.
40
Statistics for Marketing & Consumer Research Copyright © 2008 - Mario Mazzocchi 40 Canonical functions The bivariate linear relationship between variates, Y S i =f( X S i ) is the i-th canonical function The maximum number of canonical functions c (canonical variates) is equal to m or k, whichever the smaller. CCA estimates the canonical coefficients and in a way that they maximize the canonical correlation between the two covariates The coefficients are usually normalized in a way that each canonical variable has a variance of one. The method can be generalized to deal with partial canonical correlation (controlling for other sets of variables) and nonlinear canonical correlation (where the canonical variates show a non-linear relationship).
41
Statistics for Marketing & Consumer Research Copyright © 2008 - Mario Mazzocchi 41 Output elements Canonical loadings – linear correlations between each of the original variable and their respective canonical variate Cross-loadings – correlations with the opposite canonical variate. Eigenvalues (or canonical roots) – squared canonical correlations, they represent how much of the original variability is shared by the two canonical variables of each canonical correlation Canonical scores – value of the canonical function for each of the observations, based on the canonical variates Canonical redundancy index – it measures how much of the variance in one of the canonical variates is explained by the other canonical variate
42
Statistics for Marketing & Consumer Research Copyright © 2008 - Mario Mazzocchi 42 Canonical correlation analysis in SPSS There is no menu-driven routine for CCA A macro routine written through the command (syntax) editor is necessary
43
Statistics for Marketing & Consumer Research Copyright © 2008 - Mario Mazzocchi 43 Canonical correlation macro Indicate here the path to the SPSS directory List the variables of the two sets Run the program
44
Statistics for Marketing & Consumer Research Copyright © 2008 - Mario Mazzocchi 44 Output Values of the canonical correlation The first 3 correlations are different from 0 at a 95% confidence level
45
Statistics for Marketing & Consumer Research Copyright © 2008 - Mario Mazzocchi 45 Canonical coefficients for the 1 st set of canonical variates Canonical variate
46
Statistics for Marketing & Consumer Research Copyright © 2008 - Mario Mazzocchi 46 Canonical coefficients for the 2 nd set of canonical variates Canonical variate
47
Statistics for Marketing & Consumer Research Copyright © 2008 - Mario Mazzocchi 47 Loadings (correlations)
48
Statistics for Marketing & Consumer Research Copyright © 2008 - Mario Mazzocchi 48 Cross-loadings
49
Statistics for Marketing & Consumer Research Copyright © 2008 - Mario Mazzocchi 49 Redundancy analysis % OF VARIANCE EXPLAINED BY:
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.