Download presentation
Presentation is loading. Please wait.
Published byPreston Gilmore Modified over 9 years ago
1
Statistics for Marketing & Consumer Research Copyright © 2008 - Mario Mazzocchi 1 Correlation and regression Chapter 8
2
Statistics for Marketing & Consumer Research Copyright © 2008 - Mario Mazzocchi 2 Correlation Measures the strength and direction of a relationship between two metric variables For example, the relationship between weight and height or between consumption and price: it is not a perfect (deterministic) relationship, but we expect to find one on average
3
Statistics for Marketing & Consumer Research Copyright © 2008 - Mario Mazzocchi 3 Correlation Correlation measures to what extent two (or more) variables are related Correlation expresses a relationship that is not necessarily precise (e.g. height and weight) Positive correlation indicates that the two variables move in the same direction Negative correlation indicates that they move in opposite directions The question is – do two variables move together?
4
Statistics for Marketing & Consumer Research Copyright © 2008 - Mario Mazzocchi 4 Covariance measures the co-movement of two variables x and y across observations Sample covariance estimate: For each observation i, a situation where both x and y are above or below their respective sample means increases the covariance value, while the situation where one of the variables is above the sample mean and the other is below decreases the total covariance. Contrarily to variance, covariance can assume both positive and negative values. If x and y always move in opposite direction, all terms in the summation above will be negative, leading to a large negative covariance. If they always move in the same direction there will be a large positive covariance.
5
Statistics for Marketing & Consumer Research Copyright © 2008 - Mario Mazzocchi 5 From covariance to correlation Covariance, like variance, depends on the measurement units. If one measures prices in dollars and consumption in ounces, a different covariance value is obtained as compared to the use of prices in Euros and consumption in Kilograms, even if both situations refer exactly to the same goods and observations. Some form of normalization is needed to avoid the measurement unit problem The usual approach is standardization, which requires subtracting the mean and dividing by the standard deviation.
6
Statistics for Marketing & Consumer Research Copyright © 2008 - Mario Mazzocchi 6 Correlation Considering the covariance expression where the numerator is based already on differences from the means, all that is required is dividing by the sample standard deviations, for both x and y.
7
Statistics for Marketing & Consumer Research Copyright © 2008 - Mario Mazzocchi 7 Correlation coefficient The correlation coefficient r gives a measure (in the range minus one to plus one) of the relationship between two variables r=zero means no correlation r=plus one means perfect positive correlation r=minus one means perfect negative correlation Perfect correlation indicates that a p% variation in x always corresponds to a p% variation in y
8
Statistics for Marketing & Consumer Research Copyright © 2008 - Mario Mazzocchi 8 Correlation and causation Note that no assumption or consideration is made on causality The existence of a positive correlation of x and y does not mean that it is the increase in x which leads to an increase in y, but only that the two variables move together(to some extent) Thus, correlation is symmetric, so that r xy = r yx
9
Statistics for Marketing & Consumer Research Copyright © 2008 - Mario Mazzocchi 9 Correlation as a sample statistic Correlation is more than an indicator; it can be regarded as a sample statistic (which allows hypothesis testing) The sample measure of correlation is affected by sampling error: for example a small but positive correlation observed on a sample might hide a null (or negative) true correlation in the population
10
Statistics for Marketing & Consumer Research Copyright © 2008 - Mario Mazzocchi 10 Correlation and inference Some assumptions (checks) are needed a)the relationship between the two variables should be linear (a scatterplot could allow the identification of non- linear relationships) b)the error variance should be similar for different correlation levels c)the two variables should come from similar statistical distributions d)If the two variables can be assumed to derive from normal distributions it becomes possible to run hypothesis testing
11
Statistics for Marketing & Consumer Research Copyright © 2008 - Mario Mazzocchi 11 Hypothesis testing on correlations This fourth condition (d. ante) can be ignored when the sample is large enough (fifty or more observations). In these cases one can exploit the probabilistic nature of sampling to run an hypothesis test on the sample correlation coefficient The null hypothesis to be tested is that the correlation coefficient in the population is zero.
12
Statistics for Marketing & Consumer Research Copyright © 2008 - Mario Mazzocchi 12 Bivariate correlations There are two elements to be considered the value of the correlation coefficient r, which indicates to what extent the two variables move together the significance of the correlation (a p value), which helps a decision as to whether the hypothesis that r=zero (no correlation in the population) should be rejected. Examples –a correlation coefficient r=0.6 suggests a relatively strong relationship, but a p value well above 0.05 indicates that the hypothesis that the actual correlation is zero cannot be rejected at the 95% confidence level. –r=0.1 and p value below 0.01. (Thanks to a larger sample) one can be confident (at the 99% level) that there is a positive relationship between the two variables, although the relationship is weak.
13
Statistics for Marketing & Consumer Research Copyright © 2008 - Mario Mazzocchi 13 The influence of third factors The correlation coefficient between x and y is only meaningful if one can safely assume that there is no other intervening variable which affects the values of x and y. In the supermarket example, a negative correlation between prices and consumption is expected. However, suppose that one day the government introduces a new tax which reduces the average available income by 10%. Consumers have less money and consume less. The supermarket tries to maintain its customers by cutting all prices, so that the reduction in prices mitigates the impact of the new tax. If we only observe prices and consumption, it is possible that we observe lower prices and lower consumption and the bivariate correlation coefficient might return a positive value. Thus, we can only use the correlation coefficient when the ceteris paribus condition holds (all other relevant variables being constant) This is rarely the case, so it is necessary to control for other influential variables like income in the price-consumption relationship.
14
Statistics for Marketing & Consumer Research Copyright © 2008 - Mario Mazzocchi 14 Partial correlation The partial correlation coefficient allows one to evaluate the relationship between two variables after controlling for the effects of one or more additional variables. For example, if x is price, y is consumption and z is income, the partial correlation coefficient is obtained by correcting the correlation coefficient between x and y after considering the correlation between x and z and the correlation between y and z This can be generalized to control for more variables
15
Statistics for Marketing & Consumer Research Copyright © 2008 - Mario Mazzocchi 15 Other correlation statistics Part (semi-partial) correlation: controls for the correlation between the influencing variable z and only one of the two variables x and y Non-parametric correlation statistics (rank-order association): Spearmans Rho Kendalls Tau-b statistics Multiple correlation coefficient (regression analysis): joint relationship between one variable (the dependent variable) and a set of other variables.
16
Statistics for Marketing & Consumer Research Copyright © 2008 - Mario Mazzocchi 16 Correlation and covariance in SPSS Choose between bivariate & partial
17
Statistics for Marketing & Consumer Research Copyright © 2008 - Mario Mazzocchi 17 Bivariate correlation Select the variables you want to analyse Require the significance level (two tailed) Descriptive statistics (incl. covariance) Non-parametric association measures
18
Statistics for Marketing & Consumer Research Copyright © 2008 - Mario Mazzocchi 18 Pearson bivariate correlation output
19
Statistics for Marketing & Consumer Research Copyright © 2008 - Mario Mazzocchi 19 Non-parametric tests output
20
Statistics for Marketing & Consumer Research Copyright © 2008 - Mario Mazzocchi 20 Partial correlations List of variables to be analysed Control variables
21
Statistics for Marketing & Consumer Research Copyright © 2008 - Mario Mazzocchi 21 Partial correlation output Partial correlations still measure the correlations between two variables, but eliminate the effect of other variables, i.e. the correlation reflects the relationship between consumption and income for consumers facing the same price
22
Statistics for Marketing & Consumer Research Copyright © 2008 - Mario Mazzocchi 22 Bivariate linear regression Dependent variable Intercept Regression coefficient (Random) error term Causality (from x to y) is now assumed Regressing stands for going backwards from the dependent variable to its determinant The error term embodies anything which is not accounted for by the linear relationship The unknown parameters ( and ) need to be estimated (usually on sample data). We refer to the sample parameter estimates as a and b Explanatory variable
23
Statistics for Marketing & Consumer Research Copyright © 2008 - Mario Mazzocchi 23 Least squares estimation of the unknown parameters For a given value of the parameters, the error (residual) term for each observation is The least squares parameter estimates are those who minimize the sum of squared errors:
24
Statistics for Marketing & Consumer Research Copyright © 2008 - Mario Mazzocchi 24 Assumptions on the error term (1) 1.The error term has a zero mean 2.The variance of the error term does not vary across cases (homoskedasticity) 3.The error term for each case is independent of the error term for other cases 4.The error term is also independent of the values of the explanatory (independent) variable 5.The error term is normally distributed
25
Statistics for Marketing & Consumer Research Copyright © 2008 - Mario Mazzocchi 25 Assumptions on the error term (2) 1.The error term has a zero mean (otherwise there would be a systematic bias which should be captured by the intercept) 2.The variance of the error term does not vary across cases (homoskedasticity) for example, the error variability should not become larger for cases with very large values of the independent variable. Heteroskedasticity is the opposite condition 3.The error term for each case is independent from the error term for other cases The omission of relevant explanatory variables would break this assumption, as an omitted independent variable (correlated across cases by definition) ends up in the residual term and induces correlation.
26
Statistics for Marketing & Consumer Research Copyright © 2008 - Mario Mazzocchi 26 Assumptions on the error term(3) 4.The error term is also independent of the values of the explanatory (independent) variable otherwise it would mean that the variable is not truly independent and is affected by changes in the dependent variable Frequent problem: sample selection bias, which occurs when non- probabilistic samples are used, that is the sample only includes units from a specific group Example: response to advertising by sampling those who purchase a deodorant after seeing an advert; those who decided not to buy the deodorant are not taken into account, even if they saw the advert. Correlated observations do not enter the analysis and this leads to a correlated error term 5.The error term is normally distributed This corresponds to the assumption that the dependent variable is normally distributed for any value of the independent variable(s). Normality makes life easier with hypothesis testing, but there are ways to overcome the problem if the distribution is not normal.
27
Statistics for Marketing & Consumer Research Copyright © 2008 - Mario Mazzocchi 27 Prediction Once a and b have been estimated, it is possible to predict the value of the dependent variable for any given value of the explanatory variable
28
Statistics for Marketing & Consumer Research Copyright © 2008 - Mario Mazzocchi 28 Model evaluation An evaluation of the model performance can be based on the residuals, which provide information on the capability of the model predictions to fit the original data (goodness- of-fit) Since the parameters a and b are estimated on the sample, just like a mean, they are accompanied by the standard error of the parameters, which measures the precision of these estimates and depends on the sampling size. Knowledge of the standard errors opens the way to run hypothesis testing and compute confidence intervals for the regression coefficients (see lecture 6).
29
Statistics for Marketing & Consumer Research Copyright © 2008 - Mario Mazzocchi 29 Hypothesis testing on regression coefficients T-test on each of the individual coefficients Null hypothesis: the corresponding population coefficient is zero. T statistic: simply divide the estimate (for example a, the estimate f ) by its standard error (s a ). The p-value allows one to decide whether to reject or not the null hypothesis that =zero, depending on the confidence level F-test (multiple independent variables, as discussed later) It is run jointly on all coefficients of the regression model Null hypothesis: all coefficients are zero The F-test in linear regression corresponds to the ANOVA test (and the GLM is a regression model which can be adopted to run ANOVA techniques)
30
Statistics for Marketing & Consumer Research Copyright © 2008 - Mario Mazzocchi 30 Coefficient of determination The sum of squared errors is a measure of the variability which is not explained by the regression model The sum of squared errors is a portion of total variation, which is measured by and Where SSR is the portion of variability which is explained by the regression model: The natural candidate for measuring how well the model fits the data is the coefficient of determination, which varies between zero (when the model does not explain any of the variability of the dependent variable) and 1 (when the model fits the data perfectly) : In a BIVARIATE REGRESSION, the coefficient of determination R 2 is the square of the correlation coefficient between y and x
31
Statistics for Marketing & Consumer Research Copyright © 2008 - Mario Mazzocchi 31 Bivariate regression in SPSS
32
Statistics for Marketing & Consumer Research Copyright © 2008 - Mario Mazzocchi 32 Regression output Only 5% of total variation is explained by the model (correlation is 0.23) The F-test rejects the hypothesis that all coefficients are zero Both parameters are statistically different from zero according to the t-test
33
Statistics for Marketing & Consumer Research Copyright © 2008 - Mario Mazzocchi 33 Multiple regression The principle is identical to bivariate regression, but there are more explanatory variables
34
Statistics for Marketing & Consumer Research Copyright © 2008 - Mario Mazzocchi 34 Additional assumption 6.The independent variables are also independent of each other. Otherwise we could run into some double-counting problem and it would become very difficult to separate the meaning. Inefficient estimates Apparently good model but poor forecasts
35
Statistics for Marketing & Consumer Research Copyright © 2008 - Mario Mazzocchi 35 Collinearity Assumption six refers to the so-called collinearity (or multicollinearity) problem Collinearity exists when two explanatory variables are correlated). Perfect collinearity: one of the variables has a perfect (1 or -1) correlation with another variable, or with a linear combination of more than one variable. This makes estimation impossible. Strong collinearity makes estimates of the coefficients unstable and inefficient, which means that the standard errors of the estimates are inflated as compared to the best possible solution. Furthermore, the solution becomes very sensitive to the choice of which variables to include in the model. When there is multicollinearity the model might look very good at a first glance, but produces poor forecasts.
36
Statistics for Marketing & Consumer Research Copyright © 2008 - Mario Mazzocchi 36 Goodness-of-fit The coefficient of determination R 2 (still computed as the ratio between SSR and SST), always increases with the inclusion of additional regressors This is against the parsimony principle; models with many explanatory variable are more demanding in terms of data (higher costs) and computations. If alternative nested models are compared those with more explanatory variables result in a better fit. Thus, a proper indicator is the adjusted R 2 which accounts for the number of explanatory variables (k) in relation to the number of observations (n)
37
Statistics for Marketing & Consumer Research Copyright © 2008 - Mario Mazzocchi 37 Multiple regression in SPSS Analyze / Regression / Linear Simply select more than one explanatory variable Click on STATISTIC for collinearity diagnostics and more statistics
38
Statistics for Marketing & Consumer Research Copyright © 2008 - Mario Mazzocchi 38 Additional statistics Part and partial correlations are provided
39
Statistics for Marketing & Consumer Research Copyright © 2008 - Mario Mazzocchi 39 Output The model accounts for 19.3% of variability in the dependent variable. After adjusting for the number of regressors, the R 2 is 0.176 The null hypothesis that all regressors are zero is strongly rejected house
40
Statistics for Marketing & Consumer Research Copyright © 2008 - Mario Mazzocchi 40 Output Only these parameters (household size and price) emerge as significantly different from 0 Tolerance values close to 1 and low VIF indicate multicollinearity is not an issue
41
Statistics for Marketing & Consumer Research Copyright © 2008 - Mario Mazzocchi 41 Coefficient interpretation – intercept The constant represents the amount spent being zero all other variables. It provides a negative value, but the hypothesis that the constant is zero is not rejected A household of zero components, with no income is unlikely to consume chicken However, estimates for the intercept are often unsatisfactory, because frequently there are no data points with values for the independent variables close or equal to zero
42
Statistics for Marketing & Consumer Research Copyright © 2008 - Mario Mazzocchi 42 Coefficient interpretation The significant coefficients tell one that: Each additional household component means an increase in consumption by 277 grams A £ 1 increase in price leads to a decrease in consumption by 108 grams
43
Statistics for Marketing & Consumer Research Copyright © 2008 - Mario Mazzocchi 43 Stepwise regression procedure Explores each single explanatory variable before entering it in the model. The procedure: 1.Adds the variable which shows the highest bivariate correlation with the dependent variable 2.The partial correlation of all remaining potential independent variables (after controlling for the independent variable already included in the model) are explored and the explanatory variable with the highest partial correlation coefficients enters the model 3.The model is re-estimated with two explanatory variables, then the decision whether to keep the second one is based on the variation of the F-value or other goodness-of-fit statistics like the Adjusted R- square, Information Criteria, etc. If the variation is not significant, then the second variable is not included in the model. Otherwise it stays in the model and the process continues for the inclusion of a third variable (go back to step 2) At each step, the procedure may drop one of the variables already included in the model if there is no significant decrease in the F-value (or any other targeted stepwise criterion)
44
Statistics for Marketing & Consumer Research Copyright © 2008 - Mario Mazzocchi 44 Forward and backward Forward regression works exactly like step-wise regression, but variables are only entered and not dropped. The process stops when there is no further significant increase in the F-value Backward regression starts by including all the independent variables and works backward (according to the step-wise approach), so that at each step the procedure drops the variable which causes the minimum decrease in the F-value and stops when such decrease is not significant.
45
Statistics for Marketing & Consumer Research Copyright © 2008 - Mario Mazzocchi 45 Stepwise regression in SPSS Choose the variable selection procedure method here and proceed as usual
46
Statistics for Marketing & Consumer Research Copyright © 2008 - Mario Mazzocchi 46 Stepwise regression output The first model only includes the “average price” variable In the second step, the household size is included No other variable enters the model
47
Statistics for Marketing & Consumer Research Copyright © 2008 - Mario Mazzocchi 47 Regression ANOVA and the GLM (Factorial) ANOVA can be seen as a regression model where all explanatory variables are binary (dummy) variables Each of the dummy variables indicates whether a given factor is present or not The T-test on the coefficients of the dummies are mean comparison tests The F-test on the regression model is the test for factorial (n-way) ANOVA
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.