CHAPTER- 17 CORRELATION AND REGRESSION PRODUCT MOMENT CORRELATION It is a static summarizing the strength of association between two metric variables. Example: How strongly are sales related to advertising expenditures? Is there any association between market share and size of sales force? The product moment correlation r, is the most widely used static. It was originally proposed bt Karl Pearson so it is also known as Pearson correlation co-efficient.
PRODUCT MOMENT CORRELATION From a sample of n observations, X and Y, the product moment correlation r,can be calculated as follows: ∑ ( Xi – X¯) (Yi - Y¯) r=----------------------------------------- √ ∑ ( Xi – X¯)² ∑ (Yi - Y¯)² Division of numerator and denominator gives: COVXY r =------------------ SX SY In these equations X¯ and Y¯ denote the sample means and Sx and Sy the standard deviations.
PRODUCT MOMENT CORRELATION Covariance: A systematic relationship between two variables in which a change in one implies a corresponding change in the other (COVxy). The covariance may be either positive or negative. The statistical significance of the relationship between two variables measured by using r can be conveniently used. The hypothesis are: Ho: ρ = 0 H1: ρ ≠ 0 n-2 The test statistic is: t= r (-----------------)½ 1-r²
PARTIAL CORRELATION COEFFICIENT: A measure of association between two variables after controlling or adjusting for the effects of one or more additional variables. The statistic is used to answer the following questions: How strongly are sales related to advertising expenditures when the effect of price is controlled? Is there any association between market share and size of sales force after adjusting for the effects of sales promotion? Are consumers perception of quality related to their perceptions of prices when the effect of brand image is controlled?
r XY.Z= --------------------------- √ 1- r² XZ √ 1- r² YZ PARTIAL CORRELATION COEFFICIENT: Association between Y and X after controlling Z r XY – (r XZ ) (r YZ) r XY.Z= --------------------------- √ 1- r² XZ √ 1- r² YZ PART CORRELATION COEFFICIENT: A measure of the correlation between Y and X when the linear effects of the other independent variables have been removed from X but not from Y. r XY - rYZ r XZ r Y ( X.Z ) = ----------------------------- √ 1- r² XZ
REGESSION ANALYSIS: A statistical procedure for analyzing associative relationships between a metric dependent variable and one or more independent variables. It can be used in the following ways: Determine whether the independent variable explain a significant variation in the dependent variable: whether a relationship exists. Determine how much of the variation in the dependent variable can be explained by the independent variables: strength of the relationship. Determine the structure or form of the relationship: the mathematical equation relating the independent and dependent variables. Predict the values of the dependent variable. Control for other independent variables when evaluating the contributions of a specific variable or set of variables.
BIVARIATE REGRESSION: A procedure for deriving a mathematical relationship, in the form of an equation between a single metric dependent variable and a single metric independent variable. Such as: Are consumers’ perception of quality determined by their perceptions of price? Can the variation in the market share be accounted for by the size of the sales force?
STATISTICS ASSOCIATED WITH BIVARIATE REGRESSION ANALYSIS: Bivariate regression model. Coefficient of determination. Estimated or predicted value. Regression coefficient. Scatter diagram. Standard error of estimate. Standard error. Standardized regression coefficient. Sum of squared errors. t statistic.
STEPS IN CONDUCTING BIVARIATE REGRESSION ANALYSIS: Plot the scatter diagram Formulate the general model. Estimate the parameters. Estimate the standardized regression coefficient. Test for significance. Determine the strength and significance of the association. Check the prediction accuracy. Examine the residuals Cross- validate the model.
PLOT THE SCATTER DIAGRAM A scatter diagram is the plot of values of two variables for all cases or observations. It is customary to plot the dependent variable on the vertical axis and the independent variable on the horizontal axis. Least square procedure is a technique for fitting a straight line to a scatter gram by minimizing the square of the vertical distances of all points from the line. Formulate the general model: The general form is: Y = ß0+ ß1x In marketing research very few relationships are deterministic and there are errors. So the basic model becomes: YI= ßo + ß1 XI + e i
Estimate the parameters: In most cases ßo and ß1 are unknown and are estimated from the sampleobservation using the equation: Ŷi= a + bx i The slope b may be computed in terms of covariance between X and Y (COVxy) and the variance of X as: COVxy b=----------------------- S² x ∑ ( Xi – X¯) (Yi - Y¯) = ------------------------------ ∑ ( Xi – X¯)² ∑ Xi Yi- n X¯ Y¯ ∑ X ²- n X¯² The intercept a may be calculated as follows: a= Y¯ - b X¯
Estimate Standardized Regression Coefficient: Standardization is a process by which the raw data are transformed into new variables that have a mean of 0 and a variance of 1. Moreover, each of these regression coefficients is equal to the simple correlation between X and Y. Byx= Bxy = rxy . There is a simple relationship between the standardized and nonstandardized regression coefficients: Byx= byx ( Sx / Sy)
Determine the Strength and significance of Association: The strength of association may be calculated as follows: SSreg SSy– SSreg r2 = ------------ = ------------------ SSy SSreg It may be recalled from the earlier calculation of the simple correlation coefficient that: SSy= ∑ ( Yi - Y¯) ²
The Regression of attitudes towards the city on the duration of residence SSreg = ∑ ( Ŷ -Y¯) ², here, Ŷ = using a and b the predicted values of attitudes. SSres = ∑ (Y -Ŷ) ², here, Y attitudes towards the city The appropriate test statistic is F statistic: SSreg F = ------------------------ SSres / (n-2) The statistical significance of the linear relationship between X ( duration of residence) and Y (attitudes towards the city) may be tested by examining the hypothesis: H 0 : ß 1 = 0 H1 : ß 1 ≠ 0
Check prediction accuracy: ∑ ( Yi - Y¯)² SEE= √ --------------------------- n-2 SSreg SEE= √ ------------------------- If there are k independent variables then: n-k-1
ASSUMPTIONS The error term is normally distributed. For each fixed value of X the distribution of Y is normal. The means of all these normal distributions of Y, given X, lie on a straight line with slope b. The mean of error term is 0. The variance of error term is constant. The error terms are uncorrelated. Observations have been drawn independently.
MULTIPLE REGRESSION A statistical technique that simultaneously develops a mathematical relationship between two or more independent variables and an interval scaled dependent variable. The general form of multiplr regression model is as follows: Y= ß0 + ß1 X1 + ß2 X2 + … … … + ßk Xk + e Which is estimated by the following equation: Ŷ= a+ b1 X1 + b2 X2 +… … … +bk Xk
Statistics associated with multiple regression: Adjusted R² Coefficient of multiple determination F test Partial F test. Partial regression coefficient. CONDUCTING MULTIPLE REGRESSION ANALYSIS: Partial Regression Coefficients: Ŷ = a + b1X1 + b2 X2
Strength of association The total variation is decomposed as follows: SSy = SSreg + SSreg n SSy = ∑ ( Yi - Y¯)² i=1 SSreg = ∑ ( Ŷi - Y¯)² The strength of association is measured as follows: SSreg R2 = -------------------- SSy
Significance Testing: The overall test can be calculated as follows: SSreg / k F= --------------------------------------- SSreg / (n-k-1) R2 /k F= ------------------------- ( 1- R²) / ( n-k-1) Examination of residuals: The difference between the observed value of Yi and the value predicted by the regression equation Yi’
Stepwise Regression A regression proceedure in which the prodictor variables enter or leave the regression equation one at a time. There are several approaches to stepwise regression: Forward inclusion. Backward elimination. Stepwise solution.
MUTLICOLINEARITY: A state of very high correlations among independent variables. Multicollinearity results in several problems. The partial regression coefficients may not be estimated precisely. The standard errors are likely to be high. The magnititude as well as the signs of the regression coefficients may cause problems. Difficulty in assessing the relative importance of variables. Predictor variables may cause problems.
RELATIVE IMPORTANCE OF PREDICTORS: Several approaches are commonly used to assess the relative importance of predictor variables: Statistical significance Square of simple correlation coefficient. Square of partial correlation coefficient. Square of part correlation coefficient. Measures based on standardized coefficient or beta weights. Stepwise regression.
DOUBLE CROSS VALIDATION: A test of validity that examines whether a model hods om comparable data not used in the original estimation. DOUBLE CROSS VALIDATION: A special form of validation in which a sample is split into halves. One half serves as an estimation and the other as validation sample. the roles of estimation and validation halves are then reversed, and the cross validation process repeated. REGRESSION WITH DUMMY VARIABLES: The model can be computed as follows: Ŷi= a + b1 D1+ b2 D2+ b3 D3