Correlations and simple regression analysis Data analysis and information management EUZC405 M.Bazarov m.bazarov@wiut.uz
Today’s Agenda Measuring association between the variables (covariance and coefficient of correlation) Simple regression analysis Summary in Excel
Learning Objectives After completion of this lecture you will be able to: Define and calculate correlation coefficient; Find the regression line and use it for regression analysis; Define and calculate coefficient of determination (R-squared); Understand and interpret regression output from Excel
Measuring association between the variables Use of term correlation implies: That there are two or more entities under consideration. That there is some common link which makes them related to a greater or lesser degree. Consider: CA1 assessment scores and final exam results. Height and Weight. Price of goods and wages paid to the producers.
Measuring association between the variables Consider example: Tim Newton is the sales manager of a firm which manufactures meat products and a sell a big part of them directly to retail food stores via a large force of sales representatives. Recently, as the recession has begun to affect the business, Mr. Newton has become aware of the need to monitor representatives’ performance more closely, but the trouble is that he does not have very much idea what factors may influence that performance.
Measuring association between the variables Rep. no. Value of last quarter’s sales ($000s) Number of retail outlets visited regularly Area covered (square miles) 1 2 3 4 5 6 7 8 9 10 25 29 31 42 44 45 47 57 50 12 17 21 26 34 30 38 61 450 500 350 250 150 420 275 200 400 300
Measuring association between the variables
Measuring association between the variables What can we say about this relationship? Outlier Outlier!
Measuring association between the variables In general, one could observe that when number of outlets visited (or variable X) is above its mean then sales (or variable Y) also above its mean. Mean X Mean Y
Measuring association between the variables The covariance measures linear dependence between two variables. Covariance (x,y)= Cov>0 indicates that two variables move in the same direction (when x is above the mean so does the y) Cov<0 indicates that two variables move in opposite direction (when x is above the mean the y is opposite)
Measuring association between the variables To standardize the covariance we need to divide it by the product of two separate standard deviations. R or r = Where R or r is also known as Pearson’s product moment correlation coefficient Cov (x,y)=
The sales data revisited Rep No Value of last quarter's sales (y) Number of retail outlets visited regularly (x) y^2 x^2 xy 2 25 12 625 144 300 3 29 17 841 289 493 4 31 21 961 441 651 5 26 676 806 6 42 34 1764 1156 1428 7 44 30 1936 900 1320 8 45 38 2025 1444 1710 9 47 2209 2115 10 57 61 3249 3721 3477 351 284 14571 10796 12300
Finding the coefficient of correlation = 351/9, =284/9 Covariance= = 136
Simple regression analysis Hence, if the relationship between variables exists (as we can see from correlation coefficient) we would be interested in predicting the behaviour of one variable, say y, from behaviour of the other, say x - predictor or independent variable denoted x ; - dependant variable denoted by y.
Simple regression analysis For example, relationship between the sales and number of outlets visited could be well approximated by the line : Sales=a+ b *number of outlets visited (where a is a number of sales when no outlet is visited (x=0) Or y=a+bx
Simple regression analysis The problem is we could draw many possible lines. Which one to choose?
Simple regression analysis Well, try to find a line that minimizes the sum of squared distances between the data and the line (see the graph!) to ensure a better fit!
Simple regression analysis For example, let’s estimate the regression line for our data on sales minimizing the sum of squared differences between data and the line: Sales=a+ b *number of outlets visited Coefficient b of such line could be found using the following formula Coefficient a of such line could be found using the following formula
Simple regression analysis Hence,
Simple regression analysis sales=17.94+0.6673x Wow, we now could predict the sales by looking at number of outlet visited by sales representatives! In our case, if we increase the number of outlets visited by sales representative by one the sales will increase by 0.6673 thousand dollars or 667.3 $.
Simple regression analysis After we derived the regression line you have to ask yourself how well such line actually fits the data or “Goodness-of-fit” of the regression? Consider example: The average sales are: 351/9=39 Take any one value, say representative #8 Regression predicts: y=17.94+0.6673x= 17.94+0.6673*38=43.29 Rep No Value of last quarter's sales (y) Number of retail outlets visited regularly (x) 2 25 12 3 29 17 4 31 21 5 26 6 42 34 7 44 30 8 45 38 9 47 10 57 61 544
Simple regression analysis Look at the graph: Y=45 s a l e du=45-43.29=1.71 dt=Y-mean=45-39=6 de=43.29-39=4.29 Mean=39 b= 0.6673 a=17.94 # of outlet visited X=38
Simple regression analysis Hence, we could say that on average we generate 39 thousand dollars in sales. When representative #8 visits 38 outlets we use regression to predict the sales number to be 43.29 thousand dollars. Hence, our regression explains proportion of deviation from the mean or de (explained deviation) and du (unexplained deviation) is the proportion of deviation that is left unexplained. The total deviation (dt) is simply sum of both: dt=de+du! Summing such deviation across all observations gives us: As you probably remember from our previous lectures deviations from the mean sum to zero.
Simple regression analysis Hence, we could use the sum of squared deviations to see how well our regression fits the data. And we denote -Total Sum of Squares (TSS) -Regression (Explained) Sum of Squares (ESS) - Residual (Unexplained) Sum of Squares (RSS) The coefficient of determination (R-squared) is R (squared):
Simple regression analysis Now, look at the regression output (from Excel) below:
Simple regression analysis As you have probably noticed, the good thing is we do not need do all these calculations manually, Excel reports it to us! And you can easily identify all components we looked at today: correlation coefficient (Multiple R), R-squared, and regression coefficients (a=17.94 and b=0.66) The only part, we have to explain to finalize our discussion today is to understand what is the t-statistics reported means.
Simple regression analysis As you have probably noticed, the estimated coefficients (a=17.94 and b=0.66) or estimates are obtained from the sample! The t-statistics tests the hypothesis that a population regression coefficient β is 0, that is, Ho: β=0. There is also alternative hypothesis H1: β≠0. So t-statistics shows us how β is significantly different from zero. In our example t-statistics for β is equal to 9.363761.
Simple regression analysis Please note, it is only different from z-statistics we used in our previous example is that we are using SD of sample coefficient in our formula above! Should we reject this H0 or not at 5% level of significance? To decide on this you could either look at p-value or confidence interval reported in Excel regression output.! Using p-values: p = 2*P(t-statistics>t-critical). In other words, p-value less than significance level leads us to reject null hypothesis H0: β=0
Further Reading and Reference Chapter S3 . Swift and S. Piff Quantitative methods for business, management and finance (2001 2-edition), Palgrave Chapter 15&16, Curwin, J. & Slater, R. (2002 5th edition) Quantitative Methods for Business Decisions, Thomson Chapter 3, Burton, G., Carrol, G. & Wall, S. (2002 2nd edition) Quantitative Methods for Business & Economics, Financial Times / Prentice Hall Chapter 11, Bancroft and O’Sullivan (2000) Foundations in Quantitative Business Techniques, Mc-Graw Hill Publishing