Download presentation
Presentation is loading. Please wait.
1
Pearson Correlation and R2
We have discussed the major uses of regression analysis-estimation of parameters and means and prediction of new observation. In this topic, we will briefly discuss descriptive measures used in practice to describe the degree of linear association between X and Y. Many of the idea has been covered in a basic statistical course.
2
The correlation coefficient Ο (estimated by r) varies from β1 to 1:
π= Ξ£ π π β π π π β π Ξ£ π π β π 2 Ξ£ π π β π 2 = π 1 Ξ£ π π β π 2 Ξ£ π π β π = π 1 π π π π π 1 = Ξ£ X i β π π π β π Ξ£ π π β π 2 Correlation measures the strength of the linear relationship between two variables The correlation coefficient measures the strength of the linear relationship between two variables. Correlation coefficient is denoted by rho if measure from the population and r if measure from the sample. The correlation coefficient summarizes the regression line into a single value between -1 and 1. A strong linear relationship gets a value close to 1 or -1, while a weak linear relationship gets value close to 0. If the relationship is positive, the correlation coefficient is positive, other wise negative. The exact value of correlation of coefficient is b1 times the ration of the standard deviation of X and Y.
3
The Coefficient of Determination, R2
π
2 = SSR SST measures the proportion of variation in Y explained by the model . π= π 1 Ξ£ π π β π 2 Ξ£ π π β π 2 In SLR, R2 = r2: π 2 = π 1 2 Ξ£ π π β π 2 Ξ£ π π β π 2 = SSR SST The square of the correlation coefficient is called coefficient of determination, and denoted by R2. It measures the proportion of variation in Y explained by the linear regression model. This actually is easier to see that R2 is computed as SSR/SST. SST is a measure of the uncertainty in predicting Y, while SSR is the amount of variation in Y that can be explained based on its relationship with X. Or reduction of total variation associated with the use the X. When all observations fall on the fitted regression line, then SSE=0 and R2=1. Here, the predictor variable X accounts for all variation in the observation Y When the fitted regression line is horizontal so that slope =1 and all predicted value is the mean of Y. then SSE= SSTO and R2=0. Here, there is no linear association between X and Y in the sample data, and the predictor variable X is of no help in reducing the variation to predict Y. In practice, R2 is not likely to be 0 or 1 but somewhere between these limits.
4
The Toluca Company example
The Toluca company manufactures refrigeration equipment as well as many replacement parts. In the past, one of the Replacement parts has been produced periodically in lots of varying sizes. When a cost improvement program was undertaken, company officials wished to determine the optimum lot size (X) for producing this part. The production of this part involves setting up the production process and machining and assembly operations. One key input for the model to ascertain the optimum lot size was the relationship between lot size and labor hours (Y) required to produce the lot. To determine this relationship, data on lot size and work hours for 25 (n) recent production runs were utilized. Toluca company manufactures refrigeration equipment as well as many replacement parts. In the past, one of the replacement parts has been produced periodically in lots of varying sizes. When a cost improvement program was undertaken, company official wished to determine to optimum lot size (x) for producing this part. One key input for the model to ascertain the optimum lot size was the relationship between lot size and labor hours (Y) required to produce the lot. To determine this relationship, data on lot size and work hours for 25 recent production runs were utilized. The first 10 records are shown here.
5
R2 for Toluca Example π 2 =_________________ From SLR output:
Source of Variation SS π
π MS F Regression 252378 1 105.88 Error 54825 23 2384 Total 307203 24 π 2 =_________________ From SLR output: One can compute the correlation determination from an ANOVA table. Thus, the variation in (work load/ lot size) is (reduced/increased) by________ percent when lot size is considered. π
2 can be used as a criterion to access a linear regression model, if the relationship between X and Y is linear, then the _______(bigger/smaller) π
2 is, the better the model is.
6
R2 for Toluca Example Source of Variation SS π
π MS F Regression 252378 1 105.88 Error 54825 23 2384 Total 307203 24 π 2 = π 1 2 Ξ£ π π β π 2 Ξ£ π π β π 2 = SSR SST = /307203=0.8215 From SLR output: One can compute the correlation determination from an ANOVA table. Thus, the unexplained variation in (work load/ lot size) is (reduced/increased) by percent when lot size is considered. π
2 can be used as a criterion to access a linear regression model, if the relationship between X and Y is linear, then the _______(bigger/smaller) π
2 is, the better the model is.
7
The adjusted R 2 From SLR output:
Source of Variation SS π
π MS F Regression 252378 1 105.88 Error 54825 23 2384 Total 307203 24 From SLR output: Correlation determination measures the amount of variation in Y explained by the model. The more predictors used in the model, the more variation explained. This is a non-decreasing process. However, if the new added predictor is not related, it will cost some degree of freedom to estimated more beta parameters. The adjusted R-square uses a penalty item for every new added variables. The benefit of reduction in the variations in Y is adjusted accordingly. Usually the adjusted R-square is smaller than R-square, sometimes can be negative even. In which case the model is not a good fit for the data. We will discuss more on this in multiple linear regression section. Since π
2 usually can be made larger by including a larger number of predictor variables, the adjusted coefficient of Determination (π
π 2 ) is used to adjusts for the number of X values in the model. (more on this in MLR) Note: the adjusted π
2 can be negativeβ¦
8
Compute R or R 2
9
Inference on correlation coefficients
π»π:π=0 π»π:πβ 0 π»π: π½ 1 =0 π»π: π½ 1 β 0 is equivalent to π= π 1 π π π π π‘ π = π 1 π { π 1 } π‘ π = π πβ β π 2 is equivalent to
10
Inference on correlation coefficients
π»π:π=0 π»π:πβ 0 π»π: π½ 1 =0 π»π: π½ 1 β 0 is equivalent to π= π 1 π π π π π‘ π = π 1 π { π 1 } π‘ π = π πβ β π 2 is equivalent to
11
Inference on correlation coefficients
π»π:π= π β π»π:πβ π β π»π: π½ 1 = π½ β π»π: π½ 1 β π½ β is NOT equivalent to π‘ π = π 1 β π½ β π { π 1 }
12
Confidence interval on correlation coefficients
π§ β² = log π ( 1+π 1βπ ) is the Fisher π transformation. πΈ π§ β² = log π ( 1+π 1βπ ) π 2 π§ β² = 1 πβ3 (π§ β² βπΈ{π§β²})/π{π§β²} is approximately Norma (0,1) π β² Β±π(πβπΆ/π)π{ π β² } Remember, we still need to transform back to π! Because the sampling distribution of π is complicated, interval estimation of π is usually carried out by an approximation procedure based on a transformation, called Fisher z transformation. π= (π ππβ² βπ)/ (π π π β² +π)
13
What is a 90% CI for correlation coefficients in the Toluca example
Two methods here Because the sampling distribution of π is complicated, interval estimation of π is usually carried out by an approximation procedure based on a transformation, called Fisher z transformation. Conclusion: we are 90% confident that The correlation between hour and size is at least 0.82 and at most 0.95.
14
R2 is often expressed as a percentage instead of a proportion.
Notes on R2 R2 is often expressed as a percentage instead of a proportion. In MLR there will be a different r between Y and each predictor variable, but only one R2 for the whole model. In MLR, we often use adjusted R2 which has been adjusted to account for the number of variables in the model Low R2 does not imply no functional relationship exists between X and Y(it might be non-linear) High R2 does not ensure that the model can make precise predictions for E(YΛh) or Yh(new) πππΈ could still be large (for example, non linear pattern) Extrapolation
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.