Download presentation
Presentation is loading. Please wait.
Published byEugenia Alexandrina Lawrence Modified over 9 years ago
1
Multiple regression - Inference for multiple regression - A case study IPS chapters 11.1 and 11.2 © 2006 W.H. Freeman and Company
2
Objectives (IPS chapters 11.1 and 11.2) Multiple regression Multiple linear regression model Confidence interval for β j Significance test for β j Anova F-test for multiple regression Squared multiple correlation R 2
3
For p explanatory variables, we can express the y-values y = 0 + 1 x 1 … + p x p +e e is an error which is difference between each y-value and its mean. The statistical model for the sample data (i = 1, 2, … n) is then: Data = fit + residual y i = 0 + 1 x 1i … + p x pi + i The parameters of the model are 0, 1 … p and . Where s is the standard deviation of the errors, e Multiple linear regression model
4
We selected a random sample of n individuals for which p + 1 variables were measured (x 1 …, x p, y). The least-squares regression method minimizes the sum of squared deviations, e i 2 (= y i – ŷ i ) 2 to express y as a linear function of the explanatory variables: ŷ i = b 0 + b 1 x 1i … + b k x pi ■As with simple linear regression, the constant b 0 is the y intercept. ■The regression coefficients (b 1 … b p ) reflect the change in y for a 1 unit change in each independent variable. They are analogous to the slope in simple regression. ŷ y b 0 are unbiased estimates of population parameters β 0 b p β p
5
Confidence interval for β j Estimating the regression parameters β 0, … β j … β p is a case of one- sample inference with unknown population variance. We use the t distribution, with n – p – 1 degrees of freedom. A level C confidence interval for β j is given by: b j ± t* SE b j - SE b j is the standard error of b j. - t* is the critical t for the t (n – 2) distribution with area C between –t* and +t*.
6
Significance test for β j To test the hypothesis H 0 : j = 0 versus a 1 or 2 sided alternative. We calculate the t statistic t = b j /SE b j which has the t (n – p – 1) null distribution to find the p-value of the test.
7
For a multiple linear relationship ANOVA tests the hypotheses H 0 : β 1 = β 2 = … = β p = 0 versus H a : H 0 not true by computing the F statistic: F = MSM / MSE When H 0 is true, F follows the F(1, n − p − 1) distribution. The p-value is P(> F). A significant p-value doesn’t mean that all the explanatory variables have a significant influence on y — only that at least one of them does. ANOVA F-test for multiple regression
8
ANOVA table for multiple regression SourceSum of squares SSdfMean square MSFP-value ModelpSSG/DFGMSG/MSETail area above F Errorn − p − 1SSE/DFE Totaln − 1 SST = SSM + SSE DFT = DFM + DFE The standard deviation of the sampling distribution, s, for n sample data points is calculated from the residuals e i = y i – ŷ i s is an unbiased estimate of the regression standard deviation σ.
9
Squared multiple correlation R 2 Just as with simple linear regression, R 2, the squared multiple correlation, is the proportion of the variation in the response variable y that is explained by the model. In the particular case of multiple linear regression, the model is a linear function of all p explanatory variables taken together.
10
We have data on 224 first-year computer science majors at a large university in a given year. The data for each student include: * Cumulative GPA after 2 semesters at the university (y, response variable) * SAT math score (SATM, x 1, explanatory variable) * SAT verbal score (SATV, x 2, explanatory variable) * Average high school grade in math (HSM, x 3, explanatory variable) * Average high school grade in science (HSS, x 4, explanatory variable) * Average high school grade in English (HSE, x 5, explanatory variable) Here are the summary statistics for these data given by software SAS:
11
The first step in multiple linear regression is to study all pair-wise relationships between the p + 1 variables. Here is the SAS output for all pair-wise correlation analyses (value of r and 2 sided p-value of H 0 : ρ = 0). Scatterplots for all 15 pair-wise relationships are also necessary to understand the data.
12
For simplicity, let’s first run a multiple linear regression using only the three high school grade averages: P-value very significant R 2 is fairly small (20%) HSM significant HSS, HSE not
13
The ANOVA for the multiple linear regression using only HSM, HSS, and HSE is very significant at least one of the regression coefficients is significantly different from zero. But R 2 is fairly small (0.205) only about 20% of the variation in cumulative GPA can be explained by these high school scores. (Remember, a small p-value does not imply a large effect.) P-value very significant R 2 is fairly small (20%)
14
The tests of hypotheses for each b within the multiple linear regression reach significance for HSM only. We found a significant correlation between HSS and GPA when analyzed by themselves, so why isn’t b HSS significant in the multiple regression equation? HHS and HHM are also significantly correlated suggesting redundancy. HSM significant HSS, HSE not If all 3 high school averages are used, neither HSE or HSS, significantly helps us predict GPA over and above what HSM does all by itself.
15
P-value very significant R 2 is small (20%) HSM significant HSE not We now drop the least significant variable from the previous model: HSS. The conclusions are about the same. But notice that the actual regression coefficients have changed.
16
P-value very significant R 2 is very small (6%) SATM significant SATV not Let’s run a multiple linear regression with the two SAT scores only. The ANOVA test for β SATM and β SATV is very significant at least one is not zero. R 2 is really small (0.06) only 6% of GPA variations are explained by these tests. When taken together, only SATM is a significant predictor of GPA (P 0.0007).
17
We finally run a multiple regression model with all the variables together The overall test is significant, but only the average high school math score (HSM) makes a significant contribution in this model to predicting the cumulative GPA. This conclusion applies to computer majors at this large university. P-value very significant R 2 fairly small (21%) HSM significant
19
Warnings Multiple regression is a dangerous tool. The significance or non- significance of any one predictor in a multiple predictor model is largely a function of what other predictors may or may not be included in the model. P-values arrived at after a selection process (where we drop seemingly unimportant X’s) don’t mean the same thing as P-values do in earlier parts of the textbook. When the X’s are highly correlated (with each other) two equations with very different parameters can yield very similar predictions (so interpreting slopes as “if I change X by 1 unit, I will change Y by b units” may not be a good idea.
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.