9/14/ Lecture 61 STATS 330: Lecture 6
9/14/ Lecture 62 Inference for the Regression model Aim of today’s lecture: To discuss how we assess the significance of variables in the regression Key concepts: Standard errors Confidence intervals for the coefficients Tests of significance Reference: Coursebook Section 3.2
9/14/ Lecture 63 Variability of the regression coefficients Imagine that we keep the x’s fixed, but resample the errors and refit the plane. How much would the plane (estimated coefficients) change? This gives us an idea of the variability (accuracy) of the estimated coefficients as estimates of the coefficients of the true regression plane.
9/14/ Lecture 64 The regression model (cont) The data is scattered above and below the plane: Size of “sticks” is random, controlled by 2, doesn’t depend on x 1, x 2
9/14/ Lecture 65 Variability of coefficients (2) Variability depends on The arrangement of the x’s (the more correlation, the more change, see Lecture 8) The error variance (the more scatter about the true plane, the more the fitted plane changes) Measure variability by the standard error of the coefficients
9/14/ Lecture 66 Call: lm(formula = volume ~ diameter + height, data = cherry.df) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) e-07 *** diameter < 2e-16 *** height * --- Signif. codes: 0 `***' `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1 Residual standard error: on 28 degrees of freedom Multiple R-Squared: 0.948, Adjusted R-squared: F-statistic: 255 on 2 and 28 DF, p-value: < 2.2e-16 Standard errors of coefficients Cherries
9/14/ Lecture 67 Confidence interval
9/14/ Lecture 68 Confidence interval (2) A 95% confidence interval for a regression coefficient is of the form Estimated coefficient +/- standard error t where t is the 97.5% point of the appropriate t-distribution. The degrees of freedom are n-k-1 where n=number of cases (observations) in the regression, and k is the number of variables (assuming we have a constant term)
9/14/ Lecture 69 Example: cherry trees Use function confint > confint(cherry.lm) 2.5% 97.5% (Intercept) diameter height Object created by lm
9/14/ Lecture 610 Hypothesis test Often we ask “do we need a particular variable, given the others are in the model?” Note that this is not the same as asking “is a particular variable related to the response?” Can test the former by examining the ratio of the coefficient to its standard error
9/14/ Lecture 611 Hypothesis test (2) This is the t-statistic t The bigger t, the more we need the variable Equivalently, the smaller the p-value, the more we need the variable
9/14/ Lecture 612 Call: lm(formula = volume ~ diameter + height, data = cherry.df) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) e-07 *** diameter < 2e-16 *** height * --- Signif. codes: 0 `***' `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1 Residual standard error: on 28 degrees of freedom Multiple R-Squared: 0.948, Adjusted R-squared: F-statistic: 255 on 2 and 28 DF, p-value: < 2.2e-16 t-values p-values Cherries All variables required since p=values small (<0.05)
P-value 9/14/ Lecture 613 P-value: total area is Density curve for t with 28 degrees of freedom
9/14/ Lecture 614 Other hypotheses Overall significance of the regression: do none of the variables have a relationship with the response? Use the F statistic: the bigger F, the more evidence that at least one variable has a relationship equivalently, the smaller the p-value, the more evidence that at least one variable has a relationship
9/14/ Lecture 615 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) e-07 *** diameter < 2e-16 *** height * --- Signif. codes: 0 `***' `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1 Residual standard error: on 28 degrees of freedom Multiple R-Squared: 0.948, Adjusted R-squared: F-statistic: 255 on 2 and 28 DF, p-value: < 2.2e-16 F-value p-value Cherries
9/14/ Lecture 616 Testing if a subset is required Often we want to test if a subset of variables is unnecessary Terminology: Full model: model with all the variables Sub-model: model with a set of variables deleted. Test is based on comparing the RSS of the submodel with the RSS of the full model. Full model RSS is always smaller (why?)
9/14/ Lecture 617 Testing if a subset is adequate (2) If the full model RSS is not much smaller than the submodel RSS, the submodel is adequate: we don’t need the extra variables. To do the test, we Fit both models, get RSS for both. Calculate test statistic (see next slide) If the test statistic is large, (equivalently the p- value is small) the submodel is not adequate
9/14/ Lecture 618 Test statistic Test statistic is d is the number of variables dropped s 2 is the estimate of 2 from the full model (the residual mean square) R has a function anova to do the calculations
9/14/ Lecture 619 P-values When the smaller model is correct, the test statistic has an F-distribution with d and n- k-1 degrees of freedom We assess if the value of F calculated from the sample is a plausible value from this distribution by means of a p-value If the p-value is too small, we reject the hypothesis that the submodel is ok
9/14/ Lecture 620 P-values (cont) Value of F P-value
9/14/ Lecture 621 Example Free fatty acid data: use physical measures to model a biochemical parameter in overweight children Variables are FFA: free fatty acid level in blood (response) Age (months) Weight (pounds) Skinfold thickness (inches)
9/14/ Lecture 622 Data ffa age weight skinfold … 20 observations in all
9/14/ Lecture 623 Analysis (1) This suggests that age is not required if weight, skinfold retained, skinfold is not required if weight, age retained Can we get away with just weight? > model.full<- lm(ffa~age+weight+skinfold,data=fatty.df) > summary(model.full) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) * age weight ** skinfold
9/14/ Lecture 624 Analysis (2) > model.sub<-lm(ffa~weight,data=fatty.df) > anova(model.sub,model.full) Analysis of Variance Table Model 1: ffa ~ weight Model 2: ffa ~ age + weight + skinfold Res.Df RSS Df Sum of Sq F Pr(>F) Small F, large p-value suggest weight alone is adequate. But test should be interpreted with caution, as we “pretested”
Testing a combination of coefficients Cherry trees: Our model is V = c D H or log(V) = + log(D) + log(H) Dimension analysis suggests + How can we test this? Test statistic is P value is area under t-curve beyond +/- t 9/14/ Lecture 625
Testing a combination (cont) We can use the “R330” function test.lc to compute the value of t: 9/14/ Lecture 626 > cherry.lm = lm(log(volume)~log(diameter)+log(height),data=cherry.df) > cc = c(0,1,1) > c = 3 > test.lc(cherry.lm,cc,c) $est [1] $std.err [1] $t.stat [1] $df [1] 28 $p.val [1]
The “R330 package” A set of functions written for the course, in the form of an R package Install the package using the R packages menu (see coursebook for details). Then type library(R330) 9/14/ Lecture 627
Testing a combination (cont) In general, we might want to test c + c + c c (in our example c = 0, c =1, c =1, c = 3) Estimate is Test statistic is 9/14/ Lecture 628