Lecture 9: ANOVA tables F-tests BMTRY 701 Biostatistical Methods II
ANOVA Analysis of Variance Similar in derivation to ANOVA that is generalization of two-sample t-test Partitioning of variance into several parts that due to the ‘model’: SSR that due to ‘error’: SSE The sum of the two parts is the total sum of squares: SST
Total Deviations:
Regression Deviations:
Error Deviations:
Definitions
Example: logLOS ~ BEDS > ybar <- mean(data$logLOS) > yhati <- reg$fitted.values > sst <- sum((data$logLOS- ybar)^2) > ssr <- sum((yhati - ybar )^2) > sse <- sum((data$logLOS - yhati)^2) > > sst [1] > ssr [1] > sse [1] > sse+ssr [1] >
Degrees of Freedom Degrees of freedom for SST: n - 1 one df is lost because it is used to estimate mean Y Degrees of freedom for SSR: 1 only one df because all estimates are based on same fitted regression line Degrees of freedom for SSE: n - 2 two lost due to estimating regression line (slope and intercept)
Mean Squares “Scaled” version of Sum of Squares Mean Square = SS/df MSR = SSR/1 MSE = SSE/(n-2) Notes: mean squares are not additive! That is, MSR + MSE ≠ SST/(n-1) MSE is the same as we saw previously
Standard ANOVA Table SSdfMS Regression SSR1MSR Error SSEn-2MSE Total SSTn-1
ANOVA for logLOS ~ BEDS > anova(reg) Analysis of Variance Table Response: logLOS Df Sum Sq Mean Sq F value Pr(>F) BEDS e-06 *** Residuals
Inference? What is of interest and how do we interpret? We’d like to know if BEDS is related to logLOS. How do we do that using ANOVA table? We need to know the expected value of the MSR and MSE:
Implications mean of sampling distribution of MSE is σ 2 regardless of whether or not β 1 = 0 If β 1 = 0, E(MSE) = E(MSR) If β 1 ≠ 0, E(MSE) < E(MSR) To test significance of β 1, we can test if MSR and MSE are of the same magnitude.
F-test Derived naturally from the arguments just made Hypotheses: H 0 : β 1 = 0 H 1 : β 1 ≠ 0 Test statistic: F* = MSR/MSE Based on earlier argument we expect F* >1 if H 1 is true. Implies one-sided test.
F-test The distribution of F under the null has two sets of degrees of freedom (df) numerator degrees of freedom denominator degrees of freedom These correspond to the df as shown in the ANOVA table numerator df = 1 denominator df = n-2 Test is based on
Implementing the F-test The decision rule If F* > F(1- α; 1, n-2), then reject Ho If F* ≤ F(1- α; 1, n-2), then fail to reject Ho
F-distributions
ANOVA for logLOS ~ BEDS > anova(reg) Analysis of Variance Table Response: logLOS Df Sum Sq Mean Sq F value Pr(>F) BEDS e-06 *** Residuals > qf(0.95, 1, 111) [1] > 1-pf(24.44,1,111) [1] e-06
More interesting: MLR You can test that several coefficients are zero at the same time Otherwise, F-test gives the same result as a t- test That is: for testing the significance of ONE covariate in a linear regression model, an F-test and a t-test give the same result: H 0 : β 1 = 0 H 1 : β 1 ≠ 0
general F testing approach Previous seems simple It is in this case, but can be generalized to be more useful Imagine more general test: Ho: small model Ha: large model Constraint: the small model must be ‘nested’ in the large model That is, the small model must be a ‘subset’ of the large model
Example of ‘nested’ models Model 1: Model 2: Model 3: Models 2 and 3 are nested in Model 1 Model 2 is not nested in Model 3 Model 3 is not nested in Model 2
Testing: Models must be nested! To test Model 1 vs. Model 2 we are testing that β 2 = 0 Ho: β 2 = 0 vs. Ha: β 2 ≠ 0 If β 2 = 0, then we conclude that Model 2 is superior to Model 1 That is, if we reject the null hypothesis Model 2: Model 1:
R reg1 <- lm(LOS ~ INFRISK + ms + NURSE + nurse2, data=data) reg2 <- lm(LOS ~ INFRISK + NURSE + nurse2, data=data) reg3 <- lm(LOS ~ INFRISK + ms, data=data) > anova(reg1) Analysis of Variance Table Response: LOS Df Sum Sq Mean Sq F value Pr(>F) INFRISK e-10 *** ms * NURSE nurse Residuals
R > anova(reg2) Analysis of Variance Table Response: LOS Df Sum Sq Mean Sq F value Pr(>F) INFRISK e-10 *** NURSE nurse Residuals > anova(reg1, reg2) Analysis of Variance Table Model 1: LOS ~ INFRISK + ms + NURSE + nurse2 Model 2: LOS ~ INFRISK + NURSE + nurse2 Res.Df RSS Df Sum of Sq F Pr(>F)
R > summary(reg1) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 6.355e e < 2e-16 *** INFRISK 6.289e e e-06 *** ms 7.829e e NURSE 4.136e e nurse e e Signif. codes: 0 ‘***’ ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: on 108 degrees of freedom Multiple R-squared: , Adjusted R-squared: F-statistic: on 4 and 108 DF, p-value: 1.298e-08 >
Testing more than two covariates To test Model 1 vs. Model 3 we are testing that β 3 = 0 AND β 4 = 0 Ho: β 3 = β 4 = 0 vs. Ha: β 3 ≠ 0 or β 4 ≠ 0 If β 3 = β 4 = 0, then we conclude that Model 3 is superior to Model 1 That is, if we reject the null hypothesis Model 1: Model 3:
R > anova(reg3) Analysis of Variance Table Response: LOS Df Sum Sq Mean Sq F value Pr(>F) INFRISK e-10 *** ms * Residuals > anova(reg1, reg3) Analysis of Variance Table Model 1: LOS ~ INFRISK + ms + NURSE + nurse2 Model 2: LOS ~ INFRISK + ms Res.Df RSS Df Sum of Sq F Pr(>F)
R > summary(reg3) Call: lm(formula = LOS ~ INFRISK + ms, data = data) Residuals: Min 1Q Median 3Q Max Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) <2e-16 *** INFRISK e-08 *** ms * --- Signif. codes: 0 ‘***’ ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: on 110 degrees of freedom Multiple R-squared: , Adjusted R-squared: F-statistic: on 2 and 110 DF, p-value: 8.42e-10
Testing multiple coefficients simultaneously Region: it is a ‘factor’ variable with 4 categories