Download presentation
Presentation is loading. Please wait.
Published byAnis Snow Modified over 9 years ago
1
Lecture 7: Multiple Linear Regression Interpretation with different types of predictors BMTRY 701 Biostatistical Methods II
2
Interpreting regression coefficients So far, we’ve considered continuous covariates Covariates can take other forms: binary nominal categorical quadratics (or other transforms) interactions Interpretations may vary depending on the nature of your covariate
3
Binary covariates Considered ‘qualitative’ The ordering of numeric assignments does not matter Examples: MEDSCHL: 1 = yes; 2 = no More popular examples: gender mutation pre vs. post-menopausal Two age categories
5
How is MEDSCHL related to LOS? How to interpret β 1 ? Coding of variables: 2 vs. 1 i prefer 1 vs. 0 difference? the intercept. Let’s make a new variable: MS = 1 if MEDSCHL = 1 (yes) MS = 0 if MEDSCHL = 2 (no)
6
How is MEDSCHOOL related to LOS? What does β 1 mean? Same, yet different interpretation as continuous What if we had used old coding?
7
R code > data$ms <- ifelse(data$MEDSCHL==2,0,data$MEDSCHL) > table(data$ms, data$MEDSCHL) 1 2 0 0 96 1 17 0 > > reg <- lm(LOS ~ ms, data=data) > summary(reg) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 9.4105 0.1871 50.290 < 2e-16 *** ms 1.5807 0.4824 3.276 0.00140 ** --- Residual standard error: 1.833 on 111 degrees of freedom Multiple R-squared: 0.08818, Adjusted R-squared: 0.07997 F-statistic: 10.73 on 1 and 111 DF, p-value: 0.001404
8
Scatterplot? Residual Plot? res <- reg$residuals plot(data$ms, res) abline(h=0)
9
Fitted Values Only two fitted values: Diagnostic plots are not as informative Extrapolation and Interpolation are meaningless! We can estimate LOS for MS=0.5 LOS = 9.41 + 1.58*0.5 = 10.20 Try to interpret the result…
10
“Linear” regression? But, what about ‘linear’ assumption? we still need to adhere to the model assumptions recall that they relate to the residuals primarily residuals are independent and identically distributed: The model is still linear in the parameters!
11
MLR example: Add infection risk to our model > reg <- lm(LOS ~ ms + INFRISK, data=data) > summary(reg) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 6.4547 0.5146 12.542 <2e-16 *** ms 0.9717 0.4316 2.251 0.0263 * INFRISK 0.6998 0.1156 6.054 2e-08 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 1.595 on 110 degrees of freedom Multiple R-squared: 0.3161, Adjusted R-squared: 0.3036 F-statistic: 25.42 on 2 and 110 DF, p-value: 8.42e-10 How does interpretation change?
12
What about more than two categories? We looked briefly at region a few lectures back. How to interpret? You need to define a reference category For med school: reference was ms=0 almost ‘subconscious’ with only two categories With >2 categories, need to be careful of interpretation
13
LOS ~ REGION Note how ‘indicator’ or ‘dummy’ variable is defined: I(condition) = 1 if condition is true I(condition) = 0 if condition is false
14
Interpretation β 0 = β 1 = β 2 = β 3 =
15
hypothesis tests? Ho: β 1 = 0 Ha: β 1 ≠ 0 What does that test (in words)? Ho: β 2 = 0 Ha: β 2 ≠ 0 What does that test (in words)? What if we want to test region, in general? One of our next topics!
16
R > reg <- lm(LOS ~ factor(REGION), data=data) > summary(reg) Call: lm(formula = LOS ~ factor(REGION), data = data) Residuals: Min 1Q Median 3Q Max -3.05893 -1.03135 -0.02344 0.68107 8.47107 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 11.0889 0.3165 35.040 < 2e-16 *** factor(REGION)2 -1.4055 0.4333 -3.243 0.00157 ** factor(REGION)3 -1.8976 0.4194 -4.524 1.55e-05 *** factor(REGION)4 -2.9752 0.5248 -5.669 1.19e-07 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 1.675 on 109 degrees of freedom Multiple R-squared: 0.2531, Adjusted R-squared: 0.2325 F-statistic: 12.31 on 3 and 109 DF, p-value: 5.376e-07
17
Interpreting Is mean LOS different in region 2 vs. 1? What about region 3 vs. 1 and 4 vs. 1? What about region 4 vs. 3? How to test that? Two options? recode the data so that 3 or 4 is the reference use knowledge about the variance of linear combinations to estimate the p-value for the difference in the coefficients For now…we’ll focus on the first.
18
Make REGION=4 the reference Our model then changes:
19
R code: recoding so last category is reference > data$rev.region <- factor(data$REGION, levels=rev(sort(unique(data$REGION))) ) > reg <- lm(LOS ~ rev.region, data=data) > summary(reg) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 8.1137 0.4186 19.381 < 2e-16 *** rev.region3 1.0776 0.5010 2.151 0.03371 * rev.region2 1.5697 0.5127 3.061 0.00277 ** rev.region1 2.9752 0.5248 5.669 1.19e-07 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 1.675 on 109 degrees of freedom Multiple R-squared: 0.2531, Adjusted R-squared: 0.2325 F-statistic: 12.31 on 3 and 109 DF, p-value: 5.376e-07
20
Quite a few differences: Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 8.1137 0.4186 19.381 < 2e-16 *** rev.region3 1.0776 0.5010 2.151 0.03371 * rev.region2 1.5697 0.5127 3.061 0.00277 ** rev.region1 2.9752 0.5248 5.669 1.19e-07 *** Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 11.0889 0.3165 35.040 < 2e-16 *** factor(REGION)2 -1.4055 0.4333 -3.243 0.00157 ** factor(REGION)3 -1.8976 0.4194 -4.524 1.55e-05 *** factor(REGION)4 -2.9752 0.5248 -5.669 1.19e-07 ***
21
But the “model” is the same Model 1: Residual standard error: 1.675 on 109 degrees of freedom Model 2: Residual standard error: 1.675 on 109 degrees of freedom The model represent the data equally well. However, the ‘reparameterization’ yields a difference of interpretation for the model parameters.
22
Diagnostics # residual plot reg <- lm(LOS ~ factor(REGION), data=data) res <- reg$residuals fit <- reg$fitted.values plot(fit, res) abline(h=0, lwd=2)
23
Diagnostics # residual plot reg <- lm(logLOS ~ factor(REGION), data=data) res <- reg$residuals fit <- reg$fitted.values plot(fit, res) abline(h=0, lwd=2)
24
Next type: polynomials Most common: quadratic What does that mean? Including a linear and ‘squared’ term Why? to adhere to model assumptions! Example: last week we saw LOS ~ NURSE quadratic actually made some sense
25
Scatterplot
26
Fitting the model > data$nurse2 <- data$NURSE^2 > reg <- lm(logLOS ~ NURSE + nurse2, data=data) > summary(reg) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 2.090e+00 3.645e-02 57.355 < 2e-16 *** NURSE 1.430e-03 3.525e-04 4.058 9.29e-05 *** nurse2 -1.789e-06 6.262e-07 -2.857 0.00511 ** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.161 on 110 degrees of freedom Multiple R-squared: 0.1965, Adjusted R-squared: 0.1819 F-statistic: 13.45 on 2 and 110 DF, p-value: 5.948e-06 Interpretable?
27
How does it fit? # make regression line plot(data$NURSE, data$logLOS, pch=16) coef <- reg$coefficients nurse.values <- seq(15,650,5) fit.line <- coef[1] + coef[2]*nurse.values + coef[3]*nurse.values^2 lines(nurse.values, fit.line, lwd=2) Note: ‘abline’ only will work for simple linear regression. when there is more than one predictor, you need to make the line another way.
28
Another approach to the same data Does it make sense that it increases, and then decreases? Or, would it make more sense to increase, and then plateau? Which do you think makes more sense? How to tell? use a data driven approach tells us “what do the data suggest?”
29
Smoothing Empirical way to look at the relationship data is ‘binned’ by x for each ‘bin’, the average y is estimated but, it is a little fancier it is a ‘moving average’ each x value is in multiple bins Modern methods use models within bins Lowess smoothing Cubic spline smoothing Specifics are not so important: the “empirical” result is
30
smoother <- lowess(data$NURSE, data$logLOS) plot(data$NURSE, data$logLOS, pch=16) lines(smoother, lwd=2, col=2) lines(nurse.values, fit.line, lwd=2) legend(450,3,c("Quadratic Model","Lowess Smooth"), lty=c(1,1),lwd=c(2,2),col=c(1,2))
31
Inference? What do the data say? Looks like a plateau How can we model that? One option: a spline Zeger: “broken arrow” model Example: looks like a “knot” at NURSE = 250 there is a linear increase in logLOS until about NURSE=250 then, the relationship is flat this implies a slope prior to NURSE=250, and one after NURSE=250
32
Implementing a spline A little tricky We need to define a new variable, NURSE* And then we write the model as follows:
33
How to interpret? When in doubt, condition on different scenarios What is E(logLOS) when NURSE<250? What is E(logLOS) when NURSE>250?
34
R data$nurse.star <- ifelse(data$NURSE<=250,0,data$NURSE-250) data$nurse.star reg.spline <- lm(logLOS ~ NURSE + nurse.star, data=data) # make regression line coef.spline <- reg.spline$coefficients nurse.values <- seq(15,650,5) nurse.values.star <- ifelse(nurse.values<=250,0,nurse.values-250) spline.line <- coef.spline[1] + coef.spline[2]*nurse.values + coef.spline[3]*nurse.values.star plot(data$NURSE, data$logLOS, pch=16) lines(smoother, lwd=2, col=2) lines(nurse.values, fit.line, lwd=2) lines(nurse.values, spline.line, col=4, lwd=3) legend(450,3,c("Quadratic Model","Lowess Smooth","Spline Model"), lty=c(1,1,1),lwd=c(2,2,3),col=c(1,2,4))
35
Interpreting the output > summary(reg.spline) Call: lm(formula = logLOS ~ NURSE + nurse.star, data = data) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 2.1073877 0.0332135 63.450 < 2e-16 *** NURSE 0.0010278 0.0002336 4.399 2.52e-05 *** nurse.star -0.0011114 0.0004131 -2.690 0.00825 ** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.1616 on 110 degrees of freedom Multiple R-squared: 0.1902, Adjusted R-squared: 0.1754 F-statistic: 12.91 on 2 and 110 DF, p-value: 9.165e-06 How do we interpret the coefficient on nurse.star?
36
Why subtract the 250 in defining nurse.star? It ‘calibrates’ where the two pieces of the lines meet if it is not included, then they will not connect
37
Why a spline vs. the quadratic? it fits well! it is more interpretable it makes sense it is less sensitive to the outliers Can be generalized to have more ‘knots’
38
Next time ANOVA F-tests
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.