Lecture 7: Multiple Linear Regression Interpretation with different types of predictors BMTRY 701 Biostatistical Methods II.

Lecture 7: Multiple Linear Regression Interpretation with different types of predictors BMTRY 701 Biostatistical Methods II

Interpreting regression coefficients  So far, we’ve considered continuous covariates  Covariates can take other forms: binary nominal categorical quadratics (or other transforms) interactions  Interpretations may vary depending on the nature of your covariate

Binary covariates  Considered ‘qualitative’  The ordering of numeric assignments does not matter  Examples: MEDSCHL: 1 = yes; 2 = no  More popular examples: gender mutation pre vs. post-menopausal Two age categories

How is MEDSCHL related to LOS?  How to interpret β 1 ?  Coding of variables: 2 vs. 1 i prefer 1 vs. 0 difference? the intercept.  Let’s make a new variable: MS = 1 if MEDSCHL = 1 (yes) MS = 0 if MEDSCHL = 2 (no)

How is MEDSCHOOL related to LOS?  What does β 1 mean?  Same, yet different interpretation as continuous  What if we had used old coding?

R code > data$ms <- ifelse(data$MEDSCHL==2,0,data$MEDSCHL) > table(data$ms, data$MEDSCHL) 1 2 0 0 96 1 17 0 > > reg <- lm(LOS ~ ms, data=data) > summary(reg) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 9.4105 0.1871 50.290 < 2e-16 *** ms 1.5807 0.4824 3.276 0.00140 ** --- Residual standard error: 1.833 on 111 degrees of freedom Multiple R-squared: 0.08818, Adjusted R-squared: 0.07997 F-statistic: 10.73 on 1 and 111 DF, p-value: 0.001404

Scatterplot? Residual Plot? res <- reg$residuals plot(data$ms, res) abline(h=0)

Fitted Values  Only two fitted values:  Diagnostic plots are not as informative  Extrapolation and Interpolation are meaningless!  We can estimate LOS for MS=0.5 LOS = 9.41 + 1.58*0.5 = 10.20 Try to interpret the result…

“Linear” regression?  But, what about ‘linear’ assumption? we still need to adhere to the model assumptions recall that they relate to the residuals primarily residuals are independent and identically distributed:  The model is still linear in the parameters!

MLR example: Add infection risk to our model > reg <- lm(LOS ~ ms + INFRISK, data=data) > summary(reg) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 6.4547 0.5146 12.542 <2e-16 *** ms 0.9717 0.4316 2.251 0.0263 * INFRISK 0.6998 0.1156 6.054 2e-08 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 1.595 on 110 degrees of freedom Multiple R-squared: 0.3161, Adjusted R-squared: 0.3036 F-statistic: 25.42 on 2 and 110 DF, p-value: 8.42e-10 How does interpretation change?

What about more than two categories?  We looked briefly at region a few lectures back.  How to interpret?  You need to define a reference category  For med school: reference was ms=0 almost ‘subconscious’ with only two categories  With >2 categories, need to be careful of interpretation

LOS ~ REGION  Note how ‘indicator’ or ‘dummy’ variable is defined: I(condition) = 1 if condition is true I(condition) = 0 if condition is false

Interpretation  β 0 =  β 1 =  β 2 =  β 3 =

hypothesis tests?  Ho: β 1 = 0  Ha: β 1 ≠ 0  What does that test (in words)?  Ho: β 2 = 0  Ha: β 2 ≠ 0  What does that test (in words)?  What if we want to test region, in general?  One of our next topics!

R > reg <- lm(LOS ~ factor(REGION), data=data) > summary(reg) Call: lm(formula = LOS ~ factor(REGION), data = data) Residuals: Min 1Q Median 3Q Max -3.05893 -1.03135 -0.02344 0.68107 8.47107 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 11.0889 0.3165 35.040 < 2e-16 *** factor(REGION)2 -1.4055 0.4333 -3.243 0.00157 ** factor(REGION)3 -1.8976 0.4194 -4.524 1.55e-05 *** factor(REGION)4 -2.9752 0.5248 -5.669 1.19e-07 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 1.675 on 109 degrees of freedom Multiple R-squared: 0.2531, Adjusted R-squared: 0.2325 F-statistic: 12.31 on 3 and 109 DF, p-value: 5.376e-07

Interpreting  Is mean LOS different in region 2 vs. 1?  What about region 3 vs. 1 and 4 vs. 1?  What about region 4 vs. 3?  How to test that?  Two options? recode the data so that 3 or 4 is the reference use knowledge about the variance of linear combinations to estimate the p-value for the difference in the coefficients  For now…we’ll focus on the first.

Make REGION=4 the reference  Our model then changes:

R code: recoding so last category is reference > data$rev.region <- factor(data$REGION, levels=rev(sort(unique(data$REGION))) ) > reg <- lm(LOS ~ rev.region, data=data) > summary(reg) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 8.1137 0.4186 19.381 < 2e-16 *** rev.region3 1.0776 0.5010 2.151 0.03371 * rev.region2 1.5697 0.5127 3.061 0.00277 ** rev.region1 2.9752 0.5248 5.669 1.19e-07 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 1.675 on 109 degrees of freedom Multiple R-squared: 0.2531, Adjusted R-squared: 0.2325 F-statistic: 12.31 on 3 and 109 DF, p-value: 5.376e-07

Quite a few differences: Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 8.1137 0.4186 19.381 < 2e-16 *** rev.region3 1.0776 0.5010 2.151 0.03371 * rev.region2 1.5697 0.5127 3.061 0.00277 ** rev.region1 2.9752 0.5248 5.669 1.19e-07 *** Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 11.0889 0.3165 35.040 < 2e-16 *** factor(REGION)2 -1.4055 0.4333 -3.243 0.00157 ** factor(REGION)3 -1.8976 0.4194 -4.524 1.55e-05 *** factor(REGION)4 -2.9752 0.5248 -5.669 1.19e-07 ***

But the “model” is the same  Model 1: Residual standard error: 1.675 on 109 degrees of freedom  Model 2: Residual standard error: 1.675 on 109 degrees of freedom  The model represent the data equally well.  However, the ‘reparameterization’ yields a difference of interpretation for the model parameters.

Diagnostics # residual plot reg <- lm(LOS ~ factor(REGION), data=data) res <- reg$residuals fit <- reg$fitted.values plot(fit, res) abline(h=0, lwd=2)

Diagnostics # residual plot reg <- lm(logLOS ~ factor(REGION), data=data) res <- reg$residuals fit <- reg$fitted.values plot(fit, res) abline(h=0, lwd=2)

Next type: polynomials  Most common: quadratic  What does that mean?  Including a linear and ‘squared’ term  Why? to adhere to model assumptions!  Example: last week we saw LOS ~ NURSE quadratic actually made some sense

Scatterplot

Fitting the model > data$nurse2 <- data$NURSE^2 > reg <- lm(logLOS ~ NURSE + nurse2, data=data) > summary(reg) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 2.090e+00 3.645e-02 57.355 < 2e-16 *** NURSE 1.430e-03 3.525e-04 4.058 9.29e-05 *** nurse2 -1.789e-06 6.262e-07 -2.857 0.00511 ** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.161 on 110 degrees of freedom Multiple R-squared: 0.1965, Adjusted R-squared: 0.1819 F-statistic: 13.45 on 2 and 110 DF, p-value: 5.948e-06 Interpretable?

How does it fit? # make regression line plot(data$NURSE, data$logLOS, pch=16) coef <- reg$coefficients nurse.values <- seq(15,650,5) fit.line <- coef[1] + coef[2]*nurse.values + coef[3]*nurse.values^2 lines(nurse.values, fit.line, lwd=2) Note: ‘abline’ only will work for simple linear regression. when there is more than one predictor, you need to make the line another way.

Another approach to the same data  Does it make sense that it increases, and then decreases?  Or, would it make more sense to increase, and then plateau?  Which do you think makes more sense?  How to tell? use a data driven approach tells us “what do the data suggest?”

Smoothing  Empirical way to look at the relationship  data is ‘binned’ by x  for each ‘bin’, the average y is estimated  but, it is a little fancier it is a ‘moving average’ each x value is in multiple bins  Modern methods use models within bins Lowess smoothing Cubic spline smoothing  Specifics are not so important: the “empirical” result is

smoother <- lowess(data$NURSE, data$logLOS) plot(data$NURSE, data$logLOS, pch=16) lines(smoother, lwd=2, col=2) lines(nurse.values, fit.line, lwd=2) legend(450,3,c("Quadratic Model","Lowess Smooth"), lty=c(1,1),lwd=c(2,2),col=c(1,2))

Inference?  What do the data say?  Looks like a plateau  How can we model that?  One option: a spline  Zeger: “broken arrow” model  Example: looks like a “knot” at NURSE = 250 there is a linear increase in logLOS until about NURSE=250 then, the relationship is flat this implies a slope prior to NURSE=250, and one after NURSE=250

Implementing a spline  A little tricky  We need to define a new variable, NURSE*  And then we write the model as follows:

How to interpret?  When in doubt, condition on different scenarios  What is E(logLOS) when NURSE<250?  What is E(logLOS) when NURSE>250?

R data$nurse.star <- ifelse(data$NURSE<=250,0,data$NURSE-250) data$nurse.star reg.spline <- lm(logLOS ~ NURSE + nurse.star, data=data) # make regression line coef.spline <- reg.spline$coefficients nurse.values <- seq(15,650,5) nurse.values.star <- ifelse(nurse.values<=250,0,nurse.values-250) spline.line <- coef.spline[1] + coef.spline[2]*nurse.values + coef.spline[3]*nurse.values.star plot(data$NURSE, data$logLOS, pch=16) lines(smoother, lwd=2, col=2) lines(nurse.values, fit.line, lwd=2) lines(nurse.values, spline.line, col=4, lwd=3) legend(450,3,c("Quadratic Model","Lowess Smooth","Spline Model"), lty=c(1,1,1),lwd=c(2,2,3),col=c(1,2,4))

Interpreting the output > summary(reg.spline) Call: lm(formula = logLOS ~ NURSE + nurse.star, data = data) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 2.1073877 0.0332135 63.450 < 2e-16 *** NURSE 0.0010278 0.0002336 4.399 2.52e-05 *** nurse.star -0.0011114 0.0004131 -2.690 0.00825 ** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.1616 on 110 degrees of freedom Multiple R-squared: 0.1902, Adjusted R-squared: 0.1754 F-statistic: 12.91 on 2 and 110 DF, p-value: 9.165e-06 How do we interpret the coefficient on nurse.star?

Why subtract the 250 in defining nurse.star?  It ‘calibrates’ where the two pieces of the lines meet  if it is not included, then they will not connect

Why a spline vs. the quadratic?  it fits well!  it is more interpretable  it makes sense  it is less sensitive to the outliers  Can be generalized to have more ‘knots’

Next time  ANOVA  F-tests

Lecture 7: Multiple Linear Regression Interpretation with different types of predictors BMTRY 701 Biostatistical Methods II.

Similar presentations

Presentation on theme: "Lecture 7: Multiple Linear Regression Interpretation with different types of predictors BMTRY 701 Biostatistical Methods II."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Lecture 7: Multiple Linear Regression Interpretation with different types of predictors BMTRY 701 Biostatistical Methods II.

Similar presentations

Presentation on theme: "Lecture 7: Multiple Linear Regression Interpretation with different types of predictors BMTRY 701 Biostatistical Methods II."— Presentation transcript:

Similar presentations

About project

Feedback