Lecture 7: Multiple Linear Regression Interpretation with different types of predictors BMTRY 701 Biostatistical Methods II.

Slides:



Advertisements
Similar presentations
Qualitative predictor variables
Advertisements

Topic 12: Multiple Linear Regression
Lecture 10 F-tests in MLR (continued) Coefficients of Determination BMTRY 701 Biostatistical Methods II.
BA 275 Quantitative Business Methods
Copyright © 2009 Pearson Education, Inc. Chapter 29 Multiple Regression.
Review of Univariate Linear Regression BMTRY 726 3/4/14.
5/13/ lecture 91 STATS 330: Lecture 9. 5/13/ lecture 92 Diagnostics Aim of today’s lecture: To give you an overview of the modelling cycle,
Qualitative Variables and
Some Terms Y =  o +  1 X Regression of Y on X Regress Y on X X called independent variable or predictor variable or covariate or factor Which factors.
Chapter 13 Multiple Regression
Multiple Regression Predicting a response with multiple explanatory variables.
x y z The data as seen in R [1,] population city manager compensation [2,] [3,] [4,]
Chapter 12 Multiple Regression
1 Chapter 3 Multiple Linear Regression Ray-Bing Chen Institute of Statistics National University of Kaohsiung.
Examining Relationship of Variables  Response (dependent) variable - measures the outcome of a study.  Explanatory (Independent) variable - explains.
Nemours Biomedical Research Statistics April 2, 2009 Tim Bunnell, Ph.D. & Jobayer Hossain, Ph.D. Nemours Bioinformatics Core Facility.
Basic Business Statistics, 11e © 2009 Prentice-Hall, Inc. Chap 14-1 Chapter 14 Introduction to Multiple Regression Basic Business Statistics 11 th Edition.
7/2/ Lecture 51 STATS 330: Lecture 5. 7/2/ Lecture 52 Tutorials  These will cover computing details  Held in basement floor tutorial lab,
MATH 3359 Introduction to Mathematical Modeling Linear System, Simple Linear Regression.
Ch. 14: The Multiple Regression Model building
Simple Linear Regression: An Introduction Dr. Tuan V. Nguyen Garvan Institute of Medical Research Sydney.
Multiple Regression Analysis. General Linear Models  This framework includes:  Linear Regression  Analysis of Variance (ANOVA)  Analysis of Covariance.
Regression Transformations for Normality and to Simplify Relationships U.S. Coal Mine Production – 2011 Source:
Checking Regression Model Assumptions NBA 2013/14 Player Heights and Weights.
Statistics for Managers Using Microsoft Excel, 4e © 2004 Prentice-Hall, Inc. Chap 13-1 Chapter 13 Introduction to Multiple Regression Statistics for Managers.
Simple Linear Regression
PCA Example Air pollution in 41 cities in the USA.
Lecture 15: Logistic Regression: Inference and link functions BMTRY 701 Biostatistical Methods II.
Analysis of Covariance Harry R. Erwin, PhD School of Computing and Technology University of Sunderland.
Lecture 5: SLR Diagnostics (Continued) Correlation Introduction to Multiple Linear Regression BMTRY 701 Biostatistical Methods II.
 Combines linear regression and ANOVA  Can be used to compare g treatments, after controlling for quantitative factor believed to be related to response.
7.1 - Motivation Motivation Correlation / Simple Linear Regression Correlation / Simple Linear Regression Extensions of Simple.
Lecture 12 Model Building BMTRY 701 Biostatistical Methods II.
Lecture 4: Inference in SLR (continued) Diagnostic approaches in SLR BMTRY 701 Biostatistical Methods II.
23-1 Analysis of Covariance (Chapter 16) A procedure for comparing treatment means that incorporates information on a quantitative explanatory variable,
Chapter 14 Introduction to Multiple Regression
Lecture 3: Inference in Simple Linear Regression BMTRY 701 Biostatistical Methods II.
Lecture 13 Diagnostics in MLR Variance Inflation Factors Added variable plots Identifying outliers BMTRY 701 Biostatistical Methods II.
Lecture 9: ANOVA tables F-tests BMTRY 701 Biostatistical Methods II.
Multiple Regression and Model Building Chapter 15 Copyright © 2014 by The McGraw-Hill Companies, Inc. All rights reserved.McGraw-Hill/Irwin.
Chap 14-1 Copyright ©2012 Pearson Education, Inc. publishing as Prentice Hall Chap 14-1 Chapter 14 Introduction to Multiple Regression Basic Business Statistics.
FACTORS AFFECTING HOUSING PRICES IN SYRACUSE Sample collected from Zillow in January, 2015 Urban Policy Class Exercise - Lecy.
Multiple Linear Regression ● For k>1 number of explanatory variables. e.g.: – Exam grades as function of time devoted to study, as well as SAT scores.
Lecture 11 Multicollinearity BMTRY 701 Biostatistical Methods II.
Multiple Regression BPS chapter 28 © 2006 W.H. Freeman and Company.
Simple Linear Regression ANOVA for regression (10.2)
1 Experimental Statistics - week 14 Multiple Regression – miscellaneous topics.
Lecture 4 Introduction to Multiple Regression
Tutorial 4 MBP 1010 Kevin Brown. Correlation Review Pearson’s correlation coefficient – Varies between – 1 (perfect negative linear correlation) and 1.
Lecture 6: Multiple Linear Regression Adjusted Variable Plots BMTRY 701 Biostatistical Methods II.
Lecture 6: Multiple Linear Regression Adjusted Variable Plots BMTRY 701 Biostatistical Methods II.
Lecture 3 Linear Models II Olivier MISSA, Advanced Research Skills.
Linear Models Alan Lee Sample presentation for STATS 760.
Copyright ©2011 Pearson Education, Inc. publishing as Prentice Hall 14-1 Chapter 14 Introduction to Multiple Regression Statistics for Managers using Microsoft.
Business Statistics: A Decision-Making Approach, 6e © 2005 Prentice- Hall, Inc. Chap 14-1 Business Statistics: A Decision-Making Approach 6 th Edition.
Basic Business Statistics, 10e © 2006 Prentice-Hall, Inc.. Chap 14-1 Chapter 14 Introduction to Multiple Regression Basic Business Statistics 10 th Edition.
Lecture 13 Diagnostics in MLR Added variable plots Identifying outliers Variance Inflation Factor BMTRY 701 Biostatistical Methods II.
Introduction to Multiple Regression Lecture 11. The Multiple Regression Model Idea: Examine the linear relationship between 1 dependent (Y) & 2 or more.
Inference for regression - More details about simple linear regression IPS chapter 10.2 © 2006 W.H. Freeman and Company.
1 Experimental Statistics - week 13 Multiple Regression Miscellaneous Topics.
Stat 1510: Statistical Thinking and Concepts REGRESSION.
Statistics for Managers Using Microsoft Excel, 5e © 2008 Prentice-Hall, Inc.Chap 14-1 Statistics for Managers Using Microsoft® Excel 5th Edition Chapter.
Tutorial 5 Thursday February 14 MBP 1010 Kevin Brown.
Nemours Biomedical Research Statistics April 9, 2009 Tim Bunnell, Ph.D. & Jobayer Hossain, Ph.D. Nemours Bioinformatics Core Facility.
Chapter 12 Simple Linear Regression and Correlation
CHAPTER 7 Linear Correlation & Regression Methods
CHAPTER 29: Multiple Regression*
Console Editeur : myProg.R 1
Prepared by Lee Revere and John Large
Chapter 12 Simple Linear Regression and Correlation
Presentation transcript:

Lecture 7: Multiple Linear Regression Interpretation with different types of predictors BMTRY 701 Biostatistical Methods II

Interpreting regression coefficients  So far, we’ve considered continuous covariates  Covariates can take other forms: binary nominal categorical quadratics (or other transforms) interactions  Interpretations may vary depending on the nature of your covariate

Binary covariates  Considered ‘qualitative’  The ordering of numeric assignments does not matter  Examples: MEDSCHL: 1 = yes; 2 = no  More popular examples: gender mutation pre vs. post-menopausal Two age categories

How is MEDSCHL related to LOS?  How to interpret β 1 ?  Coding of variables: 2 vs. 1 i prefer 1 vs. 0 difference? the intercept.  Let’s make a new variable: MS = 1 if MEDSCHL = 1 (yes) MS = 0 if MEDSCHL = 2 (no)

How is MEDSCHOOL related to LOS?  What does β 1 mean?  Same, yet different interpretation as continuous  What if we had used old coding?

R code > data$ms <- ifelse(data$MEDSCHL==2,0,data$MEDSCHL) > table(data$ms, data$MEDSCHL) > > reg <- lm(LOS ~ ms, data=data) > summary(reg) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) < 2e-16 *** ms ** --- Residual standard error: on 111 degrees of freedom Multiple R-squared: , Adjusted R-squared: F-statistic: on 1 and 111 DF, p-value:

Scatterplot? Residual Plot? res <- reg$residuals plot(data$ms, res) abline(h=0)

Fitted Values  Only two fitted values:  Diagnostic plots are not as informative  Extrapolation and Interpolation are meaningless!  We can estimate LOS for MS=0.5 LOS = *0.5 = Try to interpret the result…

“Linear” regression?  But, what about ‘linear’ assumption? we still need to adhere to the model assumptions recall that they relate to the residuals primarily residuals are independent and identically distributed:  The model is still linear in the parameters!

MLR example: Add infection risk to our model > reg <- lm(LOS ~ ms + INFRISK, data=data) > summary(reg) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) <2e-16 *** ms * INFRISK e-08 *** --- Signif. codes: 0 ‘***’ ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: on 110 degrees of freedom Multiple R-squared: , Adjusted R-squared: F-statistic: on 2 and 110 DF, p-value: 8.42e-10 How does interpretation change?

What about more than two categories?  We looked briefly at region a few lectures back.  How to interpret?  You need to define a reference category  For med school: reference was ms=0 almost ‘subconscious’ with only two categories  With >2 categories, need to be careful of interpretation

LOS ~ REGION  Note how ‘indicator’ or ‘dummy’ variable is defined: I(condition) = 1 if condition is true I(condition) = 0 if condition is false

Interpretation  β 0 =  β 1 =  β 2 =  β 3 =

hypothesis tests?  Ho: β 1 = 0  Ha: β 1 ≠ 0  What does that test (in words)?  Ho: β 2 = 0  Ha: β 2 ≠ 0  What does that test (in words)?  What if we want to test region, in general?  One of our next topics!

R > reg <- lm(LOS ~ factor(REGION), data=data) > summary(reg) Call: lm(formula = LOS ~ factor(REGION), data = data) Residuals: Min 1Q Median 3Q Max Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) < 2e-16 *** factor(REGION) ** factor(REGION) e-05 *** factor(REGION) e-07 *** --- Signif. codes: 0 ‘***’ ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: on 109 degrees of freedom Multiple R-squared: , Adjusted R-squared: F-statistic: on 3 and 109 DF, p-value: 5.376e-07

Interpreting  Is mean LOS different in region 2 vs. 1?  What about region 3 vs. 1 and 4 vs. 1?  What about region 4 vs. 3?  How to test that?  Two options? recode the data so that 3 or 4 is the reference use knowledge about the variance of linear combinations to estimate the p-value for the difference in the coefficients  For now…we’ll focus on the first.

Make REGION=4 the reference  Our model then changes:

R code: recoding so last category is reference > data$rev.region <- factor(data$REGION, levels=rev(sort(unique(data$REGION))) ) > reg <- lm(LOS ~ rev.region, data=data) > summary(reg) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) < 2e-16 *** rev.region * rev.region ** rev.region e-07 *** --- Signif. codes: 0 ‘***’ ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: on 109 degrees of freedom Multiple R-squared: , Adjusted R-squared: F-statistic: on 3 and 109 DF, p-value: 5.376e-07

Quite a few differences: Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) < 2e-16 *** rev.region * rev.region ** rev.region e-07 *** Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) < 2e-16 *** factor(REGION) ** factor(REGION) e-05 *** factor(REGION) e-07 ***

But the “model” is the same  Model 1: Residual standard error: on 109 degrees of freedom  Model 2: Residual standard error: on 109 degrees of freedom  The model represent the data equally well.  However, the ‘reparameterization’ yields a difference of interpretation for the model parameters.

Diagnostics # residual plot reg <- lm(LOS ~ factor(REGION), data=data) res <- reg$residuals fit <- reg$fitted.values plot(fit, res) abline(h=0, lwd=2)

Diagnostics # residual plot reg <- lm(logLOS ~ factor(REGION), data=data) res <- reg$residuals fit <- reg$fitted.values plot(fit, res) abline(h=0, lwd=2)

Next type: polynomials  Most common: quadratic  What does that mean?  Including a linear and ‘squared’ term  Why? to adhere to model assumptions!  Example: last week we saw LOS ~ NURSE quadratic actually made some sense

Scatterplot

Fitting the model > data$nurse2 <- data$NURSE^2 > reg <- lm(logLOS ~ NURSE + nurse2, data=data) > summary(reg) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 2.090e e < 2e-16 *** NURSE 1.430e e e-05 *** nurse e e ** --- Signif. codes: 0 ‘***’ ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: on 110 degrees of freedom Multiple R-squared: , Adjusted R-squared: F-statistic: on 2 and 110 DF, p-value: 5.948e-06 Interpretable?

How does it fit? # make regression line plot(data$NURSE, data$logLOS, pch=16) coef <- reg$coefficients nurse.values <- seq(15,650,5) fit.line <- coef[1] + coef[2]*nurse.values + coef[3]*nurse.values^2 lines(nurse.values, fit.line, lwd=2) Note: ‘abline’ only will work for simple linear regression. when there is more than one predictor, you need to make the line another way.

Another approach to the same data  Does it make sense that it increases, and then decreases?  Or, would it make more sense to increase, and then plateau?  Which do you think makes more sense?  How to tell? use a data driven approach tells us “what do the data suggest?”

Smoothing  Empirical way to look at the relationship  data is ‘binned’ by x  for each ‘bin’, the average y is estimated  but, it is a little fancier it is a ‘moving average’ each x value is in multiple bins  Modern methods use models within bins Lowess smoothing Cubic spline smoothing  Specifics are not so important: the “empirical” result is

smoother <- lowess(data$NURSE, data$logLOS) plot(data$NURSE, data$logLOS, pch=16) lines(smoother, lwd=2, col=2) lines(nurse.values, fit.line, lwd=2) legend(450,3,c("Quadratic Model","Lowess Smooth"), lty=c(1,1),lwd=c(2,2),col=c(1,2))

Inference?  What do the data say?  Looks like a plateau  How can we model that?  One option: a spline  Zeger: “broken arrow” model  Example: looks like a “knot” at NURSE = 250 there is a linear increase in logLOS until about NURSE=250 then, the relationship is flat this implies a slope prior to NURSE=250, and one after NURSE=250

Implementing a spline  A little tricky  We need to define a new variable, NURSE*  And then we write the model as follows:

How to interpret?  When in doubt, condition on different scenarios  What is E(logLOS) when NURSE<250?  What is E(logLOS) when NURSE>250?

R data$nurse.star <- ifelse(data$NURSE<=250,0,data$NURSE-250) data$nurse.star reg.spline <- lm(logLOS ~ NURSE + nurse.star, data=data) # make regression line coef.spline <- reg.spline$coefficients nurse.values <- seq(15,650,5) nurse.values.star <- ifelse(nurse.values<=250,0,nurse.values-250) spline.line <- coef.spline[1] + coef.spline[2]*nurse.values + coef.spline[3]*nurse.values.star plot(data$NURSE, data$logLOS, pch=16) lines(smoother, lwd=2, col=2) lines(nurse.values, fit.line, lwd=2) lines(nurse.values, spline.line, col=4, lwd=3) legend(450,3,c("Quadratic Model","Lowess Smooth","Spline Model"), lty=c(1,1,1),lwd=c(2,2,3),col=c(1,2,4))

Interpreting the output > summary(reg.spline) Call: lm(formula = logLOS ~ NURSE + nurse.star, data = data) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) < 2e-16 *** NURSE e-05 *** nurse.star ** --- Signif. codes: 0 ‘***’ ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: on 110 degrees of freedom Multiple R-squared: , Adjusted R-squared: F-statistic: on 2 and 110 DF, p-value: 9.165e-06 How do we interpret the coefficient on nurse.star?

Why subtract the 250 in defining nurse.star?  It ‘calibrates’ where the two pieces of the lines meet  if it is not included, then they will not connect

Why a spline vs. the quadratic?  it fits well!  it is more interpretable  it makes sense  it is less sensitive to the outliers  Can be generalized to have more ‘knots’

Next time  ANOVA  F-tests