Lecture 11 Multicollinearity BMTRY 701 Biostatistical Methods II.

Slides:



Advertisements
Similar presentations
Topic 12: Multiple Linear Regression
Advertisements

Chapter 7: Multiple Regression II Ayona Chatterjee Spring 2008 Math 4813/5813.
Multiple Regression W&W, Chapter 13, 15(3-4). Introduction Multiple regression is an extension of bivariate regression to take into account more than.
Lecture 10 F-tests in MLR (continued) Coefficients of Determination BMTRY 701 Biostatistical Methods II.
The Multiple Regression Model.
 Population multiple regression model  Data for multiple regression  Multiple linear regression model  Confidence intervals and significance tests.
Irwin/McGraw-Hill © Andrew F. Siegel, 1997 and l Chapter 12 l Multiple Regression: Predicting One Factor from Several Others.
Copyright © 2009 Pearson Education, Inc. Chapter 29 Multiple Regression.
6-1 Introduction To Empirical Models 6-1 Introduction To Empirical Models.
Review of Univariate Linear Regression BMTRY 726 3/4/14.
Understanding the General Linear Model
Some Terms Y =  o +  1 X Regression of Y on X Regress Y on X X called independent variable or predictor variable or covariate or factor Which factors.
From last time….. Basic Biostats Topics Summary Statistics –mean, median, mode –standard deviation, standard error Confidence Intervals Hypothesis Tests.
Multiple Regression Predicting a response with multiple explanatory variables.
Variance and covariance M contains the mean Sums of squares General additive models.
N-way ANOVA. 3-way ANOVA 2 H 0 : The mean respiratory rate is the same for all species H 0 : The mean respiratory rate is the same for all temperatures.
Lecture 19: Tues., Nov. 11th R-squared (8.6.1) Review
CHAPTER 4 ECONOMETRICS x x x x x Multiple Regression = more than one explanatory variable Independent variables are X 2 and X 3. Y i = B 1 + B 2 X 2i +
Examining Relationship of Variables  Response (dependent) variable - measures the outcome of a study.  Explanatory (Independent) variable - explains.
Nemours Biomedical Research Statistics April 2, 2009 Tim Bunnell, Ph.D. & Jobayer Hossain, Ph.D. Nemours Bioinformatics Core Facility.
Lecture 24: Thurs., April 8th
7/2/ Lecture 51 STATS 330: Lecture 5. 7/2/ Lecture 52 Tutorials  These will cover computing details  Held in basement floor tutorial lab,
Correlation 1. Correlation - degree to which variables are associated or covary. (Changes in the value of one tends to be associated with changes in the.
Checking Regression Model Assumptions NBA 2013/14 Player Heights and Weights.
Multiple Linear Regression Analysis
Multiple Linear Regression Response Variable: Y Explanatory Variables: X 1,...,X k Model (Extension of Simple Regression): E(Y) =  +  1 X 1 +  +  k.
Objectives of Multiple Regression
Copyright © 2011 Pearson Education, Inc. Multiple Regression Chapter 23.
Regression with 2 IVs Generalization of Regression from 1 to 2 Independent Variables.
Simple Linear Regression
9/14/ Lecture 61 STATS 330: Lecture 6. 9/14/ Lecture 62 Inference for the Regression model Aim of today’s lecture: To discuss how we assess.
Lecture 15: Logistic Regression: Inference and link functions BMTRY 701 Biostatistical Methods II.
© 2002 Prentice-Hall, Inc.Chap 14-1 Introduction to Multiple Regression Model.
Lecture 5: SLR Diagnostics (Continued) Correlation Introduction to Multiple Linear Regression BMTRY 701 Biostatistical Methods II.
7.1 - Motivation Motivation Correlation / Simple Linear Regression Correlation / Simple Linear Regression Extensions of Simple.
Lecture 12 Model Building BMTRY 701 Biostatistical Methods II.
MultiCollinearity. The Nature of the Problem OLS requires that the explanatory variables are independent of error term But they may not always be independent.
Lecture 4: Inference in SLR (continued) Diagnostic approaches in SLR BMTRY 701 Biostatistical Methods II.
Lecture 3: Inference in Simple Linear Regression BMTRY 701 Biostatistical Methods II.
Model Selection1. 1. Regress Y on each k potential X variables. 2. Determine the best single variable model. 3. Regress Y on the best variable and each.
Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Slide
Soc 3306a Lecture 9: Multivariate 2 More on Multiple Regression: Building a Model and Interpreting Coefficients.
1 Chapter 3 Multiple Linear Regression Multiple Regression Models Suppose that the yield in pounds of conversion in a chemical process depends.
Lecture 13 Diagnostics in MLR Variance Inflation Factors Added variable plots Identifying outliers BMTRY 701 Biostatistical Methods II.
Multiple Linear Regression. Purpose To analyze the relationship between a single dependent variable and several independent variables.
Lecture 9: ANOVA tables F-tests BMTRY 701 Biostatistical Methods II.
Multiple Regression and Model Building Chapter 15 Copyright © 2014 by The McGraw-Hill Companies, Inc. All rights reserved.McGraw-Hill/Irwin.
MGS3100_04.ppt/Sep 29, 2015/Page 1 Georgia State University - Confidential MGS 3100 Business Analysis Regression Sep 29 and 30, 2015.
Lesson Multiple Regression Models. Objectives Obtain the correlation matrix Use technology to find a multiple regression equation Interpret the.
Stat 112 Notes 9 Today: –Multicollinearity (Chapter 4.6) –Multiple regression and causal inference.
Tutorial 4 MBP 1010 Kevin Brown. Correlation Review Pearson’s correlation coefficient – Varies between – 1 (perfect negative linear correlation) and 1.
Lecture 7: Multiple Linear Regression Interpretation with different types of predictors BMTRY 701 Biostatistical Methods II.
 Relationship between education level, income, and length of time out of school  Our new regression equation: is the predicted value of the dependent.
Chapter 22: Building Multiple Regression Models Generalization of univariate linear regression models. One unit of data with a value of dependent variable.
Lecture 6: Multiple Linear Regression Adjusted Variable Plots BMTRY 701 Biostatistical Methods II.
Multiple Regression I 1 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Chapter 4 Multiple Regression Analysis (Part 1) Terry Dielman.
Lecture 6: Multiple Linear Regression Adjusted Variable Plots BMTRY 701 Biostatistical Methods II.
Statistics for Managers Using Microsoft Excel, 4e © 2004 Prentice-Hall, Inc. Chap 14-1 Chapter 14 Multiple Regression Model Building Statistics for Managers.
Linear Models Alan Lee Sample presentation for STATS 760.
Lecture 13 Diagnostics in MLR Added variable plots Identifying outliers Variance Inflation Factor BMTRY 701 Biostatistical Methods II.
Multiple Regression Learning Objectives n Explain the Linear Multiple Regression Model n Interpret Linear Multiple Regression Computer Output n Test.
Tutorial 5 Thursday February 14 MBP 1010 Kevin Brown.
Biostatistics Regression and Correlation Methods Class #10 April 4, 2000.
Regression Diagnostics
Correlation and Simple Linear Regression
CHAPTER 29: Multiple Regression*
Welcome to the class! set.seed(843) df <- tibble::data_frame(
Correlation and Simple Linear Regression
Lecture 12 Model Building
Simple Linear Regression and Correlation
Presentation transcript:

Lecture 11 Multicollinearity BMTRY 701 Biostatistical Methods II

Multicollinearity Introduction  Some common questions we ask in MLR what is the relative importance of the effects of the different covariates? what is the magnitude of effect of a given covariate on the response? can any covariate be dropped from the model because it has little effect or no effect on the outcome? should any covariates not yet included in the model be considered for possible inclusion?

Easy answers?  If the candidate covariates are uncorrelated with one another: yes, these are simple questions  If the candidate covariates are correlated with one another: no, these are not easy.  Most commonly: observational studies have correlated covariates we need to adjust for these when assessing relationships “adjusting” for confounders  Experimental designs? less problematic patients are randomized in common designs no confounding exists because factors are ‘balanced’ across arms

Multicollinearity  Also called “intercorrelation”  refers to the situation when the covariates are related to each other and to the outcome of interest  like confounding, but a statistical terminology for it because of the effects it has on regression modeling

No Multicollinearity Example: Mouse experiment MouseDose ADose BDietTumor size

Linear modeling  Interested in seeing which factors influence tumor size in mice  Notice that the experiment is perfectly balanced.  What does that mean?

Dose of Drug A on Tumor > reg.a <- lm(Tumor.size ~ Dose.A, data=data) > summary(reg.a) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) * Dose.A Signif. codes: 0 ‘***’ ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: on 10 degrees of freedom Multiple R-Squared: , Adjusted R-squared: F-statistic: on 1 and 10 DF, p-value: >

Dose of Drug B on Tumor > reg.b <- lm(Tumor.size ~ Dose.B, data=data) > summary(reg.b) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) *** Dose.B ** --- Signif. codes: 0 ‘***’ ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 10.4 on 10 degrees of freedom Multiple R-Squared: , Adjusted R-squared: F-statistic: on 1 and 10 DF, p-value: >

Diet on Tumor > reg.diet <- lm(Tumor.size ~ Diet, data=data) > summary(reg.diet) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) ** Diet Signif. codes: 0 ‘***’ ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: on 10 degrees of freedom Multiple R-Squared: , Adjusted R-squared: F-statistic: on 1 and 10 DF, p-value:

All in the model together > reg.all <- lm(Tumor.size ~ Dose.A + Dose.B + Diet, data=data) > summary(reg.all) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) e-05 *** Dose.A Dose.B *** Diet * --- Signif. codes: 0 ‘***’ ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: on 8 degrees of freedom Multiple R-Squared: , Adjusted R-squared: F-statistic: on 3 and 8 DF, p-value:

Correlation matrix of predictors and outcome > cor(data[,-1]) Dose.A Dose.B Diet Tumor.size Dose.A Dose.B Diet Tumor.size >

Result  For perfectly balanced designs, adjusting does not affect the coefficients  However, it can affect the significance  Why? residual sum of squares is affected if you explain more of the variance in the outcome, less is left to chance/error when you adjust for another related factor, you will likely improve the significance

The other extreme: perfect collinearity MouseDose ADose CDietTumor size

The model has infinitely many solutions  Too much flexibility  What happens?  The fitting algorithm usually gives you some indication of this will not fit the model and gives an error drops one of the predictors  “perfectly collinear” = “perfect confounding”

Effects of Multicollinearity  Most common result two covariates are independently associated with Y in simple linear regression models in MLR model with both covariates, one or both is insignificant the magnitude of the regression coefficients is attenuated why?  recall the adjusted variable plot  if the two are related, removing the systematic part of one from Y may leave too little left to explain

Effects of Multicollinearity  Other situations Neither is significant alone, but they are both significant together (somewhat rare) Both are significant alone and both retain signficance in the model The regression coefficient for one of the covariates may change direction Magnitude of coefficient may increase (in absolute value)  It is usually hard to predict exactly what will happen when both are in the model

Implications in inference  the interpretation of a regression coefficient measuring the change in the expected value of Y when the covariate is increased while all other are held constant is not quite applicable  It may be conceptually feasible to think of ‘holding all constant’  but, practically, it may not be possible if the covariates are related.  Example: amount of rainfall and hours of sunshine

Implications in inference  multicollinearity tends to inflate the standard errors on the regression coefficients  when multicollinearity is present, you will see the coefficient of partial determination will have little increase with the addition of the collinear covariate  Predictions tend to be relatively unaffected for better or worse when a highly collinear covariate is added to the model.

Implications in Inference  Recall the interpretation of the t-statistics in MLR  The represent the significance of a variable, adjusting for all else in the model  If two covariates are highly correlated, then both are likely to end up insignificant  Marginal nature of t-tests!  ANOVA can be more useful due to conditional nature of tables.

So, which is the ‘correct’ variable?  Almost impossible to tell  Usually, people choose the one that is ‘more’ significant.  but that does not mean it is the correct choice it could be the correct choice it could be the one that is less associated  why might it be less associated?  measurement issues the correct ‘culprit’ could be a variable that is related to the ones in the model but not in the model itself.

Example  Let’s look at our classic example of logLOS  What variables are associated with logLOS?  What variables have the potential to create multicollinearity?

SENIC

> data <- read.csv("senicfull.csv") > data$logLOS <- log(data$LOS) > data$nurse2 <- data$NURSE^2 > data$ms <- ifelse(data$MEDSCHL==2,0,data$MEDSCHL) > > data.cor <- data[,-1] > round(cor(data.cor),2) LOS AGE INFRISK CULT XRAY BEDS MEDSCHL REGION CENSUS NURSE FACS logLOS nurse2 ms LOS AGE INFRISK CULT XRAY BEDS MEDSCHL REGION CENSUS NURSE FACS logLOS nurse ms >

Let’s try an example with serious multicollinearity  To anticipate multicollinearity, ALWAYS good to look at scatterplots and correlation matrices of potential covariates  What covariates would give rise to a good example?