Multicollinearity.

Slides:



Advertisements
Similar presentations
Qualitative predictor variables
Advertisements

More on understanding variance inflation factors (VIFk)
Chapter 12 Simple Linear Regression
Forecasting Using the Simple Linear Regression Model and Correlation
13- 1 Chapter Thirteen McGraw-Hill/Irwin © 2005 The McGraw-Hill Companies, Inc., All Rights Reserved.
Analysis of Economic Data
Reading – Linear Regression Le (Chapter 8 through 8.1.6) C &S (Chapter 5:F,G,H)
Basic Business Statistics, 11e © 2009 Prentice-Hall, Inc. Chap 14-1 Chapter 14 Introduction to Multiple Regression Basic Business Statistics 11 th Edition.
Chapter Topics Types of Regression Models
REGRESSION AND CORRELATION
Slide 1 Larger is better case (Golf Ball) Linear Model Analysis: SN ratios versus Material, Diameter, Dimples, Thickness Estimated Model Coefficients for.
1 Chapter 17: Introduction to Regression. 2 Introduction to Linear Regression The Pearson correlation measures the degree to which a set of data points.
Polynomial regression models Possible models for when the response function is “curved”
Descriptive measures of the strength of a linear association r-squared and the (Pearson) correlation coefficient r.
Model selection Stepwise regression. Statement of problem A common problem is that there is a large set of candidate predictor variables. (Note: The examples.
Simple linear regression Linear regression with one predictor variable.
Understanding Multivariate Research Berry & Sanders.
Variable selection and model building Part II. Statement of situation A common situation is that there is a large set of candidate predictor variables.
Inferences in Regression and Correlation Analysis Ayona Chatterjee Spring 2008 Math 4803/5803.
12a - 1 © 2000 Prentice-Hall, Inc. Statistics Multiple Regression and Model Building Chapter 12 part I.
Copyright © 2010 Pearson Education, Inc Chapter Seventeen Correlation and Regression.
Business Statistics: A First Course, 5e © 2009 Prentice-Hall, Inc. Chap 12-1 Correlation and Regression.
Introduction to Linear Regression
TEAS prep course trends YOUR FUTURE BEGINS TODAY Joel Collazo, MD Maria E Guzman, MPM.
Chap 14-1 Copyright ©2012 Pearson Education, Inc. publishing as Prentice Hall Chap 14-1 Chapter 14 Introduction to Multiple Regression Basic Business Statistics.
Detecting and reducing multicollinearity. Detecting multicollinearity.
Copyright ©2011 Nelson Education Limited Linear Regression and Correlation CHAPTER 12.
Solutions to Tutorial 5 Problems Source Sum of Squares df Mean Square F-test Regression Residual Total ANOVA Table Variable.
Copyright © 2013, 2009, and 2007, Pearson Education, Inc. Chapter 13 Multiple Regression Section 13.3 Using Multiple Regression to Make Inferences.
Sequential sums of squares … or … extra sums of squares.
Department of Cognitive Science Michael J. Kalsher Adv. Experimental Methods & Statistics PSYC 4310 / COGS 6310 Regression 1 PSYC 4310/6310 Advanced Experimental.
Multiple regression. Example: Brain and body size predictive of intelligence? Sample of n = 38 college students Response (Y): intelligence based on the.
Regression Analysis © 2007 Prentice Hall17-1. © 2007 Prentice Hall17-2 Chapter Outline 1) Correlations 2) Bivariate Regression 3) Statistics Associated.
Multiple Regression I 1 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Chapter 4 Multiple Regression Analysis (Part 1) Terry Dielman.
Week 101 ANOVA F Test in Multiple Regression In multiple regression, the ANOVA F test is designed to test the following hypothesis: This test aims to assess.
Correlation & Regression Analysis
A first order model with one binary and one quantitative predictor variable.
Regression through the origin
1 1 Slide The Simple Linear Regression Model n Simple Linear Regression Model y =  0 +  1 x +  n Simple Linear Regression Equation E( y ) =  0 + 
732G21/732G28/732A35 Lecture 4. Variance-covariance matrix for the regression coefficients 2.
Multiple Regression II 1 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Chapter 4 Multiple Regression Analysis (Part 2) Terry Dielman.
Multiple Regression Learning Objectives n Explain the Linear Multiple Regression Model n Interpret Linear Multiple Regression Computer Output n Test.
Variable selection and model building Part I. Statement of situation A common situation is that there is a large set of candidate predictor variables.
Multicollinearity. Multicollinearity (or intercorrelation) exists when at least some of the predictor variables are correlated among themselves. In observational.
Interaction regression models. What is an additive model? A regression model with p-1 predictor variables contains additive effects if the response function.
Simple linear regression. What is simple linear regression? A way of evaluating the relationship between two continuous variables. One variable is regarded.
Michael J. Kalsher PSYCHOMETRICS MGMT 6971 Regression 1 PSYC 4310 Advanced Experimental Methods and Statistics © 2014, Michael Kalsher.
Simple linear regression. What is simple linear regression? A way of evaluating the relationship between two continuous variables. One variable is regarded.
1 Multiple Regression. 2 Model There are many explanatory variables or independent variables x 1, x 2,…,x p that are linear related to the response variable.
Model selection and model building. Model selection Selection of predictor variables.
Multiple Regression.
Chapter 20 Linear and Multiple Regression
Regression Scientific
Least Square Regression
Least Square Regression
Linear Regression Prof. Andy Field.
Chapter 13 Simple Linear Regression
9/19/2018 ST3131, Lecture 6.
Correlation and Simple Linear Regression
Solutions for Tutorial 3
Multiple Regression.
Correlation and Simple Linear Regression
Scientific Practice Regression.
Business Statistics, 4e by Ken Black
Ch 4.1 & 4.2 Two dimensions concept
Essentials of Statistics for Business and Economics (8e)
Chapter Thirteen McGraw-Hill/Irwin
Introduction to Regression
Correlation and Simple Linear Regression
Correlation and Simple Linear Regression
Presentation transcript:

Multicollinearity

Overview This set of slides Next set of slides What is multicollinearity? What is its effects on regression analyses? Next set of slides How to detect multicollinearity? How to reduce some types of multicollinearity?

Multicollinearity Multicollinearity exists when two or more of the predictors in a regression model are moderately or highly correlated. Regression analyses most often take place on data obtained from observational studies. In observational studies, multicollinearity happens more often than not.

Types of multicollinearity Structural multicollinearity a mathematical artifact caused by creating new predictors from other predictors, such as, creating the predictor x2 from the predictor x. Sample-based multicollinearity a result of a poorly designed experiment, reliance on purely observational data, or the inability to manipulate the system on which you collect the data

Example n = 20 hypertensive individuals p-1 = 6 predictor variables

Example Blood pressure (BP) is the response. BP Age Weight BSA Duration Pulse Age 0.659 Weight 0.950 0.407 BSA 0.866 0.378 0.875 Duration 0.293 0.344 0.201 0.131 Pulse 0.721 0.619 0.659 0.465 0.402 Stress 0.164 0.368 0.034 0.018 0.312 0.506 Blood pressure (BP) is the response.

What is effect on regression analyses if predictors are perfectly uncorrelated?

Example x1 x2 y 2 5 52 2 5 51 2 7 49 2 7 46 4 5 50 4 5 48 4 7 44 4 7 43 Pearson correlation of x1 and x2 = 0.000

Example

Regress y on x1 The regression equation is y = 48.8 - 0.63 x1 Predictor Coef SE Coef T P Constant 48.750 4.025 12.11 0.000 x1 -0.625 1.273 -0.49 0.641 Analysis of Variance Source DF SS MS F P Regression 1 3.13 3.13 0.24 0.641 Error 6 77.75 12.96 Total 7 80.88

Regress y on x2 The regression equation is y = 55.1 - 1.38 x2 Predictor Coef SE Coef T P Constant 55.125 7.119 7.74 0.000 x2 -1.375 1.170 -1.17 0.285 Analysis of Variance Source DF SS MS F P Regression 1 15.13 15.13 1.38 0.285 Error 6 65.75 10.96 Total 7 80.88

Regress y on x1 and x2 The regression equation is y = 57.0 - 0.63 x1 - 1.38 x2 Predictor Coef SE Coef T P Constant 57.000 8.486 6.72 0.001 x1 -0.625 1.251 -0.50 0.639 x2 -1.375 1.251 -1.10 0.322 Analysis of Variance Source DF SS MS F P Regression 2 18.25 9.13 0.73 0.528 Error 5 62.63 12.53 Total 7 80.88 Source DF Seq SS x1 1 3.13 x2 1 15.13

Regress y on x2 and x1 The regression equation is y = 57.0 - 1.38 x2 - 0.63 x1 Predictor Coef SE Coef T P Constant 57.000 8.486 6.72 0.001 x2 -1.375 1.251 -1.10 0.322 x1 -0.625 1.251 -0.50 0.639 Analysis of Variance Source DF SS MS F P Regression 2 18.25 9.13 0.73 0.528 Error 5 62.63 12.53 Total 7 80.88 Source DF Seq SS x2 1 15.13 x1 1 3.13

Summary of results Model b1 se(b1) b2 se(b2) Seq SS x1 only SSR(X1) -0.625 1.273 --- SSR(X1) 3.13 x2 only -1.375 1.170 SSR(X2) 15.13 x1, x2 (in order) 1.251 SSR(X2|X1) x2, x1 SSR(X1|X2)

If predictors are perfectly uncorrelated, then… You get the same slope estimates regardless of the first-order regression model used. That is, the effect on the response ascribed to a predictor doesn’t depend on the other predictors in the model.

If predictors are perfectly uncorrelated, then… The sum of squares SSR(X1) is the same as the sequential sum of squares SSR(X1|X2). The sum of squares SSR(X2) is the same as the sequential sum of squares SSR(X2|X1). That is, the marginal contribution of one predictor variable in reducing the error sum of squares doesn’t depend on the other predictors in the model.

Do we see similar results for “real data” with nearly uncorrelated predictors?

Example BP Age Weight BSA Duration Pulse Age 0.659 Weight 0.950 0.407 Stress 0.164 0.368 0.034 0.018 0.312 0.506

Example

Regress BP on x1 = Stress The regression equation is BP = 113 + 0.0240 Stress Predictor Coef SE Coef T P Constant 112.720 2.193 51.39 0.000 Stress 0.02399 0.03404 0.70 0.490 S = 5.502 R-Sq = 2.7% R-Sq(adj) = 0.0% Analysis of Variance Source DF SS MS F P Regression 1 15.04 15.04 0.50 0.490 Error 18 544.96 30.28 Total 19 560.00

Regress BP on x2 = BSA The regression equation is BP = 45.2 + 34.4 BSA Predictor Coef SE Coef T P Constant 45.183 9.392 4.81 0.000 BSA 34.443 4.690 7.34 0.000 S = 2.790 R-Sq = 75.0% R-Sq(adj) = 73.6% Analysis of Variance Source DF SS MS F P Regression 1 419.86 419.86 53.93 0.000 Error 18 140.14 7.79 Total 19 560.00

Regress BP on x1 = Stress and x2 = BSA The regression equation is BP = 44.2 + 0.0217 Stress + 34.3 BSA Predictor Coef SE Coef T P Constant 44.245 9.261 4.78 0.000 Stress 0.02166 0.01697 1.28 0.219 BSA 34.334 4.611 7.45 0.000 Analysis of Variance Source DF SS MS F P Regression 2 432.12 216.06 28.72 0.000 Error 17 127.88 7.52 Total 19 560.00 Source DF Seq SS Stress 1 15.04 BSA 1 417.07

Regress BP on x2 = BSA and x1 = Stress The regression equation is BP = 44.2 + 34.3 BSA + 0.0217 Stress Predictor Coef SE Coef T P Constant 44.245 9.261 4.78 0.000 BSA 34.334 4.611 7.45 0.000 Stress 0.02166 0.01697 1.28 0.219 Analysis of Variance Source DF SS MS F P Regression 2 432.12 216.06 28.72 0.000 Error 17 127.88 7.52 Total 19 560.00 Source DF Seq SS BSA 1 419.86 Stress 1 12.26

Summary of results Model b1 se(b1) b2 se(b2) Seq SS x1 only SSR(X1) 0.0240 0.0340 --- SSR(X1) 15.04 x2 only 34.443 4.690 SSR(X2) 419.86 x1, x2 (in order) 0.0217 0.0170 34.334 4.611 SSR(X2|X1) 417.07 x2, x1 SSR(X1|X2) 12.26

If predictors are nearly uncorrelated, then… You get similar slope estimates regardless of the first-order regression model used. The sum of squares SSR(X1) is similar to the sequential sum of squares SSR(X1|X2). The sum of squares SSR(X2) is similar to the sequential sum of squares SSR(X2|X1).

What happens if the predictor variables are highly correlated?

Example BP Age Weight BSA Duration Pulse Age 0.659 Weight 0.950 0.407 Stress 0.164 0.368 0.034 0.018 0.312 0.506

Example

Regress BP on x1 = Weight The regression equation is BP = 2.21 + 1.20 Weight Predictor Coef SE Coef T P Constant 2.205 8.663 0.25 0.802 Weight 1.20093 0.09297 12.92 0.000 S = 1.740 R-Sq = 90.3% R-Sq(adj) = 89.7% Analysis of Variance Source DF SS MS F P Regression 1 505.47 505.47 166.86 0.000 Error 18 54.53 3.03 Total 19 560.00

Regress BP on x2 = BSA The regression equation is BP = 45.2 + 34.4 BSA Predictor Coef SE Coef T P Constant 45.183 9.392 4.81 0.000 BSA 34.443 4.690 7.34 0.000 S = 2.790 R-Sq = 75.0% R-Sq(adj) = 73.6% Analysis of Variance Source DF SS MS F P Regression 1 419.86 419.86 53.93 0.000 Error 18 140.14 7.79 Total 19 560.00

Regress BP on x1 = Weight and x2 = BSA The regression equation is BP = 5.65 + 1.04 Weight + 5.83 BSA Predictor Coef SE Coef T P Constant 5.653 9.392 0.60 0.555 Weight 1.0387 0.1927 5.39 0.000 BSA 5.831 6.063 0.96 0.350 Analysis of Variance Source DF SS MS F P Regression 2 508.29 254.14 83.54 0.000 Error 17 51.71 3.04 Total 19 560.00 Source DF Seq SS Weight 1 505.47 BSA 1 2.81

Regress BP on x2 = BSA and x1 = Weight The regression equation is BP = 5.65 + 5.83 BSA + 1.04 Weight Predictor Coef SE Coef T P Constant 5.653 9.392 0.60 0.555 BSA 5.831 6.063 0.96 0.350 Weight 1.0387 0.1927 5.39 0.000 Analysis of Variance Source DF SS MS F P Regression 2 508.29 254.14 83.54 0.000 Error 17 51.71 3.04 Total 19 560.00 Source DF Seq SS BSA 1 419.86 Weight 1 88.43

Effect #1 of multicollinearity When predictor variables are correlated, the regression coefficient of any one variable depends on which other predictor variables are included in the model. Variables in model b1 b2 x1 1.20 ---- x2 34.4 x1, x2 1.04 5.83

Even correlated predictors not in the model can have an impact! Regression of territory sales on territory population, per capita income, etc. Against expectation, coefficient of territory population was determined to be negative. Competitor’s market penetration, which was strongly positively correlated with territory population, was not included in model. But, competitor kept sales down in territories with large populations.

Effect #2 of multicollinearity When predictor variables are correlated, the precision of the estimated regression coefficients decreases as more predictor variables are added to the model. Variables in model se(b1) se(b2) x1 0.093 ---- x2 4.69 x1, x2 0.193 6.06

Why not effects #1 and #2?

Why not effects #1 and #2?

Why effects #1 and #2?

Effect #3 of multicollinearity When predictor variables are correlated, the marginal contribution of any one predictor variable in reducing the error sum of squares varies, depending on which other variables are already in model. SSR(X1) = 505.47 SSR(X1|X2) = 88.43 SSR(X2) = 419.86 SSR(X2|X1) = 2.81

What is the effect on estimating mean or predicting new response?

Effect #4 of multicollinearity on estimating mean or predicting Y Weight Fit SE Fit 95.0% CI 95.0% PI 92 112.7 0.402 (111.85,113.54) (108.94,116.44) BSA Fit SE Fit 95.0% CI 95.0% PI 2 114.1 0.624 (112.76,115.38) (108.06,120.08) BSA Weight Fit SE Fit 95.0% CI 95.0% PI 2 92 112.8 0.448 (111.93,113.83) (109.08, 116.68) High multicollinearity among predictor variables does not prevent good, precise predictions of the response (within scope of model).

What is effect on tests of individual slopes? The regression equation is BP = 45.2 + 34.4 BSA Predictor Coef SE Coef T P Constant 45.183 9.392 4.81 0.000 BSA 34.443 4.690 7.34 0.000 S = 2.790 R-Sq = 75.0% R-Sq(adj) = 73.6% Analysis of Variance Source DF SS MS F P Regression 1 419.86 419.86 53.93 0.000 Error 18 140.14 7.79 Total 19 560.00

What is effect on tests of individual slopes? The regression equation is BP = 2.21 + 1.20 Weight Predictor Coef SE Coef T P Constant 2.205 8.663 0.25 0.802 Weight 1.20093 0.09297 12.92 0.000 S = 1.740 R-Sq = 90.3% R-Sq(adj) = 89.7% Analysis of Variance Source DF SS MS F P Regression 1 505.47 505.47 166.86 0.000 Error 18 54.53 3.03 Total 19 560.00

What is effect on tests of individual slopes? The regression equation is BP = 5.65 + 1.04 Weight + 5.83 BSA Predictor Coef SE Coef T P Constant 5.653 9.392 0.60 0.555 Weight 1.0387 0.1927 5.39 0.000 BSA 5.831 6.063 0.96 0.350 Analysis of Variance Source DF SS MS F P Regression 2 508.29 254.14 83.54 0.000 Error 17 51.71 3.04 Total 19 560.00 Source DF Seq SS Weight 1 505.47 BSA 1 2.81

Effect #5 of multicollinearity on slope tests When predictor variables are correlated, hypothesis tests for βk = 0 may yield different conclusions depending on which predictor variables are in the model. Variables in model b2 se(b2) t P-value x2 34.4 4.7 7.34 0.000 x1, x2 5.83 6.1 0.96 0.350

The major impacts on model use In the presence of multicollinearity, it is okay to use an estimated regression model to predict y or estimate μY in scope of model. In the presence of multicollinearity, we can no longer interpret a slope coefficient as … the change in the mean response for each additional unit increase in xk, when all the other predictors are held constant