Detecting and reducing multicollinearity. Detecting multicollinearity.

Detecting and reducing multicollinearity

Detecting multicollinearity

Common methods of detection Realized effects (changes in coefficients, changes in standard errors of coefficients, changes in sequential sums of squares) of multicollinearity. Non-significant t-tests for all of the slopes but a significant overall F-test. Significant correlations among pairs of predictor variables (correlations, matrix scatter plots). Variance inflation factors (VIF).

The first variance at issue For the model: the variance of the estimated coefficient b k is: whereis the R 2 value obtained by regressing the k th predictor on the remaining predictors.

The second variance at issue For the model: the variance of the estimated coefficient b k is:

The ratio of the two variances

Variance inflation factors The variance inflation factor for the k th predictor is: whereis the R 2 value obtained by regressing the k th predictor on the remaining predictors.

Variance inflation factors (VIF k ) A measure of how much the variance of the estimated regression coefficient b k is “inflated” by the existence of correlation among the predictor variables in the model. VIFs exceeding 4 warrant investigation. VIFs exceeding 10 are signs of serious multicollinearity.

Blood pressure example n = 20 hypertensive individuals p-1 = 6 predictor variables

Blood pressure example BP Age Weight BSA Duration Pulse Age 0.659 Weight 0.950 0.407 BSA 0.866 0.378 0.875 Duration 0.293 0.344 0.201 0.131 Pulse 0.721 0.619 0.659 0.465 0.402 Stress 0.164 0.368 0.034 0.018 0.312 0.506 Blood pressure (BP) is the response.

Regress y = BP on all 6 predictors Predictor Coef SE Coef T P VIF Constant -12.870 2.557 -5.03 0.000 Age 0.70326 0.04961 14.18 0.000 1.8 Weight 0.96992 0.06311 15.37 0.000 8.4 BSA 3.776 1.580 2.39 0.033 5.3 Dur 0.06838 0.04844 1.41 0.182 1.2 Pulse -0.08448 0.05161 -1.64 0.126 4.4 Stress 0.005572 0.003412 1.63 0.126 1.8 S = 0.4072 R-Sq = 99.6% R-Sq(adj) = 99.4% Analysis of Variance Source DF SS MS F P Regression 6 557.844 92.974 560.64 0.000 Residual Error 13 2.156 0.166 Total 19 560.000

Regress x 2 = weight on 5 predictors Predictor Coef SE Coef T P VIF Constant 19.674 9.465 2.08 0.057 Age -0.1446 0.2065 -0.70 0.495 1.7 BSA 21.422 3.465 6.18 0.000 1.4 Dur 0.0087 0.2051 0.04 0.967 1.2 Pulse 0.5577 0.1599 3.49 0.004 2.4 Stress -0.02300 0.01308 -1.76 0.101 1.5 S = 1.725 R-Sq = 88.1% R-Sq(adj) = 83.9% Analysis of Variance Source DF SS MS F P Regression 5 308.839 61.768 20.77 0.000 Residual Error 14 41.639 2.974 Total 19 350.478

The variance inflation factor calculated by its definition The variance of the weight coefficient is inflated by a factor of 8.40 due to the existence of correlation among the predictor variables in the model.

The pairwise correlations BP Age Weight BSA Duration Pulse Age 0.659 Weight 0.950 0.407 BSA 0.866 0.378 0.875 Duration 0.293 0.344 0.201 0.131 Pulse 0.721 0.619 0.659 0.465 0.402 Stress 0.164 0.368 0.034 0.018 0.312 0.506 Blood pressure (BP) is the response.

Regress y = BP on age, weight, duration and stress Predictor Coef SE Coef T P VIF Constant -15.870 3.195 -4.97 0.000 Age 0.68374 0.06120 11.17 0.000 1.5 Weight 1.03413 0.03267 31.65 0.000 1.2 Dur 0.03989 0.06449 0.62 0.545 1.2 Stress 0.002184 0.003794 0.58 0.573 1.2 S = 0.5505 R-Sq = 99.2% R-Sq(adj) = 99.0% Analysis of Variance Source DF SS MS F P Regression 4 555.45 138.86 458.28 0.000 Residual Error 15 4.55 0.30 Total 19 560.00

Reducing data-based multicollinearity

Data-based multicollinearity Multicollinearity that results from a poorly designed experiment, reliance on purely observational data, or the inability to manipulate the system on which you collect the data.

Some methods Modify the regression model by eliminating one or more predictor variables. Collect additional data under different experimental or observational conditions.

(Modified!) Allen Cognitive Level (ACL) Study Relationship of ACL test to level of pathology in a set of 23 patients in a hospital psychiatry unit: –Response y = ACL score –x 1 = vocabulary (Vocab) score on Shipley Institute of Living Scale –x 2 = abstraction (Abstract) score on Shipley Institute of Living Scale –x 3 = score on Symbol-Digit Modalities Test (SDMT)

Allen Cognitive Level (ACL) Study on 23 patients

Strong correlation between Vocab and Abstract Pearson correlation of Vocab and Abstract = 0.990

Regress y = ACL on SDMT, Vocab, and Abstract Predictor Coef SE Coef T P VIF Constant 3.747 1.342 2.79 0.012 SDMT 0.02326 0.01273 1.83 0.083 1.7 Vocab 0.0283 0.1524 0.19 0.855 49.3 Abstract -0.0138 0.1006 -0.14 0.892 50.6 S = 0.7344 R-Sq = 26.5% R-Sq(adj) = 14.8% Analysis of Variance Source DF SS MS F P Regression 3 3.6854 1.2285 2.28 0.112 Residual Error 19 10.2476 0.5393 Total 22 13.9330

Allen Cognitive Level (ACL) Study on 69 patients

Plot after having collected more data Pearson correlation of Vocab and Abstract = 0.698

Regress y = ACL on SDMT, Vocab, and Abstract Predictor Coef SE Coef T P VIF Constant 3.9463 0.3381 11.67 0.000 SDMT 0.027404 0.007168 3.82 0.000 1.6 Vocab -0.01740 0.01808 -0.96 0.339 2.1 Abstract 0.01218 0.01159 1.05 0.297 2.2 S = 0.6878 R-Sq = 28.6% R-Sq(adj) = 25.3% Analysis of Variance Source DF SS MS F P Regression 3 12.3009 4.1003 8.67 0.000 Residual Error 65 30.7487 0.4731 Total 68 43.0496

Reducing structural multicollinearity In context of polynomial regression models

Structural multicollinearity Multicollinearity that is a mathematical artifact caused by creating new predictors from other predictors, such as, creating the predictor x 2 from the predictor x.

Example (General research question) What is impact of exercise on human immune system? (Specific research question) How is amount of immunoglobin in blood (y) related to maximal oxygen uptake (x)?

Scatter plot

A quadratic polynomial regression function where: y i = amount of immunoglobin in blood (mg) x i = maximal oxygen uptake (ml/kg) typical assumptions about error terms (“INE”)

Estimated quadratic function

Interpretation of the regression coefficients If 0 is a possible x value, then b 0 is the predicted response. Otherwise, interpretation of b 0 is meaningless. b 1 is the slope of the tangent line at x = 0. b 2 indicates the up/down direction of curve –b 2 < 0 means curve is concave down –b 2 > 0 means curve is concave up

The regression equation is igg = - 1464 + 88.3 oxygen - 0.536 oxygensq Predictor Coef SE Coef T P VIF Constant -1464.4 411.4 -3.56 0.001 oxygen 88.31 16.47 5.36 0.000 99.9 oxygensq -0.5362 0.1582 -3.39 0.002 99.9 S = 106.4 R-Sq = 93.8% R-Sq(adj) = 93.3% Analysis of Variance Source DF SS MS F P Regression 2 4602211 2301105 203.16 0.000 Residual Error 27 305818 11327 Total 29 4908029 Regress y = igg on oxygen and oxygen 2

Structural multicollinearity Pearson correlation of oxygen and oxygensq = 0.995

“Center” the predictors Mean of oxygen = 50.637 oxygen oxcent oxcentsq 34.6 -16.037 257.185 45.0 -5.637 31.776 62.3 11.663 136.026 58.9 8.263 68.277 42.5 -8.137 66.211 44.3 -6.337 40.158 67.9 17.263 298.011 58.5 7.863 61.827 35.6 -15.037 226.111 49.6 -1.037 1.075 33.0 -17.637 311.064

Wow! It really works! Pearson correlation of oxcent and oxcentsq = 0.219

A better quadratic polynomial regression function wheredenotes the centered predictor and: y i = amount of immunoglobin in blood (mg) typical assumptions about error terms (“INE”)

The regression equation is igg = 1632 + 34.0 oxcent - 0.536 oxcentsq Predictor Coef SE Coef T P VIF Constant 1632.20 29.35 55.61 0.000 oxcent 34.000 1.689 20.13 0.000 1.1 oxcentsq -0.5362 0.1582 -3.39 0.002 1.1 S = 106.4 R-Sq = 93.8% R-Sq(adj) = 93.3% Analysis of Variance Source DF SS MS F P Regression 2 4602211 2301105 203.16 0.000 Residual Error 27 305818 11327 Total 29 4908029 Regress y = igg on oxcent and oxcent 2

Interpretation of the regression coefficients b 0 is predicted response at the predictor mean. b 1 is the estimated slope of the tangent line at the predictor mean; and, often, similar to the estimated slope in the simple model. b 2 indicates the up/down direction of curve –b 2 < 0 means curve is concave down –b 2 > 0 means curve is concave up

Estimated regression function

Similar estimates of coefficients from first-order linear model

The relationship between the two forms of the model Centered model: Original model: where:

Mean of oxygen = 50.637

Model evaluation

Model use: What is predicted IgG if maximal oxygen uptake is 90? There is an even greater danger in extrapolation when modeling data with a polynomial function, because of changes in direction. Predicted Values for New Observations New Obs Fit SE Fit 95.0% CI 95.0% PI 1 2139.6 219.2 (1689.8,2589.5) (1639.6,2639.7) XX X denotes a row with X values away from the center XX denotes a row with very extreme X values Values of Predictors for New Observations New Obs oxcent oxcentsq 1 39.4 1549

The hierarchical approach to model fitting Widely accepted approach is to fit a higher-order model and then explore whether a lower-order (simpler) model is adequate. Is a first-order linear model (“line”) adequate?

The hierarchical approach to model fitting But then … if a polynomial term of a given order is retained, then all related lower-order terms are also retained. That is, if a quadratic term was significant, you would use this regression function: and not this one:

Detecting and reducing multicollinearity. Detecting multicollinearity.

Similar presentations

Presentation on theme: "Detecting and reducing multicollinearity. Detecting multicollinearity."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Detecting and reducing multicollinearity. Detecting multicollinearity.

Similar presentations

Presentation on theme: "Detecting and reducing multicollinearity. Detecting multicollinearity."— Presentation transcript:

Similar presentations

About project

Feedback