Multicollinearity: an introductory example A high-tech business wants to measure the effect of advertising on sales and likes to distinguish between traditional.

Multicollinearity: an introductory example A high-tech business wants to measure the effect of advertising on sales and likes to distinguish between traditional advertising (TV and newspapers) and advertising on internet. –Y : sales in $m –X 1 : advertising in $m –X 2 : internet in $m Data: Sales3.sav

A matrix scatter plot of the data Cor(y, x1)0.983 Cor(y, x2)0.986 Cor(x1,x2)0.990 x 1 and x 2 are strongly correlated, i.e. they have a substantial amount of common information x 1 = α 0 + α 1 x 2 + ε

Regression output R R2R2 Adj R 2 DS 1,983 a,965,962,9764 Anova b SSdfMSFSig. 1Regressione265,4661 278,438,000 a Residuo9,53410,953 Totale275,00011 Coefficients Model tSig. BDS 1 (Costante),885,6961,272,232 X1 = traditional advertising in $m 2,254,13516,686,000 Using x 1 only With equivalent results when using x 2 only

Regression output Using x 1 and x 2 R R2R2 Adj R 2 DS 1,987 a,974,968,8916 Anova b SSdfMSFSig. 1Regressione267,8462133,923168,483,000 a Residuo7,1549,795 Totale275,00011 Coefficients tSig. BDS 1 (Costante)1,992,9022,210,054 X1 = traditional advertising in $m,767,868,884,400 X2 = internet advertising in $m 1,275,7371,730,118 x 1 and x 2 are not significant anymore

Multicollinearity Multicollinearity exists when two or more of the independent variables are moderately or highly correlated with each other. In practice if independent variables are (highly) correlated they contribute too much redundant information which prevents isolating the effect of single independent variables on y. Confusion is often the result. High levels of multicollinearity: a)inflate the variance of the β estimates b)regression results maybe misleading and confusing. In the extreme case, if there exists perfect correlation among some of the independent variables, OLS estimates cannot be computed. x i = α 0 + α 1 x j +… + α p x j+p + ε, j+p<k, i≠j,j+1,…, j+p

Detecting Multicollinearity The following are indicators of multicollinearity: 1.Significant correlations between pairs of independent variables in the model (sufficient but not necessary). 2.Nonsignificant t-tests for all (or nearly all) the individual β parameters when the F test for model adequacy H 0 : β 1 = β 2 = … = β k = 0 is significant. 3.Opposite signs (from what expected) in the estimated parameters 4.A variance inflation factor (VIF) for a β parameter greater that 10. The VIFs can be calculated in SPSS by selecting “Collinearity diagnostics” in the “Statistics” options in the “Regression” dialog box.

A typical situation Multicollinearity can arise when transforming variables, e.g. using x 1 and x 1 2 in the regression equations if the range of values of x 1 is limited. Cor(x,x 2 )=0.987

Remember, if the multicollinearity is present but not excessive (no high correlations, no VIFs above 10), you can ignore it. Each variable provides enough independent information and one can assess its value. If your main goal is explaining relationships, then the multicollinearity maybe a problem because measured effects can be misleading. If your main goal is prediction (using the available explanatory variables to predict the response), then you can safely ignore the multicollinearity.

Some solutions to Multicollinearity Get more data if you can. Drop one or more of the correlated independent variables from the final model. A screening procedure like Stepwise regression may be helpful in determining which variable to drop. If you keep all independent variables be cautios in interpreting parameter values and keep prediction within the range of your data. Use Ridge regression (we do not touch this subject in the course).

Some solutions to MC If the multicollinearity is introduced by the use of higher order models (e.g. use x and x 2 or x 1, x 2 and x 1 x 2 ) use IV as deviations from their mean. Example: suppose multicollinearity is present in E(Y) = β 0 + β 1 x + β 2 x 2 1)Compute: x* = x – Mean(X) 2)Run the regression E(Y) = β 0 + β 1 x* + β 2 (x*) 2 In most cases multicollinearity is greatly reduced. Clearly the parameters β of the new regression will have different values and meaning.

Example: Shipping costs - continues –Y : cost of shipment in dollars –X 1 : package weight in pounds –X 2 : distance shipped in miles Model 1: E(Y) = β 0 + β 1 x 1 + β 2 x 2 + β 3 x 1 x 2 + β 4 x 1 2 + β 5 x 2 2 Data: Express.sav A company conducted a study to investigate the relationship between the cost of shipment and the variables that control the shipping charge: weight and distance. It is suspected that non linear effect may be present, let us analyze the model

Matrix scatter-plot A matrix scatter- plot shows at once the bivariate scatter plots for the selected variables. Use it as preliminary screening. In SPSS choose the “Matrix” option from “Scatter/Dot” Graph and input the variables of interest Note the obvious quadratic relation for some of the variables, very close to linearity Symmetric matrix, just look at the lower triangle

Correlation matrix Correlazioni Weight of parcel Distance shipped Cost of shipm. Weight squared Dist. squared Weight* Dist. Weight of parcel in lbs. Correlazione1,182,774 **,967 **,151,820 ** Sig. (2-code),444,000,524,000 N20 Distance shipped Correlazione,1821,695 **,202,980 **,633 ** Sig. (2-code),444,001,393,000,003 N20 Cost of shipment Correlazione,774 **,695 ** 1,799 **,652 **,989 ** Sig. (2-code),000,001,000,002,000 N20 Weight squared Correlazione,967 **,202,799 ** 1,160,821 ** Sig. (2-code),000,393,000,500,000 N20 Distance squared Correlazione,151,980 **,652 **,1601,590 ** Sig. (2-code),524,000,002,500,006 N20 Weight*Dist ance Correlazione,820 **,633 **,989 **,821 **,590 ** 1 Sig. (2-code),000,003,000,006 N20 **. La correlazione è significativa al livello 0,01 (2-code). Individually strongly related to Y

Model 1:VIF statistics A VIF statistics larger than 10 is usualy considered an indicator of substantial collinearity The VIFs can be calculated in SPSS by selecting “Collinearity diagnostics” in the “Statistics” options in the “Regression” dialog box. Coefficienti a Model tSig. BDSVIF 1(Costante),827,7021,178,259 Weight of parcel in lbs. -,609,180-3,386,00420,031 Distance shipped,004,008,503,62335,526 Weight squared,090,0204,442,00117,027 Distance squared1,507E-5,000,672,51328,921 Weight*Distance,007,00111,495,00012,618

Model 2: Using IV as deviations from their mean Note: problems of multicollinearity have disappeared Coefficients a Model tSig. BDSVIF 1 (Costante)5,467,21625,252,000 X1star1,263,04230,128,0001,087 X2star,038,00127,563,0001,081 X1x2star,007,00111,495,0001,095 X1star2,090,0204,442,0011,113 x2star21,507E-5,000,672,5131,120 Note: R-square (adjusted), ANOVA table and prediction are the same for the two models (check). Seems actually irrelevant, drop it

Multicollinearity: an introductory example A high-tech business wants to measure the effect of advertising on sales and likes to distinguish between traditional.

Similar presentations

Presentation on theme: "Multicollinearity: an introductory example A high-tech business wants to measure the effect of advertising on sales and likes to distinguish between traditional."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Multicollinearity: an introductory example A high-tech business wants to measure the effect of advertising on sales and likes to distinguish between traditional.

Similar presentations

Presentation on theme: "Multicollinearity: an introductory example A high-tech business wants to measure the effect of advertising on sales and likes to distinguish between traditional."— Presentation transcript:

Similar presentations

About project

Feedback