Presentation is loading. Please wait.

Presentation is loading. Please wait.

Topic 16: Multicollinearity and Polynomial Regression.

Similar presentations


Presentation on theme: "Topic 16: Multicollinearity and Polynomial Regression."— Presentation transcript:

1 Topic 16: Multicollinearity and Polynomial Regression

2 Outline Multicollinearity Polynomial regression

3 An example (KNNL p256) The P-value for ANOVA F-test is <.0001 The P values for the individual regression coefficients are 0.1699, 0.2849, and 0.1896 None of these are near our standard significance level of 0.05 What is the explanation? Multicollinearity!!!

4 Multicollinearity Numerical analysis problem in that the matrix X΄X is close to singular and is therefore difficult to invert accurately Statistical problem in that there is too much correlation among the explanatory variables and it is therefore difficult to determine the regression coefficients

5 Multicollinearity Solve the statistical problem and the numerical problem will also be solved –We want to refine a model that currently has redundancy in the explanatory variables –Do this regardless if X΄X can be inverted without difficulty

6 Multicollinearity Extreme cases can help us understand the problems caused by multicollinearity –Assume columns in X matrix were uncorrelated Type I and Type II SS will be the same The contribution of each explanatory variable to the model is the same whether or not the other explanatory variables are in the model

7 Multicollinearity –Suppose a linear combination of the explanatory variables is a constant Example: X 1 = X 2 → X 1 -X 2 = 0 Example: 3X 1 -X 2 = 5 Example: SAT total = SATV + SATM The Type II SS for the X’s involved will all be zero

8 An example: Part I Data a1; infile ‘../data/csdata.dat'; input id gpa hsm hss hse satm satv genderm1; Data a1; set a1; hs = (hsm+hse+hss)/3; Proc reg data=a1; model gpa= hsm hss hse hs; run;

9 Output Something is wrong df M =3 but there are 4 X’s Analysis of Variance SourceDF Sum of Squares Mean SquareF ValuePr > F Model327.712339.2374418.86<.0001 Error220107.750460.48977 Corrected Total223135.46279

10 Output NOTE: Model is not full rank. Least-squares solutions for the parameters are not unique. Some statistics will be misleading. A reported DF of 0 or B means that the estimate is biased.

11 Output NOTE: The following parameters have been set to 0, since the variables are a linear combination of other variables as shown. hs = 0.33333*hsm + 0.33333*hss + 0.33333*hse

12 Output Parameter Estimates VariableDF Parameter Estimate Standard Errort ValuePr > |t|Type I SSType II SS Intercept10.589880.294242.000.04621555.54591.96837 hsmB0.168570.035494.75<.000125.8098911.04779 hssB0.034320.037560.910.36191.237080.40884 hseB0.045100.038701.170.24510.66536 hs00..... In this extreme case, SAS does not consider hs in the model

13 Extent of multicollinearity This example had one explanatory variable equal to a linear combination of other explanatory variables This is the most extreme case of multicollinearity and is detected by statistical software because (X΄X) does not have an inverse We are concerned with cases less extreme

14 An example: Part II *add a little noise to break up perfect linear association; Data a1; set a1; hs1 = hs + normal(612)*.05; Proc reg data=a1; model gpa= hsm hss hse hs1; run;

15 Output Model seems to be good here Analysis of Variance SourceDF Sum of Squares Mean SquareF ValuePr > F Model427.815866.9539614.15<.0001 Error219107.646930.49154 Corrected Total223135.46279

16 Output Parameter Estimates VariableDF Parameter Estimate Standard Errort ValuePr > |t|Type I SSType II SS Intercept10.562710.300661.870.06261555.54591.72182 hsm10.024110.316770.080.939425.809890.00285 hss1-0.110930.31872-0.350.72811.237080.05954 hse1-0.100380.31937-0.310.75360.665360.04856 hs110.438050.954510.460.64670.10352 None of the predictors significant. Look at the differences in Type I and II sums of squares Sign of the coefficients?

17 Effects of multicollinearity Regression coefficients are not well estimated and may be meaningless Similarly for standard errors of these estimates Type I SS and Type II SS will differ R 2 and predicted values are usually ok

18 Pairwise Correlations Pairwise correlations can be used to check for “pairwise” collinearity Recall KNNL p256 proc reg data=a1 corr; model fat=skinfold thigh midarm; model midarm = skinfold thigh; run;

19 Pairwise Correlations Cor(skinfold, thigh)=0.9238 Cor(skinfold, midarm) = 0.4578 Cor(thigh, midarm) = 0.0847 Cor(midarm,skinfold+thigh) = 0.9952!!! See p 284 for change in coeff values of skinfold and thigh depending on what variables are in the model

20 Polynomial regression We can fit a quadratic, cubic, etc. relationship by defining squares, cubes, etc., of a single X in a data step and using them as additional explanatory variables We can do this with more than one explanatory variable if needed Issue: When we do this we generally create a multicollinearity problem

21 KNNL Example p300 Response variable is the life (in cycles) of a power cell Explanatory variables are –Charge rate (3 levels) –Temperature (3 levels) This is a designed experiment

22 Input and check the data Data a1; infile ‘../data/ch08ta01.txt'; input cycles chrate temp; run; Proc print data=a1; run;

23 Output Obscycleschratetemp 11500.610 2861.010 3491.410 42880.620 51571.020 61311.020 71841.020 81091.420 92790.630 102351.030 112241.430

24 Create new variables and run the regression Data a1; set a1; chrate2=chrate*chrate; temp2=temp*temp; ct=chrate*temp; Proc reg data=a1; model cycles= chrate temp chrate2 temp2 ct; run;

25 Output Analysis of Variance SourceDF Sum of Squares Mean SquareF ValuePr > F Model5553661107310.570.0109 Error55240.43861048.0877 Corrected Total1060606

26 Output Parameter Estimates VariableDF Parameter Estimate Standard Errort ValuePr > |t| Intercept1337.72149149.961632.250.0741 chrate1-539.51754268.86033-2.010.1011 temp18.917119.182490.970.3761 chrate21171.21711127.125501.350.2359 temp21-0.106050.20340-0.520.6244 ct12.875004.046770.710.5092

27 Conclusion Overall F significant, individual t’s not significant → multicollinearity problem Look at the correlations (proc corr) There are some very high correlations –r(chrate,chrate2) = 0.99103 –r(temp,temp2) = 0.98609 Correlation between powers of a variable

28 A remedy We can remove the correlation between explanatory variables and their powers by centering Centering means that you subtract off the mean before squaring etc. KNNL rescaled by standardizing (subtract the mean and divide by the standard deviation) but subtracting the mean is key here

29 A remedy Use Proc Standard to center the explanatory variables Recompute the squares, cubes, etc., using the centered variables Rerun the regression analysis

30 Proc standard Data a2; set a1; schrate=chrate; stemp=temp; keep cycles schrate stemp; Proc standard data=a2 out=a3 mean=0 std=1; var schrate stemp; Proc print data=a3; run;

31 Output Obscyclesschratestemp 1150-1.29099 2860.00000-1.29099 3491.29099-1.29099 4288-1.290990.00000 51570.00000 61310.00000 71840.00000 81091.290990.00000 9279-1.290991.29099 102350.000001.29099 112241.29099

32 Recompute squares and cross product Data a3; set a3; schrate2=schrate*schrate; stemp2=stemp*stemp; sct=schrate*stemp;

33 Rerun regression Proc reg data=a3; model cycles=schrate stemp schrate2 stemp2 sct; run;

34 Output Analysis of Variance SourceDF Sum of Squares Mean SquareF ValuePr > F Model5553661107310.570.0109 Error55240.43861048.0877 Corrected Total1060606 Exact same ANOVA table as before

35 Output Parameter Estimates VariableDF Parameter Estimate Standard Errort ValuePr > |t| Intercept1162.8421116.607619.810.0002 schrate1-43.2483110.23762-4.220.0083 stemp158.4820510.237625.710.0023 schrate2116.4368412.204051.350.2359 stemp21-6.3631612.20405-0.520.6244 sct16.900009.712250.710.5092

36 Conclusion Overall F significant Individual t’s significant for chrate and temp Appears linear model will suffice Could do formal general linear test to assess this. (P-value is 0.5527)

37 Last slide We went over KNNL 7.6 and 8.1. We used programs Topic16.sas to generate the output for today


Download ppt "Topic 16: Multicollinearity and Polynomial Regression."

Similar presentations


Ads by Google