Anareg Week 10 Multicollinearity Interesting special cases Polynomial regression
Multicollinearity Numerical analysis problem is that the matrix X’X is close to singular and is therefore difficult to invert accurately Statistical problem is that there is too much correlation among the explanatory variables and it is therefore difficult to determine the regression coefficients
Multicollinearity (2) Solve the statistical problem and the numerical problem will also be solved The statistical problem is more serious than the numerical problem We want to refine a model that has redundancy in the explanatory variables even if X’X can be inverted without difficulty
Multicollinearity (3) Extremes cases can help us to understand the problem if all X’s are uncorrelated, Type I SS and Type II SS will be the same, i.e, the contribution of each explanatory variable to the model will be the same whether or not the other explanatory variables are in the model if there is a linear combination of the explanatory variables that is a constant (e.g. X 1 = X 2 (X 1 - X 2 = 0)), then the Type II SS for the X’s involved will be zero
An example Y = gpaX 1 = hsm X 3 = hssX 4 = hse X 5 = satmX 6 = satv X 7 = genderm; Define: sat=satm+satv; We will regress Y on sat satm and satv;
Output Source DF Model 2 Error 221 Corrected Total 223 Something is wrong dfM=2 but there are 3 Xs
Output (2) NOTE: Model is not full rank. Least-squares solutions for the parameters are not unique. Some statistics will be misleading. A reported DF of 0 or B means that the estimate is biased.
Output (3) NOTE: The following parameters have been set to 0, since the variables are a linear combination of other variables as shown. satv = sat - satm
Output (4) Par St Var DF Est Err t P Int sat B satm B satv
Extent of multicollinearity Our CS example had one explanatory variable equal to a linear combination of other explanatory variables This is the most extreme case of multicollinearity and is detected by statistical software because (X’X) does not have an inverse We are concerned with cases less extreme
Effects of multicollinearity Regression coefficients are not well estimated and may be meaningless Similarly for standard errors of these estimates Type I SS and Type II SS will differ R 2 and predicted values are usually ok
Two separate problems Numerical accuracy (X’X) is difficult to invert Need good software Statistical problem Results are difficult to interpret Need a better model
Polynomial regression We can do linear, quadratic, cubic, etc. by defining squares, cubes, etc. in a data step and using these as predictors in a multiple regression We can do this with more than one explanatory variable When we do this we generally create a multicollinearity problem
Polynomial Regression (2) We can remove the correlation between explanatory variables and their squares Center (subtract the mean) before squaring NKNW rescale by standardizing (subtract the mean and divide by the standard deviation)
Interaction Models With several explanatory variables, we need to consider the possibility that the effect of one variable depends on the value of another variable Special cases One indep variable – second order One indep variable – Third order Two cindep variables – second order
One Independent variable – Second Order The regression model: The mean response is a parabole and is frequently called a quadratic response function. βo reperesents the mean response of Y when x = 0 and β1 is often called the linear effect coeff while β11 is called the quadratic effect coeff.
One Independent variable –Third Order The regression model: The mean response is
Two Independent variable – Second Order The regression model: The mean response is the equation of a conic section. The coeff β12 is often called the interaction effect coeff.
NKNW Example p 330 Response variable is the life (in cycles) of a power cell Explanatory variables are Charge rate (3 levels) Temperature (3 levels) This is a designed experiment
check the data Obs cycles chrate temp
Create the new variables and run the regression Create new variables chrate2=chrate*chrate; temp2=temp*temp; ct=chrate*temp; Then regress cycles on chrate, temp, chrate2, temp2, and ct;
a. Regression Coefficients VarbS(b)tPr>|t| int <.0002 Chrate <0.01 Temp <0.005 Chrate Temp ct Output
Output (2) b. ANOVA Table SourcedfSSMS Regression X1X X 2 |X X 1 2 |X 1,X X 2 2 |X 1,X 2,X X 2 2 |X 1,X 2,X 1 2, X Error Total
Conclusion We have a multicollinearity problem Lets look at the correlations (use proc corr) There are some very high correlations r(chrate,chrate2) = r(temp,temp2) =
A remedy We can remove the correlation between explanatory variables and their squares Center (subtract the mean) before squaring NKNW rescale by standardizing (subtract the mean and divide by the standard deviation)
Last slide Read NKNW 7.6 to 7.7 and the problems on pp We used programs cs4.sas and NKNW302.sas to generate the output for today
Last slide Read NKNW 8.5 and Chapter 9