Multiple Regression
From last time There were questions about the bowed shape of the confidence limits around the regression line, both for limits around the mean and the individuals should be curved because they are estimates. In theory if you knew the population values you would not need to bow CLs around the individuals. You just shift the density around following the regession line.
CLs You do not have the same percision across the entire range. The CLs around the mean are based on: The CLs for a new person has the distance from the mean of x in the formula as well. Overall estimated variance in y
Multiple Regression You have seen that you can include polynomials in a regression model. What about including entirely different predictors? Say you want to predict the blood pressure of daft scientists before they give talks at an international conference. You could predict with many single predictors or as a set: –age –the size of the audience –number of “hawt” potential evil lackeys in the front row
Explaining variance Age Audience sizeLackey Quality Total variance in blood pressure
Explained Multivariate Variance The total variance explained depends on how correlated the predictors are. You want to have a global R 2 to indicate the amount of the variance explained by the model and also measures of the contributions of the predictors.
Multicolinearity Even though audience size and lackey quality both allow for you to predict the heart rate of the mad scientists, the amount of variance that is uniquely associated with the lackey variable is very little. When you have very correlated predictors, they can’t uniquely explain variance. You can end up with a model that is statistically significant with a big R 2 but none of the individual predictors is statistically significant.
Stop it before it starts Before you do a multiple regression use subject matter knowledge to remove highly correlated predictors. Look at the bivaraite correlation coefficients between the predictors. If you have highly correlated variables use subject matter and pragmatism to decide which varible to put in the model.
Partitioning Variance What is the unique contribution of each variable? There are different formulas for adding up the sum of squares SS (in the variance). Sequential (aka type 1 SS) lets the first variable explain as much variance as it can then add in the second and see if it can explain any of the remaining variance. Simultaneous (aka type 2 SS) put the first all the variables in at the same time and let them divvy up the common variance. Simultaneous with interactions (aka type 3 SS) have the variable try to explain all they can and consider they are used interactions. Type 4 SS makes my head hurt.
SAS vs. R S-Plus and SAS use the same formulas for SS but R does not use the same formula for Type 3 SS (and I have never tested it on Type 4 SS).
Partial and Semipartial Correlations If you want to look at the correlation between a predictor (a) and an outcome (z) controlling for the impact of a second predictor (b) you can do a partial correlation to remove the impact b on both. You can also remove the correlation between b and a only you can do a semi-partial correlation.
Hierarchical Stepwise Regression If you take 2 nd and 3 rd quarter statistics you will learn the details on how to compare two models. Hierarchical stepwise regression is the process of figuring out what variables matter (in advance) and adding them to a model to see if you improve the quality of the model as you add them. People frequently look at the R 2 for the models and/or use AIC. This is a fine thing to do so long as you keep track of the comparison you make and report it.
Automatic Stepwise Regression These are BAD BAD BAD. You feed the software a set of variables and tell it put them into the model one at a time to find the predictor that explains the most outcome variance. Once that is put into the model, add all the remaining ones one at a time to see if the residual variance is reduced with the second variable. Repeat over and over. Some of the methods subtract variables instead of adding (others do both). The Type 1 error is astronomical. These methods have horrible properties. Adding in completely random variables affects the model.