1 Chapter 9 Variable Selection and Model building Ray-Bing Chen Institute of Statistics National University of Kaohsiung
2 9.1 Introduction The Model-Building Problem Ensure that the function form of the model is correct and that the underlying assumptions are not violated. A pool of candidate regressors Variable selection problem Two conflicting objectives: –Include as many regressors as possible: the information content in these factors can influence the predicted values, y
3 –Include as few regressors as possible: the variance of the prediction increases as the number of the regressors increases “Best” regression equation??? Several algorithms can be used for variable selection, but these procedures frequently specify different subsets of the candidate regressors as best. An idealized setting: –The correct functional forms of regressors are known. –No outliers or influential observations
4 Residual analysis Iterative approach: 1.A variable selection strategy 2.Check the correct functional forms, outliers and influential observations None of the variable selection procedures are guaranteed to produce the best regression equation for a given data set.
Consequences of Model Misspecification The full model The subset model
9 Motivation for variable selection: –Deleting variables from the model can improve the precision of parameter estimates. This is also true for the variance of predicted response. –Deleting variable from the model will introduce the bias. –However, if the deleted variables have small effects, the MSE of the biased estimates will be less than the variance of the unbiased estimates.
Criteria for Evaluating Subset Regression Models Coefficient of Multiple Determination:
11 –Aitkin (1974) : R 2 -adequate subset: the subset regressor variables produce R 2 > R 2 0
17 Uses of Regression and Model Evaluation Criteria –Data description: Minimize SS Res and as few regressors as possible –Prediction and estimation: Minimize the mean square error of prediction. Use PRESS statistic –Parameter estimation: Chapter 10 –Control: minimize the standard errors of the regression coefficients.
Computational Techniques for Variable Selection All Possible Regressions Fit all possible regression equations, and then select the best one by some suitable criterions. Assume the model includes the intercept term If there are K candidate regressors, there are 2 K total equations to be estimated and examined.
19 Example 9.1 The Hald Cement Data
21 R 2 p criterion:
Stepwise Regression Methods Three broad categories: 1.Forward selection 2.Backward elimination 3.Stepwise regression
31 Backward elimination –Start with a model with all K candidate regressors. –The partial F-statistic is computed for each regressor, and drop a regressor which has the smallest F-statistic and < F OUT. –Stop when all partial F-statistics > F OUT.
32 Stepwise Regression A modification of forward selection. A regressor added at an earlier step may be redundant. Hence this variable should be dropped from the model. Two cutoff values: F OUT and F IN Usually choose F IN > F OUT : more difficult to add a regressor than to delete one.