Model selection and model building
Model selection Selection of predictor variables
Statement of problem A common problem is that there is a large set of candidate predictor variables. Goal is to choose a small subset from the larger set so that the resulting regression model is simple, yet have good predictive ability.
Example: Cement data Response y: heat evolved in calories during hardening of cement on a per gram basis Predictor x 1 : % of tricalcium aluminate Predictor x 2 : % of tricalcium silicate Predictor x 3 : % of tetracalcium alumino ferrite Predictor x 4 : % of dicalcium silicate
Example: Cement data
Two basic methods of selecting predictors Stepwise regression: Enter and remove variables, in a stepwise manner, until no justifiable reason to enter or remove more. Best subsets regression: Select the subset of variables that do the best at meeting some well-defined objective criterion.
Stepwise regression: the idea Start with no predictors in the model. At each step, enter or remove a variable based on partial F-tests. Stop when no more variables can be justifiably entered or removed.
Stepwise regression: the steps Specify an Alpha-to-Enter (0.15) and an Alpha-to-Remove (0.15). Start with no predictors in the model. Put the predictor with the smallest P-value based on the partial F statistic (a t-statistic) in the model. If P-value > 0.15, then stop. None of the predictors have good predictive ability. Otherwise …
Stepwise regression: the steps Add the predictor with the smallest P-value (below 0.15) based on the partial F-statistic (a t-statistic) in the model. If none of the predictors yield P-values < 0.15, stop. If P-value > 0.15 for any of the partial F statistics, then remove violating predictor. Continue above two steps, until no more predictors can be entered or removed.
Predictor Coef SE Coef T P Constant x Predictor Coef SE Coef T P Constant x Predictor Coef SE Coef T P Constant x Predictor Coef SE Coef T P Constant x Example: Cement data
Predictor Coef SE Coef T P Constant x x Predictor Coef SE Coef T P Constant x x Predictor Coef SE Coef T P Constant x x
Predictor Coef SE Coef T P Constant x x x Predictor Coef SE Coef T P Constant x x x
Predictor Coef SE Coef T P Constant x x
Predictor Coef SE Coef T P Constant x x x Predictor Coef SE Coef T P Constant x x x
Predictor Coef SE Coef T P Constant x x
Stepwise Regression: y versus x1, x2, x3, x4 Alpha-to-Enter: 0.15 Alpha-to-Remove: 0.15 Response is y on 4 predictors, with N = 13 Step Constant x T-Value P-Value x T-Value P-Value x T-Value P-Value S R-Sq R-Sq(adj) C-p
Drawbacks of stepwise regression The final model is not guaranteed to be optimal in any specified sense. The procedure yields a single final model, although in practice there are often several almost equally good models.
Best subsets regression If there are p-1 possible predictors, then there are 2 p-1 possible regression models containing the predictors. For example, 10 predictors yields 2 10 = 1024 possible regression models. A best subsets algorithm determines the best subsets of each size, so that choice of the final model can be made by researcher.
What is used to judge “best”? R-square Adjusted R-square MSE (or S = square root of MSE) Mallow’s C p
R-squared Use the R-squared values to find the point where adding more predictors is not worthwhile because it leads to a very small increase in R-squared.
Adjusted R-squared or MSE Adjusted R-squared increases only if MSE decreases, so adjusted R-squared and MSE provide equivalent information. Find a few subsets for which MSE is smallest (or adjusted R-squared is largest) or so close to the smallest (largest) that adding more predictors is not worthwhile.
Mallow’s C p criterion Mallow’s C p statistic: is an estimator of total standardized mean square error of prediction: which equals:
Using the C p criterion Subsets with small C p values have a small total (standardized) mean square error of prediction. When the C p value is also near p, the bias of the regression model is small.
Using the C p criterion (cont’d) So, identify subsets of predictors for which: –the C p value is smallest, and –the C p value is near p (if possible) Note though that for the full model, C p = p. So, the fullest model is always judged to be a good candidate model.
Best Subsets Regression: y versus x1, x2, x3, x4 Response is y x x x x Vars R-Sq R-Sq(adj) C-p S X X X X X X X X X X X X X X X X
Example: Modeling PIQ
Stepwise Regression: PIQ versus MRI, Height, Weight Alpha-to-Enter: 0.15 Alpha-to-Remove: 0.15 Response is PIQ on 3 predictors, with N = 38 Step 1 2 Constant MRI T-Value P-Value Height T-Value P-Value S R-Sq R-Sq(adj) C-p
Best Subsets Regression: PIQ versus MRI, Height, Weight Response is PIQ H W e e i i M g g R h h Vars R-Sq R-Sq(adj) C-p S I t t X X X X X X X X X
The regression equation is PIQ = MRI Height Predictor Coef SE Coef T P Constant MRI Height S = R-Sq = 29.5% R-Sq(adj) = 25.5% Analysis of Variance Source DF SS MS F P Regression Error Total Source DF Seq SS MRI Height
Example: Modeling BP
Stepwise Regression: BP versus Age, Weight, BSA, Duration, Pulse, Stress Alpha-to-Enter: 0.15 Alpha-to-Remove: 0.15 Response is BP on 6 predictors, with N = 20 Step Constant Weight T-Value P-Value Age T-Value P-Value BSA 4.6 T-Value 3.04 P-Value S R-Sq R-Sq(adj) C-p
Best Subsets Regression: BP versus Age, Weight,... Response is BP D u W r S e a P t i t u r A g B i l e g h S o s s Vars R-Sq R-Sq(adj) C-p S e t A n e s X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X
The regression equation is BP = Age Weight BSA Predictor Coef SE Coef T P Constant Age Weight BSA S = R-Sq = 99.5% R-Sq(adj) = 99.4% Analysis of Variance Source DF SS MS F P Regression Error Total Source DF Seq SS Age Weight BSA
Stepwise regression in Minitab Stat >> Regression >> Stepwise … Specify response and all possible predictors. If desired, specify predictors that must be included in every model. Select OK. Results appear in session window.
Best subsets regression Stat >> Regression >> Best subsets … Specify response and all possible predictors. If desired, specify predictors that must be included in every model. Select OK. Results appear in session window.
Model building strategy
The first step Decide on the type of model needed –Predictive: model used to predict the response variable from a chosen set of predictors. –Theoretical: model based on theoretical relationship between response and predictors. –Control: model used to control a response variable by manipulating predictor variables.
The first step (cont’d) Decide on the type of model needed –Inferential: model used to explore strength of relationships between response and predictors. –Data summary: model used merely as a way to summarize a large set of data by a single equation.
The second step Decide which predictor variables and response variable on which to collect the data. Collect the data.
The third step Explore the data –Check for outliers, gross data errors, missing values on a univariate basis. –Study bivariate relationships to reveal other outliers, to suggest possible transformations, to identify possible multicollinearities.
The fourth step Randomly divide the data into a training set and a test set: –The training set, with at least error d.f., is used to fit the model. –The test set is used for cross-validation of the fitted model.
The fifth step Using the training set, fit several candidate models: –Use best subsets regression. –Use stepwise regression (only gives one model unless specifies different alpha-to-remove and alpha-to-enter values).
The sixth step Select and evaluate a few “good” models: –Select based on adjusted R 2, Mallow’s C p, number and nature of predictors. –Evaluate selected models for violation of model assumptions. –If none of the models provide a satisfactory fit, try something else, such as more data, different predictors, a different class of model …
The final step Select the final model: –Compare competing models by cross-validating them against the test data. –The model with a larger cross-validation R 2 is a better predictive model. –Consider residual plots, outliers, parsimony, relevance, and ease of measurement of predictors.