Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Building the Regression Model –I Selection and Validation KNN Ch. 9 (pp. 343-375)

Similar presentations


Presentation on theme: "1 Building the Regression Model –I Selection and Validation KNN Ch. 9 (pp. 343-375)"— Presentation transcript:

1 1 Building the Regression Model –I Selection and Validation KNN Ch. 9 (pp. 343-375)

2 2 The Model Building Process  Collect and prepare data  Reduction of explanatory variables for exploratory/ observational studies  Refine model and select best model  Validate model – if it passes the checks then adopt it  All four of the above have several intermediate steps. These are outlined in Fig. 9.1, page 344 of KNN

3 3 The Model Building Process  Data collection  Controlled Experiments (levels, treatments)  With supplemental variables (incorporate uncontrollable variables in regression model rather than in the experiment)  Confirmatory Observational Studies (hypothesis testing, primary variables and risk factors)  Exploratory Observational Studies (Measurement errors/problems, duplication of variables, spurious variables, sample size; are but some of the issues here)

4 4 The Model Building Process  Data Preparation  What are the standard techniques here? Its an easy guess, a rough-cut approach is to look at various plots and identify obvious problems such as outliers, spurious variables etc.  Preliminary Model Investigation  Scatter Plots and Residual Plots (For what?)  Functional forms and transformations (of entire data or some explanatory variables or predicted variable?)  Interactions and …..Intuition

5 5 The Model Building Process  Reduction of Explanatory Variables  Generally an issue for Controlled Experiments with Supplemental Variables and for Exploratory Observational Studies  It is not difficult to guess that for Exploratory Observational Studies, this is more serious  Identification of good subsets of the explanatory variables and their functional forms and any interactions, is perhaps the most difficult problem in multiple regression analysis  Need to be careful of specification bias and latent explanatory variables.

6 6 The Model Building Process  Model Refinement and Selection  Diagnostics for candidate models  Lack-of-fit tests if repeat obs. available  “Best” model’s # of variables should be used as benchmark for investigating other models with similar number of variables  Model Validation  Robustness and Usability of regression coefficients  Usability of regression function. Does it all make sense ?

7 7 All Possible Regressions: Variable Reduction  Usually many explanatory variables (p-1) present at the outset  Select the best subset of these variables  Best  The smallest subset of variables which provides an adequate prediction of Y.  Multicollinearity usually a problem when all variables in the model.  Variable selection may be based on the determination coefficient or on the statistic (Equivalent Procedures).

8 8  - and are highest when all the variables are in the model.  One intends to find the point at which adding more variables causes a very small increase in or a very small decrease in.  Given a value of p, we compute the maximum of R p 2 (or minimum of SSE p ) and then we compare the several maxima (minima).  See the Surgical Unit Example on page 350 of KNN. All Possible Regressions: Variable Reduction

9 9 A Simple Example Regression Analysis The regression equation is Y = 0.236 + 9.09 X1 - 0.330 X2 - 0.203 X3 Predictor Coef StDev T P Constant 0.2361 0.2545 0.93 0.355 X1 9.090 1.718 5.29 0.000 X2 -0.3303 0.2229 -1.48 0.141 X3 -0.20286 0.05894 -3.44 0.001 S = 1.802 R-Sq = 95.7% R-Sq(adj) = 95.6% Regression Analysis The regression equation is Y = 0.408 + 6.55 X1 - 0.173 X3 Predictor Coef StDev T P Constant 0.4078 0.2276 1.79 0.075 X1 6.5506 0.1201 54.54 0.000 X3 -0.17253 0.05551 -3.11 0.002 S = 1.810 R-Sq = 95.6% R-Sq(adj) = 95.5% Regression Analysis The regression equation is Y = 0.014 + 6.50 X1 Predictor Coef StDev T P Constant 0.0144 0.1949 0.07 0.941 X1 6.4957 0.1225 53.05 0.000 S = 1.866 R-Sq = 95.3% R-Sq(adj) = 95.3%

10 10  R p 2 does not take into account the number of parameters (p) and never decreases as p increases.  This is a mathematical property, but it may not make sense practically.  However, useless explanatory variables can actually worsen the predictive power of the model. How?  The adjusted coefficient of multiple determination will account for the increased p always.  The R a 2 and MSE p criterion are equivalent  When can MSE p actually increase with p? All Possible Regressions: Variable Reduction

11 11 A Simple Example Regression Analysis The regression equation is Y = 21.7 + 12.8 X1 - 0.88 X2 - 5.93 X3 Predictor Coef StDev T P Constant 21.69 14.77 1.47 0.381 X1 12.763 9.225 1.38 0.398 X2 -0.877 1.099 -0.80 0.571 X3 -5.927 2.033 -2.92 0.210 S = 2.878 R-Sq = 99.3% R-Sq(adj) = 97.1% Regression Analysis The regression equation is Y = 27.8 + 5.45 X1 - 6.37 X3 Predictor Coef StDev T P Constant 27.76 11.45 2.43 0.136 X1 5.4534 0.9666 5.64 0.030 X3 -6.370 1.769 -3.60 0.069 S = 2.603 R-Sq = 98.8% R-Sq(adj) = 97.7% Regression Analysis The regression equation is Y = - 10.4 + 8.05 X1 Predictor Coef StDev T P Constant -10.363 9.738 -1.06 0.365 X1 8.049 1.439 5.59 0.011 S = 5.816 R-Sq = 91.2% R-Sq(adj) = 88.3% Interesting

12 12  The C p criterion is concerned with the total MSE of the n fitted values.  Total error for any fitted value is a sum of bias and random error components  is the total error, where  i is the “true” mean response of Y when X=X i.  The bias is and the random error is  Then the total mean squared error is shown to be:  When the above is divided by the variance of the actual Y values i.e., by  2, then we get the criterion  p  The estimator of  p is what we shall use:C p All Possible Regressions: Variable Reduction

13 13   Choose a model with small C p  C p should be as close as possible to p. When all variables are included then obviously C p = p (=P)  If the model has very little bias then in that case and E(C p ) ≈ p  When we plot a line through the origin at 45 o and plot the (p,C p ) points, then for models with little bias, the points will fall almost on the straight line, for models with substantial bias, the points will fall much above the line, and if the points fall below the then such models have no bias but just some random sampling error. All Possible Regressions: Variable Reduction

14 14  The PRESS p criterion :  is the predicted value of when the i th observation is not in the dataset.  Choose models with small values of PRESS p.  It may seem that one will have to run “n” separate regressions in order to calculate PRESS p. Not so, as we will see later. All Possible Regressions: Variable Reduction

15 15  Best Subsets Algorithm:  Best subsets (a limited number) are identified according to pre-specified criteria.  Require much less computational effort than when evaluating all possible subsets.  Provide “good” subsets along with best, which is quite useful.  When pool of X variables is large, then this algorithm can run out of steam. What then? We will see in the ensuing discussion. Best Subsets

16 16 Best Subsets Regression (Note: “s” is the square root of MSE p ) Response variable is Y Adj. Vars R-Sq R-Sq C-p s X1 X2 X3 1 95.3 95.3 11.9 1.8656 X 1 94.7 94.7 30.8 1.9801 X 2 95.6 95.5 4.2 1.8101 X X 2 95.3 95.2 13.8 1.8718 X X 3 95.7 95.6 4.0 1.8023 X X X Response variable is Y Adj. Vars R-Sq R-Sq C-p s X1 X2 X3 X4 1 95.3 95.3 13.4 1.8656 X 1 94.7 94.7 32.4 1.9801 X 2 95.6 95.5 5.6 1.8101 X X 2 95.5 95.4 9.8 1.8374 X X 3 95.7 95.6 3.9 1.7927 X X X 3 95.7 95.6 5.3 1.8023 X X X 4 95.7 95.6 5.0 1.7936 X X X X A Simple Example

17 17 Forward Stepwise Regression  An iterative procedure  Based on the partial F * or t * statistic one decides whether to add a variable or not.  One variable at a time is considered.  Before we see the actual algorithm here are some levers: Minimum acceptable F to enter (F E ) Minimum acceptable F to remove (F R ) Minimum acceptable Tolerance (T min ) Maximum number of iterations (N)  And here is the general form of the test statistic:

18 18 Forward Stepwise Regression  The procedure: 1.Run a simple linear regression of all variables with the Y variable. 2.If none of the individual F values are larger than the cut-off F E value, then stop. Else, enter the variable with the largest F. 3.Now run the regression of remaining variables with Y given that the variable entered in step 2 is already in the model. 4. Repeat step 2. If a candidate is found, then check for tolerance. If tolerance (1-R 2 k ) is not larger than cut-off tolerance value T min, then choose a different candidate. If none available, then terminate. Else, add the candidate variable. 5.Calculate the partial F for the variable entered in step 2 given that the variable entered in step 4 is already in the model. Check if this F is less than F R. If so, then remove the variable entered in step 2. Else keep it. Check if number of iterations is equal to N. If yes, terminate. If not, then proceed to step 6. 6. Check from results of step 1, which is the next candidate variable to enter. If number of iterations exceeded, then terminate

19 19 Other Stepwise Regression Procedures  Backward Stepwise Regression  exact opposite of forward procedure.  Sometimes preferred to forward stepwise.  Think about how this procedure would work why, or under which conditions you would use it instead of forward stepwise ?  Forward Selection  Similar to forward stepwise; except that the variable dropping part is not present  Backward Elimination  Similar to backward stepwise; except that the variable adding part is not present

20 20 An Example Let us go through the example (Fig. 9.7) on page 366 of KNN.

21 21 Some other Selection Criteria Akaike Information Criteria (AIC) – Impose a penalty for adding regressors – AIC = e 2p/n SSE p /n, where 2p/n is the penalty factor –Harsher penalty than R a 2 (How?) – Model with lowest AIC is preferred – AIC used for in-sample and out-of-sample forecasting performance measurement – Useful for nested and non-nested mode and for determining lag-length in autoregressive models (Ch12)

22 22 Some other Selection Criteria Schwarz Information Criteria (SIC) – SIC = n p/n SSE p /n –Similar to AIC – Imposes stricter penalty than AIC – Has similar advantages as AIC

23 23 Model Validation  Checking the prediction ability of the model.  Methods for the model validation; 1.Collection of new data; - We select a new sample with the same variables of dimension ; - Compute the mean squared prediction error: 2.Comparison of results with theoretical expectations; 3.Data splitting in two data sets: model building and validation.


Download ppt "1 Building the Regression Model –I Selection and Validation KNN Ch. 9 (pp. 343-375)"

Similar presentations


Ads by Google