BUILDING THE REGRESSION MODEL Data preparation Variable reduction Model Selection Model validation Procedures for variable reduction 1 Building the Regression Model
List independent variables that could coceivably be related to the dependent variable under study. Some of the independent variables can be screened out. An independent variable: 1. May not be fundamental to the problem 2. May be subjected to large measurement errors, and/or 3. May effectively duplicate another independent variables in the list Independent variables that cannot be measured may either be deleted or replaced by proxy variables taht are highly correlated with them. 2 Building the Regression Model
The number of cases to be collected depends on the size of the pool of independent variable (usually 6 to 10 cases for every variable in the pool After the data have been collected, edit checks and plots should be performed to identify gross data errors as well as extreme outliers. The formal modeling process can begin; a variety of diagnostics should be employed to identify important independent variables, the functional forms in which the independent variables should enter the regression model, and important relationships. 3 Building the Regression Model
Yes No Collect data Preliminary Checks on data quality Diagnostic for relationship and strong interaction Are remedial measure needed? Remedial measures 4 Building the Regression Model
Selecting a few “good” subesets of X variables should include not only the potential independent variables in first-order form but also any needed quadratic and other curvature terms and any necessary interaction terms. Several reason in reducing the independent variables: 1. A regression model with a large number of independent variables is difficult to maintain 2. Regression models with a limited number of independent variables are easier to work with and understand 3. The presence of many highly intercorrelated independent variables may add little to the predictive power of the model while substantially increasing the sampling variation of the regression coefficients, detracting from the the model’s descriptive abilities, and increasing the problem of roundoff errors 5 Building the Regression Model
After successfully reducing the number of independent variables, select a small number of potential “good” regression models, each of which contains those independent variables that are known to be essential. More detailed checks of curvature and interaction effects are desireble. Diagnostic on residuals are needed in order to identify influential outlying observations, multicollinearity, etc 6 Building the Regression Model
The final step in the model building process is to validat the selected regression model. Model Validity refers to the stability and reasonableness of the regression coefficients, the plausibility and usability of the regression function, and the ability to generalize inferences drawn from the regression analysys. 7 Building the Regression Model
Three basic ways of validating a regression model are: 1. Collection of new data to check the model and its predictive ability. 2. Comparison of results with theoritical expectations, earlier empirical results, and simulation results. 3. Use of a hold-out sample to check the model and its predictive ability 8 Building the Regression Model
The purpose of collecting new data is to be able to examine whether the regression model developed from the earlier data is still applicable for the new data. Some methods of examining the validity of the regression model against the new data : Reestimate the model form chosen earlier using the new data then compared the estimated coeff and various characteristic of the fitted values to those of the regression model based on the earlier data. Resestimate from the new data all of the “good” models had been considered to see if the selected regression model is the preferred model according to the new data. Designed to calibrate the predictive capability of the selected regression model 9 Building the Regression Model
Data splitting The preferred method to validate using the new data is neither practical nor feasible. A reasonable alternative when the data set is large enough is to split the data into two sets, the model building set and the validation or prediction set. This validation is called cross-validation 10 Building the Regression Model
A mean of measuring the actual predictive capability of the selected regression model is to use this model to predict each case in the new data set and then to calculate the mean of the squared prediction errors, denoted MSPR (mean squared prediction error): Where Y i : the value of the response in the ith validation case Ŷ i : the predictive value for the ith validation case based on the model building data set n* : the number cases in the validation data set If the MSPR is fairly close to MSE based on the regression fit to the model-building data set, then the MSE for the selected regression model is not seriously biased and gives an appropriate indication of the predictive ability of the model 11 Building the Regression Model
Some procedures for variable reductions are: 1. Forward procedure 2. Backward procedure Some criteria for comparing the regression models: R 2 p, MSE p, C p and PRESS p. Where P is the number of potensial parameters and the all- possible regressions approach assumes that the number of observations n exceeds the maximum number of potential parameters n > P, and 1 < p < P 12 Building the Regression Model
An examination of the coefficient of multiple determination R 2 R 2 p = SSM p /SST= 1 – (SSE p /SST) Where SST is constant for all possible regression R 2 p varies inversely with SSE p and R 2 p will be a maximum when all P-1 potential X variables are included in the regression model. 13 Building the Regression Model
MSE p criterion is the adjusted coefficient of multiple determination R 2 a which takes the number of parameters in the model into account through the df. Seek for min(MSE p ) 14 Building the Regression Model
This criterion is concern with the total mean squared error of the n fitted values for each subset regression model. The model which includes all P-1 potential X variables is assumed to have been carefully chosen so that MSE(X 1,..., X p ) is an unbiased estimator of σ2, and SSE p is the error sum of squares for the fitted subset regression model with p parameters. The C p formula is defined as follows: C p criterion suggests to seek the small C p and its value is near p. 15 Building the Regression Model
The PRESS (prediction sum of squares) selection criterion is based on the deleted residuals d i. Models with small PRESS values are considered good candidate models 16 Building the Regression Model
Predicting survival in patients undergoing a particular type of liver operation: X 1 : blood cloting score X 2 : prognostic index, which includes the age of patient X 3 : enzyme function test score X 4 : liver function test score Y : survival time N = 54 patients Building the Regression Model 17