1 Building the Regression Model –I Selection and Validation KNN Ch. 9 (pp. 343-375)

Slides:



Advertisements
Similar presentations
All Possible Regressions and Statistics for Comparing Models
Advertisements

Chapter 5 Multiple Linear Regression
Best subsets regression
6-1 Introduction To Empirical Models 6-1 Introduction To Empirical Models.
Regression Analysis Once a linear relationship is defined, the independent variable can be used to forecast the dependent variable. Y ^ = bo + bX bo is.
Multiple Linear Regression
Linear Methods for Regression Dept. Computer Science & Engineering, Shanghai Jiao Tong University.
Statistics for the Social Sciences
Statistics for Managers Using Microsoft® Excel 5th Edition
MARE 250 Dr. Jason Turner Multiple Regression. y Linear Regression y = b 0 + b 1 x y = dependent variable b 0 + b 1 = are constants b 0 = y intercept.
BA 555 Practical Business Analysis
Statistics for Managers Using Microsoft® Excel 5th Edition
Part I – MULTIVARIATE ANALYSIS C3 Multiple Linear Regression II © Angel A. Juan & Carles Serrat - UPC 2007/2008.
Multiple Regression Involves the use of more than one independent variable. Multivariate analysis involves more than one dependent variable - OMS 633 Adding.
Lecture 6: Multiple Regression
1 Chapter 9 Variable Selection and Model building Ray-Bing Chen Institute of Statistics National University of Kaohsiung.
Multiple Regression MARE 250 Dr. Jason Turner.
Lecture 11 Multivariate Regression A Case Study. Other topics: Multicollinearity  Assuming that all the regression assumptions hold how good are our.
Multiple Linear Regression
Basic Business Statistics, 11e © 2009 Prentice-Hall, Inc. Chap 15-1 Chapter 15 Multiple Regression Model Building Basic Business Statistics 11 th Edition.
Chapter 15: Model Building
Multiple Linear Regression A method for analyzing the effects of several predictor variables concurrently. - Simultaneously - Stepwise Minimizing the squared.
1 1 Slide © 2003 South-Western/Thomson Learning™ Slides Prepared by JOHN S. LOUCKS St. Edward’s University.
Correlation & Regression
Marketing Research Aaker, Kumar, Day and Leone Tenth Edition
1 Least squares procedure Inference for least squares lines Simple Linear Regression.
Variable selection and model building Part II. Statement of situation A common situation is that there is a large set of candidate predictor variables.
Selecting Variables and Avoiding Pitfalls Chapters 6 and 7.
Multiple Regression Selecting the Best Equation. Techniques for Selecting the "Best" Regression Equation The best Regression equation is not necessarily.
Business Statistics, 4e, by Ken Black. © 2003 John Wiley & Sons Business Statistics, 4e by Ken Black Chapter 15 Building Multiple Regression Models.
Shonda Kuiper Grinnell College. Statistical techniques taught in introductory statistics courses typically have one response variable and one explanatory.
Lesson Multiple Regression Models. Objectives Obtain the correlation matrix Use technology to find a multiple regression equation Interpret the.
MARE 250 Dr. Jason Turner Multiple Regression. y Linear Regression y = b 0 + b 1 x y = dependent variable b 0 + b 1 = are constants b 0 = y intercept.
CROSS-VALIDATION AND MODEL SELECTION Many Slides are from: Dr. Thomas Jensen -Expedia.com and Prof. Olga Veksler - CS Learning and Computer Vision.
Slide 1 DSCI 5340: Predictive Modeling and Business Forecasting Spring 2013 – Dr. Nick Evangelopoulos Lecture 2: Review of Multiple Regression (Ch. 4-5)
Model Selection and Validation. Model-Building Process 1. Data collection and preparation 2. Reduction of explanatory or predictor variables (for exploratory.
Multiple Regression Selecting the Best Equation. Techniques for Selecting the "Best" Regression Equation The best Regression equation is not necessarily.
 Relationship between education level, income, and length of time out of school  Our new regression equation: is the predicted value of the dependent.
BUILDING THE REGRESSION MODEL Data preparation Variable reduction Model Selection Model validation Procedures for variable reduction 1 Building the Regression.
1 Experimental Statistics - week 12 Chapter 12: Multiple Regression Chapter 13: Variable Selection Model Checking.
Multiple Regression I 1 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Chapter 4 Multiple Regression Analysis (Part 1) Terry Dielman.
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Variable Selection 1 Chapter 8 Variable Selection Terry Dielman Applied Regression Analysis:
Statistics for Managers Using Microsoft Excel, 4e © 2004 Prentice-Hall, Inc. Chap 14-1 Chapter 14 Multiple Regression Model Building Statistics for Managers.
ANOVA, Regression and Multiple Regression March
Copyright © 2010 Pearson Education, Inc Chapter Seventeen Correlation and Regression.
Copyright © 2004 by The McGraw-Hill Companies, Inc. All rights reserved.
Lesson 14 - R Chapter 14 Review. Objectives Summarize the chapter Define the vocabulary used Complete all objectives Successfully answer any of the review.
Example x y We wish to check for a non zero correlation.
Roger B. Hammer Assistant Professor Department of Sociology Oregon State University Conducting Social Research Specification: Choosing the Independent.
Basic Business Statistics, 10e © 2006 Prentice-Hall, Inc. Chap 15-1 Chapter 15 Multiple Regression Model Building Basic Business Statistics 10 th Edition.
Statistics for Managers Using Microsoft Excel, 4e © 2004 Prentice-Hall, Inc. Chap 14-1 Chapter 14 Multiple Regression Model Building Statistics for Managers.
Variable selection and model building Part I. Statement of situation A common situation is that there is a large set of candidate predictor variables.
4-1 MGMG 522 : Session #4 Choosing the Independent Variables and a Functional Form (Ch. 6 & 7)
Canadian Bioinformatics Workshops
Model selection and model building. Model selection Selection of predictor variables.
Yandell – Econ 216 Chap 15-1 Chapter 15 Multiple Regression Model Building.
Stats Methods at IC Lecture 3: Regression.
Chapter 15 Multiple Regression Model Building
The simple linear regression model and parameter estimation
Chapter 9 Multiple Linear Regression
Regression Analysis Part D Model Building
Statistics in MSmcDESPOT
Business Statistics, 4e by Ken Black
Solutions of Tutorial 10 SSE df RMS Cp Radjsq SSE1 F Xs c).
Prepared by Lee Revere and John Large
Linear Model Selection and regularization
Lecture 20 Last Lecture: Effect of adding or deleting a variable
Solutions of Tutorial 9 SSE df RMS Cp Radjsq SSE1 F Xs c).
Chapter 11 Variable Selection Procedures
Business Statistics, 4e by Ken Black
Presentation transcript:

1 Building the Regression Model –I Selection and Validation KNN Ch. 9 (pp )

2 The Model Building Process  Collect and prepare data  Reduction of explanatory variables for exploratory/ observational studies  Refine model and select best model  Validate model – if it passes the checks then adopt it  All four of the above have several intermediate steps. These are outlined in Fig. 9.1, page 344 of KNN

3 The Model Building Process  Data collection  Controlled Experiments (levels, treatments)  With supplemental variables (incorporate uncontrollable variables in regression model rather than in the experiment)  Confirmatory Observational Studies (hypothesis testing, primary variables and risk factors)  Exploratory Observational Studies (Measurement errors/problems, duplication of variables, spurious variables, sample size; are but some of the issues here)

4 The Model Building Process  Data Preparation  What are the standard techniques here? Its an easy guess, a rough-cut approach is to look at various plots and identify obvious problems such as outliers, spurious variables etc.  Preliminary Model Investigation  Scatter Plots and Residual Plots (For what?)  Functional forms and transformations (of entire data or some explanatory variables or predicted variable?)  Interactions and …..Intuition

5 The Model Building Process  Reduction of Explanatory Variables  Generally an issue for Controlled Experiments with Supplemental Variables and for Exploratory Observational Studies  It is not difficult to guess that for Exploratory Observational Studies, this is more serious  Identification of good subsets of the explanatory variables and their functional forms and any interactions, is perhaps the most difficult problem in multiple regression analysis  Need to be careful of specification bias and latent explanatory variables.

6 The Model Building Process  Model Refinement and Selection  Diagnostics for candidate models  Lack-of-fit tests if repeat obs. available  “Best” model’s # of variables should be used as benchmark for investigating other models with similar number of variables  Model Validation  Robustness and Usability of regression coefficients  Usability of regression function. Does it all make sense ?

7 All Possible Regressions: Variable Reduction  Usually many explanatory variables (p-1) present at the outset  Select the best subset of these variables  Best  The smallest subset of variables which provides an adequate prediction of Y.  Multicollinearity usually a problem when all variables in the model.  Variable selection may be based on the determination coefficient or on the statistic (Equivalent Procedures).

8  - and are highest when all the variables are in the model.  One intends to find the point at which adding more variables causes a very small increase in or a very small decrease in.  Given a value of p, we compute the maximum of R p 2 (or minimum of SSE p ) and then we compare the several maxima (minima).  See the Surgical Unit Example on page 350 of KNN. All Possible Regressions: Variable Reduction

9 A Simple Example Regression Analysis The regression equation is Y = X X X3 Predictor Coef StDev T P Constant X X X S = R-Sq = 95.7% R-Sq(adj) = 95.6% Regression Analysis The regression equation is Y = X X3 Predictor Coef StDev T P Constant X X S = R-Sq = 95.6% R-Sq(adj) = 95.5% Regression Analysis The regression equation is Y = X1 Predictor Coef StDev T P Constant X S = R-Sq = 95.3% R-Sq(adj) = 95.3%

10  R p 2 does not take into account the number of parameters (p) and never decreases as p increases.  This is a mathematical property, but it may not make sense practically.  However, useless explanatory variables can actually worsen the predictive power of the model. How?  The adjusted coefficient of multiple determination will account for the increased p always.  The R a 2 and MSE p criterion are equivalent  When can MSE p actually increase with p? All Possible Regressions: Variable Reduction

11 A Simple Example Regression Analysis The regression equation is Y = X X X3 Predictor Coef StDev T P Constant X X X S = R-Sq = 99.3% R-Sq(adj) = 97.1% Regression Analysis The regression equation is Y = X X3 Predictor Coef StDev T P Constant X X S = R-Sq = 98.8% R-Sq(adj) = 97.7% Regression Analysis The regression equation is Y = X1 Predictor Coef StDev T P Constant X S = R-Sq = 91.2% R-Sq(adj) = 88.3% Interesting

12  The C p criterion is concerned with the total MSE of the n fitted values.  Total error for any fitted value is a sum of bias and random error components  is the total error, where  i is the “true” mean response of Y when X=X i.  The bias is and the random error is  Then the total mean squared error is shown to be:  When the above is divided by the variance of the actual Y values i.e., by  2, then we get the criterion  p  The estimator of  p is what we shall use:C p All Possible Regressions: Variable Reduction

13   Choose a model with small C p  C p should be as close as possible to p. When all variables are included then obviously C p = p (=P)  If the model has very little bias then in that case and E(C p ) ≈ p  When we plot a line through the origin at 45 o and plot the (p,C p ) points, then for models with little bias, the points will fall almost on the straight line, for models with substantial bias, the points will fall much above the line, and if the points fall below the then such models have no bias but just some random sampling error. All Possible Regressions: Variable Reduction

14  The PRESS p criterion :  is the predicted value of when the i th observation is not in the dataset.  Choose models with small values of PRESS p.  It may seem that one will have to run “n” separate regressions in order to calculate PRESS p. Not so, as we will see later. All Possible Regressions: Variable Reduction

15  Best Subsets Algorithm:  Best subsets (a limited number) are identified according to pre-specified criteria.  Require much less computational effort than when evaluating all possible subsets.  Provide “good” subsets along with best, which is quite useful.  When pool of X variables is large, then this algorithm can run out of steam. What then? We will see in the ensuing discussion. Best Subsets

16 Best Subsets Regression (Note: “s” is the square root of MSE p ) Response variable is Y Adj. Vars R-Sq R-Sq C-p s X1 X2 X X X X X X X X X X Response variable is Y Adj. Vars R-Sq R-Sq C-p s X1 X2 X3 X X X X X X X X X X X X X X X X X A Simple Example

17 Forward Stepwise Regression  An iterative procedure  Based on the partial F * or t * statistic one decides whether to add a variable or not.  One variable at a time is considered.  Before we see the actual algorithm here are some levers: Minimum acceptable F to enter (F E ) Minimum acceptable F to remove (F R ) Minimum acceptable Tolerance (T min ) Maximum number of iterations (N)  And here is the general form of the test statistic:

18 Forward Stepwise Regression  The procedure: 1.Run a simple linear regression of all variables with the Y variable. 2.If none of the individual F values are larger than the cut-off F E value, then stop. Else, enter the variable with the largest F. 3.Now run the regression of remaining variables with Y given that the variable entered in step 2 is already in the model. 4. Repeat step 2. If a candidate is found, then check for tolerance. If tolerance (1-R 2 k ) is not larger than cut-off tolerance value T min, then choose a different candidate. If none available, then terminate. Else, add the candidate variable. 5.Calculate the partial F for the variable entered in step 2 given that the variable entered in step 4 is already in the model. Check if this F is less than F R. If so, then remove the variable entered in step 2. Else keep it. Check if number of iterations is equal to N. If yes, terminate. If not, then proceed to step Check from results of step 1, which is the next candidate variable to enter. If number of iterations exceeded, then terminate

19 Other Stepwise Regression Procedures  Backward Stepwise Regression  exact opposite of forward procedure.  Sometimes preferred to forward stepwise.  Think about how this procedure would work why, or under which conditions you would use it instead of forward stepwise ?  Forward Selection  Similar to forward stepwise; except that the variable dropping part is not present  Backward Elimination  Similar to backward stepwise; except that the variable adding part is not present

20 An Example Let us go through the example (Fig. 9.7) on page 366 of KNN.

21 Some other Selection Criteria Akaike Information Criteria (AIC) – Impose a penalty for adding regressors – AIC = e 2p/n SSE p /n, where 2p/n is the penalty factor –Harsher penalty than R a 2 (How?) – Model with lowest AIC is preferred – AIC used for in-sample and out-of-sample forecasting performance measurement – Useful for nested and non-nested mode and for determining lag-length in autoregressive models (Ch12)

22 Some other Selection Criteria Schwarz Information Criteria (SIC) – SIC = n p/n SSE p /n –Similar to AIC – Imposes stricter penalty than AIC – Has similar advantages as AIC

23 Model Validation  Checking the prediction ability of the model.  Methods for the model validation; 1.Collection of new data; - We select a new sample with the same variables of dimension ; - Compute the mean squared prediction error: 2.Comparison of results with theoretical expectations; 3.Data splitting in two data sets: model building and validation.