Variable selection and model building Part I. Statement of situation A common situation is that there is a large set of candidate predictor variables.

Slides:



Advertisements
Similar presentations
Qualitative predictor variables
Advertisements

Multicollinearity.
More on understanding variance inflation factors (VIFk)
Best subsets regression
6-1 Introduction To Empirical Models 6-1 Introduction To Empirical Models.
Objectives (BPS chapter 24)
1 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Summarizing Bivariate Data Introduction to Linear Regression.
Statistics for Managers Using Microsoft® Excel 5th Edition
MARE 250 Dr. Jason Turner Multiple Regression. y Linear Regression y = b 0 + b 1 x y = dependent variable b 0 + b 1 = are constants b 0 = y intercept.
Statistics for Managers Using Microsoft® Excel 5th Edition
Part I – MULTIVARIATE ANALYSIS C3 Multiple Linear Regression II © Angel A. Juan & Carles Serrat - UPC 2007/2008.
Note 14 of 5E Statistics with Economics and Business Applications Chapter 12 Multiple Regression Analysis A brief exposition.
1 Chapter 9 Variable Selection and Model building Ray-Bing Chen Institute of Statistics National University of Kaohsiung.
Multiple Regression MARE 250 Dr. Jason Turner.
Basic Business Statistics, 11e © 2009 Prentice-Hall, Inc. Chap 14-1 Chapter 14 Introduction to Multiple Regression Basic Business Statistics 11 th Edition.
Basic Business Statistics, 11e © 2009 Prentice-Hall, Inc. Chap 15-1 Chapter 15 Multiple Regression Model Building Basic Business Statistics 11 th Edition.
Chapter 15: Model Building
Regression Model Building Setting: Possibly a large set of predictor variables (including interactions). Goal: Fit a parsimonious model that explains variation.
Week 14 Chapter 16 – Partial Correlation and Multiple Regression and Correlation.
Descriptive measures of the strength of a linear association r-squared and the (Pearson) correlation coefficient r.
Model selection Stepwise regression. Statement of problem A common problem is that there is a large set of candidate predictor variables. (Note: The examples.
Hypothesis tests for slopes in multiple linear regression model Using the general linear test and sequential sums of squares.
Copyright ©2011 Pearson Education 15-1 Chapter 15 Multiple Regression Model Building Statistics for Managers using Microsoft Excel 6 th Global Edition.
A (second-order) multiple regression model with interaction terms.
1 1 Slide © 2003 South-Western/Thomson Learning™ Slides Prepared by JOHN S. LOUCKS St. Edward’s University.
Simple linear regression Linear regression with one predictor variable.
Copyright ©2011 Pearson Education, Inc. publishing as Prentice Hall 15-1 Chapter 15 Multiple Regression Model Building Statistics for Managers using Microsoft.
M23- Residuals & Minitab 1  Department of ISM, University of Alabama, ResidualsResiduals A continuation of regression analysis.
Variable selection and model building Part II. Statement of situation A common situation is that there is a large set of candidate predictor variables.
Multiple Regression Selecting the Best Equation. Techniques for Selecting the "Best" Regression Equation The best Regression equation is not necessarily.
Chapter 14 Multiple Regression Models. 2  A general additive multiple regression model, which relates a dependent variable y to k predictor variables.
Introduction to Linear Regression
Multiple regression - Inference for multiple regression - A case study IPS chapters 11.1 and 11.2 © 2006 W.H. Freeman and Company.
Introduction to Probability and Statistics Thirteenth Edition Chapter 12 Linear Regression and Correlation.
An alternative approach to testing for a linear association The Analysis of Variance (ANOVA) Table.
Part 2: Model and Inference 2-1/49 Regression Models Professor William Greene Stern School of Business IOMS Department Department of Economics.
Lesson Multiple Regression Models. Objectives Obtain the correlation matrix Use technology to find a multiple regression equation Interpret the.
Copyright ©2011 Nelson Education Limited Linear Regression and Correlation CHAPTER 12.
Solutions to Tutorial 5 Problems Source Sum of Squares df Mean Square F-test Regression Residual Total ANOVA Table Variable.
Sequential sums of squares … or … extra sums of squares.
Inference for regression - More details about simple linear regression IPS chapter 10.2 © 2006 W.H. Freeman and Company.
1 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Summarizing Bivariate Data Non-linear Regression Example.
Lack of Fit (LOF) Test A formal F test for checking whether a specific type of regression function adequately fits the data.
Multiple regression. Example: Brain and body size predictive of intelligence? Sample of n = 38 college students Response (Y): intelligence based on the.
Lecture 10 Chapter 23. Inference for regression. Objectives (PSLS Chapter 23) Inference for regression (NHST Regression Inference Award)[B level award]
Multiple Regression Selecting the Best Equation. Techniques for Selecting the "Best" Regression Equation The best Regression equation is not necessarily.
Overview of our study of the multiple linear regression model Regression models with more than one slope parameter.
1 Experimental Statistics - week 12 Chapter 12: Multiple Regression Chapter 13: Variable Selection Model Checking.
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Variable Selection 1 Chapter 8 Variable Selection Terry Dielman Applied Regression Analysis:
Statistics for Managers Using Microsoft Excel, 4e © 2004 Prentice-Hall, Inc. Chap 14-1 Chapter 14 Multiple Regression Model Building Statistics for Managers.
A first order model with one binary and one quantitative predictor variable.
Copyright © 2004 by The McGraw-Hill Companies, Inc. All rights reserved.
Essentials of Business Statistics: Communicating with Numbers By Sanjiv Jaggia and Alison Kelly Copyright © 2014 by McGraw-Hill Higher Education. All rights.
Basic Business Statistics, 10e © 2006 Prentice-Hall, Inc. Chap 15-1 Chapter 15 Multiple Regression Model Building Basic Business Statistics 10 th Edition.
1 Building the Regression Model –I Selection and Validation KNN Ch. 9 (pp )
Statistics for Managers Using Microsoft Excel, 4e © 2004 Prentice-Hall, Inc. Chap 14-1 Chapter 14 Multiple Regression Model Building Statistics for Managers.
Inference for regression - More details about simple linear regression IPS chapter 10.2 © 2006 W.H. Freeman and Company.
Using SPSS Note: The use of another statistical package such as Minitab is similar to using SPSS.
Multicollinearity. Multicollinearity (or intercorrelation) exists when at least some of the predictor variables are correlated among themselves. In observational.
Chapter 9 Minitab Recipe Cards. Contingency tests Enter the data from Example 9.1 in C1, C2 and C3.
Interaction regression models. What is an additive model? A regression model with p-1 predictor variables contains additive effects if the response function.
Simple linear regression. What is simple linear regression? A way of evaluating the relationship between two continuous variables. One variable is regarded.
Simple linear regression. What is simple linear regression? A way of evaluating the relationship between two continuous variables. One variable is regarded.
Analysis of variance approach to regression analysis … an (alternative) approach to testing for a linear association.
1 Multiple Regression. 2 Model There are many explanatory variables or independent variables x 1, x 2,…,x p that are linear related to the response variable.
Model selection and model building. Model selection Selection of predictor variables.
Chapter 15 Multiple Regression Model Building
Week 14 Chapter 16 – Partial Correlation and Multiple Regression and Correlation.
Model selection Stepwise regression.
Chapter 11 Variable Selection Procedures
Presentation transcript:

Variable selection and model building Part I

Statement of situation A common situation is that there is a large set of candidate predictor variables. (Note: The examples herein are not really that large.) Goal is to choose a small subset from the larger set so that the resulting regression model is simple and useful: –provides a good summary of the trend in the response –and/or provides good predictions of response –and/or provides good estimates of slope coefficients

What if the regression equation contains “wrong” variables?

When is an estimate unbiased? An estimate is unbiased if the average of the values of the statistics determined from all possible random samples equals the parameter you’re trying to estimate. –An estimated regression coefficient b i is unbiased if the mean of all possible b i equals β i. –The predicted response is unbiased if the mean of all possible equals.

One of four possible situations The model is correctly specified: –The regression equation contains all relevant predictors, including necessary interaction terms and transformations. No redundant or extraneous predictors. –Leads to unbiased regression coefficients and unbiased predictions of the response. –MSE is an unbiased estimate of σ 2.

One of four possible situations The model is underspecified: –The regression equation is missing one or more important predictor variables. –Leads to biased regression coefficients and biased predictions of the response. –MSE is a biased (upward) estimate of σ 2.

A (likely) underspecified model Weight = Height Water, MSE = Weight = Height, MSE = 0.653

One of four possible situations The model contains two or more extraneous variables: –The regression equation contains extraneous variables that are not related to the response or to any of the other predictors. –Leads to unbiased regression coefficients and unbiased predictions of the response. –MSE is an unbiased estimate of σ 2, but has fewer degrees of freedom associated with it.

One of four possible situations The model is overspecified: –The regression equation contains one or more redundant predictor variables. –Leads to unbiased regression coefficients and unbiased predictions of the response. –MSE is an unbiased estimate of σ 2. –Because of multicollinearity, the standard errors of the regression coefficients are inflated. –Model can be used, with caution, for prediction.

A goal, a strategy Know your research question. –Are there a few particular predictors of interest? –Most interested in summary description, in prediction or in effects of predictors? Identify all possible candidate predictors. –Don’t worry about functional form, such as x 2, log x, and interactions, yet.

A goal, a strategy (cont’d) Use variable selection procedures to find the middle ground between underspecified model and model with extraneous variables. Fine-tune the model to get a correctly specified model. –If necessary, change functional form of predictors and add interactions. –Check behavior of residuals.

Two basic methods of selecting predictors Stepwise regression: Enter and remove predictors, in a stepwise manner, until no justifiable reason to enter or remove more. Best subsets regression: Select the subset of predictors that do the best at meeting some well-defined objective criterion.

Two cautions! The list of candidate predictor variables must include all the variables that actually predict the response. There is no single criterion that will always be the best measure of the “best” regression equation.

Stepwise regression

Enter and remove predictors, in a stepwise manner, until there is no justifiable reason to enter or remove any more.

Example: Cement data Response y: heat evolved in calories during hardening of cement on a per gram basis Predictor x 1 : % of tricalcium aluminate Predictor x 2 : % of tricalcium silicate Predictor x 3 : % of tetracalcium alumino ferrite Predictor x 4 : % of dicalcium silicate

Example: Cement data

Stepwise regression: the idea Start with no predictors in the “stepwise model.” At each step, enter or remove a predictor based on partial F-tests (that is, the t-tests). Stop when no more predictors can be justifiably entered or removed from the stepwise model.

Stepwise regression: Preliminary steps 1.Specify an Alpha-to-Enter (α E = 0.15) significance level. 2.Specify an Alpha-to-Remove (α R = 0.15) significance level.

Stepwise regression: Step #1 1.Fit each of the one-predictor models, that is, regress y on x 1, regress y on x 2, …, regress y on x p-1. 2.The first predictor put in the stepwise model is the predictor that has the smallest t-test P-value (below α E = 0.15). 3.If no P-value < 0.15, stop.

Stepwise regression: Step #2 1.Suppose x 1 was the “best” one predictor. 2.Fit each of the two-predictor models with x 1 in the model, that is, regress y on (x 1, x 2 ), regress y on (x 1, x 3 ), …, and y on (x 1, x p-1 ). 3.The second predictor put in stepwise model is the predictor that has the smallest t-test P-value (below α E = 0.15). 4.If no P-value < 0.15, stop.

Stepwise regression: Step #2 (continued) 1.Suppose x 2 was the “best” second predictor. 2.Step back and check the P-value for β 1 = 0. If the P-value for β 1 = 0 has become not significant (above α R = 0.15), remove x 1 from the stepwise model.

Stepwise regression: Step #3 1.Suppose both x 1 and x 2 made it into the two-predictor stepwise model. 2.Fit each of the three-predictor models with x 1 and x 2 in the model, that is, regress y on (x 1, x 2, x 3 ), regress y on (x 1, x 2, x 4 ), …, and regress y on (x 1, x 2, x p-1 ).

Stepwise regression: Step #3 (continued) 1.The third predictor put in stepwise model is the predictor that has the smallest t-test P-value (below α E = 0.15). 2.If no P-value < 0.15, stop. 3.Step back and check P-values for β 1 = 0 and β 2 = 0. If either P-value has become not significant (above α R = 0.15), remove the predictor from the stepwise model.

Stepwise regression: Stopping the procedure The procedure is stopped when adding an additional predictor does not yield a t-test P-value below α E = 0.15.

Example: Cement data

Predictor Coef SE Coef T P Constant x Predictor Coef SE Coef T P Constant x Predictor Coef SE Coef T P Constant x Predictor Coef SE Coef T P Constant x

Predictor Coef SE Coef T P Constant x x Predictor Coef SE Coef T P Constant x x Predictor Coef SE Coef T P Constant x x

Predictor Coef SE Coef T P Constant x x x Predictor Coef SE Coef T P Constant x x x

Predictor Coef SE Coef T P Constant x x

Predictor Coef SE Coef T P Constant x x x Predictor Coef SE Coef T P Constant x x x

Predictor Coef SE Coef T P Constant x x

Stepwise Regression: y versus x1, x2, x3, x4 Alpha-to-Enter: 0.15 Alpha-to-Remove: 0.15 Response is y on 4 predictors, with N = 13 Step Constant x T-Value P-Value x T-Value P-Value x T-Value P-Value S R-Sq R-Sq(adj) C-p

Caution about stepwise regression! Do not over-interpret the order in which predictors are entered into the model. Do not jump to the conclusion … –that all the important predictor variables for predicting y have been identified, or –that all the unimportant predictor variables have been eliminated.

Caution about stepwise regression! (cont’d) Many t-tests for β k = 0 are conducted in a stepwise regression procedure. The probability is high … –that we included some unimportant predictors –that we excluded some important predictors

Drawbacks of stepwise regression The final model is not guaranteed to be optimal in any specified sense. The procedure yields a single final model, although often several equally good models. It doesn’t take into account a researcher’s knowledge about the predictors. –If necessary, force the procedure to include important predictors.

Example: Modeling PIQ

Stepwise Regression: PIQ versus MRI, Height, Weight Alpha-to-Enter: 0.15 Alpha-to-Remove: 0.15 Response is PIQ on 3 predictors, with N = 38 Step 1 2 Constant MRI T-Value P-Value Height T-Value P-Value S R-Sq R-Sq(adj) C-p

The regression equation is PIQ = MRI Height Predictor Coef SE Coef T P Constant MRI Height S = R-Sq = 29.5% R-Sq(adj) = 25.5% Analysis of Variance Source DF SS MS F P Regression Error Total Source DF Seq SS MRI Height

Example: Modeling BP

Stepwise Regression: BP versus Age, Weight, BSA, Duration, Pulse, Stress Alpha-to-Enter: 0.15 Alpha-to-Remove: 0.15 Response is BP on 6 predictors, with N = 20 Step Constant Weight T-Value P-Value Age T-Value P-Value BSA 4.6 T-Value 3.04 P-Value S R-Sq R-Sq(adj) C-p

The regression equation is BP = Age Weight BSA Predictor Coef SE Coef T P Constant Age Weight BSA S = R-Sq = 99.5% R-Sq(adj) = 99.4% Analysis of Variance Source DF SS MS F P Regression Error Total Source DF Seq SS Age Weight BSA

Stepwise regression in Minitab Stat >> Regression >> Stepwise … Specify response and all possible predictors. If desired, specify predictors that must be included in every model. –(This is where researcher’s knowledge helps!!) Select OK. Results appear in session window.