Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Variable Selection 1 Chapter 8 Variable Selection Terry Dielman Applied Regression Analysis:

Slides:



Advertisements
Similar presentations
All Possible Regressions and Statistics for Comparing Models
Advertisements

Chapter 5 Multiple Linear Regression
Automated Regression Modeling Descriptive vs. Predictive Regression Models Four common automated modeling procedures Forward Modeling Backward Modeling.
Statistical Techniques I EXST7005 Multiple Regression.
Best subsets regression
6-1 Introduction To Empirical Models 6-1 Introduction To Empirical Models.
Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc. Chapter 13 Nonlinear and Multiple Regression.
Objectives (BPS chapter 24)
1 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Summarizing Bivariate Data Introduction to Linear Regression.
1 1 Slide 統計學 Spring 2004 授課教師:統計系余清祥 日期: 2004 年 5 月 11 日 第十二週:建立迴歸模型.
1 1 Slide © 2014 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole.
Chapter 13 Multiple Regression
Statistics for Managers Using Microsoft® Excel 5th Edition
Chapter 12 Simple Regression
MARE 250 Dr. Jason Turner Multiple Regression. y Linear Regression y = b 0 + b 1 x y = dependent variable b 0 + b 1 = are constants b 0 = y intercept.
Statistics for Managers Using Microsoft® Excel 5th Edition
Part I – MULTIVARIATE ANALYSIS C3 Multiple Linear Regression II © Angel A. Juan & Carles Serrat - UPC 2007/2008.
Chapter 12 Multiple Regression
Multiple Regression Involves the use of more than one independent variable. Multivariate analysis involves more than one dependent variable - OMS 633 Adding.
MARE 250 Dr. Jason Turner Multiple Regression. y Linear Regression y = b 0 + b 1 x y = dependent variable b 0 + b 1 = are constants b 0 = y intercept.
1 Chapter 9 Variable Selection and Model building Ray-Bing Chen Institute of Statistics National University of Kaohsiung.
Multiple Regression MARE 250 Dr. Jason Turner.
Multiple Linear Regression
Multiple Regression and Correlation Analysis
Chap 15-1 Copyright ©2012 Pearson Education, Inc. publishing as Prentice Hall Chap 15-1 Chapter 15 Multiple Regression Model Building Basic Business Statistics.
Basic Business Statistics, 11e © 2009 Prentice-Hall, Inc. Chap 15-1 Chapter 15 Multiple Regression Model Building Basic Business Statistics 11 th Edition.
Chapter 15: Model Building
Regression Model Building Setting: Possibly a large set of predictor variables (including interactions). Goal: Fit a parsimonious model that explains variation.
Part 7: Multiple Regression Analysis 7-1/54 Regression Models Professor William Greene Stern School of Business IOMS Department Department of Economics.
Slide 1 DSCI 5180: Introduction to the Business Decision Process Spring 2013 – Dr. Nick Evangelopoulos Lectures 4-5: Simple Regression Analysis (Ch. 3)
Model selection Stepwise regression. Statement of problem A common problem is that there is a large set of candidate predictor variables. (Note: The examples.
Copyright ©2011 Pearson Education 15-1 Chapter 15 Multiple Regression Model Building Statistics for Managers using Microsoft Excel 6 th Global Edition.
Statistics for Managers Using Microsoft Excel, 4e © 2004 Prentice-Hall, Inc. Chap 13-1 Chapter 13 Introduction to Multiple Regression Statistics for Managers.
1 1 Slide © 2003 South-Western/Thomson Learning™ Slides Prepared by JOHN S. LOUCKS St. Edward’s University.
Marketing Research Aaker, Kumar, Day and Leone Tenth Edition
Chapter 13: Inference in Regression
Copyright ©2011 Pearson Education, Inc. publishing as Prentice Hall 15-1 Chapter 15 Multiple Regression Model Building Statistics for Managers using Microsoft.
Variable selection and model building Part II. Statement of situation A common situation is that there is a large set of candidate predictor variables.
Selecting Variables and Avoiding Pitfalls Chapters 6 and 7.
Multiple Regression Selecting the Best Equation. Techniques for Selecting the "Best" Regression Equation The best Regression equation is not necessarily.
Business Statistics, 4e, by Ken Black. © 2003 John Wiley & Sons Business Statistics, 4e by Ken Black Chapter 15 Building Multiple Regression Models.
CHAPTER 14 MULTIPLE REGRESSION
Model Selection1. 1. Regress Y on each k potential X variables. 2. Determine the best single variable model. 3. Regress Y on the best variable and each.
Slide 1 DSCI 5180: Introduction to the Business Decision Process Spring 2013 – Dr. Nick Evangelopoulos Lectures 5-6: Simple Regression Analysis (Ch. 3)
Multiple Regression and Model Building Chapter 15 Copyright © 2014 by The McGraw-Hill Companies, Inc. All rights reserved.McGraw-Hill/Irwin.
Lesson Multiple Regression Models. Objectives Obtain the correlation matrix Use technology to find a multiple regression equation Interpret the.
14- 1 Chapter Fourteen McGraw-Hill/Irwin © 2006 The McGraw-Hill Companies, Inc., All Rights Reserved.
Multiple regression. Example: Brain and body size predictive of intelligence? Sample of n = 38 college students Response (Y): intelligence based on the.
Fitting Curves to Data 1 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Chapter 5: Fitting Curves to Data Terry Dielman Applied Regression.
Multiple Regression Selecting the Best Equation. Techniques for Selecting the "Best" Regression Equation The best Regression equation is not necessarily.
 Relationship between education level, income, and length of time out of school  Our new regression equation: is the predicted value of the dependent.
1 Experimental Statistics - week 12 Chapter 12: Multiple Regression Chapter 13: Variable Selection Model Checking.
Multiple Regression I 1 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Chapter 4 Multiple Regression Analysis (Part 1) Terry Dielman.
Statistics for Managers Using Microsoft Excel, 4e © 2004 Prentice-Hall, Inc. Chap 14-1 Chapter 14 Multiple Regression Model Building Statistics for Managers.
1 Doing Statistics for Business Doing Statistics for Business Data, Inference, and Decision Making Marilyn K. Pelosi Theresa M. Sandifer Chapter 12 Multiple.
Copyright © 2004 by The McGraw-Hill Companies, Inc. All rights reserved.
Basic Business Statistics, 10e © 2006 Prentice-Hall, Inc. Chap 15-1 Chapter 15 Multiple Regression Model Building Basic Business Statistics 10 th Edition.
1 Building the Regression Model –I Selection and Validation KNN Ch. 9 (pp )
Statistics for Managers Using Microsoft Excel, 4e © 2004 Prentice-Hall, Inc. Chap 14-1 Chapter 14 Multiple Regression Model Building Statistics for Managers.
Using SPSS Note: The use of another statistical package such as Minitab is similar to using SPSS.
Multiple Regression II 1 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Chapter 4 Multiple Regression Analysis (Part 2) Terry Dielman.
Variable selection and model building Part I. Statement of situation A common situation is that there is a large set of candidate predictor variables.
Variable Selection 1 Chapter 8 Variable Selection Terry Dielman Applied Regression Analysis: A Second Course in Business and Economic Statistics, fourth.
1 Multiple Regression. 2 Model There are many explanatory variables or independent variables x 1, x 2,…,x p that are linear related to the response variable.
Model selection and model building. Model selection Selection of predictor variables.
Yandell – Econ 216 Chap 15-1 Chapter 15 Multiple Regression Model Building.
Chapter 15 Multiple Regression Model Building
Chapter 9 Multiple Linear Regression
Lecture 20 Last Lecture: Effect of adding or deleting a variable
Presentation transcript:

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Variable Selection 1 Chapter 8 Variable Selection Terry Dielman Applied Regression Analysis: A Second Course in Business and Economic Statistics, fourth edition

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Variable Selection Introduction  Previously we discussed some tests (t-test and partial F) that helped us determine whether certain variables should be in the regression.  Here we will look at several variable selection strategies that expand on this idea.

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Variable Selection 3 Why is This Important?  If an important variable is omitted, the estimated regression coefficients can become biased (systematically too high or low).  Their standard errors can become inflated, leading to imprecise intervals and poor power in hypothesis tests.

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Variable Selection 4 Strategies  All possible regressions: computer procedures that briefly examine every possible combination of Xs and report summaries of fit ability.  Selection algorithms: rules for deciding when to drop or add variables 1. Backwards Elimination 2. Forward Selection 3. Stepwise Regression

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Variable Selection 5 Words of Caution  None guarantee you get the right model because they do not check assumptions or search for omitted factors like curvature.  None have the ability to use a researcher's knowledge about the business or economic situation being analyzed.

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Variable Selection All Possible Regressions  If there are k x variables to consider using, there are k 2 possible subsets. For example, with only k=5, there are 32 regression equations.  Obtaining these sounds like a ton of work but programs like SAS or Minitab have algorithms that can measure fit ability without really producing the equation.

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Variable Selection 7 Typical Output  The program will usually give you a summary table.  Each line on the table will tell you which variables were in the model, plus measures of fit ability.  These measures include R 2, adjusted R 2, S e and a new one, C p

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Variable Selection 8 The C p Statistic p = k + 1 is the number of terms in the model, including the intercept. SSE p is the SSE of this model MSE F is the MSE in the "full model" (with all the variables) SSE p SSE p C p = - (n – 2p) MSE F MSE F

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Variable Selection 9 Using The C p Statistic Theory says that in a model with bias, C p will be large. It also says that in a model with no bias, C p should be equal to p. It is thus recommended that we consider models with a small C p and those with C p near p = k + 1.

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Variable Selection 10 Example 8.1 Meddicorp Revisited n = 25 sales territories y = Sales (in $1000s) in each territory x 1 = Advertising ($100s) in the territory x 2 = Bonuses paid (in $100s) in the territory x 3 = Market share in the territory x 4 = largest competitor's sales ($1000s) x 5 = Region code (1 = S, 2 = W, 3 = MW) We are not using region here because it should be converted to indicator variables which should be examined as a group.

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Variable Selection 11 Summary Results For All Possible Regressions

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Variable Selection 12 The Best Model?  The two variable model with ADV and BONUS has the smallest C p and highest adjusted R 2.  The three variable models adding either MKTSHR or COMPET also have small C p values but only modest increases in R 2.  The two-variable model is probably the best.

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Variable Selection 13 M C B K O O T M A N S P D U H E Vars R-Sq R-Sq(adj) C-p S V S R T X X X X X X X X X X X X X X X X Minitab Results By default, the Best Subsets procedure prints two models for each number of X variables. This can be increased up to 5.

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Variable Selection 14 Limitations  With a large number of potential x variables, the all possible approach becomes unwieldy.  Minitab can use up to 31 predictors, but warns that computational time can be long when as few as 15 are used.  "Obviously good" predictors can be forced into the model, thus reducing search time, but this is not always what you want.

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Variable Selection Other Variable Selection Techniques  With a large number of potential x variables, it may be best to use one of the iterative selection methods.  These will look at only the set of models that their rules will lead them too, so they may not yield a model as good as that returned by the all possible regressions approach.

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Variable Selection Backwards Elimination 1. Start with all variables in the equation. 2. Examine the variables in the model for significance and identify the least significant one. 3. Remove this variable if it does not meet some minimum significance level. 4. Run a new regression and repeat until all remaining variables are significant.

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Variable Selection 17 No Search Routine Needed?  Although most software packages have automatic procedures for backwards elimination, it is fairly easy to do interactively.  Run a model, check its t-tests for significance, and identify the variable to drop.  Run again with one less variable and repeat the steps.

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Variable Selection 18 Step 1 – All Variables Regression Analysis: SALES versus ADV, BONUS, MKTSHR, COMPET The regression equation is SALES = ADV BONUS MKTSHR COMPET Predictor Coef SE Coef T P Constant ADV BONUS MKTSHR COMPET S = R-Sq = 85.9% R-Sq(adj) = 83.1% Least Significant

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Variable Selection 19 Step 2 – COMPET Eliminated Regression Analysis: SALES versus ADV, BONUS, MKTSHR The regression equation is SALES = ADV BONUS MKTSHR Predictor Coef SE Coef T P Constant ADV BONUS MKTSHR S = R-Sq = 85.8% R-Sq(adj) = 83.8%

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Variable Selection 20 Step 3 – MKTSHR Eliminated Regression Analysis: SALES versus ADV, BONUS The regression equation is SALES = ADV BONUS Predictor Coef SE Coef T P Constant ADV BONUS S = R-Sq = 85.5% R-Sq(adj) = 84.2%

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Variable Selection Forward Selection  At each stage, it looks at the x variables not in the current equation and tests to see if they will be significant if they are added.  In the first stage, the x with the highest correlation with y is added.  At later stages it is much harder to see how the next x is selected.

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Variable Selection 22 Minitab Output for Forward Selection An option in the Stepwise procedure Forward selection. Alpha-to-Enter: 0.25 Response is SALES on 4 predictors, with N = 25 Step 1 2 Constant ADV T-Value P-Value BONUS 1.86 T-Value 2.59 P-Value S R-Sq R-Sq(adj) C-p

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Variable Selection 23 Same Model as Backwards  This data set is not too complex, so both procedures returned the same model.  With larger data sets, particularly when the x variables are correlated among themselves, results can be different.

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Variable Selection Stepwise Regression  A limitation with the backwards procedure is that a variable that gets eliminated is never considered again.  With forward selection, variables entering stay in, even if they lose significance.  Stepwise regression corrects these flaws. A variable entering can later leave. A variable eliminated can later go back in.

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Variable Selection 25 Minitab Output for Stepwise Regression Alpha-to-Enter: 0.15 Alpha-to-Remove: 0.15 Response is SALES on 4 predictors, with N = 25 Step 1 2 Constant ADV T-Value P-Value BONUS 1.86 T-Value 2.59 P-Value S R-Sq R-Sq(adj) C-p

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Variable Selection 26 Selection Parameters  For backwards elimination, the user specifies "Alpha to Remove", which is the maximum p- value a variable can have and stay in the equation.  For forward selection, the user specifies "Alpha to Enter", which is the minimum p-vale a variable needs to enter the equation.  Stepwise regression gets both.  Often we use values like.15 or.20 because this encourages the procedures to look at models with more variables.

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Variable Selection Which Procedure is Best?  Unless there are too many x variables, the all possible models approach is favored because it looks at all combinations of variables.  Of the other strategies, stepwise regression is probably best.  If no search programs are available, backwards elimination can still provide a useful sifting of the data.

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Variable Selection 28 No Guarantees  Because they do not check assumptions or examine the model residuals, there is no guarantee of returning the right model.  Nonetheless, these can be effective tools filtering the data and identifying which variables to pay more attention to.