Model selection and model building. Model selection Selection of predictor variables.

Slides:



Advertisements
Similar presentations
All Possible Regressions and Statistics for Comparing Models
Advertisements

Multicollinearity.
More on understanding variance inflation factors (VIFk)
Polynomial Regression and Transformations STA 671 Summer 2008.
Copyright © 2009 Pearson Education, Inc. Chapter 29 Multiple Regression.
Best subsets regression
/k 2DS00 Statistics 1 for Chemical Engineering lecture 4.
Review of Univariate Linear Regression BMTRY 726 3/4/14.
Chicago Insurance Redlining Example Were insurance companies in Chicago denying insurance in neighborhoods based on race?
1 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Summarizing Bivariate Data Introduction to Linear Regression.
Statistics for Managers Using Microsoft® Excel 5th Edition
MARE 250 Dr. Jason Turner Multiple Regression. y Linear Regression y = b 0 + b 1 x y = dependent variable b 0 + b 1 = are constants b 0 = y intercept.
Statistics for Managers Using Microsoft® Excel 5th Edition
Note 14 of 5E Statistics with Economics and Business Applications Chapter 12 Multiple Regression Analysis A brief exposition.
MARE 250 Dr. Jason Turner Multiple Regression. y Linear Regression y = b 0 + b 1 x y = dependent variable b 0 + b 1 = are constants b 0 = y intercept.
1 Chapter 9 Variable Selection and Model building Ray-Bing Chen Institute of Statistics National University of Kaohsiung.
Multiple Regression MARE 250 Dr. Jason Turner.
Lecture 11 Multivariate Regression A Case Study. Other topics: Multicollinearity  Assuming that all the regression assumptions hold how good are our.
Chap 15-1 Copyright ©2012 Pearson Education, Inc. publishing as Prentice Hall Chap 15-1 Chapter 15 Multiple Regression Model Building Basic Business Statistics.
Basic Business Statistics, 11e © 2009 Prentice-Hall, Inc. Chap 15-1 Chapter 15 Multiple Regression Model Building Basic Business Statistics 11 th Edition.
Chapter 15: Model Building
Regression Model Building Setting: Possibly a large set of predictor variables (including interactions). Goal: Fit a parsimonious model that explains variation.
Descriptive measures of the strength of a linear association r-squared and the (Pearson) correlation coefficient r.
Model selection Stepwise regression. Statement of problem A common problem is that there is a large set of candidate predictor variables. (Note: The examples.
Hypothesis tests for slopes in multiple linear regression model Using the general linear test and sequential sums of squares.
Copyright ©2011 Pearson Education 15-1 Chapter 15 Multiple Regression Model Building Statistics for Managers using Microsoft Excel 6 th Global Edition.
Forecasting Revenue: An Example of Regression Model Building Setting: Possibly a large set of predictor variables used to predict future quarterly revenues.
Checking Regression Model Assumptions NBA 2013/14 Player Heights and Weights.
Simple linear regression Linear regression with one predictor variable.
M23- Residuals & Minitab 1  Department of ISM, University of Alabama, ResidualsResiduals A continuation of regression analysis.
Variable selection and model building Part II. Statement of situation A common situation is that there is a large set of candidate predictor variables.
Selecting Variables and Avoiding Pitfalls Chapters 6 and 7.
Multiple Regression Selecting the Best Equation. Techniques for Selecting the "Best" Regression Equation The best Regression equation is not necessarily.
Chapter 14 Multiple Regression Models. 2  A general additive multiple regression model, which relates a dependent variable y to k predictor variables.
Business Statistics, 4e, by Ken Black. © 2003 John Wiley & Sons Business Statistics, 4e by Ken Black Chapter 15 Building Multiple Regression Models.
Chapter 12: Linear Regression 1. Introduction Regression analysis and Analysis of variance are the two most widely used statistical procedures. Regression.
Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Slide
Introduction to Linear Regression
Summarizing Bivariate Data
Introduction to Probability and Statistics Thirteenth Edition Chapter 12 Linear Regression and Correlation.
An alternative approach to testing for a linear association The Analysis of Variance (ANOVA) Table.
Detecting and reducing multicollinearity. Detecting multicollinearity.
Solutions to Tutorial 5 Problems Source Sum of Squares df Mean Square F-test Regression Residual Total ANOVA Table Variable.
Sequential sums of squares … or … extra sums of squares.
1 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Summarizing Bivariate Data Non-linear Regression Example.
Lack of Fit (LOF) Test A formal F test for checking whether a specific type of regression function adequately fits the data.
Multiple regression. Example: Brain and body size predictive of intelligence? Sample of n = 38 college students Response (Y): intelligence based on the.
Multiple Regression Selecting the Best Equation. Techniques for Selecting the "Best" Regression Equation The best Regression equation is not necessarily.
Overview of our study of the multiple linear regression model Regression models with more than one slope parameter.
1 Experimental Statistics - week 12 Chapter 12: Multiple Regression Chapter 13: Variable Selection Model Checking.
Multiple Regression I 1 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Chapter 4 Multiple Regression Analysis (Part 1) Terry Dielman.
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Variable Selection 1 Chapter 8 Variable Selection Terry Dielman Applied Regression Analysis:
Statistics for Managers Using Microsoft Excel, 4e © 2004 Prentice-Hall, Inc. Chap 14-1 Chapter 14 Multiple Regression Model Building Statistics for Managers.
Copyright © 2004 by The McGraw-Hill Companies, Inc. All rights reserved.
Statistical Data Analysis 2010/2011 M. de Gunst Lecture 9.
Basic Business Statistics, 10e © 2006 Prentice-Hall, Inc. Chap 15-1 Chapter 15 Multiple Regression Model Building Basic Business Statistics 10 th Edition.
1 Building the Regression Model –I Selection and Validation KNN Ch. 9 (pp )
Inference for regression - More details about simple linear regression IPS chapter 10.2 © 2006 W.H. Freeman and Company.
732G21/732G28/732A35 Lecture 4. Variance-covariance matrix for the regression coefficients 2.
Using SPSS Note: The use of another statistical package such as Minitab is similar to using SPSS.
Variable selection and model building Part I. Statement of situation A common situation is that there is a large set of candidate predictor variables.
Multicollinearity. Multicollinearity (or intercorrelation) exists when at least some of the predictor variables are correlated among themselves. In observational.
Interaction regression models. What is an additive model? A regression model with p-1 predictor variables contains additive effects if the response function.
Simple linear regression. What is simple linear regression? A way of evaluating the relationship between two continuous variables. One variable is regarded.
Simple linear regression. What is simple linear regression? A way of evaluating the relationship between two continuous variables. One variable is regarded.
Analysis of variance approach to regression analysis … an (alternative) approach to testing for a linear association.
1 Multiple Regression. 2 Model There are many explanatory variables or independent variables x 1, x 2,…,x p that are linear related to the response variable.
Chapter 15 Multiple Regression Model Building
Model selection Stepwise regression.
Chapter 11 Variable Selection Procedures
Presentation transcript:

Model selection and model building

Model selection Selection of predictor variables

Statement of problem A common problem is that there is a large set of candidate predictor variables. Goal is to choose a small subset from the larger set so that the resulting regression model is simple, yet have good predictive ability.

Example: Cement data Response y: heat evolved in calories during hardening of cement on a per gram basis Predictor x 1 : % of tricalcium aluminate Predictor x 2 : % of tricalcium silicate Predictor x 3 : % of tetracalcium alumino ferrite Predictor x 4 : % of dicalcium silicate

Example: Cement data

Two basic methods of selecting predictors Stepwise regression: Enter and remove variables, in a stepwise manner, until no justifiable reason to enter or remove more. Best subsets regression: Select the subset of variables that do the best at meeting some well-defined objective criterion.

Stepwise regression: the idea Start with no predictors in the model. At each step, enter or remove a variable based on partial F-tests. Stop when no more variables can be justifiably entered or removed.

Stepwise regression: the steps Specify an Alpha-to-Enter (0.15) and an Alpha-to-Remove (0.15). Start with no predictors in the model. Put the predictor with the smallest P-value based on the partial F statistic (a t-statistic) in the model. If P-value > 0.15, then stop. None of the predictors have good predictive ability. Otherwise …

Stepwise regression: the steps Add the predictor with the smallest P-value (below 0.15) based on the partial F-statistic (a t-statistic) in the model. If none of the predictors yield P-values < 0.15, stop. If P-value > 0.15 for any of the partial F statistics, then remove violating predictor. Continue above two steps, until no more predictors can be entered or removed.

Predictor Coef SE Coef T P Constant x Predictor Coef SE Coef T P Constant x Predictor Coef SE Coef T P Constant x Predictor Coef SE Coef T P Constant x Example: Cement data

Predictor Coef SE Coef T P Constant x x Predictor Coef SE Coef T P Constant x x Predictor Coef SE Coef T P Constant x x

Predictor Coef SE Coef T P Constant x x x Predictor Coef SE Coef T P Constant x x x

Predictor Coef SE Coef T P Constant x x

Predictor Coef SE Coef T P Constant x x x Predictor Coef SE Coef T P Constant x x x

Predictor Coef SE Coef T P Constant x x

Stepwise Regression: y versus x1, x2, x3, x4 Alpha-to-Enter: 0.15 Alpha-to-Remove: 0.15 Response is y on 4 predictors, with N = 13 Step Constant x T-Value P-Value x T-Value P-Value x T-Value P-Value S R-Sq R-Sq(adj) C-p

Drawbacks of stepwise regression The final model is not guaranteed to be optimal in any specified sense. The procedure yields a single final model, although in practice there are often several almost equally good models.

Best subsets regression If there are p-1 possible predictors, then there are 2 p-1 possible regression models containing the predictors. For example, 10 predictors yields 2 10 = 1024 possible regression models. A best subsets algorithm determines the best subsets of each size, so that choice of the final model can be made by researcher.

What is used to judge “best”? R-square Adjusted R-square MSE (or S = square root of MSE) Mallow’s C p

R-squared Use the R-squared values to find the point where adding more predictors is not worthwhile because it leads to a very small increase in R-squared.

Adjusted R-squared or MSE Adjusted R-squared increases only if MSE decreases, so adjusted R-squared and MSE provide equivalent information. Find a few subsets for which MSE is smallest (or adjusted R-squared is largest) or so close to the smallest (largest) that adding more predictors is not worthwhile.

Mallow’s C p criterion Mallow’s C p statistic: is an estimator of total standardized mean square error of prediction: which equals:

Using the C p criterion Subsets with small C p values have a small total (standardized) mean square error of prediction. When the C p value is also near p, the bias of the regression model is small.

Using the C p criterion (cont’d) So, identify subsets of predictors for which: –the C p value is smallest, and –the C p value is near p (if possible) Note though that for the full model, C p = p. So, the fullest model is always judged to be a good candidate model.

Best Subsets Regression: y versus x1, x2, x3, x4 Response is y x x x x Vars R-Sq R-Sq(adj) C-p S X X X X X X X X X X X X X X X X

Example: Modeling PIQ

Stepwise Regression: PIQ versus MRI, Height, Weight Alpha-to-Enter: 0.15 Alpha-to-Remove: 0.15 Response is PIQ on 3 predictors, with N = 38 Step 1 2 Constant MRI T-Value P-Value Height T-Value P-Value S R-Sq R-Sq(adj) C-p

Best Subsets Regression: PIQ versus MRI, Height, Weight Response is PIQ H W e e i i M g g R h h Vars R-Sq R-Sq(adj) C-p S I t t X X X X X X X X X

The regression equation is PIQ = MRI Height Predictor Coef SE Coef T P Constant MRI Height S = R-Sq = 29.5% R-Sq(adj) = 25.5% Analysis of Variance Source DF SS MS F P Regression Error Total Source DF Seq SS MRI Height

Example: Modeling BP

Stepwise Regression: BP versus Age, Weight, BSA, Duration, Pulse, Stress Alpha-to-Enter: 0.15 Alpha-to-Remove: 0.15 Response is BP on 6 predictors, with N = 20 Step Constant Weight T-Value P-Value Age T-Value P-Value BSA 4.6 T-Value 3.04 P-Value S R-Sq R-Sq(adj) C-p

Best Subsets Regression: BP versus Age, Weight,... Response is BP D u W r S e a P t i t u r A g B i l e g h S o s s Vars R-Sq R-Sq(adj) C-p S e t A n e s X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X

The regression equation is BP = Age Weight BSA Predictor Coef SE Coef T P Constant Age Weight BSA S = R-Sq = 99.5% R-Sq(adj) = 99.4% Analysis of Variance Source DF SS MS F P Regression Error Total Source DF Seq SS Age Weight BSA

Stepwise regression in Minitab Stat >> Regression >> Stepwise … Specify response and all possible predictors. If desired, specify predictors that must be included in every model. Select OK. Results appear in session window.

Best subsets regression Stat >> Regression >> Best subsets … Specify response and all possible predictors. If desired, specify predictors that must be included in every model. Select OK. Results appear in session window.

Model building strategy

The first step Decide on the type of model needed –Predictive: model used to predict the response variable from a chosen set of predictors. –Theoretical: model based on theoretical relationship between response and predictors. –Control: model used to control a response variable by manipulating predictor variables.

The first step (cont’d) Decide on the type of model needed –Inferential: model used to explore strength of relationships between response and predictors. –Data summary: model used merely as a way to summarize a large set of data by a single equation.

The second step Decide which predictor variables and response variable on which to collect the data. Collect the data.

The third step Explore the data –Check for outliers, gross data errors, missing values on a univariate basis. –Study bivariate relationships to reveal other outliers, to suggest possible transformations, to identify possible multicollinearities.

The fourth step Randomly divide the data into a training set and a test set: –The training set, with at least error d.f., is used to fit the model. –The test set is used for cross-validation of the fitted model.

The fifth step Using the training set, fit several candidate models: –Use best subsets regression. –Use stepwise regression (only gives one model unless specifies different alpha-to-remove and alpha-to-enter values).

The sixth step Select and evaluate a few “good” models: –Select based on adjusted R 2, Mallow’s C p, number and nature of predictors. –Evaluate selected models for violation of model assumptions. –If none of the models provide a satisfactory fit, try something else, such as more data, different predictors, a different class of model …

The final step Select the final model: –Compare competing models by cross-validating them against the test data. –The model with a larger cross-validation R 2 is a better predictive model. –Consider residual plots, outliers, parsimony, relevance, and ease of measurement of predictors.