Model Selection In multiple regression we often have many explanatory variables. How do we find the “best” model?

Slides:



Advertisements
Similar presentations
SADC Course in Statistics Revision of key regression ideas (Session 10)
Advertisements

Chapter 5 Multiple Linear Regression
Stat 112: Lecture 7 Notes Homework 2: Due next Thursday The Multiple Linear Regression model (Chapter 4.1) Inferences from multiple regression analysis.
1 Multiple Regression A single numerical response variable, Y. Multiple numerical explanatory variables, X 1, X 2,…, X k.
Regression single and multiple. Overview Defined: A model for predicting one variable from other variable(s). Variables:IV(s) is continuous, DV is continuous.
1 Home Gas Consumption Interaction? Should there be a different slope for the relationship between Gas and Temp after insulation than before insulation?
1 Multiple Regression Response, Y (numerical) Explanatory variables, X 1, X 2, …X k (numerical) New explanatory variables can be created from existing.
Multiple Regression Predicting a response with multiple explanatory variables.
Statistics for Managers Using Microsoft® Excel 5th Edition
Multivariate Data Analysis Chapter 4 – Multiple Regression.
Lecture 6: Multiple Regression
Multiple Linear Regression
Regression Model Building Setting: Possibly a large set of predictor variables (including interactions). Goal: Fit a parsimonious model that explains variation.
Linear Regression Analysis
Multiple Linear Regression Response Variable: Y Explanatory Variables: X 1,...,X k Model (Extension of Simple Regression): E(Y) =  +  1 X 1 +  +  k.
Example of Simple and Multiple Regression
© 2004 Prentice-Hall, Inc.Chap 15-1 Basic Business Statistics (9 th Edition) Chapter 15 Multiple Regression Model Building.
Multiple Regression Selecting the Best Equation. Techniques for Selecting the "Best" Regression Equation The best Regression equation is not necessarily.
Managerial Economics Demand Estimation. Scatter Diagram Regression Analysis.
Model Selection1. 1. Regress Y on each k potential X variables. 2. Determine the best single variable model. 3. Regress Y on the best variable and each.
1 Chapter 10, Part 2 Linear Regression. 2 Last Time: A scatterplot gives a picture of the relationship between two quantitative variables. One variable.
1 Model Selection Response: Highway MPG Explanatory: 13 explanatory variables Indicator variables for types of car – Sports Car, SUV, Wagon, Minivan There.
Soc 3306a Lecture 9: Multivariate 2 More on Multiple Regression: Building a Model and Interpreting Coefficients.
1 Multivariate Linear Regression Models Shyh-Kang Jeng Department of Electrical Engineering/ Graduate Institute of Communication/ Graduate Institute of.
Multiple Linear Regression. Purpose To analyze the relationship between a single dependent variable and several independent variables.
1 Prices of Antique Clocks Antique clocks are sold at auction. We wish to investigate the relationship between the age of the clock and the auction price.
Regression. Population Covariance and Correlation.
Lesson Multiple Regression Models. Objectives Obtain the correlation matrix Use technology to find a multiple regression equation Interpret the.
1 Quadratic Model In order to account for curvature in the relationship between an explanatory and a response variable, one often adds the square of the.
 Relationship between education level, income, and length of time out of school  Our new regression equation: is the predicted value of the dependent.
Statistics for Managers Using Microsoft Excel, 4e © 2004 Prentice-Hall, Inc. Chap 14-1 Chapter 14 Multiple Regression Model Building Statistics for Managers.
 Chapter 3! 1. UNIT 7 VOCABULARY – CHAPTERS 3 & 14 2.
1 Doing Statistics for Business Doing Statistics for Business Data, Inference, and Decision Making Marilyn K. Pelosi Theresa M. Sandifer Chapter 12 Multiple.
Stat 112 Notes 6 Today: –Chapter 4.1 (Introduction to Multiple Regression)
ANOVA, Regression and Multiple Regression March
Lesson 14 - R Chapter 14 Review. Objectives Summarize the chapter Define the vocabulary used Complete all objectives Successfully answer any of the review.
Statistics for Managers Using Microsoft Excel, 4e © 2004 Prentice-Hall, Inc. Chap 14-1 Chapter 14 Multiple Regression Model Building Statistics for Managers.
Stat 112 Notes 6 Today: –Chapters 4.2 (Inferences from a Multiple Regression Analysis)
Using SPSS Note: The use of another statistical package such as Minitab is similar to using SPSS.
1 Response Surface A Response surface model is a special type of multiple regression model with: Explanatory variables Interaction variables Squared variables.
Lab 4 Multiple Linear Regression. Meaning  An extension of simple linear regression  It models the mean of a response variable as a linear function.
REGRESSION REVISITED. PATTERNS IN SCATTER PLOTS OR LINE GRAPHS Pattern Pattern Strength Strength Regression Line Regression Line Linear Linear y = mx.
1 Simple Linear Regression Example - mammals Response variable: gestation (length of pregnancy) days Explanatory: brain weight.
Stat 112 Notes 8 Today: –Chapters 4.3 (Assessing the Fit of a Regression Model) –Chapter 4.4 (Comparing Two Regression Models) –Chapter 4.5 (Prediction.
Yandell – Econ 216 Chap 15-1 Chapter 15 Multiple Regression Model Building.
Stats Methods at IC Lecture 3: Regression.
Predicting Energy Consumption in Buildings using Multiple Linear Regression Introduction Linear regression is used to model energy consumption in buildings.
Lecture 24 Multiple Regression Model And Residual Analysis
PowerPoint Slides Prepared by Robert F. Brooker, Ph.D. Slide 1
Correlation, Bivariate Regression, and Multiple Regression
Multiple Regression Prof. Andy Field.
Statistics for the Social Sciences
10.2 Regression If the value of the correlation coefficient is significant, the next step is to determine the equation of the regression line which is.
Chapter 9 Multiple Linear Regression
Linear Regression Using Excel
Forward Selection The Forward selection procedure looks to add variables to the model. Once added, those variables stay in the model even if they become.
Multiple Regression.
BUSI 410 Business Analytics
Categorical Variables
Simple Linear Regression
Categorical Variables
Multiple Regression Models
Simple Linear Regression
Multivariate Linear Regression Models
Multiple Linear Regression
Indicator Variables Response: Highway MPG
Chapter 13 Additional Topics in Regression Analysis
Chapter 6 Predicting Future Performance
Cases. Simple Regression Linear Multiple Regression.
Multicollinearity Multicollinearity occurs when explanatory variables are highly correlated, in which case, it is difficult or impossible to measure their.
Presentation transcript:

Model Selection In multiple regression we often have many explanatory variables. How do we find the “best” model?

Model Selection How can we select the set of explanatory variables that will explain the most variation in the response and have each variable adding significantly to the model?

Cruising Timber Response: Mean Diameter at Breast Height (MDBH) of a tree. Explanatory: X1 = Mean Height of Pines X2 = Age of Tract times the Number of Pines X3 = Mean Height of Pines divided by the Number of Pines

Forward Selection Begin with no variables in the model. At each step check to see if you can add a variable to the model. If you can, add the variable. If not, stop.

Forward Selection – Step 1 Select the variable that has the highest correlation with the response. If this correlation is statistically significant, add the variable to the model.

JMP Multivariate Methods Multivariate Put MDBH, X1, X2, and X3 in the Y, Columns box.

Correlation with response

Comment The explanatory variable X3 has the highest correlation with the response MDBH. r = 0.8404 The correlation between X3 and MDBH is statistically significant. Signif Prob < 0.0001, small P-value.

Step 1 - Action Fit the simple linear regression of MDBH on X3. Predicted MDBH = 3.896 + 32.937*X3 R2 = 0.7063 RMSE = 0.4117

SLR of MDBH on X3 Test of Model Utility Statistical Significance of X3 F = 43.2886, P-value < 0.0001 Statistical Significance of X3 t = 6.58, P-value < 0.0001 Exactly the same as the test for significant correlation.

Can we do better? Can we explain more variation in MDBH by adding one of the other variables to the model with X3? Will that addition be statistically significant?

Forward Selection – Step 2 Which variable should we add, X1 or X2? How can we decide?

Correlation among explanatory variables

Multicollinearity Because some explanatory variables are correlated, they may carry overlapping information about the response. You can’t rely on the simple correlations between explanatory and response to tell you which variable to add.

Forward selection – Step 2 Look at partial residual plots. Determine statistical significance.

Partial Residual Plots Look at the residuals from the SLR of Y on X3 plotted against the other variables once the overlapping information with X3 has been removed.

How is this done? Fit MDBH versus X3 and obtain residuals – Resid(Y on X3) Fit X1 versus X3 and obtain residuals - Resid(X1 on X3) Fit X2 versus X3 and obtain residuals - Resid(X2 on X3)

Correlations 1.0000 0.5726 0.3636 0.9320 1.000 Resid(YonX3) Resid(X1onX3) Resid(X2onX3) 1.0000 0.5726 0.3636 0.9320 1.000

Comment The residuals (unexplained variation in the response) from the SLR of MDBH on X3 have the highest correlation with X1 once we have adjusted for the overlapping information with X3.

Statistical Significance Does X1 add significantly to the model that already contains X3? t = 2.88, P-value = 0.0104 F = 8.29, P-value = 0.0104 Because the P-value is small, X1 adds significantly to the model with X3.

Summary Step 1 – add X3 Step 2 – add X1 to X3 Can we do better?

Forward Selection – Step 3 Does X2 add significantly to the model that already contains X3 and X1? t = –2.79, P-value = 0.0131 F = 7.78, P-value = 0.0131 Because the P-value is small, X2 adds significantly to the model with X3 and X1.

Summary Step 1 – add X3 Step 2 – add X1 to X3 Step 3 – add X2 to X1 and X3 R2 = 0.867

Summary At each step the variable being added is statistically significant. Has the forward selection procedure found the “best” model?

“Best” Model? The model with all three variables is useful. F = 34.83, P-value < 0.0001 The variable X3 does not add significantly to the model with just X1 and X2. t = 0.41, P-value = 0.6844