Multiple Regression Selecting the Best Equation. Techniques for Selecting the "Best" Regression Equation The best Regression equation is not necessarily.

Slides:



Advertisements
Similar presentations
Multiple Regression.
Advertisements

All Possible Regressions and Statistics for Comparing Models
Chapter 5 Multiple Linear Regression
11-1 Empirical Models Many problems in engineering and science involve exploring the relationships between two or more variables. Regression analysis.
4/14/ lecture 81 STATS 330: Lecture 8. 4/14/ lecture 82 Collinearity Aims of today’s lecture: Explain the idea of collinearity and its connection.
Multiple Regression in Practice The value of outcome variable depends on several explanatory variables. The value of outcome variable depends on several.
Best subsets regression
Probability & Statistical Inference Lecture 9
6-1 Introduction To Empirical Models 6-1 Introduction To Empirical Models.
11 Simple Linear Regression and Correlation CHAPTER OUTLINE
LINEAR REGRESSION: Evaluating Regression Models. Overview Standard Error of the Estimate Goodness of Fit Coefficient of Determination Regression Coefficients.
Statistics for Managers Using Microsoft® Excel 5th Edition
Statistics for Managers Using Microsoft® Excel 5th Edition
Part I – MULTIVARIATE ANALYSIS C3 Multiple Linear Regression II © Angel A. Juan & Carles Serrat - UPC 2007/2008.
Multiple Regression Involves the use of more than one independent variable. Multivariate analysis involves more than one dependent variable - OMS 633 Adding.
Additional Topics in Regression Analysis
Statistical Analysis SC504/HS927 Spring Term 2008 Session 5: Week 20: 15 th February OLS (2): assessing goodness of fit, extension to multiple regression.
Lecture 6: Multiple Regression
1 Chapter 9 Variable Selection and Model building Ray-Bing Chen Institute of Statistics National University of Kaohsiung.
Multiple Regression MARE 250 Dr. Jason Turner.
Irwin/McGraw-Hill © The McGraw-Hill Companies, Inc., 2000 LIND MASON MARCHAL 1-1 Chapter Twelve Multiple Regression and Correlation Analysis GOALS When.
Ch. 14: The Multiple Regression Model building
Basic Business Statistics, 11e © 2009 Prentice-Hall, Inc. Chap 15-1 Chapter 15 Multiple Regression Model Building Basic Business Statistics 11 th Edition.
11-1 Empirical Models Many problems in engineering and science involve exploring the relationships between two or more variables. Regression analysis.
Chapter 15: Model Building
Model selection Stepwise regression. Statement of problem A common problem is that there is a large set of candidate predictor variables. (Note: The examples.
Copyright ©2011 Pearson Education 15-1 Chapter 15 Multiple Regression Model Building Statistics for Managers using Microsoft Excel 6 th Global Edition.
1 1 Slide © 2008 Thomson South-Western. All Rights Reserved Slides by JOHN LOUCKS & Updated by SPIROS VELIANITIS.
Multiple Linear Regression Response Variable: Y Explanatory Variables: X 1,...,X k Model (Extension of Simple Regression): E(Y) =  +  1 X 1 +  +  k.
Objectives of Multiple Regression
Creating Empirical Models Constructing a Simple Correlation and Regression-based Forecast Model Christopher Oludhe, Department of Meteorology, University.
Multiple Linear and Polynomial Regression with Statistical Analysis Given a set of data of measured (or observed) values of a dependent variable: y i versus.
Marketing Research Aaker, Kumar, Day and Leone Tenth Edition
Copyright ©2011 Pearson Education, Inc. publishing as Prentice Hall 15-1 Chapter 15 Multiple Regression Model Building Statistics for Managers using Microsoft.
© 2004 Prentice-Hall, Inc.Chap 15-1 Basic Business Statistics (9 th Edition) Chapter 15 Multiple Regression Model Building.
© 2002 Prentice-Hall, Inc.Chap 14-1 Introduction to Multiple Regression Model.
Variable selection and model building Part II. Statement of situation A common situation is that there is a large set of candidate predictor variables.
Multiple Regression Selecting the Best Equation. Techniques for Selecting the "Best" Regression Equation The best Regression equation is not necessarily.
Model Selection1. 1. Regress Y on each k potential X variables. 2. Determine the best single variable model. 3. Regress Y on the best variable and each.
1 1 Slide © 2014 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole.
Lesson Multiple Regression Models. Objectives Obtain the correlation matrix Use technology to find a multiple regression equation Interpret the.
MARE 250 Dr. Jason Turner Multiple Regression. y Linear Regression y = b 0 + b 1 x y = dependent variable b 0 + b 1 = are constants b 0 = y intercept.
Slide 1 DSCI 5340: Predictive Modeling and Business Forecasting Spring 2013 – Dr. Nick Evangelopoulos Lecture 2: Review of Multiple Regression (Ch. 4-5)
Simple Linear Regression (OLS). Types of Correlation Positive correlationNegative correlationNo correlation.
 Relationship between education level, income, and length of time out of school  Our new regression equation: is the predicted value of the dependent.
Chapter 22: Building Multiple Regression Models Generalization of univariate linear regression models. One unit of data with a value of dependent variable.
1 Experimental Statistics - week 12 Chapter 12: Multiple Regression Chapter 13: Variable Selection Model Checking.
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Variable Selection 1 Chapter 8 Variable Selection Terry Dielman Applied Regression Analysis:
Statistics for Managers Using Microsoft Excel, 4e © 2004 Prentice-Hall, Inc. Chap 14-1 Chapter 14 Multiple Regression Model Building Statistics for Managers.
Correlation & Regression Analysis
Copyright © 2004 by The McGraw-Hill Companies, Inc. All rights reserved.
Basic Business Statistics, 10e © 2006 Prentice-Hall, Inc. Chap 15-1 Chapter 15 Multiple Regression Model Building Basic Business Statistics 10 th Edition.
1 Building the Regression Model –I Selection and Validation KNN Ch. 9 (pp )
Statistics for Managers Using Microsoft Excel, 4e © 2004 Prentice-Hall, Inc. Chap 14-1 Chapter 14 Multiple Regression Model Building Statistics for Managers.
Using SPSS Note: The use of another statistical package such as Minitab is similar to using SPSS.
Multiple Regression David A. Kenny January 12, 2014.
Variable selection and model building Part I. Statement of situation A common situation is that there is a large set of candidate predictor variables.
Biostatistics Regression and Correlation Methods Class #10 April 4, 2000.
Regression Analysis: A statistical procedure used to find relations among a set of variables B. Klinkenberg G
Model selection and model building. Model selection Selection of predictor variables.
Yandell – Econ 216 Chap 15-1 Chapter 15 Multiple Regression Model Building.
Chapter 15 Multiple Regression Model Building
Chapter 9 Multiple Linear Regression
Statistics in MSmcDESPOT
Chapter 13 Created by Bethany Stubbe and Stephan Kogitz.
1) A residual: a) is the amount of variation explained by the LSRL of y on x b) is how much an observed y-value differs from a predicted y-value c) predicts.
CHAPTER 29: Multiple Regression*
Linear Model Selection and regularization
Multiple Regression Chapter 14.
Product moment correlation
Presentation transcript:

Multiple Regression Selecting the Best Equation

Techniques for Selecting the "Best" Regression Equation The best Regression equation is not necessarily the equation that explains most of the variance in Y (the highest R 2 ). This equation will be the one with all the variables included. The best equation should also be simple and interpretable. (i.e. contain a small no. of variables). Simple (interpretable) & Reliable - opposing criteria. The best equation is a compromise between these two.

We will discuss several strategies for selecting the best equation: 1.All Possible Regressions Uses R 2, s 2, Mallows C p C p = RSS p /s 2 complete - [n-2(p+1)] 2."Best Subset" Regression Uses R 2,R a 2, Mallows C p 3.Backward Elimination 4.Stepwise Regression

An Example In this example the following four chemicals are measured: X 1 = amount of tricalcium aluminate, 3 CaO - Al 2 O 3 X 2 = amount of tricalcium silicate, 3 CaO - SiO 2 X 3 = amount of tetracalcium alumino ferrite, 4 CaO - Al 2 O 3 - Fe 2 O 3 X 4 = amount of dicalcium silicate, 2 CaO - SiO 2 Y = heat evolved in calories per gram of cement.

The data is given below: X1X1 X2X2 X3X3 X4X4 Y

I All Possible Regressions Suppose we have the p independent variables X 1, X 2,..., X p. Then there are 2 p subsets of variables

Variables in EquationModel no variablesY =  0 +  X 1 Y =  0 +  1 X 1 +  X 2 Y =  0 +  2 X 2 +  X 3 Y =  0 +  3 X 3 +  X 1, X 2 Y =  0 +  1 X 1 +  2 X 2 + e X 1, X 3 Y =  0 +  1 X 1 +  3 X 3 +  X 2, X 3 Y =  0 +  2 X 2 +  3 X 3 + e and X 1, X 2, X 3 Y =  0 +  1 X 1 +  2 X 2 +  2 X 3 + 

Use of R 2 1.Assume we carry out 2 p runs for each of the subsets. Divide the Runs into the following sets Set 0: No variables Set 1:One independent variable.... Set p: p independent variables. 2. Order the runs in each set according to R Examine the leaders in each run looking for consistent patterns - take into account correlation between independent variables.

Example (k=4) X 1, X 2, X 3, X 4 Variables in for leading runs100 R 2 % Set 1: X % Set 2: X 1, X % X 1, X % Set 3: X 1, X 2, X % Set 4: X 1, X 2, X 3, X % Examination of the correlation coefficients reveals a high correlation between X 1, X 3 (r 13 = ) and between X 2, X 4 (r 24 = ). Best Equation Y =  0 +  1 X 1 +  4 X 4 + 

Use of R 2 Number of variables required, p, coincides with where R 2 begins to level out

Use of the Residual Mean Square (RMS) (s 2 ) When all of the variables having a non-zero effect have been included in the mode then the residual mean square is an estimate of s 2. If "significant" variables have been left out then RMS will be biased upward.

No. of Variables pRMS s 2 (p)Average s 2 (p) , 82.39, , *,122.71,7.48**, , 5.33, 5.65, *- run X 1, X 2 **- run X 1, X 4 s 2 - approximately 6.

Use of s 2 Number of variables required, p, coincides with where s 2 levels out

Use of Mallows C p If the equation with p variables is adequate then both s 2 complete and RSS p /(n-p-1) will be estimating s 2. If "significant" variables have been left out then RMS will be biased upward.

Then Thus if we plot, for each run, Cp vs p and look for Cp close to p + 1 then we will be able to identify models giving a reasonable fit.

RunCpp + 1 no variables ,2,3,4202.5, 142.5, 315.2, ,13,142.7, 198.1, ,24,3462.4, 138.2, ,124,134,2343.0, 3.0, 3.5,

Use of C p Number of variables required, p, coincides with where C p becomes close to p + 1 CpCp p

II "Best Subset" Regression Similar to all possible regressions. If p, the number of variables, is large then the number of runs, 2 p, performed could be extremely large. In this algorithm the user supplies the value K and the algorithm identifies the best K subsets of X 1, X 2,..., X p for predicting Y.

III Backward Elimination In this procedure the complete regression equation is determined containing all the variables - X 1, X 2,..., X p. Then variables are checked one at a time and the least significant is dropped from the model at each stage. The procedure is terminated when all of the variables remaining in the equation provide a significant contribution to the prediction of the dependent variable Y.

The precise algorithm proceeds as follows: 1.Fit a regression equation containing all variables in the equation.

2.A partial F-test is computed for each of the independent variables still in the equation. where RSS 1 = the residual sum of squares with all variables that are presently in the equation, RSS 2 = the residual sum of squares with on of the variables removed, and MSE 1 = the Mean Square for Error with all variables that are presently in the equation. The Partial F statistic:

3.The lowest partial F value is compared with F   for some pre-specified . If F Lowest  F   then remove that variable and return to step 2. If F Lowest > F   then accept the equation as it stands.

Example (k=4) (same example as before) X 1, X 2, X 3, X 4 1. X 1, X 2, X 3, X 4 in the equation. The lowest partial F = (X 3 ) is compared with F  (1,8)  = 3.46 for  = 0.01  Remove X 3.

2. X 1, X 2, X 4 in the equation. The lowest partial F = 1.86 (X 4 ) is compared with F  (1,9) = 3.36  for  Remove X 4.

Partial F for both variables X 1 and X 2 exceed F  (1,10) = 3.36 for  3. X 1, X 2 in the equation. Equation is accepted as it stands. Y = X X 2 Note : F to Remove = partial F.

IV Stepwise Regression In this procedure the regression equation is determined containing no variables in the model. Variables are then checked one at a time using the partial correlation coefficient as a measure of importance in predicting the dependent variable Y. At each stage the variable with the highest significant partial correlation coefficient is added to the model. Once this has been done the partial F statistic is computed for all variables now in the model is computed to check if any of the variables previously added can now be deleted.

This procedure is continued until no further variables can be added or deleted from the model. The partial correlation coefficient for a given variable is the correlation between the given variable and the response when the present independent variables in the equation are held fixed. It is also the correlation between the given variable and the residuals computed from fitting an equation with the present independent variables in the equation.

Example (k=4) (same example as before) X 1, X 2, X 3, X 4 1. With no variables in the equation. The correlation of each independent variable with the dependent variable Y is computed. The highest significant correlation ( r = ) is with variable X 4. Thus the decision is made to include X 4. Regress Y with X 4 -significant thus we keep X 4.

2.Compute partial correlation coefficients of Y with all other independent variables given X 4 in the equation. The highest partial correlation is with the variable X 1. ( [r Y1.4 ] 2 = 0.915). Thus the decision is made to include X 1.

Regress Y with X 1, X 4. R 2 = 0.972, F = For X 1 the partial F value = (F 0.10 (1,8) = 3.46) Retain X 1. For X 4 the partial F value = (F 0.10 (1,8) = 3.46) Retain X 4. Check to see if variables in the equation can be eliminated

3.Compute partial correlation coefficients of Y with all other independent variables given X 4 and X 1 in the equation. The highest partial correlation is with the variable X 2. ( [r Y2.14 ] 2 = 0.358). Thus the decision is made to include X 2. Regress Y with X 1, X 2,X 4. R 2 = Lowest partial F value =1.863 for X 4 (F 0.10 (1,9) = 3.36) Remove X 4 leaving X 1 and X 2. Check to see if variables in the equation can be eliminated

Examples Using Statistical Packages