Multiple Regression Selecting the Best Equation. Techniques for Selecting the "Best" Regression Equation The best Regression equation is not necessarily.

Slides:



Advertisements
Similar presentations
Multiple Regression.
Advertisements

All Possible Regressions and Statistics for Comparing Models
Chapter 5 Multiple Linear Regression
11-1 Empirical Models Many problems in engineering and science involve exploring the relationships between two or more variables. Regression analysis.
Lecture 10 F-tests in MLR (continued) Coefficients of Determination BMTRY 701 Biostatistical Methods II.
4/14/ lecture 81 STATS 330: Lecture 8. 4/14/ lecture 82 Collinearity Aims of today’s lecture: Explain the idea of collinearity and its connection.
Statistical Techniques I EXST7005 Multiple Regression.
Best subsets regression
Probability & Statistical Inference Lecture 9
6-1 Introduction To Empirical Models 6-1 Introduction To Empirical Models.
11 Simple Linear Regression and Correlation CHAPTER OUTLINE
LINEAR REGRESSION: Evaluating Regression Models Overview Assumptions for Linear Regression Evaluating a Regression Model.
LINEAR REGRESSION: Evaluating Regression Models. Overview Standard Error of the Estimate Goodness of Fit Coefficient of Determination Regression Coefficients.
Statistics for Managers Using Microsoft® Excel 5th Edition
Statistics for Managers Using Microsoft® Excel 5th Edition
Multiple Regression Involves the use of more than one independent variable. Multivariate analysis involves more than one dependent variable - OMS 633 Adding.
Additional Topics in Regression Analysis
Lecture 6: Multiple Regression
1 Chapter 9 Variable Selection and Model building Ray-Bing Chen Institute of Statistics National University of Kaohsiung.
Irwin/McGraw-Hill © The McGraw-Hill Companies, Inc., 2000 LIND MASON MARCHAL 1-1 Chapter Twelve Multiple Regression and Correlation Analysis GOALS When.
Multiple Linear Regression
Ch. 14: The Multiple Regression Model building
Basic Business Statistics, 11e © 2009 Prentice-Hall, Inc. Chap 15-1 Chapter 15 Multiple Regression Model Building Basic Business Statistics 11 th Edition.
11-1 Empirical Models Many problems in engineering and science involve exploring the relationships between two or more variables. Regression analysis.
Chapter 15: Model Building
Multiple Regression Dr. Andy Field.
Model selection Stepwise regression. Statement of problem A common problem is that there is a large set of candidate predictor variables. (Note: The examples.
Copyright ©2011 Pearson Education 15-1 Chapter 15 Multiple Regression Model Building Statistics for Managers using Microsoft Excel 6 th Global Edition.
Example of Simple and Multiple Regression
Elements of Multiple Regression Analysis: Two Independent Variables Yong Sept
Copyright ©2011 Pearson Education, Inc. publishing as Prentice Hall 15-1 Chapter 15 Multiple Regression Model Building Statistics for Managers using Microsoft.
© 2004 Prentice-Hall, Inc.Chap 15-1 Basic Business Statistics (9 th Edition) Chapter 15 Multiple Regression Model Building.
© 2002 Prentice-Hall, Inc.Chap 14-1 Introduction to Multiple Regression Model.
Variable selection and model building Part II. Statement of situation A common situation is that there is a large set of candidate predictor variables.
Model Selection1. 1. Regress Y on each k potential X variables. 2. Determine the best single variable model. 3. Regress Y on the best variable and each.
Chapter 9 Analyzing Data Multiple Variables. Basic Directions Review page 180 for basic directions on which way to proceed with your analysis Provides.
Chap 14-1 Statistics for Business and Economics, 6e © 2007 Pearson Education, Inc. Chapter 14 Additional Topics in Regression Analysis Statistics for Business.
Lesson Multiple Regression Models. Objectives Obtain the correlation matrix Use technology to find a multiple regression equation Interpret the.
MARE 250 Dr. Jason Turner Multiple Regression. y Linear Regression y = b 0 + b 1 x y = dependent variable b 0 + b 1 = are constants b 0 = y intercept.
1 11 Simple Linear Regression and Correlation 11-1 Empirical Models 11-2 Simple Linear Regression 11-3 Properties of the Least Squares Estimators 11-4.
Slide 1 DSCI 5340: Predictive Modeling and Business Forecasting Spring 2013 – Dr. Nick Evangelopoulos Lecture 2: Review of Multiple Regression (Ch. 4-5)
Simple Linear Regression (SLR)
Simple Linear Regression (OLS). Types of Correlation Positive correlationNegative correlationNo correlation.
Multiple Regression Selecting the Best Equation. Techniques for Selecting the "Best" Regression Equation The best Regression equation is not necessarily.
 Relationship between education level, income, and length of time out of school  Our new regression equation: is the predicted value of the dependent.
Chapter 22: Building Multiple Regression Models Generalization of univariate linear regression models. One unit of data with a value of dependent variable.
1 Experimental Statistics - week 12 Chapter 12: Multiple Regression Chapter 13: Variable Selection Model Checking.
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Variable Selection 1 Chapter 8 Variable Selection Terry Dielman Applied Regression Analysis:
Statistics for Managers Using Microsoft Excel, 4e © 2004 Prentice-Hall, Inc. Chap 14-1 Chapter 14 Multiple Regression Model Building Statistics for Managers.
1 Doing Statistics for Business Doing Statistics for Business Data, Inference, and Decision Making Marilyn K. Pelosi Theresa M. Sandifer Chapter 12 Multiple.
Basic Business Statistics, 10e © 2006 Prentice-Hall, Inc. Chap 15-1 Chapter 15 Multiple Regression Model Building Basic Business Statistics 10 th Edition.
1 Building the Regression Model –I Selection and Validation KNN Ch. 9 (pp )
Statistics for Managers Using Microsoft Excel, 4e © 2004 Prentice-Hall, Inc. Chap 14-1 Chapter 14 Multiple Regression Model Building Statistics for Managers.
Using SPSS Note: The use of another statistical package such as Minitab is similar to using SPSS.
Multiple Regression David A. Kenny January 12, 2014.
Variable selection and model building Part I. Statement of situation A common situation is that there is a large set of candidate predictor variables.
Applied Quantitative Analysis and Practices LECTURE#28 By Dr. Osman Sadiq Paracha.
Summary of the Statistics used in Multiple Regression.
Biostatistics Regression and Correlation Methods Class #10 April 4, 2000.
Model selection and model building. Model selection Selection of predictor variables.
Yandell – Econ 216 Chap 15-1 Chapter 15 Multiple Regression Model Building.
Stats Methods at IC Lecture 3: Regression.
Chapter 15 Multiple Regression Model Building
Regression Models First-order with Two Independent Variables
Chapter 9 Multiple Linear Regression
Chapter 13 Created by Bethany Stubbe and Stephan Kogitz.
CHAPTER 29: Multiple Regression*
Linear Model Selection and regularization
Multiple Regression Chapter 14.
Product moment correlation
Presentation transcript:

Multiple Regression Selecting the Best Equation

Techniques for Selecting the "Best" Regression Equation The best Regression equation is not necessarily the equation that explains most of the variance in Y (the highest R 2 ). This equation will be the one with all the variables included. The best equation should also be simple and interpretable. (i.e. contain a small no. of variables). Simple (interpretable) & Reliable - opposing criteria. The best equation is a compromise between these two.

We will discuss several strategies for selecting the best equation: 1.All Possible Regressions Uses R 2, s 2, Mallows C p C p = RSS p /s 2 complete - [n-2(p+1)] 2."Best Subset" Regression Uses R 2,R a 2, Mallows C p 3.Backward Elimination 4.Stepwise Regression

Model General linear model with intercept  0

The ANOVA table: SourceSSd.f. Testing p Errorn-p-1 Adjusted Totaln - 1 SS Regression SS Error SS Total

An Example In this example the following four chemicals are measured when curing cement: X 1 = amount of tricalcium aluminate, 3 CaO - Al 2 O 3 X 2 = amount of tricalcium silicate, 3 CaO - SiO 2 X 3 = amount of tetracalcium alumino ferrite, 4 CaO - Al 2 O 3 - Fe 2 O 3 X 4 = amount of dicalcium silicate, 2 CaO - SiO 2 Y = heat evolved in calories per gram of cement.

The data is given below: X1X1 X2X2 X3X3 X4X4 Y

I All Possible Regressions Suppose we have the p independent variables X 1, X 2,..., X p. Then there are 2 p subsets of variables

Variables in EquationModel no variablesY =  0 +  X 1 Y =  0 +  1 X 1 +  X 2 Y =  0 +  2 X 2 +  X 3 Y =  0 +  3 X 3 +  X 1, X 2 Y =  0 +  1 X 1 +  2 X 2 + e X 1, X 3 Y =  0 +  1 X 1 +  3 X 3 +  X 2, X 3 Y =  0 +  2 X 2 +  3 X 3 + e and X 1, X 2, X 3 Y =  0 +  1 X 1 +  2 X 2 +  2 X 3 + 

Use of R 2 1.Assume we carry out 2 p runs for each of the subsets. Divide the Runs into the following sets Set 0: No variables Set 1:One independent variable.... Set p: p independent variables. 2. Order the runs in each set according to R Examine the leaders in each run looking for consistent patterns - take into account correlation between independent variables.

Example (k=4) X 1, X 2, X 3, X 4 Variables in for leading runs100 R 2 % Set 1: X % Set 2: X 1, X % X 1, X % Set 3: X 1, X 2, X % Set 4: X 1, X 2, X 3, X % Examination of the correlation coefficients reveals a high correlation between X 1, X 3 (r 13 = ) and between X 2, X 4 (r 24 = ). Best Equation Y =  0 +  1 X 1 +  4 X 4 + 

Use of R 2 Number of variables required, p, coincides with where R 2 begins to level out

Use of the Residual Mean Square (RMS) (s 2 ) When all of the variables having a non-zero effect have been included in the model then the residual mean square is an estimate of  2. If "significant" variables have been left out then RMS will be biased upward.

No. of Variables pRMS s 2 (p)Average s 2 (p) , 82.39, , *,122.71,7.48**, , 5.33, 5.65, *- run X 1, X 2 **- run X 1, X 4 s 2 - approximately 6.

Use of s 2 Number of variables required, p, coincides with where s 2 levels out

Use of Mallows C p If the equation with p variables is adequate then both s 2 complete and RSS p /(n-p-1) will be estimating  2. If "significant" variables have been left out then RMS will be biased upward.

Then Thus if we plot, for each run, Cp vs p and look for Cp close to p + 1 then we will be able to identify models giving a reasonable fit.

RunCpp + 1 no variables ,2,3,4202.5, 142.5, 315.2, ,13,142.7, 198.1, ,24,3462.4, 138.2, ,124,134,2343.0, 3.0, 3.5,

Use of C p Number of variables required, p, coincides with where C p becomes close to p + 1 CpCp p

Methods to Select Best Equation Summary I All Possible Regressions Suppose we have the p independent variables X 1, X 2,..., X p. Then there are 2 p subsets of variables In all possible subsets regression, we regress each subset of X 1, X 2,..., X p on Y.

1.We carry out 2 p runs for each of the subsets. Divide the Runs into the following sets Set 0: No variables Set 1:One independent variable.... Set p: p independent variables. 2. Order the runs in each set according to R 2. s 2 or Mallows C p 3. Examine the leaders in each run looking for consistent patterns - take into account correlation between independent variables. 4. Decide on the best equation

Use of R 2 Number of variables required, p, coincides with where R 2 begins to level out

Use of s 2 Number of variables required, p, coincides with where s 2 levels out

Use of C p Number of variables required, p, coincides with where C p becomes close to p + 1 CpCp p

II "Best Subset" Regression Similar to all possible regressions. If p, the number of variables, is large then the number of runs, 2 p, performed could be extremely large. In this algorithm the user supplies the value K and the algorithm identifies the best K subsets of X 1, X 2,..., X p containing m variables for predicting Y.

III Backward Elimination In this procedure the complete regression equation is determined containing all the variables - X 1, X 2,..., X p. Then variables are checked one at a time and the least significant is dropped from the model at each stage. The procedure is terminated when all of the variables remaining in the equation provide a significant contribution to the prediction of the dependent variable Y.

The precise algorithm proceeds as follows: 1.Fit a regression equation containing all variables in the equation.

2.A partial F-test is computed for each of the independent variables still in the equation. where RSS 1 = the residual sum of squares with all variables that are presently in the equation, RSS 2 = the residual sum of squares with on of the variables removed, and MSE 1 = the Mean Square for Error with all variables that are presently in the equation. The Partial F statistic:

3.The lowest partial F value is compared with F   for some pre-specified . If F Lowest  F   then remove that variable and return to step 2. If F Lowest > F   then accept the equation as it stands. If  is small then F   will be large making easier to remove variables. Increasing the value of  will decrease the value of F  , making it harder to remove variables resulting in an equation with more variables in the equation.

Thus Using a value of  too small may result in important variables being missed. It may be useful to use several valves of  to determine which are needed and which variables are close to being significant.

Example (k=4) (same example as before) X 1, X 2, X 3, X 4 1. X 1, X 2, X 3, X 4 in the equation. The lowest partial F = (X 3 ) is compared with F  (1,8)  = 3.46 for  = 0.01  Remove X 3.

2. X 1, X 2, X 4 in the equation. The lowest partial F = 1.86 (X 4 ) is compared with F  (1,9) = 3.36  for  Remove X 4.

Partial F for both variables X 1 and X 2 exceed F  (1,10) = 3.36 for  3. X 1, X 2 in the equation. Equation is accepted as it stands. Y = X X 2 Note : F to Remove = partial F.

IV Stepwise Regression In this procedure the regression equation is determined containing no variables in the model. Variables are then checked one at a time using the partial correlation coefficient (or an equivalent statistic – F to enter) as a measure of importance in predicting the dependent variable Y. At each stage the variable with the highest significant partial correlation coefficient is added to the model. Once this has been done the partial F statistic (F to remove) is computed for all variables now in the model is computed to check if any of the variables previously added can now be deleted.

This procedure is continued until no further variables can be added or deleted from the model. The partial correlation coefficient for a given variable is the correlation between the given variable and the response when the present independent variables in the equation are held fixed. It is also the correlation between the given variable and the residuals computed from fitting an equation with the present independent variables in the equation.

Equivalent Statistics F to enter F to remove

Example (k=4) (same example as before) X 1, X 2, X 3, X 4 1. With no variables in the equation. The correlation of each independent variable with the dependent variable Y is computed. The highest significant correlation ( r = ) is with variable X 4. Thus the decision is made to include X 4. Regress Y with X 4 -significant thus we keep X 4.

2.Compute partial correlation coefficients of Y with all other independent variables given X 4 in the equation. The highest partial correlation is with the variable X 1. ( [r Y1.4 ] 2 = 0.915). Thus the decision is made to include X 1.

Regress Y with X 1, X 4. R 2 = 0.972, F = For X 1 the partial F value = (F 0.10 (1,8) = 3.46) Retain X 1. For X 4 the partial F value = (F 0.10 (1,8) = 3.46) Retain X 4. Check to see if variables in the equation can be eliminated

3.Compute partial correlation coefficients of Y with all other independent variables given X 4 and X 1 in the equation. The highest partial correlation is with the variable X 2. ( [r Y2.14 ] 2 = 0.358). Thus the decision is made to include X 2. Regress Y with X 1, X 2,X 4. R 2 = Lowest partial F value =1.863 for X 4 (F 0.10 (1,9) = 3.36) Remove X 4 leaving X 1 and X 2. Check to see if variables in the equation can be eliminated

Examples Using Statistical Packages