business analytics II ▌appendix – regression performance the R2 

Slides:



Advertisements
Similar presentations
Statistics and Quantitative Analysis U4320
Advertisements

Lecture 9 Today: Ch. 3: Multiple Regression Analysis Example with two independent variables Frisch-Waugh-Lovell theorem.
Some Topics In Multivariate Regression. Some Topics We need to address some small topics that are often come up in multivariate regression. I will illustrate.
INTERPRETATION OF A REGRESSION EQUATION
Lecture 4 This week’s reading: Ch. 1 Today:
1 Lecture 2: ANOVA, Prediction, Assumptions and Properties Graduate School Social Science Statistics II Gwilym Pryce
Valuation 4: Econometrics Why econometrics? What are the tasks? Specification and estimation Hypotheses testing Example study.
1 Lecture 2: ANOVA, Prediction, Assumptions and Properties Graduate School Social Science Statistics II Gwilym Pryce
Sociology 601 Class 28: December 8, 2009 Homework 10 Review –polynomials –interaction effects Logistic regressions –log odds as outcome –compared to linear.
1 Review of Correlation A correlation coefficient measures the strength of a linear relation between two measurement variables. The measure is based on.
Interpreting Bi-variate OLS Regression
Back to House Prices… Our failure to reject the null hypothesis implies that the housing stock has no effect on prices – Note the phrase “cannot reject”
TESTING A HYPOTHESIS RELATING TO A REGRESSION COEFFICIENT This sequence describes the testing of a hypotheses relating to regression coefficients. It is.
SLOPE DUMMY VARIABLES 1 The scatter diagram shows the data for the 74 schools in Shanghai and the cost functions derived from a regression of COST on N.
Christopher Dougherty EC220 - Introduction to econometrics (chapter 3) Slideshow: precision of the multiple regression coefficients Original citation:
Relationships Among Variables
Christopher Dougherty EC220 - Introduction to econometrics (chapter 5) Slideshow: dummy variable classification with two categories Original citation:
Christopher Dougherty EC220 - Introduction to econometrics (chapter 5) Slideshow: the effects of changing the reference category Original citation: Dougherty,
Christopher Dougherty EC220 - Introduction to econometrics (chapter 5) Slideshow: dummy classification with more than two categories Original citation:
DUMMY CLASSIFICATION WITH MORE THAN TWO CATEGORIES This sequence explains how to extend the dummy variable technique to handle a qualitative explanatory.
Returning to Consumption
Simple Linear Regression Models
MultiCollinearity. The Nature of the Problem OLS requires that the explanatory variables are independent of error term But they may not always be independent.
EDUC 200C Section 3 October 12, Goals Review correlation prediction formula Calculate z y ’ = r xy z x for a new data set Use formula to predict.
What is the MPC?. Learning Objectives 1.Use linear regression to establish the relationship between two variables 2.Show that the line is the line of.
F TEST OF GOODNESS OF FIT FOR THE WHOLE EQUATION 1 This sequence describes two F tests of goodness of fit in a multiple regression model. The first relates.
CHAPTER 14 MULTIPLE REGRESSION
Statistics for the Social Sciences Psychology 340 Fall 2013 Correlation and Regression.
Welcome to Econ 420 Applied Regression Analysis Study Guide Week Six.
Multiple Regression and Model Building Chapter 15 Copyright © 2014 by The McGraw-Hill Companies, Inc. All rights reserved.McGraw-Hill/Irwin.
Part 2: Model and Inference 2-1/49 Regression Models Professor William Greene Stern School of Business IOMS Department Department of Economics.
Lesson Multiple Regression Models. Objectives Obtain the correlation matrix Use technology to find a multiple regression equation Interpret the.
STA 286 week 131 Inference for the Regression Coefficient Recall, b 0 and b 1 are the estimates of the slope β 1 and intercept β 0 of population regression.
COST 11 DUMMY VARIABLE CLASSIFICATION WITH TWO CATEGORIES 1 This sequence explains how you can include qualitative explanatory variables in your regression.
Christopher Dougherty EC220 - Introduction to econometrics (chapter 6) Slideshow: exercise 6.13 Original citation: Dougherty, C. (2012) EC220 - Introduction.
F TESTS RELATING TO GROUPS OF EXPLANATORY VARIABLES 1 We now come to more general F tests of goodness of fit. This is a test of the joint explanatory power.
Managerial Economics & Decision Sciences Department intro to dummy variables  dummy regressions  slope dummies  business analytics II Developed for.
Managerial Economics & Decision Sciences Department introduction  inflated standard deviations  the F  test  business analytics II Developed for ©
Managerial Economics & Decision Sciences Department hypotheses, test and confidence intervals  linear regression: estimation and interpretation  linear.
Managerial Economics & Decision Sciences Department cross-section and panel data  fixed effects  omitted variable bias  business analytics II Developed.
Managerial Economics & Decision Sciences Department tyler realty  old faithful  business analytics II Developed for © 2016 kellogg school of management.
Managerial Economics & Decision Sciences Department intro to linear regression  underlying concepts for the linear regression  interpret linear regression.
Chapter 13 Linear Regression and Correlation. Our Objectives  Draw a scatter diagram.  Understand and interpret the terms dependent and independent.
Session 6 omitted variables, spurious regression and multicollinearity ► omitted variables ► spurious regression ► multicollinearity Developed for Managerial.
Stats Methods at IC Lecture 3: Regression.
business analytics II ▌assignment three - solutions pet food 
business analytics II ▌assignment four - solutions mba for yourself 
business analytics II ▌assignment three - solutions pet food 
Chapter 20 Linear and Multiple Regression
QM222 Class 9 Section A1 Coefficient statistics
QM222 Class 10 Section D1 1. Goodness of fit -- review 2
assignment 7 solutions ► office networks ► super staffing
Chapter 9 Multiple Linear Regression
business analytics II ▌assignment one - solutions autoparts 
QM222 Class 16 & 17 Today’s New topic: Estimating nonlinear relationships QM222 Fall 2017 Section A1.
QM222 Class 11 Section A1 Multiple Regression
QM222 Class 8 Section A1 Using categorical data in regression
Chapter 11 Simple Regression
(Residuals and
Prepared by Lee Revere and John Large
Multiple Regression Models
QM222 Class 15 Section D1 Review for test Multicollinearity
Interval Estimation and Hypothesis Testing
Multiple Regression Chapter 14.
CHAPTER 14 MULTIPLE REGRESSION
The Multiple Regression Model
Introduction to Econometrics, 5th edition
Introduction to Econometrics, 5th edition
Presentation transcript:

business analytics II ▌appendix – regression performance the R2  Managerial Economics & Decision Sciences Department Developed for business analytics II week 6 week 7 ▌appendix – regression performance week 8 the R2  multicollinearity, the R2 and the Ftest  week 3 © 2016 kellogg school of management | managerial economics and decision sciences department | business analytics II

► statistics & econometrics session seven - appendix regression performance Developed for business analytics II learning objectives ► statistics & econometrics  define and interpret the R2  understand the connection between multicollinearity and R2 ►  the vif and testparm commands readings ► (MSN)  Chapter 7 ► (CS)  Dubuque Hot Dog © 2016 kellogg school of management | managerial economics and decision sciences department | business analytics II

variations around the mean Managerial Economics & Decision Sciences Department session seven - appendix multicollinearity Developed for business analytics II the R2 ◄ multicollinearity, the R2 and the Ftest◄ variations around the mean ► The diagram below shows a very simple case (just for illustrative purposes) of a linear regression of y (dependent variable) on x (independent variable). There are only three observations in the sample: the pairs (x1, y1), (x2, y2) and (x3, y3).  the mean of the dependent variable is  the regression line is which generates the pairs dependent variable y ( ) ► We can identify two types of variations around the mean:  dependent variable variation around its own mean: yi  model-based estimated variable variation around the mean: ► Remember that the linear regression is supposed to explained how the mean of the dependent variable depends on x regression line independent variable x © 2016 kellogg school of management | managerial economics and decision sciences department | business analytics II session seven - appendix | page 1

► For the overall regression it holds that: Managerial Economics & Decision Sciences Department session seven - appendix multicollinearity Developed for business analytics II the R2 ◄ multicollinearity, the R2 and the Ftest◄ the R2 ► For the overall regression it holds that: ► The equation above says that the total variation (TSS) in the dependent variable is the sum of two components: one that is explained by the regression (MSS) and one that is unexplained by the regression model or residual variation (RSS).  variation of the dependent variable around the mean (TSS – total sum of squares)  variation of the model-based estimated variable around the mean (MSS – model sum of squares)  variation of the dependent variable around the model-based estimated variable (RSS – residual sum of squares) key concept: multicollinearity ► R2 is the fraction of the variation in the y variable, i.e. variation of y around its own mean, that is explained by the x-variables used in the regression, i.e. “explained by the regression model”. © 2016 kellogg school of management | managerial economics and decision sciences department | business analytics II session seven - appendix | page 2

► Let’s look back at the Dubuque regression: Managerial Economics & Decision Sciences Department session seven - appendix multicollinearity Developed for business analytics II the R2 ◄ multicollinearity, the R2 and the Ftest◄ the R2 ► Let’s look back at the Dubuque regression: Figure 1. Results for regression of MKTDUB on pdub, poscar, pbpreg and pbpbeef Source | SS df MS Number of obs = 113 ---------+---------------------------------- F(4, 108) = 30.00 Model | .012013954 4 .003003488 Prob > F = 0.0000 Residual | .010811783 108 .000100109 R-squared = 0.5263 ---------+---------------------------------- Adj R-squared = 0.5088 Total | .022825737 112 .000203801 Root MSE = .01001 -------------------------------------------------------------------------- MKTDUB | Coef. Std. Err. t P>|t| [95% Conf. Interval] ---------+---------------------------------------------------------------- pdub | -.0007598 .0000809 -9.39 0.000 -.0009202 -.0005994 poscar | .0002622 .0000843 3.11 0.002 .0000952 .0004293 pbpreg | .0003473 .0003316 1.05 0.297 -.00031 .0010046 pbpbeef | .0001025 .0002938 0.35 0.728 -.0004798 .0006848 _cons | .0403026 .0141226 2.85 0.005 .0123092 .068296 Remark. The definition of R2  MSS/TSS seems to imply that 52.63% of the variation in the MKTDUB is explained by the variation generated by the independent variables. Pretty impressive!!! As a check (and a way to calculate R2 “manually”) notice the top left of the table numbers: MSS  0.012013954, RSS  0.010811783 and TSS  0.022825737 and MSS/TSS  0.012013954 / 0.022825737 = 0.5263 © 2016 kellogg school of management | managerial economics and decision sciences department | business analytics II session seven - appendix | page 3

► High R2 does not guarantee accurate predictions Managerial Economics & Decision Sciences Department session seven - appendix multicollinearity Developed for business analytics II the R2 ◄ multicollinearity, the R2 and the Ftest◄ the R2 ► High R2 does not guarantee accurate predictions • R2 is only a relative, not an absolute measure of accuracy. If there is a lot of variation in y, i.e. TSS is large, then even a regression with a high R2 might have wide prediction intervals • “In-sample” vs. “Out-of-Sample” performance: the computed value of R2 is based on the observations included in your regression - R2 tells you how well the model fit the data used to run the regression and obtain the R2 - this may be useful if the model that generates future observations is the same as the model that generated your data - if this is not the case, then your model may have a high R2 but may be worthless for prediction ► High R2 is not a sign of accomplishment • Regressions with high R2 may be uninformative if your regression has picked up a well-known and uninteresting trend (regress annual per capita income on a time trend; have you actually learned anything?) • If we loosely interpret “trend,” we see that there are many situations where you can get a high R2 without learning anything (say you run the regression Rebounds  0  1·Height, then with 1 > 0 the recommendation is to grow taller!) © 2016 kellogg school of management | managerial economics and decision sciences department | business analytics II session seven - appendix | page 4

► Low R2, by itself, is not a sign of failure Managerial Economics & Decision Sciences Department session seven - appendix multicollinearity Developed for business analytics II the R2 ◄ multicollinearity, the R2 and the Ftest◄ the R2 ► R2 will increase (for sure!) by adding randomly chosen variables as independent variables • Add the following variable to the hotdog.dta file: z taking values 1, 2, … , 113. • Surprise: R2 increases! Yet, variable z has nothing to do with (no explanatory power for) the MKTDUB. For this reason we should use an adjusted R2 that basically adjusts for the number of independent variables (k) that are included in the model: ► Low R2, by itself, is not a sign of failure • If there is not much variation in y, i.e. TSS is small, then even a regression with a low R2 might have small prediction intervals • Sometimes, even a little extra predictive power can be valuable - think of the stock market, or forecasting landfall of hurricanes • The regression may be very useful in learning about the deterministic portion of a model if you can obtain precise coefficient estimates. A coefficient may have a small standard error and be highly statistically significant even if R2 is low ► If deciding on how many independent variables to include you can use the adjusted R2 - nevertheless the specific questions motivating your analysis usually suggest better, more relevant tools than R2 (or the adjusted version of R2) © 2016 kellogg school of management | managerial economics and decision sciences department | business analytics II session seven - appendix | page 5

multicollinearity revisited Managerial Economics & Decision Sciences Department session seven - appendix multicollinearity Developed for business analytics II the R2 ◄ multicollinearity, the R2 and the Ftest◄ multicollinearity revisited ► Back to the Dubuque Hot Dogs regression (to be able to run the following commands make sure the initial regression is the last regression run). ► We saw that pbpreg and pbpbeef are likely to be correlated and that might induce inflated standard deviations for the coefficients. The command vif delivers a list of “variance inflation factors” for each coefficient: ► How are the variance inflation factors calculated? Figure 1. Results for vif command . vif Variable | VIF 1/VIF -------------+---------------------- pbpreg | 25.97 0.038508 pbpbeef | 25.15 0.039765 poscar | 1.66 0.603208 pdub | 1.36 0.733979 Mean VIF | 13.53 © 2016 kellogg school of management | managerial economics and decision sciences department | business analytics II session seven - appendix | page 6

multicollinearity revisited Managerial Economics & Decision Sciences Department session seven - appendix multicollinearity Developed for business analytics II the R2 ◄ multicollinearity, the R2 and the Ftest◄ multicollinearity revisited ► Let’s keep this table available for an easy reference: ► The dependent variables are obviously pdub, poscar, pbpreg and pbpbeef ► The 1/VIF for variable pdub is equal to 1  R2 where R2 is the R-squared from the regression of pdub on all the other dependent variables: pdub  c0  c1poscar  c2pbreg  c3pbpbeef VIF results for full regression Variable | VIF 1/VIF -------------+---------------------- pbpreg | 25.97 0.038508 pbpbeef | 25.15 0.039765 poscar | 1.66 0.603208 pdub | 1.36 0.733979 Mean VIF | 13.53 . regress pdub poscar pbpreg pbpbeef Source | SS df MS Number of obs = 113 ---------+------------------------------ F( 3, 109) = 13.17 Model | 5541.63674 3 1847.21225 Prob > F = 0.0000 Residual | 15289.9208 109 140.274503 R-squared = 0.2660 ---------+------------------------------ Adj R-squared = 0.2458 Total | 20831.5575 112 185.996049 Root MSE = 11.844 1  R2 = 0.734 © 2016 kellogg school of management | managerial economics and decision sciences department | business analytics II session seven - appendix | page 7

pbpreg  d0  d1pdub  d2poscar  d3pbpbeef Managerial Economics & Decision Sciences Department session seven - appendix multicollinearity Developed for business analytics II the R2 ◄ multicollinearity, the R2 and the Ftest◄ multicollinearity revisited ► Let’s keep this table available for an easy reference: ► The 1/VIF for variable pbpreg is equal to 1  R2 where R2 is the R-squared from the regression of pbpreg on all the other dependent variables: pbpreg  d0  d1pdub  d2poscar  d3pbpbeef VIF results for full regression Variable | VIF 1/VIF -------------+---------------------- pbpreg | 25.97 0.038508 pbpbeef | 25.15 0.039765 poscar | 1.66 0.603208 pdub | 1.36 0.733979 Mean VIF | 13.53 . regress pbpreg pdub poscar pbpbeef Source | SS df MS Number of obs = 113 -------------+------------------------------ F( 3, 109) = 907.18 Model | 22730.6813 3 7576.89378 Prob > F = 0.0000 Residual | 910.380602 109 8.35211561 R-squared = 0.9615 ---------+------------------------------ Adj R-squared = 0.9604 Total | 23641.0619 112 211.08091 Root MSE = 2.89 1  R2 = 0.038 © 2016 kellogg school of management | managerial economics and decision sciences department | business analytics II session seven - appendix | page 8

multicollinearity revisited Managerial Economics & Decision Sciences Department session seven - appendix multicollinearity Developed for business analytics II the R2 ◄ multicollinearity, the R2 and the Ftest◄ multicollinearity revisited ► The fact that we detect inflated standard deviations does not guarantee automatically detection of multicollinearity. To identify multicollinearity we use the F-test. ► The F-test tells us whether one or more variables adds predictive power to a regression: ► In plain language: you are basically testing whether these variables are no more related to y than junk variables. Remark. The F-test for a single variable returns the same significance level as the t-test ► The Ftest for a group of variables can be executed in STATA using the test or testparm command and listing the variables you wish to test after running a regression. hypothesis H0: all of the regression coefficients () on the variables you are testing equal 0 Ha: at least one of the regression coefficients () is different from 0 testparm xvar1 xvar2 … xvark Remark. After STATA command testparm you should list the variable for which you want to test whether their coefficients are all different from zero. Remark What if we include all the dependent variables in the list for the F-test? What are we testing?  We are actually testing the null H0: all coefficients in the regression are zero. If we reject the null then we know that at least one of the variable adds some predictive value. If we cannot reject the null then it means that we are really using variables with no explanatory/predictive power for the variation in the independent variable. © 2016 kellogg school of management | managerial economics and decision sciences department | business analytics II session seven - appendix | page 9

multicollinearity revisited Managerial Economics & Decision Sciences Department session seven - appendix multicollinearity Developed for business analytics II the R2 ◄ multicollinearity, the R2 and the Ftest◄ multicollinearity revisited ► For the Dubuque Hot Dogs full regression (just the upper part of the table): ► You can also run: Source | SS df MS Number of obs = 113 -------------+------------------------------ F( 4, 108) = 30.00 Model | .012013954 4 .003003488 Prob > F = 0.0000 Residual | .010811783 108 .000100109 R-squared = 0.5263 -------------+------------------------------ Adj R-squared = 0.5088 Total | .022825737 112 .000203801 Root MSE = .01001 . testparm pdub poscar pbpreg pbpbeef ( 1) pdub = 0 ( 2) poscar = 0 ( 3) pbpreg = 0 ( 4) pbpbeef = 0 F( 4, 108) = 30.00 Prob > F = 0.0000 hypothesis test decision ► The regression table provides implicitly the joint test that all regression coefficients are zero. © 2016 kellogg school of management | managerial economics and decision sciences department | business analytics II session seven - appendix | page 10