Copyright © 2014 McGraw-Hill Education. All rights reserved. No reproduction or distribution without the prior written consent of McGraw-Hill Education. Chapter 8 Model Selection in Multiple Linear Regression Analysis
8-2 Learning Objectives Understand the problem presented by omitted variable bias Understand the problem presented by including an irrelevant variable Understand the problem presented by missing data Understand the problem presented by outliers Perform the RESET test for the inclusion of higher-order polynomials
8-3 Learning Objectives Perform the Davidson-MacKinnon test for choosing among non-nested alternatives Consider how to implement the “eye test” to judge the estimated sample regression function Consider what it means for a p-value to be just above a given cutoff
8-4
8-5 Understand the Problem Presented by Omitted Variable Bias Omitted Variable Bias is the bias in coefficient estimates when a variable is omitted from the model and that variable is also related to one or more independent variables. Omitted variable bias results in OLS estimates being on average wrong and incorrect hypothesis test and confidence intervals.
8-6 Understand the Problem Presented by Including an Irrelevant Variable Including an Irrelevant Variable is when a variable is included in the regression model even though it is not related to the dependent variable. Including an irrelevant variable does not cause the coefficient estimates to be biased but it may result in larger standard errors (which might result in more variables being estimated as statistically insignificant).
8-7 What is the Lesser of Two Evils: Omitted Variable Bias or Including an Irrelevant Variable? Because omitting a relevant variable results in biased estimates while including an irrelevant variable does not, it is more desirable to include an irrelevant variable. However, it would be best to have a correctly specified model without either an omitted variable or an irrelevant variable. A correctly specified model should be created by considering relevant economic theory and by looking at what others have done in similar studies.
8-8 Understand the Problem Presented by Missing Data When collecting data sometime data are missing for some of the observations. Solutions: (1)If there is no systematic reason that the data are missing, we can delete those observations and estimate the model for the observations with the non-missing data. (2)Create a new dummy variable, which is equal to 1 if the data are missing and 0 if they aren’t for that observation (and set the value of the missing observations to 0)
8-9 Empirical Example: We are interested in explaining the teen fertility rate (births per 1,000 women ages 15-19) using expenditure per student in primary school (% of GDP per capita) in Of the 252 Countries the World Bank collects data on, 225 of them have data on teen fertility and only 90 of the countries have data on education expenditure. C.ZS/countries
8-10 Empirical Example: Running the Regression with the Missing Observations Deleted The interpretation of the expenditure variable coefficient is that, “ on average, as expenditure per student as a percent of GDP per capita goes up by $1, the teen fertility rate drops by 2.29 per 1000.” This results is statistically significant at the 1% level.
8-11 Empirical Example: Running the Regression with the Missing Observations Set to 0 and a Dummy Variable Added for the Missing Observations The coefficient with the missing observations included is almost identical to the coefficient estimate on the previous slide suggesting that the missing observations don’t affect the results. However, the dummy variable for the missing observations is statistically significant at the 1% level and it suggests that, on average, the teen fertility rate is lower than those countries without education expenditure missing.
8-12 Understand the Problem Presented by Outliers Outliers can significantly affect the calculated slope coefficients. It is not acceptable to simply drop outliers unless you can determine their presence is due to data entry error. One possible way to control for outliers is to put a dummy variable in for dependent and independent variable outliers.
8-13 Empirical Example: Total Medals won in the Olympics vs. GDP per Capita Potential Outliers
8-14 Empirical Example: Regression Results without Controlling for the Outliers The coefficient on GDP per Capita means on average, if GDP per capita increases by $1000 then the number of Olympic medals goes up by.15 of a medal. This coefficient is statistically significant at the 10% and it is almost significant at the 5% level. 8-14
8-15 Empirical Example: Regression Results without Controlling for the Outliers The coefficient on GDP per Capita has increased from.15 to.21. The medal outlier coefficient says that, “on average the three medal outliers have more medals relative to not being an outlier.” This GDP outlier coefficient says that, “on average the two GDPoutliers have 20 fewer medals relative to not being an outlier.” Both of these coefficients are statistically significant at the 5% level.
8-16 Perform the Reset Test for the Inclusion of Higher-Order Polynomials
8-17 Perform the Davidson MacKinnon Test for Choosing Among Non-Nested Alternatives
8-18 Consider How to Implement the “Eye Test” to Judge the Sample Regression Function To perform the “eye test” you should take a step back from your results and analyze whether the estimates and your conclusions seem reasonable.
8-19 Examples of where the “Eye Test” Should have been used (1)One student tried to analyze what determined super bowl wins using only one year of data (so that there was only one winner) (2)Another student …