QM222 Class 19 Omitted Variable Bias pt 2 Different slopes for a single variable QM222 Fall 2017 Section A1.

Slides:

Advertisements

Similar presentations

CHOW TEST AND DUMMY VARIABLE GROUP TEST

Advertisements

Sociology 601 Class 24: November 19, 2009 (partial) Review –regression results for spurious & intervening effects –care with sample sizes for comparing.

Heteroskedasticity The Problem:

Lecture 9 Today: Ch. 3: Multiple Regression Analysis Example with two independent variables Frisch-Waugh-Lovell theorem.

INTERPRETATION OF A REGRESSION EQUATION

Christopher Dougherty EC220 - Introduction to econometrics (chapter 3) Slideshow: exercise 3.5 Original citation: Dougherty, C. (2012) EC220 - Introduction.

Lecture 4 This week’s reading: Ch. 1 Today:

Sociology 601 Class 25: November 24, 2009 Homework 9 Review –dummy variable example from ASR (finish) –regression results for dummy variables Quadratic.

Sociology 601 Class 28: December 8, 2009 Homework 10 Review –polynomials –interaction effects Logistic regressions –log odds as outcome –compared to linear.

1 Review of Correlation A correlation coefficient measures the strength of a linear relation between two measurement variables. The measure is based on.

Sociology 601 Class 23: November 17, 2009 Homework #8 Review –spurious, intervening, & interactions effects –stata regression commands & output F-tests.

TESTING A HYPOTHESIS RELATING TO A REGRESSION COEFFICIENT This sequence describes the testing of a hypotheses relating to regression coefficients. It is.

Christopher Dougherty EC220 - Introduction to econometrics (chapter 3) Slideshow: precision of the multiple regression coefficients Original citation:

EDUC 200C Section 4 – Review Melissa Kemmerle October 19, 2012.

Christopher Dougherty EC220 - Introduction to econometrics (chapter 5) Slideshow: the effects of changing the reference category Original citation: Dougherty,

DUMMY CLASSIFICATION WITH MORE THAN TWO CATEGORIES This sequence explains how to extend the dummy variable technique to handle a qualitative explanatory.

How do Lawyers Set fees?. Learning Objectives 1.Model i.e. “Story” or question 2.Multiple regression review 3.Omitted variables (our first failure of.

Addressing Alternative Explanations: Multiple Regression

MultiCollinearity. The Nature of the Problem OLS requires that the explanatory variables are independent of error term But they may not always be independent.

EDUC 200C Section 3 October 12, Goals Review correlation prediction formula Calculate z y ’ = r xy z x for a new data set Use formula to predict.

F TEST OF GOODNESS OF FIT FOR THE WHOLE EQUATION 1 This sequence describes two F tests of goodness of fit in a multiple regression model. The first relates.

MULTIPLE REGRESSION WITH TWO EXPLANATORY VARIABLES: EXAMPLE 1 This sequence provides a geometrical interpretation of a multiple regression model with two.

Simple regression model: Y =  1 +  2 X + u 1 We have seen that the regression coefficients b 1 and b 2 are random variables. They provide point estimates.

Christopher Dougherty EC220 - Introduction to econometrics (chapter 5) Slideshow: exercise 5.2 Original citation: Dougherty, C. (2012) EC220 - Introduction.

Chapter 5: Dummy Variables. DUMMY VARIABLE CLASSIFICATION WITH TWO CATEGORIES 1 We’ll now examine how you can include qualitative explanatory variables.

STAT E100 Section Week 12- Regression. Course Review - Project due Dec 17 th, your TA. - Exam 2 make-up is Dec 5 th, practice tests have been updated.

RAMSEY’S RESET TEST OF FUNCTIONAL MISSPECIFICATION 1 Ramsey’s RESET test of functional misspecification is intended to provide a simple indicator of evidence.

1 CHANGES IN THE UNITS OF MEASUREMENT Suppose that the units of measurement of Y or X are changed. How will this affect the regression results? Intuitively,

GRAPHING A RELATIONSHIP IN A MULTIPLE REGRESSION MODEL The output above shows the result of regressing EARNINGS, hourly earnings in dollars, on S, years.

1 BINARY CHOICE MODELS: LINEAR PROBABILITY MODEL Economists are often interested in the factors behind the decision-making of individuals or enterprises,

1 COMPARING LINEAR AND LOGARITHMIC SPECIFICATIONS When alternative specifications of a regression model have the same dependent variable, R 2 can be used.

VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE In this sequence and the next we will investigate the consequences of misspecifying the regression.

QM222 Class 19 Section D1 Tips on your Project

QM222 Class 12 Section D1 1. A few Stata things 2

For presentation schedule and makeup test signups, see:

Linear Regression Essentials Line Basics y = mx + b vs. Definitions

Chapter 14 Introduction to Multiple Regression

Chapter 20 Linear and Multiple Regression

QM222 Class 9 Section A1 Coefficient statistics

QM222 Class 11 Section D1 1. Review and Stata: Time series data, multi-category dummies, etc. (chapters 10,11) 2. Capturing nonlinear relationships (Chapter.

QM222 Class 10 Section D1 1. Goodness of fit -- review 2

QM222 Nov. 7 Section D1 Multicollinearity Regression Tables What to do next on your project QM222 Fall 2016 Section D1.

QM222 Nov. 28 Presentations Some additional tips on the project

QM222 Class 13 Section D1 Omitted variable bias (Chapter 13.)

Review Multiple Regression Multiple-Category Dummy Variables

QM222 Class 16 & 17 Today’s New topic: Estimating nonlinear relationships QM222 Fall 2017 Section A1.

QM222 Class 11 Section A1 Multiple Regression

Multiple Regression Analysis and Model Building

QM222 Class 14 Section D1 Different slopes for the same variable (Chapter 14) Review: Omitted variable bias (Chapter 13.) The bias on a regression coefficient.

QM222 Class 18 Omitted Variable Bias

QM222 Class 9 Section D1 1. Multiple regression – review and in-class exercise 2. Goodness of fit 3. What if your Dependent Variable is an 0/1 Indicator.

QM222 A1 On tests and projects

QM222 Class 8 Section A1 Using categorical data in regression

QM222 Class 8 Section D1 1. Review: coefficient statistics: standard errors, t-statistics, p-values (chapter 7) 2. Multiple regression 3. Goodness of fit.

QM222 A1 Nov. 27 More tips on writing your projects

The slope, explained variance, residuals

QM222 A1 How to proceed next in your project Multicollinearity

POSC 202A: Lecture Lecture: Substantive Significance, Relationship between Variables 1.

QM222 Class 14 Today’s New topic: What if the Dependent Variable is a Dummy Variable? QM222 Fall 2017 Section A1.

QM222 Your regressions and the test

QM222 Dec. 5 Presentations For presentation schedule, see:

QM222 Class 15 Section D1 Review for test Multicollinearity

Covariance x – x > 0 x (x,y) y – y > 0 y x and y axes.

CHAPTER 14 MULTIPLE REGRESSION

EPP 245 Statistical Analysis of Laboratory Data

Introduction to Econometrics, 5th edition

Introduction to Econometrics, 5th edition

Introduction to Econometrics, 5th edition

Presentation transcript:

QM222 Class 19 Omitted Variable Bias pt 2 Different slopes for a single variable QM222 Fall 2017 Section A1

To-dos Assignment 5 is due today. But you can only do it if you have your stata dataset. Test 6pm Oct 31 (location TBD) QM222 Fall 2017 Section A1

Today we will… Omitted variable bias: Review the idea of omitted variable bias Do the graph from last class’s in-class exercise Learn algebra of omitted variable bias Different slopes for a single variable (start) QM222 Fall 2017 Section A1

Omitted variable bias In a simple regression of Y on X1, the coefficient b1 measures the combined effects of: the direct (or often called “causal”) effect of the included variable X1 on Y PLUS an “omitted variable bias” due to factors that were left out (omitted) from the regression. Often we want to measure the direct, causal effect. In this case, the coefficient in the simple regression is biased. QM222 Fall 2017 Section A1

Regressions without and with Age WS48= .1203 - .0325 INJURED (has bias!) (66.37) (-6.34) adjRsq=.0359 Regression 2: WS48= .1991 - .0274 INJURED - .00279 Age (18.38) (-5.41) (-7.37) adjRsq=.0826 (t-stats in parentheses) Putting age in the regression (2) added .051 to the INJURED coefficient (i.e. made it a smaller negative.) The omitted variable bias in Regression (1) was the difference in the coefficients on INJURED -.0274-.0325 = - .051 More generally: Omitted variable bias occurs when: The omitted variable (Age) has an effect on the dependent variable (WS48) AND 2. The omitted variable (Age) is correlated with the explanatory variable of interest (INJURED). QM222 Fall 2017 Section A1

We learned the graphic way last week Really, both being injured and age affect WS48 as in the multiple regression Y = b0 + b1X1 + b2X2 This is drawn below. Let’s call this the Full model. Let’s call b1 and b2 the direct effects. QM222 Fall 2017 Section A1

The mis-specified or Limited model However, in the simple (1 X variable) regression, we measure only a (combined) effect of injured on price. Call its coefficient c1 Y = c0 + c1X1 Let’s call c1 is the combined effect because it combines the direct effect of X1 and the bias. QM222 Fall 2017 Section A1

The reason that there is an omitted variable bias in the simple regression of Y on X1 is that there is a Background Relationship between the X’s We intuited that there is a relationship between X1 (Injured) and X2 (Age). We call this the Background Relationship: correlate WS48 INJURED Age (obs=1,051) | WS48 INJURED Age -------------+--------------------------- WS48 | 1.0000 INJURED | -0.1920 1.0000 Age | -0.2425 0.1388 1.0000 This background relationship, shown in the graph as a1, is positive. QM222 Fall 2017 Section A1

But in the limited model without an X2 in the regression, The combined effect c1 includes both X1‘s direct effect b1. And the indirect effect (blue arrow) working through X2. i.e. when X1 changes, X2 also tends to change (a1) This change in X2 has another effect on Y (b2) The indirect effect (blue arrow) is the omitted variable bias and its sign is the sign of a1 times the sign of b2 QM222 Fall 2017 Section A1

In the basketball case WS48= .1203 - .0325 INJURED (limited model) WS48= .1991 - .0274 INJURED - .00279 Age (full model) The effect of Injured on WS48 has two channels. The first one is the direct effect b1 (-.0274) The second channel is the indirect effect working through X2.(Age) When X1 (INJURED) changes, X2 (Age) also tends to change (a1) (correlation +.1388) This change in X2 has its own effect on Y (b2) (-.0274) The indirect effect (blue arrow) is the omitted variable bias and its sign is the sign of a1 times the sign of b2 : pos*neg=neg QM222 Fall 2017 Section A1

In-Class exercise (t-stats in parentheses) Regression 1: Score = 61.809 – 5.68 Pay_Program adjR2=.0175 (93.5) (-3.19) Regression 2: Score = 10.80 + 3.73 Pay_Program + 0.826 OldScore adjR2=.6687 (6.52) (3.46) (31.68) QM222 Fall 2017 Section A1

Pay Program graph (1) Score = 61. 809 – 5 Pay Program graph (1) Score = 61.809 – 5.68 Pay_Program (2) Score = 10.80 + 3.73 Pay_Program + 0.826 OldScore Pay Program b1= 3.73 a1 SCORE bias=-5.68--.373=-.941 Old Score b2= + .826 a1 has the sign of the correlation between Pay Program and Old Score. Since the bias is negative and its sign = sign of a1* sign b2, a1 must be negative. In words: It must be that OldScore is correlated with who chooses the Pay Program, and particularly that schools with bad (old) scores chose the pay program QM222 Fall 2017 Section A1

Algebra Limited model Y = c0 + c1 X1 Full model Y = b0 + b1 X1 + b2 X2 Background model X2 = a0 + a1 X1 We want the Full model but we only have the limited one with only X1 So substitute the background model into the full model: Y = b0 + b1 X1 + b2 (a0 + a1 X1 ) X2 Collect terms: Y = (b0 + b2 a0 ) + (b1 + b2a1) X1 c0 c1 X1 So the bias of the coefficient on X1 in the limited model is b2a1 QM222 Fall 2017 Section A1

Let’s apply this to Brookline Condo’s c1 combined effect (negative.) Limited Model: Price = 520729 – 46969 BEACON Full Model: Price = 6981 + 409.4 SIZE + 32936 BEACON Background relationship: SIZE = 1254 – 195.17 BEACON c1 = (b1 + b2a1) check -46969=32935+(-195.17*409.4) Bias is b2a1 or -195.17*409.4 which is negative. We are UNDERESTIMATING the direct effect a1 (negative) b1 direct effect (positive.) QM222 Fall 2017 Section A1

Pay Program algebra (1) Score = 61. 809 – 5 Pay Program algebra (1) Score = 61.809 – 5.68 Pay_Program (limited) (2) Score = 10.80 + 3.73 Pay_Program + 0.826 OldScore (full) Here is the regression of the background model: . regress OLDSCORE PAY_PROGRAM Source | SS df MS Number of obs = 515 -------------+---------------------------------- F(1, 513) = 42.23 Model | 7952.87922 1 7952.87922 Prob > F = 0.0000 Residual | 96613.7883 513 188.330971 R-squared = 0.0761 -------------+---------------------------------- Adj R-squared = 0.0743 Total | 104566.667 514 203.437096 Root MSE = 13.723 ------------------------------------------------------------------------------ OLDSCORE | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- PAY_PROGRAM | -11.39843 1.754058 -6.50 0.000 -14.84445 -7.952413 _cons | 61.78153 .6512825 94.86 0.000 60.50202 63.06104 Write the equation of the background model. Combine it with the two above models to get the value and sign of the omitted variable bias in the coefficient -5.68 in the limited model. QM222 Fall 2017 Section A1

One variable with different slopes QM222 Fall 2017 Section A1

Review of simple derivatives A derivative is the same as a slope. In a line, the slope is always the same. In a curve, the slope changes. The rules of derivatives tell you how to calculate the slope at any point of a curve. We write the derivative as dy/dx instead of the slope ∆Y/∆X QM222 Fall 2017 Section A1

Three rules of calculus 1. The derivative (slope) of two terms added together = the derivative of each term added together: Y = A + B where A and B are terms with X in them dY/dX= dA/dX + dB/dX 2. The derivative (slope) of a constant is zero If Y = 5, dY/dX =0 3. If Y = a xb dY/dX = b a x b-1 QM222 Fall 2017 Section A1

Examples Y = 25 X2 then dY/dX = 2 · 25 X 2-1 = 50 x Another example combining the three rules is: y = 25 x 2 + 200 x + 3000 then, recalling that x0 = 1, dY/dX = 2 · 25 x 2-1 + 1· 200 x 1-1 + 0 = 50 x + 200 The exponent does not have to be either positive or an integer. Example: Y = 20 X-2.5 then: dY/dX = 2.5 · 20 X -2.5 - 1 = - 50 X-3.5 QM222 Fall 2017 Section A1

Now we’re ready for different slopes QM222 Fall 2017 Section A1

Movie dataset Here is a regression of Movie lifetime revenues on Budget and a dummy for if it is a SciFi movie Revenues = 16.6 + 1.12 Budget - 9.79 SciFi (5.28) (.102) (11.6) (standard errors in parentheses) What does an observation represent in this data set? What do we learn from the standard errors about each coefficient’s significance? What is the slope dRevenues/dSciFi? What is the slope dRevenues/dBudget? Are these results what you expect? QM222 Fall 2017 Section A1

Do you think that budget will matter similarly for all types of movies? Particularly, what do we expect about the coefficient on budget (slope) for SciFi movies (compared to others)? QM222 Fall 2017 Section A1

If we think that each budget dollar affects SciFi movies differently… The simplest way to model this in a regression is: Make an additional variable by multiplying Budget x SciFi Make an additional variable by multiplying Budget x non-SciFi Replace Budget with these two variables (keeping in SciFi) These are called interaction terms. QM222 Fall 2017 Section A1

Steps 1 and 2: Replace budget with two new variables Budget x SciFi and Budget x Non-SciFi gen budgetscifi= budget*scifi gen budgetnonscifi=budget*(1-scifi) QM222 Fall 2017 Section A1

What data looks like in a spreadsheet moviename revenue scifi budget budgetscifi budgetnonscifi The Bridges of Madison County 71.5166 22 Dead Man Walking 39.3636 11 Rob Roy 31.5969 28 Clueless 56.6316 13.7 Babe 63.6589 30 Jumanji 100 65 Showgirls 20.3508 40 Starship Troopers 54.8144 1 Bad Boys 65.807 23 Event Horizon 26.6732 60 Jefferson in Paris 2.47367 14 To Die For 21.2845 20 Star Trek: Insurrection 70.1877 70 Sphere 37.0203 73 Out of Sight 37.5626 48 Saving Private Ryan 220 Enemy of the State 110 85 The Big Lebowski 17.4519 15 Lost in Space 69.1176 80 Mortal Kombat 70.4541 Copycat 32.0519 QM222 Fall 2017 Section A1

3. Replace Budget with these two variables (keeping in SciFi) regress revenues scifi budgetscifi budgetnonscifi You get: revenues = 19.91 – 72.07 SciFi + 2.04 budgetscifi + 1.04 budgetnotscifi (5.36) (25.5) (0.352) (0.105) What is the slope drevenues/dbudget? drevenues/dbudget = 2.04 scifi + 1.04 notscifi If it is a scifi movie: Slope drevenues/dbudget = 2.04 (since the last term is 0) If it is not a scifi movie : Slope drevenues/dbudget = 1.04 Each budget dollar is more important if it is a scifi/fantasy movie. Note also: All coefficients are significant. QM222 Fall 2017 Section A1

Graph of this model SciFi movies Revenues Other movies Budget QM222 Fall 2017 Section A1

This also allows the effect of being a scifi movie to depend on the budget From the previous overhead: revenues = 19.91 – 72.07 scifi + 2.04 budgetscifi + 1.04 budgetnotscifi What is the slope drevenues/dscifi? drevenues/dscifi = - 72.70 + 2.04 budget So if budget = 100, drevenues/dscifi = - 72.70 + 2.04 *100 = 131.3 Compare to our equation without the “interaction terms”, with : drevenues/dscifi = - 9.79 QM222 Fall 2017 Section A1

Making Regression Tables Regressions of Points per game per player 1 2 3 4 height -0.0226 0.2133*** 0.2240*** 4.7306 (-0.28) (4.77) -4.93 -0.85 year -.01545*** -0.0166* 0.1667 (-2.99) (-1.79) -0.74 Height x year -0.0023 (-0.81) yr1970 0.5629 0.5522 -1.12 -1.1 yr1975 -1.0830*** -1.1161*** (-2.48) (-2.55) yr1980 -0.3757 -0.4313 (-0.96) (-1.08) yr1985 -0.1960 -0.2582 (-0.55) (-0.71) yr1990 -0.1043 -0.1580 (-0.33) (-0.49) yr1995 -0.54223* -0.5841** (-1.86) (-1.97) yr2000 -0.5964** -0.6269** (-2.20) (-2.29) yr2005 -0.5453** 0.5576** (-2.08) (-2.13) yr2010 -0.1962502 -0.2026464 (-0.73) (-0.75) center -0.8795*** -0.8917*** -0.8946*** (-5.14) (-5.20) (-5.22) minutes played 0.5260*** 0.5253*** 0.5251*** -76.41 -76.42 -76.35 Constant (integer) 9.872891 11.21819 13.02334 -352.6088 (-1.51) (-1.12) (-0.78) # observations 1509 1,509 MSE SSE 6.0404 2.6943 2.6812 2.6815 adjusted r squared -0.0006 0.8009 0.8028 t-statistics in parentheses; *p<.1 ** p<.05 ***p<.01; Omitted year: 1965 from "Have Taller "Big Men" Become Less Productive in the NBA over Recent Years" Austin Hirsch 2015 When you want to report several regressions and allow readers to compare them, you report the regressions in a table. All variables are listed in the first column. Each regression is another column. QM222 Fall 2017 Section A1

Main elements of the regression table: Each column represents a different equation. If all regressions have the same dependent (Y) variable, then tell people what the dependent variable is in the table title (like I did here). If different columns have different dependent variables, list them as each column’s first row (instead of “1”, “2”). Each explanatory variable in any equation should be listed in the first column, leaving 2 rows for each variable. For each regression, first put the coefficient itself in the column. Then, in the cell below it, put either the coefficient’s standard error or the coefficient’s t-statistic in parentheses. Somewhere in the title or the notes to the table, inform the reader which statistic is in parentheses. It is particularly helpful to put asterisks next to significant coefficients. For instance, I put the following: *** if the p-value is <.01 ** p<.05 * p<.10 This should be explained in the table’s footnotes. Every column does not need to have a coefficient for every variable, since every regression did not include every variable. Looking at the table, we can clearly see which explanatory variables were included in each regression. QM222 Fall 2017 Section A1