QM222 Class 13 Section D1 Omitted variable bias (Chapter 13.) The bias on a regression coefficient due to leaving out confounding factors from a Regression QM222 Fall 2016 Section D1
Assignment 4 – Due Friday at 6pm: Hard copy and online Part A: Current Project Status If you have changed or added any aspect of the Current Project Status (Q1-6), revise it. Part B: Questions on your dependent variable (if you have > 1, choose the most important one): If you have a numeric dependent variable, create a histogram of your dependent variable in Stata (histogram varname). If you have a categorical dependent variable, tabulate it with the Stata command: tab variablename, missing. What do you learn from this histogram or tabulation? If you have a numeric dependent variable, get descriptive statistics for your (key) dependent variable in Stata by using summarize variablename, detail. If you have a categorical dependent variable, make it into a single indicator variable, making sure that any missing values are left as missing. Then summarize varname, detail. What important things do you learn about the distribution of your dependent variable from these descriptive statistics? Answer in 1-4 sentences. Based on this evidence, are there any observations with values that seem like mistakes? Should you drop these observations or correct the mistake? Explain and drop. (For numeric variables only) Based on this evidence, is your dependent variable very skewed, and particularly are there any extreme outliers? If so, do you think we should top- code these values (or use logs etc.)? Explain why. Then top-code or change into logs if appropriate. QM222 Fall 2016 Section D1
Assignment 4 – Due Friday at 6pm: Hard copy and online Part C: Questions on your key explanatory variable (if you have > 1, choose the most important one): If it is a numeric variable, create a histogram of it in Stata . If it is a categorical variable, tabulate it with the Stata command: tab variablename, missing. If it is a numeric variable, get descriptive statistics for it summarize variablename, detail. If it is categorical, make it into a single indicator (dummy) variable, keeping missing values as missing. What important things do you learn about the distribution of your key explanatory variable from these descriptive statistics? Based on this evidence, are there any observations with values that seem like mistakes? Do you think we should drop these observations or correct the mistake? Explain, and drop if appropriate. Based on this evidence, is your explanatory variable very skewed, and particularly are there any extreme outliers? If so, do you think we should top- code these values (or use logs etc.)? Explain (and do it). . Then top-code or change into logs if appropriate. QM222 Fall 2016 Section D1
Assignment 4 – Due Friday at 6pm: Hard copy and online Part D: Questions on Correlation: Correlate all variables you plan to use. What important things do you learn about the relationship between your dependent variable(s) and your key explanatory variable(s) from this correlation table? Part E: Simple Regression: Run a simple regression of your key dependent variable on your key explanatory variable (or one of them, if you have several.) What important things do you learn about the relationship between your key dependent and explanatory variables from this regression? In your answer, include a discussion of the explanatory variable’s coefficient, its t- statistic and its confidence interval. QM222 Fall 2016 Section D1
Omitted Variable Bias QM222 Fall 2016 Section D1
Why know about this? It is useful in your projects to understand why coefficients change when you add a variable. So you can know which coefficient answers your question. It is useful in your projects to understand what possibly confounding variables you should search for. Also, if there is a confounding variable that you cannot measure, this will help you predict what the sign of the omitted variable bias is. Finally, it will be on the test. QM222 Fall 2015 Section D1
So isolating each factor’s effect Multiple regression measures the individual impacts of different factors on Y…. Multiple regression helps us to measure the individual impacts of different factors on our dependent variable Y… Holding the other factors constant So isolating each factor’s effect QM222 Fall 2016 Section D1
Condo’s Price = 520729 – 46969 BEACON Price = 6981 + 409 SIZE + 32936 BEACON Why are the coefficients on Beacon so different? The coefficient on Beacon in the first (simple) regression says: Across all the properties in our dataset, those on Beacon cost $46,239 less on average. In contrast, the coefficient on Beacon in the multiple regression says: If we compare two condos of the same size, one on Beacon and one not on Beacon, the one on Beacon costs $32,946 more. QM222 Fall 2016 Section D1
If you really want to measure the effect of X1 alone (e. g If you really want to measure the effect of X1 alone (e.g. Beacon), you need to control for possibly confounding factors. If you don’t, the coefficient on X1 is biased. We call this omitted or missing variable bias. Omitted variable bias occurs when The omitted variable has an effect on the dependent variable, AND 2. The omitted variable is correlated with the explanatory variable of interest. QM222 Fall 2016 Section D1
Omitted variable bias in the condo case Price = 520729 – 46969 BEACON (simple regression) In a simple regression of Y on X1, the coefficient b1 measures the combined effects of: the direct (or often called “causal”) effect of the included variable X1 on Y PLUS an “omitted variable bias” due to factors that were left out (omitted) from the regression. Often we want to measure the direct, causal effect. In this case, the coefficient in the simple regression is biased. QM222 Fall 2016 Section D1
Another example: How does getting more education affect salaries? Let’s say you un this regression: Income = 20,000 + 4000 Education (in years). But, the coefficient 4000 may pick up the fact that more intelligent people have both more education and higher income. If you could add the variable IQ to the regression, the coefficient on education would hold IQ constant. QM222 Fall 2016 Section D1
We are going to learn methods so that you can understand Omitted Variable Bias- first with graphs Really, both being on Beacon and price affect price, as in the multiple regression Y = b0 + b1X1 + b2X2 Let’s call this the Full model. Let’s call b1 and b2 the direct effects. QM222 Fall 2016 Section D1
The mis-specified or Limited model However, in the simple (1 X variable) regression, we measure only a (combined) effect of Beacon on price. Call its coefficient c1 Y = c0 + c1X1 Let’s call c1 is the combined effect. QM222 Fall 2016 Section D1
The reason that there is a bias on X1 is that there is a Background Relationship between the X’s We also know that there is a relationship between X1 (Beacon) and X2 (Size). We call this the Background Relationship: . correlate price size Beacon_Street (obs=1085) | price size Beacon~t -------------+--------------------------- price | 1.0000 size | 0.8655 1.0000 Beacon_Str~t | -0.0552 -0.1081 1.0000 This background relationship, shown here as a1, is negative. QM222 Fall 2016 Section D1
Let’s combine all 3 pictures: the full model, the limited model & the background relationship The effect of X1 on Y has two channels. The first one is the direct effect b1. The second channel is the indirect effect through X2. When X1 changes, X2 also tends to change (a1) This change in X2 has another effect on Y (b2) QM222 Fall 2016 Section D1
If we want the direct effect only When we include both X1 and X2 in a multiple regression, we get the coefficient b1 – the direct effect of X1. QM222 Fall 2016 Section D1