Presentation is loading. Please wait.

Presentation is loading. Please wait.

Overview of Regression Analysis. Conditional Mean We all know what a mean or average is. E.g. The mean annual earnings for 25-44 year old working males.

Similar presentations


Presentation on theme: "Overview of Regression Analysis. Conditional Mean We all know what a mean or average is. E.g. The mean annual earnings for 25-44 year old working males."— Presentation transcript:

1 Overview of Regression Analysis

2 Conditional Mean We all know what a mean or average is. E.g. The mean annual earnings for 25-44 year old working males is $54,648 (March 2010) We are also often interested in how this mean differs by other individual characteristics. E.g. How do the mean earnings differ between black and non-black workers? Mean earnings for working non-black males ages 25-44 = $56,614 Mean earnings for working black males ages 25-44 = $39,380 These are known as Conditional Means (the mean conditioned on some other characteristics, in this case race) So without controlling for anything else, 25-44 yr old black working males earn on average $17,234 less annually, or 30% less, than similar aged white working males.

3 Conditional Means When testing a theory though, we often want to know how much of a given mean difference can be attributed to a particular observable variable, after controlling for other observable differences. For example, we also know that earnings are highly tied to schooling, and there is a significant racial gap in schooling, so we might want to know how large is racial earnings gap net of racial differences in years of schooling (i.e., controlling for schooling).

4 Conditional Means One way to do this is to calculate even more complicated conditional means. E.g., Non-Black males between 25-44 w/out hs degree = $25,278 Black males between 25-44 w/out hs degree = $22,275 Non-Black males between 25-44 w/ hs degree = $41,922 Black males between 25-44 w/ hs degree = $33,670 Non-Black males between 25-44 w/ college degree = $80,295 Black males between 25-44 w/ college degree = $61,136

5 Conditional Means Then, we can find how much less blacks earn than whites, after controlling for education, via the following weighted mean formula: where i corresponds to the three education categories, n b,i / n b corresponds to the fraction of black male workers in education category i, earnings b,i corresponds to the mean earnings for black workers in education category i, earnings w,i corresponds to the mean earnings for white workers in education category i. Doing so we find that according to the above conditional mean calculations, black male workers earn about $11,064, or 11,064/54,648 = 20 percent less, than white male workers with similar education characteristics So conditioning on years of education explains about 33% of racial earnings gap ([0.30 - 0.20]/0.30 = 0.33)

6 Conditional Means This can be quite cumbersome to compute all these conditional means though, especially if we start adding in more categories for education e.g., only up to 10 th grade, only up to 11 th grade, only up to 12 th grade, 1 yr of college, 2 years of college, 3 years of college, etc. Moreover, what if we are also interested in the impact of another year of schooling on earnings, after controlling for race? That would require a whole new set of calculations.

7 Regression This is why a regression model is often a simpler way to describe conditional means. earnings i = α + β 1 *black i + β 2 *yrs of school i + e i α is known as intercept, β’s are (slope) coefficients, e i is the “residual” Estimating a regression amounts to finding the intercept and slope coefficients that minimize the sum of the squared e i terms across the sample (i.e. find best “fit”) So intercepts and coefficients essentially account for the variation in the dependant variable (earnings) that is common across all people with respect to the control variables, while the residual is the individual specific variation, or how each individual differs from the average. Graphically?

8 Regression α α+β1α+β1 Yrs of Schooling Earnings Slope = β 2

9 Regression When I estimate this model I get: earnings i = -70,003 – 10,381*black + 8,888*yrs of schooling i + e i (1968) (1,126) (138) or Computing the equation for particular characteristics without the e i term gives “expected,” or average, earnings for a person with those characteristics. So for a non-black with 12 years of schooling, expected earnings are: -70,003 – 10,381*0 + 8,888*12 = $36,653 How do we interpret specific coefficients?

10 Regression The way to interpret coefficients (i.e. “Betas”) “The marginal change in the conditional mean of the dependant variable due to a one unit increase in that characteristic, holding all other characteristics constant.” So, one way to determine the marginal impact of a given characteristic on the dependant variable (e.g., the impact of another year of schooling on earnings), is to simply take the see how the “expected” outcome of the dependant variable would differ if two individuals differed by one unit in that characteristic, but were otherwise the same.

11 Regression For example, consider our estimated earnings regression earnings i = -70,003 – 10,381*black i + 8,888*yrs of schooling i + e i Finding this “difference” between two individuals who were the same on all other characteristics (i.e., same race), but one had s years of education while the other had s+1, we get -70,003 – 10,381*black + 8,888*(s+1) – (-70,003 – 10,381*black + 8,888*s) = [(s+1)-s]*8,888 = 8,888 So, under this specification, “marginal” impact of another year of schooling on earnings is β 2, or simply the coefficient on the years of schooling variable.

12 Regression Consider again our estimated earnings regression earnings i = -70,003 – 10,381*black + 8,888*yrs of schooling i + e i Doing a similar exercise with the “black” indicator variable (i.e., holding yrs of schooling constant and comparing an individual with black = 1 to an individual with black = 0) we get -10,381. This means that, holding everything else equal (i.e. yrs of education), on average black workers earn $10,381 less than white workers. This compares similarly to the $11,064 conditional pay differential we computed before, but is still a little different. Why?

13 Regression Often, when we run regressions, we aren’t really interested in “point estimates” (i.e. specific coefficient estimates), but rather in using these estimates to test hypotheses. For example, what if what we are really interested in is whether black workers have a lower return to an additional year of schooling than white workers. How could we test this?

14 Regression What if I added in an “interaction” term between schooling and race? earnings i = α + β 1 *black i + β 2 *yrs of school i + β 3 *black i *yrs of school i + e i Doing this estimation I get: earnings i = -47,011 + 1381*black i + 7,321*yrs of school i - 982*black*yrs of school i How do we interpret these coefficients? What is the avg impact of another year of schooling on a black worker’s earnings? What is the avg impact of another year of schooling on a white worker’s earnings? So marginal impact of another year of schooling on earnings for black workers is given by β 2 + β 3, so hypothesis test amounts to determining whether β 3 is “statistically” different from zero.

15 Regression Precision/Significance of estimates: Consider again the previous estimates What we are testing is whether coefficient of interest is “significantly” different than zero (i.e., how likely is it that we would have gotten this large of an estimate by chance even if it was really equal to zero) To hypothesis test, we must compare size of coefficient to its standard error. A good rule of thumb is that absolute magnitude of coefficient is more than twice standard error. What will generally impact whether an estimate is significant?

16 Specification form Often when doing regressions researchers will use the natural log of earnings rather than simply earnings as the dependant variable: ln(earnings i )= α + β 1 *black i + β 2 *schooling i + e i This is done for two reasons: 1. This specification often “fits” the data better, as log transformation makes a variable with a highly skewed distribution closer to a normal distribution, which generally helps the regression fit. 2. The coefficients can be roughly interpreted as percentage changes in dependant variable associated with a unit change in the corresponding control variable (i.e., elasticity), rather than how the level of the dependant variable changes given a unit change in the corresponding control variable.

17 Specification form

18 Omitted variables If we are really interested in the wage gap between black workers and white workers after conditioning on years of education, what are we missing from the basic specification that might obscure the answer we are really looking for? ln(earnings i )= α + β 1 *black i + β 2 *schooling i + e i

19 Omitted variables ln(earnings i )= α + β 1 *black i + β 2 *Hispanic i + β 3 *schooling i + e i What will this likely do to coefficient on black indicator?

20 Omitted variables ln(earnings i )= α + β 1 *black i + β 2 *Hispanic i + β 3 *schooling i + e i What will this likely do to coefficient on black indicator?

21 Omitted variables What about other things like age and region? These things are surely associated with earnings, therefore don’t they need to be included?

22 Omitted variables What about other things like age and region? These things are surely associated with earnings, therefore don’t they need to be included?

23 Omitted variables In the end, it is not necessary to control for every possible thing that can affect dependant (y, or left-hand side) variable. What to control for depends on your question of interest. Robustness – A finding is said to be relatively robust if basic qualitative finding is unchanged by inclusion of further variables, adding more interaction terms (i.e., the combination of two existing variables such as the term black*years of school), or changes in specification form (i.e. log transformation of dependant variable)

24 Selection Be very weary of making causal inferences of significant correlations In particular, there are often issues of sample selection/endogeneity/omitted variables Many characteristics are often the products of choice (often called endogenous characteristics). In such cases it is hard to identify how the outcome of interest depends on that endogenous characteristic, versus other unobserved/omitted characteristics that determined that choice. Consider again the expected “compensating wage differential” for higher risk jobs. Does this truly capture the average tolerance for risk? Consider the Brooklyn Bridge “effect” on wages.

25 Selection More realistically, consider trying to estimate the causal “effect” of being in a gang on individual crime. What might be the concern of regressing number of crimes committed on an indicator for whether someone is in a gang or not, even after controlling for household income, race, age, and neighborhood characteristics? How about estimating the causal “effect” of being married on individual crime by regressing number of crimes committed on an indicator for whether someone is married or not, even after controlling for income, race, age, and neighborhood characteristics?

26 Selection

27 Summary In summary, Coefficient on a given variable tells you how the expected change in the outcome of interest due to a one unit change in that variable, after controlling for all of the other included characteristics. Little credence should be given to imprecisely estimated coefficients (i.e. large enough standard errors so that they are not statistically different from zero), especially when hypothesis testing. Part of the key details of a paper is the “empirical strategy” it uses to deal with selection effects. Much of this class will be spent on discussing various empirical strategies authors use in the papers we read. In the end, use your empirical intuition---can this data really answer the question of interest?


Download ppt "Overview of Regression Analysis. Conditional Mean We all know what a mean or average is. E.g. The mean annual earnings for 25-44 year old working males."

Similar presentations


Ads by Google