Presentation is loading. Please wait.

Presentation is loading. Please wait.

QM222 Class 12 Section D1 1. A few Stata things 2

Similar presentations


Presentation on theme: "QM222 Class 12 Section D1 1. A few Stata things 2"— Presentation transcript:

1 QM222 Class 12 Section D1 1. A few Stata things 2
QM222 Class 12 Section D1 1. A few Stata things 2. Nonlinear Relationships in Regression 2. Omitted variable bias (Leaving Things out of a Regression – Chapter 14.) Future topics: One variable with different slopes (for different groups, Chapter 13) QM222 Fall 2016 Section D1

2 Dealing with missing values, or numbers saved as strings: open Open somevarsfrom2013nscg.dta
CH1218 is the number of children between 12 and 18 CH1218IN is an indicator (dummy) variable for if there are children between 12 and 18 jobsatis if how satisfied are you with your job, with 1=very satisfied, 4=very dissatisfied lwyr is the last year the person worked marsta is marital status. From codebook: 1 Married Living in a marriage-like relationship 3 Widowed 4 Separated 5 Divorced 6 Never Married Sum all variables QM222 Fall 2016 Section D1

3 Dealing with missing values, or numbers saved as strings: open Open somevarsfrom2013nscg.dta
CH1218 is the number of children between 12 and 18 CH1218IN is an indicator (dummy) variable for if there are children between 12 and 18 jobsatis if how satisfied are you with your job, with 1=very satisfied, 4=very dissatisfied lwyr is the last year the person worked marsta is marital status. From codebook: 1 Married Living in a marriage-like relationship 3 Widowed 4 Separated 5 Divorced 6 Never Married sum (all variables ) CH1218IN, jobsatis and marsta say there are 0 observations because they are saved as strings. tab CH1218IN You would need to make this into a real indicator/dummy variable to use it as “Has children 12-18” What is “L”? Look at the codebook! It says “Logical skip”. So look up, “who is NOT asked this question?” Answer is people with no children. So make people with L=0. QM222 Fall 2016 Section D1

4 Dealing with missing values, or numbers saved as strings: open Open somevarsfrom2013nscg.dta
CH1218 is the number of children between 12 and 18 CH1218IN is an indicator (dummy) variable for if there are children between 12 and 18 jobsatis if how satisfied are you with your job, with 1=very satisfied, 4=very dissatisfied lwyr is the last year the person worked marsta is marital status. From codebook: 1 Married Living in a marriage-like relationship 3 Widowed 4 Separated 5 Divorced 6 Never Married Notice that lwyr goes from 1959 to But we haven’t yet reached 9998. This is probably missing, or not asked. To check, tab lwyr. Who would not be asked this question? People who are not working. So you would only use this variable when analyzing things about people not working. QM222 Fall 2016 Section D1

5 Dealing with missing values, or numbers saved as strings: open Open somevarsfrom2013nscg.dta
CH1218 is the number of children between 12 and 18 CH1218IN is an indicator (dummy) variable for if there are children between 12 and 18 jobsatis if how satisfied are you with your job, with 1=very satisfied, 4=very dissatisfied lwyr is the last year the person worked marsta is marital status. From codebook: 1 Married Living in a marriage-like relationship 3 Widowed 4 Separated 5 Divorced 6 Never Married tab jobsatis Who is “L (logical skip)? Who would not be asked this question? People who are not working. So you would only use this variable when analyzing people who are working. Does it make sense to use this as a numerical variable? How could I do that. First make “L” into missing which is “” Second: destring jobsatis, replace QM222 Fall 2016 Section D1

6 Dealing with missing values, or numbers saved as strings: open Open somevarsfrom2013nscg.dta
CH1218 is the number of children between 12 and 18 CH1218IN is an indicator (dummy) variable for if there are children between 12 and 18 jobsatis if how satisfied are you with your job, with 1=very satisfied, 4=very dissatisfied lwyr is the last year the person worked marsta is marital status. From codebook: 1 Married Living in a marriage-like relationship 3 Widowed 4 Separated 5 Divorced 6 Never Married tab marsta Does it make sense to use this as a numerical variable? NO! Leave as is. QM222 Fall 2016 Section D1

7 Estimating nonlinear relationships
Could the relationship be non-linear, and if so, how can we estimate this using linear regression? QM222 Fall 2016 Section D1

8 Non-linear relationships between Y and X
Sometimes, the relationship between the Y variable and the X variable is unlikely to be linear. This may lead you to measure a very low insignificant slope. e.g. If you ran a regression of this graph, its coefficient would be zero. QM222 Fall 2016 Section D1

9 Many of you believe that you might have nonlinear relationships
e.g. Maybe job satisfaction goes up with age and then down again. e.g. You do not believe that an increase $1 in price will have the same effect going from $10 to $11 as going from $100 to $101. Note that this section is only applicable for numerical variables. You cannot do these nonlinear things with indicator variables. QM222 Fall 2016 Section D1

10 To solve the problem of Y possibly increasing with X and then decreasing:
You simply add to the regression a new X variable that is a non-linear versions of old variable. My suggestion: estimate a quadratic by making a new variable X2 and run the regression with both the linear and non-linear (quadratic) term in the equation. If you don’t know if a relationship is nonlinear, you can estimate the regression assuming it is nonlinear (e.g. quadratic) and then examine the results to see if this assumption is correct. QM222 Fall 2016 Section D1

11 Quadratic: Y = b0 + b1 X + b2 X2 In high school you learned that quadratic equations look like this. So by adding a squared term, you can estimate these shapes. QM222 Fall 2016 Section D1

12 However, a regression with a quadratic can estimate ANY part of these shapes
So, using a quadratic does not mean that the curve need actually ever change from a positive to a negative slope or vice versa … QM222 Fall 2016 Section D1

13 How do you know whether the relationship really is nonlinear?
Put in a nonlinear term (e.g. a squared term) and let the |t-stats|’s in the equation tell you if it belongs in there. If the |t-stat|>2, you are more than 95% confident that the relationship is nonlinear. Even if |t-stat| < 2, it’s a good idea to keep in the quadratic term as long as you are relatively confident it belongs in. I tend to leave it in if it has a | t-stat | >1, which means that I am at least 68% confident the relationship is nonlinear. Example: I know annual visitors to the park. I want to know if they are growing (or falling) at a constant rate over time, or not. First I make the variables: gen time= _n gen timesq = time^2 QM222 Fall 2016 Section D1

14 Here are regressions of visitors (to Cape Canaveral) on time, then on time AND timesq. Is the relationship nonlinear? Are visitors growing/shrinking, and at a constant rate? . regress annualvisitors time Source | SS df MS Number of obs = F( 1, 21) = Model | e e Prob > F = Residual | e e R-squared = Adj R-squared = Total | e e Root MSE = 2.5e+05 annualvisi~s | Coef. Std. Err t P>|t| [95% Conf. Interval] time | _cons | . regress annualvisitors time timesq F( 2, 20) = Model | e e Prob > F = Residual | e e R-squared = Adj R-squared = Total | e e Root MSE = 1.3e+05 time | timesq | _cons | QM222 Fall 2016 Section D1

15 Sketching the Quadratic Visitors = 1102401 + 118498 time - 5374 time2
The linear term in positive, so at a small X eg. X=0.1 the slope is positive. The squared is negative so the slope eventually becomes negatively sloped. So the general shape is as below. But which part of the curve is it? For those who don’t think in derivatives, plug in high, medium and low values for X in the original equation. In this data, time goes from 1 to 23 so: At time=1, Visitors = (1) (1) = 1,215,525 At time=10, Visitors = (10) (102) =1,749,981 At time=23, Visitors = (23) (232) =985,009 So over these 23 years, predicted visitors go up, then back down again. QM222 Fall 2016 Section D1

16 Sketching the Quadratic using calculus Visitors = 1102401 + 118498 time - 5374 time2
Calculus tells us the slope: dVisitors/dtime = – 2*5374 time The slope gets smaller as time increases. At the top of this cure, the slope is exactly zero. So solve 0 = – 2*5374 time time = 11.03 QM222 Fall 2016 Section D1

17 What about this issue: You believe that a 1% increase in X will have the same % effect on Y no matter what price you start at. [NOT ON TEST] e.g. You believe a 1 percent increase in price has a constant percentage effect on sales. Mathematical rule: If lnY = b0+ b1 lnX, b1 represents the %∆Y/ %∆X Or, the percentage change in Y when X changes by 1% (ln is natural log, the coefficient of “e”. Log means to the base 10. Either works.) So just make two new variables: lnY and lnX and run a regression: regress lnY lnX The coefficient will be: the percentage change in Y when X changes by 1% QM222 Fall 2016 Section D1

18 A case when logs might be useful?
If you have skewed data (like lifetime gross in movies), you could just regress ln(Lifetime gross) = b0 + b1 ln(metascore) QM222 Fall 2016 Section D1

19 We should talk more if you want to use logs
QM222 Fall 2016 Section D1

20 Back to the hobbit data set
Make a variable for timesquared Run a regression of gross on time, timesquared, and the better of the other two (weekend indicator, or day of week indicator variables) Is the relationship between gross and time nonlinear? What does it look like? QM222 Fall 2016 Section D1

21 Dealing with skewed data
QM222 Fall 2016 Section D1

22 There are 3 ways you might deal with skewed data
1. Use logs for the skewed variable (if you believe the right relationship is with the percentage change). 2. If the skewed variable is the dependent variable, predict the median rather than the mean by going: qreg Yvariable Xvariable 3. You can topcode the variable (whether it is a dependent or explanatory variable) , for instance: replace LifetimeGross = if LifetimeGross> QM222 Fall 2016 Section D1

23 Omitted Variable Bias QM222 Fall 2016 Section D1

24 Multiple regression measures the individual impacts of different factors on Y….
Multiple regression helps us to measure the individual impacts of different factors on our dependent variable Y… Holding the other factors constant So isolating each factor’s effect QM222 Fall 2016 Section D1

25 Assignments and omitted variable bias
Assignment 4 asks people to run a regression. Results will be mixed: Some will find the coefficient as expected, some not. Some will find the coefficient statistically significant ( |t|>2), some did not. However, that coefficient is likely to be picking up some other important factor. Example: The dummy on Beacon Street was also picking up the effect of size Assignment 5 asks people to add more X variables as controls – In case these new variables are correlated with both Y and X1. QM222 Fall 2016 Section D1

26 Condo’s Price = 520729 – 46969 BEACON
Price = SIZE BEACON Why are the coefficients on Beacon so different? The coefficient on Beacon in the first (simple) regression says: Across all the properties in our dataset, those on Beacon cost $46,239 less on average.   In contrast, the coefficient on Beacon in the multiple regression says: If we compare two condos of the same size, one on Beacon and one not on Beacon, the one on Beacon costs $32,946 more. QM222 Fall 2016 Section D1

27 Condo’s Price = 520729 – 46969 BEACON
 Price = SIZE BEACON Which of these two equations should the executive use for to figure out how much of a premium she will have to pay for an equivalent condo on Beacon Street? The multiple regression. Which of these two equations should the realtor care most about, since realtors get a percent of the sale price? The simple regression QM222 Fall 2016 Section D1

28 If you really want to measure the effect of X alone (e. g
If you really want to measure the effect of X alone (e.g. Beacon), you need to control for possibly confounding factors. If you don’t, the coefficient on X is biased. We call this omitted or missing variable bias. Omitted variable bias occurs when The omitted variable has an effect on the dependent variable, AND The omitted variable is correlated with the explanatory variable of interest. QM222 Fall 2016 Section D1

29 It is necessary in your projects to understand why the coefficients change when you add a variable
And also, if there is a confounding variable that you cannot measure, you want to be able to predict what the sign of the omitted variable bias is. QM222 Fall 2016 Section D1

30 Another example: How does getting more education affect salaries?
Let’s say you un this regression: Income = 20, Education (in years). But, the coefficient 4000 may pick up the fact that more intelligent people have both more education and higher income. If you could add the variable IQ to the regression, the coefficient on education would hold IQ constant. QM222 Fall 2016 Section D1

31 Omitted variable bias Price = 520729 – 46969 BEACON
Price = SIZE BEACON In a simple regression of Y on X1, the coefficient b1 measures the combined effects of: the direct (or often called “causal”) effect of the included variable X1 on Y PLUS an “omitted variable bias” due to factors that were left out (omitted) from the regression. Often we want to measure the direct, causal effect. In this case, the coefficient in the simple regression is biased. QM222 Fall 2016 Section D1

32 In-Class exercise (t-stats in parentheses)
Regression 1: Score = – 5.68 Pay_Program adjR2=.0175 (93.5) (-3.19) Regression 2: Score = Pay_Program OldScore adjR2=.6687 (6.52) (3.46) (31.68) Regression 3: adjR2=.6727 Score = Pay_Program OldScore – Poverty (7.10) (4.59) (28.97) (-3.05) QM222 Fall 2016 Section D1

33 You can Reason out what sign the bias will be if there is likely a confounding variable that we cannot measure. Understand exactly why a coefficient changes the way it does when we add a confounding variable into the regression. QM222 Fall 2016 Section D1

34 Graphical representation of Omitted Variable Bias
Really, both being on Beacon and price affect price. Y = b0 + b1X1 + b2X2 Let’s call this the Full model. Let’s call b1 and b2 the direct effects. QM222 Fall 2016 Section D1

35 The mis-specified or Limited model
However, in the simple regression, we measure only a (combined) effect of Beacon on price. Call its coefficient c1 Y = c0 + c1X1 Let’s call c1 is the combined effect. QM222 Fall 2016 Section D1

36 Background relationship between X’s
We also know that there is a relationship between X­1 (Beacon) and X2 (Size). We call this the Background Relationship: . correlate price size Beacon_Street (obs=1085) | price size Beacon~t price | size | Beacon_Str~t | This background relationship, shown here as a1, is negative. QM222 Fall 2016 Section D1

37 Let’s combine all 3 pictures: the full model, the limited model & the background relationship
The effect of X­1 on Y has two channels. The first one is the direct effect b1. The second channel is the indirect effect through X­2. When X­1 changes, X2 also tends to change (a1) This change in X­2 has another effect on Y (b2) QM222 Fall 2016 Section D1

38 If we want the direct effect only
When we include both X­1 and X2 in a multiple regression, we get the coefficient b1 – the direct effect of X­1. QM222 Fall 2016 Section D1

39 QM222 Fall 2016 Section D1

40 Let’s apply this to Brookline Condo’s
c1 combined effect (negative.) Limited Model: Price = – BEACON Full Model: Price = SIZE BEACON Background relationship: SIZE = 1254 – BEACON c1 = (b1 + b2a1) check =32935+( *409.4) Bias is b2a1 or * which is negative. We are UNDERESTIMATING the direct effect a1 (negative) b1 direct effect (positive.) QM222 Fall 2016 Section D1

41 It is useful in your projects to understand why the coefficients change when you add a variable
And also, if there is a confounding variable that you cannot measure, what the sign of the omitted variable bias is. QM222 Fall 2016 Section D1


Download ppt "QM222 Class 12 Section D1 1. A few Stata things 2"

Similar presentations


Ads by Google