Download presentation
Presentation is loading. Please wait.
Published byFerdinand Harmon Modified over 7 years ago
1
QM222 Class 12 Section D1 1. A few Stata things 2
QM222 Class 12 Section D1 1. A few Stata things 2. Nonlinear Relationships in Regression 2. Omitted variable bias (Leaving Things out of a Regression – Chapter 14.) Future topics: One variable with different slopes (for different groups, Chapter 13) QM222 Fall 2016 Section D1
2
Dealing with missing values, or numbers saved as strings: open Open somevarsfrom2013nscg.dta
CH1218 is the number of children between 12 and 18 CH1218IN is an indicator (dummy) variable for if there are children between 12 and 18 jobsatis if how satisfied are you with your job, with 1=very satisfied, 4=very dissatisfied lwyr is the last year the person worked marsta is marital status. From codebook: 1 Married Living in a marriage-like relationship 3 Widowed 4 Separated 5 Divorced 6 Never Married Sum all variables QM222 Fall 2016 Section D1
3
Dealing with missing values, or numbers saved as strings: open Open somevarsfrom2013nscg.dta
CH1218 is the number of children between 12 and 18 CH1218IN is an indicator (dummy) variable for if there are children between 12 and 18 jobsatis if how satisfied are you with your job, with 1=very satisfied, 4=very dissatisfied lwyr is the last year the person worked marsta is marital status. From codebook: 1 Married Living in a marriage-like relationship 3 Widowed 4 Separated 5 Divorced 6 Never Married sum (all variables ) CH1218IN, jobsatis and marsta say there are 0 observations because they are saved as strings. tab CH1218IN You would need to make this into a real indicator/dummy variable to use it as “Has children 12-18” What is “L”? Look at the codebook! It says “Logical skip”. So look up, “who is NOT asked this question?” Answer is people with no children. So make people with L=0. QM222 Fall 2016 Section D1
4
Dealing with missing values, or numbers saved as strings: open Open somevarsfrom2013nscg.dta
CH1218 is the number of children between 12 and 18 CH1218IN is an indicator (dummy) variable for if there are children between 12 and 18 jobsatis if how satisfied are you with your job, with 1=very satisfied, 4=very dissatisfied lwyr is the last year the person worked marsta is marital status. From codebook: 1 Married Living in a marriage-like relationship 3 Widowed 4 Separated 5 Divorced 6 Never Married Notice that lwyr goes from 1959 to But we haven’t yet reached 9998. This is probably missing, or not asked. To check, tab lwyr. Who would not be asked this question? People who are not working. So you would only use this variable when analyzing things about people not working. QM222 Fall 2016 Section D1
5
Dealing with missing values, or numbers saved as strings: open Open somevarsfrom2013nscg.dta
CH1218 is the number of children between 12 and 18 CH1218IN is an indicator (dummy) variable for if there are children between 12 and 18 jobsatis if how satisfied are you with your job, with 1=very satisfied, 4=very dissatisfied lwyr is the last year the person worked marsta is marital status. From codebook: 1 Married Living in a marriage-like relationship 3 Widowed 4 Separated 5 Divorced 6 Never Married tab jobsatis Who is “L (logical skip)? Who would not be asked this question? People who are not working. So you would only use this variable when analyzing people who are working. Does it make sense to use this as a numerical variable? How could I do that. First make “L” into missing which is “” Second: destring jobsatis, replace QM222 Fall 2016 Section D1
6
Dealing with missing values, or numbers saved as strings: open Open somevarsfrom2013nscg.dta
CH1218 is the number of children between 12 and 18 CH1218IN is an indicator (dummy) variable for if there are children between 12 and 18 jobsatis if how satisfied are you with your job, with 1=very satisfied, 4=very dissatisfied lwyr is the last year the person worked marsta is marital status. From codebook: 1 Married Living in a marriage-like relationship 3 Widowed 4 Separated 5 Divorced 6 Never Married tab marsta Does it make sense to use this as a numerical variable? NO! Leave as is. QM222 Fall 2016 Section D1
7
Estimating nonlinear relationships
Could the relationship be non-linear, and if so, how can we estimate this using linear regression? QM222 Fall 2016 Section D1
8
Non-linear relationships between Y and X
Sometimes, the relationship between the Y variable and the X variable is unlikely to be linear. This may lead you to measure a very low insignificant slope. e.g. If you ran a regression of this graph, its coefficient would be zero. QM222 Fall 2016 Section D1
9
Many of you believe that you might have nonlinear relationships
e.g. Maybe job satisfaction goes up with age and then down again. e.g. You do not believe that an increase $1 in price will have the same effect going from $10 to $11 as going from $100 to $101. Note that this section is only applicable for numerical variables. You cannot do these nonlinear things with indicator variables. QM222 Fall 2016 Section D1
10
To solve the problem of Y possibly increasing with X and then decreasing:
You simply add to the regression a new X variable that is a non-linear versions of old variable. My suggestion: estimate a quadratic by making a new variable X2 and run the regression with both the linear and non-linear (quadratic) term in the equation. If you don’t know if a relationship is nonlinear, you can estimate the regression assuming it is nonlinear (e.g. quadratic) and then examine the results to see if this assumption is correct. QM222 Fall 2016 Section D1
11
Quadratic: Y = b0 + b1 X + b2 X2 In high school you learned that quadratic equations look like this. So by adding a squared term, you can estimate these shapes. QM222 Fall 2016 Section D1
12
However, a regression with a quadratic can estimate ANY part of these shapes
So, using a quadratic does not mean that the curve need actually ever change from a positive to a negative slope or vice versa … QM222 Fall 2016 Section D1
13
How do you know whether the relationship really is nonlinear?
Put in a nonlinear term (e.g. a squared term) and let the |t-stats|’s in the equation tell you if it belongs in there. If the |t-stat|>2, you are more than 95% confident that the relationship is nonlinear. Even if |t-stat| < 2, it’s a good idea to keep in the quadratic term as long as you are relatively confident it belongs in. I tend to leave it in if it has a | t-stat | >1, which means that I am at least 68% confident the relationship is nonlinear. Example: I know annual visitors to the park. I want to know if they are growing (or falling) at a constant rate over time, or not. First I make the variables: gen time= _n gen timesq = time^2 QM222 Fall 2016 Section D1
14
Here are regressions of visitors (to Cape Canaveral) on time, then on time AND timesq. Is the relationship nonlinear? Are visitors growing/shrinking, and at a constant rate? . regress annualvisitors time Source | SS df MS Number of obs = F( 1, 21) = Model | e e Prob > F = Residual | e e R-squared = Adj R-squared = Total | e e Root MSE = 2.5e+05 annualvisi~s | Coef. Std. Err t P>|t| [95% Conf. Interval] time | _cons | . regress annualvisitors time timesq F( 2, 20) = Model | e e Prob > F = Residual | e e R-squared = Adj R-squared = Total | e e Root MSE = 1.3e+05 time | timesq | _cons | QM222 Fall 2016 Section D1
15
Sketching the Quadratic Visitors = 1102401 + 118498 time - 5374 time2
The linear term in positive, so at a small X eg. X=0.1 the slope is positive. The squared is negative so the slope eventually becomes negatively sloped. So the general shape is as below. But which part of the curve is it? For those who don’t think in derivatives, plug in high, medium and low values for X in the original equation. In this data, time goes from 1 to 23 so: At time=1, Visitors = (1) (1) = 1,215,525 At time=10, Visitors = (10) (102) =1,749,981 At time=23, Visitors = (23) (232) =985,009 So over these 23 years, predicted visitors go up, then back down again. QM222 Fall 2016 Section D1
16
Sketching the Quadratic using calculus Visitors = 1102401 + 118498 time - 5374 time2
Calculus tells us the slope: dVisitors/dtime = – 2*5374 time The slope gets smaller as time increases. At the top of this cure, the slope is exactly zero. So solve 0 = – 2*5374 time time = 11.03 QM222 Fall 2016 Section D1
17
What about this issue: You believe that a 1% increase in X will have the same % effect on Y no matter what price you start at. [NOT ON TEST] e.g. You believe a 1 percent increase in price has a constant percentage effect on sales. Mathematical rule: If lnY = b0+ b1 lnX, b1 represents the %∆Y/ %∆X Or, the percentage change in Y when X changes by 1% (ln is natural log, the coefficient of “e”. Log means to the base 10. Either works.) So just make two new variables: lnY and lnX and run a regression: regress lnY lnX The coefficient will be: the percentage change in Y when X changes by 1% QM222 Fall 2016 Section D1
18
A case when logs might be useful?
If you have skewed data (like lifetime gross in movies), you could just regress ln(Lifetime gross) = b0 + b1 ln(metascore) QM222 Fall 2016 Section D1
19
We should talk more if you want to use logs
QM222 Fall 2016 Section D1
20
Back to the hobbit data set
Make a variable for timesquared Run a regression of gross on time, timesquared, and the better of the other two (weekend indicator, or day of week indicator variables) Is the relationship between gross and time nonlinear? What does it look like? QM222 Fall 2016 Section D1
21
Dealing with skewed data
QM222 Fall 2016 Section D1
22
There are 3 ways you might deal with skewed data
1. Use logs for the skewed variable (if you believe the right relationship is with the percentage change). 2. If the skewed variable is the dependent variable, predict the median rather than the mean by going: qreg Yvariable Xvariable 3. You can topcode the variable (whether it is a dependent or explanatory variable) , for instance: replace LifetimeGross = if LifetimeGross> QM222 Fall 2016 Section D1
23
Omitted Variable Bias QM222 Fall 2016 Section D1
24
Multiple regression measures the individual impacts of different factors on Y….
Multiple regression helps us to measure the individual impacts of different factors on our dependent variable Y… Holding the other factors constant So isolating each factor’s effect QM222 Fall 2016 Section D1
25
Assignments and omitted variable bias
Assignment 4 asks people to run a regression. Results will be mixed: Some will find the coefficient as expected, some not. Some will find the coefficient statistically significant ( |t|>2), some did not. However, that coefficient is likely to be picking up some other important factor. Example: The dummy on Beacon Street was also picking up the effect of size Assignment 5 asks people to add more X variables as controls – In case these new variables are correlated with both Y and X1. QM222 Fall 2016 Section D1
26
Condo’s Price = 520729 – 46969 BEACON
Price = SIZE BEACON Why are the coefficients on Beacon so different? The coefficient on Beacon in the first (simple) regression says: Across all the properties in our dataset, those on Beacon cost $46,239 less on average. In contrast, the coefficient on Beacon in the multiple regression says: If we compare two condos of the same size, one on Beacon and one not on Beacon, the one on Beacon costs $32,946 more. QM222 Fall 2016 Section D1
27
Condo’s Price = 520729 – 46969 BEACON
Price = SIZE BEACON Which of these two equations should the executive use for to figure out how much of a premium she will have to pay for an equivalent condo on Beacon Street? The multiple regression. Which of these two equations should the realtor care most about, since realtors get a percent of the sale price? The simple regression QM222 Fall 2016 Section D1
28
If you really want to measure the effect of X alone (e. g
If you really want to measure the effect of X alone (e.g. Beacon), you need to control for possibly confounding factors. If you don’t, the coefficient on X is biased. We call this omitted or missing variable bias. Omitted variable bias occurs when The omitted variable has an effect on the dependent variable, AND The omitted variable is correlated with the explanatory variable of interest. QM222 Fall 2016 Section D1
29
It is necessary in your projects to understand why the coefficients change when you add a variable
And also, if there is a confounding variable that you cannot measure, you want to be able to predict what the sign of the omitted variable bias is. QM222 Fall 2016 Section D1
30
Another example: How does getting more education affect salaries?
Let’s say you un this regression: Income = 20, Education (in years). But, the coefficient 4000 may pick up the fact that more intelligent people have both more education and higher income. If you could add the variable IQ to the regression, the coefficient on education would hold IQ constant. QM222 Fall 2016 Section D1
31
Omitted variable bias Price = 520729 – 46969 BEACON
Price = SIZE BEACON In a simple regression of Y on X1, the coefficient b1 measures the combined effects of: the direct (or often called “causal”) effect of the included variable X1 on Y PLUS an “omitted variable bias” due to factors that were left out (omitted) from the regression. Often we want to measure the direct, causal effect. In this case, the coefficient in the simple regression is biased. QM222 Fall 2016 Section D1
32
In-Class exercise (t-stats in parentheses)
Regression 1: Score = – 5.68 Pay_Program adjR2=.0175 (93.5) (-3.19) Regression 2: Score = Pay_Program OldScore adjR2=.6687 (6.52) (3.46) (31.68) Regression 3: adjR2=.6727 Score = Pay_Program OldScore – Poverty (7.10) (4.59) (28.97) (-3.05) QM222 Fall 2016 Section D1
33
You can Reason out what sign the bias will be if there is likely a confounding variable that we cannot measure. Understand exactly why a coefficient changes the way it does when we add a confounding variable into the regression. QM222 Fall 2016 Section D1
34
Graphical representation of Omitted Variable Bias
Really, both being on Beacon and price affect price. Y = b0 + b1X1 + b2X2 Let’s call this the Full model. Let’s call b1 and b2 the direct effects. QM222 Fall 2016 Section D1
35
The mis-specified or Limited model
However, in the simple regression, we measure only a (combined) effect of Beacon on price. Call its coefficient c1 Y = c0 + c1X1 Let’s call c1 is the combined effect. QM222 Fall 2016 Section D1
36
Background relationship between X’s
We also know that there is a relationship between X1 (Beacon) and X2 (Size). We call this the Background Relationship: . correlate price size Beacon_Street (obs=1085) | price size Beacon~t price | size | Beacon_Str~t | This background relationship, shown here as a1, is negative. QM222 Fall 2016 Section D1
37
Let’s combine all 3 pictures: the full model, the limited model & the background relationship
The effect of X1 on Y has two channels. The first one is the direct effect b1. The second channel is the indirect effect through X2. When X1 changes, X2 also tends to change (a1) This change in X2 has another effect on Y (b2) QM222 Fall 2016 Section D1
38
If we want the direct effect only
When we include both X1 and X2 in a multiple regression, we get the coefficient b1 – the direct effect of X1. QM222 Fall 2016 Section D1
39
QM222 Fall 2016 Section D1
40
Let’s apply this to Brookline Condo’s
c1 combined effect (negative.) Limited Model: Price = – BEACON Full Model: Price = SIZE BEACON Background relationship: SIZE = 1254 – BEACON c1 = (b1 + b2a1) check =32935+( *409.4) Bias is b2a1 or * which is negative. We are UNDERESTIMATING the direct effect a1 (negative) b1 direct effect (positive.) QM222 Fall 2016 Section D1
41
It is useful in your projects to understand why the coefficients change when you add a variable
And also, if there is a confounding variable that you cannot measure, what the sign of the omitted variable bias is. QM222 Fall 2016 Section D1
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.