QM222 Class 16 & 17 Today’s New topic: Estimating nonlinear relationships QM222 Fall 2017 Section A1
To-dos I am available today 11-3:30. If you don’t have your data completed, come then or make a special appointment. QM222 Fall 2017 Section A1
Today Estimating nonlinear relationships - part 1 Future special topics: Omitted variable bias (leaving things out of the regression) One variable with different slopes. We already covered A dummy variable as the dependent (Y) variable Multiple-category dummies Time series analysis - See me if you have time-series cross-section QM222 Fall 2017 Section A1
Estimating nonlinear relationships Could the relationship be non-linear, and if so, how can we estimate this using linear regression? QM222 Fall 2017 Section A1
Non-linear relationships between Y and X Sometimes, the relationship between a numerical Y variable and a numerical X variable is unlikely to be linear. This may lead you to measure a very low insignificant slope. e.g. If you ran a regression of this graph, its coefficient would be zero. QM222 Fall 2017 Section A1
Many of you believe that you might have nonlinear relationships e.g. You want to see how age will affect job-satisfaction (measured on a scale 1 to 10) so you think about running the regression: regress jobsatis age Is this likely to be linear? Maybe job satisfaction goes up with age and then down again. For instance, maybe you do not believe that an extra year increases job-satisfaction by a constant amount whether if it is the difference between 24 to 25 years old, or the difference between 60 and 61 years old. QM222 Fall 2017 Section A1
Many of you believe that you might have nonlinear relationships Note that this section is only applicable for numerical variables. You cannot do these nonlinear things with dummy variables. Why not? 12 = 1 and 02 = 0 Squaring a dummy variable just gives you the same dummy variable. QM222 Fall 2017 Section A1
To solve the problem of Y possibly increasing with X and then decreasing: You simply add to the regression a new X variable that is a non-linear version of old variable. My suggestion: estimate a quadratic by making a new variable X2 and run the regression with both the linear and non-linear (quadratic) term in the equation. If you don’t know if a relationship is nonlinear, you can estimate the regression assuming it is nonlinear (e.g. quadratic) and then examine the results to see if this assumption is correct. QM222 Fall 2017 Section A1
Quadratic: Y = b0 + b1 X + b2 X2 In high school you learned that quadratic equations look like this. So by adding a squared term, you can estimate these shapes. QM222 Fall 2017 Section A1
However, a regression with a quadratic can estimate ANY part of these shapes So, using a quadratic does not mean that the curve need actually ever change from a positive to a negative slope or vice versa … QM222 Fall 2017 Section A1
How do you know whether the relationship really is nonlinear? Put a nonlinear term (e.g. a squared term) in the regression and let the |t-stats|’s in the equation tell you if it belongs in there. If the |t-stat|>2, you are more than 95% confident that the relationship is nonlinear. However, it’s a good idea to keep in the quadratic term as long as the| t-stat | >1, which means that I am at least 68% confident the relationship is nonlinear. Recall: The adjusted R2 goes up if you add in a variable with |t|>=1 (so that you are >= 68% certain that the coefficient is not 0 ) The adjusted R2 goes down if you add in a variable with |t|<1 (so that you are < 68% certain that the coefficient is not 0 ) QM222 Fall 2017 Section A1
Example of testing nonlinear relationships by adding in a squared term Example: I know annual visitors to a national park (Cape Canaveral). I want to know if they are growing (or falling) at a constant rate over time, or not. Assume the data is in chronological order. First I make the variables: gen time= _n gen timesq = time^2 Then I run a regression with BOTH variables, time and timesq QM222 Fall 2017 Section A1
Here are regressions of visitors (to Cape Canaveral) first on time, then on time AND timesq. Is the relationship nonlinear? Are visitors growing/shrinking, and at a constant rate? . regress annualvisitors time Source | SS df MS Number of obs = 23 -------------+------------------------------ F( 1, 21) = 1.74 Model | 1.1103e+11 1 1.1103e+11 Prob > F = 0.2010 Residual | 1.3382e+12 21 6.3722e+10 R-squared = 0.0766 -------------+------------------------------ Adj R-squared = 0.0326 Total | 1.4492e+12 22 6.5872e+10 Root MSE = 2.5e+05 ------------------------------------------------------------------------------ annualvisi~s | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- time | -10474.59 7935.107 -1.32 0.201 -26976.55 6027.369 _cons | 1639786 108800.7 15.07 0.000 1413523 1866050 . regress annualvisitors time timesq -------------+------------------------------ F( 2, 20) = 35.96 Model | 1.1339e+12 2 5.6695e+11 Prob > F = 0.0000 Residual | 3.1528e+11 20 1.5764e+10 R-squared = 0.7824 -------------+------------------------------ Adj R-squared = 0.7607 Total | 1.4492e+12 22 6.5872e+10 Root MSE = 1.3e+05 time | 118497.8 16490.43 7.19 0.000 84099.37 152896.3 timesq | -5373.85 667.1316 -8.06 0.000 -6765.462 -3982.238 _cons | 1102401 85902.09 12.83 0.000 923212.7 1281590 QM222 Fall 2017 Section A1
Sketching that Quadratic Visitors = 1102401 + 118498 time - 5374 time2 The linear term in positive, so at a small X eg. time=0.1 the slope is positive. The squared is negative so the slope eventually becomes negatively sloped. So the general shape is as below. But which part of the curve is it? For those who don’t think in derivatives, plug in high, medium and low values for X in the original equation. In this data, time goes from 1 to 23 so: At time=1, Visitors = 1102401 + 118498 (1) - 5374 (1) = 1,215,525 At time=10, Visitors = 1102401 + 118498 (10) - 5374 (102) =1,749,981 At time=23, Visitors = 1102401 + 118498 (23) - 5374 (232) =985,009 So over these 23 years, predicted visitors go up, then back down again. QM222 Fall 2017 Section A1
Sketching the Quadratic using calculus (easier Sketching the Quadratic using calculus (easier!) Visitors = 1102401 + 118498 time - 5374 time2 Calculus tells us the slope: dVisitors/dtime = 118498 – 2*5374 time The slope gets smaller as time increases. At the top of this curve, the slope is exactly zero. So solve slope= 118498 – 2*5374 time = 0 time = 118498/2 = 11.03 QM222 Fall 2017 Section A1
Another kind of non-linear term Another kind of non-linear term. [NOT ON TEST] You believe that a 1% increase in X will have the same percentage (% ) effect on Y no matter what X you start at. e.g. You believe a 1 percent increase in price has a constant percentage effect on sales. Mathematical rule: If lnY = b0+ b1 lnX, b1 represents the %∆Y/ %∆X (for small changes) Or, the percentage change in Y when X changes by 1% (ln is natural log, the coefficient of “e”. log means to the base 10. Either works.) So just make two new variables from yvariable and xvariable: gen lnY=ln(yvariable) gen lnX =ln(xvariable) regress lnY lnX Then the coefficient will tell you the percentage change in Y when X changes by 1% QM222 Fall 2017 Section A1
A case when logs might be useful? If you have skewed data (like lifetime gross in movies), you could just regress ln(Lifetime gross) = b0 + b1 ln(metascore) QM222 Fall 2017 Section A1
Of course, we said last time that with a very skewed Y- variable, you could…. You could predict the median, not the average, income. qreg incwage age - You could top-code the variable (i.e. treat all incomes above a certain amount – your choice – as that top amount) e.g. make all incomes above $400,000 into $400,000: replace incwage=400000 if incwage>400000 QM222 Fall 2017 Section A1
Which non-linear way of dealing with things is best with a skewed Y? There is no one best way. It is trial and error. It is an art rather than a science. BUT ONLY use methods that you understand. QM222 Fall 2017 Section A1
Comparing regressions with skewed Y Regular variable: regress income wage Source | SS df MS Number of obs = 150,000 -------------+---------------------------------- F(1, 149998) = 6049.10 Model | 3.7133e+13 1 3.7133e+13 Prob > F = 0.0000 Residual | 9.2077e+14 149,998 6.1385e+09 R-squared = 0.0388 -------------+---------------------------------- Adj R-squared = 0.0388 Total | 9.5790e+14 149,999 6.3860e+09 Root MSE = 78349 ------------------------------------------------------------------------------ income | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- age | 1418.423 18.23728 77.78 0.000 1382.678 1454.167 _cons | 28233.09 820.0105 34.43 0.000 26625.89 29840.3 With a quadratic: regress income age agesq -------------+---------------------------------- F(2, 149997) = 4162.44 Model | 5.0368e+13 2 2.5184e+13 Prob > F = 0.0000 Residual | 9.0753e+14 149,997 6.0503e+09 R-squared = 0.0526 -------------+---------------------------------- Adj R-squared = 0.0526 Total | 9.5790e+14 149,999 6.3860e+09 Root MSE = 77784 age | 8636.687 155.388 55.58 0.000 8332.13 8941.245 agesq | -81.87879 1.750603 -46.77 0.000 -85.30994 -78.44764 _cons | -120758.3 3287.881 -36.73 0.000 -127202.5 -114314.1 QM222 Fall 2017 Section A1
Comparing regressions Regular variable: regress income wage Source | SS df MS Number of obs = 150,000 -------------+---------------------------------- F(1, 149998) = 6049.10 Model | 3.7133e+13 1 3.7133e+13 Prob > F = 0.0000 Residual | 9.2077e+14 149,998 6.1385e+09 R-squared = 0.0388 -------------+---------------------------------- Adj R-squared = 0.0388 Total | 9.5790e+14 149,999 6.3860e+09 Root MSE = 78349 ------------------------------------------------------------------------------ income | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- age | 1418.423 18.23728 77.78 0.000 1382.678 1454.167 _cons | 28233.09 820.0105 34.43 0.000 26625.89 29840.3 With topcoding if income>400000 regress incometop age -------------+---------------------------------- F(1, 149998) = 6919.29 Model | 3.3996e+13 1 3.3996e+13 Prob > F = 0.0000 Residual | 7.3698e+14 149,998 4.9133e+09 R-squared = 0.0441 -------------+---------------------------------- Adj R-squared = 0.0441 Total | 7.7098e+14 149,999 5.1399e+09 Root MSE = 70095 incometop | Coef. Std. Err. t P>|t| [95% Conf. Interval] age | 1357.2 16.31598 83.18 0.000 1325.221 1389.179 _cons | 29247.77 733.6221 39.87 0.000 27809.88 30685.65 QM222 Fall 2017 Section A1
Comparing regressions With a quadratic: regress income age agesq Source | SS df MS Number of obs = 150,000 -------------+---------------------------------- F(2, 149997) = 4162.44 Model | 5.0368e+13 2 2.5184e+13 Prob > F = 0.0000 Residual | 9.0753e+14 149,997 6.0503e+09 R-squared = 0.0526 -------------+---------------------------------- Adj R-squared = 0.0526 Total | 9.5790e+14 149,999 6.3860e+09 Root MSE = 77784 ------------------------------------------------------------------------------ income | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- age | 8636.687 155.388 55.58 0.000 8332.13 8941.245 agesq | -81.87879 1.750603 -46.77 0.000 -85.30994 -78.44764 _cons | -120758.3 3287.881 -36.73 0.000 -127202.5 -114314.1 With topcoding (if income>400000) AND a quadratic . regress incometop age agesq Best adj Rsq -------------+---------------------------------- F(2, 149997) = 4747.54 Model | 4.5899e+13 2 2.2949e+13 Prob > F = 0.0000 Residual | 7.2508e+14 149,997 4.8339e+09 R-squared = 0.0595 -------------+---------------------------------- Adj R-squared = 0.0595 Total | 7.7098e+14 149,999 5.1399e+09 Root MSE = 69527 incometop | Coef. Std. Err. t P>|t| [95% Conf. Interval] age | 8202.253 138.8926 59.05 0.000 7930.027 8474.48 agesq | -77.64535 1.564765 -49.62 0.000 -80.71226 -74.57844 _cons | -112040.2 2938.851 -38.12 0.000 -117800.3 -106280.1 QM222 Fall 2017 Section A1
Comparing regressions With topcoding if income>400000 AND a quadratic . regress incometop age agesq Source | SS df MS Number of obs = 150,000 -------------+---------------------------------- F(2, 149997) = 4747.54 Model | 4.5899e+13 2 2.2949e+13 Prob > F = 0.0000 Residual | 7.2508e+14 149,997 4.8339e+09 R-squared = 0.0595 -------------+---------------------------------- Adj R-squared = 0.0595 Total | 7.7098e+14 149,999 5.1399e+09 Root MSE = 69527 ------------------------------------------------------------------------------ incometop | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- age | 8202.253 138.8926 59.05 0.000 7930.027 8474.48 agesq | -77.64535 1.564765 -49.62 0.000 -80.71226 -74.57844 _cons | -112040.2 2938.851 -38.12 0.000 -117800.3 -106280.1 With logs regress lnincome lnage -------------+---------------------------------- F(1, 149998) = 11370.59 Model | 4376.05532 1 4376.05532 Prob > F = 0.0000 Residual | 57727.8423 149,998 .384857413 R-squared = 0.0705 -------------+---------------------------------- Adj R-squared = 0.0705 Total | 62103.8976 149,999 .414028744 Root MSE = .62037 lnincome | Coef. Std. Err. t P>|t| [95% Conf. Interval] lnage | .646042 .0060586 106.63 0.000 .6341673 .6579166 _cons | 8.755814 .0227184 385.41 0.000 8.711287 8.800342 This is the best adj Rsq, but you are fitting a whole different Y-variable. It does look very good from the t-stat on lnage. QM222 Fall 2017 Section A1
Comparing regressions With logs regress lnincome lnage Source | SS df MS Number of obs = 150,000 -------------+---------------------------------- F(1, 149998) = 11370.59 Model | 4376.05532 1 4376.05532 Prob > F = 0.0000 Residual | 57727.8423 149,998 .384857413 R-squared = 0.0705 -------------+---------------------------------- Adj R-squared = 0.0705 Total | 62103.8976 149,999 .414028744 Root MSE = .62037 ------------------------------------------------------------------------------ lnincome | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- lnage | .646042 .0060586 106.63 0.000 .6341673 .6579166 _cons | 8.755814 .0227184 385.41 0.000 8.711287 8.800342 With logged income,and an age quadratic (rather than using log age) . regress lnincome age agesq -------------+---------------------------------- F(2, 149997) = 7249.43 Model | 5473.90915 2 2736.95457 Prob > F = 0.0000 Residual | 56629.9885 149,997 .377540807 R-squared = 0.0881 -------------+---------------------------------- Adj R-squared = 0.0881 Total | 62103.8976 149,999 .414028744 Root MSE = .61444 age | .0959792 .0012275 78.19 0.000 .0935734 .098385 agesq | -.0009264 .0000138 -66.99 0.000 -.0009535 -.0008993 _cons | 8.863046 .0259722 341.25 0.000 8.812141 8.913951 This is the best fit so far, and it IS the same lnY variable as above. QM222 Fall 2017 Section A1
Comparing regressions With logs for income, an age quadratic (rather than using log age) . regress lnincome age agesq Source | SS df MS Number of obs = 150,000 -------------+---------------------------------- F(2, 149997) = 7249.43 Model | 5473.90915 2 2736.95457 Prob > F = 0.0000 Residual | 56629.9885 149,997 .377540807 R-squared = 0.0881 -------------+---------------------------------- Adj R-squared = 0.0881 Total | 62103.8976 149,999 .414028744 Root MSE = .61444 ------------------------------------------------------------------------------ lnincome | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- age | .0959792 .0012275 78.19 0.000 .0935734 .098385 agesq | -.0009264 .0000138 -66.99 0.000 -.0009535 -.0008993 _cons | 8.863046 .0259722 341.25 0.000 8.812141 8.913951 Finally, with qreg to get a prediction for median wage, with age and agesq qreg income age agesq Median regression Number of obs = 150,000 Raw sum of deviations 3.35e+09 (about 69000) Min sum of deviations 3.22e+09 Pseudo R2 = 0.0382 income | Coef. Std. Err. t P>|t| [95% Conf. Interval] age | 5607.143 95.94801 58.44 0.000 5419.087 5795.199 agesq | -53.57143 1.080952 -49.56 0.000 -55.69007 -51.45279 _cons | -66000 2030.18 -32.51 0.000 -69979.11 -62020.89 Think of pseudo R-sq are adjusted R-sq. This does NOT fit as well as lnincome QM222 Fall 2017 Section A1
To sum today up There are many ways to deal with a skewed Y variable Top-code it Take its log Predict the median rather than the average. There are many ways to deal with an X variable when you think the Y-X relationship is nonlinear First look at the scatter diagram – It helps Then add a quadratic X term and see if |t| >1 [EASIEST] Or add a different kind of non-linear X term (e.g. log of X) Remember: If you use logs of both sides, the coefficient can be interpreted as: The percentage change in Y when X changes by 1% QM222 Fall 2017 Section A1
Which non-linear way of dealing with things is best with a skewed Y? There is no one best way. It is trial and error. It is an art rather than a science. BUT ONLY use methods that you understand. QM222 Fall 2017 Section A1