assignment 7 solutions ► office networks ► super staffing Managerial Economics & Decision Sciences Department Developed for assignment 7 solutions ► office networks ► super staffing © Kellogg School of Management
non-linear models ► STATA ► non-linearity (log models) Managerial Economics & Decision Sciences Department assignment 7 - solutions non-linear models learning objectives ► STATA testing for curvature: rvfplot testing and correcting for heteroskedasticity: hettest, robust correcting for clustering: cluster() ► non-linearity (log models) test for curvature and effect on linear regression use of logarithmic (log) models: interpretation and prediction with log models ► heteroskedasticity define heteroskedasticity and effect on linear regression correction for heteroskedasticity: log models and the “white wash” approach ► independence and clustering define independence of errors and effect of clustering correction for clustering readings ► (MSN) Chapter 8 ► (KTN) Log vs. Linear Models, Noise, Heteroskedasticity, and Grouped Data © Kellogg School of Management
non-linear models office networks Managerial Economics & Decision Sciences Department assignment 7 - solutions non-linear models office networks ◄ super staffing ◄ office networks model specification and data curvature. A simply scatter diagram (on the left below) indicates the presence of “curvature”. For the sake of presentation, and to understand the rvfplot command, let’s run the linear regression Emails = b0 + b1·Computers . The results are shown in the table below while the fitted line is shown on the right below. . regress Emails Computers ------------------------------------------------------------------------------ Emails | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- Computers | 21.08865 2.59076 8.14 0.000 15.71575 26.46156 _cons | -191.9852 48.87665 -3.93 0.001 -293.3491 -90.62123 curvature fitted linear regression despite the obvious curvature, the results of the linear regression do not reflect the miss-fitting with respect to curvature © Kellogg School of Management page | 1
non-linear models office networks Managerial Economics & Decision Sciences Department assignment 7 - solutions non-linear models office networks ◄ super staffing ◄ office networks model specification and data curvature. Fairly easy to see that the fitting a straight line definitely “misses” the curvature. This is why, if possible, a first visual “check” is required, and even better, try rvfplot. We consider three points (A), (B) and (C) shown on the left for which the predicted values (according to the linear regression) are shown. For each of these points we measure the distance between the true y and the predicted , i.e. and we get the numbers below. (A): (B): (C): ► rvfplot will plot vs. for each observation, thus the vertical axis from the left becomes the horizontal axis on the right and the vertical axis on the right simply measures the distance from the true y to the predicted value. The more curvature in the rvfplot the more the considered regression misses the curvature in the original data. C C rvfplot plots A B A B © Kellogg School of Management page | 2
ln(Emails) = b0 + b1·Computers Managerial Economics & Decision Sciences Department assignment 7 - solutions non-linear models office networks ◄ super staffing ◄ office networks model specification and data curvature. Since the curvature is “U” shaped we try a log-linear specification. (for a inverted “U” shape, i.e. “∩”, we would have to try a linear-log specification). The regression is ln(Emails) = b0 + b1·Computers Remark. Every time we change the specification we have to make sure to transform first the variables. In this case it is only the Emails variable that is transformed, thus we would have to generate first the logarithm of Emails and then run the regression of this on Computers. . generate lnEmails=ln(Emails) . regress lnEmails Computers Source | SS df MS Number of obs = 24 -------------+------------------------------ F( 1, 22) = 353.69 Model | 16.8365711 1 16.8365711 Prob > F = 0.0000 Residual | 1.04726918 22 .047603145 R-squared = 0.9414 -------------+------------------------------ Adj R-squared = 0.9388 Total | 17.8838402 23 .777558271 Root MSE = .21818 ------------------------------------------------------------------------------ lnEmails | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- Computers | .1188471 .0063195 18.81 0.000 .1057414 .1319529 _cons | 2.712677 .1192213 22.75 0.000 2.465427 2.959927 © Kellogg School of Management page | 3
ln(Emails) = 2.712677 + 0.1188471·Computers Managerial Economics & Decision Sciences Department assignment 7 - solutions non-linear models office networks ◄ super staffing ◄ office networks model specification and data curvature. The estimated regression is ln(Emails) = 2.712677 + 0.1188471·Computers Remark. To generate the predicted values for the new regression simply generate the lnEmails_hat as: generate lnEmails_hat = _b[cons] + _b[Computers]*Computers and then revert to Emails_hat as generate Emails_hat = exp(lnEmails_hat) the log specification fits better the actual data rvfplot: no curvature in the plot fitted values according to the log-specification © Kellogg School of Management page | 4
ln(Emails) = 2.712677 + 0.1188471·Computers Managerial Economics & Decision Sciences Department assignment 7 - solutions non-linear models office networks ◄ super staffing ◄ office networks model specification and data curvature. The estimated regression is ln(Emails) = 2.712677 + 0.1188471·Computers Remark. In the previous graph we plotted the true number of Emails and the predicted number of Emails, in other words, we first predicted the ln(Emails) then we transformed this into Emails. This is repeated in the diagram on the left. However, we can “live” in a “logarithmic world”: why not compare the logarithm of true Emails, i.e. lnEmails, with the predicted logarithm of Emails, i.e. lnEmails_hat. This is shown in the right diagram. Of course the “fit” should be similarly “good” in both cases. units of measurements: Emails units of measurements: ln(Emails) fitted values according to the log-specification fitted values according to the log-specification © Kellogg School of Management page | 5
ln(Emails) = 2.712677 + 0.1188471·Computers Managerial Economics & Decision Sciences Department assignment 7 - solutions non-linear models office networks ◄ super staffing ◄ office networks i. estimation. The estimated regression is ln(Emails) = 2.712677 + 0.1188471·Computers ► For Computers = 20 we get first the estimation of logarithm of Emails as ln(Emails)|Computers = 20 = 2.712677 + 0.1188471·20 = 5.0896194 then we “exponentiate” back to get the estimated number of Emails as Emails|Computers = 20 = exp(ln(Emails)|Computers = 20) = exp(5.0896194) = 162.32807 units of measurements: Emails units of measurements: ln(Emails) fitted values according to the log-specification fitted values according to the log-specification 5.08 162.32 © Kellogg School of Management page | 6
ln(Emails) = 2.712677 + 0.1188471·Computers Managerial Economics & Decision Sciences Department assignment 7 - solutions non-linear models office networks ◄ super staffing ◄ office networks ii. prediction interval. The estimated regression is ln(Emails) = 2.712677 + 0.1188471·Computers ► Since we are asked about an interval for the estimate of a single office with Computers = 20 we will use kpredint command: Remark. The above interval [4.6266475,5.5525913] is for the estimated ln(Emails) = 5.0896194. ► To obtain the interval for Emails we need to exponentiate the lower and upper bounds of the interval (no need for correction here since we are dealing with one observation): lower bound(Emails) = exp(4.6266475) = 102.17096 ; upper bound(Emails) = exp(5.5525913) = 257.905 ► The prediction interval is thus [102.17096,257.905] and the estimate is 162.32807. . kpredint _b[_cons]+_b[Computers]*20 Estimate: 5.0896194 Standard Error of Individual Prediction: .22324024 Individual Prediction Interval (95%): [ 4.6266475,5.5525913 ] t-ratio: 22.798844 If Ha: < then Pr(T < t) = 1 If Ha: not = then Pr(|T| > |t|) = 0 If Ha: > then Pr(T > t) = 0 © Kellogg School of Management page | 7
non-linear models office networks Managerial Economics & Decision Sciences Department assignment 7 - solutions non-linear models office networks ◄ super staffing ◄ office networks ii. prediction interval. Units of measurement translation: 257.905 ► This is the “transformed” interval through the exponential function in order to translate the initial interval obtained in “logarithmic units” into the original units. Since the exponential function is no-linear, the transformation of the initial interval is not proportional, i.e. the equal distances in the initial “logarithmic units” do not translates into equal distances in the original units the exp(·) function 95.577 162.328 60.158 102.170 Remark. The interval for the logarithm, i.e. [4.626,5.552], is centered around the estimated logarithm, i.e. 5.089. But notice that the interval for Emails, i.e. [102.170,257.905], is actually not centered in the estimate for Emails, i.e. 162.328. 4.626 5.089 5.552 0.463 0.463 ► This is the interval provided by kpredint and it will always be “centered” around the estimate © Kellogg School of Management page | 8
non-linear models office networks Managerial Economics & Decision Sciences Department assignment 7 - solutions non-linear models office networks ◄ super staffing ◄ office networks iii. benchmarking the estimate. We are asked to “estimate the probability that the daily internal emails at a particular office with 20 computers will be under 200“. There are two important issues here: Understand what is the probability about, i.e. what exactly are you asked to calculate? Make sure the units of measurement are used consistently As for the first issue: the regression only gives us an estimate of the daily internal emails at an office with 20 computers (and this estimate is about 162), but we really don’t know the true value of this number (of internal emails at an office with 20 computers). Thus the true number of emails, call it trueY, remains a random variable for us and the probability refers exactly to this trueY, i.e. we have to calculate for a given benchmark yb (in our case yb = 200) the following: This looks like a daunting task…unless we remember that we know the following fact: where is the sample estimate of trueY, is the standard error of the estimate and finally T(n – k – 1) is a T-distributed random variable with n – k – 1 degrees of freedom (k is the number of variables used to obtain the estimate . For our situation the estimate is obtained as a logarithm in a regression with one variable thus k = 1. © Kellogg School of Management page | 9
non-linear models office networks Managerial Economics & Decision Sciences Department assignment 7 - solutions non-linear models office networks ◄ super staffing ◄ office networks iii. benchmarking the estimate. Changing to logarithms (there’s no need of correction here): ► But how do we really use this result? The above implies that for any number t (this is our choice): ► With a bit of work (algebra): ► We are asked to evaluate so choose x such that Call this particular t that solves the equation as x*, then ► Can we find x? It must satisfy thus © Kellogg School of Management page | 10
non-linear models office networks Managerial Economics & Decision Sciences Department assignment 7 - solutions non-linear models office networks ◄ super staffing ◄ office networks iii. benchmarking the estimate. Our conclusion is that, with the desired probability is: ► The final step is to calculate tb. Going back to the output from kpredint: we identify and ln(yb) = ln(200) = 5.298 thus ► We find Remark. If you run kpredint _b[_cons] + _b[Computers]*20 – 5.298 you should get for Ha: > the result 0.82. . kpredint _b[_cons]+_b[Computers]*20 Estimate: 5.0896194 Standard Error of Individual Prediction: .22324024 Individual Prediction Interval (95%): [ 4.6266475,5.5525913 ] t-ratio: 22.798844 If Ha: < then Pr(T < t) = 1 If Ha: not = then Pr(|T| > |t|) = 0 If Ha: > then Pr(T > t) = 0 © Kellogg School of Management page | 11
non-linear models office networks Managerial Economics & Decision Sciences Department assignment 7 - solutions non-linear models office networks ◄ super staffing ◄ office networks iv.-v. estimate of average and confidence interval. The logic for this art is very similar with the one we used in answering a previous question. There are a few differences in terms of the correction factor and standard error of the estimate given that now we are estimating the average number of emails for all office that have 20 computers. ► First, since we are estimating the average number of emails we would run the klincom: ► Thus, the (corrected) estimate is: est.avg. Emails = exp(5.089619)*exp(e(rmse)^2/2) = 166.2383 and confidence interval: lower bound = exp(4.991618)*exp(e(rmse)^2/2) = 150.71936 upper bound = exp(5.187621)*exp(0.0472553^2/2) = 183.35471 . klincom _b[_cons]+_b[Computers]*20 ------------------------------------------------------------------------------ lnEmails | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- (1) | 5.089619 .0472553 107.70 0.000 4.991618 5.187621 If Ha: < then Pr(T < t) = 1 If Ha: not = then Pr(|T| > |t|) = 0 If Ha: > then Pr(T > t) = 0 © Kellogg School of Management page | 12
non-linear models office networks Managerial Economics & Decision Sciences Department assignment 7 - solutions non-linear models office networks ◄ super staffing ◄ office networks vi. benchmarking the estimate. We saw that: thus ► But how do we really use this result? The above implies that for any number t (this is our choice): ► With a bit of work (algebra): ► We need to evaluate: © Kellogg School of Management page | 13
non-linear models office networks Managerial Economics & Decision Sciences Department assignment 7 - solutions non-linear models office networks ◄ super staffing ◄ office networks vi. benchmarking the estimate. Use the last two equalities: ► We need to get x such that ► Finally: Remark. Notice the difference: the company is 82% sure that the number of emails will be less than 200 for one office with 20 computers but it is 99% sure that the average number of emails will be less than 200 across all office with 20 computers. © Kellogg School of Management page | 14
non-linear models super staffing Managerial Economics & Decision Sciences Department assignment 7 - solutions non-linear models office networks ◄ super staffing ◄ super staffing Part I. i. linear regression estimate. We estimate the regression (results below): supers = b0 + b1·workers . regress supers workers ------------------------------------------------------------------------------ supers | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- workers | .1053611 .0113256 9.30 0.000 .0820355 .1286867 _cons | 14.44806 9.562012 1.51 0.143 -5.245273 34.14139 ► The coefficient on workers means that increasing employment by one worker requires 0.1 more supers; it could be restated as: for every extra 10 workers there’s need for an extra super. Part I. ii. prediction. Running the kpredint for workers = 1200 gives the output: . kpredint _b[_cons]+_b[workers]*1200 Estimate: 140.88137 Standard Error of Individual Prediction: 22.684067 Individual Prediction Interval (95%): [94.162661,187.60008] t-ratio: 6.2105871 If Ha: < then Pr(T < t) = 1 If Ha: not = then Pr(|T| > |t|) = 0 If Ha: > then Pr(T > t) = 0 ► The new factory requires about 140 supers, with a 95% prediction interval : lower bound = 94 upper bound = 188 © Kellogg School of Management page | 15
ln(supers) = b0 + b1·workers Managerial Economics & Decision Sciences Department assignment 7 - solutions non-linear models office networks ◄ super staffing ◄ super staffing Part II. i. log-linear regression estimate. We estimate the regression (results below): ln(supers) = b0 + b1·workers . regress lnsupers workers ------------------------------------------------------------------------------ lnsupers | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- workers | .0012041 .0001316 9.15 0.000 .0009331 .001475 _cons | 3.515023 .111067 31.65 0.000 3.286276 3.74377 ► The coefficient on workers: to increase employment by one worker requires 0.1 percent more supers. Part II. ii. prediction. Running the kpredint for workers = 1200 gives the output: . kpredint _b[_cons]+_b[workers]*1200 Estimate: 4.9599165 Standard Error of Individual Prediction: .26348554 Individual Prediction Interval (95%): [4.4172579,5.5025751] t-ratio: 18.824246 If Ha: < then Pr(T < t) = 1 If Ha: not = then Pr(|T| > |t|) = 0 If Ha: > then Pr(T > t) = 0 ► The new factory requires about exp(4.96) = 142 supers, with a 95% prediction interval : lower bound = exp(4.41) = 82 upper bound = exp(5.50) = 245 © Kellogg School of Management page | 16
ln(supers) = b0 + b1·ln(workers) Managerial Economics & Decision Sciences Department assignment 7 - solutions non-linear models office networks ◄ super staffing ◄ super staffing Part III. i. log-log regression estimate. We estimate the regression (results below): ln(supers) = b0 + b1·ln(workers) . regress lnsupers lnworkers ------------------------------------------------------------------------------ lnsupers | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- lnworkers | .9092009 .0667307 13.62 0.000 .7717665 1.046635 _cons | -1.484583 .4354448 -3.41 0.002 -2.381398 -.5877674 ► The coefficient on workers: to increasing employment by one percent requires 0.91 percent more supers; it could be restated as: for every extra 10 percent increase in workers number the number of supers should increase by 9 percent. Part III. ii. prediction. Running the kpredint for ln(workers) = ln(1200) = 7.09 gives the output: . kpredint _b[_cons]+_b[lnworkers]*7.09 Estimate: 4.9616519 Standard Error of Individual Prediction: .18879103 Individual Prediction Interval (95%): [4.5728295,5.3504743] t-ratio: 26.281184 If Ha: < then Pr(T < t) = 1 If Ha: not = then Pr(|T| > |t|) = 0 If Ha: > then Pr(T > t) = 0 ► The new factory requires about exp(4.96) = 142 supers, with a 95% prediction interval : lower bound = exp(4.57) = 96 upper bound = exp(5.35) = 211 © Kellogg School of Management page | 17