assignment 7 solutions ► office networks ► super staffing

Slides:



Advertisements
Similar presentations
Heteroskedasticity The Problem:
Advertisements

ELASTICITIES AND DOUBLE-LOGARITHMIC MODELS
Lecture 9 Today: Ch. 3: Multiple Regression Analysis Example with two independent variables Frisch-Waugh-Lovell theorem.
Lecture 4 This week’s reading: Ch. 1 Today:
Introduction to Regression Analysis Straight lines, fitted values, residual values, sums of squares, relation to the analysis of variance.
1 Review of Correlation A correlation coefficient measures the strength of a linear relation between two measurement variables. The measure is based on.
Christopher Dougherty EC220 - Introduction to econometrics (chapter 4) Slideshow: semilogarithmic models Original citation: Dougherty, C. (2012) EC220.
Christopher Dougherty EC220 - Introduction to econometrics (chapter 5) Slideshow: the effects of changing the reference category Original citation: Dougherty,
DUMMY CLASSIFICATION WITH MORE THAN TWO CATEGORIES This sequence explains how to extend the dummy variable technique to handle a qualitative explanatory.
1 INTERACTIVE EXPLANATORY VARIABLES The model shown above is linear in parameters and it may be fitted using straightforward OLS, provided that the regression.
Confidence intervals were treated at length in the Review chapter and their application to regression analysis presents no problems. We will not repeat.
Returning to Consumption
CPE 619 Simple Linear Regression Models Aleksandar Milenković The LaCASA Laboratory Electrical and Computer Engineering Department The University of Alabama.
Simple Linear Regression Models
EDUC 200C Section 3 October 12, Goals Review correlation prediction formula Calculate z y ’ = r xy z x for a new data set Use formula to predict.
What is the MPC?. Learning Objectives 1.Use linear regression to establish the relationship between two variables 2.Show that the line is the line of.
MULTIPLE REGRESSION WITH TWO EXPLANATORY VARIABLES: EXAMPLE 1 This sequence provides a geometrical interpretation of a multiple regression model with two.
Biostat 200 Lecture Simple linear regression Population regression equationμ y|x = α +  x α and  are constants and are called the coefficients.
Chapter 5: Dummy Variables. DUMMY VARIABLE CLASSIFICATION WITH TWO CATEGORIES 1 We’ll now examine how you can include qualitative explanatory variables.
COST 11 DUMMY VARIABLE CLASSIFICATION WITH TWO CATEGORIES 1 This sequence explains how you can include qualitative explanatory variables in your regression.
The Practice of Statistics, 5th Edition Starnes, Tabor, Yates, Moore Bedford Freeman Worth Publishers CHAPTER 12 More About Regression 12.1 Inference for.
STAT E100 Section Week 12- Regression. Course Review - Project due Dec 17 th, your TA. - Exam 2 make-up is Dec 5 th, practice tests have been updated.
1 CHANGES IN THE UNITS OF MEASUREMENT Suppose that the units of measurement of Y or X are changed. How will this affect the regression results? Intuitively,
SEMILOGARITHMIC MODELS 1 This sequence introduces the semilogarithmic model and shows how it may be applied to an earnings function. The dependent variable.
GRAPHING A RELATIONSHIP IN A MULTIPLE REGRESSION MODEL The output above shows the result of regressing EARNINGS, hourly earnings in dollars, on S, years.
1 COMPARING LINEAR AND LOGARITHMIC SPECIFICATIONS When alternative specifications of a regression model have the same dependent variable, R 2 can be used.
Managerial Economics & Decision Sciences Department introduction  inflated standard deviations  the F  test  business analytics II Developed for ©
Managerial Economics & Decision Sciences Department hypotheses, test and confidence intervals  linear regression: estimation and interpretation  linear.
VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE In this sequence and the next we will investigate the consequences of misspecifying the regression.
Managerial Economics & Decision Sciences Department non-linearity  heteroskedasticity  clustering  business analytics II Developed for © 2016 kellogg.
The Practice of Statistics, 5th Edition Starnes, Tabor, Yates, Moore Bedford Freeman Worth Publishers CHAPTER 12 More About Regression 12.1 Inference for.
Managerial Economics & Decision Sciences Department cross-section and panel data  fixed effects  omitted variable bias  business analytics II Developed.
Managerial Economics & Decision Sciences Department tyler realty  old faithful  business analytics II Developed for © 2016 kellogg school of management.
Managerial Economics & Decision Sciences Department random variables  density functions  cumulative functions  business analytics II Developed for ©
business analytics II ▌assignment three - solutions pet food 
Chapter 8: Estimating with Confidence
business analytics II ▌assignment four - solutions mba for yourself 
CHAPTER 12 More About Regression
Chapter 4: Basic Estimation Techniques
business analytics II ▌assignment three - solutions pet food 
QM222 Class 9 Section A1 Coefficient statistics
business analytics II ▌appendix – regression performance the R2 
QM222 Class 10 Section D1 1. Goodness of fit -- review 2
Inference for Regression (Chapter 14) A.P. Stats Review Topic #3
Model validation and prediction
business analytics II ▌assignment one - solutions autoparts 
business analytics II ▌panel data models
QM222 Class 16 & 17 Today’s New topic: Estimating nonlinear relationships QM222 Fall 2017 Section A1.
business analytics II ▌applications fuel efficiency 
CHAPTER 12 More About Regression
QM222 Class 8 Section A1 Using categorical data in regression
assignment 8 solutions ► yogurt brands Developed for
The slope, explained variance, residuals
The Practice of Statistics in the Life Sciences Fourth Edition
Undergraduated Econometrics
CHAPTER 12 More About Regression
CHAPTER 14 MULTIPLE REGRESSION
Replicated Binary Designs
Chapter 8: Estimating with Confidence
Chapter 8: Estimating with Confidence
CHAPTER 12 More About Regression
Chapter 8: Estimating with Confidence
2/5/ Estimating a Population Mean.
Chapter 8: Estimating with Confidence
Introduction to Econometrics, 5th edition
Introduction to Econometrics, 5th edition
Introduction to Econometrics, 5th edition
Introduction to Econometrics, 5th edition
Introduction to Econometrics, 5th edition
Presentation transcript:

assignment 7 solutions ► office networks ► super staffing Managerial Economics & Decision Sciences Department Developed for assignment 7 solutions ► office networks ► super staffing © Kellogg School of Management

non-linear models ► STATA ► non-linearity (log models) Managerial Economics & Decision Sciences Department assignment 7 - solutions non-linear models learning objectives ► STATA  testing for curvature: rvfplot  testing and correcting for heteroskedasticity: hettest, robust  correcting for clustering: cluster() ► non-linearity (log models)  test for curvature and effect on linear regression  use of logarithmic (log) models: interpretation and prediction with log models ► heteroskedasticity  define heteroskedasticity and effect on linear regression  correction for heteroskedasticity: log models and the “white wash” approach ► independence and clustering  define independence of errors and effect of clustering  correction for clustering readings ► (MSN) Chapter 8 ► (KTN) Log vs. Linear Models, Noise, Heteroskedasticity, and Grouped Data © Kellogg School of Management

non-linear models office networks Managerial Economics & Decision Sciences Department assignment 7 - solutions non-linear models office networks ◄ super staffing ◄ office networks model specification and data curvature. A simply scatter diagram (on the left below) indicates the presence of “curvature”. For the sake of presentation, and to understand the rvfplot command, let’s run the linear regression Emails = b0 + b1·Computers . The results are shown in the table below while the fitted line is shown on the right below. . regress Emails Computers ------------------------------------------------------------------------------ Emails | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- Computers | 21.08865 2.59076 8.14 0.000 15.71575 26.46156 _cons | -191.9852 48.87665 -3.93 0.001 -293.3491 -90.62123 curvature fitted linear regression despite the obvious curvature, the results of the linear regression do not reflect the miss-fitting with respect to curvature © Kellogg School of Management page | 1

non-linear models office networks Managerial Economics & Decision Sciences Department assignment 7 - solutions non-linear models office networks ◄ super staffing ◄ office networks model specification and data curvature. Fairly easy to see that the fitting a straight line definitely “misses” the curvature. This is why, if possible, a first visual “check” is required, and even better, try rvfplot. We consider three points (A), (B) and (C) shown on the left for which the predicted values (according to the linear regression) are shown. For each of these points we measure the distance between the true y and the predicted , i.e. and we get the numbers below. (A): (B): (C): ► rvfplot will plot vs. for each observation, thus the vertical axis from the left becomes the horizontal axis on the right and the vertical axis on the right simply measures the distance from the true y to the predicted value. The more curvature in the rvfplot the more the considered regression misses the curvature in the original data. C C rvfplot plots A B A B © Kellogg School of Management page | 2

ln(Emails) = b0 + b1·Computers Managerial Economics & Decision Sciences Department assignment 7 - solutions non-linear models office networks ◄ super staffing ◄ office networks model specification and data curvature. Since the curvature is “U” shaped we try a log-linear specification. (for a inverted “U” shape, i.e. “∩”, we would have to try a linear-log specification). The regression is ln(Emails) = b0 + b1·Computers Remark. Every time we change the specification we have to make sure to transform first the variables. In this case it is only the Emails variable that is transformed, thus we would have to generate first the logarithm of Emails and then run the regression of this on Computers. . generate lnEmails=ln(Emails) . regress lnEmails Computers Source | SS df MS Number of obs = 24 -------------+------------------------------ F( 1, 22) = 353.69 Model | 16.8365711 1 16.8365711 Prob > F = 0.0000 Residual | 1.04726918 22 .047603145 R-squared = 0.9414 -------------+------------------------------ Adj R-squared = 0.9388 Total | 17.8838402 23 .777558271 Root MSE = .21818 ------------------------------------------------------------------------------ lnEmails | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- Computers | .1188471 .0063195 18.81 0.000 .1057414 .1319529 _cons | 2.712677 .1192213 22.75 0.000 2.465427 2.959927 © Kellogg School of Management page | 3

ln(Emails) = 2.712677 + 0.1188471·Computers Managerial Economics & Decision Sciences Department assignment 7 - solutions non-linear models office networks ◄ super staffing ◄ office networks model specification and data curvature. The estimated regression is ln(Emails) = 2.712677 + 0.1188471·Computers Remark. To generate the predicted values for the new regression simply generate the lnEmails_hat as: generate lnEmails_hat = _b[cons] + _b[Computers]*Computers and then revert to Emails_hat as generate Emails_hat = exp(lnEmails_hat) the log specification fits better the actual data rvfplot: no curvature in the plot fitted values according to the log-specification © Kellogg School of Management page | 4

ln(Emails) = 2.712677 + 0.1188471·Computers Managerial Economics & Decision Sciences Department assignment 7 - solutions non-linear models office networks ◄ super staffing ◄ office networks model specification and data curvature. The estimated regression is ln(Emails) = 2.712677 + 0.1188471·Computers Remark. In the previous graph we plotted the true number of Emails and the predicted number of Emails, in other words, we first predicted the ln(Emails) then we transformed this into Emails. This is repeated in the diagram on the left. However, we can “live” in a “logarithmic world”: why not compare the logarithm of true Emails, i.e. lnEmails, with the predicted logarithm of Emails, i.e. lnEmails_hat. This is shown in the right diagram. Of course the “fit” should be similarly “good” in both cases. units of measurements: Emails units of measurements: ln(Emails) fitted values according to the log-specification fitted values according to the log-specification © Kellogg School of Management page | 5

ln(Emails) = 2.712677 + 0.1188471·Computers Managerial Economics & Decision Sciences Department assignment 7 - solutions non-linear models office networks ◄ super staffing ◄ office networks i. estimation. The estimated regression is ln(Emails) = 2.712677 + 0.1188471·Computers ► For Computers = 20 we get first the estimation of logarithm of Emails as  ln(Emails)|Computers = 20 = 2.712677 + 0.1188471·20 = 5.0896194 then we “exponentiate” back to get the estimated number of Emails as  Emails|Computers = 20 = exp(ln(Emails)|Computers = 20) = exp(5.0896194) = 162.32807 units of measurements: Emails units of measurements: ln(Emails) fitted values according to the log-specification fitted values according to the log-specification 5.08 162.32 © Kellogg School of Management page | 6

ln(Emails) = 2.712677 + 0.1188471·Computers Managerial Economics & Decision Sciences Department assignment 7 - solutions non-linear models office networks ◄ super staffing ◄ office networks ii. prediction interval. The estimated regression is ln(Emails) = 2.712677 + 0.1188471·Computers ► Since we are asked about an interval for the estimate of a single office with Computers = 20 we will use kpredint command: Remark. The above interval [4.6266475,5.5525913] is for the estimated ln(Emails) = 5.0896194. ► To obtain the interval for Emails we need to exponentiate the lower and upper bounds of the interval (no need for correction here since we are dealing with one observation): lower bound(Emails) = exp(4.6266475) = 102.17096 ; upper bound(Emails) = exp(5.5525913) = 257.905 ► The prediction interval is thus [102.17096,257.905] and the estimate is 162.32807. . kpredint _b[_cons]+_b[Computers]*20 Estimate: 5.0896194 Standard Error of Individual Prediction: .22324024 Individual Prediction Interval (95%): [ 4.6266475,5.5525913 ] t-ratio: 22.798844 If Ha: < then Pr(T < t) = 1 If Ha: not = then Pr(|T| > |t|) = 0 If Ha: > then Pr(T > t) = 0 © Kellogg School of Management page | 7

non-linear models office networks Managerial Economics & Decision Sciences Department assignment 7 - solutions non-linear models office networks ◄ super staffing ◄ office networks ii. prediction interval. Units of measurement translation: 257.905 ► This is the “transformed” interval through the exponential function in order to translate the initial interval obtained in “logarithmic units” into the original units. Since the exponential function is no-linear, the transformation of the initial interval is not proportional, i.e. the equal distances in the initial “logarithmic units” do not translates into equal distances in the original units the exp(·) function 95.577 162.328 60.158 102.170 Remark. The interval for the logarithm, i.e. [4.626,5.552], is centered around the estimated logarithm, i.e. 5.089. But notice that the interval for Emails, i.e. [102.170,257.905], is actually not centered in the estimate for Emails, i.e. 162.328. 4.626 5.089 5.552 0.463 0.463 ► This is the interval provided by kpredint and it will always be “centered” around the estimate © Kellogg School of Management page | 8

non-linear models office networks Managerial Economics & Decision Sciences Department assignment 7 - solutions non-linear models office networks ◄ super staffing ◄ office networks iii. benchmarking the estimate. We are asked to “estimate the probability that the daily internal emails at a particular office with 20 computers will be under 200“. There are two important issues here:  Understand what is the probability about, i.e. what exactly are you asked to calculate?  Make sure the units of measurement are used consistently As for the first issue: the regression only gives us an estimate of the daily internal emails at an office with 20 computers (and this estimate is about 162), but we really don’t know the true value of this number (of internal emails at an office with 20 computers). Thus the true number of emails, call it trueY, remains a random variable for us and the probability refers exactly to this trueY, i.e. we have to calculate for a given benchmark yb (in our case yb = 200) the following: This looks like a daunting task…unless we remember that we know the following fact: where is the sample estimate of trueY, is the standard error of the estimate and finally T(n – k – 1) is a T-distributed random variable with n – k – 1 degrees of freedom (k is the number of variables used to obtain the estimate . For our situation the estimate is obtained as a logarithm in a regression with one variable thus k = 1. © Kellogg School of Management page | 9

non-linear models office networks Managerial Economics & Decision Sciences Department assignment 7 - solutions non-linear models office networks ◄ super staffing ◄ office networks iii. benchmarking the estimate. Changing to logarithms (there’s no need of correction here): ► But how do we really use this result? The above implies that for any number t (this is our choice): ► With a bit of work (algebra): ► We are asked to evaluate so choose x such that Call this particular t that solves the equation as x*, then ► Can we find x? It must satisfy thus © Kellogg School of Management page | 10

non-linear models office networks Managerial Economics & Decision Sciences Department assignment 7 - solutions non-linear models office networks ◄ super staffing ◄ office networks iii. benchmarking the estimate. Our conclusion is that, with the desired probability is: ► The final step is to calculate tb. Going back to the output from kpredint: we identify and ln(yb) = ln(200) = 5.298 thus ► We find Remark. If you run kpredint _b[_cons] + _b[Computers]*20 – 5.298 you should get for Ha: > the result 0.82. . kpredint _b[_cons]+_b[Computers]*20 Estimate: 5.0896194 Standard Error of Individual Prediction: .22324024 Individual Prediction Interval (95%): [ 4.6266475,5.5525913 ] t-ratio: 22.798844 If Ha: < then Pr(T < t) = 1 If Ha: not = then Pr(|T| > |t|) = 0 If Ha: > then Pr(T > t) = 0 © Kellogg School of Management page | 11

non-linear models office networks Managerial Economics & Decision Sciences Department assignment 7 - solutions non-linear models office networks ◄ super staffing ◄ office networks iv.-v. estimate of average and confidence interval. The logic for this art is very similar with the one we used in answering a previous question. There are a few differences in terms of the correction factor and standard error of the estimate given that now we are estimating the average number of emails for all office that have 20 computers. ► First, since we are estimating the average number of emails we would run the klincom: ► Thus, the (corrected) estimate is: est.avg. Emails = exp(5.089619)*exp(e(rmse)^2/2) = 166.2383 and confidence interval: lower bound = exp(4.991618)*exp(e(rmse)^2/2) = 150.71936 upper bound = exp(5.187621)*exp(0.0472553^2/2) = 183.35471 . klincom _b[_cons]+_b[Computers]*20 ------------------------------------------------------------------------------ lnEmails | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- (1) | 5.089619 .0472553 107.70 0.000 4.991618 5.187621 If Ha: < then Pr(T < t) = 1 If Ha: not = then Pr(|T| > |t|) = 0 If Ha: > then Pr(T > t) = 0 © Kellogg School of Management page | 12

non-linear models office networks Managerial Economics & Decision Sciences Department assignment 7 - solutions non-linear models office networks ◄ super staffing ◄ office networks vi. benchmarking the estimate. We saw that: thus ► But how do we really use this result? The above implies that for any number t (this is our choice): ► With a bit of work (algebra): ► We need to evaluate: © Kellogg School of Management page | 13

non-linear models office networks Managerial Economics & Decision Sciences Department assignment 7 - solutions non-linear models office networks ◄ super staffing ◄ office networks vi. benchmarking the estimate. Use the last two equalities: ► We need to get x such that ► Finally: Remark. Notice the difference: the company is 82% sure that the number of emails will be less than 200 for one office with 20 computers but it is 99% sure that the average number of emails will be less than 200 across all office with 20 computers. © Kellogg School of Management page | 14

non-linear models super staffing Managerial Economics & Decision Sciences Department assignment 7 - solutions non-linear models office networks ◄ super staffing ◄ super staffing Part I. i. linear regression estimate. We estimate the regression (results below): supers = b0 + b1·workers . regress supers workers ------------------------------------------------------------------------------ supers | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- workers | .1053611 .0113256 9.30 0.000 .0820355 .1286867 _cons | 14.44806 9.562012 1.51 0.143 -5.245273 34.14139 ► The coefficient on workers means that increasing employment by one worker requires 0.1 more supers; it could be restated as: for every extra 10 workers there’s need for an extra super. Part I. ii. prediction. Running the kpredint for workers = 1200 gives the output: . kpredint _b[_cons]+_b[workers]*1200 Estimate: 140.88137 Standard Error of Individual Prediction: 22.684067 Individual Prediction Interval (95%): [94.162661,187.60008] t-ratio: 6.2105871 If Ha: < then Pr(T < t) = 1 If Ha: not = then Pr(|T| > |t|) = 0 If Ha: > then Pr(T > t) = 0 ► The new factory requires about 140 supers, with a 95% prediction interval : lower bound = 94 upper bound = 188 © Kellogg School of Management page | 15

ln(supers) = b0 + b1·workers Managerial Economics & Decision Sciences Department assignment 7 - solutions non-linear models office networks ◄ super staffing ◄ super staffing Part II. i. log-linear regression estimate. We estimate the regression (results below): ln(supers) = b0 + b1·workers . regress lnsupers workers ------------------------------------------------------------------------------ lnsupers | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- workers | .0012041 .0001316 9.15 0.000 .0009331 .001475 _cons | 3.515023 .111067 31.65 0.000 3.286276 3.74377 ► The coefficient on workers: to increase employment by one worker requires 0.1 percent more supers. Part II. ii. prediction. Running the kpredint for workers = 1200 gives the output: . kpredint _b[_cons]+_b[workers]*1200 Estimate: 4.9599165 Standard Error of Individual Prediction: .26348554 Individual Prediction Interval (95%): [4.4172579,5.5025751] t-ratio: 18.824246 If Ha: < then Pr(T < t) = 1 If Ha: not = then Pr(|T| > |t|) = 0 If Ha: > then Pr(T > t) = 0 ► The new factory requires about exp(4.96) = 142 supers, with a 95% prediction interval : lower bound = exp(4.41) = 82 upper bound = exp(5.50) = 245 © Kellogg School of Management page | 16

ln(supers) = b0 + b1·ln(workers) Managerial Economics & Decision Sciences Department assignment 7 - solutions non-linear models office networks ◄ super staffing ◄ super staffing Part III. i. log-log regression estimate. We estimate the regression (results below): ln(supers) = b0 + b1·ln(workers) . regress lnsupers lnworkers ------------------------------------------------------------------------------ lnsupers | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- lnworkers | .9092009 .0667307 13.62 0.000 .7717665 1.046635 _cons | -1.484583 .4354448 -3.41 0.002 -2.381398 -.5877674 ► The coefficient on workers: to increasing employment by one percent requires 0.91 percent more supers; it could be restated as: for every extra 10 percent increase in workers number the number of supers should increase by 9 percent. Part III. ii. prediction. Running the kpredint for ln(workers) = ln(1200) = 7.09 gives the output: . kpredint _b[_cons]+_b[lnworkers]*7.09 Estimate: 4.9616519 Standard Error of Individual Prediction: .18879103 Individual Prediction Interval (95%): [4.5728295,5.3504743] t-ratio: 26.281184 If Ha: < then Pr(T < t) = 1 If Ha: not = then Pr(|T| > |t|) = 0 If Ha: > then Pr(T > t) = 0 ► The new factory requires about exp(4.96) = 142 supers, with a 95% prediction interval : lower bound = exp(4.57) = 96 upper bound = exp(5.35) = 211 © Kellogg School of Management page | 17