Download presentation
Presentation is loading. Please wait.
Published byMarshall Haynes Modified over 8 years ago
1
Managerial Economics & Decision Sciences Department hypotheses, test and confidence intervals linear regression: estimation and interpretation linear regression: the dummy case business analytics II Developed for © 2016 kellogg school of management | managerial economics and decision sciences department | business analytics II ▌ key concepts week 5 week 3 week 5 week 3
2
© 2016 kellogg school of management | managerial economics and decision sciences department | business analytics II readings ► (MSN) learning objectives Chapter 2-5 session five key concepts business analytics II Developed for ► linear regression definition and assumptions of the linear model estimate the model, interpret coefficients understand regression table when provided without data statistical significance, p-value and confidence intervals ► confidence and prediction intervals klincom and kpredint commands interpret and use output from klincom and kpredint commands klincom and kpredint commands: use for change in y and levels of y ► dummy variables definition and interpretation of dummy and slope dummy variables use of dummy and slope dummy regressions in hypothesis testing
3
linear regression – general overview ► A high level description of the typical steps in a regression analysis: Step I: Model specification choice of dependent and independent variables Step II: Coefficients estimation run the regression, obtain estimated coefficients Step III: Coefficients interpretation sensitivity analysis (change in E [ y ] vs. change in x i ) Step IV:Tests of results test coefficient significance Step V:Confidence intervals ranges for parameters ( i ) and dependent variable Step VI:Additional analysis test of combinations of parameters ► Sometimes the steps above were fairly obvious, other times the analysis was not explicitly set forward in terms of these steps but everything that we have done so far (and will continue to do) falls somewhere along this general roadmap. ► As a simple example: while the dummy regression seems to be a topic by itself, it’s really nothing else but a standard linear regression model ; if you think that dummy vs. slope dummy is not captured above: it’s really a matter Step I – a correct and purpose- driven specification of the regression model. ► The intention of this review is to give an integrated view of the topics covered so far and not to substitute the notes. As you read the detailed notes and solve practice problems, have this road-map handy and try to figure out where your specific analysis/topics/problem are located. Managerial Economics & Decision Sciences Department session five key concepts business analytics II Developed for © 2016 kellogg school of management | managerial economics and decision sciences department | business analytics II | page1session five
4
linear regression: model specification Managerial Economics & Decision Sciences Department session five key concepts business analytics II Developed for © 2016 kellogg school of management | managerial economics and decision sciences department | business analytics II | page2session five ► The mechanics of running a linear regression is straightforward, no matter that variables you decide to include as independent variables ( x 1, …, x k ) the model gets estimated. The most important aspect at the stage of model specification is to consider as independent variables only those for which there exist ex-ante (before estimation) reasons to believe that the variable under consideration really affect the dependent variable in a meaningful way. Can you argue why the x-variable might affect the y-variable? ► Ingredients of the linear regression model: i. assumption that the true mean of the dependent variable is explained by k independent variables through a linear relation E [ y ] = 0 + 1 · x 1 + 2 · x 2 + … + k · x k ii. what is observed in practice is a sample of n observation for ( y, x 1, x 2, …,x k ) where, in terms of observations we write y = 0 + 1 · x 1 + 2 · x 2 + … + k · x k + noise Here the term “noise” indicates the part of the variation in y ( around its mean ) not explained by the deterministic part of the regression ( 0 + 1 · x 1 + 2 ·x 2 + … + k ·x k ) iii. The OLS (ordinary least square) method delivers estimates b 0, b 1, b 2, …, b k for the true parameters 0, 1, 2, …, k respectively
5
linear regression: coefficient estimation Managerial Economics & Decision Sciences Department session five key concepts business analytics II Developed for © 2016 kellogg school of management | managerial economics and decision sciences department | business analytics II | page3session five ► This is always the easiest step. Again, worth emphasizing that it makes no difference whether the dependent variables are continuous or discrete (dummy) or slope dummies the same OLS method is applied. ► For a regression without slope dummies variables, all the coefficients have the same interpretation: b i ( coefficient of independent variable x i ) represents the change in y when x i changes with one unit holding all other independent variables at fixed level ( no matter what level ) ► Why does a regression with slope dummies variables deserve a different treatment at this step? Notice in the statement above that the interpretation of b i requires that only x i changes, all other independent variables being held fixed. For a simple slope dummy regression Est. E [ y ] = b 0 + b 1 · dummy + b 2 · x + b 3 · dummyx if we attempt to interpret b 2 as the change in y when x changes with one unit holding all other independent variables fixed we’ll run into a problem when dummy = 1 because in this case dummyx = x and therefore this variable will also change when x changes. This is the reason why for dummy regressions we have to pay extra attention on how we interpret the coefficients, however, in doing this what we end doing is simply a grouping of coefficients ( b 0 and b 0 + b 1, and b 2 and b 2 + b 3 ) that allows us to apply the “holding everything else constant” condition properly. linear regression: coefficient interpretation
6
linear regression: tests of results Managerial Economics & Decision Sciences Department session five key concepts business analytics II Developed for © 2016 kellogg school of management | managerial economics and decision sciences department | business analytics II | page4session five ► While it seems that each time you solve a problem there’s a new test you have to apply and remember, there’s really a standard “testing procedure” that is applied again and again. The impression that there’s not a standardized strategy to perform a test is probably due to the fact that, due to lack of time availability, we sometimes tend to jump of different steps involved in testing. ► The standard (and very general) steps in any testing procedure are given below: determine the function/combination of coefficients that is tested, call it f ( b 0, b 1, …, b k ) determine the benchmark B * with which f ( b 0, b 1, …, b k ) is compared set the null hypothesis H 0 – gather negative evidence against this statement set the alternative hypothesis (contrary of the null) H a calculate the t-test = [ f ( b 0, b 1, …, b k ) – B *]/ std.err [ f ( b 0, b 1, …, b k )] calculate the p -value - if H a : f ( b 0, b 1, …, b k ) > B * then p -value = Pr[ t > t-test ]( right tail) - if H a : f ( b 0, b 1, …, b k ) < B * then p -value = Pr[ t < t-test ]( left tail) - if H a : f ( b 0, b 1, …, b k ) B * then p -value = Pr[ t -| t-test |]( two -tail) set the significance level (0,1) compare the p -value with your chosen significance level , if p -value < then reject the null
7
linear regression: tests of results Managerial Economics & Decision Sciences Department session five key concepts business analytics II Developed for © 2016 kellogg school of management | managerial economics and decision sciences department | business analytics II | page5session five ► For the simple case when the test involves a single coefficient of the regression you can just use the estimate and standard error of that coefficient from the regression table. ► When the test involves combination of coefficients (as in the case of testing whether that the average of y is above a certain benchmark b *) for particular, given, values of independent variables x 1, x 2, …,, x k f ( b 0, b 1, …, b k ) = b 0 + b 1 · x 1 + … + b k · x k and B * = b * ■ null hypothesis H 0 : b 0 + b 1 · x 1 + … + b k · x k b * ■ alternative hypothesis H a : b 0 + b 1 · x 1 + … + b k · x k > b * ► You will use the klincom or kpredint. Choose the p -value that corresponds to your specific alternative hypothesis, set the significance level (0,1) and compare the p -value with , if p -value < then reject the null.
8
linear regression: confidence intervals Managerial Economics & Decision Sciences Department session five key concepts business analytics II Developed for © 2016 kellogg school of management | managerial economics and decision sciences department | business analytics II | page6session five ► The central idea behind any confidence interval is the following: The true value of your variable of interest is not known but you do have a sample based on which you obtain an estimate (for your variable of interest you might consider just a coefficient or a combination of coefficients or the value of the dependent variable for given levels of the independent variables – this is context driven). Based on the sample determine two numbers, call them b L and b U as lower and upper bound, such that you can say that with a certain chosen probability 1 – , the true value for your variable of interest is above b L and below b U. You call [ b L, b U ] a 1 – confidence interval for your variable of interest. The bounds will depend on the chosen significance: the lower the the wider will be the confidence interval. To see that indeed all these steps are related to each other the variable of interest is the function/combination of coefficients f ( b 0, b 1, …, b k ) and the general form of any confidence interval is: b L = estimated value of f ( b 0, b 1, …, b k ) – t /2, n – k – 1 · std.error [ f ( b 0, b 1, …, b k )] b U = estimated value of f ( b 0, b 1, …, b k ) + t /2, n – k – 1 · std.error [ f ( b 0, b 1, …, b k )] Depending on how complicated the combination of coefficients is (see the previous discussion on tests) you can perform these calculation “manually”, i.e. can calculate or obtain the standard error fairly simple or you need to run the klincom or kpredint commands.
9
linear regression: confidence intervals Managerial Economics & Decision Sciences Department session five key concepts business analytics II Developed for © 2016 kellogg school of management | managerial economics and decision sciences department | business analytics II | page7session five ► The central idea behind any confidence interval is the following: The true value of your variable of interest is not known but you do have a sample based on which you obtain an estimate (for your variable of interest you might consider just a coefficient or a combination of coefficients or the value of the dependent variable for given levels of the independent variables – this is context driven). Based on the sample determine two numbers, call them b L and b U as lower and upper bound, such that you can say that with a certain chosen probability 1 – , the true value for your variable of interest is above b L and below b U. You call [ b L, b U ] a 1 – confidence interval for your variable of interest. The bounds will depend on the chosen significance: the lower the the wider will be the confidence interval. To see that indeed all these steps are related to each other the variable of interest is the function/combination of coefficients f ( b 0, b 1, …, b k ) and the general form of any confidence interval is: b L = estimated value of f ( b 0, b 1, …, b k ) – std.error [ f ( b 0, b 1, …, b k )] t df, /2 b U = estimated value of f ( b 0, b 1, …, b k ) + std.error [ f ( b 0, b 1, …, b k )] t df, /2 Depending on how complicated the combination of coefficients is (see the previous discussion on tests) you can perform these calculation “manually”, i.e. can calculate or obtain the standard error fairly simple or you need to run the klincom or kpredint commands. ► What can you calculate yourself? You can always determine the estimate of the variable of interest t df, /2 = invttail( n – k – 1, /2) The third term (standard error) is given directly in the regression table if you calculate a confidence interval for a coefficient. Otherwise you have to use the klincom or kpredint commands.
10
linear regression: confidence intervals Managerial Economics & Decision Sciences Department session five key concepts business analytics II Developed for © 2016 kellogg school of management | managerial economics and decision sciences department | business analytics II | page8session five ► The central idea behind any confidence interval is the following: The true value of your variable of interest is not known but you do have a sample based on which you obtain an estimate (for your variable of interest you might consider just a coefficient or a combination of coefficients or the value of the dependent variable for given levels of the independent variables – this is context driven). Based on the sample determine two numbers, call them b L and b U as lower and upper bound, such that you can say that with a certain chosen probability 1 – , the true value for your variable of interest is above b L and below b U. You call [ b L, b U ] a 1 – confidence interval for your variable of interest. The bounds will depend on the chosen significance: the lower the the wider will be the confidence interval. To see that indeed all these steps are related to each other the variable of interest is the function/combination of coefficients f ( b 0, b 1, …, b k ) and the general form of any confidence interval is: b L = estimated value of f ( b 0, b 1, …, b k ) – std.error [ f ( b 0, b 1, …, b k )] t df, /2 b U = estimated value of f ( b 0, b 1, …, b k ) + std.error [ f ( b 0, b 1, …, b k )] t df, /2 Depending on how complicated the combination of coefficients is (see the previous discussion on tests) you can perform these calculation “manually”, i.e. can calculate or obtain the standard error fairly simple or you need to run the klincom or kpredint commands. ► What can you calculate yourself? You can always determine the estimate of the variable of interest t df, /2 = invttail( n – k – 1, /2) The third term (standard error) is given directly in the regression table if you calculate a confidence interval for a coefficient. Otherwise you have to use the klincom or kpredint commands.
11
linear regression: example Managerial Economics & Decision Sciences Department session five key concepts business analytics II Developed for © 2016 kellogg school of management | managerial economics and decision sciences department | business analytics II | page9session five ► A simple example: let’s use the pizzasales.dta file and run the regression Sales = b 0 + b 1 · Income ------------------------------------------------------------------------------ Sales | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- Income | 2.697244.2777973 9.71 0.000 2.138695 3.255793 _cons | 48.72079 87.57192 0.56 0.581 -127.3544 224.7959 ------------------------------------------------------------------------------ ► The table already provide you with the 95% confidence intervals for b 0 and b 1 Let’s confirm the calculations for b 1 (the sample consists of n = 50 observations): - the estimated value b 1 : b 1 = 2.697244 - the t -value: t 48,0.025 = invttail(48,0.025) = 2.0106348 - the standard error: std.err. ( b 1 ) = 0.2777973 ► The 95% confidence interval bounds: b L = b 1 – t df, /2 · std.error [ b 1 ] = 2.697244 – 0.2777973 · 2.010634 = 2.138695 b U = b 1 + t df, /2 · std.error [ b 1 ] = 2.697244 + 0.2777973 · 2.010634 = 3.255793
12
linear regression: example Managerial Economics & Decision Sciences Department session five key concepts business analytics II Developed for © 2016 kellogg school of management | managerial economics and decision sciences department | business analytics II | page10session five ► Let’s continue with the pizzasales.dta file Sales = b 0 + b 1 · Income ------------------------------------------------------------------------------ Sales | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- Income | 2.697244.2777973 9.71 0.000 2.138695 3.255793 _cons | 48.72079 87.57192 0.56 0.581 -127.3544 224.7959 ------------------------------------------------------------------------------ Remark : One issue that raises sometimes questions is why not using the t = 9.71 from the table for the confidence interval? That t is the t -test calculated for the null hypothesis that b 1 = 0. Let’s confirm that: ■ null hypothesis H 0 : b 1 = 0 ■ alternative hypothesis H a : b 1 0 calculate t-test = [ b 1 – 0]/ std.err [ b 1 ] = [2.697244 – 0]/0.2777973 = 9.71 calculate p -value = Pr[ t +| t-test |] = 2·ttail(48,9.71) = 0
13
linear regression: example Managerial Economics & Decision Sciences Department session five key concepts business analytics II Developed for © 2016 kellogg school of management | managerial economics and decision sciences department | business analytics II | page11session five ------------------------------------------------------------------------------ Sales | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- Income | 2.697244.2777973 9.71 0.000 2.138695 3.255793 _cons | 48.72079 87.57192 0.56 0.581 -127.3544 224.7959 ------------------------------------------------------------------------------ 0.2905 the sum is exactly the p -value -127.3544 -224.7959 48.72079 0.025 the sum is exactly the confidence level 0.05 for the intervals ► The reported t and p -value are related to the testing part ► The reported conf. interval is the confidence interval for = 0.05 (or 95% confidence)
14
linear regression: klincom and kpredint Managerial Economics & Decision Sciences Department session five key concepts business analytics II Developed for © 2016 kellogg school of management | managerial economics and decision sciences department | business analytics II | page12session five ► The confidence interval is the interval that contains the true mean of y for x x 0 with probability 1 . The confidence interval draws inference about E [ y | x x 0 ] based on Est. E [ y | x x 0 ]. Confidence interval’s form is: Est. E [ y | x x 0 ] std. err. ci t df, /2 E [ y | x x 0 ] Est. E [ y | x x 0 ] std. err. ci t df, /2 ► The prediction interval is the interval that contains any level of y for x x 0 with probability 1 . The prediction interval draws inference about y | x x 0 based on Est. E [ y | x x 0 ]. Prediction interval’s form is Est. E [ y | x x 0 ] std. err. pi t df, /2 y | x x 0 Est. E [ y | x x 0 ] std. err. pi t df, /2 ► Understanding the difference between klincom and kpredint: klincom is used in the context of providing inference for the level of true mean/average of the dependent variable kpredint is used in the context of providing inference for the level of an individual value of the dependent variable ► Both klincom and kpredint can be used to obtain an interval (set the benchmark equal to zero) or to perform tests comparing the average or level of dependent variable with a given benchmark (in this case you will include the benchmark in the command)
15
linear regression: dummy variables Managerial Economics & Decision Sciences Department session five key concepts business analytics II Developed for © 2016 kellogg school of management | managerial economics and decision sciences department | business analytics II | page13session five ► Whenever a dummy variable and a continuous variable is involved there are three possible regressions you can set up: y = 0 + 1 ·dummyFor this setup you are only interested to find the mean of y for dummy =1 vs. dummy = 0 without controlling for anything else. As an example: test whether the Sales for pizza are different for neighborhoods with competitors vs. neighborhoods without competition (test b 1 = 0) y = 0 + 1 ·dummy + 2 ·xFor this setup you force the regression to assume that the dummy variable has only a level effect on y. y = 0 + 1 ·dummy + 2 ·x + 3 ·slopedummyFor this setup you allow the regression to pick up the slope effect; you assume that the dummy variable and the continuous variable might interact. Useful for: testing whether the change in y vs. change in x is different when dummy = 1 vs. dummy =0 (test b 3 =0)
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.