Download presentation
Presentation is loading. Please wait.
Published byJoel Chapman Modified over 8 years ago
1
Managerial Economics & Decision Sciences Department non-linearity heteroskedasticity clustering business analytics II Developed for © 2016 kellogg school of management | managerial economics and decision sciences department | business analytics II ▌ non-linearity, heteroskedasticity and clustering week 8 week 7 week 9 week 3
2
© 2016 kellogg school of management | managerial economics and decision sciences department | business analytics II readings ► statistics & econometrics ► (MSN) non-linearity (log models) test for curvature and effect on linear regression use of logarithmic (log) models: interpretation and prediction with log models heteroskedasticity define heteroskedasticity and effect on linear regression correction for heteroskedasticity: log models and the “white wash” approach independence and clustering define independence of errors and effect of clustering correction for clustering learning objectives testing for curvature: rvfplot theteroskedasticity: hettest, robust correcting for clustering: cluster() ► Chapter 8 ► (CS) AdSales, Bank Deposit, Yogurt Sales, Convenience Store session eight non-linearity, heteroskedasticity and clustering business analytics II Developed for ► (KTN) Log vs. Linear Models, Noise, Heteroskedasticity, and Grouped Data
3
Managerial Economics & Decision Sciences Department session eight non-linearity, heteroskedasticity and clustering business analytics II Developed for non-linearity ◄ heteroskedasticity ◄ clustering ◄ © 2016 kellogg school of management | managerial economics and decision sciences department | business analytics II | page1 non-linear models session eight ► So far we never questioned the appropriateness of a linear equation relating the dependent variable and the independent x -variables, in fact is was our main assumption for the linear regression. ► However, this will be difficult to reconcile with the “diminishing returns” hypothesis: equal increases in investment/effort/advertisement will generate higher results but at a decreasing rate. ► A simple example is provided by the data in the file adsales.dta. The scatter diagram below provides the observations (sales vs. expenditure on advertisement) and the fitted linear regression line. ► Using a linear regression in cases like this will produce models that for sure will overstate the estimation at both ends of the data range and will understate in the middle of the range. It is reasonable to look for a different specification of the model, i.e. not linear. estimation is overstated estimation is understated Figure 1. Results for regression of sales on expenditure on advertisement
4
Managerial Economics & Decision Sciences Department session eight non-linearity, heteroskedasticity and clustering business analytics II Developed for non-linearity ◄ heteroskedasticity ◄ clustering ◄ © 2016 kellogg school of management | managerial economics and decision sciences department | business analytics II | page2 non-linear models session eight ► Another situation in which the linearity will fit the data is the opposite of diminishing returns case. Take for example the time-related balance of a bank deposit. The more time passes the balance increases but at a higher and higher rate – assume no withdrawals and no drastic decreases in the interest rate. Thus with each “extra unit of time” the increase in balance is higher. ► Another example is provided by the data in the file deposit.dta. The scatter diagram below provides the observations (balance vs. time) and the fitted linear regression line. ► Using a linear regression in cases like this will produce models that for sure will understate the estimation at both ends of the data range and will overstate in the middle of the range. It is reasonable to look for a different specification of the model, i.e. not linear. Figure 2. Results for regression of balance on time estimation is understated estimation is overstated
5
Managerial Economics & Decision Sciences Department session eight non-linearity, heteroskedasticity and clustering business analytics II Developed for non-linearity ◄ heteroskedasticity ◄ clustering ◄ © 2016 kellogg school of management | managerial economics and decision sciences department | business analytics II | page3 non-linear models: the logarithm session eight ► The type of curvature seen in the previous plot is easily described through a logarithmic function. The natural logarithm function y = ln( x ) has a graph shown in the diagram on the right. If y ln( x ) then x e y, where e 2.71 (exponential). natural logarithm function y = ln( x ) ► One useful property for the logarithm is that for two numbers x and x ’, close enough to each other : ► We will use this principle when we interpret the results of models using logs Remark : this approximation breaks down as x and x ’ are too far apart in percentage terms. You can generate the logarithm of a variable: generate lnvariable ln(variable) Figure 3. The logarithmic function
6
Managerial Economics & Decision Sciences Department session eight non-linearity, heteroskedasticity and clustering business analytics II Developed for non-linearity ◄ heteroskedasticity ◄ clustering ◄ © 2016 kellogg school of management | managerial economics and decision sciences department | business analytics II | page4 non-linear models: the logarithm session eight ► Types of logarithmic models: Depending on which (if any) variable gets “loged” there are four types of models shown below ModelDependent variableIndependent variable standard linear yx log-linearln( y ) x linear-log y ln( x ) log-logln( y )ln( x ) ► How can we choose which model to use? let the data “speak” – does the scatter plot show any specific “curvature”? if the answer to first question suggests further investigation then STATA offers a command rvfplot that provides a plot of residuals against the predicted values to implement rvfplot first run the regression for the model you want to test (one of the four specifications above) then simply type rvfplot and STATA generates the residuals’ plot Figure 4. The logarithmic models
7
Managerial Economics & Decision Sciences Department session eight non-linearity, heteroskedasticity and clustering business analytics II Developed for non-linearity ◄ heteroskedasticity ◄ clustering ◄ © 2016 kellogg school of management | managerial economics and decision sciences department | business analytics II | page5 non-linear models: the logarithm session eight ► Model: E [ sales ] 0 1 · expend The graph below shows the scatter plot and the regression line. Consider observation #56 in the set, for which expend = 0.46 and sales = 13.9; the predicted sales is fitted sales = 14.9. ► rvfplot provides a plot of residuals against the predicted (fitted) values, i.e. for each observation in the set we consider the pair (fitted y, residual ) with fitted y on the horizontal axis and corresponding residual on the vertical axis: residual = actual y – fitted y ≈ -1.1 fitted y = 14.9 actual y = 13.9 rvfplot scatter and regression plot Figure 5. Regression and scatter plots Figure 6. The rvfplot diagram
8
Managerial Economics & Decision Sciences Department session eight non-linearity, heteroskedasticity and clustering business analytics II Developed for non-linearity ◄ heteroskedasticity ◄ clustering ◄ © 2016 kellogg school of management | managerial economics and decision sciences department | business analytics II | page6 non-linear models: the logarithm session eight Figure 7. Regression and scatter plots Figure 8. The rvfplot diagram v rvfplot scatter and regression plot ► rvfplot really provides a visual representation on how far from zero are the differences (“errors”) between the linearly fitted y and actual y. Remark. When the scatter plot looks “curved” then you will notice in the rvfplot that for those observations not really along the fitted line the “errors” are really either above or below the zero level in a clear pattern. If there is no curvature in the scatter plot then the “errors” will be uniformly lying above and below the zero level across all levels of fitted values.
9
Managerial Economics & Decision Sciences Department session eight non-linearity, heteroskedasticity and clustering business analytics II Developed for non-linearity ◄ heteroskedasticity ◄ clustering ◄ © 2016 kellogg school of management | managerial economics and decision sciences department | business analytics II | page7 non-linear models: the logarithm session eight ► Model: E [ sales ] 0 1 · expend Simply regress sales on expend and then run the command rvfplot (use adsales.dta) ► Model: E [ sales ] 0 1 · lnexpend Generate variable lnexpend ln ( expend ) and regress sales on lnexpend ; finally run the command rvfplot ► The curvature in the residuals plot (left diagram) disappears when the standard linear regression is change to a linear-log specification (the residuals plot in the right diagram as no obvious curvature). Figure 9. The rvfplot diagram: linear model Figure 10. The rvfplot diagram: the logarithmic model
10
Managerial Economics & Decision Sciences Department session eight non-linearity, heteroskedasticity and clustering business analytics II Developed for non-linearity ◄ heteroskedasticity ◄ clustering ◄ © 2016 kellogg school of management | managerial economics and decision sciences department | business analytics II | page8 non-linear models: the logarithm session eight ► Model: E [ balance ] 0 1 · time Simply regress balance on time and then run the command rvfplot (use deposit.dta) ► Model: E [ lnbalance ] 0 1 · time Generate variable lnbalance ln ( balance ) and regress lnbalance on time ; finally run the command rvfplot ► The curvature in the residuals plot (left diagram) disappears when the standard linear regression is change to a log- linear specification (the residuals plot in the right diagram as no obvious curvature). Figure 10. The rvfplot diagram: linear model Figure 11. The rvfplot diagram: the logarithmic model
11
Managerial Economics & Decision Sciences Department session eight non-linearity, heteroskedasticity and clustering business analytics II Developed for non-linearity ◄ heteroskedasticity ◄ clustering ◄ © 2016 kellogg school of management | managerial economics and decision sciences department | business analytics II | page9 non-linear models: the logarithm session eight ► Interpretation of coefficients ■ Standard linear model: y b 0 b 1 · x A one unit change in x results in b 1 units change in y : ■ Linear-log model: y b 0 b 1 ·ln( x ) A one percent change in x results in b 1 /100 units change in y. yy xx y b 1 · x yy x / x y b 1 ·( x / x ) Example : y 100 2 x Say x increases from 100 to 101, then y changes from 200 to 202 thus the net change in y is 2 units ( b 1 units) Example : y 100 2 ln( x ) Say x increases from 100 to 101, then y changes from 100 + 2 ln(100) 109.21 to 100 2 ln(101) 109.23 Thus x changed by (101 – 100)/100 1% while y changed by 109.23 – 109.21≈ 0.02 ( b 1 /100 units)
12
Managerial Economics & Decision Sciences Department session eight non-linearity, heteroskedasticity and clustering business analytics II Developed for non-linearity ◄ heteroskedasticity ◄ clustering ◄ © 2016 kellogg school of management | managerial economics and decision sciences department | business analytics II | page10 non-linear models: the logarithm session eight ► Interpretation of coefficients ■ Log-linear model: ln( y ) b 0 b 1 · x A one unit change in x results in b 1 100 percent change in y. ■ Log-log model: ln( y ) b 0 b 1 ·ln( x ) A one percent change in x results in b 1 percent change in y. xx y / y y / y b 1 · x x / x y / y y / y b 1 · x / x Remark Here b 1 is the elasticity coefficient for changes in y versus changes in x. Remark Here b 1 ·x is the elasticity coefficient for changes in y versus changes in x. Example :ln( y ) 1 0.01 x Say x increases from 1 to 2, then y changes from exp(1.01) 2.74 to exp(1.02) 1.77 about (1.77 – 1.74)/1.74 ≈ 0.01 or 1% ( b 1 100%) Example :ln( y ) 1 ln( x ) Say x increases from 100 to 101 (a 1% increase in x ), then y changes from exp(1 ln(100)) 271.82 to exp(1 ln(101)) 274.54 about (274.54 – 271.82)/271.82 ≈ 0.01 or 1% ( b 1 %)
13
Managerial Economics & Decision Sciences Department session eight non-linearity, heteroskedasticity and clustering business analytics II Developed for non-linearity ◄ heteroskedasticity ◄ clustering ◄ © 2016 kellogg school of management | managerial economics and decision sciences department | business analytics II | page11 non-linear models: the logarithm session eight ► Using a “log” specification Taking the log of the left hand side (dependent variable) changes how we interpret the coefficients, thus use a linear model if you believe that changes in x -variable have an “ additive impact ” on the y -variable use ln( y ) if changes in x -variable have a “ percentage impact ” on the y -variable In deciding to use a “log” specification let the data speak: run the standard linear regression and rvfplot, if the residuals plot exhibits a curvature then try one of the three “log” specifications and re-run the regression followed by rvfplot in order to check that the curvature in the residuals plot has disappeared. ► Predictions with a “log” specification It is easy enough, given x values, to predict ln( y ) for individual observations and the average value of ln( y ). Follow the same rules we have used thus far. It is likely that we are more interested in predicting y or average y rather than ln( y ) or average ln( y ) even if the regression is giving us the latter predicting y : we simply exponentiate the prediction for ln( y ), and likewise for the prediction intervals
14
Managerial Economics & Decision Sciences Department session eight non-linearity, heteroskedasticity and clustering business analytics II Developed for non-linearity ◄ heteroskedasticity ◄ clustering ◄ © 2016 kellogg school of management | managerial economics and decision sciences department | business analytics II | page12 non-linear models: the logarithm session eight ► Predictions with a “log” specification predicting average y and confidence intervals for y : when predicting average y, you need to correct for the bias that results from the fact that ln() is a concave function: (1) Run your regression with the logged dependent variable (2) Take the predicted value for the x ’s you are interested in (3) “Exponentiate” that predicted value, i.e. use the exp(·) function (4) Multiply the result of step (3) by a correction factor that equals: correction factor exp((RMSE)^2)/2) where the RMSE is the standard error of the regression ► The STATA regression output reports the RMSE but you can save yourself some typing because STATA “keeps” the RMSE as a “hidden” variable named e(rmse) ► For example, if you want to predict the average value of y for an observation, where ln( y ) is the dependent variable and predictedlny is the prediction you obtained from the estimated regression then: You can generate the logarithm of a variable: display exp(predictedlny)*exp((e(rmse)^2)/2)
15
Managerial Economics & Decision Sciences Department session eight non-linearity, heteroskedasticity and clustering business analytics II Developed for non-linearity ◄ heteroskedasticity ◄ clustering ◄ © 2016 kellogg school of management | managerial economics and decision sciences department | business analytics II | page13 non-linear models: the logarithm session eight Answer: Since we deal with only one account we are in the context of prediction on one observation and therefore the prediction interval (use kpredint) First the regression:. reg lnbalance time ------------------------------------------------------------------------------ lnbalance | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- time |.0041376.0000548 75.55 0.000.0040297.0042455 _cons | 4.630944.0079285 584.09 0.000 4.615328 4.64656 ------------------------------------------------------------------------------ Prediction for balance of one account at t 12:. kpredint _b[_cons] _b[time]*12 Estimate: 4.680595 Standard Error of Individual Prediction:.06292517 Individual Prediction Interval (95%): [4.5566591,4.8045308] The prediction for the balance is therefore balance ( t 12) exp(4.680595) 107.83421 and the prediction interval for the balance of one account: lower bound exp(4.556) 95.264 and upper bound exp(4.804) 122.062 quiz ► Use deposit.dta and find the predicted balance for one account after 12 months. What is a 95% interval for your prediction?
16
Managerial Economics & Decision Sciences Department session eight non-linearity, heteroskedasticity and clustering business analytics II Developed for non-linearity ◄ heteroskedasticity ◄ clustering ◄ © 2016 kellogg school of management | managerial economics and decision sciences department | business analytics II | page13 non-linear models: the logarithm session eight Answer: Since we deal with only several accounts we are in the context of prediction of an average and therefore the confidence interval (use klincom) First the regression: Prediction for balance of one account at t 12: The prediction for the average balance is therefore avg.balance( t 12) exp(4.680595)·exp(e(rmse)2/2) 108.04498 and the confidence interval for the average balance is obtained as: lower bound exp(4.666)·exp(e(rmse)2/2) 106.479 and upper bound exp(4.695)·exp(e(rmse)2/2) 109.612 quiz ► Use deposit.dta and find the predicted average balance for ten accounts after 12 months. What is a 95% interval for your prediction?. reg lnbalance time ------------------------------------------------------------------------------ lnbalance | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- time |.0041376.0000548 75.55 0.000.0040297.0042455 _cons | 4.630944.0079285 584.09 0.000 4.615328 4.64656 ------------------------------------------------------------------------------. klincom _b[_cons]+_b[time]*12 ------------------------------------------------------------------------------ lnbalance | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- (1) | 4.680595.0073661 635.42 0.000 4.666087 4.695103 ------------------------------------------------------------------------------
17
Managerial Economics & Decision Sciences Department session eight non-linearity, heteroskedasticity and clustering business analytics II Developed for non-linearity ◄ heteroskedasticity ◄ clustering ◄ © 2016 kellogg school of management | managerial economics and decision sciences department | business analytics II | page15 heteroskedasticity session eight ► Fairly obviously from above: heteroskedasticity means that the errors appear to be larger (or smaller) in magnitude when variable x is large ► Heteroskedasticity affects the standard error of coefficients not the coefficients, thus the problems will be related to significance and confidence intervals. Correction for heteroskedasticity means to correct the standard errors of coefficients. ► Heteroskedasticity is related to variation in magnitude of errors but we do not know the errors in regression equations, but we have estimates of them: the residuals one way to detect heteroskedasticity is to plot the residuals against some factor that we suspect is “causing” the problem. But what factor would that be? a common source of heteroskedasticity is when the magnitude of the residual is correlated with the predicted values of the dependent variable. We can look for this graphically using the rvfplot command in STATA key concept : heteroskedasticity ► We say that the regression model exhibits homoskedasticity when the distribution of each observation’s error is identical: this implies that the magnitudes of the errors must be unrelated to anything and everything we cannot make statements such as “the errors appear to be larger in magnitude when variable x is large”
18
Managerial Economics & Decision Sciences Department session eight non-linearity, heteroskedasticity and clustering business analytics II Developed for non-linearity ◄ heteroskedasticity ◄ clustering ◄ © 2016 kellogg school of management | managerial economics and decision sciences department | business analytics II | page16 heteroskedasticity session eight ► Detecting heteroskedasticity (I). Let’s use the yogurt.dta set to get an idea on how heteroskedasticity manifest and how we correct for it. First the regression ( sales 1 and price 1 refer to sales and price for Dannon while promo 1, as a dummy, indicates whether there was an end-of-aisle display for Dannon in the store) and immediately after run the command rvfplot:. regress sales1 price1 promo1. rvfplot ► It’s obvious in the diagram that the residuals vary in magnitude with the estimated (fitted) values ► Thus we have indication of heteroskedasticity... What factor, related to the independent variables (price or promotion), can cause this behavior? ► Possibly the variation in sales is different at different levels of prices. (Repeat the same steps as above and now include only price as independent variable) Figure 12. The rvfplot diagram: the linear model
19
Managerial Economics & Decision Sciences Department session eight non-linearity, heteroskedasticity and clustering business analytics II Developed for non-linearity ◄ heteroskedasticity ◄ clustering ◄ © 2016 kellogg school of management | managerial economics and decision sciences department | business analytics II | page17 heteroskedasticity session eight ► Correcting for heteroskedasticity. As stated previously, heteroskedasticity affects the magnitude of the standard errors. There are two ways to correct for heteroskedasticity: (i) use a log specification (ii) use the “White Wash” method ■ Log specification We try the following E [ln( sales 1)] 0 1 · price 1 2 · promo 1 with regression results - first generate lnsales 1 ln ( sales 1):. regress lnsales1 price1 promo1 ------------------------------------------------------------------------------ lnsales1 | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- price1 | -18.85207 3.095316 -6.09 0.000 -24.93989 -12.76426 promo1 |.6699763.119982 5.58 0.000.4339975.905955 _cons | 9.535195.2552827 37.35 0.000 9.033109 10.03728 ------------------------------------------------------------------------------ Remark In case you keep this regression make sure all the inference about estimates/prediction is done according to the discussion in the previous slides. Figure 13. The regression results
20
Managerial Economics & Decision Sciences Department session eight non-linearity, heteroskedasticity and clustering business analytics II Developed for non-linearity ◄ heteroskedasticity ◄ clustering ◄ © 2016 kellogg school of management | managerial economics and decision sciences department | business analytics II | page18 heteroskedasticity session eight ■ Log specification For the log specification E [ln( sales 1)] 0 1 · price 1 2 · promo 1 the rvfplot is shown on the right. ► The residual plot is somewhat better but not necessarily a clear-cut indication that the magnitude of residuals does not vary with the predicted (fitted) values. Seems that the heteroskedasticity persists. Remark In case this diagram had shown a fairly similar range of values for the residuals across the fitted values then you would keep the log specification and use for inference. Figure 13. The rvfplot diagram: he logarithmic model
21
Managerial Economics & Decision Sciences Department session eight non-linearity, heteroskedasticity and clustering business analytics II Developed for non-linearity ◄ heteroskedasticity ◄ clustering ◄ © 2016 kellogg school of management | managerial economics and decision sciences department | business analytics II | page19 heteroskedasticity session eight ► Detecting heteroskedasticity (II) Econometricians Breusch and Pagan proposed a more robust alternative: regress the squares of the residuals against all of the predictor variables: ( null )the model is homoskedastic ( alternative )the model is heteroskedastic ► If the 2 -test (chi-squared) for the regression is significant, i.e. p-value is smaller than the chosen significance level and so reject the null, then the model is heteroskedastic ► Implementation : STATA makes all of this easy, simply type hettest after the regression and you will receive the results of the 2 -test (for the standard linear regression of sales 1 on price 1 and promo 1):. hettest Breusch-Pagan / Cook-Weisberg test for heteroskedasticity Ho: Constant variance Variables: fitted values of sales1 chi2(1) = 132.13 Prob > chi2 = 0.0000 ► If the p-value is below the chosen significance level then reject the null, i.e. you reject homoskedasticity hypothesis Figure 14. The hottest results: the linear model hypothesis test decision
22
Managerial Economics & Decision Sciences Department session eight non-linearity, heteroskedasticity and clustering business analytics II Developed for non-linearity ◄ heteroskedasticity ◄ clustering ◄ © 2016 kellogg school of management | managerial economics and decision sciences department | business analytics II | page20 heteroskedasticity session eight ► Detecting heteroskedasticity (II) As a quick check, let’s run the test for the log specification. First re-run the log regression of lnsales 1 on price 1 and promo 1 then the hettest command:. hettest Breusch-Pagan / Cook-Weisberg test for heteroskedasticity Ho: Constant variance Variables: fitted values of lnsales1 chi2(1) = 1.07 Prob > chi2 = 0.3017 Remarks Logging the dependent variable will often eliminate the problem. But logging the dependent variable may cause an unwelcome interpretation or creates curvature issues and curvature issues are more important than heteroskedasticity Sometimes taking logs is inappropriate or is not enough and in such cases we need to “White Wash” the standard errors, or, more formally, use robust standard errors. Figure 15. The hottest results: the logarithmic model hypothesis test decision ► If the p-value is below the chosen significance level then reject the null, i.e. you reject homoskedasticity
23
Managerial Economics & Decision Sciences Department session eight non-linearity, heteroskedasticity and clustering business analytics II Developed for non-linearity ◄ heteroskedasticity ◄ clustering ◄ © 2016 kellogg school of management | managerial economics and decision sciences department | business analytics II | page21 heteroskedasticity session eight ■ White Wash method The solution is to rerun the preferred (standard linear) regression and correct the standard errors. A sophisticated technique developed by econometrician Hal White provides the needed correction. ► In STATA, add robust to the end of the regression: estimated coefficients do not change, but the standard errors are estimated in a way that is robust to heteroskedasticity. Compare the effect of adding robust after the regression:. regress sales1 price1 promo1, robust ------------------------------------------------------------------------------ | Robust sales1 | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- price1 | -72224.06 16346.28 -4.42 0.000 -104373.7 -40074.44 promo1 | 4186.332 783.199 5.35 0.000 2645.948 5726.715 _cons | 9206.1 1372.16 6.71 0.000 6507.357 11904.84 ------------------------------------------------------------------------------. regress sales1 price1 promo1 ------------------------------------------------------------------------------ sales1 | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- price1 | -72224.06 10726.01 -6.73 0.000 -93319.81 -51128.31 promo1 | 4186.332 415.7663 10.07 0.000 3368.609 5004.055 _cons | 9206.1 884.6156 10.41 0.000 7466.252 10945.95 ------------------------------------------------------------------------------ Figure 16. The regression results: without correction for std.dev. Figure 17. The regression results: with correction for std.dev.
24
Managerial Economics & Decision Sciences Department session eight non-linearity, heteroskedasticity and clustering business analytics II Developed for non-linearity ◄ heteroskedasticity ◄ clustering ◄ © 2016 kellogg school of management | managerial economics and decision sciences department | business analytics II | page22 clustering session eight ► Regression assumes independent observations, i.e. each observation is a new, independent experiment However, more often than not data is grouped multiple years of data from the same firm several members of the same family several cities in the same region ► Such clustering may suggest that each observation within the group is not a new independent experiment. As a result, we may be overstating the degrees of freedom and claiming greater precision for estimates than is appropriate ► There is no simple test for clustering but we will often have an intuitive grasp of whether we have clustered data, e.g. for a panel data with the unit of observation is the firm/year, then each firm is a “group” ► Respond to clustering by estimating the standard errors in a way that is robust to violations of independence within groups (and also to overall heteroskedasticity) by adding to the end of the regression the following option: cluster( groupname ) where groupname is the variable that identifies the groups Remark When using cluster-robust standard errors the effective sample size is reduced to the number of groups. Thus we typically need a large number of groups, say at least 20, if we are going to cluster.
25
Managerial Economics & Decision Sciences Department session eight non-linearity, heteroskedasticity and clustering business analytics II Developed for non-linearity ◄ heteroskedasticity ◄ clustering ◄ © 2016 kellogg school of management | managerial economics and decision sciences department | business analytics II | page23 clustering session eight ► Let’s use ConvenienceStore.dta to get a grasp of clustering. The data set consists of 100 observations for the following variables: StoreId - identifies the store at which the observation was recorded Sales - sales FuelPrice - price of fuel Radio - number of times the store run advertisements on radio ► The scatter diagram shows the relation between Sales and FuelPrice for all stores and highlighted are the observations for StoreId 2 (yellow) and StoreId 8 (red). Notice how the observations for these stores are clustered … ► Thus we might suspect that clustering occurs according to StoreId. observations for StoreId = 2 observations for StoreId = 8 Figure 18. The regression results
26
Managerial Economics & Decision Sciences Department session eight non-linearity, heteroskedasticity and clustering business analytics II Developed for non-linearity ◄ heteroskedasticity ◄ clustering ◄ © 2016 kellogg school of management | managerial economics and decision sciences department | business analytics II | page24 clustering session eight ► We run the usual standard linear regression using the option cluster( StoreId ). For comparison, first the regression without taking into account potential clustering problem is given:. regress Sales FuelPrice ------------------------------------------------------------------------------ Sales | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- FuelPrice | 1955.035 466.0864 4.19 0.000 1030.102 2879.968 _cons | -180366.7 53973.12 -3.34 0.001 -287474.6 -73258.8 ------------------------------------------------------------------------------. regress Sales FuelPrice, cluster(StoreId) (Std. Err. adjusted for 20 clusters in StoreId) ------------------------------------------------------------------------------ | Robust Sales | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- FuelPrice | 1955.035 660.5955 2.96 0.008 572.3926 3337.677 _cons | -180366.7 72893.62 -2.47 0.023 -332934.8 -27798.61 ------------------------------------------------------------------------------ Remark The coefficients are not changed, only the standard errors are adjusted. Of course the confidence intervals are changed too. Figure 19. The regression results: without correction for clustering Figure 20. The regression results: with correction for clustering
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.