Presentation is loading. Please wait.

Presentation is loading. Please wait.

Session 6 omitted variables, spurious regression and multicollinearity ► omitted variables ► spurious regression ► multicollinearity Developed for Managerial.

Similar presentations


Presentation on theme: "Session 6 omitted variables, spurious regression and multicollinearity ► omitted variables ► spurious regression ► multicollinearity Developed for Managerial."— Presentation transcript:

1 session 6 omitted variables, spurious regression and multicollinearity ► omitted variables ► spurious regression ► multicollinearity Developed for Managerial Economics & Decision Sciences Department © Kellogg School of Management

2 ► STATA  the vif command  running the F-test ► omitted variables  define the omitted variable bias and construct the “influence diagram”  datamining and coping with omitted variable(s) bias ► spurious regression  define an interpret the spurious regression ► multicollinearity  define multicollinearity  detect multicollinearity and correct for multicollinearity ► (MSN)Chapter 6 and Chapter 7 ► (KTN) Omitted Variables ► (CS) Refrigerator Pricing, Dubuque’s Hot Dogs © Kellogg School of Management learning objectives learning objectives readings session 6 omitted variables, spurious regression, multicollinearity Managerial Economics & Decision Sciences Department

3 energy cost and refrigerator pricing ► We are provided with information on 41 popular models of refrigerators and the data is given in the file newfridge.dta (the relevant variables for this case are: Price, which gives the refrigerator price in $, energycost, which gives the annual energy cost of running the refrigerator in $/year and volume_cuinches, which gives the volume in cubic inches). The results of the regression Price on energycost are shown below: Source | SS df MS Number of obs = 41 -------------+------------------------------ F( 1, 39) = 7.97 Model | 1228208.25 1 1228208.25 Prob > F = 0.0075 Residual | 6011613.7 39 154143.941 R-squared = 0.1696 -------------+------------------------------ Adj R-squared = 0.1484 Total | 7239821.95 40 180995.549 Root MSE = 392.61 ------------------------------------------------------------------------------ Price | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- energycost | 17.14957 6.075478 2.82 0.007 4.860756 29.43838 _cons | 300.1567 290.463 1.03 0.308 -287.3601 887.6735 ------------------------------------------------------------------------------ ► The results indicate that a decrease of $20 in energycost implies a decrease in price (at which the refrigerator could be sold) of 17.14957 ·(-$ 20) = –$342.9914. This is counter-intuitive: for a refrigerator that consumes less energy, one would expect the Price to increase (think of Prius, Tesla, etc.) session 6 omitted variables, spurious regression, multicollinearity Managerial Economics & Decision Sciences Department omitted variables ◄ spurious regression ◄ multicollinearity ◄ © Kellogg School of Managementpage | 1

4 ► A simple inspection of the variables available in the study reveals other features that are probably relevant in the pricing decision such as volume (volume_cuinches). Let’s include this variable in the regression (as an independent variable with the assumption that it affects the selling price): Source | SS df MS Number of obs = 41 -------------+------------------------------ F( 2, 38) = 5.60 Model | 1648156.13 2 824078.063 Prob > F = 0.0074 Residual | 5591665.82 38 147149.101 R-squared = 0.2277 -------------+------------------------------ Adj R-squared = 0.1870 Total | 7239821.95 40 180995.549 Root MSE = 383.6 --------------------------------------------------------------------------------- Price | Coef. Std. Err. t P>|t| [95% Conf. Interval] ----------------+---------------------------------------------------------------- energycost | -2.427965 13.02064 -0.19 0.853 -28.78688 23.93095 volume_cuinches |.0217692.0128861 1.69 0.099 -.0043175.0478558 _cons | -342.8964 474.8011 -0.72 0.475 -1304.081 618.2881 --------------------------------------------------------------------------------- ► The results indicate that a decrease of $20 in energycost implies an increase in price (at which the refrigerator could be sold) of (-2.427965) ·(-$ 20) = $48.5593 holding the volume constant. This means that for the same features (such as volume) if the spending on energy consumed by refrigerator is lower you’d expect to be able to sell the refrigerator at a higher price. It seems that we are on the right track. ► What is the explanation? session 6 omitted variables, spurious regression, multicollinearity Managerial Economics & Decision Sciences Department omitted variables ◄ spurious regression ◄ multicollinearity ◄ © Kellogg School of Managementpage | 2 energy cost and refrigerator pricing

5 omitted variables omitted variables ◄ spurious regression ◄ multicollinearity ◄ ► We’ll consider for the moment a hypothetical situation characterized by: - independent variables x and z - dependent variable y ► The true relations between these variables are: ► Ideally when studying the interaction between x, z and y we would use the above true relations. However, it’s more often than not that we might end up using (intentional, constrained by data availability or lack of a in depth analysis of causal/correlation effects) a truncated relation. ► The truncated relation between x and y is: ► The truncated relation basically omits the variable z from the study; the effect, if any, on is called the omitted variable bias. ► Objectives: i. understand the true effect on y of a change in x given the true relations ii. understand the perceived effect on y of a change in x given the truncated relation causal correlation causal session 6 omitted variables, spurious regression, multicollinearity Managerial Economics & Decision Sciences Department © Kellogg School of Managementpage | 3

6 omitted variables ► Let’s start with the true and truncated relations and consider a change in x: ► There are three channels “at work” in propagating the change in x to a change in y:  direct channel a change in x results in a change in y through the causal relation direct causal effect: one unit change in x results in b 1 units change in y  correlation channel a change in x is associated with a change in z through the correlation relation correlation effect: one unit change in x results in a 1 units change in z  indirect channel a change in z results in a change in y through the causal relation indirect causal effect: one unit change in z results in b 2 units change in y ► Notice how the total effect on y is a result of two separate channels: direct and correlation-indirect. direct channel indirect channel correlation channel causal correlation change in x change in z change in y  z = a 1 ·  x b 2 ·  z = b 2 ·a 1 ·  x x x b 1 ·  x  x  y = (b 1 + b 2 ·a 1 )·  x direct indirect correlation perceived (if run y on x only) session 6 omitted variables, spurious regression, multicollinearity Managerial Economics & Decision Sciences Department omitted variables ◄ spurious regression ◄ multicollinearity ◄ © Kellogg School of Managementpage | 4

7 omitted variables ► Continue with the true and truncated relations for a change in x: ► Recall the truncated relation: ► In this relation we omitted the variable z, again either intentionally or because of data availability constraints or lack of analysis, and we try to estimate the value of the coefficient. By ignoring the indirect channel, the coefficient will actually “pick up” the perceived effect determined above: direct channel indirect channel correlation channel causal correlation causal true effect of x on y (direct channel) omitted variable bias (correlation-indirect channel) session 6 omitted variables, spurious regression, multicollinearity Managerial Economics & Decision Sciences Department omitted variables ◄ spurious regression ◄ multicollinearity ◄ © Kellogg School of Managementpage | 5 change in x change in z change in y  z = a 1 ·  x b 2 ·  z = b 2 ·a 1 ·  x x x b 1 ·  x  x  y = (b 1 + b 2 ·a 1 )·  x direct indirect correlation perceived (if run y on x only)

8 omitted variables ► Over-estimation of b 1. The true value of b 1 is 4 (true causal relation) but perceived effect ( ) is 14 ► Under-estimation of b 1. The true value of b 1 is 4 (true causal relation) but perceived effect ( ) is – 6 direct channel indirect channel correlation channel causal correlation direct channel indirect channel correlation channel causal correlation session 6 omitted variables, spurious regression, multicollinearity Managerial Economics & Decision Sciences Department omitted variables ◄ spurious regression ◄ multicollinearity ◄ © Kellogg School of Managementpage | 6 change in x change in z change in y  z = 5·  x 2·  z = 2·5·  x = 10·  x x x 4·  x  x  y = 14·  x direct indirect correlation change in x change in z change in y  z = -5·  x 2·  z = 2·(-5)·  x = - 10·  x x x 4·  x  x  y = - 6·  x direct indirect correlation perceived (if run y on x only)

9 omitted variables ► The omitted variables effect (ovb) is defined as the difference between the coefficient based on the truncated equation and the coefficient obtained based on the true/causal relation). The ovb is easily derived with a bit of algebra using the three equations: ► Notice that the ovb 1 depends on the combination of both magnitudes and signs of the relation between z and y (through b 2 ) and between z and x (through a 1 ). ► The table below summarizes the effect of a one unit increase in x to the change in y as captured by the truncated relation: b 2 > 0b 2 < 0 a 1 > 0 overestimation underestimation a 1 < 0 underestimation overestimation session 6 omitted variables, spurious regression, multicollinearity Managerial Economics & Decision Sciences Department omitted variables ◄ spurious regression ◄ multicollinearity ◄ © Kellogg School of Managementpage | 7

10 omitted variables ► To summarize: Remarks  if there is no correlation effect between x and z, i.e. a 1 = 0 then ovb 1 = 0 and therefore using the truncated equation gives the correct effect of x on y  to continue with the previous situation: even with no correlation effect between x and z using the truncated equation gives a biased estimate of the constant as long as b 2  0  the no correlation effect between x and z case implies that using the truncated equation will provide the correct slope but a biased constant thus inference: - on changes in y is still correct when based on the truncated equation - on the level of y is biased when based on the truncated equation omitted variable bias for constant omitted variable bias for coefficient session 6 omitted variables, spurious regression, multicollinearity Managerial Economics & Decision Sciences Department omitted variables ◄ spurious regression ◄ multicollinearity ◄ © Kellogg School of Managementpage | 8

11 ► Back to the two regressions we run and let’s formalize the two models so far: ► Let’s look a the relation between volume (volume_cuinches) and energy cost (energycost), in particular let’s run the following regression of volume_cuinches on energycost: ► We can write not necessarily in the sense of causal relation but correlation-wise: ------------------------------------------------------------------------------ volume_cui~s | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- energycost | 899.3238 73.76336 12.19 0.000 750.1233 1048.524 _cons | 29539.62 3526.558 8.38 0.000 22406.49 36672.76 ------------------------------------------------------------------------------ session 6 omitted variables, spurious regression, multicollinearity Managerial Economics & Decision Sciences Department omitted variables ◄ spurious regression ◄ multicollinearity ◄ © Kellogg School of Managementpage | 9 energy cost and refrigerator pricing correlation: truncated: true:

12 ► Just rearrange the equations as direct channel indirect channel correlation channel truncated: ► Notice that b 1 + b 2 · a 1 = –2.428 + (0.022·899.324) = 17.357 Remark. Above we obtained only approximately that ; previously we had this as an equality. This is because in our derivation we assumed that x and z are perfectly linearly related when in fact these can be related but not perfectly, e.g. energycost and volume_cuinches. Nevertheless, what remains valid is the direction of the bias, as this depends on the sign of b 2 and a 1 not on their magnitude. session 6 omitted variables, spurious regression, multicollinearity Managerial Economics & Decision Sciences Department omitted variables ◄ spurious regression ◄ multicollinearity ◄ © Kellogg School of Managementpage | 10 energy cost and refrigerator pricing true: correlation:

13 ► A recent study by an industry trade journal found that auto dealerships that spend heavily on marketing earn lower profits than dealerships that do not spend as heavily. This finding controls for market size and type of car being sold, but does not account for the extent of competition faced by each dealership.  What do you believe is the direction of correlation between marketing expenditures and the extent of competition?  What do you believe is the direction of the extent of competition and profits?  What bias, if any, does failing to account for competition impart to the study by the trade journal? quiz Ads and profitability direct channel indirect channel correlation channel Answer: session 6 omitted variables, spurious regression, multicollinearity Managerial Economics & Decision Sciences Department omitted variables ◄ spurious regression ◄ multicollinearity ◄ © Kellogg School of Managementpage | 11 truncated: true: correlation:

14 ► A recent study by an industry trade journal found that auto dealerships that spend heavily on marketing earn lower profits than dealerships that do not spend as heavily. This finding controls for market size and type of car being sold, but does not account for the extent of competition faced by each dealership.  What do you believe is the direction of correlation between marketing expenditures and the extent of competition?  What do you believe is the direction of the extent of competition and profits?  What bias, if any, does failing to account for competition impart to the study by the trade journal? Answer: The direction (underestimation or overestimation) of omitted variables bias is determined by the product of b 2 : the direction of the effect of the omitted variable on the dependent variable (holding any other x’s in the model fixed) a 1 : the direction of the relation between the omitted and included x-variable (holding any other x’s in the model fixed). ► Holding marketing spending, size of market and type of car sold fixed, we expect that dealers facing more competition are likely to experience lower profits, i.e. b 2 < 0 ► We also might expect that (holding size of market and type of car sold fixed) dealers facing more competition are likely to spend more on marketing, i.e. a 1 > 0 quiz Ads and profitability session 6 omitted variables, spurious regression, multicollinearity Managerial Economics & Decision Sciences Department omitted variables ◄ spurious regression ◄ multicollinearity ◄ © Kellogg School of Managementpage | 12 ► The direction of the bias is given by the sign of b 2 ·a 1, but b 2 0, thus sign(ovb) is negative and so

15 quiz Fast food wages ► A major fast food chain conducted a study of worker productivity. Using regression, it found that productivity (measured along several dimensions that are relevant to the production and service of fast food) is higher for workers who earn higher wages, even after controlling for worker experience. However, wages appear to have no effect on productivity in a regression that also controls for the median income of the local community where the fast food chain is located. In this regression, the estimated coefficient on median income is positive. What must be the sign of the correlation between wages and median income (holding worker experience fixed) in the study? direct channel indirect channel correlation channel Answer: session 6 omitted variables, spurious regression, multicollinearity Managerial Economics & Decision Sciences Department omitted variables ◄ spurious regression ◄ multicollinearity ◄ © Kellogg School of Managementpage | 13 truncated: true: correlation:

16 quiz Fast food wages ► A major fast food chain conducted a study of worker productivity. Using regression, it found that productivity (measured along several dimensions that are relevant to the production and service of fast food) is higher for workers who earn higher wages, even after controlling for worker experience. However, wages appear to have no effect on productivity in a regression that also controls for the median income of the local community where the fast food chain is located. In this regression, the estimated coefficient on median income is positive. What must be the sign of the correlation between wages and median income (holding worker experience fixed) in the study? Answer: In this example we know the direction of the bias from the information we are given about the two regressions: i. the coefficient on wages is positive in the regression from which local income is excluded, e.g. and zero when it is included; thus ii. the estimated effect of local income on productivity, holding worker experience and wages fixed, is positive; thus b 2 > 0 ► Since and, b 1  0 and b 2 > 0, we infer that a 1 > 0. ► Thus local income and wages must be positively correlated. session 6 omitted variables, spurious regression, multicollinearity Managerial Economics & Decision Sciences Department omitted variables ◄ spurious regression ◄ multicollinearity ◄ © Kellogg School of Managementpage | 14

17 omitted variables ► Key points: The omitted variable bias can “corrupt” your models in different ways, in particular, the omitted variables bias can make a variable seem to have a stronger or weaker effect than it really does ■ Variables may appear to be significant when in fact they have no effect except through their relationship with omitted variables ■ Variables may appear to be insignificant because the bias offsets their actual effect ■ Even when significance is not affected the estimated size of the effect may be larger or smaller ► Regression effects: When you run a regression and are interested in isolating the effect of a particular x variable, think hard about whether you are leaving out any variables that are related to both that x variable and to y: ■ If you are omitting such a variable, then the coefficient on x may be biased ► Bias effects: When you suspect that you have an omitted variable bias, perhaps because you don’t have the available data, you can try to assess the direction of the bias: ■ Use the formula for the omitted variable bias and the table from a previous slide together with information you may have about the likely signs (positive or negative) of b 2 and a 1 to determine if the coefficient reported by the regression is likely more positive or more negative than the actual effect you are interested in (e.g. What is the direction of the bias if we omit refrigerator features? ) session 6 omitted variables, spurious regression, multicollinearity Managerial Economics & Decision Sciences Department omitted variables ◄ spurious regression ◄ multicollinearity ◄ © Kellogg School of Managementpage | 15

18 omitted variables ► Living with ovb: It is impossible to get data on all the factors that might affect the dependent variable. This exposes all regressions to potential omitted variable bias, which is why we always should think about possible biases in our regressions. Fortunately, the omitted variable bias can be managed: ■ Omitting variables results in biased coefficients only if they are (a) related to the dependent variable, and (b) related to included independent variables ■ If the magnitude of (a) and (b) above is small, the bias will be small. ■ Even if omitted variable bias exists, it may be possible to determine the direction of the bias. This will allow us to state that the reported coefficients are either upper or lower bounds on the actual effects. ■ Thinking about omitted variable bias forces us to carefully identify the correct model before we run any regressions and do a better job of variable selection in the first place ■ As we will discuss in a later class, fixed effects models can sometimes further mitigate or eliminate OVB. session 6 omitted variables, spurious regression, multicollinearity Managerial Economics & Decision Sciences Department omitted variables ◄ spurious regression ◄ multicollinearity ◄ © Kellogg School of Managementpage | 16

19 spurious regression ► We have spurious regression when we find a statistically significant relationship between two truly unrelated variables, i.e. we reject H 0 :  k = 0 although there is no causal relation between x k and y. In reality, variable x k does not belong in the regression. ► A typical spurious regression is obtained when both the dependent and independent variables in a regression are in fact determined by a third, genuine independent variable:  true relationsy = a 0 + a 1 · z and x = b 0 + b 1 · z  spurious relationy =  0 +  1 · x ► It’s not difficult to find the connection between the coefficients in the spurious relation between the y and x, and the coefficients in the true relations. Retrieve z as a function of x and plug it into the first true relation. Notice again the “propagation” of the sign (positive/negative) from the real relations to the spurious relation: session 6 omitted variables, spurious regression, multicollinearity Managerial Economics & Decision Sciences Department omitted variables ◄ spurious regression ◄ multicollinearity ◄ © Kellogg School of Managementpage | 17

20 spurious regression ► Key points: We may incorrectly include x k in the model because it appears statistically significant. ■ As a result, we draw false conclusions about variable x k ■ To make matters worse, by including the “junk” variable x k in the model, we may bias the coefficients on all other included variables ► Subjectivity and hindsight: Spurious correlation plays to our psychology: ■ We are hard wired to try to explain empirical phenomena and “smart” people can rationalize almost any findings ■ Thus, after the fact, we can claim that almost any correlation is consistent with theory ■ Such correlations may have been accidents – artifacts of sampling, noise, and the definition of statistical significance (which tells us the probability of observing our result due to random chance), as a result, we reach silly conclusions that cannot be replicated ► Data mining: In explaining y you try several (an extremely high number) of potential independent variables and keep the best ones, i.e. the ones that create the best fit, and then try to rationalize in hindsight. session 6 omitted variables, spurious regression, multicollinearity Managerial Economics & Decision Sciences Department omitted variables ◄ spurious regression ◄ multicollinearity ◄ © Kellogg School of Managementpage | 18

21 spurious regression ► Potential solutions: Include variables on the RHS only when there is a plausible reason to do so and do your theorizing ex ante, not ex post; we are all expert “rationalizers”! ■ Use a smaller significance level when you try many variables or estimate many regressions ■ Question the process used to generate results ► Data mining to your advantage: While it is much better to approach your data with hypotheses based on a good understanding of what you are studying you may sometimes approach a problem without such an understanding ■ Data mining can identify patterns that may lead to hypotheses  Do not confuse the data mining step as a test of your hypotheses – gather more data and conduct new tests - this is known as “out of sample” testing  Spurious correlation tells us that this process will be costly ■ Most hypotheses coming out of data mining will not be confirmed session 6 omitted variables, spurious regression, multicollinearity Managerial Economics & Decision Sciences Department omitted variables ◄ spurious regression ◄ multicollinearity ◄ © Kellogg School of Managementpage | 19

22 multicollinearity ► Multicollinearity occurs when two or more explanatory (x) variables are highly correlated This is not the same as an interaction/slope dummy, which is a new variable created to ask a specific question about whether the effect of one variable depends on the level of another You can have multicollinearity without any interaction, and vice versa When variables are multicollinear, their standard errors are inflated, which makes it more difficult to draw inferences about the impact of each one separately (significance issues) ► If two variables are highly correlated, then they tend to move in lockstep As a result, they lack independent action; in effect, they represent the “same experiment” The computer (regression) may be able to determine that “the experiment” actually had an effect on y, but it may not be able to determine which of the two variables is responsible Thus, each variable used individually may be significant, but when entered jointly, multicollinearity may lead to neither being significant session 6 omitted variables, spurious regression, multicollinearity Managerial Economics & Decision Sciences Department omitted variables ◄ spurious regression ◄ multicollinearity ◄ © Kellogg School of Managementpage | 20

23 ► Dubuque Hot Dogs offers branded hot dogs and has as main competitors Oscar Mayer and Ball Park. There are two kinds of Ball Park hot dogs: regular and all-beef. ► Data is available in hotdog.dta with MKTDUB gives Dubuque’s weekly market share in decimal points, i.e. 0.04 means a 4% share Avg. prices (in cents) during each week for Dubuque, Oscar Mayer and the two Ball Park hot dogs ► We try to explain Dubuque’s market share using the price variables in a regression: own price Oscar’s price Ball Park prices quiz Do you expect any variables to exhibit any multicollinearity? Answer: If any variables are suspected of multicollinearity then probably the prices of Ball Park hot dogs are likely to be correlated to each other (similar production, distribution costs, etc). The fact that the two prices have a common underlying generating process is a good example of “same experiment”. session 6 omitted variables, spurious regression, multicollinearity Managerial Economics & Decision Sciences Department omitted variables ◄ spurious regression ◄ multicollinearity ◄ dubuque hot dog © Kellogg School of Managementpage | 21

24 ► Regression results: ► The coefficients for the two Ball Park price variables are insignificant. We mentioned that the main effect of multicollinearity is inflated standard deviations for the coefficients. But the standard deviation is basically the denominator of the t-test (for significance) thus in the presence of multicollinearity the t-test is small and the larger is the corresponding p-value. Thus we’ll tend to see variables as insignificant (due to multicollinearity) when in fact those variables might have explanatory power. Source | SS df MS Number of obs = 113 -------------+------------------------------ F( 4, 108) = 30.00 Model |.012013954 4.003003488 Prob > F = 0.0000 Residual |.010811783 108.000100109 R-squared = 0.5263 -------------+------------------------------ Adj R-squared = 0.5088 Total |.022825737 112.000203801 Root MSE =.01001 ------------------------------------------------------------------------------ MKTDUB | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- pdub | -.0007598.0000809 -9.39 0.000 -.0009202 -.0005994 poscar |.0002622.0000843 3.11 0.002.0000952.0004293 pbpreg |.0003473.0003316 1.05 0.297 -.00031.0010046 pbpbeef |.0001025.0002938 0.35 0.728 -.0004798.0006848 _cons |.0403026.0141226 2.85 0.005.0123092.068296 ------------------------------------------------------------------------------ session 6 omitted variables, spurious regression, multicollinearity Managerial Economics & Decision Sciences Department omitted variables ◄ spurious regression ◄ multicollinearity ◄ dubuque hot dog © Kellogg School of Managementpage | 22

25 ► Let’s check first the relation between pbpreg and pbpbeef (just run a simple regression, no causality implied, just a “feel” of possible correlation): ► The high R 2 (given that there’s only one independent variable) is a sign of strong correlation between the two variables. A simple scatter diagram would reveal an almost perfect “alignment” of the two variables. ► To emphasize again the concept of action: since the two variables are highly correlated we cannot see a lot of “independent movement” between the two variables.. reg pbpreg pbpbeef Source | SS df MS Number of obs = 113 -------------+------------------------------ F( 1, 111) = 2609.09 Model | 22676.3294 1 22676.3294 Prob > F = 0.0000 Residual | 964.732576 111 8.69128447 R-squared = 0.9592 -------------+------------------------------ Adj R-squared = 0.9588 Total | 23641.0619 112 211.08091 Root MSE = 2.9481 ------------------------------------------------------------------------------ pbpreg | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- pbpbeef |.8816371.0172602 51.08 0.000.8474349.9158393 _cons | 12.34627 3.064725 4.03 0.000 6.273311 18.41922 ------------------------------------------------------------------------------ session 6 omitted variables, spurious regression, multicollinearity Managerial Economics & Decision Sciences Department omitted variables ◄ spurious regression ◄ multicollinearity ◄ dubuque hot dog © Kellogg School of Managementpage | 23

26 ► Back to the initial regression (to be able to run the following commands make sure the initial regression is the last regression run). ► We saw that pbpreg and pbpbeef are likely to be correlated and that might induce inflated standard deviations for the coefficients. But are the standard deviations really inflated? ► Detection of inflated standard deviations. The command vif delivers a list of “variance inflation factors” for each coefficient: ► For a given coefficient if the VIF value is greater than 10 then we have evidence that the standard deviation for that coefficient is inflated and therefore it is likely that the p-value is larger than it should be, i.e. will tend to indicate that the coefficient is insignificant when in fact it might be significant.. vif Variable | VIF 1/VIF -------------+---------------------- pbpreg | 25.97 0.038508 pbpbeef | 25.15 0.039765 poscar | 1.66 0.603208 pdub | 1.36 0.733979 -------------+---------------------- Mean VIF | 13.53 session 6 omitted variables, spurious regression, multicollinearity Managerial Economics & Decision Sciences Department omitted variables ◄ spurious regression ◄ multicollinearity ◄ dubuque hot dog © Kellogg School of Managementpage | 24

27 ► Detection of multicollinearity. The fact that we detect inflated standard deviations does not guarantee automatically detection of multicollinearity. To identify multicollinearity we use the F-test. ► The F-test tells us whether one or more variables adds predictive power to a regression: ■ null hypothesis: all of the coefficients (  ’s) on the variables you are testing equal 0 ■ alternative hypothesis: at least one of the coefficients (  ’s) is different from 0 In plain language: you are basically testing whether these variables are no more related to y than junk variables. Remark. The F-test for a single variable returns the same significance level as the t-test ► The F-test for a group of variables can be executed in STATA using the test or testparm command and listing the variables you wish to test after running a regression. the group of variables you are testing Perform the F-testtestparm x 1 x 2 … x k session 6 omitted variables, spurious regression, multicollinearity Managerial Economics & Decision Sciences Department omitted variables ◄ spurious regression ◄ multicollinearity ◄ dubuque hot dog © Kellogg School of Managementpage | 25

28 dubuque hot dog ► Detection of multicollinearity. The F-test applied to pbpreg and pbpbeef provides the following result:. testparm pbpreg pbpbeef ( 1) pbpreg = 0 ( 2) pbpbeef = 0 F( 2, 108) = 17.21 Prob > F = 0.0000 the null hypothesis (joint): the p-value: ► The decision criterion is similar to the one we introduced in the context of the t-test. First choose a significance level  then compare the p-value with  : ■ reject the null hypothesis: if p-value <  ■ cannot reject the null hypothesis: if p-value   ► What exactly does it mean to reject the null in the context of the F-test? It means that at least one of the coefficients is significantly different from zero. ► What if you fail to reject the null in the context of the F-test? Then the group of tested variables is no more predictive than junk … So treat it like junk and drop the group. session 6 omitted variables, spurious regression, multicollinearity Managerial Economics & Decision Sciences Department omitted variables ◄ spurious regression ◄ multicollinearity ◄ © Kellogg School of Managementpage | 26

29 multicollinearity ► If the F-test does reject the null, then one or more variables in the group has a nonzero coefficient but:  this does not imply that every variable in the group has a nonzero coefficient  you should still generally keep all of the group. ► Extra thought is needed when working with categorical variables, such as seasonality. Don’t be surprised if despite the significant F-test, none of categorical variables are individually significant.  remember that significance is relative to the omitted category  thus, you may have significant pairwise comparisons between included categories ► If you do have multicollinearity, then:  if your goal is to predict y or if the multicollinear variables are not the key predictors you are interested in then multicollinearity is not a problem  if the variables are key predictors and your multicollinearity is problematic, then perform an F-test on the group of variables. If the test is significant, then conclude that these variables collectively matter, even though you may be unable to sort out individual effects  do not conclude that one variable, but not the other, is what matters; if you knew which one(s) mattered and how, you would not have a problem with multicollinearity in the first place. session 6 omitted variables, spurious regression, multicollinearity Managerial Economics & Decision Sciences Department omitted variables ◄ spurious regression ◄ multicollinearity ◄ © Kellogg School of Managementpage | 27

30 ► We saw that the F-test detected multicollinearity and now we know that the reason we get the coefficients for pbpreg and pbpbeef insignificant is because of inflated standard deviations. Can we simply remove one of the two? Which one? ------------------------------------------------------------------------------ MKTDUB | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- pdub | -.0007642.0000796 -9.60 0.000 -.0009219 -.0006065 poscar |.0002633.0000839 3.14 0.002.0000971.0004296 pbpreg |.0004597.0000782 5.88 0.000.0003047.0006146 _cons |.0400699.0140499 2.85 0.005.0122235.0679162 ------------------------------------------------------------------------------ MKTDUB | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- pdub | -.0007442.0000796 -9.35 0.000 -.0009019 -.0005865 poscar |.0002686.0000841 3.19 0.002.0001019.0004352 pbpbeef |.0004014.0000696 5.77 0.000.0002635.0005392 _cons |.0421994.0140121 3.01 0.003.0144278.069971 ------------------------------------------------------------------------------ Keep only pbpreg and drop pbpbeef. Now pbpreg becomes significant Keep only pbpbeef and drop pbpreg. Now pbpbeef becomes significant ► In both cases the remaining variable (one of the two) becomes significant. This should not be a huge surprise: the F-test indicated that jointly they are insignificant but at least one of them is significant. session 6 omitted variables, spurious regression, multicollinearity Managerial Economics & Decision Sciences Department © Kellogg School of Managementpage | 28 omitted variables ◄ spurious regression ◄ multicollinearity ◄ dubuque hot dog - extensions

31 ► Another interesting fact is that in both regressions in which we use only one of the two variables the coefficient of the included variable is very close to the sum of the coefficients of the two variables when both were in included in the regression: initial regression (both): b pbpreg = 0.0003473 and b pbpbeef = 0.0001025 (sum 0.0004498) regression with pbpreg: b pbpreg = 0.0004597 regression with pbpbeef: b pbpbeef = 0.0004014 ► In fact in either of the two regressions in which only one variable is include, the coefficient of the included variable “picks up” the cumulative effect. ► Nevertheless, since we are not able (yet) to figure out how to split the “cumulative” effect, i.e. which of the two variables to include and therefore exclude the other one, we choose to continue the analysis with both included. quiz Which of the two competitors (Oscar or Ball Park) do you think has a stronger impact on “stealing” market share from Dubuque? Answer: We should identify the impact on Dubuque’s market share of a change with one cent on the price of Oscar and then one cent change in the prices of both Ball Park products. session 6 omitted variables, spurious regression, multicollinearity Managerial Economics & Decision Sciences Department © Kellogg School of Managementpage | 29 omitted variables ◄ spurious regression ◄ multicollinearity ◄ dubuque hot dog - extensions

32 ► The initial regression is ------------------------------------------------------------------------------ MKTDUB | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- pdub | -.0007598.0000809 -9.39 0.000 -.0009202 -.0005994 poscar |.0002622.0000843 3.11 0.002.0000952.0004293 pbpreg |.0003473.0003316 1.05 0.297 -.00031.0010046 pbpbeef |.0001025.0002938 0.35 0.728 -.0004798.0006848 _cons |.0403026.0141226 2.85 0.005.0123092.068296 ------------------------------------------------------------------------------ ► By now it should be easy to identify that an increase with one cent for in Oscar’s price leads to an increase of 0.02622% (coefficient of poscar) in Dubuque’s market share while an increase with one cent in Ball Park’s products leads to an increase of 0.04498% (sum of pbpreg and pbpbeef) in Dubuque’s market share. It seems that Ball Park is a more serious competitor than Oscar. ► Can we test whether Ball Park’s change in price has a higher impact on Dubuque’s market share than a change in Oscar’s price? session 6 omitted variables, spurious regression, multicollinearity Managerial Economics & Decision Sciences Department © Kellogg School of Managementpage | 30 omitted variables ◄ spurious regression ◄ multicollinearity ◄ dubuque hot dog - extensions

33 ► To prove that a Ball Park’s change in price has a higher impact on Dubuque’s market share than a change in Oscar’s price we need to prove that where the coefficients correspond to the regression: ► The hypotheses are: ■ null hypothesis:H 0 :  3 +  4 –  2  0 ■ alternative hypothesis: H a :  3 +  4 –  2 > 0 ► But this is a standard type of testing an expression involving coefficients from the regression (klincom):. klincom _b[pbpreg]+_b[pbpbeef]-_b[poscar] ------------------------------------------------------------------------------ MKTDUB | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- (1) |.0001875.0001413 1.33 0.187 -.0000925.0004676 ------------------------------------------------------------------------------ If Ha: < then Pr(T < t) =.906 If Ha: not = then Pr(|T| > |t|) =.187 If Ha: > then Pr(T > t) =.094 ► We cannot reject the null hypothesis for up to  = 10%; the point here is to realize that klincom is useful. session 6 omitted variables, spurious regression, multicollinearity Managerial Economics & Decision Sciences Department © Kellogg School of Managementpage | 31 omitted variables ◄ spurious regression ◄ multicollinearity ◄ dubuque hot dog - extensions

34 key points variables interaction ► omitted variables “too few” variables key predictor(s) not in the regression value of coefficient(s) are biased economically poorly constructed data availability constraints difficult to detect: use common sense ► spurious regression “too many” variables not meaningful causality relation likely to have high R 2 economically poorly constructed difficult to detect: use common sense ► multicollinearity “too many” variables correlated predictors in the regression potentially insignificant group of coefficients economically poorly constructed detection: vif and F-test session 6 omitted variables, spurious regression, multicollinearity Managerial Economics & Decision Sciences Department omitted variables ◄ spurious regression ◄ multicollinearity ◄ © Kellogg School of Managementpage | 32


Download ppt "Session 6 omitted variables, spurious regression and multicollinearity ► omitted variables ► spurious regression ► multicollinearity Developed for Managerial."

Similar presentations


Ads by Google