Managerial Economics & Decision Sciences Department introduction inflated standard deviations the F test business analytics II Developed for © 2016 kellogg school of management | managerial economics and decision sciences department | business analytics II ▌ mulicollinearity week 7 week 6 week 8 week 3
© 2016 kellogg school of management | managerial economics and decision sciences department | business analytics II readings ► statistics & econometrics ► (MSN) define the multicollinearity understand the effects of multicollinearity detecting multicollinearity learning objectives the vif command ► Chapter 7 ► (CS) Dubuque Hot Dog session seven multicollinearity business analytics II Developed for
Managerial Economics & Decision Sciences Department session seven multicollinearity business analytics II Developed for introduction ◄ inflated standard deviations◄ the F test ◄ © 2016 kellogg school of management | managerial economics and decision sciences department | business analytics II | page1 multicollinearity: fundamentals session seven …one too many… Remark. This is not the same as an interaction/slope dummy, which is a new variable created to ask a specific question about whether the effect of one variable depends on the level of another You can have multicollinearity without any interaction, and vice versa When variables are multicollinear, their standard errors are inflated, which makes it more difficult to draw inferences about the impact of each one separately (significance issues) ► If two variables are highly correlated, then they tend to move in lockstep As a result, they lack independent action ; in effect, they represent the “same experiment” The regression may be able to determine that “the experiment” actually had an effect on y, but it may not be able to determine which of the two variables is responsible Thus, each variable used individually may be significant, but when entered jointly, multicollinearity may lead to neither being significant key concept : multicollinearity ► Multicollinearity occurs when two or more explanatory (x) variables are highly correlated.
Managerial Economics & Decision Sciences Department session seven multicollinearity business analytics II Developed for introduction ◄ inflated standard deviations ◄ the F test ◄ © 2016 kellogg school of management | managerial economics and decision sciences department | business analytics II | page2 an illustration: …the Dubuque hot dogs… session seven ► Dubuque Hot Dogs offers branded hot dogs and has as main competitors Oscar Mayer and Ball Park. There are two kinds of Ball Park hot dogs: regular and all-beef. ► Data is available in hotdog.dta with MKTDUB gives Dubuque’s weekly market share in decimal points, i.e means a 4% share Avg. prices (in cents) during each week for Dubuque, Oscar Mayer and the two Ball Park hot dogs ► We try to explain Dubuque’s market share using the price variables in a regression: own price Oscar’s price Ball Park prices Remark. Should we expect any variables to exhibit multicollinearity? If any variables are suspected of multicollinearity then probably the prices of Ball Park hot dogs are likely to be correlated to each other (similar production, distribution costs, etc). The fact that the two prices have a common underlying generating process is a good example of “same experiment”.
Managerial Economics & Decision Sciences Department session seven multicollinearity business analytics II Developed for © 2016 kellogg school of management | managerial economics and decision sciences department | business analytics II | page3 an illustration: …the Dubuque hot dogs… session seven ► The regression results are shown in the table below: MKTDUB | Coef. Std. Err. t P>|t| [95% Conf. Interval] pdub | poscar | pbpreg | pbpbeef | _cons | Figure 1. Results for regression of MKTDUB on pdub, poscar, pbpreg and pbpbeef ► The coefficients for the two Ball Park price variables are insignificant. We mentioned that the main effect of multicollinearity is inflated standard deviations for the coefficients. But the standard deviation is basically the denominator of the t test (for significance) thus in the presence of multicollinearity the t-test is small and the larger is the corresponding p value. Thus we’ll tend to see variables as insignificant (due to multicollinearity) when in fact those variables might have explanatory power. pbpreg | Coef. Std. Err. t P>|t| pbpbeef | _cons | Figure 2. Results for regression of pbpreg on pbpbeef Remark. To emphasize the concept of action : since the two variables are highly correlated we cannot see a lot of “independent movement” between the two variables. This is the origin of the inability to disentangle the separate effects of two highly correlated variables on the y variable. introduction ◄ inflated standard deviations ◄ the F test ◄
Managerial Economics & Decision Sciences Department session seven multicollinearity business analytics II Developed for © 2016 kellogg school of management | managerial economics and decision sciences department | business analytics II | page4 an illustration: …the Dubuque hot dogs… session seven introduction ◄ inflated standard deviations ◄ the F test ◄ ► In the initial regression we saw that pbpreg and pbpbeef are likely to be correlated and that might induce inflated standard deviations for the coefficients. But are the standard deviations really inflated? ► The command vif delivers a list of “variance inflation factors” for each coefficient: regress MKTDUB pdub poscar pbpreg pbpbeef vif Variable | VIF 1/VIF pbpreg | pbpbeef | poscar | pdub | Mean VIF | Figure 3. Results for vif command key concept : inflated standard deviations ► For a given coefficient if the VIF value is greater than 10 then we have evidence that the standard deviation for that coefficient is inflated and therefore it is likely that the p-value is larger than it should be, i.e. will tend to indicate that the coefficient is insignificant when in fact it might be significant.
Managerial Economics & Decision Sciences Department session seven multicollinearity business analytics II Developed for © 2016 kellogg school of management | managerial economics and decision sciences department | business analytics II | page5 detecting multicollinearity session seven introduction ◄ inflated standard deviations ◄ the F test ◄ ► The fact that we detect inflated standard deviations does not guarantee automatically detection of multicollinearity. To identify multicollinearity we use the F-test. ► The F -test tells us whether one or more variables adds predictive power to a regression: ► In plain language: you are basically testing whether these variables are no more related to y than junk variables. Remark. The F-test for a single variable returns the same significance level as the t-test ► The F test for a group of variables can be executed in STATA using the test or testparm command and listing the variables you wish to test after running a regression. hypothesis H 0 : all of the regression coefficients ( ) on the variables you are testing equal 0 H a : at least one of the regression coefficients ( ) is different from 0 testparm xvar1 xvar2 … xvark Remark. After STATA command testparm you should list the variable for which you want to test whether their coefficients are all different from zero.
Managerial Economics & Decision Sciences Department session seven multicollinearity business analytics II Developed for © 2016 kellogg school of management | managerial economics and decision sciences department | business analytics II | page6 detecting multicollinearity session seven introduction ◄ inflated standard deviations ◄ the F test ◄ ► The F-test applied to pbpreg and pbpbeef provides the following result:. testparm pbpreg pbpbeef ( 1) pbpreg = 0 ( 2) pbpbeef = 0 F( 2, 108) = Prob > F = the null hypothesis (joint) the p value ► The decision criterion is similar to the one we introduced in the context of the t test. First choose a significance level then compare the p value with : ► What exactly does it mean to reject the null in the context of the F test ? It means that at least one of the coefficients is significantly different from zero. ► What if you fail to reject the null in the context of the F test ? It means the group of tested variables is no more predictive than junk: drop the group. Figure 4. Results for testparm command decision If p–value < reject H 0 otherwise cannot reject H 0 at the significance level
Managerial Economics & Decision Sciences Department session seven multicollinearity business analytics II Developed for © 2016 kellogg school of management | managerial economics and decision sciences department | business analytics II | page7 detecting multicollinearity session seven introduction ◄ inflated standard deviations ◄ the F test ◄ ► If the F test does reject the null, then one or more variables in the group has a nonzero coefficient but: this does not imply that every variable in the group has a nonzero coefficient you should still generally keep all of the group. ► Extra thought is needed when working with categorical variables, such as seasonality. Don’t be surprised if despite the significant F-test, none of categorical variables are individually significant. remember that significance is relative to the omitted category thus, you may have significant pairwise comparisons between included categories ► If you do have multicollinearity, then: if your goal is to predict y or if the multicollinear variables are not the key predictors you are interested in then multicollinearity is not a problem if the variables are key predictors and your multicollinearity is problematic, then perform an F-test on the group of variables. If the test is significant, then conclude that these variables collectively matter, even though you may be unable to sort out individual effects do not conclude that one variable, but not the other, is what matters; if you knew which one(s) mattered and how, you would not have a problem with multicollinearity in the first place.
Managerial Economics & Decision Sciences Department session seven multicollinearity business analytics II Developed for © 2016 kellogg school of management | managerial economics and decision sciences department | business analytics II | page8 detecting multicollinearity session seven introduction ◄ inflated standard deviations ◄ the F test ◄ ► We saw that the F test detected multicollinearity and now we know that the reason we get the coefficients for pbpreg and pbpbeef insignificant is because of inflated standard deviations. Can we simply remove one of the two? Which one? MKTDUB | Coef. Std. Err. t P>|t| [95% Conf. Interval] pdub | poscar | pbpreg | _cons | MKTDUB | Coef. Std. Err. t P>|t| [95% Conf. Interval] pdub | poscar | pbpbeef | _cons | Remark. Keep only pbpreg and drop pbpbeef. Now pbpreg becomes significant ► In both cases the remaining variable (one of the two) becomes significant. This should not be a surprise: the F test indicated that jointly they are insignificant but at least one of them is significant. Figure 5. Results for regression of MKTDUB on pdub, poscar and pbpreg Figure 6. Results for regression of MKTDUB on pdub, poscar and pbpbeef Remark. Keep only pbpbeef and drop pbpreg. Now pbpbeef becomes significant
Managerial Economics & Decision Sciences Department session seven multicollinearity business analytics II Developed for © 2016 kellogg school of management | managerial economics and decision sciences department | business analytics II | page9 detecting multicollinearity session seven introduction ◄ inflated standard deviations ◄ the F test ◄ ► Another interesting fact is that in both regressions in which we use only one of the two variables the coefficient of the included variable is very close to the sum of the coefficients of the two variables when both were in included in the regression: initial regression (both): b pbpreg and b pbpbeef (sum ) regression with pbpreg : b pbpreg regression with pbpbeef : b pbpbeef ► In fact in either of the two regressions in which only one variable is included, the coefficient of the included variable “picks up” the cumulative effect. ► Nevertheless, since we are not able (yet) to figure out how to split the “cumulative” effect, i.e. which of the two variables to include and therefore exclude the other one, we choose to continue the analysis with both included.
Managerial Economics & Decision Sciences Department session seven multicollinearity business analytics II Developed for © 2016 kellogg school of management | managerial economics and decision sciences department | business analytics II | page10 detecting multicollinearity session seven introduction ◄ inflated standard deviations ◄ the F test ◄ quiz ► Which of the two competitors (Oscar or Ball Park) do you think has a stronger impact on “stealing” market share from Dubuque? MKTDUB | Coef. Std. Err. t P>|t| [95% Conf. Interval] pdub | poscar | pbpreg | pbpbeef | _cons | Figure 7. Results for regression of MKTDUB on pdub, poscar, pbpreg and pbpbeef ► By now it should be easy to identify that an increase with one cent for in Oscar’s price leads to an increase of % (coefficient of poscar ) in Dubuque’s market share while an increase with one cent in Ball Park’s products leads to an increase of % (sum of pbpreg and pbpbeef ) in Dubuque’s market share. It seems that Ball Park is a more serious competitor than Oscar.
Managerial Economics & Decision Sciences Department session seven multicollinearity business analytics II Developed for © 2016 kellogg school of management | managerial economics and decision sciences department | business analytics II | page11 detecting multicollinearity session seven introduction ◄ inflated standard deviations ◄ the F test ◄ quiz ► Can we test whether Ball Park’s change in price has a higher impact on Dubuque’s market share than a change in Oscar’s price? ► To prove that a Ball Park’s change in price has a higher impact on Dubuque’s market share than a change in Oscar’s price we need to prove that where the coefficients correspond to the regression: ► The hypotheses are:. klincom _b[pbpreg]+_b[pbpbeef]-_b[poscar] MKTDUB | Coef. Std. Err. t P>|t| [95% Conf. Interval] (1) | If Ha: < then Pr(T < t) =.906 If Ha: not = then Pr(|T| > |t|) =.187 If Ha: > then Pr(T > t) =.094 ► We cannot reject the null hypothesis for up to 10%; the point here is to realize that klincom is useful. hypothesis H 0 : 3 4 2 0 H a : 3 4 2 0 Figure 8. Results for klincom command