QM222 Nov. 7 Section D1 Multicollinearity Regression Tables What to do next on your project QM222 Fall 2016 Section D1
Multicollinearity QM222 Fall 2016 Section D1
Multicollinearity Recall that the interpretation of a coefficient in a multiple regression is: The effect on Y of X changing by 1 if the other variables stay the same And the t-test tests the null: Could this coefficient be zero? Sometimes you run a regression of on two very very correlated variables like #toothbrushes sold and amount of toothpaste sold in a country in a year. The t-tests will both be very low. Because each coefficient could be zero and the regression would predict approximately the same thing. But if you drop one of them, the other would become highly significant. E.g. GDP and Unemployment QM222 Fall 2016 Section D1
What to do if you find that variables that you believe should be significant are not If several variables are really measuring the same concept, drop one of them if its |t-stat| is less than ONE. If you drop a variable with a |t-stat| <1, the adjusted R- squared increases. Which do you drop? The one with the lowest |t|. In other words, let the computer tell you which of the two variables you need. If you are right, the other variable will become more significant. NEVER DROP MORE THAN ONE VARIABLE AT A TIME. If you do, you might drop BOTH highly correlated variables. You can test if two (or more) insignificant variables together are significant but writing this after you run the regression: test varname1 varnname2 QM222 Fall 2016 Section D1
Making Regression Tables (see chapter 19) QM222 Fall 2016 Section D1
Use Tables to report several regressions Your different regressions will have different combinations of variables. Why present more than 1 regression? -To develop your ideas. -Or for different dependent variables (list in column title.) QM222 Fall 2016 Section D1
In footnotes, say which you included. Include either t-stats or coefficient standard errors in parentheses directly below the coefficient. In footnotes, say which you included. QM222 Fall 2016 Section D1
Use asterisks to denote significance Include the number of observations and at least the adjusted Rsq (and maybe RMSE aka SEE) Use asterisks to denote significance QM222 Fall 2016 Section D1
For any set of multiple dummies, include in footnote what the excluded category is. (here, year1965) Note: If you are using i. for your dummies, Stata might use different reference categories for different regressions. QM222 Fall 2016 Section D1
What to do next on your project QM222 Fall 2016 Section D1
Assignment 6 Ideally, have by Friday. Post your current data set under Stata data set (if you can). Run additional multiple regressions. Specifically: Think hard about whether there are additional omitted variables (i.e. confounding factors) that you can measure that are likely to be biasing your key coefficient(s). If you can find data on them, add them into the regressions. (If you really cannot think of anything beyond what you have, just write that.) Identify at least one omitted variable that you cannot measure, reason out the sign of the omitted variable bias and explain here (Ass.6) in 1-3 sentences why and in what direction it will bias your key coefficient. QM222 Fall 2016 Section D1
Assignment 6 cont. If you have any numeric explanatory (X) variable, add a quadratic term in addition to your other variables to test if this nonlinear specification fits better. (If you are good at math and prefer to add a different nonlinear variable or to make your dependent variable non-linear, be my guest.) Explain here (Ass.6) what you learn from this result (1-3 sentences). Explain/show (e.g. with graph) what you learn from this. If you have a numeric explanatory (X) variable that is very skewed, think about whether top-coding or taking the log of that variables is appropriate instead. QM222 Fall 2016 Section D1
Another approach if you think the relationship between Y and X is really really nonlinear You could try a set of dummy variables for different ranges of the variable. Even though it is a numerical variable. Only use this approach if you believe that the relationship between Y and X changes so much at every value that it can’t be estimated as a quadratic (or cubic etc.) Education sometimes is better as a set of dummies QM222 Fall 2016 Section D1
Not in Assignment 6 If you have a very skewed Y (dependent) variable Try top-coding it (if you think that once it reaches a quite high level, it doesn’t matter how much higher it gets) Try changing it into an indicator variable Try estimating the median Y, replacing regress with qreg. QM222 Fall 2016 Section D1
Assignment 6 cont. Think about if you can and should use an interaction term. (This will be most useful if you think that different groups have different slopes.) Try at least one out in a multiple regression (with all your other variables as well). Copy and paste here (PS 6) Explain here what you learn from this interaction term result (1-3 sentences). QM222 Fall 2016 Section D1
Review interaction terms: If we think that the effect of X1 on Y depends on a different indicator variable X2 (e.g. scifi) The simplest way to model this in a regression is: Make an additional variable by multiplying X1 * X2 Make an additional variable by multiplying X1 * (1-X2) Recalling that (1-X2) is 1 if X2=0 Run a regression of Y on 3 variables: X1*X2 This is X1 for observations where X2 =1 X1*(1-X2) This is X1 for observations where X2 =0 X2 This is X2 QM222 Fall 2016 Section D1
Graph of this model SciFi movies Revenues Other movies Budget QM222 Fall 2016 Section D1
Interaction terms with numeric variables (for those who dare): If we think that the effect of X1 on Y depends on a different numeric variable X2 (e.g. scifi) The simplest way to model this in a regression is: Make an additional variable by multiplying X1 * X2 Run a regression of Y on 3 variables: X1 X2 X1*X2 So Y = b0 + b1X1 + b2X2 + b3X1*X2 Note that dY/dX1= b1 + b3 X2 QM222 Fall 2016 Section D1
More generally, ask yourself if your regressions are really answering the question…. I like sophisticated approaches if you are using them correctly, if they are the most appropriate way to answer your question. QM222 Fall 2016 Section D1
Assignment 6 cont. Decide which is the best regression or set of regressions that you will use in your project. Update your Current Project Status including replacing/adding these regressions to Question 7. Also answer Question 9, which asks for the conclusions of your project, as it now stands. The more fully you answer Questions 7 and 9, the better feedback I can give you at your required meeting #2. QM222 Fall 2016 Section D1
More things to be careful about QM222 Fall 2016 Section D1
Make sure indicator variables are 0-1 and named correctly . tab sex respondents | sex | Freq. Percent Cum. ------------+----------------------------------- male | 26,286 44.10 44.10 female | 33,313 55.90 100.00 Total | 59,599 100.00 . tab sex, nolabel 1 | 26,286 44.10 44.10 2 | 33,313 55.90 100.00 . gen male=sex . replace male=0 if sex==2 (33313 real changes made) QM222 Fall 2016 Section D1
When not to control for a variable I want to know how education affects men and women’s belief that people should legalize pot. Grass: Indicator variable if believe marijuana should be legalized. If I run this regression Grass = b0 + b1 education + b2 income + b3 age… Then the coefficient on education tells us “If someone gets a lot of education but has the same income as another person, how does the education affect grass?” You might instead want to know “If someone gets a lot of education and as a result has higher income than another person as well as being better education, how does the education affect grass?” For this, run Grass = b0 + b1 education + + b3 age… QM222 Fall 2016 Section D1
Some misunderstandings on multiple dummies You cannot use categorical variables as numbers. This includes: Marital status Work status Each coefficient is that variable versus the reference, excluded category Often, it makes sense to choose the reference category to be something you would most want to be the comparison. You can NEVER put in all categories into the regression Stata will omit one Example next page. QM222 Fall 2015 Section D1
What happens when you add all categories on the right…. regress realrinc married widow divorced separated nevermarried note: widow omitted because of collinearity Source | SS df MS Number of obs = 34888 -------------+------------------------------ F( 4, 34883) = 187.50 Model | 5.9751e+11 4 1.4938e+11 Prob > F = 0.0000 Residual | 2.7791e+13 34883 796694311 R-squared = 0.0210 -------------+------------------------------ Adj R-squared = 0.0209 Total | 2.8389e+13 34887 813730067 Root MSE = 28226 ------------------------------------------------------------------------------ realrinc | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- married | 9009.761 814.9317 11.06 0.000 7412.468 10607.05 widow | 0 (omitted) divorced | 6124.18 882.4919 6.94 0.000 4394.467 7853.892 separated | 1753.797 1126.321 1.56 0.119 -453.8281 3961.422 nevermarried | -553.9949 846.5664 -0.65 0.513 -2213.292 1105.302 _cons | 16458.18 788.6264 20.87 0.000 14912.45 18003.92 QM222 Fall 2015 Section D1
When you have categorical variables You cannot use categorical variables as numbers. This includes: Marital status Work status Etc. Don’t use LOTS of dummies for important variables (whose coefficients you want to understand). If you have a categorical variable with more than 10 categories, try to combine them into broader categories. It’s okay to use dummies when you have more than 10 (or so) categories as control variables that you don’t plan to discuss or report, you can include them. (e.g. occupation) QM222 Fall 2016 Section D1
Note: currently married is reference category What is the difference between nevermarried and currently married? Is it significant? . regress realrinc i.marital Source | SS df MS Number of obs = 34888 -------------+------------------------------ F( 4, 34883) = 187.50 Model | 5.9751e+11 4 1.4938e+11 Prob > F = 0.0000 Residual | 2.7791e+13 34883 796694311 R-squared = 0.0210 -------------+------------------------------ Adj R-squared = 0.0209 Total | 2.8389e+13 34887 813730067 Root MSE = 28226 -------------------------------------------------------------------------------- realrinc | Coef. Std. Err. t P>|t| [95% Conf. Interval] ---------------+---------------------------------------------------------------- marital | widowed | -9009.761 814.9317 -11.06 0.000 -10607.05 -7412.468 divorced | -2885.581 446.1419 -6.47 0.000 -3760.033 -2011.128 separated | -7255.963 829.9696 -8.74 0.000 -8882.73 -5629.196 never married | -9563.755 370.0341 -25.85 0.000 -10289.03 -8838.477 | _cons | 25467.94 205.3829 124.00 0.000 25065.39 25870.5 QM222 Fall 2015 Section D1
If coefficients on two categories are within 1 se of each other, you might consider combining them. You can calculate the 68% confidence interval and see if they overlap. . regress realrinc i.marital Source | SS df MS Number of obs = 34888 -------------+------------------------------ F( 4, 34883) = 187.50 Model | 5.9751e+11 4 1.4938e+11 Prob > F = 0.0000 Residual | 2.7791e+13 34883 796694311 R-squared = 0.0210 -------------+------------------------------ Adj R-squared = 0.0209 Total | 2.8389e+13 34887 813730067 Root MSE = 28226 -------------------------------------------------------------------------------- realrinc | Coef. Std. Err. t P>|t| [95% Conf. Interval] ---------------+---------------------------------------------------------------- marital | widowed | -9009.761 814.9317 -11.06 0.000 -10607.05 -7412.468 divorced | -2885.581 446.1419 -6.47 0.000 -3760.033 -2011.128 separated | -7255.963 829.9696 -8.74 0.000 -8882.73 -5629.196 nevermarried | -9563.755 370.0341 -25.85 0.000 -10289.03 -8838.477 | _cons | 25467.94 205.3829 124.00 0.000 25065.39 25870.5 QM222 Fall 2015 Section D1
Or you can see if the coefficients on 2 categories (of the same thing) are similar by using Stata lincom tests You can’t use i.marital, so first make actual indicagtor variables gen widow=marital==2 gen divorced=marital==3 gen separated= marital==4 gen nevermarried= marital==5 regress realrinc widow divorced separated nevermarried then after the regression test a linear combination: . lincom widow – nevermarried RESULTS: ( 1) widow - nevermarried = 0 ------------------------------------------------------------------------------ realrinc | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- (1) | 553.9949 846.5298 0.65 0.513 -1105.231 2213.22 This result tells me that I could combine widow and nevermarried into a single category if I want since |t)<1. QM222 Fall 2015 Section D1
Some of you are getting close to writing your project up. Do this ONLY after you meet with me. Everyone needs to meet with me after they think they have the results they want to present. If that is you….. Make an appointment. QM222 Fall 2016 Section D1
What should the paper look like? Put yourself in the clients’ mind as they are reading it. Introduction: Motivate the paper. Address the client. Why is it interesting to them? Be sure to describe your data and data sources. Be sure to develop your ideas and have a logical train of thought. It needs to look professional and the English needs to be correct. After you finish the paper, make an executive summary that an executive can read INSTEAD of the paper. It will repeat ideas and sentences from the introduction and conclusion, for sure. It should be understood by someone who knows no statistics. MORE on this later. QM222 Fall 2016 Section D1