QM222 Class 19 Section D1 Tips on your Project QM222 Fall 2015 Section D1
Your draft will not be complete without…. An updated data set that I can read (saveold) Do this NOW if you haven’t already. A “sum” of all variables used anywhere in the draft attached at the end. QM222 Fall 2015 Section D1
Criteria: How your project will be judged Does your project use statistics, including (but not limited to) multiple regression, that are most appropriate to answer your question? Does it demonstrate a deep understanding of the statistics taught in the course? Is your data set appropriate to answering the question? Have you made mistakes in handling missing data, generating variables, or interpreting coefficients? QM222 Fall 2015 Section D1
Some regression misunderstandings and tips QM222 Fall 2015 Section D1
You need to understand what your results are telling you… You need to understand what your results are telling you…. You need to think, reread the book, etc. QM222 Fall 2015 Section D1
Make sure indicator variables are 0-1 and named correctly . tab sex respondents | sex | Freq. Percent Cum. ------------+----------------------------------- male | 26,286 44.10 44.10 female | 33,313 55.90 100.00 Total | 59,599 100.00 . tab sex, nolabel 1 | 26,286 44.10 44.10 2 | 33,313 55.90 100.00 . gen male=sex . replace male=0 if sex==2 (33313 real changes made) QM222 Fall 2015 Section D1
When to use dummies (i.occupation), and when not. If you have a numerical variable (like hours worked or height), use it like a numerical variable. UNLESS you believe that the relationship between Y and X changes so much at every value that it can’t be estimated as non-linear relationship. QM222 Fall 2015 Section D1
When you have categorical variables You cannot use categorical variables as numbers. This includes: Marital status Work status Etc. Don’t use LOTS of dummies for important variables (whose coefficients you want to understand). If you have a categorical variable with more than 10 categories, try to combine them into broader categories. It’s okay to use dummies when you have more than 10 (or so) categories as control variables that you don’t plan to discuss or report, you can include them. (e.g. occupation) QM222 Fall 2015 Section D1
Some misunderstandings Chapter 10 Interpreting multiple dummies Chapter 10 Each coefficient is that variable versus the reference, excluded category Often, it makes sense to choose the reference category to be something you would most want to be the comparison. You can NEVER put in all categories into the regression Stata will omit one Example next page. QM222 Fall 2015 Section D1
What happens when you add all categories on the right…. regress realrinc married widow divorced separated nevermarried note: widow omitted because of collinearity Source | SS df MS Number of obs = 34888 -------------+------------------------------ F( 4, 34883) = 187.50 Model | 5.9751e+11 4 1.4938e+11 Prob > F = 0.0000 Residual | 2.7791e+13 34883 796694311 R-squared = 0.0210 -------------+------------------------------ Adj R-squared = 0.0209 Total | 2.8389e+13 34887 813730067 Root MSE = 28226 ------------------------------------------------------------------------------ realrinc | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- married | 9009.761 814.9317 11.06 0.000 7412.468 10607.05 widow | 0 (omitted) divorced | 6124.18 882.4919 6.94 0.000 4394.467 7853.892 separated | 1753.797 1126.321 1.56 0.119 -453.8281 3961.422 nevermarried | -553.9949 846.5664 -0.65 0.513 -2213.292 1105.302 _cons | 16458.18 788.6264 20.87 0.000 14912.45 18003.92 QM222 Fall 2015 Section D1
Where is currently married Where is currently married? What is the difference between nevermarried and currently married? Is it significant? . regress realrinc i.marital Source | SS df MS Number of obs = 34888 -------------+------------------------------ F( 4, 34883) = 187.50 Model | 5.9751e+11 4 1.4938e+11 Prob > F = 0.0000 Residual | 2.7791e+13 34883 796694311 R-squared = 0.0210 -------------+------------------------------ Adj R-squared = 0.0209 Total | 2.8389e+13 34887 813730067 Root MSE = 28226 -------------------------------------------------------------------------------- realrinc | Coef. Std. Err. t P>|t| [95% Conf. Interval] ---------------+---------------------------------------------------------------- marital | widowed | -9009.761 814.9317 -11.06 0.000 -10607.05 -7412.468 divorced | -2885.581 446.1419 -6.47 0.000 -3760.033 -2011.128 separated | -7255.963 829.9696 -8.74 0.000 -8882.73 -5629.196 never married | -9563.755 370.0341 -25.85 0.000 -10289.03 -8838.477 | _cons | 25467.94 205.3829 124.00 0.000 25065.39 25870.5 QM222 Fall 2015 Section D1
What is the difference between nevermarried and widowed What is the difference between nevermarried and widowed? Is it significant? You can see if the confidence intervals overlap. Or you can use Stata’s postestimation tests . regress realrinc i.marital Source | SS df MS Number of obs = 34888 -------------+------------------------------ F( 4, 34883) = 187.50 Model | 5.9751e+11 4 1.4938e+11 Prob > F = 0.0000 Residual | 2.7791e+13 34883 796694311 R-squared = 0.0210 -------------+------------------------------ Adj R-squared = 0.0209 Total | 2.8389e+13 34887 813730067 Root MSE = 28226 -------------------------------------------------------------------------------- realrinc | Coef. Std. Err. t P>|t| [95% Conf. Interval] ---------------+---------------------------------------------------------------- marital | widowed | -9009.761 814.9317 -11.06 0.000 -10607.05 -7412.468 divorced | -2885.581 446.1419 -6.47 0.000 -3760.033 -2011.128 separated | -7255.963 829.9696 -8.74 0.000 -8882.73 -5629.196 never married | -9563.755 370.0341 -25.85 0.000 -10289.03 -8838.477 | _cons | 25467.94 205.3829 124.00 0.000 25065.39 25870.5 QM222 Fall 2015 Section D1
Testing, combining categories You can’t use i Testing, combining categories You can’t use i.marital, so first make actual indicagtor variables gen widow=marital==2 gen divorced=marital==3 gen separated= marital==4 gen nevermarried= marital==5 regress realrinc widow divorced separated nevermarried then after the regression test a linear combination: . lincom widow – nevermarried RESULTS: ( 1) widow - nevermarried = 0 ------------------------------------------------------------------------------ realrinc | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- (1) | 553.9949 846.5298 0.65 0.513 -1105.231 2213.22 This result tells me that I could combine widow and nevermarried into a single category if I want. QM222 Fall 2015 Section D1
When not to control for a variable I want to know how education affects men and women’s belief that people should Grass: Indicator variable if believe marijuana should be legalized. If I run this Grass = b0 + b1 education + b2 income + b3 age… Then the coefficient on education tells us “If someone gets a lot of education but has the same income as another person, how does the education affect grass?” You might instead want to know “If someone gets a lot of education and as a result has higher income than another person as well as being better education, how does the education affect grass?” For this, run Grass = b0 + b1 education + + b3 age… QM222 Fall 2015 Section D1
If you want to show how adding a group of variables improves explanatory power, you can show both regressions (with and without) and compare adjusted R-sq (or MSE) Source | SS df MS Number of obs = 1038 -------------+------------------------------ F( 6, 1031) = 18.91 Model | 1.0843e+11 6 1.8072e+10 Prob > F = 0.0000 Residual | 9.8536e+11 1031 955729182 R-squared = 0.0991 -------------+------------------------------ Adj R-squared = 0.0939 Total | 1.0938e+12 1037 1.0548e+09 Root MSE = 30915 -------------------------------------------------------------------------------- realinc | Coef. Std. Err. t P>|t| [95% Conf. Interval] ---------------+---------------------------------------------------------------- marital | widowed | -20202.06 4203.911 -4.81 0.000 -28451.25 -11952.86 divorced | -20055.93 2846.119 -7.05 0.000 -25640.78 -14471.08 separated | -17877.02 4945.558 -3.61 0.000 -27581.53 -8172.515 never married | -19598.35 2518.839 -7.78 0.000 -24540.99 -14655.72 hrsusual | -330.5012 231.5231 -1.43 0.154 -784.8114 123.809 hrssq | 6.118547 2.829847 2.16 0.031 .565629 11.67146 _cons | 48787.92 4886.288 9.98 0.000 39199.71 58376.12 Source | SS df MS Number of obs = 962 -------------+------------------------------ F( 2, 959) = 29.31 Model | 4.6811e+10 2 2.3406e+10 Prob > F = 0.0000 Residual | 7.6571e+11 959 798446597 R-squared = 0.0576 -------------+------------------------------ Adj R-squared = 0.0556 Total | 8.1252e+11 961 845495865 Root MSE = 28257 ------------------------------------------------------------------------------ realrinc | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- hrsusual | 594.0719 223.0332 2.66 0.008 156.3825 1031.761 hrssq | -.8641106 2.720913 -0.32 0.751 -6.203741 4.47552 _cons | 587.6587 4650.105 0.13 0.899 -8537.896 9713.214 QM222 Fall 2015 Section D1
Nonlinear terms: Some of you did it without thinking about the results You can NOT put a nonlinear version of Y (the dependent variable) as an explanatory variable It is obvious that Y2 will explain Y, but what have you learned from that? You can’t make an indicator variable into a nonlinear one. When you try out a nonlinear term, be sure to add it into the whole multiple regression…. You want to test it controlling for other factors. QM222 Fall 2015 Section D1
These results say that at 0, the slope drealinc/dhrs is negative, but the slope becomes more positive . regress realinc i.marital hrsusual hrssq Source | SS df MS Number of obs = 1038 -------------+------------------------------ F( 6, 1031) = 18.91 Model | 1.0843e+11 6 1.8072e+10 Prob > F = 0.0000 Residual | 9.8536e+11 1031 955729182 R-squared = 0.0991 -------------+------------------------------ Adj R-squared = 0.0939 Total | 1.0938e+12 1037 1.0548e+09 Root MSE = 30915 -------------------------------------------------------------------------------- realinc | Coef. Std. Err. t P>|t| [95% Conf. Interval] ---------------+---------------------------------------------------------------- marital | widowed | -20202.06 4203.911 -4.81 0.000 -28451.25 -11952.86 divorced | -20055.93 2846.119 -7.05 0.000 -25640.78 -14471.08 separated | -17877.02 4945.558 -3.61 0.000 -27581.53 -8172.515 never married | -19598.35 2518.839 -7.78 0.000 -24540.99 -14655.72 | hrsusual -330.5012 231.5231 -1.43 0.154 -784.8114 123.809 hrssq 6.118547 2.829847 2.16 0.031 .565629 11.67146 _cons | 48787.92 4886.288 9.98 0.000 39199.71 58376.12 Take the derivative: drealinc/dhrs= -330 + 6.119*2*hrs At hrs=0, slope = -330 At hrs=10, slope = -85.24 At hrs=40, slope = +159.5 QM222 Fall 2015 Section D1
Interpreting coefficients when the dependent variable is a dummy Source | SS df MS Number of obs = 34892 -------------+------------------------------ F( 2, 34889) = 360.89 Model | 175.59685 2 87.7984249 Prob > F = 0.0000 Residual | 8487.89149 34889 .243282739 R-squared = 0.0203 -------------+------------------------------ Adj R-squared = 0.0202 Total | 8663.48834 34891 .24830152 Root MSE = .49324 ------------------------------------------------------------------------------ married | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- male | .0673883 .0054176 12.44 0.000 .0567696 .0780069 realrinc | 1.94e-06 9.50e-08 20.45 0.000 1.76e-06 2.13e-06 _cons | .4641422 .004045 114.75 0.000 .4562139 .472070 The coefficient on male tells us that a man is: 6.7 percentage points more likely than a female to be married (holding income constant). But the average probability of being married is .536, so this is 6.7/.536 =12.5 percent different. QM222 Fall 2015 Section D1
Dropping variables If you want, you can drop a variable with |t| < 1. Never drop a variable with a |t| > 1 unless that variable should not be in the regression no matter what its t-stat is. If you keep a variable with |t|<1 in the regression, don’t discuss its coefficient as if it really was important. Say “X2 has no effect on Y” Or at most “while the sign on X2 is positive, we are very uncertain that it has any effect at all.” QM222 Fall 2015 Section D1
Sometimes, a variable can be significant but not particularly important Sometimes, a variable can have a large t-stat but not really make a large difference to your predicted Y. To see if it does make a large difference, make this calculation: X coefficient * (highest X value in dataset – lowest X value) That tells you the maximum that the variation in X can change the Y or X coefficient * (95th percentile X value – 5th percentile X value) QM222 Fall 2015 Section D1
On writing the project: Focus to client Components of a good report: Body of the report: Introduction of the problem (written to the client, of course) Perhaps background on the problem (can bring in other sources, put in graphs etc) Describing the data set and variables. Analysis: Describing the regression results as intuitively as possible, without using a lot of statistics terms except when absolutely necessary (and perhaps in parentheses or footnotes. Conclusion At the beginning of the report: An executive summary (1 paragraph to 2 pages) with no statistics terms… assignment read: “Can an executive who knows little statistics understand what you did and what you found from reading the executive summary? “ QM222 Fall 2015 Section D1
On writing the project: Focus to client Graphs and tables can be in the body of the report or at the end. They need to be labeled and referred to (as in “Table 1 shows” or “Men make more money than women in this occupation, as can be seen in Table 1…) QM222 Fall 2015 Section D1
Style of writing Do NOT write about your process of what you did …. (first I did this, then I realized that… etc.) Most things should be written in third person such as: This report investigates whether …… As can be seen in the first two columns of Table 1, when we do not control for gender or other possibly confounding factors, each hour of work…. Report more than 1 regression if the client learns something from it (not if you learned that it was the wrong way to estimate things.) QM222 Fall 2015 Section D1
On writing the project: Other criteria If I were the client, would I feel that this project answered a question I am interested in? Can an executive who knows little statistics understand what you did and what you found from reading the executive summary? Are regressions presented in easy to read tables and/or as equations? Are any graphics you use appropriate and clearly convey information to the reader? Does the writing develop the ideas in logical order and clearly? Is the report well-written? Are there English and spelling mistakes? Does the report look professional? QM222 Fall 2015 Section D1
Use Tables to report several regressions Use Tables to Report Several Regressions (in your final project report)
If you have lots of dummies that are not themselves important, you can just say “All regressions also include…..” If only some regressions have these dummies, at the bottom include a line labeled e.g. Industry dummies. Then put a √ if it is there. (Use adjusted R-square rather than R-square)
How to report single regressions VOTESHARE%= 48.216 + 4.179 INCUMBENT + 0.840 GROWTH (34.38) (2.38) (5.31) SEE =4.199 R2=.627 adj.R2 = .593 N = 25 (t-stats in parentheses) Or, put standard errors on parentheses.
How to move Stata results into Excel You want to do this to make tables. You probably want to do this to make graphs. Copy from results or from log into Excel In Windows: Data – Text to Columns
Criteria: How your project will be judged Does your project use statistics, including (but not limited to) multiple regression, that are most appropriate to answer your question? Does it demonstrate a deep understanding of the statistics taught in the course? Is your data set appropriate to answering the question? Have you made mistakes in handling missing data, generating variables, or interpreting coefficients? Does the writing develop the ideas in logical order and clearly? If I were the client, would I feel that this project answered a question I am interested in? Can an executive who knows little statistics understand what you did and what you found from reading the executive summary? Does your report either control for and/or discuss possible biases e.g. due to confounding factors? Are regressions presented in easy to read tables and/or as equations? Are any graphics you use appropriate and clearly convey information to the reader? Is the report well-written? Are there English and spelling mistakes? Does the report look professional? QM222 Fall 2015 Section D1
What’s next? Don’t forget the signup https://docs.google.com/spreadsheets/d/1pcfaSpsS6TISPccPsJf7ykuhTIzqsWN7OqU RMO_BLM4/edit?usp=sharing QM222 Fall 2015 Section D1