Download presentation
Presentation is loading. Please wait.
Published byBenjamin Garrett Modified over 7 years ago
1
QM222 Class 19 Section D1 Tips on your Project
QM222 Fall 2015 Section D1
2
Your draft will not be complete without….
An updated data set that I can read (saveold) Do this NOW if you haven’t already. A “sum” of all variables used anywhere in the draft attached at the end. QM222 Fall 2015 Section D1
3
Criteria: How your project will be judged
Does your project use statistics, including (but not limited to) multiple regression, that are most appropriate to answer your question? Does it demonstrate a deep understanding of the statistics taught in the course? Is your data set appropriate to answering the question? Have you made mistakes in handling missing data, generating variables, or interpreting coefficients? QM222 Fall 2015 Section D1
4
Some regression misunderstandings and tips
QM222 Fall 2015 Section D1
5
You need to understand what your results are telling you…
You need to understand what your results are telling you…. You need to think, reread the book, etc. QM222 Fall 2015 Section D1
6
Make sure indicator variables are 0-1 and named correctly
. tab sex respondents | sex | Freq. Percent Cum. male | , female | , Total | , . tab sex, nolabel 1 | , 2 | , . gen male=sex . replace male=0 if sex==2 (33313 real changes made) QM222 Fall 2015 Section D1
7
When to use dummies (i.occupation), and when not.
If you have a numerical variable (like hours worked or height), use it like a numerical variable. UNLESS you believe that the relationship between Y and X changes so much at every value that it can’t be estimated as non-linear relationship. QM222 Fall 2015 Section D1
8
When you have categorical variables
You cannot use categorical variables as numbers. This includes: Marital status Work status Etc. Don’t use LOTS of dummies for important variables (whose coefficients you want to understand). If you have a categorical variable with more than 10 categories, try to combine them into broader categories. It’s okay to use dummies when you have more than 10 (or so) categories as control variables that you don’t plan to discuss or report, you can include them. (e.g. occupation) QM222 Fall 2015 Section D1
9
Some misunderstandings Chapter 10
Interpreting multiple dummies Chapter 10 Each coefficient is that variable versus the reference, excluded category Often, it makes sense to choose the reference category to be something you would most want to be the comparison. You can NEVER put in all categories into the regression Stata will omit one Example next page. QM222 Fall 2015 Section D1
10
What happens when you add all categories on the right….
regress realrinc married widow divorced separated nevermarried note: widow omitted because of collinearity Source | SS df MS Number of obs = F( 4, 34883) = Model | e e+11 Prob > F = Residual | e R-squared = Adj R-squared = Total | e Root MSE = realrinc | Coef. Std. Err. t P>|t| [95% Conf. Interval] married | widow | 0 (omitted) divorced | separated | nevermarried | _cons | QM222 Fall 2015 Section D1
11
Where is currently married
Where is currently married? What is the difference between nevermarried and currently married? Is it significant? . regress realrinc i.marital Source | SS df MS Number of obs = F( 4, 34883) = Model | e e Prob > F = Residual | e R-squared = Adj R-squared = Total | e Root MSE = realrinc | Coef. Std. Err t P>|t| [95% Conf. Interval] marital | widowed | divorced | separated | never married | | _cons | QM222 Fall 2015 Section D1
12
What is the difference between nevermarried and widowed
What is the difference between nevermarried and widowed? Is it significant? You can see if the confidence intervals overlap. Or you can use Stata’s postestimation tests . regress realrinc i.marital Source | SS df MS Number of obs = F( 4, 34883) = Model | e e Prob > F = Residual | e R-squared = Adj R-squared = Total | e Root MSE = realrinc | Coef. Std. Err t P>|t| [95% Conf. Interval] marital | widowed | divorced | separated | never married | | _cons | QM222 Fall 2015 Section D1
13
Testing, combining categories You can’t use i
Testing, combining categories You can’t use i.marital, so first make actual indicagtor variables gen widow=marital==2 gen divorced=marital==3 gen separated= marital==4 gen nevermarried= marital==5 regress realrinc widow divorced separated nevermarried then after the regression test a linear combination: . lincom widow – nevermarried RESULTS: ( 1) widow - nevermarried = 0 realrinc | Coef. Std. Err t P>|t| [95% Conf. Interval] (1) | This result tells me that I could combine widow and nevermarried into a single category if I want. QM222 Fall 2015 Section D1
14
When not to control for a variable
I want to know how education affects men and women’s belief that people should Grass: Indicator variable if believe marijuana should be legalized. If I run this Grass = b0 + b1 education + b2 income + b3 age… Then the coefficient on education tells us “If someone gets a lot of education but has the same income as another person, how does the education affect grass?” You might instead want to know “If someone gets a lot of education and as a result has higher income than another person as well as being better education, how does the education affect grass?” For this, run Grass = b0 + b1 education + + b3 age… QM222 Fall 2015 Section D1
15
If you want to show how adding a group of variables improves explanatory power, you can show both regressions (with and without) and compare adjusted R-sq (or MSE) Source | SS df MS Number of obs = F( 6, 1031) = Model | e e Prob > F = Residual | e R-squared = Adj R-squared = Total | e e Root MSE = realinc | Coef. Std. Err t P>|t| [95% Conf. Interval] marital | widowed | divorced | separated | never married | hrsusual | hrssq | _cons | Source | SS df MS Number of obs = F( 2, 959) = Model | e e Prob > F = Residual | e R-squared = Adj R-squared = Total | e Root MSE = realrinc | Coef. Std. Err t P>|t| [95% Conf. Interval] hrsusual | hrssq | _cons | QM222 Fall 2015 Section D1
16
Nonlinear terms: Some of you did it without thinking about the results
You can NOT put a nonlinear version of Y (the dependent variable) as an explanatory variable It is obvious that Y2 will explain Y, but what have you learned from that? You can’t make an indicator variable into a nonlinear one. When you try out a nonlinear term, be sure to add it into the whole multiple regression…. You want to test it controlling for other factors. QM222 Fall 2015 Section D1
17
These results say that at 0, the slope drealinc/dhrs is negative, but the slope becomes more positive . regress realinc i.marital hrsusual hrssq Source | SS df MS Number of obs = F( 6, 1031) = Model | e e Prob > F = Residual | e R-squared = Adj R-squared = Total | e e Root MSE = realinc | Coef. Std. Err t P>|t| [95% Conf. Interval] marital | widowed | divorced | separated | never married | | hrsusual hrssq _cons | Take the derivative: drealinc/dhrs= *2*hrs At hrs=0, slope = At hrs=10, slope = At hrs=40, slope = QM222 Fall 2015 Section D1
18
Interpreting coefficients when the dependent variable is a dummy
Source | SS df MS Number of obs = F( 2, 34889) = Model | Prob > F = Residual | R-squared = Adj R-squared = Total | Root MSE = married | Coef. Std. Err t P>|t| [95% Conf. Interval] male | realrinc | e e e e-06 _cons | The coefficient on male tells us that a man is: 6.7 percentage points more likely than a female to be married (holding income constant). But the average probability of being married is .536, so this is 6.7/.536 =12.5 percent different. QM222 Fall 2015 Section D1
19
Dropping variables If you want, you can drop a variable with |t| < 1. Never drop a variable with a |t| > 1 unless that variable should not be in the regression no matter what its t-stat is. If you keep a variable with |t|<1 in the regression, don’t discuss its coefficient as if it really was important. Say “X2 has no effect on Y” Or at most “while the sign on X2 is positive, we are very uncertain that it has any effect at all.” QM222 Fall 2015 Section D1
20
Sometimes, a variable can be significant but not particularly important
Sometimes, a variable can have a large t-stat but not really make a large difference to your predicted Y. To see if it does make a large difference, make this calculation: X coefficient * (highest X value in dataset – lowest X value) That tells you the maximum that the variation in X can change the Y or X coefficient * (95th percentile X value – 5th percentile X value) QM222 Fall 2015 Section D1
21
On writing the project: Focus to client
Components of a good report: Body of the report: Introduction of the problem (written to the client, of course) Perhaps background on the problem (can bring in other sources, put in graphs etc) Describing the data set and variables. Analysis: Describing the regression results as intuitively as possible, without using a lot of statistics terms except when absolutely necessary (and perhaps in parentheses or footnotes. Conclusion At the beginning of the report: An executive summary (1 paragraph to 2 pages) with no statistics terms… assignment read: “Can an executive who knows little statistics understand what you did and what you found from reading the executive summary? “ QM222 Fall 2015 Section D1
22
On writing the project: Focus to client
Graphs and tables can be in the body of the report or at the end. They need to be labeled and referred to (as in “Table 1 shows” or “Men make more money than women in this occupation, as can be seen in Table 1…) QM222 Fall 2015 Section D1
23
Style of writing Do NOT write about your process of what you did …. (first I did this, then I realized that… etc.) Most things should be written in third person such as: This report investigates whether …… As can be seen in the first two columns of Table 1, when we do not control for gender or other possibly confounding factors, each hour of work…. Report more than 1 regression if the client learns something from it (not if you learned that it was the wrong way to estimate things.) QM222 Fall 2015 Section D1
24
On writing the project: Other criteria
If I were the client, would I feel that this project answered a question I am interested in? Can an executive who knows little statistics understand what you did and what you found from reading the executive summary? Are regressions presented in easy to read tables and/or as equations? Are any graphics you use appropriate and clearly convey information to the reader? Does the writing develop the ideas in logical order and clearly? Is the report well-written? Are there English and spelling mistakes? Does the report look professional? QM222 Fall 2015 Section D1
25
Use Tables to report several regressions
Use Tables to Report Several Regressions (in your final project report)
26
If you have lots of dummies that are not themselves important, you can just say “All regressions also include…..” If only some regressions have these dummies, at the bottom include a line labeled e.g. Industry dummies. Then put a √ if it is there. (Use adjusted R-square rather than R-square)
27
How to report single regressions
VOTESHARE%= INCUMBENT GROWTH (34.38) (2.38) (5.31) SEE =4.199 R2=.627 adj.R2 = .593 N = 25 (t-stats in parentheses) Or, put standard errors on parentheses.
28
How to move Stata results into Excel
You want to do this to make tables. You probably want to do this to make graphs. Copy from results or from log into Excel In Windows: Data – Text to Columns
29
Criteria: How your project will be judged
Does your project use statistics, including (but not limited to) multiple regression, that are most appropriate to answer your question? Does it demonstrate a deep understanding of the statistics taught in the course? Is your data set appropriate to answering the question? Have you made mistakes in handling missing data, generating variables, or interpreting coefficients? Does the writing develop the ideas in logical order and clearly? If I were the client, would I feel that this project answered a question I am interested in? Can an executive who knows little statistics understand what you did and what you found from reading the executive summary? Does your report either control for and/or discuss possible biases e.g. due to confounding factors? Are regressions presented in easy to read tables and/or as equations? Are any graphics you use appropriate and clearly convey information to the reader? Is the report well-written? Are there English and spelling mistakes? Does the report look professional? QM222 Fall 2015 Section D1
30
What’s next? Don’t forget the signup
RMO_BLM4/edit?usp=sharing QM222 Fall 2015 Section D1
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.