Download presentation
Presentation is loading. Please wait.
Published byVirgil Knight Modified over 6 years ago
1
QM222 Nov. 7 Section D1 Multicollinearity Regression Tables What to do next on your project
QM222 Fall 2016 Section D1
2
Multicollinearity QM222 Fall 2016 Section D1
3
Multicollinearity Recall that the interpretation of a coefficient in a multiple regression is: The effect on Y of X changing by 1 if the other variables stay the same And the t-test tests the null: Could this coefficient be zero? Sometimes you run a regression of on two very very correlated variables like #toothbrushes sold and amount of toothpaste sold in a country in a year. The t-tests will both be very low. Because each coefficient could be zero and the regression would predict approximately the same thing. But if you drop one of them, the other would become highly significant. E.g. GDP and Unemployment QM222 Fall 2016 Section D1
4
What to do if you find that variables that you believe should be significant are not
If several variables are really measuring the same concept, drop one of them if its |t-stat| is less than ONE. If you drop a variable with a |t-stat| <1, the adjusted R- squared increases. Which do you drop? The one with the lowest |t|. In other words, let the computer tell you which of the two variables you need. If you are right, the other variable will become more significant. NEVER DROP MORE THAN ONE VARIABLE AT A TIME. If you do, you might drop BOTH highly correlated variables. You can test if two (or more) insignificant variables together are significant but writing this after you run the regression: test varname1 varnname2 QM222 Fall 2016 Section D1
5
Making Regression Tables (see chapter 19)
QM222 Fall 2016 Section D1
6
Use Tables to report several regressions
Your different regressions will have different combinations of variables. Why present more than 1 regression? -To develop your ideas. -Or for different dependent variables (list in column title.) QM222 Fall 2016 Section D1
7
In footnotes, say which you included.
Include either t-stats or coefficient standard errors in parentheses directly below the coefficient. In footnotes, say which you included. QM222 Fall 2016 Section D1
8
Use asterisks to denote significance
Include the number of observations and at least the adjusted Rsq (and maybe RMSE aka SEE) Use asterisks to denote significance QM222 Fall 2016 Section D1
9
For any set of multiple dummies,
include in footnote what the excluded category is. (here, year1965) Note: If you are using i. for your dummies, Stata might use different reference categories for different regressions. QM222 Fall 2016 Section D1
10
What to do next on your project
QM222 Fall 2016 Section D1
11
Assignment 6 Ideally, have by Friday.
Post your current data set under Stata data set (if you can). Run additional multiple regressions. Specifically: Think hard about whether there are additional omitted variables (i.e. confounding factors) that you can measure that are likely to be biasing your key coefficient(s). If you can find data on them, add them into the regressions. (If you really cannot think of anything beyond what you have, just write that.) Identify at least one omitted variable that you cannot measure, reason out the sign of the omitted variable bias and explain here (Ass.6) in 1-3 sentences why and in what direction it will bias your key coefficient. QM222 Fall 2016 Section D1
12
Assignment 6 cont. If you have any numeric explanatory (X) variable, add a quadratic term in addition to your other variables to test if this nonlinear specification fits better. (If you are good at math and prefer to add a different nonlinear variable or to make your dependent variable non-linear, be my guest.) Explain here (Ass.6) what you learn from this result (1-3 sentences). Explain/show (e.g. with graph) what you learn from this. If you have a numeric explanatory (X) variable that is very skewed, think about whether top-coding or taking the log of that variables is appropriate instead. QM222 Fall 2016 Section D1
13
Another approach if you think the relationship between Y and X is really really nonlinear
You could try a set of dummy variables for different ranges of the variable. Even though it is a numerical variable. Only use this approach if you believe that the relationship between Y and X changes so much at every value that it can’t be estimated as a quadratic (or cubic etc.) Education sometimes is better as a set of dummies QM222 Fall 2016 Section D1
14
Not in Assignment 6 If you have a very skewed Y (dependent) variable
Try top-coding it (if you think that once it reaches a quite high level, it doesn’t matter how much higher it gets) Try changing it into an indicator variable Try estimating the median Y, replacing regress with qreg. QM222 Fall 2016 Section D1
15
Assignment 6 cont. Think about if you can and should use an interaction term. (This will be most useful if you think that different groups have different slopes.) Try at least one out in a multiple regression (with all your other variables as well). Copy and paste here (PS 6) Explain here what you learn from this interaction term result (1-3 sentences). QM222 Fall 2016 Section D1
16
Review interaction terms: If we think that the effect of X1 on Y depends on a different indicator variable X2 (e.g. scifi) The simplest way to model this in a regression is: Make an additional variable by multiplying X1 * X2 Make an additional variable by multiplying X1 * (1-X2) Recalling that (1-X2) is 1 if X2=0 Run a regression of Y on 3 variables: X1*X2 This is X1 for observations where X2 =1 X1*(1-X2) This is X1 for observations where X2 =0 X This is X2 QM222 Fall 2016 Section D1
17
Graph of this model SciFi movies Revenues Other movies Budget
QM222 Fall 2016 Section D1
18
Interaction terms with numeric variables (for those who dare): If we think that the effect of X1 on Y depends on a different numeric variable X2 (e.g. scifi) The simplest way to model this in a regression is: Make an additional variable by multiplying X1 * X2 Run a regression of Y on 3 variables: X1 X2 X1*X2 So Y = b0 + b1X1 + b2X2 + b3X1*X2 Note that dY/dX1= b1 + b3 X2 QM222 Fall 2016 Section D1
19
More generally, ask yourself if your regressions are really answering the question….
I like sophisticated approaches if you are using them correctly, if they are the most appropriate way to answer your question. QM222 Fall 2016 Section D1
20
Assignment 6 cont. Decide which is the best regression or set of regressions that you will use in your project. Update your Current Project Status including replacing/adding these regressions to Question 7. Also answer Question 9, which asks for the conclusions of your project, as it now stands. The more fully you answer Questions 7 and 9, the better feedback I can give you at your required meeting #2. QM222 Fall 2016 Section D1
21
More things to be careful about
QM222 Fall 2016 Section D1
22
Make sure indicator variables are 0-1 and named correctly
. tab sex respondents | sex | Freq. Percent Cum. male | , female | , Total | , . tab sex, nolabel 1 | , 2 | , . gen male=sex . replace male=0 if sex==2 (33313 real changes made) QM222 Fall 2016 Section D1
23
When not to control for a variable
I want to know how education affects men and women’s belief that people should legalize pot. Grass: Indicator variable if believe marijuana should be legalized. If I run this regression Grass = b0 + b1 education + b2 income + b3 age… Then the coefficient on education tells us “If someone gets a lot of education but has the same income as another person, how does the education affect grass?” You might instead want to know “If someone gets a lot of education and as a result has higher income than another person as well as being better education, how does the education affect grass?” For this, run Grass = b0 + b1 education + + b3 age… QM222 Fall 2016 Section D1
24
Some misunderstandings on multiple dummies
You cannot use categorical variables as numbers. This includes: Marital status Work status Each coefficient is that variable versus the reference, excluded category Often, it makes sense to choose the reference category to be something you would most want to be the comparison. You can NEVER put in all categories into the regression Stata will omit one Example next page. QM222 Fall 2015 Section D1
25
What happens when you add all categories on the right….
regress realrinc married widow divorced separated nevermarried note: widow omitted because of collinearity Source | SS df MS Number of obs = F( 4, 34883) = Model | e e+11 Prob > F = Residual | e R-squared = Adj R-squared = Total | e Root MSE = realrinc | Coef. Std. Err. t P>|t| [95% Conf. Interval] married | widow | 0 (omitted) divorced | separated | nevermarried | _cons | QM222 Fall 2015 Section D1
26
When you have categorical variables
You cannot use categorical variables as numbers. This includes: Marital status Work status Etc. Don’t use LOTS of dummies for important variables (whose coefficients you want to understand). If you have a categorical variable with more than 10 categories, try to combine them into broader categories. It’s okay to use dummies when you have more than 10 (or so) categories as control variables that you don’t plan to discuss or report, you can include them. (e.g. occupation) QM222 Fall 2016 Section D1
27
Note: currently married is reference category What is the difference between nevermarried and currently married? Is it significant? . regress realrinc i.marital Source | SS df MS Number of obs = F( 4, 34883) = Model | e e Prob > F = Residual | e R-squared = Adj R-squared = Total | e Root MSE = realrinc | Coef. Std. Err t P>|t| [95% Conf. Interval] marital | widowed | divorced | separated | never married | | _cons | QM222 Fall 2015 Section D1
28
If coefficients on two categories are within 1 se of each other, you might consider combining them. You can calculate the 68% confidence interval and see if they overlap. . regress realrinc i.marital Source | SS df MS Number of obs = F( 4, 34883) = Model | e e Prob > F = Residual | e R-squared = Adj R-squared = Total | e Root MSE = realrinc | Coef. Std. Err t P>|t| [95% Conf. Interval] marital | widowed | divorced | separated | nevermarried | | _cons | QM222 Fall 2015 Section D1
29
Or you can see if the coefficients on 2 categories (of the same thing) are similar by using Stata lincom tests You can’t use i.marital, so first make actual indicagtor variables gen widow=marital==2 gen divorced=marital==3 gen separated= marital==4 gen nevermarried= marital==5 regress realrinc widow divorced separated nevermarried then after the regression test a linear combination: . lincom widow – nevermarried RESULTS: ( 1) widow - nevermarried = 0 realrinc | Coef. Std. Err t P>|t| [95% Conf. Interval] (1) | This result tells me that I could combine widow and nevermarried into a single category if I want since |t)<1. QM222 Fall 2015 Section D1
30
Some of you are getting close to writing your project up.
Do this ONLY after you meet with me. Everyone needs to meet with me after they think they have the results they want to present. If that is you….. Make an appointment. QM222 Fall 2016 Section D1
31
What should the paper look like?
Put yourself in the clients’ mind as they are reading it. Introduction: Motivate the paper. Address the client. Why is it interesting to them? Be sure to describe your data and data sources. Be sure to develop your ideas and have a logical train of thought. It needs to look professional and the English needs to be correct. After you finish the paper, make an executive summary that an executive can read INSTEAD of the paper. It will repeat ideas and sentences from the introduction and conclusion, for sure. It should be understood by someone who knows no statistics. MORE on this later. QM222 Fall 2016 Section D1
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.