QM222 Nov. 7 Section D1 Multicollinearity Regression Tables What to do next on your project QM222 Fall 2016 Section D1.

Slides:



Advertisements
Similar presentations
Dummy Variables and Interactions. Dummy Variables What is the the relationship between the % of non-Swiss residents (IV) and discretionary social spending.
Advertisements

Sociology 601 Class 24: November 19, 2009 (partial) Review –regression results for spurious & intervening effects –care with sample sizes for comparing.
Irwin/McGraw-Hill © Andrew F. Siegel, 1997 and l Chapter 12 l Multiple Regression: Predicting One Factor from Several Others.
Lecture 4 This week’s reading: Ch. 1 Today:
Sociology 601 Class 19: November 3, 2008 Review of correlation and standardized coefficients Statistical inference for the slope (9.5) Violations of Model.
Sociology 601 Class 25: November 24, 2009 Homework 9 Review –dummy variable example from ASR (finish) –regression results for dummy variables Quadratic.
Chapter 4 Multiple Regression.
Sociology 601 Class 28: December 8, 2009 Homework 10 Review –polynomials –interaction effects Logistic regressions –log odds as outcome –compared to linear.
So far, we have considered regression models with dummy variables of independent variables. In this lecture, we will study regression models whose dependent.
Sociology 601 Class 23: November 17, 2009 Homework #8 Review –spurious, intervening, & interactions effects –stata regression commands & output F-tests.
Basic Business Statistics, 11e © 2009 Prentice-Hall, Inc. Chap 15-1 Chapter 15 Multiple Regression Model Building Basic Business Statistics 11 th Edition.
Sociology 601 Class 26: December 1, 2009 (partial) Review –curvilinear regression results –cubic polynomial Interaction effects –example: earnings on married.
Back to House Prices… Our failure to reject the null hypothesis implies that the housing stock has no effect on prices – Note the phrase “cannot reject”
Christopher Dougherty EC220 - Introduction to econometrics (chapter 3) Slideshow: precision of the multiple regression coefficients Original citation:
EDUC 200C Section 4 – Review Melissa Kemmerle October 19, 2012.
DUMMY CLASSIFICATION WITH MORE THAN TWO CATEGORIES This sequence explains how to extend the dummy variable technique to handle a qualitative explanatory.
1 INTERACTIVE EXPLANATORY VARIABLES The model shown above is linear in parameters and it may be fitted using straightforward OLS, provided that the regression.
How do Lawyers Set fees?. Learning Objectives 1.Model i.e. “Story” or question 2.Multiple regression review 3.Omitted variables (our first failure of.
Addressing Alternative Explanations: Multiple Regression
MultiCollinearity. The Nature of the Problem OLS requires that the explanatory variables are independent of error term But they may not always be independent.
EDUC 200C Section 3 October 12, Goals Review correlation prediction formula Calculate z y ’ = r xy z x for a new data set Use formula to predict.
What is the MPC?. Learning Objectives 1.Use linear regression to establish the relationship between two variables 2.Show that the line is the line of.
Biostat 200 Lecture Simple linear regression Population regression equationμ y|x = α +  x α and  are constants and are called the coefficients.
Christopher Dougherty EC220 - Introduction to econometrics (chapter 5) Slideshow: exercise 5.2 Original citation: Dougherty, C. (2012) EC220 - Introduction.
Christopher Dougherty EC220 - Introduction to econometrics (chapter 4) Slideshow: exercise 4.5 Original citation: Dougherty, C. (2012) EC220 - Introduction.
Special topics. Importance of a variable Death penalty example. sum death bd- yv Variable | Obs Mean Std. Dev. Min Max
Business Statistics: A Decision-Making Approach, 6e © 2005 Prentice- Hall, Inc. Chap 14-1 Business Statistics: A Decision-Making Approach 6 th Edition.
STAT E100 Section Week 12- Regression. Course Review - Project due Dec 17 th, your TA. - Exam 2 make-up is Dec 5 th, practice tests have been updated.
Introduction to Multiple Regression Lecture 11. The Multiple Regression Model Idea: Examine the linear relationship between 1 dependent (Y) & 2 or more.
1 BINARY CHOICE MODELS: LINEAR PROBABILITY MODEL Economists are often interested in the factors behind the decision-making of individuals or enterprises,
Managerial Economics & Decision Sciences Department introduction  inflated standard deviations  the F  test  business analytics II Developed for ©
Stats Methods at IC Lecture 3: Regression.
QM222 Class 19 Section D1 Tips on your Project
Correlation and Regression
QM222 Class 12 Section D1 1. A few Stata things 2
Chapter 15 Multiple Regression Model Building
Chapter 14 Introduction to Multiple Regression
QM222 Nov. 9 Section D1 Visualizing Using Graphs More on your project Test returned QM222 Fall 2016 Section D1.
QM222 Class 9 Section A1 Coefficient statistics
QM222 Class 11 Section D1 1. Review and Stata: Time series data, multi-category dummies, etc. (chapters 10,11) 2. Capturing nonlinear relationships (Chapter.
business analytics II ▌appendix – regression performance the R2 
QM222 Class 10 Section D1 1. Goodness of fit -- review 2
QM222 Nov. 28 Presentations Some additional tips on the project
assignment 7 solutions ► office networks ► super staffing
QM222 Class 13 Section D1 Omitted variable bias (Chapter 13.)
Advanced Quantitative Techniques
Review Multiple Regression Multiple-Category Dummy Variables
QM222 Class 16 & 17 Today’s New topic: Estimating nonlinear relationships QM222 Fall 2017 Section A1.
QM222 Class 11 Section A1 Multiple Regression
QM222 Class 19 Omitted Variable Bias pt 2 Different slopes for a single variable QM222 Fall 2017 Section A1.
Multiple Regression Analysis and Model Building
QM222 Class 14 Section D1 Different slopes for the same variable (Chapter 14) Review: Omitted variable bias (Chapter 13.) The bias on a regression coefficient.
QM222 Class 18 Omitted Variable Bias
QM222 Class 9 Section D1 1. Multiple regression – review and in-class exercise 2. Goodness of fit 3. What if your Dependent Variable is an 0/1 Indicator.
Basics of Group Analysis
QM222 A1 On tests and projects
QM222 Class 8 Section A1 Using categorical data in regression
QM222 Class 8 Section D1 1. Review: coefficient statistics: standard errors, t-statistics, p-values (chapter 7) 2. Multiple regression 3. Goodness of fit.
QM222 A1 Nov. 27 More tips on writing your projects
QM222 A1 How to proceed next in your project Multicollinearity
Regression and Residual Plots
QM222 Class 14 Today’s New topic: What if the Dependent Variable is a Dummy Variable? QM222 Fall 2017 Section A1.
QM222 Your regressions and the test
Lecture Notes The Relation between Two Variables Q Q
QM222 Class 15 Section D1 Review for test Multicollinearity
Covariance x – x > 0 x (x,y) y – y > 0 y x and y axes.
Regression Forecasting and Model Building
EPP 245 Statistical Analysis of Laboratory Data
Introduction to Econometrics, 5th edition
Presentation transcript:

QM222 Nov. 7 Section D1 Multicollinearity Regression Tables What to do next on your project QM222 Fall 2016 Section D1

Multicollinearity QM222 Fall 2016 Section D1

Multicollinearity Recall that the interpretation of a coefficient in a multiple regression is: The effect on Y of X changing by 1 if the other variables stay the same And the t-test tests the null: Could this coefficient be zero? Sometimes you run a regression of on two very very correlated variables like #toothbrushes sold and amount of toothpaste sold in a country in a year. The t-tests will both be very low. Because each coefficient could be zero and the regression would predict approximately the same thing. But if you drop one of them, the other would become highly significant. E.g. GDP and Unemployment QM222 Fall 2016 Section D1

What to do if you find that variables that you believe should be significant are not If several variables are really measuring the same concept, drop one of them if its |t-stat| is less than ONE. If you drop a variable with a |t-stat| <1, the adjusted R- squared increases. Which do you drop? The one with the lowest |t|. In other words, let the computer tell you which of the two variables you need. If you are right, the other variable will become more significant. NEVER DROP MORE THAN ONE VARIABLE AT A TIME. If you do, you might drop BOTH highly correlated variables. You can test if two (or more) insignificant variables together are significant but writing this after you run the regression: test varname1 varnname2 QM222 Fall 2016 Section D1

Making Regression Tables (see chapter 19) QM222 Fall 2016 Section D1

Use Tables to report several regressions Your different regressions will have different combinations of variables. Why present more than 1 regression? -To develop your ideas. -Or for different dependent variables (list in column title.) QM222 Fall 2016 Section D1

In footnotes, say which you included. Include either t-stats or coefficient standard errors in parentheses directly below the coefficient. In footnotes, say which you included. QM222 Fall 2016 Section D1

Use asterisks to denote significance Include the number of observations and at least the adjusted Rsq (and maybe RMSE aka SEE) Use asterisks to denote significance QM222 Fall 2016 Section D1

For any set of multiple dummies, include in footnote what the excluded category is. (here, year1965) Note: If you are using i. for your dummies, Stata might use different reference categories for different regressions. QM222 Fall 2016 Section D1

What to do next on your project QM222 Fall 2016 Section D1

Assignment 6 Ideally, have by Friday. Post your current data set under Stata data set (if you can). Run additional multiple regressions. Specifically: Think hard about whether there are additional omitted variables (i.e. confounding factors) that you can measure that are likely to be biasing your key coefficient(s). If you can find data on them, add them into the regressions. (If you really cannot think of anything beyond what you have, just write that.) Identify at least one omitted variable that you cannot measure, reason out the sign of the omitted variable bias and explain here (Ass.6) in 1-3 sentences why and in what direction it will bias your key coefficient. QM222 Fall 2016 Section D1

Assignment 6 cont. If you have any numeric explanatory (X) variable, add a quadratic term in addition to your other variables to test if this nonlinear specification fits better. (If you are good at math and prefer to add a different nonlinear variable or to make your dependent variable non-linear, be my guest.) Explain here (Ass.6) what you learn from this result (1-3 sentences). Explain/show (e.g. with graph) what you learn from this. If you have a numeric explanatory (X) variable that is very skewed, think about whether top-coding or taking the log of that variables is appropriate instead. QM222 Fall 2016 Section D1

Another approach if you think the relationship between Y and X is really really nonlinear You could try a set of dummy variables for different ranges of the variable. Even though it is a numerical variable. Only use this approach if you believe that the relationship between Y and X changes so much at every value that it can’t be estimated as a quadratic (or cubic etc.) Education sometimes is better as a set of dummies QM222 Fall 2016 Section D1

Not in Assignment 6 If you have a very skewed Y (dependent) variable Try top-coding it (if you think that once it reaches a quite high level, it doesn’t matter how much higher it gets) Try changing it into an indicator variable Try estimating the median Y, replacing regress with qreg. QM222 Fall 2016 Section D1

Assignment 6 cont. Think about if you can and should use an interaction term. (This will be most useful if you think that different groups have different slopes.) Try at least one out in a multiple regression (with all your other variables as well). Copy and paste here (PS 6) Explain here what you learn from this interaction term result (1-3 sentences). QM222 Fall 2016 Section D1

Review interaction terms: If we think that the effect of X1 on Y depends on a different indicator variable X2 (e.g. scifi) The simplest way to model this in a regression is: Make an additional variable by multiplying X1 * X2 Make an additional variable by multiplying X1 * (1-X2) Recalling that (1-X2) is 1 if X2=0 Run a regression of Y on 3 variables: X1*X2 This is X1 for observations where X2 =1 X1*(1-X2) This is X1 for observations where X2 =0 X2 This is X2 QM222 Fall 2016 Section D1

Graph of this model SciFi movies Revenues Other movies Budget QM222 Fall 2016 Section D1

Interaction terms with numeric variables (for those who dare): If we think that the effect of X1 on Y depends on a different numeric variable X2 (e.g. scifi) The simplest way to model this in a regression is: Make an additional variable by multiplying X1 * X2 Run a regression of Y on 3 variables: X1 X2 X1*X2 So Y = b0 + b1X1 + b2X2 + b3X1*X2 Note that dY/dX1= b1 + b3 X2 QM222 Fall 2016 Section D1

More generally, ask yourself if your regressions are really answering the question…. I like sophisticated approaches if you are using them correctly, if they are the most appropriate way to answer your question. QM222 Fall 2016 Section D1

Assignment 6 cont. Decide which is the best regression or set of regressions that you will use in your project. Update your Current Project Status including replacing/adding these regressions to Question 7. Also answer Question 9, which asks for the conclusions of your project, as it now stands. The more fully you answer Questions 7 and 9, the better feedback I can give you at your required meeting #2. QM222 Fall 2016 Section D1

More things to be careful about QM222 Fall 2016 Section D1

Make sure indicator variables are 0-1 and named correctly . tab sex respondents | sex | Freq. Percent Cum. ------------+----------------------------------- male | 26,286 44.10 44.10 female | 33,313 55.90 100.00 Total | 59,599 100.00 . tab sex, nolabel 1 | 26,286 44.10 44.10 2 | 33,313 55.90 100.00 . gen male=sex . replace male=0 if sex==2 (33313 real changes made) QM222 Fall 2016 Section D1

When not to control for a variable I want to know how education affects men and women’s belief that people should legalize pot. Grass: Indicator variable if believe marijuana should be legalized. If I run this regression Grass = b0 + b1 education + b2 income + b3 age… Then the coefficient on education tells us “If someone gets a lot of education but has the same income as another person, how does the education affect grass?” You might instead want to know “If someone gets a lot of education and as a result has higher income than another person as well as being better education, how does the education affect grass?” For this, run Grass = b0 + b1 education + + b3 age… QM222 Fall 2016 Section D1

Some misunderstandings on multiple dummies You cannot use categorical variables as numbers. This includes: Marital status Work status Each coefficient is that variable versus the reference, excluded category Often, it makes sense to choose the reference category to be something you would most want to be the comparison. You can NEVER put in all categories into the regression Stata will omit one Example next page. QM222 Fall 2015 Section D1

What happens when you add all categories on the right…. regress realrinc married widow divorced separated nevermarried note: widow omitted because of collinearity Source | SS df MS Number of obs = 34888 -------------+------------------------------ F( 4, 34883) = 187.50 Model | 5.9751e+11 4 1.4938e+11 Prob > F = 0.0000 Residual | 2.7791e+13 34883 796694311 R-squared = 0.0210 -------------+------------------------------ Adj R-squared = 0.0209 Total | 2.8389e+13 34887 813730067 Root MSE = 28226 ------------------------------------------------------------------------------ realrinc | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- married | 9009.761 814.9317 11.06 0.000 7412.468 10607.05 widow | 0 (omitted) divorced | 6124.18 882.4919 6.94 0.000 4394.467 7853.892 separated | 1753.797 1126.321 1.56 0.119 -453.8281 3961.422 nevermarried | -553.9949 846.5664 -0.65 0.513 -2213.292 1105.302 _cons | 16458.18 788.6264 20.87 0.000 14912.45 18003.92 QM222 Fall 2015 Section D1

When you have categorical variables You cannot use categorical variables as numbers. This includes: Marital status Work status Etc. Don’t use LOTS of dummies for important variables (whose coefficients you want to understand). If you have a categorical variable with more than 10 categories, try to combine them into broader categories. It’s okay to use dummies when you have more than 10 (or so) categories as control variables that you don’t plan to discuss or report, you can include them. (e.g. occupation) QM222 Fall 2016 Section D1

Note: currently married is reference category What is the difference between nevermarried and currently married? Is it significant? . regress realrinc i.marital Source | SS df MS Number of obs = 34888 -------------+------------------------------ F( 4, 34883) = 187.50 Model | 5.9751e+11 4 1.4938e+11 Prob > F = 0.0000 Residual | 2.7791e+13 34883 796694311 R-squared = 0.0210 -------------+------------------------------ Adj R-squared = 0.0209 Total | 2.8389e+13 34887 813730067 Root MSE = 28226 -------------------------------------------------------------------------------- realrinc | Coef. Std. Err. t P>|t| [95% Conf. Interval] ---------------+---------------------------------------------------------------- marital | widowed | -9009.761 814.9317 -11.06 0.000 -10607.05 -7412.468 divorced | -2885.581 446.1419 -6.47 0.000 -3760.033 -2011.128 separated | -7255.963 829.9696 -8.74 0.000 -8882.73 -5629.196 never married | -9563.755 370.0341 -25.85 0.000 -10289.03 -8838.477 | _cons | 25467.94 205.3829 124.00 0.000 25065.39 25870.5 QM222 Fall 2015 Section D1

If coefficients on two categories are within 1 se of each other, you might consider combining them. You can calculate the 68% confidence interval and see if they overlap. . regress realrinc i.marital Source | SS df MS Number of obs = 34888 -------------+------------------------------ F( 4, 34883) = 187.50 Model | 5.9751e+11 4 1.4938e+11 Prob > F = 0.0000 Residual | 2.7791e+13 34883 796694311 R-squared = 0.0210 -------------+------------------------------ Adj R-squared = 0.0209 Total | 2.8389e+13 34887 813730067 Root MSE = 28226 -------------------------------------------------------------------------------- realrinc | Coef. Std. Err. t P>|t| [95% Conf. Interval] ---------------+---------------------------------------------------------------- marital | widowed | -9009.761 814.9317 -11.06 0.000 -10607.05 -7412.468 divorced | -2885.581 446.1419 -6.47 0.000 -3760.033 -2011.128 separated | -7255.963 829.9696 -8.74 0.000 -8882.73 -5629.196 nevermarried | -9563.755 370.0341 -25.85 0.000 -10289.03 -8838.477 | _cons | 25467.94 205.3829 124.00 0.000 25065.39 25870.5 QM222 Fall 2015 Section D1

Or you can see if the coefficients on 2 categories (of the same thing) are similar by using Stata lincom tests You can’t use i.marital, so first make actual indicagtor variables gen widow=marital==2 gen divorced=marital==3 gen separated= marital==4 gen nevermarried= marital==5 regress realrinc widow divorced separated nevermarried then after the regression test a linear combination: . lincom widow – nevermarried RESULTS: ( 1) widow - nevermarried = 0 ------------------------------------------------------------------------------ realrinc | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- (1) | 553.9949 846.5298 0.65 0.513 -1105.231 2213.22 This result tells me that I could combine widow and nevermarried into a single category if I want since |t)<1. QM222 Fall 2015 Section D1

Some of you are getting close to writing your project up. Do this ONLY after you meet with me. Everyone needs to meet with me after they think they have the results they want to present. If that is you….. Make an appointment. QM222 Fall 2016 Section D1

What should the paper look like? Put yourself in the clients’ mind as they are reading it. Introduction: Motivate the paper. Address the client. Why is it interesting to them? Be sure to describe your data and data sources. Be sure to develop your ideas and have a logical train of thought. It needs to look professional and the English needs to be correct. After you finish the paper, make an executive summary that an executive can read INSTEAD of the paper. It will repeat ideas and sentences from the introduction and conclusion, for sure. It should be understood by someone who knows no statistics. MORE on this later. QM222 Fall 2016 Section D1