QM222 Class 19 Section D1 Tips on your Project

Slides:



Advertisements
Similar presentations
CHOW TEST AND DUMMY VARIABLE GROUP TEST
Advertisements

EC220 - Introduction to econometrics (chapter 5)
Sociology 601 Class 24: November 19, 2009 (partial) Review –regression results for spurious & intervening effects –care with sample sizes for comparing.
Irwin/McGraw-Hill © Andrew F. Siegel, 1997 and l Chapter 12 l Multiple Regression: Predicting One Factor from Several Others.
Christopher Dougherty EC220 - Introduction to econometrics (chapter 4) Slideshow: interactive explanatory variables Original citation: Dougherty, C. (2012)
Lecture 9 Today: Ch. 3: Multiple Regression Analysis Example with two independent variables Frisch-Waugh-Lovell theorem.
Lecture 4 This week’s reading: Ch. 1 Today:
Sociology 601 Class 25: November 24, 2009 Homework 9 Review –dummy variable example from ASR (finish) –regression results for dummy variables Quadratic.
Sociology 601 Class 28: December 8, 2009 Homework 10 Review –polynomials –interaction effects Logistic regressions –log odds as outcome –compared to linear.
So far, we have considered regression models with dummy variables of independent variables. In this lecture, we will study regression models whose dependent.
Sociology 601 Class 23: November 17, 2009 Homework #8 Review –spurious, intervening, & interactions effects –stata regression commands & output F-tests.
Sociology 601 Class 26: December 1, 2009 (partial) Review –curvilinear regression results –cubic polynomial Interaction effects –example: earnings on married.
EC220 - Introduction to econometrics (chapter 1)
1 INTERPRETATION OF A REGRESSION EQUATION The scatter diagram shows hourly earnings in 2002 plotted against years of schooling, defined as highest grade.
TESTING A HYPOTHESIS RELATING TO A REGRESSION COEFFICIENT This sequence describes the testing of a hypotheses relating to regression coefficients. It is.
SLOPE DUMMY VARIABLES 1 The scatter diagram shows the data for the 74 schools in Shanghai and the cost functions derived from a regression of COST on N.
Christopher Dougherty EC220 - Introduction to econometrics (chapter 3) Slideshow: precision of the multiple regression coefficients Original citation:
Christopher Dougherty EC220 - Introduction to econometrics (chapter 5) Slideshow: Chow test Original citation: Dougherty, C. (2012) EC220 - Introduction.
Christopher Dougherty EC220 - Introduction to econometrics (chapter 5) Slideshow: dummy variable classification with two categories Original citation:
Christopher Dougherty EC220 - Introduction to econometrics (chapter 5) Slideshow: dummy classification with more than two categories Original citation:
DUMMY CLASSIFICATION WITH MORE THAN TWO CATEGORIES This sequence explains how to extend the dummy variable technique to handle a qualitative explanatory.
1 INTERACTIVE EXPLANATORY VARIABLES The model shown above is linear in parameters and it may be fitted using straightforward OLS, provided that the regression.
1 TWO SETS OF DUMMY VARIABLES The explanatory variables in a regression model may include multiple sets of dummy variables. This sequence provides an example.
1 PROXY VARIABLES Suppose that a variable Y is hypothesized to depend on a set of explanatory variables X 2,..., X k as shown above, and suppose that for.
How do Lawyers Set fees?. Learning Objectives 1.Model i.e. “Story” or question 2.Multiple regression review 3.Omitted variables (our first failure of.
Addressing Alternative Explanations: Multiple Regression
EDUC 200C Section 3 October 12, Goals Review correlation prediction formula Calculate z y ’ = r xy z x for a new data set Use formula to predict.
MULTIPLE REGRESSION WITH TWO EXPLANATORY VARIABLES: EXAMPLE 1 This sequence provides a geometrical interpretation of a multiple regression model with two.
Christopher Dougherty EC220 - Introduction to econometrics (chapter 5) Slideshow: exercise 5.2 Original citation: Dougherty, C. (2012) EC220 - Introduction.
Chapter 5: Dummy Variables. DUMMY VARIABLE CLASSIFICATION WITH TWO CATEGORIES 1 We’ll now examine how you can include qualitative explanatory variables.
POSSIBLE DIRECT MEASURES FOR ALLEVIATING MULTICOLLINEARITY 1 What can you do about multicollinearity if you encounter it? We will discuss some possible.
Special topics. Importance of a variable Death penalty example. sum death bd- yv Variable | Obs Mean Std. Dev. Min Max
COST 11 DUMMY VARIABLE CLASSIFICATION WITH TWO CATEGORIES 1 This sequence explains how you can include qualitative explanatory variables in your regression.
STAT E100 Section Week 12- Regression. Course Review - Project due Dec 17 th, your TA. - Exam 2 make-up is Dec 5 th, practice tests have been updated.
RAMSEY’S RESET TEST OF FUNCTIONAL MISSPECIFICATION 1 Ramsey’s RESET test of functional misspecification is intended to provide a simple indicator of evidence.
1 CHANGES IN THE UNITS OF MEASUREMENT Suppose that the units of measurement of Y or X are changed. How will this affect the regression results? Intuitively,
GRAPHING A RELATIONSHIP IN A MULTIPLE REGRESSION MODEL The output above shows the result of regressing EARNINGS, hourly earnings in dollars, on S, years.
1 COMPARING LINEAR AND LOGARITHMIC SPECIFICATIONS When alternative specifications of a regression model have the same dependent variable, R 2 can be used.
VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE In this sequence and the next we will investigate the consequences of misspecifying the regression.
Before the class starts: Login to a computer Read the Data analysis assignment 1 on MyCourses If you use Stata: Start Stata Start a new do file Open the.
Stats Methods at IC Lecture 3: Regression.
QM222 Class 12 Section D1 1. A few Stata things 2
Chapter 15 Multiple Regression Model Building
QM222 Nov. 9 Section D1 Visualizing Using Graphs More on your project Test returned QM222 Fall 2016 Section D1.
QM222 Class 9 Section A1 Coefficient statistics
QM222 Class 11 Section D1 1. Review and Stata: Time series data, multi-category dummies, etc. (chapters 10,11) 2. Capturing nonlinear relationships (Chapter.
business analytics II ▌appendix – regression performance the R2 
QM222 Class 10 Section D1 1. Goodness of fit -- review 2
QM222 Nov. 7 Section D1 Multicollinearity Regression Tables What to do next on your project QM222 Fall 2016 Section D1.
QM222 Nov. 28 Presentations Some additional tips on the project
QM222 Class 13 Section D1 Omitted variable bias (Chapter 13.)
Review Multiple Regression Multiple-Category Dummy Variables
QM222 Class 16 & 17 Today’s New topic: Estimating nonlinear relationships QM222 Fall 2017 Section A1.
QM222 Class 11 Section A1 Multiple Regression
QM222 Class 19 Omitted Variable Bias pt 2 Different slopes for a single variable QM222 Fall 2017 Section A1.
QM222 Class 18 Omitted Variable Bias
QM222 Class 9 Section D1 1. Multiple regression – review and in-class exercise 2. Goodness of fit 3. What if your Dependent Variable is an 0/1 Indicator.
QM222 A1 On tests and projects
QM222 Class 8 Section A1 Using categorical data in regression
QM222 Class 8 Section D1 1. Review: coefficient statistics: standard errors, t-statistics, p-values (chapter 7) 2. Multiple regression 3. Goodness of fit.
QM222 A1 Nov. 27 More tips on writing your projects
The slope, explained variance, residuals
QM222 A1 How to proceed next in your project Multicollinearity
QM222 Class 14 Today’s New topic: What if the Dependent Variable is a Dummy Variable? QM222 Fall 2017 Section A1.
QM222 Your regressions and the test
QM222 Class 15 Section D1 Review for test Multicollinearity
Covariance x – x > 0 x (x,y) y – y > 0 y x and y axes.
EPP 245 Statistical Analysis of Laboratory Data
Introduction to Econometrics, 5th edition
Introduction to Econometrics, 5th edition
Presentation transcript:

QM222 Class 19 Section D1 Tips on your Project QM222 Fall 2015 Section D1

Your draft will not be complete without…. An updated data set that I can read (saveold) Do this NOW if you haven’t already. A “sum” of all variables used anywhere in the draft attached at the end. QM222 Fall 2015 Section D1

Criteria: How your project will be judged Does your project use statistics, including (but not limited to) multiple regression, that are most appropriate to answer your question? Does it demonstrate a deep understanding of the statistics taught in the course? Is your data set appropriate to answering the question? Have you made mistakes in handling missing data, generating variables, or interpreting coefficients? QM222 Fall 2015 Section D1

Some regression misunderstandings and tips QM222 Fall 2015 Section D1

You need to understand what your results are telling you… You need to understand what your results are telling you…. You need to think, reread the book, etc. QM222 Fall 2015 Section D1

Make sure indicator variables are 0-1 and named correctly . tab sex respondents | sex | Freq. Percent Cum. ------------+----------------------------------- male | 26,286 44.10 44.10 female | 33,313 55.90 100.00 Total | 59,599 100.00 . tab sex, nolabel 1 | 26,286 44.10 44.10 2 | 33,313 55.90 100.00 . gen male=sex . replace male=0 if sex==2 (33313 real changes made) QM222 Fall 2015 Section D1

When to use dummies (i.occupation), and when not. If you have a numerical variable (like hours worked or height), use it like a numerical variable. UNLESS you believe that the relationship between Y and X changes so much at every value that it can’t be estimated as non-linear relationship. QM222 Fall 2015 Section D1

When you have categorical variables You cannot use categorical variables as numbers. This includes: Marital status Work status Etc. Don’t use LOTS of dummies for important variables (whose coefficients you want to understand). If you have a categorical variable with more than 10 categories, try to combine them into broader categories. It’s okay to use dummies when you have more than 10 (or so) categories as control variables that you don’t plan to discuss or report, you can include them. (e.g. occupation) QM222 Fall 2015 Section D1

Some misunderstandings Chapter 10 Interpreting multiple dummies Chapter 10 Each coefficient is that variable versus the reference, excluded category Often, it makes sense to choose the reference category to be something you would most want to be the comparison. You can NEVER put in all categories into the regression Stata will omit one Example next page. QM222 Fall 2015 Section D1

What happens when you add all categories on the right…. regress realrinc married widow divorced separated nevermarried note: widow omitted because of collinearity Source | SS df MS Number of obs = 34888 -------------+------------------------------ F( 4, 34883) = 187.50 Model | 5.9751e+11 4 1.4938e+11 Prob > F = 0.0000 Residual | 2.7791e+13 34883 796694311 R-squared = 0.0210 -------------+------------------------------ Adj R-squared = 0.0209 Total | 2.8389e+13 34887 813730067 Root MSE = 28226 ------------------------------------------------------------------------------ realrinc | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- married | 9009.761 814.9317 11.06 0.000 7412.468 10607.05 widow | 0 (omitted) divorced | 6124.18 882.4919 6.94 0.000 4394.467 7853.892 separated | 1753.797 1126.321 1.56 0.119 -453.8281 3961.422 nevermarried | -553.9949 846.5664 -0.65 0.513 -2213.292 1105.302 _cons | 16458.18 788.6264 20.87 0.000 14912.45 18003.92 QM222 Fall 2015 Section D1

Where is currently married Where is currently married? What is the difference between nevermarried and currently married? Is it significant? . regress realrinc i.marital Source | SS df MS Number of obs = 34888 -------------+------------------------------ F( 4, 34883) = 187.50 Model | 5.9751e+11 4 1.4938e+11 Prob > F = 0.0000 Residual | 2.7791e+13 34883 796694311 R-squared = 0.0210 -------------+------------------------------ Adj R-squared = 0.0209 Total | 2.8389e+13 34887 813730067 Root MSE = 28226 -------------------------------------------------------------------------------- realrinc | Coef. Std. Err. t P>|t| [95% Conf. Interval] ---------------+---------------------------------------------------------------- marital | widowed | -9009.761 814.9317 -11.06 0.000 -10607.05 -7412.468 divorced | -2885.581 446.1419 -6.47 0.000 -3760.033 -2011.128 separated | -7255.963 829.9696 -8.74 0.000 -8882.73 -5629.196 never married | -9563.755 370.0341 -25.85 0.000 -10289.03 -8838.477 | _cons | 25467.94 205.3829 124.00 0.000 25065.39 25870.5 QM222 Fall 2015 Section D1

What is the difference between nevermarried and widowed What is the difference between nevermarried and widowed? Is it significant? You can see if the confidence intervals overlap. Or you can use Stata’s postestimation tests . regress realrinc i.marital Source | SS df MS Number of obs = 34888 -------------+------------------------------ F( 4, 34883) = 187.50 Model | 5.9751e+11 4 1.4938e+11 Prob > F = 0.0000 Residual | 2.7791e+13 34883 796694311 R-squared = 0.0210 -------------+------------------------------ Adj R-squared = 0.0209 Total | 2.8389e+13 34887 813730067 Root MSE = 28226 -------------------------------------------------------------------------------- realrinc | Coef. Std. Err. t P>|t| [95% Conf. Interval] ---------------+---------------------------------------------------------------- marital | widowed | -9009.761 814.9317 -11.06 0.000 -10607.05 -7412.468 divorced | -2885.581 446.1419 -6.47 0.000 -3760.033 -2011.128 separated | -7255.963 829.9696 -8.74 0.000 -8882.73 -5629.196 never married | -9563.755 370.0341 -25.85 0.000 -10289.03 -8838.477 | _cons | 25467.94 205.3829 124.00 0.000 25065.39 25870.5 QM222 Fall 2015 Section D1

Testing, combining categories You can’t use i Testing, combining categories You can’t use i.marital, so first make actual indicagtor variables gen widow=marital==2 gen divorced=marital==3 gen separated= marital==4 gen nevermarried= marital==5 regress realrinc widow divorced separated nevermarried then after the regression test a linear combination: . lincom widow – nevermarried RESULTS: ( 1) widow - nevermarried = 0 ------------------------------------------------------------------------------ realrinc | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- (1) | 553.9949 846.5298 0.65 0.513 -1105.231 2213.22 This result tells me that I could combine widow and nevermarried into a single category if I want. QM222 Fall 2015 Section D1

When not to control for a variable I want to know how education affects men and women’s belief that people should Grass: Indicator variable if believe marijuana should be legalized. If I run this Grass = b0 + b1 education + b2 income + b3 age… Then the coefficient on education tells us “If someone gets a lot of education but has the same income as another person, how does the education affect grass?” You might instead want to know “If someone gets a lot of education and as a result has higher income than another person as well as being better education, how does the education affect grass?” For this, run Grass = b0 + b1 education + + b3 age… QM222 Fall 2015 Section D1

If you want to show how adding a group of variables improves explanatory power, you can show both regressions (with and without) and compare adjusted R-sq (or MSE) Source | SS df MS Number of obs = 1038 -------------+------------------------------ F( 6, 1031) = 18.91 Model | 1.0843e+11 6 1.8072e+10 Prob > F = 0.0000 Residual | 9.8536e+11 1031 955729182 R-squared = 0.0991 -------------+------------------------------ Adj R-squared = 0.0939 Total | 1.0938e+12 1037 1.0548e+09 Root MSE = 30915 -------------------------------------------------------------------------------- realinc | Coef. Std. Err. t P>|t| [95% Conf. Interval] ---------------+---------------------------------------------------------------- marital | widowed | -20202.06 4203.911 -4.81 0.000 -28451.25 -11952.86 divorced | -20055.93 2846.119 -7.05 0.000 -25640.78 -14471.08 separated | -17877.02 4945.558 -3.61 0.000 -27581.53 -8172.515 never married | -19598.35 2518.839 -7.78 0.000 -24540.99 -14655.72 hrsusual | -330.5012 231.5231 -1.43 0.154 -784.8114 123.809 hrssq | 6.118547 2.829847 2.16 0.031 .565629 11.67146 _cons | 48787.92 4886.288 9.98 0.000 39199.71 58376.12 Source | SS df MS Number of obs = 962 -------------+------------------------------ F( 2, 959) = 29.31 Model | 4.6811e+10 2 2.3406e+10 Prob > F = 0.0000 Residual | 7.6571e+11 959 798446597 R-squared = 0.0576 -------------+------------------------------ Adj R-squared = 0.0556 Total | 8.1252e+11 961 845495865 Root MSE = 28257 ------------------------------------------------------------------------------ realrinc | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- hrsusual | 594.0719 223.0332 2.66 0.008 156.3825 1031.761 hrssq | -.8641106 2.720913 -0.32 0.751 -6.203741 4.47552 _cons | 587.6587 4650.105 0.13 0.899 -8537.896 9713.214 QM222 Fall 2015 Section D1

Nonlinear terms: Some of you did it without thinking about the results You can NOT put a nonlinear version of Y (the dependent variable) as an explanatory variable It is obvious that Y2 will explain Y, but what have you learned from that? You can’t make an indicator variable into a nonlinear one. When you try out a nonlinear term, be sure to add it into the whole multiple regression…. You want to test it controlling for other factors. QM222 Fall 2015 Section D1

These results say that at 0, the slope drealinc/dhrs is negative, but the slope becomes more positive . regress realinc i.marital hrsusual hrssq Source | SS df MS Number of obs = 1038 -------------+------------------------------ F( 6, 1031) = 18.91 Model | 1.0843e+11 6 1.8072e+10 Prob > F = 0.0000 Residual | 9.8536e+11 1031 955729182 R-squared = 0.0991 -------------+------------------------------ Adj R-squared = 0.0939 Total | 1.0938e+12 1037 1.0548e+09 Root MSE = 30915 -------------------------------------------------------------------------------- realinc | Coef. Std. Err. t P>|t| [95% Conf. Interval] ---------------+---------------------------------------------------------------- marital | widowed | -20202.06 4203.911 -4.81 0.000 -28451.25 -11952.86 divorced | -20055.93 2846.119 -7.05 0.000 -25640.78 -14471.08 separated | -17877.02 4945.558 -3.61 0.000 -27581.53 -8172.515 never married | -19598.35 2518.839 -7.78 0.000 -24540.99 -14655.72 | hrsusual -330.5012 231.5231 -1.43 0.154 -784.8114 123.809 hrssq 6.118547 2.829847 2.16 0.031 .565629 11.67146 _cons | 48787.92 4886.288 9.98 0.000 39199.71 58376.12 Take the derivative: drealinc/dhrs= -330 + 6.119*2*hrs At hrs=0, slope = -330 At hrs=10, slope = -85.24 At hrs=40, slope = +159.5 QM222 Fall 2015 Section D1

Interpreting coefficients when the dependent variable is a dummy Source | SS df MS Number of obs = 34892 -------------+------------------------------ F( 2, 34889) = 360.89 Model | 175.59685 2 87.7984249 Prob > F = 0.0000 Residual | 8487.89149 34889 .243282739 R-squared = 0.0203 -------------+------------------------------ Adj R-squared = 0.0202 Total | 8663.48834 34891 .24830152 Root MSE = .49324 ------------------------------------------------------------------------------ married | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- male | .0673883 .0054176 12.44 0.000 .0567696 .0780069 realrinc | 1.94e-06 9.50e-08 20.45 0.000 1.76e-06 2.13e-06 _cons | .4641422 .004045 114.75 0.000 .4562139 .472070 The coefficient on male tells us that a man is: 6.7 percentage points more likely than a female to be married (holding income constant). But the average probability of being married is .536, so this is 6.7/.536 =12.5 percent different. QM222 Fall 2015 Section D1

Dropping variables If you want, you can drop a variable with |t| < 1. Never drop a variable with a |t| > 1 unless that variable should not be in the regression no matter what its t-stat is. If you keep a variable with |t|<1 in the regression, don’t discuss its coefficient as if it really was important. Say “X2 has no effect on Y” Or at most “while the sign on X2 is positive, we are very uncertain that it has any effect at all.” QM222 Fall 2015 Section D1

Sometimes, a variable can be significant but not particularly important Sometimes, a variable can have a large t-stat but not really make a large difference to your predicted Y. To see if it does make a large difference, make this calculation: X coefficient * (highest X value in dataset – lowest X value) That tells you the maximum that the variation in X can change the Y or X coefficient * (95th percentile X value – 5th percentile X value) QM222 Fall 2015 Section D1

On writing the project: Focus to client Components of a good report: Body of the report: Introduction of the problem (written to the client, of course) Perhaps background on the problem (can bring in other sources, put in graphs etc) Describing the data set and variables. Analysis: Describing the regression results as intuitively as possible, without using a lot of statistics terms except when absolutely necessary (and perhaps in parentheses or footnotes. Conclusion At the beginning of the report: An executive summary (1 paragraph to 2 pages) with no statistics terms… assignment read: “Can an executive who knows little statistics understand what you did and what you found from reading the executive summary? “ QM222 Fall 2015 Section D1

On writing the project: Focus to client Graphs and tables can be in the body of the report or at the end. They need to be labeled and referred to (as in “Table 1 shows” or “Men make more money than women in this occupation, as can be seen in Table 1…) QM222 Fall 2015 Section D1

Style of writing Do NOT write about your process of what you did …. (first I did this, then I realized that… etc.) Most things should be written in third person such as: This report investigates whether …… As can be seen in the first two columns of Table 1, when we do not control for gender or other possibly confounding factors, each hour of work…. Report more than 1 regression if the client learns something from it (not if you learned that it was the wrong way to estimate things.) QM222 Fall 2015 Section D1

On writing the project: Other criteria If I were the client, would I feel that this project answered a question I am interested in? Can an executive who knows little statistics understand what you did and what you found from reading the executive summary? Are regressions presented in easy to read tables and/or as equations? Are any graphics you use appropriate and clearly convey information to the reader? Does the writing develop the ideas in logical order and clearly? Is the report well-written? Are there English and spelling mistakes? Does the report look professional? QM222 Fall 2015 Section D1

Use Tables to report several regressions Use Tables to Report Several Regressions (in your final project report)

If you have lots of dummies that are not themselves important, you can just say “All regressions also include…..” If only some regressions have these dummies, at the bottom include a line labeled e.g. Industry dummies. Then put a √ if it is there. (Use adjusted R-square rather than R-square)

How to report single regressions VOTESHARE%= 48.216 + 4.179 INCUMBENT + 0.840 GROWTH (34.38) (2.38) (5.31) SEE =4.199 R2=.627 adj.R2 = .593 N = 25 (t-stats in parentheses) Or, put standard errors on parentheses.

How to move Stata results into Excel You want to do this to make tables. You probably want to do this to make graphs. Copy from results or from log into Excel In Windows: Data – Text to Columns

Criteria: How your project will be judged Does your project use statistics, including (but not limited to) multiple regression, that are most appropriate to answer your question? Does it demonstrate a deep understanding of the statistics taught in the course? Is your data set appropriate to answering the question? Have you made mistakes in handling missing data, generating variables, or interpreting coefficients? Does the writing develop the ideas in logical order and clearly? If I were the client, would I feel that this project answered a question I am interested in? Can an executive who knows little statistics understand what you did and what you found from reading the executive summary? Does your report either control for and/or discuss possible biases e.g. due to confounding factors? Are regressions presented in easy to read tables and/or as equations? Are any graphics you use appropriate and clearly convey information to the reader? Is the report well-written? Are there English and spelling mistakes? Does the report look professional? QM222 Fall 2015 Section D1

What’s next? Don’t forget the signup https://docs.google.com/spreadsheets/d/1pcfaSpsS6TISPccPsJf7ykuhTIzqsWN7OqU RMO_BLM4/edit?usp=sharing QM222 Fall 2015 Section D1