QM222 A1 How to proceed next in your project Multicollinearity

Slides:



Advertisements
Similar presentations
Qualitative Variables and
Advertisements

AP Statistics Chapters 3 & 4 Measuring Relationships Between 2 Variables.
1 Multiple Regression Response, Y (numerical) Explanatory variables, X 1, X 2, …X k (numerical) New explanatory variables can be created from existing.
The Use and Interpretation of the Constant Term
Choosing a Functional Form
Chapter 13 Multiple Regression
Section 4.2 Fitting Curves and Surfaces by Least Squares.
Linear Regression.
Chapter 12 Multiple Regression
© 2003 Prentice-Hall, Inc.Chap 14-1 Basic Business Statistics (9 th Edition) Chapter 14 Introduction to Multiple Regression.
Statistical Analysis SC504/HS927 Spring Term 2008 Session 7: Week 23: 7 th March 2008 Complex independent variables and regression diagnostics.
Chapter 11 Multiple Regression.
1 4. Multiple Regression I ECON 251 Research Methods.
Chapter 8 Forecasting with Multiple Regression
Correlation and Linear Regression. Evaluating Relations Between Interval Level Variables Up to now you have learned to evaluate differences between the.
Regression Analysis A statistical procedure used to find relations among a set of variables.
Discussion of time series and panel models
Relationships If we are doing a study which involves more than one variable, how can we tell if there is a relationship between two (or more) of the.
Welcome to Econ 420 Applied Regression Analysis Study Guide Week Seven.
Correlation – Recap Correlation provides an estimate of how well change in ‘ x ’ causes change in ‘ y ’. The relationship has a magnitude (the r value)
Statistics for Managers Using Microsoft Excel, 4e © 2004 Prentice-Hall, Inc. Chap 14-1 Chapter 14 Multiple Regression Model Building Statistics for Managers.
Business Statistics: A Decision-Making Approach, 6e © 2005 Prentice- Hall, Inc. Chap 14-1 Business Statistics: A Decision-Making Approach 6 th Edition.
ANOVA, Regression and Multiple Regression March
Introduction to Multiple Regression Lecture 11. The Multiple Regression Model Idea: Examine the linear relationship between 1 dependent (Y) & 2 or more.
Statistics for Managers Using Microsoft Excel, 4e © 2004 Prentice-Hall, Inc. Chap 14-1 Chapter 14 Multiple Regression Model Building Statistics for Managers.
Regression Analysis: A statistical procedure used to find relations among a set of variables B. Klinkenberg G
1. Analyzing patterns in scatterplots 2. Correlation and linearity 3. Least-squares regression line 4. Residual plots, outliers, and influential points.
Stats Methods at IC Lecture 3: Regression.
QM222 Class 19 Section D1 Tips on your Project
Correlation and Regression
QM222 Class 12 Section D1 1. A few Stata things 2
Statistics 200 Lecture #6 Thursday, September 8, 2016
Chapter 14 Introduction to Multiple Regression
QM222 Nov. 9 Section D1 Visualizing Using Graphs More on your project Test returned QM222 Fall 2016 Section D1.
Regression and Correlation
QM222 Class 11 Section D1 1. Review and Stata: Time series data, multi-category dummies, etc. (chapters 10,11) 2. Capturing nonlinear relationships (Chapter.
Chapter 15 Multiple Regression and Model Building
Econ 326 Lecture 19.
QM222 Class 10 Section D1 1. Goodness of fit -- review 2
QM222 Nov. 7 Section D1 Multicollinearity Regression Tables What to do next on your project QM222 Fall 2016 Section D1.
Statistical Data Analysis - Lecture /04/03
QM222 Nov. 28 Presentations Some additional tips on the project
Topic 10 - Linear Regression
Let’s Get It Straight! Re-expressing Data Curvilinear Regression
26134 Business Statistics Week 5 Tutorial
Basic Estimation Techniques
Statistics 200 Lecture #5 Tuesday, September 6, 2016
QM222 Class 13 Section D1 Omitted variable bias (Chapter 13.)
Review Multiple Regression Multiple-Category Dummy Variables
QM222 Class 16 & 17 Today’s New topic: Estimating nonlinear relationships QM222 Fall 2017 Section A1.
QM222 Class 11 Section A1 Multiple Regression
Multiple Regression Analysis and Model Building
QM222 Class 14 Section D1 Different slopes for the same variable (Chapter 14) Review: Omitted variable bias (Chapter 13.) The bias on a regression coefficient.
QM222 Class 18 Omitted Variable Bias
QM222 Class 9 Section D1 1. Multiple regression – review and in-class exercise 2. Goodness of fit 3. What if your Dependent Variable is an 0/1 Indicator.
QM222 A1 More on Excel QM222 Fall 2017 Section A1.
QM222 A1 On tests and projects
QM222 Class 8 Section A1 Using categorical data in regression
QM222 A1 Visualizing data using Excel graphs
26134 Business Statistics Week 6 Tutorial
QM222 A1 Nov. 27 More tips on writing your projects
(Residuals and
QM222 Class 14 Today’s New topic: What if the Dependent Variable is a Dummy Variable? QM222 Fall 2017 Section A1.
Stats Club Marnie Brennan
CHAPTER 26: Inference for Regression
QM222 Dec. 5 Presentations For presentation schedule, see:
Regression Forecasting and Model Building
Day 68 Agenda: 30 minute workday on Hypothesis Test --- you have 9 worksheets to use as practice Begin Ch 15 (last topic)
What’s the plan? First, we are going to look at the correlation between two variables: studying for calculus and the final percentage grade a student gets.
Presentation transcript:

QM222 A1 How to proceed next in your project Multicollinearity QM222 Fall 2017 Section A1

To Do: Assignment 6: I can extend the due date to next Wednesday. I think that those who came to the last class and met with me for 10 minutes really made progress on the direction they should go next in their project. I hope that each of you meet with me sooner rather than later. When can we meet? Monday 11:15-3:30 Tuesday 11:15-1:45, 3:30-5 You will also need to meet with me for a second required meeting -- after you have the regressions that you plan to use (and before your presentation). Does not need to be by November 13. I’ll put up signup next week. QM222 Fall 2017 Section A1

Schedule of Classes Monday(11/6): Graphs in Excel, Visualizing data Wednesday(11/8): Other Excel Statistical Tools (regression, pivot tales, if statements for dummies) Friday(11/10): Test Recap, Project Help Monday(11/13): Tips on writing your first draft. Wednesday (11/15) and maybe Friday (11/17): Experiments After that: Topics as requested, presentations. Draft (Assignment 7) due before Thanksgiving. Presentations will start as early as November 15, but certainly by Monday November 27. QM222 Fall 2017 Section A1

Assignment 6: What to try next Run additional multiple regressions. Specifically: Think hard about whether there are additional omitted variables (i.e. confounding factors) that you can measure that are likely to be biasing your key coefficient(s). If you can find data on them, add them into the regressions. (If you really cannot think of anything beyond what you have, just write that.) Also, identify at least one omitted variable that you cannot measure, reason out the sign of the omitted variable bias and explain here (Ass.6) in 1-3 sentences why and in what direction it will bias your key coefficient. QM222 Fall 2017 Section A1

What you do NOT want to put as an explanatory (X, right hand side) variable Usually, you do NOT want to include a variable that is just another version of Y Example: Suppose you want to see who is a good baseball hitter So you use RBI (runs batted in) as your Y variable You do NOT want to use slugging percentage (which is another measure of how good a baseball hitter is) Slugging percentage does not predict or affect Y. It is just another way to measure Y It will be highly correlated with Y But so what? You do not learn anything. QM222 Fall 2017 Section A1

Important terms to know related to this: Endogenous v. Exogenous An exogenous factor X is something that happens and then changes your Y Its change is started “from the outside” Exo means outside, as in exoskeleton An endogenous factor X is something that happens as Y happens Like RBIs and Slugging percentage Endo means “from inside”: endoscopy examines your innards (GI tract) You want your X variables to be as exogenous as possible…. So you see how they affect Y Rather than how they are affected by Y or happen along with Y QM222 Fall 2017 Section A1

Assignment 6 cont.: nonlinear terms If you have any numeric explanatory (X) variable, add a quadratic term in addition to your other variables to test if this nonlinear specification fits better. Try other non-linear terms for numeric variables if you think they solve your problem better: If you use log X, it means that you think that a 1% increase in X will give a constant change in Y. If you use log X and log Y, it means that you think that a 1% increase in X will give a constant percentage (%) change in Y. Using top coding if after a certain point, it doesn’t matter how much more X you have Explain what you learn from these results. It is easiest to show quadratics with a graph. QM222 Fall 2017 Section A1

Nonlinear example: Here we plotted average salaries at each experience level Would a straight line fit well? QM222 Fall 2017 Section A1

Evidence: Can Money Buy Happiness? What is the relationship here between money and happiness? Is it linear? Happiness increases in income but at an decreasing rate; the slope is becoming less steep as income increases. Extra income “buys” us less happiness when we have more of it. QM222 Fall 2017 Section A1

A helpful stata tip related to quadratics Let’s say you run a regression of salary on experience and experience squared. Salary = b0 + b1 experience + b2 experience_squared (b1 is probably positive, b2 is probably negative) You want to know the impact of experience: d Salary/ d Experience= b1 + 2 * b2 * experience (do the calculus: if Y = a xb dY/dX = b a x b-1 ) But you do not have a t-test for this slope! And it depends on the value of experience. What to do? In Stata, after you run the regression, type: effect at exp=5: lincom experience + 2* experience-squared*5 effect at exp=10: lincom experience + 2* experience-squared*5 This tests whether the linear combination above = 0, where Stata plugs in the coefficients b1 for “experience” and b2 “experience-squared” QM222 Fall 2017 Section A1

Another approach if you think the relationship between Y and X is really really nonlinear You could try a set of dummy variables for different ranges of the variable. Even though it is a numerical variable. Only use this approach if you believe that the relationship between Y and X changes so much at every value that it can’t be estimated as a quadratic (or cubic etc.) Education sometimes is better as a set of dummies. Remember how to easily make a set of dummies in Stata, e.g. for number of children numchildren: xi: regress yvariable xvariable i.numchildren QM222 Fall 2017 Section A1

Assignment 6 cont. : Use “interaction terms” (like Scifi Assignment 6 cont.: Use “interaction terms” (like Scifi*Budget & nonscifi*Budget) When to use them? If you think that the effect of one thing depends on the slope of another. Example: Matt. The simplest way to add this in a regression is (if X2 is a dummy): 1. Make an additional variable by multiplying X1 * X2 2. Make an additional variable by multiplying X1 * (1-X2) Recalling that (1-X2) is 1 if X2=0 3. Run a regression of Y on 3 variables: X1*X2 This is X1 for observations where X2 =1 X1*(1-X2) This is X1 for observations where X2 =0 X2 This is X2 I ask in Assignment 6 to explain what you learn from your interaction terms. QM222 Fall 2017 Section A1

A helpful stata tip related to these interaction terms Let’s say you think that experience affects men and women differently. You make the variables: male_experience = maledummy * experience female_experience = (1-maledummy) * experience You run the regression: Salary = b0 +b1female + b2 male_experience + b2 female_experience But to test whether the slopes are different for male_experience and female_experience, after you run the regression, type: lincom male_experience - female_experience Stata will test whether this equals zero, i.e. whether there is no difference. QM222 Fall 2017 Section A1

Multicollinearity QM222 Fall 2017 Section A1

Multicollinearity Recall that the interpretation of a coefficient in a multiple regression is: The effect on Y of X changing by 1 if the other variables stay the same And the t-test tests the null H0: Could this coefficient be zero? Sometimes you run a regression with two very very correlated variables like #toothbrushes sold and amount of toothpaste sold in a country in a year. The t-tests will both be very low. Because each coefficient could be zero and the regression would predict approximately the same thing. But if you drop one of them, the other would become highly significant. E.g. GDP and Unemployment QM222 Fall 2017 Section A1

What to do if you find that variables that you believe should be significant are not If you believe that several variables are really measuring the same concept, drop one of them if its |t-stat| is less than ONE. If you drop a variable with a |t-stat| <1, the adjusted R- squared increases. Which do you drop? The one with the lowest |t|. In other words, let the computer tell you which of the two variables you need. If you are right, the other variable will become more significant. NEVER DROP MORE THAN ONE VARIABLE AT A TIME. If you do, you might drop BOTH highly correlated variables. You can test if two (or more) insignificant variables together are significant but writing this after you run the regression: test varname1 varnname2 QM222 Fall 2017 Section A1