QM222 Class 11 Section D1 1. Review and Stata: Time series data, multi-category dummies, etc. (chapters 10,11) 2. Capturing nonlinear relationships (Chapter.

Slides:



Advertisements
Similar presentations
Dummy Variables and Interactions. Dummy Variables What is the the relationship between the % of non-Swiss residents (IV) and discretionary social spending.
Advertisements

CHOW TEST AND DUMMY VARIABLE GROUP TEST
Christopher Dougherty EC220 - Introduction to econometrics (chapter 4) Slideshow: interactive explanatory variables Original citation: Dougherty, C. (2012)
Heteroskedasticity The Problem:
Lecture 9 Today: Ch. 3: Multiple Regression Analysis Example with two independent variables Frisch-Waugh-Lovell theorem.
Sociology 601, Class17: October 27, 2009 Linear relationships. A & F, chapter 9.1 Least squares estimation. A & F 9.2 The linear regression model (9.3)
Chapter 13 Multiple Regression
Chapter 12 Multiple Regression
Sociology 601 Class 25: November 24, 2009 Homework 9 Review –dummy variable example from ASR (finish) –regression results for dummy variables Quadratic.
Sociology 601 Class 28: December 8, 2009 Homework 10 Review –polynomials –interaction effects Logistic regressions –log odds as outcome –compared to linear.
Basic Business Statistics, 11e © 2009 Prentice-Hall, Inc. Chap 14-1 Chapter 14 Introduction to Multiple Regression Basic Business Statistics 11 th Edition.
Basic Business Statistics, 11e © 2009 Prentice-Hall, Inc. Chap 15-1 Chapter 15 Multiple Regression Model Building Basic Business Statistics 11 th Edition.
SLOPE DUMMY VARIABLES 1 The scatter diagram shows the data for the 74 schools in Shanghai and the cost functions derived from a regression of COST on N.
EDUC 200C Section 4 – Review Melissa Kemmerle October 19, 2012.
Christopher Dougherty EC220 - Introduction to econometrics (chapter 5) Slideshow: dummy variable classification with two categories Original citation:
Christopher Dougherty EC220 - Introduction to econometrics (chapter 5) Slideshow: two sets of dummy variables Original citation: Dougherty, C. (2012) EC220.
DUMMY CLASSIFICATION WITH MORE THAN TWO CATEGORIES This sequence explains how to extend the dummy variable technique to handle a qualitative explanatory.
1 INTERACTIVE EXPLANATORY VARIABLES The model shown above is linear in parameters and it may be fitted using straightforward OLS, provided that the regression.
1 TWO SETS OF DUMMY VARIABLES The explanatory variables in a regression model may include multiple sets of dummy variables. This sequence provides an example.
How do Lawyers Set fees?. Learning Objectives 1.Model i.e. “Story” or question 2.Multiple regression review 3.Omitted variables (our first failure of.
MultiCollinearity. The Nature of the Problem OLS requires that the explanatory variables are independent of error term But they may not always be independent.
EDUC 200C Section 3 October 12, Goals Review correlation prediction formula Calculate z y ’ = r xy z x for a new data set Use formula to predict.
MULTIPLE REGRESSION WITH TWO EXPLANATORY VARIABLES: EXAMPLE 1 This sequence provides a geometrical interpretation of a multiple regression model with two.
Biostat 200 Lecture Simple linear regression Population regression equationμ y|x = α +  x α and  are constants and are called the coefficients.
Christopher Dougherty EC220 - Introduction to econometrics (chapter 5) Slideshow: exercise 5.2 Original citation: Dougherty, C. (2012) EC220 - Introduction.
Christopher Dougherty EC220 - Introduction to econometrics (chapter 4) Slideshow: exercise 4.5 Original citation: Dougherty, C. (2012) EC220 - Introduction.
COST 11 DUMMY VARIABLE CLASSIFICATION WITH TWO CATEGORIES 1 This sequence explains how you can include qualitative explanatory variables in your regression.
RAMSEY’S RESET TEST OF FUNCTIONAL MISSPECIFICATION 1 Ramsey’s RESET test of functional misspecification is intended to provide a simple indicator of evidence.
Introduction to Multiple Regression Lecture 11. The Multiple Regression Model Idea: Examine the linear relationship between 1 dependent (Y) & 2 or more.
Statistics for Managers Using Microsoft Excel, 4e © 2004 Prentice-Hall, Inc. Chap 14-1 Chapter 14 Multiple Regression Model Building Statistics for Managers.
1 CHANGES IN THE UNITS OF MEASUREMENT Suppose that the units of measurement of Y or X are changed. How will this affect the regression results? Intuitively,
1 COMPARING LINEAR AND LOGARITHMIC SPECIFICATIONS When alternative specifications of a regression model have the same dependent variable, R 2 can be used.
VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE In this sequence and the next we will investigate the consequences of misspecifying the regression.
QM222 Class 19 Section D1 Tips on your Project
QM222 Class 12 Section D1 1. A few Stata things 2
Chapter 15 Multiple Regression Model Building
Chapter 14 Introduction to Multiple Regression
QM222 Nov. 9 Section D1 Visualizing Using Graphs More on your project Test returned QM222 Fall 2016 Section D1.
QM222 Class 9 Section A1 Coefficient statistics
business analytics II ▌appendix – regression performance the R2 
QM222 Class 10 Section D1 1. Goodness of fit -- review 2
QM222 Nov. 7 Section D1 Multicollinearity Regression Tables What to do next on your project QM222 Fall 2016 Section D1.
assignment 7 solutions ► office networks ► super staffing
QM222 Class 13 Section D1 Omitted variable bias (Chapter 13.)
Review Multiple Regression Multiple-Category Dummy Variables
QM222 Class 16 & 17 Today’s New topic: Estimating nonlinear relationships QM222 Fall 2017 Section A1.
QM222 Class 11 Section A1 Multiple Regression
QM222 Class 19 Omitted Variable Bias pt 2 Different slopes for a single variable QM222 Fall 2017 Section A1.
Multiple Regression Analysis and Model Building
QM222 Class 14 Section D1 Different slopes for the same variable (Chapter 14) Review: Omitted variable bias (Chapter 13.) The bias on a regression coefficient.
QM222 Class 18 Omitted Variable Bias
QM222 Class 9 Section D1 1. Multiple regression – review and in-class exercise 2. Goodness of fit 3. What if your Dependent Variable is an 0/1 Indicator.
QM222 A1 More on Excel QM222 Fall 2017 Section A1.
QM222 Class 15 Today’s New topic: Time Series
QM222 A1 On tests and projects
QM222 Class 8 Section A1 Using categorical data in regression
QM222 Class 8 Section D1 1. Review: coefficient statistics: standard errors, t-statistics, p-values (chapter 7) 2. Multiple regression 3. Goodness of fit.
QM222 A1 Nov. 27 More tips on writing your projects
QM222 A1 How to proceed next in your project Multicollinearity
QM222 Class 14 Today’s New topic: What if the Dependent Variable is a Dummy Variable? QM222 Fall 2017 Section A1.
QM222 Your regressions and the test
QM222 Dec. 5 Presentations For presentation schedule, see:
MBF1413 | Quantitative Methods Prepared by Dr Khairul Anuar
QM222 Class 15 Section D1 Review for test Multicollinearity
STA 291 Summer 2008 Lecture 23 Dustin Lueker.
Covariance x – x > 0 x (x,y) y – y > 0 y x and y axes.
Regression Forecasting and Model Building
Logistic Regression.
Introduction to Econometrics, 5th edition
Introduction to Econometrics, 5th edition
Presentation transcript:

QM222 Class 11 Section D1 1. Review and Stata: Time series data, multi-category dummies, etc. (chapters 10,11) 2. Capturing nonlinear relationships (Chapter 12) Future topics before test: One variable with different slopes (for different groups, Chapter 13) Understanding more about the bias due to missing confounding factors (Chapter 14) QM222 Fall 2015 Section D1

Schedule Assignment 3 due today. Assignment 4: Due date moved to Friday 6pm. I very much hope to quickly look at your Assignment 3 to see if you are on the right track. QM222 Fall 2015 Section D1

Some of you are still unclear on wording An “observation” is what a row in your dataset represents. Your dependent variable is what is on the left hand side of the regression equation. Your explanatory (also called independent) variables are on the right hand side. If you can measure a possibly confounding variable, you want to include it among your explanatory variables. QM222 Fall 2015 Section D1

Time series and time Review QM222 Fall 2015 Section D1

(review) In time-series data, you need to have a variable for time The variable for time has to increase by 1 each time period. If you have annual data, a variable Year does exactly this. If you have quarterly or monthly (or decade) data, you need to create a variable time. Sales = 1003 + 27 time Quarterly data The coefficient on time tells us that Sales increase by 27 each quarter. QM222 Fall 2015 Section D1

(review) Making a variable Time in Stata: background Note: in Stata, _n means the observation number In Stata, to refer to the previous value of a variable i.e. in the previous observation, just use the notation: varname[_n-1] The square brackets tells Stata the observation number you are referring to. QM222 Fall 2015 Section D1

Making a variable for Time in time-series data in Stata (one observation per time period) First make sure the data is in chronological order. For instance, if there is a variable “date” go: sort date Making a time variable (when the data is in chronological order) gen time=1 in 1 (“in #” tell State to do this only for observation #) replace time= time[_n-1]+1 OR just: gen time= _n QM222 Fall 2015 Section D1

Quarterly or monthly data With quarterly or monthly data, you should also include indicator variables for seasonality. For quarter data, make 3 indicator variables. The fourth is the reference (base) category. Example: Sales = 998 + 27 time - 4 Q1 + 10 Q2 + 12 Q3 Here, the coefficient on time tells us that Sales increase by 27 each quarter, holding season constant. Q4 is the reference category. Sales in Q2 on average are 10 more than Sales in Q4. Sales in Q1 on average are 4 less than Sales in Q4. QM222 Fall 2015 Section D1

(review) Running a Stata regression using a categorical explanatory variables with many categories You can make a single indicator variable in Stata easily, e.g. gen female = 0 replace female = 1 if gender==2 OR in a single line: gen female= gender==2 QM222 Fall 2015 Section D1

(review) Running a Stata regression using a categorical explanatory variables with many categories In Stata statistics, you don’t need to make indicator variables separately for a variable with more than 2 categories. Assuming that you have a string (or numeric) categorical variable season that could take on the values Winter, Fall, Spring and Summer, type: regress sales price i.season This will run a multiple regression of sales on price and on 3 seasonal indicator variables. Stata chooses the reference category (it chooses the category it encounters first, although there is a way for you to set a different reference category if you want). Stata will name the indicator variables by the string or number of each value they take. QM222 Fall 2015 Section D1

Let’s do this! Use hobbit data set(on our website, Other Materials, Data and other Materials) Make time variable. Make a weekend indicator variable. Regress Gross on time and weekend indicator. Interpret each coefficient. Regress Gross on time and day of week (Day) using i. QM222 Fall 2016 Section D1

Estimating nonlinear relationships Could the relationship be non-linear, and if so, how can we estimate this using linear regression? QM222 Fall 2015 Section D1

Non-linear relationships between Y and X Sometimes, the relationship between the Y variable and the X variable is unlikely to be linear. This may lead you to measure a very low insignificant slope. e.g. If you ran a regression of this graph, its coefficient would be zero. QM222 Fall 2015 Section D1

Many of you believe that you might have nonlinear relationships e.g. Maybe job satisfaction goes up with age and then down again. e.g. You do not believe that an increase $1 in price will have the same effect going from $10 to $11 as going from $100 to $101. Note that this section is only applicable for numerical variables. You cannot do these nonlinear things with indicator variables. QM222 Fall 2015 Section D1

To solve the problem of Y possibly increasing with X and then decreasing: You simply add to the regression a new X variable that is a non-linear versions of old variable. My suggestion: estimate a quadratic by making a new variable X2 and run the regression with both the linear and non-linear (quadratic) term in the equation. If you don’t know if a relationship is nonlinear, you can estimate the regression assuming it is nonlinear (e.g. quadratic) and then examine the results to see if this assumption is correct. QM222 Fall 2015 Section D1

Quadratic: Y = b0 + b1 X + b2 X2 In high school you learned that quadratic equations look like this. So by adding a squared term, you can estimate these shapes. QM222 Fall 2015 Section D1

However, a regression with a quadratic can estimate ANY part of these shapes So, using a quadratic does not mean that the curve need actually ever change from a positive to a negative slope or vice versa … QM222 Fall 2015 Section D1

How do you know whether the relationship really is nonlinear? Put in a nonlinear term (e.g. a squared term) and let the |t-stats|’s in the equation tell you if it belongs in there. If the |t-stat|>2, you are more than 95% confident that the relationship is nonlinear. Even if |t-stat| < 2, it’s a good idea to keep in the quadratic term as long as you are relatively confident it belongs in. I tend to leave it in if it has a | t-stat | >1, which means that I am at least 68% confident the relationship is nonlinear. Example: I know annual visitors to the park. I want to know if they are growing (or falling) at a constant rate over time, or not. First I make the variables: gen time= _n gen timesq = time^2 QM222 Fall 2015 Section D1

Here are regressions on time, then on time AND timesq Here are regressions on time, then on time AND timesq. Is the relationship nonlinear? Are visitors growing/shrinking, and at a constant rate? . regress annualvisitors time Source | SS df MS Number of obs = 23 -------------+------------------------------ F( 1, 21) = 1.74 Model | 1.1103e+11 1 1.1103e+11 Prob > F = 0.2010 Residual | 1.3382e+12 21 6.3722e+10 R-squared = 0.0766 -------------+------------------------------ Adj R-squared = 0.0326 Total | 1.4492e+12 22 6.5872e+10 Root MSE = 2.5e+05 ------------------------------------------------------------------------------ annualvisi~s | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- time | -10474.59 7935.107 -1.32 0.201 -26976.55 6027.369 _cons | 1639786 108800.7 15.07 0.000 1413523 1866050 . regress annualvisitors time timesq -------------+------------------------------ F( 2, 20) = 35.96 Model | 1.1339e+12 2 5.6695e+11 Prob > F = 0.0000 Residual | 3.1528e+11 20 1.5764e+10 R-squared = 0.7824 -------------+------------------------------ Adj R-squared = 0.7607 Total | 1.4492e+12 22 6.5872e+10 Root MSE = 1.3e+05 time | 118497.8 16490.43 7.19 0.000 84099.37 152896.3 timesq | -5373.85 667.1316 -8.06 0.000 -6765.462 -3982.238 _cons | 1102401 85902.09 12.83 0.000 923212.7 1281590 QM222 Fall 2015 Section D1

Sketching the Quadratic Visitors = 1102401 + 118498 time - 5374 time2 The linear term in positive, so at a small X eg. X=0.1 the slope is positive. The squared is negative so the slope eventually becomes negatively sloped. So the general shape is as below. But which part of the curve is it? For those who don’t think in derivatives, plug in high, medium and low values for X in the original equation. In this data, time goes from 1 to 23 so: At time=1, Visitors = 1102401 + 118498 (1) - 5374 (1) = 1,215,525 At time=10, Visitors = 1102401 + 118498 (10) - 5374 (102) =1,749,981 At time=23, Visitors = 1102401 + 118498 (23) - 5374 (232) =985,009 So over these 23 years, predicted visitors go up, then back down again. QM222 Fall 2015 Section D1

Sketching the Quadratic using calculus Visitors = 1102401 + 118498 time - 5374 time2 Calculus tells us the slope: dVisitors/dtime = 118498 – 2*5374 time The slope gets smaller as time increases. At the top of this cure, the slope is exactly zero. So solve 0 = 118498 – 2*5374 time time = 11.03 QM222 Fall 2015 Section D1

What about this issue: You believe that a 1% increase in X will have the same % effect on Y no matter what price you start at. [NOT ON TEST] e.g. You believe a 1 percent increase in price has a constant percentage effect on sales. Mathematical rule: If lnY = b0+ b1 lnX, b1 represents the %∆Y/ %∆X Or, the percentage change in Y when X changes by 1% (ln is natural log, the coefficient of “e”. Log means to the base 10. Either works.) So just make two new variables: lnY and lnX and run a regression: regress lnY lnX The coefficient will be: the percentage change in Y when X changes by 1% QM222 Fall 2015 Section D1

A case when logs might be useful? If you have skewed data (like lifetime gross in movies), you could just regress ln(Lifetime gross) = b0 + b1 ln(metascore) QM222 Fall 2015 Section D1

We should talk more if you want to use logs QM222 Fall 2015 Section D1

Back to the hobbit data set Make a variable for timesquared Run a regression of gross on time, timesquared, and the better of the other two (weekend indicator, or day of week indicator variables) Is the relationship between gross and time nonlinear? What does it look like? QM222 Fall 2015 Section D1

Dealing with skewed data QM222 Fall 2015 Section D1

There are 3 ways you might deal with skewed data 1. Use logs for the skewed variable (if you believe the right relationship is with the percentage change). 2. If the skewed variable is the dependent variable, predict the median rather than the mean by going: qreg Yvariable Xvariable 3. You can topcode the variable (whether it is a dependent or explanatory variable) , for instance: replace LifetimeGross = 100000000 if LifetimeGross>100000000. QM222 Fall 2015 Section D1

More practice using Stata What would you like me to demonstrate? Otherwise: Help each other. Where are you stuck? What don’t you know how to do? What can you teach the others? QM222 Fall 2015 Section D1