QM222 Class 10 Section D1 1. Goodness of fit -- review 2

Slides:



Advertisements
Similar presentations
Qualitative predictor variables
Advertisements

Heteroskedasticity The Problem:
Lecture 9 Today: Ch. 3: Multiple Regression Analysis Example with two independent variables Frisch-Waugh-Lovell theorem.
Some Topics In Multivariate Regression. Some Topics We need to address some small topics that are often come up in multivariate regression. I will illustrate.
Lecture 4 This week’s reading: Ch. 1 Today:
Valuation 4: Econometrics Why econometrics? What are the tasks? Specification and estimation Hypotheses testing Example study.
Sociology 601 Class 25: November 24, 2009 Homework 9 Review –dummy variable example from ASR (finish) –regression results for dummy variables Quadratic.
Sociology 601 Class 28: December 8, 2009 Homework 10 Review –polynomials –interaction effects Logistic regressions –log odds as outcome –compared to linear.
Introduction to Regression Analysis Straight lines, fitted values, residual values, sums of squares, relation to the analysis of variance.
1 Review of Correlation A correlation coefficient measures the strength of a linear relation between two measurement variables. The measure is based on.
So far, we have considered regression models with dummy variables of independent variables. In this lecture, we will study regression models whose dependent.
Basic Business Statistics, 11e © 2009 Prentice-Hall, Inc. Chap 14-1 Chapter 14 Introduction to Multiple Regression Basic Business Statistics 11 th Edition.
Interpreting Bi-variate OLS Regression
Sociology 601 Class 26: December 1, 2009 (partial) Review –curvilinear regression results –cubic polynomial Interaction effects –example: earnings on married.
TESTING A HYPOTHESIS RELATING TO A REGRESSION COEFFICIENT This sequence describes the testing of a hypotheses relating to regression coefficients. It is.
EDUC 200C Section 4 – Review Melissa Kemmerle October 19, 2012.
Christopher Dougherty EC220 - Introduction to econometrics (chapter 5) Slideshow: dummy classification with more than two categories Original citation:
DUMMY CLASSIFICATION WITH MORE THAN TWO CATEGORIES This sequence explains how to extend the dummy variable technique to handle a qualitative explanatory.
1 INTERACTIVE EXPLANATORY VARIABLES The model shown above is linear in parameters and it may be fitted using straightforward OLS, provided that the regression.
1 TWO SETS OF DUMMY VARIABLES The explanatory variables in a regression model may include multiple sets of dummy variables. This sequence provides an example.
Returning to Consumption
How do Lawyers Set fees?. Learning Objectives 1.Model i.e. “Story” or question 2.Multiple regression review 3.Omitted variables (our first failure of.
EDUC 200C Section 3 October 12, Goals Review correlation prediction formula Calculate z y ’ = r xy z x for a new data set Use formula to predict.
What is the MPC?. Learning Objectives 1.Use linear regression to establish the relationship between two variables 2.Show that the line is the line of.
F TEST OF GOODNESS OF FIT FOR THE WHOLE EQUATION 1 This sequence describes two F tests of goodness of fit in a multiple regression model. The first relates.
MULTIPLE REGRESSION WITH TWO EXPLANATORY VARIABLES: EXAMPLE 1 This sequence provides a geometrical interpretation of a multiple regression model with two.
Biostat 200 Lecture Simple linear regression Population regression equationμ y|x = α +  x α and  are constants and are called the coefficients.
POSSIBLE DIRECT MEASURES FOR ALLEVIATING MULTICOLLINEARITY 1 What can you do about multicollinearity if you encounter it? We will discuss some possible.
COST 11 DUMMY VARIABLE CLASSIFICATION WITH TWO CATEGORIES 1 This sequence explains how you can include qualitative explanatory variables in your regression.
STAT E100 Section Week 12- Regression. Course Review - Project due Dec 17 th, your TA. - Exam 2 make-up is Dec 5 th, practice tests have been updated.
1 Regression-based Approach for Calculating CBL Dr. Sunil Maheshwari Dominion Virginia Power.
RAMSEY’S RESET TEST OF FUNCTIONAL MISSPECIFICATION 1 Ramsey’s RESET test of functional misspecification is intended to provide a simple indicator of evidence.
Introduction to Multiple Regression Lecture 11. The Multiple Regression Model Idea: Examine the linear relationship between 1 dependent (Y) & 2 or more.
1 CHANGES IN THE UNITS OF MEASUREMENT Suppose that the units of measurement of Y or X are changed. How will this affect the regression results? Intuitively,
F TESTS RELATING TO GROUPS OF EXPLANATORY VARIABLES 1 We now come to more general F tests of goodness of fit. This is a test of the joint explanatory power.
1 COMPARING LINEAR AND LOGARITHMIC SPECIFICATIONS When alternative specifications of a regression model have the same dependent variable, R 2 can be used.
VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE In this sequence and the next we will investigate the consequences of misspecifying the regression.
QM222 Class 19 Section D1 Tips on your Project
QM222 Class 12 Section D1 1. A few Stata things 2
Chapter 14 Introduction to Multiple Regression
QM222 Class 9 Section A1 Coefficient statistics
QM222 Class 11 Section D1 1. Review and Stata: Time series data, multi-category dummies, etc. (chapters 10,11) 2. Capturing nonlinear relationships (Chapter.
business analytics II ▌appendix – regression performance the R2 
QM222 Nov. 7 Section D1 Multicollinearity Regression Tables What to do next on your project QM222 Fall 2016 Section D1.
assignment 7 solutions ► office networks ► super staffing
QM222 Class 13 Section D1 Omitted variable bias (Chapter 13.)
Review Multiple Regression Multiple-Category Dummy Variables
QM222 Class 16 & 17 Today’s New topic: Estimating nonlinear relationships QM222 Fall 2017 Section A1.
QM222 Class 11 Section A1 Multiple Regression
QM222 Class 19 Omitted Variable Bias pt 2 Different slopes for a single variable QM222 Fall 2017 Section A1.
QM222 Class 18 Omitted Variable Bias
QM222 Class 9 Section D1 1. Multiple regression – review and in-class exercise 2. Goodness of fit 3. What if your Dependent Variable is an 0/1 Indicator.
QM222 Class 15 Today’s New topic: Time Series
QM222 A1 On tests and projects
QM222 Class 8 Section A1 Using categorical data in regression
QM222 Class 8 Section D1 1. Review: coefficient statistics: standard errors, t-statistics, p-values (chapter 7) 2. Multiple regression 3. Goodness of fit.
QM222 A1 Nov. 27 More tips on writing your projects
The slope, explained variance, residuals
Regression and Residual Plots
QM222 Class 14 Today’s New topic: What if the Dependent Variable is a Dummy Variable? QM222 Fall 2017 Section A1.
Stat 112 Notes 4 Today: Review of p-values for one-sided tests
QM222 Your regressions and the test
MBF1413 | Quantitative Methods Prepared by Dr Khairul Anuar
QM222 Class 15 Section D1 Review for test Multicollinearity
Covariance x – x > 0 x (x,y) y – y > 0 y x and y axes.
Regression Forecasting and Model Building
Introduction to Econometrics, 5th edition
MGS 3100 Business Analysis Regression Feb 18, 2016
Presentation transcript:

QM222 Class 10 Section D1 1. Goodness of fit -- review 2 QM222 Class 10 Section D1 1. Goodness of fit -- review 2. What if your Dependent Variable is an 0/1 Indicator variable? – review 3. Indicator variables when there are multiple categories 4. Time series data QM222 Fall 2016 Section D1

Review QM222 Fall 2016 Section D1

Multiple regression: Why use it? 2 reasons why we do multiple regression To get the closer to the“correct/causal” (unbiased) coefficient by controlling for confounding factors To increase the predictive power of a regression

Goodness of fit How well does the model explain our dependent variables? R2 or R-squared (with the same # X’s) Adjusted R2 (with different # X’s) How accurate are our predictions? Root mean-squared-error MSE (also called SEE – standard error of the equation)

The R2 measures the goodness of fit. Higher is better. Compare .9414 to .6408

Measuring Goodness of Fit with R2 R2: the fraction of the variation in Y explained by the regression It is always between 0 and 1 R2 = [correlation(X,Y)]2 What does R2 =1 tell us? That the regression predicts Y perfectly What does R2 =0 tell us? That the model doesn’t predict any of the variation in Y.

Goodness of Fit in a Multiple Regression A higher R2 means a better fit than a lower R2 (when you have the same number of explanatory variables) However, in multiple regressions, use the adjusted R2 instead (right below R2 ). Number of obs = 1085 F( 2, 1082) = 1627.49 Prob > F = 0.0000 R-squared = 0.7505 Adj R-squared = 0.7501 Root MSE = 1.3e+05 If you compare two regressions (no matter how many X/explanatory variables there are in each), the best fit will be the one with the Highest Adjusted R2 QM222 Fall 2015 Section D1

We can also measure fit by looking at the dispersion of population Y around predicted Y We predict Ŷ but we can make confidence intervals around predicted Ŷ using the Root MSE (or SEE) The RootMSE (Root mean squared error ) a.k.a. the SEE(standard effort of the equation) measures how spread out the distribution of the errors (residuals) from a regression is: We are approximately 68% (or around 2/3rds) certain that the actual Y will be within one Root MSE (or SEE) of predicted Ŷ This is called the 68% Confidence Interval (CI). We are approximately 95% certain that the actual Y will be within two Root MSEs (or SEE) of predicted Ŷ This is called the 95% Confidence Interval (CI). -3RMSE -2RMSE -1RMSE Ŷ +1RMSE +2RMSE+3RMSE

Where is the Root MSE? . regress price Beacon_Street size   Source | SS df MS Number of obs = 1085 -------------+------------------------------ F( 2, 1082) = 1627.49 Model | 5.6215e+13 2 2.8108e+13 Prob > F = 0.0000 Residual | 1.8687e+13 1082 1.7271e+10 R-squared = 0.7505 -------------+------------------------------ Adj R-squared = 0.7501 Total | 7.4902e+13 1084 6.9098e+10 Root MSE = 1.3e+05 ------------------------------------------------------------------------------- price | Coef. Std. Err. t P>|t| [95% Conf. Interval] --------------+---------------------------------------------------------------- Beacon_Street | 32935.89 12987.55 2.54 0.011 7452.263 58419.52 size | 409.4219 7.190862 56.94 0.000 395.3122 423.5315 _cons | 6981.353 9961.969 0.70 0.484 -12565.61 26528.32 The Root MSE (root mean squared error) or SEE is just the square root of the sum of squared errors (SSE) divided by the # of observations (kind of) QM222 Fall 2015 Section D1

Goodness of Fit in a Multiple Regression revisited Number of obs = 1085 F( 2, 1082) = 1627.49 Prob > F = 0.0000 R-squared = 0.7505 Adj R-squared = 0.7501 Root MSE = 1.3e+05 If you compare two models with the same dependent variable (and data set), the best fit will be both of these: The one with the Highest Adjusted R2 The one with the Lowest MSE/SEE Note: MSE/SEE depends on the scale of the dependent variable, so it cannot be used to compare the fit of two regressions with different dependent variables. QM222 Fall 2015 Section D1

Example Root MSE: If an apartment is not on Beacon and has 2000 square feet, what it the predicted price and what is its 95% confidence interval? . regress price Beacon_Street size   Source | SS df MS Number of obs = 1085 -------------+------------------------------ F( 2, 1082) = 1627.49 Model | 5.6215e+13 2 2.8108e+13 Prob > F = 0.0000 Residual | 1.8687e+13 1082 1.7271e+10 R-squared = 0.7505 -------------+------------------------------ Adj R-squared = 0.7501 Total | 7.4902e+13 1084 6.9098e+10 Root MSE = 1.3e+05 ------------------------------------------------------------------------------- price | Coef. Std. Err. t P>|t| [95% Conf. Interval] --------------+---------------------------------------------------------------- Beacon_Street | 32935.89 12987.55 2.54 0.011 7452.263 58419.52 size | 409.4219 7.190862 56.94 0.000 395.3122 423.5315 _cons | 6981.353 9961.969 0.70 0.484 -12565.61 26528.32 Predicted Price= 6981.4 + 32936*0 + 409.4*2000 = 825,781 95% confidence interval: 825781 +/- 2*130,000 = 565,781 to 1,085,781. QM222 Fall 2015 Section D1

What if your Dependent Variable is a 0/1 Indicator variable? For instance, your dependent (left-hand side variable) might be an indicator variable =1 if you bought something, 0 if you didn’t QM222 Fall 2015 Section D1

Example: Your company wants to how many people will buy a specific ebook at different prices. Let’s say you did an experiment randomly offering the book at different prices. You make an indicator variable for “Did they purchase the book?” (0= no, 1=purchased one or more) Visitor #1: 1 (i.e. bought the ebook) P=$10 Visitor #2: 0 (i.e. didn’t buy) P=$10 Visitor #3: 0 (i.e. didn’t buy) P=$5 Visitor #4: 1 (i.e. bought the book) P=$5 Visitor #5: 1 (i.e. bought the book) P=$7 What is the average probability that the person bought the book? It is just the average of all of the 1’s and 0’s.

Example: Your company wants to how many people will buy a specific ebook at different prices. Visitor #1: 1 (i.e. bought the ebook) P=$10 Visitor #2: 0 (i.e. didn’t buy) P=$10 Visitor #3: 0 (i.e. didn’t buy) P=$5 Visitor #4: 1 (i.e. bought the book) P=$5 Visitor #5: 1 (i.e. bought the book) P=$7 We said that the average probability that the person bought the book is just the average of all of the 1’s and 0’s. How can we find out how the price affects this probability? We run a regression where the indicator variable is the dependent (Y, LHS) variable. The explanatory X variable is Price. This will predict the “Probability of Purchase” based on price.

example Buy = .9 - .05 Price If Price = 5, Buy = .9 – .25 = .65 or a 65% probability of buying If Price = 10, Buy = .9 – .5 = .4 or a 40% probability of buying However, if Price = 20, Buy = .9 – 1.0 = -.1 or a -10% probability of buying…. which makes no sense. So if you have this problem in your project, there are other ways to model it, not using linear regression. QM222 Fall 2016 Section D1

Creating and interpreting indicator variables when there are >2 categories Suppose we have seasonal data and want to include indicator variables for whether it is summer, fall, winter or spring? QM222 Fall 2015 Section D1

With more than 2 categories As a rule, if a categorical variable has n categories, we need to construct n-1 indicator variables. One category always must be the reference category, the category that other categories are compared to. Example: With 4 seasons, create 3 indicator variables. Here I arbitrarily choose Fall to be the reference category and create an indicator variable for each of the other seasons. Let’s say that I get this regression: Sales = 100 + 50 Spring + 90 Summer - 25 Winter - .5 Price QM222 Fall 2015 Section D1

Sales = 200 + 50 Spring + 90 Summer - 25 Winter - .5 Price Predict Sales in Spring (if Price=100) Predict Sales in Summer (if Price=100) Predict Sales in Winter (if Price=100) Predict Sales in Fall (the reference category) (if Price=100) Predict the difference between Sales in Summer and Spring Predict the difference between Sales in Summer and Fall QM222 Fall 2015 Section D1

Sales = 200 + 50 Spring + 90 Summer - 25 Winter - .5 Price Predict Sales in Spring: Sales = 200 + 50 *1 +90*0 -25*0 - .5 Price If Price= 100, Sales = 200+50- .5*100 = 250 -50=200 Predict Sales in Summer: Sales = 200 +50*0 + 90 *1 -25*0 - .05 Price If Price= 100, Sales = 200+90 -50= 240 Predict Sales in Winter: Sales = 200 +50*0+90*0 -25*1 - .05 Price If Price= 100, Sales = 200-25 -50= 125 Predict Sales in Fall (the reference category) : Sales = 200 +50*0+90*0-25*0 - .05 Price If Price= 100, Sales = 200 -50= 150 Difference between Sales in Summer and Spring? Difference: [200 + 90 - .05 Price ] - [200 +50 - .05 Price ]=90-50=40 The difference between 2 seasons is the difference in the seasons’ coefficients Difference between Sales in Summer and Fall? Difference: [200 + 90 - .05 Price ] - [200 - .05 Price ]= 90 The difference between a season and the reference category is that season’s coefficient. QM222 Fall 2015 Section D1

Running a Stata regression using a categorical explanatory variables with many categories You can make a single indicator variable in Stata easily, e.g. gen female = 0 replace female = 1 if gender==2 OR in a single line: gen female= gender==2 In Stata, you don’t need to make indicator variables separately for a variable with more than 2 categories. Assuming that you have a string (or numeric) categorical variable season that could take on the values Winter, Fall, Spring and Summer, type: regress sales price i.season This will run a multiple regression of sales on price and on 3 seasonal indicator variables. Stata chooses the reference category (it chooses the category it encounters first, although there is a way for you to set a different reference category if you want). Stata will name the indicator variables by the string or number of each value they take. QM222 Fall 2015 Section D1

Time series and time QM222 Fall 2015 Section D1

In time-series (or cross-section/time-series) data, you need to have a variable for time The variable for time has to increase by 1 each time period. If you have annual data, a variable Year does exactly this. If you have quarterly or monthly (or decade) data, you need to create a variable time. First, make sure the data is ordered chronologically! Then we’ll show how to make a variable that is : 1 for first quarter, first year 2 for second quarter, first year 3 for third quarter, first year 4 for fourth quarter, first year 5 for first quarter, second year etc QM222 Fall 2015 Section D1

Interpreting the coefficient on the variable “time” Sales = 1003 + 27 time Quarterly data The coefficient on time tells us that Sales increase by 27 each quarter. QM222 Fall 2015 Section D1

Making a variable Time in Stata: background Note: in Stata, _n means the observation number In Stata, to refer to the previous value of a variable i.e. in the previous observation, just use the notation: varname[_n-1] The square brackets tells Stata the observation number you are referring to. QM222 Fall 2015 Section D1

Making a variable for Time in time-series data in Stata (one observation per time period) First make sure the data is in chronological order. For instance, if there is a variable “date” go: sort date Making a time variable (when the data is in chronological order) gen time=1 in 1 (“in #” tell State to do this only for observation #) replace time= time[_n-1]+1 OR just: gen time= _n QM222 Fall 2015 Section D1

What about panel/longitudinal data (same preson/company, different time periods? How do you make a time variable then? Example: You have three companies’ daily share prices. Your variables are name (for company name) and price and date. In Stata: (think like a computer!) sort company date (This sorts first by company, then by date) gen time = gen time=1 if company != company[_n-1] replace time= time[_n-1]+1 if company == company[_n-1] Then check out this by going browse! QM222 Fall 2016 Section D1

Quarterly or monthly data With quarterly or monthly data, you should also include indicator variables for seasonality. For quarter data, make 3 indicator variables. The fourth is the reference (base) category. Example: Sales = 998 + 27 time - 4 Q1 + 10 Q2 + 12 Q3 Here, the coefficient on time tells us that Sales increase by 27 each quarter, holding season constant. Q4 is the reference category. Sales in Q2 on average are 10 more than Sales in Q4. Sales in Q1 on average are 4 less than Sales in Q4. QM222 Fall 2015 Section D1

Example Use hobbit data set Make time variable Make a weekend indicator variable Regress Gross on time and weekend indicator Regress Gross on time and day of week (Day) using i. Drop time variable, make variable hobbit=1, sort by Date, save on desktop Use Beasts. Sort by Date. Make variable beasts =1 Merge two datasets: merge Date using (hobbit file name &location) Sum, fix the hobbit variable for missing (make 0) Make time variable after sorting QM222 Fall 2016 Section D1