QM222 Class 10 Section D1 1. Goodness of fit -- review 2 QM222 Class 10 Section D1 1. Goodness of fit -- review 2. What if your Dependent Variable is an 0/1 Indicator variable? – review 3. Indicator variables when there are multiple categories 4. Time series data QM222 Fall 2016 Section D1
Review QM222 Fall 2016 Section D1
Multiple regression: Why use it? 2 reasons why we do multiple regression To get the closer to the“correct/causal” (unbiased) coefficient by controlling for confounding factors To increase the predictive power of a regression
Goodness of fit How well does the model explain our dependent variables? R2 or R-squared (with the same # X’s) Adjusted R2 (with different # X’s) How accurate are our predictions? Root mean-squared-error MSE (also called SEE – standard error of the equation)
The R2 measures the goodness of fit. Higher is better. Compare .9414 to .6408
Measuring Goodness of Fit with R2 R2: the fraction of the variation in Y explained by the regression It is always between 0 and 1 R2 = [correlation(X,Y)]2 What does R2 =1 tell us? That the regression predicts Y perfectly What does R2 =0 tell us? That the model doesn’t predict any of the variation in Y.
Goodness of Fit in a Multiple Regression A higher R2 means a better fit than a lower R2 (when you have the same number of explanatory variables) However, in multiple regressions, use the adjusted R2 instead (right below R2 ). Number of obs = 1085 F( 2, 1082) = 1627.49 Prob > F = 0.0000 R-squared = 0.7505 Adj R-squared = 0.7501 Root MSE = 1.3e+05 If you compare two regressions (no matter how many X/explanatory variables there are in each), the best fit will be the one with the Highest Adjusted R2 QM222 Fall 2015 Section D1
We can also measure fit by looking at the dispersion of population Y around predicted Y We predict Ŷ but we can make confidence intervals around predicted Ŷ using the Root MSE (or SEE) The RootMSE (Root mean squared error ) a.k.a. the SEE(standard effort of the equation) measures how spread out the distribution of the errors (residuals) from a regression is: We are approximately 68% (or around 2/3rds) certain that the actual Y will be within one Root MSE (or SEE) of predicted Ŷ This is called the 68% Confidence Interval (CI). We are approximately 95% certain that the actual Y will be within two Root MSEs (or SEE) of predicted Ŷ This is called the 95% Confidence Interval (CI). -3RMSE -2RMSE -1RMSE Ŷ +1RMSE +2RMSE+3RMSE
Where is the Root MSE? . regress price Beacon_Street size Source | SS df MS Number of obs = 1085 -------------+------------------------------ F( 2, 1082) = 1627.49 Model | 5.6215e+13 2 2.8108e+13 Prob > F = 0.0000 Residual | 1.8687e+13 1082 1.7271e+10 R-squared = 0.7505 -------------+------------------------------ Adj R-squared = 0.7501 Total | 7.4902e+13 1084 6.9098e+10 Root MSE = 1.3e+05 ------------------------------------------------------------------------------- price | Coef. Std. Err. t P>|t| [95% Conf. Interval] --------------+---------------------------------------------------------------- Beacon_Street | 32935.89 12987.55 2.54 0.011 7452.263 58419.52 size | 409.4219 7.190862 56.94 0.000 395.3122 423.5315 _cons | 6981.353 9961.969 0.70 0.484 -12565.61 26528.32 The Root MSE (root mean squared error) or SEE is just the square root of the sum of squared errors (SSE) divided by the # of observations (kind of) QM222 Fall 2015 Section D1
Goodness of Fit in a Multiple Regression revisited Number of obs = 1085 F( 2, 1082) = 1627.49 Prob > F = 0.0000 R-squared = 0.7505 Adj R-squared = 0.7501 Root MSE = 1.3e+05 If you compare two models with the same dependent variable (and data set), the best fit will be both of these: The one with the Highest Adjusted R2 The one with the Lowest MSE/SEE Note: MSE/SEE depends on the scale of the dependent variable, so it cannot be used to compare the fit of two regressions with different dependent variables. QM222 Fall 2015 Section D1
Example Root MSE: If an apartment is not on Beacon and has 2000 square feet, what it the predicted price and what is its 95% confidence interval? . regress price Beacon_Street size Source | SS df MS Number of obs = 1085 -------------+------------------------------ F( 2, 1082) = 1627.49 Model | 5.6215e+13 2 2.8108e+13 Prob > F = 0.0000 Residual | 1.8687e+13 1082 1.7271e+10 R-squared = 0.7505 -------------+------------------------------ Adj R-squared = 0.7501 Total | 7.4902e+13 1084 6.9098e+10 Root MSE = 1.3e+05 ------------------------------------------------------------------------------- price | Coef. Std. Err. t P>|t| [95% Conf. Interval] --------------+---------------------------------------------------------------- Beacon_Street | 32935.89 12987.55 2.54 0.011 7452.263 58419.52 size | 409.4219 7.190862 56.94 0.000 395.3122 423.5315 _cons | 6981.353 9961.969 0.70 0.484 -12565.61 26528.32 Predicted Price= 6981.4 + 32936*0 + 409.4*2000 = 825,781 95% confidence interval: 825781 +/- 2*130,000 = 565,781 to 1,085,781. QM222 Fall 2015 Section D1
What if your Dependent Variable is a 0/1 Indicator variable? For instance, your dependent (left-hand side variable) might be an indicator variable =1 if you bought something, 0 if you didn’t QM222 Fall 2015 Section D1
Example: Your company wants to how many people will buy a specific ebook at different prices. Let’s say you did an experiment randomly offering the book at different prices. You make an indicator variable for “Did they purchase the book?” (0= no, 1=purchased one or more) Visitor #1: 1 (i.e. bought the ebook) P=$10 Visitor #2: 0 (i.e. didn’t buy) P=$10 Visitor #3: 0 (i.e. didn’t buy) P=$5 Visitor #4: 1 (i.e. bought the book) P=$5 Visitor #5: 1 (i.e. bought the book) P=$7 What is the average probability that the person bought the book? It is just the average of all of the 1’s and 0’s.
Example: Your company wants to how many people will buy a specific ebook at different prices. Visitor #1: 1 (i.e. bought the ebook) P=$10 Visitor #2: 0 (i.e. didn’t buy) P=$10 Visitor #3: 0 (i.e. didn’t buy) P=$5 Visitor #4: 1 (i.e. bought the book) P=$5 Visitor #5: 1 (i.e. bought the book) P=$7 We said that the average probability that the person bought the book is just the average of all of the 1’s and 0’s. How can we find out how the price affects this probability? We run a regression where the indicator variable is the dependent (Y, LHS) variable. The explanatory X variable is Price. This will predict the “Probability of Purchase” based on price.
example Buy = .9 - .05 Price If Price = 5, Buy = .9 – .25 = .65 or a 65% probability of buying If Price = 10, Buy = .9 – .5 = .4 or a 40% probability of buying However, if Price = 20, Buy = .9 – 1.0 = -.1 or a -10% probability of buying…. which makes no sense. So if you have this problem in your project, there are other ways to model it, not using linear regression. QM222 Fall 2016 Section D1
Creating and interpreting indicator variables when there are >2 categories Suppose we have seasonal data and want to include indicator variables for whether it is summer, fall, winter or spring? QM222 Fall 2015 Section D1
With more than 2 categories As a rule, if a categorical variable has n categories, we need to construct n-1 indicator variables. One category always must be the reference category, the category that other categories are compared to. Example: With 4 seasons, create 3 indicator variables. Here I arbitrarily choose Fall to be the reference category and create an indicator variable for each of the other seasons. Let’s say that I get this regression: Sales = 100 + 50 Spring + 90 Summer - 25 Winter - .5 Price QM222 Fall 2015 Section D1
Sales = 200 + 50 Spring + 90 Summer - 25 Winter - .5 Price Predict Sales in Spring (if Price=100) Predict Sales in Summer (if Price=100) Predict Sales in Winter (if Price=100) Predict Sales in Fall (the reference category) (if Price=100) Predict the difference between Sales in Summer and Spring Predict the difference between Sales in Summer and Fall QM222 Fall 2015 Section D1
Sales = 200 + 50 Spring + 90 Summer - 25 Winter - .5 Price Predict Sales in Spring: Sales = 200 + 50 *1 +90*0 -25*0 - .5 Price If Price= 100, Sales = 200+50- .5*100 = 250 -50=200 Predict Sales in Summer: Sales = 200 +50*0 + 90 *1 -25*0 - .05 Price If Price= 100, Sales = 200+90 -50= 240 Predict Sales in Winter: Sales = 200 +50*0+90*0 -25*1 - .05 Price If Price= 100, Sales = 200-25 -50= 125 Predict Sales in Fall (the reference category) : Sales = 200 +50*0+90*0-25*0 - .05 Price If Price= 100, Sales = 200 -50= 150 Difference between Sales in Summer and Spring? Difference: [200 + 90 - .05 Price ] - [200 +50 - .05 Price ]=90-50=40 The difference between 2 seasons is the difference in the seasons’ coefficients Difference between Sales in Summer and Fall? Difference: [200 + 90 - .05 Price ] - [200 - .05 Price ]= 90 The difference between a season and the reference category is that season’s coefficient. QM222 Fall 2015 Section D1
Running a Stata regression using a categorical explanatory variables with many categories You can make a single indicator variable in Stata easily, e.g. gen female = 0 replace female = 1 if gender==2 OR in a single line: gen female= gender==2 In Stata, you don’t need to make indicator variables separately for a variable with more than 2 categories. Assuming that you have a string (or numeric) categorical variable season that could take on the values Winter, Fall, Spring and Summer, type: regress sales price i.season This will run a multiple regression of sales on price and on 3 seasonal indicator variables. Stata chooses the reference category (it chooses the category it encounters first, although there is a way for you to set a different reference category if you want). Stata will name the indicator variables by the string or number of each value they take. QM222 Fall 2015 Section D1
Time series and time QM222 Fall 2015 Section D1
In time-series (or cross-section/time-series) data, you need to have a variable for time The variable for time has to increase by 1 each time period. If you have annual data, a variable Year does exactly this. If you have quarterly or monthly (or decade) data, you need to create a variable time. First, make sure the data is ordered chronologically! Then we’ll show how to make a variable that is : 1 for first quarter, first year 2 for second quarter, first year 3 for third quarter, first year 4 for fourth quarter, first year 5 for first quarter, second year etc QM222 Fall 2015 Section D1
Interpreting the coefficient on the variable “time” Sales = 1003 + 27 time Quarterly data The coefficient on time tells us that Sales increase by 27 each quarter. QM222 Fall 2015 Section D1
Making a variable Time in Stata: background Note: in Stata, _n means the observation number In Stata, to refer to the previous value of a variable i.e. in the previous observation, just use the notation: varname[_n-1] The square brackets tells Stata the observation number you are referring to. QM222 Fall 2015 Section D1
Making a variable for Time in time-series data in Stata (one observation per time period) First make sure the data is in chronological order. For instance, if there is a variable “date” go: sort date Making a time variable (when the data is in chronological order) gen time=1 in 1 (“in #” tell State to do this only for observation #) replace time= time[_n-1]+1 OR just: gen time= _n QM222 Fall 2015 Section D1
What about panel/longitudinal data (same preson/company, different time periods? How do you make a time variable then? Example: You have three companies’ daily share prices. Your variables are name (for company name) and price and date. In Stata: (think like a computer!) sort company date (This sorts first by company, then by date) gen time = gen time=1 if company != company[_n-1] replace time= time[_n-1]+1 if company == company[_n-1] Then check out this by going browse! QM222 Fall 2016 Section D1
Quarterly or monthly data With quarterly or monthly data, you should also include indicator variables for seasonality. For quarter data, make 3 indicator variables. The fourth is the reference (base) category. Example: Sales = 998 + 27 time - 4 Q1 + 10 Q2 + 12 Q3 Here, the coefficient on time tells us that Sales increase by 27 each quarter, holding season constant. Q4 is the reference category. Sales in Q2 on average are 10 more than Sales in Q4. Sales in Q1 on average are 4 less than Sales in Q4. QM222 Fall 2015 Section D1
Example Use hobbit data set Make time variable Make a weekend indicator variable Regress Gross on time and weekend indicator Regress Gross on time and day of week (Day) using i. Drop time variable, make variable hobbit=1, sort by Date, save on desktop Use Beasts. Sort by Date. Make variable beasts =1 Merge two datasets: merge Date using (hobbit file name &location) Sum, fix the hobbit variable for missing (make 0) Make time variable after sorting QM222 Fall 2016 Section D1