Presentation is loading. Please wait.

Presentation is loading. Please wait.

QM222 Class 8 Section D1 1. Review: coefficient statistics: standard errors, t-statistics, p-values (chapter 7) 2. Multiple regression 3. Goodness of fit.

Similar presentations


Presentation on theme: "QM222 Class 8 Section D1 1. Review: coefficient statistics: standard errors, t-statistics, p-values (chapter 7) 2. Multiple regression 3. Goodness of fit."— Presentation transcript:

1 QM222 Class 8 Section D1 1. Review: coefficient statistics: standard errors, t-statistics, p-values (chapter 7) 2. Multiple regression 3. Goodness of fit QM222 Fall 2016 Section D1

2 Scheduling reminders:
I have (replacement) office hours tomorrow (Thursday) 10-12 Reminder: No class (or office hours) next Monday Oct. 3 QM222 Fall 2016 Section D1

3 (in light of the fact that we have limited numbers of observations)
Regression statistics tell us how certain we are about the coefficient’s true value (in light of the fact that we have limited numbers of observations) QM222 Fall 2016 Section D1

4 Coefficient statistics
Source | SS df MS Number of obs = F( 1, 1083) = Model | e e Prob > F = Residual | e e R-squared = Adj R-squared = Total | e e Root MSE = 1.3e+05   price | Coef. Std. Err t P>|t| [95% Conf. Interval] size | _cons | We are approximately 95% (68%) certain that the “true” coefficient is within two (one) standard errors of the estimated coefficient. The 95% confidence interval for each coefficient is given on the right of that coefficient’s line. If the 95% confidence interval of a coefficient does not include zero, we are at least 95% confident that the coefficient is NOT zero so that size affects price. The t-stat next to the coefficient in the regression output tests the null hypothesis that the true coefficient is actually zero. When the | t | >2.0, we reject this hypothesis so that size affects price (with >=95% certainty). p-values: The p–value tells us exactly how probable it is that the coefficient is 0 or of the opposite sign. When the p-value<=.05, we are at least 95% certain that size affects price. QM222 Fall 2016 Section D1

5 p-values: The p–value tells us exactly how probable it is that the coefficient is 0 or of the opposite sign. Source | SS df MS Number of obs = F( 1, 1083) = Model | e e Prob > F = Residual | e e R-squared = Adj R-squared = Total | e e Root MSE = 1.3e+05 price | Coef. Std. Err. t P>|t| [95% Conf. Interval] size | _cons | This p-value says that it is less than (or .05%) likely that the coefficient on size is 0 or negative. (Higher than this & it would be .001 rounded) I am more than 100% - .05% = 99.95% certain that the coefficient is not zero. QM222 Fall 2016 Section D1

6 Multiple Regression QM222 Fall 2016 Section D1

7 Multiple Regression The multiple linear regression model is an extension of the simple linear regression model, where the dependent variable Y depends (linearly) on more than one explanatory variable: Ŷ=b0+b1X1 +b2X2 +b3X3 … We now interpret b1 as the change in Y when X1 changes by 1 and all other variables in the equation REMAIN CONSTANT. We say: “controlling for” other variables (X2 , X3).

8 Example of multiple regression
We predicted the sale price of a condo in Brookline based on “Beacon_Street”: Price = 520,729 – Beacon_Street R2 = .0031 We expected condos on Beacon to cost more and are surprised with the result, but there are confounding factors that might be correlated with Beacon Street, such as size (in square feet). So we run a regression of Price (Y) on TWO explanatory variables, Beacon_Street AND size price of QM222 Fall 2015 Section D1

9 Multiple regression in Stata
. regress price Beacon_Street size Source | SS df MS Number of obs = F( 2, 1082) = Model | e e Prob > F = Residual | e e R-squared = Adj R-squared = Total | e e Root MSE = 1.3e+05 price | Coef. Std. Err t P>|t| [95% Conf. Interval] Beacon_Street | size | _cons | Write the regression equation: Is the effect of Beacon statistically significant? Of size? Why?

10 More on interpreting multiple regression
Price = Beacon_Street size R2 = .7505 If we compare 2 condos of the same size, the one on Beacon Street will cost more. Or: Holding size constant, condos on Beacon Street cost more. Or: Controlling for size, condos on Beacon Street cost more. IN OTHER WORDS: By adding additional, possibly confounding variables into the regression, this takes out the bias (due to the missing variable) from the coefficient on the variable we are interested in (Beacon Street), so we isolate the true effect of Beacon from being confounded with the fact that Beacon and size are related and size affects price.

11 More on interpreting multiple regression
Price = 520,729 – Beacon_Street R2 = Price = Beacon_Street size R2 = We learn something from the difference in the coefficients on Beacon Street. Challenge question: Does this suggest that Beacon Street condos are bigger or smaller than others? (we’ll come back to this topic)

12 Multiple regression: Why use it?
2 reasons why we do multiple regression To get the closer to the“correct/causal” (unbiased) coefficient by controlling for confounding factors To increase the predictive power of a regression (We’ll soon learn how to measure this power.)

13 Write out these regressions on Brookline condos
. regress price Fullbathrooms Source | SS df MS Number of obs = F( 1, 1083) = Model | e e+13 Prob > F = Residual | e e+10 R-squared = Adj R-squared = Total | e e+10 Root MSE = 1.9e price | Coef. Std. Err. t P>|t| [95% Conf. Interval] Fullbathrooms | _cons | regress price Beacon_Street size Fullbathrooms Source | SS df MS Number of obs = F( 3, 1081) = Model | e e+13 Prob > F = Residual | e e+10 R-squared = Adj R-squared = Total | e e+10 Root MSE = 1.3e+05 Beacon_Street | size | Fullbathrooms | _cons | QM222 Fall 2015 Section D1

14 Interpreting multiple regression
Price= 75, ,608 Full baths Price= -30,541+68,996 Full baths +34,373 Beacon size Q1: Interpret the coefficient of FullBaths in each regression. Q2: How can you explain the very large discrepancy between these coefficients? Q3: Which of the two models would you use to assess the financial profitability of replacing a spare room with a second full bathroom in a condo on Beacon Street? Why? QM222 Fall 2015 Section D1

15 Goodness of fit How well does the model explain our dependent variables? New Statistics: R2 or R-squared adjusted R2 How accurate are our predictions? New Statistic: SEE / Root MSE

16 Background to Goodness of Fit: Predicted line Ŷ=b0+b1X and errors
Y = Ŷ + error Y = b0+b1X + error For any specific Xi (e.g. 2700), we predict Ŷ the value along the line. (The subscript i is for an “individual” point.) But each actual Yi observation is not exactly the same as the predicted value Ŷ. The difference is called the RESIDUAL or ERROR.

17 Key term: “Error” AKA “Residual”
Error = the difference between the actual and predicted or fitted LHS variable (actual minus predicted). Also known as RESIDUAL. Yi = (b0+b1Xi )+ errori errori = Yi - (b0 + b1Xi) In the scatter plot, the error is the distance between the dot and the line. This distance can be positive or negative.

18 Residual or Error What is the interpretation of $ 520,516?.
What can explain it?

19 Residual or Error What is the interpretation of $ 520,516? How much the model under-predicts the price of this particular apartment. What can explain it? A nice view, a great location, brand new expensive kitchen and bathroom.

20 price= size R² = Condos (dif. Data from before): The intercept and slopes are the same in both regressions So predictions will be the same. However, the smaller the errors (the closer the data points are to the line), the better the prediction is. In other words, the more highly correlated the two variables, the better the goodness of fit.

21 The R2 measures the goodness of fit. Higher is better. Compare .9414 to .6408
Which prediction will you trust more and why? What kind of city do you think each graph would be a good representation of?

22 Measuring Goodness of Fit with R2
R2: the fraction of the variation in Y explained by the regression It is always between 0 and 1 R2 = [correlation(X,Y)]2 What does R2 =1 tell us? It is the same as a correlation of 1 OR -1 It means that the regression predicts Y perfectly What does R2 =0 mean? It means that the model doesn’t predict any variation in Y It is the same as a correlation of 0 Also, the slope b1 would be 0 if there really is 0 correlation

23 What is a “high” R2 ? As with correlation, there are no strict rules - it depends on context We’ll get high R2 for outcomes that are easily predictable We’ll get low R2 for outcomes that depend heavily on unobserved factors (like people’s behavior) But that doesn’t mean that the X variable is a useless predictor … It means a person is hard to predict. Do not worry too much about R-squared unless your question is “how well can I predict?” Most of you will emphasize statistics about the coefficients, i.e. “how well can I predict the IMPACT of X on Y?”

24 Where do we see information on R-squared on the Stata output?
This is the R2 It is tiny . regress price Beacon_Street Source | SS df MS Number of obs = F( 1, 1083) = 3.31 Model | e e+11 Prob > F = Residual | e e+10 R-squared = Adj R-squared = Total | e e+10 Root MSE = 2.6e price | Coef. Std. Err. t P>|t| [95% Conf. Interval] Beacon_Street | _cons |

25 Regression of price on Size in sq. ft.
regress yvariablename xvariablename  For instance, to run a regression of price on size, type:   regress price size This R2 is much greater Size is a better predictor of the Condo Price QM222 Fall 2015 Section D1

26 We can also make confidence intervals around predicted Y
We predict Ŷ but can make confidence intervals around predicted Ŷ using the Root MSE (or SEE) The RootMSE (Root mean squared error ) a.k.a. the SEE(standard effort of the equation) measures how spread out the distribution of the errors (residuals) from a regression is: We are approximately 68% (or around 2/3rds) certain that the actual Y will be within one Root MSE (or SEE) of predicted Ŷ This is called the 68% Confidence Interval (CI). We are approximately 95% certain that the actual Y will be within two Root MSEs (or SEE) of predicted Ŷ This is called the 95% Confidence Interval (CI). -3RMSE -2RMSE -1RMSE Ŷ RMSE +2RMSE+3RMSE

27 Where is the Root MSE? . regress price Beacon_Street size Source | SS df MS Number of obs = F( 2, 1082) = Model | e e Prob > F = Residual | e e R-squared = Adj R-squared = Total | e e Root MSE = 1.3e+05 price | Coef. Std. Err t P>|t| [95% Conf. Interval] Beacon_Street | size | _cons | The Root MSE (root mean squared error) or SEE is just the square root of the sum of squared errors (SSE) divided by the # of observations (kind of) QM222 Fall 2015 Section D1

28 If an apartment is not on Beacon and has 2000 square feet, what it the predicted price and what is its 95% confidence interval? . regress price Beacon_Street size Source | SS df MS Number of obs = F( 2, 1082) = Model | e e Prob > F = Residual | e e R-squared = Adj R-squared = Total | e e Root MSE = 1.3e+05 price | Coef. Std. Err t P>|t| [95% Conf. Interval] Beacon_Street | size | _cons | Predicted Price= * *2000 = 825,781 95% confidence interval: /- 2*130,000 = 565,781 to 1,085,781. QM222 Fall 2015 Section D1

29 Goodness of Fit in a Multiple Regression
A higher R2 means a better fit than a lower R2 (when you have the same number of explanatory variables) In multiple regressions, use the adjusted R2 instead (right below R2 ). Number of obs = F( 2, 1082) = Prob > F = R-squared = Adj R-squared = Root MSE = 1.3e+05 If you compare two models with the same dependent variable (and data set), the best fit will be both of these: The one with the Highest Adjusted R2 The one with the Lowest MSE/SEE Note: MSE/SEE depends on the scale of the dependent variable, so it cannot be used to compare the fit of two regressions with different dependent variables. QM222 Fall 2015 Section D1

30 Inputting data in Stata: review
Who will (or has) convert Excel data to Stata? File→Import→Excel spreadsheet Opens an Excel datafile If importing an Excel file doesn’t work, save the Excel file first as csv Who will (or has) convert .csv data to Stata? File→Import→Text file(delimited,*.csv) Opens a text file with columns divided by commas. Who downloaded Stata data? Did it have a “Stata” or “do-file” program also? Download that as well and talk to me/TA. EVERYONE: Download the codebook or data dictionary or definition of variables as well. Other commands at this point : count (with “if” statements) cd drive:\foldername (e.g. C:\QM222 File→Save or File→save as exit, clear QM222 Fall 2016 Section D1


Download ppt "QM222 Class 8 Section D1 1. Review: coefficient statistics: standard errors, t-statistics, p-values (chapter 7) 2. Multiple regression 3. Goodness of fit."

Similar presentations


Ads by Google