QM222 Class 8 Section D1 1. Review: coefficient statistics: standard errors, t-statistics, p-values (chapter 7) 2. Multiple regression 3. Goodness of fit.

Slides:



Advertisements
Similar presentations
Inference for Regression Today we will talk about the conditions necessary to make valid inference with regression We will also discuss the various types.
Advertisements

Lecture 9 Today: Ch. 3: Multiple Regression Analysis Example with two independent variables Frisch-Waugh-Lovell theorem.
INTERPRETATION OF A REGRESSION EQUATION
Objectives (BPS chapter 24)
Lecture 4 This week’s reading: Ch. 1 Today:
1 Review of Correlation A correlation coefficient measures the strength of a linear relation between two measurement variables. The measure is based on.
Chapter Topics Types of Regression Models
Interpreting Bi-variate OLS Regression
Back to House Prices… Our failure to reject the null hypothesis implies that the housing stock has no effect on prices – Note the phrase “cannot reject”
SLOPE DUMMY VARIABLES 1 The scatter diagram shows the data for the 74 schools in Shanghai and the cost functions derived from a regression of COST on N.
Christopher Dougherty EC220 - Introduction to econometrics (chapter 3) Slideshow: precision of the multiple regression coefficients Original citation:
EDUC 200C Section 4 – Review Melissa Kemmerle October 19, 2012.
DUMMY CLASSIFICATION WITH MORE THAN TWO CATEGORIES This sequence explains how to extend the dummy variable technique to handle a qualitative explanatory.
How do Lawyers Set fees?. Learning Objectives 1.Model i.e. “Story” or question 2.Multiple regression review 3.Omitted variables (our first failure of.
OPIM 303-Lecture #8 Jose M. Cruz Assistant Professor.
EDUC 200C Section 3 October 12, Goals Review correlation prediction formula Calculate z y ’ = r xy z x for a new data set Use formula to predict.
F TEST OF GOODNESS OF FIT FOR THE WHOLE EQUATION 1 This sequence describes two F tests of goodness of fit in a multiple regression model. The first relates.
MULTIPLE REGRESSION WITH TWO EXPLANATORY VARIABLES: EXAMPLE 1 This sequence provides a geometrical interpretation of a multiple regression model with two.
Introduction to Linear Regression
STA 286 week 131 Inference for the Regression Coefficient Recall, b 0 and b 1 are the estimates of the slope β 1 and intercept β 0 of population regression.
STAT E100 Section Week 12- Regression. Course Review - Project due Dec 17 th, your TA. - Exam 2 make-up is Dec 5 th, practice tests have been updated.
1 CHANGES IN THE UNITS OF MEASUREMENT Suppose that the units of measurement of Y or X are changed. How will this affect the regression results? Intuitively,
F TESTS RELATING TO GROUPS OF EXPLANATORY VARIABLES 1 We now come to more general F tests of goodness of fit. This is a test of the joint explanatory power.
1 COMPARING LINEAR AND LOGARITHMIC SPECIFICATIONS When alternative specifications of a regression model have the same dependent variable, R 2 can be used.
VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE In this sequence and the next we will investigate the consequences of misspecifying the regression.
Correlation and Regression
QM222 Class 12 Section D1 1. A few Stata things 2
Chapter 13 Simple Linear Regression
Chapter 14 Introduction to Multiple Regression
Chapter 20 Linear and Multiple Regression
QM222 Class 9 Section A1 Coefficient statistics
QM222 Class 11 Section D1 1. Review and Stata: Time series data, multi-category dummies, etc. (chapters 10,11) 2. Capturing nonlinear relationships (Chapter.
Chapter 4 Basic Estimation Techniques
business analytics II ▌appendix – regression performance the R2 
QM222 Class 10 Section D1 1. Goodness of fit -- review 2
Regression Analysis AGEC 784.
Inference for Least Squares Lines
QM222 Nov. 7 Section D1 Multicollinearity Regression Tables What to do next on your project QM222 Fall 2016 Section D1.
Statistics for Managers using Microsoft Excel 3rd Edition
26134 Business Statistics Week 5 Tutorial
QM222 Class 13 Section D1 Omitted variable bias (Chapter 13.)
Review Multiple Regression Multiple-Category Dummy Variables
QM222 Class 16 & 17 Today’s New topic: Estimating nonlinear relationships QM222 Fall 2017 Section A1.
QM222 Class 11 Section A1 Multiple Regression
QM222 Class 19 Omitted Variable Bias pt 2 Different slopes for a single variable QM222 Fall 2017 Section A1.
QM222 Class 18 Omitted Variable Bias
QM222 Class 9 Section D1 1. Multiple regression – review and in-class exercise 2. Goodness of fit 3. What if your Dependent Variable is an 0/1 Indicator.
QM222 Class 8 Section A1 Using categorical data in regression
Simple Linear Regression
Chapter 11 Simple Regression
QM222 A1 Nov. 27 More tips on writing your projects
The slope, explained variance, residuals
Chapter 13 Simple Linear Regression
Simple Linear Regression
Regression and Residual Plots
QM222 Class 14 Today’s New topic: What if the Dependent Variable is a Dummy Variable? QM222 Fall 2017 Section A1.
Stat 112 Notes 4 Today: Review of p-values for one-sided tests
CHAPTER 26: Inference for Regression
QM222 Your regressions and the test
Prepared by Lee Revere and John Large
QM222 Dec. 5 Presentations For presentation schedule, see:
QM222 Class 15 Section D1 Review for test Multicollinearity
Introduction to Econometrics, 5th edition
MGS 3100 Business Analysis Regression Feb 18, 2016
Introduction to Econometrics, 5th edition
Chapter 13 Simple Linear Regression
Correlation and Simple Linear Regression
Correlation and Simple Linear Regression
Presentation transcript:

QM222 Class 8 Section D1 1. Review: coefficient statistics: standard errors, t-statistics, p-values (chapter 7) 2. Multiple regression 3. Goodness of fit QM222 Fall 2016 Section D1

Scheduling reminders: I have (replacement) office hours tomorrow (Thursday) 10-12 Reminder: No class (or office hours) next Monday Oct. 3 QM222 Fall 2016 Section D1

(in light of the fact that we have limited numbers of observations) Regression statistics tell us how certain we are about the coefficient’s true value (in light of the fact that we have limited numbers of observations) QM222 Fall 2016 Section D1

Coefficient statistics Source | SS df MS Number of obs = 1085 -------------+------------------------------ F( 1, 1083) = 3232.35 Model | 5.6104e+13 1 5.6104e+13 Prob > F = 0.0000 Residual | 1.8798e+13 1083 1.7357e+10 R-squared = 0.7490 -------------+------------------------------ Adj R-squared = 0.7488 Total | 7.4902e+13 1084 6.9098e+10 Root MSE = 1.3e+05  ------------------------------------------------------------------------------ price | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+-------------------------------------- -------------------------- size | 407.4513 7.166659 56.85 0.000 393.3892 421.5134 _cons | 12934.12 9705.712 1.33 0.183 -6110.006 31978.25 We are approximately 95% (68%) certain that the “true” coefficient is within two (one) standard errors of the estimated coefficient. The 95% confidence interval for each coefficient is given on the right of that coefficient’s line. If the 95% confidence interval of a coefficient does not include zero, we are at least 95% confident that the coefficient is NOT zero so that size affects price. The t-stat next to the coefficient in the regression output tests the null hypothesis that the true coefficient is actually zero. When the | t | >2.0, we reject this hypothesis so that size affects price (with >=95% certainty). p-values: The p–value tells us exactly how probable it is that the coefficient is 0 or of the opposite sign. When the p-value<=.05, we are at least 95% certain that size affects price. QM222 Fall 2016 Section D1

p-values: The p–value tells us exactly how probable it is that the coefficient is 0 or of the opposite sign. Source | SS df MS Number of obs = 1085 -------------+------------------------------ F( 1, 1083) = 3232.35 Model | 5.6104e+13 1 5.6104e+13 Prob > F = 0.0000 Residual | 1.8798e+13 1083 1.7357e+10 R-squared = 0.7490 -------------+------------------------------ Adj R-squared = 0.7488 Total | 7.4902e+13 1084 6.9098e+10 Root MSE = 1.3e+05   ------------------------------------------------------------------------------ price | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- size | 407.4513 7.166659 56.85 0.000 393.3892 421.5134 _cons | 12934.12 9705.712 1.33 0.183 -6110.006 31978.25 This p-value says that it is less than .0005 (or .05%) likely that the coefficient on size is 0 or negative. (Higher than this & it would be .001 rounded) I am more than 100% - .05% = 99.95% certain that the coefficient is not zero. QM222 Fall 2016 Section D1

Multiple Regression QM222 Fall 2016 Section D1

Multiple Regression The multiple linear regression model is an extension of the simple linear regression model, where the dependent variable Y depends (linearly) on more than one explanatory variable: Ŷ=b0+b1X1 +b2X2 +b3X3 … We now interpret b1 as the change in Y when X1 changes by 1 and all other variables in the equation REMAIN CONSTANT. We say: “controlling for” other variables (X2 , X3).

Example of multiple regression We predicted the sale price of a condo in Brookline based on “Beacon_Street”: Price = 520,729 – 46969 Beacon_Street R2 = .0031 We expected condos on Beacon to cost more and are surprised with the result, but there are confounding factors that might be correlated with Beacon Street, such as size (in square feet). So we run a regression of Price (Y) on TWO explanatory variables, Beacon_Street AND size price of QM222 Fall 2015 Section D1

Multiple regression in Stata . regress price Beacon_Street size   Source | SS df MS Number of obs = 1085 -------------+------------------------------ F( 2, 1082) = 1627.49 Model | 5.6215e+13 2 2.8108e+13 Prob > F = 0.0000 Residual | 1.8687e+13 1082 1.7271e+10 R-squared = 0.7505 -------------+------------------------------ Adj R-squared = 0.7501 Total | 7.4902e+13 1084 6.9098e+10 Root MSE = 1.3e+05 ------------------------------------------------------------------------------- price | Coef. Std. Err. t P>|t| [95% Conf. Interval] --------------+---------------------------------------------------------------- Beacon_Street | 32935.89 12987.55 2.54 0.011 7452.263 58419.52 size | 409.4219 7.190862 56.94 0.000 395.3122 423.5315 _cons | 6981.353 9961.969 0.70 0.484 -12565.61 26528.32 Write the regression equation: Is the effect of Beacon statistically significant? Of size? Why?

More on interpreting multiple regression Price = 6981 + 32936 Beacon_Street + 409.4 size R2 = .7505 If we compare 2 condos of the same size, the one on Beacon Street will cost 32936 more. Or: Holding size constant, condos on Beacon Street cost 32936 more. Or: Controlling for size, condos on Beacon Street cost 32936 more. IN OTHER WORDS: By adding additional, possibly confounding variables into the regression, this takes out the bias (due to the missing variable) from the coefficient on the variable we are interested in (Beacon Street), so we isolate the true effect of Beacon from being confounded with the fact that Beacon and size are related and size affects price.

More on interpreting multiple regression Price = 520,729 – 46969 Beacon_Street R2 = .0031 Price = 6981 + 32936 Beacon_Street + 409.4 size R2 = .7505 We learn something from the difference in the coefficients on Beacon Street. Challenge question: Does this suggest that Beacon Street condos are bigger or smaller than others? (we’ll come back to this topic)

Multiple regression: Why use it? 2 reasons why we do multiple regression To get the closer to the“correct/causal” (unbiased) coefficient by controlling for confounding factors To increase the predictive power of a regression (We’ll soon learn how to measure this power.)

Write out these regressions on Brookline condos . regress price Fullbathrooms Source | SS df MS Number of obs = 1085 -------------+------------------------------ F( 1, 1083) = 982.27 Model | 3.5624e+13 1 3.5624e+13 Prob > F = 0.0000 Residual | 3.9278e+13 1083 3.6267e+10 R-squared = 0.4756 -------------+------------------------------ Adj R-squared = 0.4751 Total | 7.4902e+13 1084 6.9098e+10 Root MSE = 1.9e+05 ------------------------------------------------------------------------------- price | Coef. Std. Err. t P>|t| [95% Conf. Interval] --------------+---------------------------------------------------------------- Fullbathrooms | 293608 9368.142 31.34 0.000 275226.3 311989.8 _cons | 75701.13 15183.12 4.99 0.000 45909.47 105492.8 . regress price Beacon_Street size Fullbathrooms Source | SS df MS Number of obs = 1085 -------------+------------------------------ F( 3, 1081) = 1165.33 Model | 5.7212e+13 3 1.9071e+13 Prob > F = 0.0000 Residual | 1.7690e+13 1081 1.6365e+10 R-squared = 0.7638 -------------+------------------------------ Adj R-squared = 0.7632 Total | 7.4902e+13 1084 6.9098e+10 Root MSE = 1.3e+05 Beacon_Street | 34373.49 12643.79 2.72 0.007 9564.337 59182.63 size | 355.9106 9.799695 36.32 0.000 336.682 375.1392 Fullbathrooms | 68996.13 8842.991 7.80 0.000 51644.76 86347.5 _cons | -30541.11 10824.26 -2.82 0.005 -51780.05 -9302.174 QM222 Fall 2015 Section D1

Interpreting multiple regression Price= 75,701 +293,608 Full baths Price= -30,541+68,996 Full baths +34,373 Beacon + 355.9 size Q1: Interpret the coefficient of FullBaths in each regression. Q2: How can you explain the very large discrepancy between these coefficients? Q3: Which of the two models would you use to assess the financial profitability of replacing a spare room with a second full bathroom in a condo on Beacon Street? Why? QM222 Fall 2015 Section D1

Goodness of fit How well does the model explain our dependent variables? New Statistics: R2 or R-squared adjusted R2 How accurate are our predictions? New Statistic: SEE / Root MSE

Background to Goodness of Fit: Predicted line Ŷ=b0+b1X and errors Y = Ŷ + error Y = b0+b1X + error For any specific Xi (e.g. 2700), we predict Ŷ the value along the line. (The subscript i is for an “individual” point.) But each actual Yi observation is not exactly the same as the predicted value Ŷ. The difference is called the RESIDUAL or ERROR.

Key term: “Error” AKA “Residual” Error = the difference between the actual and predicted or fitted LHS variable (actual minus predicted). Also known as RESIDUAL. Yi = (b0+b1Xi )+ errori errori = Yi - (b0 + b1Xi) In the scatter plot, the error is the distance between the dot and the line. This distance can be positive or negative.

Residual or Error What is the interpretation of $ 520,516?. What can explain it?

Residual or Error What is the interpretation of $ 520,516? How much the model under-predicts the price of this particular apartment. What can explain it? A nice view, a great location, brand new expensive kitchen and bathroom.

price= 16651 + 401.31 size R² = 0.9414 Condos (dif. Data from before): The intercept and slopes are the same in both regressions So predictions will be the same. However, the smaller the errors (the closer the data points are to the line), the better the prediction is. In other words, the more highly correlated the two variables, the better the goodness of fit.

The R2 measures the goodness of fit. Higher is better. Compare .9414 to .6408 Which prediction will you trust more and why? What kind of city do you think each graph would be a good representation of?

Measuring Goodness of Fit with R2 R2: the fraction of the variation in Y explained by the regression It is always between 0 and 1 R2 = [correlation(X,Y)]2 What does R2 =1 tell us? It is the same as a correlation of 1 OR -1 It means that the regression predicts Y perfectly What does R2 =0 mean? It means that the model doesn’t predict any variation in Y It is the same as a correlation of 0 Also, the slope b1 would be 0 if there really is 0 correlation

What is a “high” R2 ? As with correlation, there are no strict rules - it depends on context We’ll get high R2 for outcomes that are easily predictable We’ll get low R2 for outcomes that depend heavily on unobserved factors (like people’s behavior) But that doesn’t mean that the X variable is a useless predictor … It means a person is hard to predict. Do not worry too much about R-squared unless your question is “how well can I predict?” Most of you will emphasize statistics about the coefficients, i.e. “how well can I predict the IMPACT of X on Y?”

Where do we see information on R-squared on the Stata output? This is the R2 It is tiny . regress price Beacon_Street Source | SS df MS Number of obs = 1085 -------------+------------------------------ F( 1, 1083) = 3.31 Model | 2.2855e+11 1 2.2855e+11 Prob > F = 0.0689 Residual | 7.4673e+13 1083 6.8951e+10 R-squared = 0.0031 -------------+------------------------------ Adj R-squared = 0.0021 Total | 7.4902e+13 1084 6.9098e+10 Root MSE = 2.6e+05 ------------------------------------------------------------------------------- price | Coef. Std. Err. t P>|t| [95% Conf. Interval] --------------+---------------------------------------------------------------- Beacon_Street | -46969.18 25798.41 -1.82 0.069 -97589.71 3651.345 _cons | 520728.9 8435.427 61.73 0.000 504177.2 537280.5

Regression of price on Size in sq. ft. regress yvariablename xvariablename  For instance, to run a regression of price on size, type:   regress price size This R2 is much greater Size is a better predictor of the Condo Price QM222 Fall 2015 Section D1

We can also make confidence intervals around predicted Y We predict Ŷ but can make confidence intervals around predicted Ŷ using the Root MSE (or SEE) The RootMSE (Root mean squared error ) a.k.a. the SEE(standard effort of the equation) measures how spread out the distribution of the errors (residuals) from a regression is: We are approximately 68% (or around 2/3rds) certain that the actual Y will be within one Root MSE (or SEE) of predicted Ŷ This is called the 68% Confidence Interval (CI). We are approximately 95% certain that the actual Y will be within two Root MSEs (or SEE) of predicted Ŷ This is called the 95% Confidence Interval (CI). -3RMSE -2RMSE -1RMSE Ŷ +1RMSE +2RMSE+3RMSE

Where is the Root MSE? . regress price Beacon_Street size   Source | SS df MS Number of obs = 1085 -------------+------------------------------ F( 2, 1082) = 1627.49 Model | 5.6215e+13 2 2.8108e+13 Prob > F = 0.0000 Residual | 1.8687e+13 1082 1.7271e+10 R-squared = 0.7505 -------------+------------------------------ Adj R-squared = 0.7501 Total | 7.4902e+13 1084 6.9098e+10 Root MSE = 1.3e+05 ------------------------------------------------------------------------------- price | Coef. Std. Err. t P>|t| [95% Conf. Interval] --------------+---------------------------------------------------------------- Beacon_Street | 32935.89 12987.55 2.54 0.011 7452.263 58419.52 size | 409.4219 7.190862 56.94 0.000 395.3122 423.5315 _cons | 6981.353 9961.969 0.70 0.484 -12565.61 26528.32 The Root MSE (root mean squared error) or SEE is just the square root of the sum of squared errors (SSE) divided by the # of observations (kind of) QM222 Fall 2015 Section D1

If an apartment is not on Beacon and has 2000 square feet, what it the predicted price and what is its 95% confidence interval? . regress price Beacon_Street size   Source | SS df MS Number of obs = 1085 -------------+------------------------------ F( 2, 1082) = 1627.49 Model | 5.6215e+13 2 2.8108e+13 Prob > F = 0.0000 Residual | 1.8687e+13 1082 1.7271e+10 R-squared = 0.7505 -------------+------------------------------ Adj R-squared = 0.7501 Total | 7.4902e+13 1084 6.9098e+10 Root MSE = 1.3e+05 ------------------------------------------------------------------------------- price | Coef. Std. Err. t P>|t| [95% Conf. Interval] --------------+---------------------------------------------------------------- Beacon_Street | 32935.89 12987.55 2.54 0.011 7452.263 58419.52 size | 409.4219 7.190862 56.94 0.000 395.3122 423.5315 _cons | 6981.353 9961.969 0.70 0.484 -12565.61 26528.32 Predicted Price= 6981.4 + 32936*0 + 409.4*2000 = 825,781 95% confidence interval: 825781 +/- 2*130,000 = 565,781 to 1,085,781. QM222 Fall 2015 Section D1

Goodness of Fit in a Multiple Regression A higher R2 means a better fit than a lower R2 (when you have the same number of explanatory variables) In multiple regressions, use the adjusted R2 instead (right below R2 ). Number of obs = 1085 F( 2, 1082) = 1627.49 Prob > F = 0.0000 R-squared = 0.7505 Adj R-squared = 0.7501 Root MSE = 1.3e+05 If you compare two models with the same dependent variable (and data set), the best fit will be both of these: The one with the Highest Adjusted R2 The one with the Lowest MSE/SEE Note: MSE/SEE depends on the scale of the dependent variable, so it cannot be used to compare the fit of two regressions with different dependent variables. QM222 Fall 2015 Section D1

Inputting data in Stata: review Who will (or has) convert Excel data to Stata? File→Import→Excel spreadsheet Opens an Excel datafile If importing an Excel file doesn’t work, save the Excel file first as csv Who will (or has) convert .csv data to Stata? File→Import→Text file(delimited,*.csv) Opens a text file with columns divided by commas. Who downloaded Stata data? Did it have a “Stata” or “do-file” program also? Download that as well and talk to me/TA. EVERYONE: Download the codebook or data dictionary or definition of variables as well. Other commands at this point : count (with “if” statements) cd drive:\foldername (e.g. C:\QM222 File→Save or File→save as exit, clear QM222 Fall 2016 Section D1