QM222 Class 9 Section D1 1. Multiple regression – review and in-class exercise 2. Goodness of fit 3. What if your Dependent Variable is an 0/1 Indicator.

Slides:



Advertisements
Similar presentations
Inference for Regression Today we will talk about the conditions necessary to make valid inference with regression We will also discuss the various types.
Advertisements

Simple Linear Regression and Correlation
Lecture 9 Today: Ch. 3: Multiple Regression Analysis Example with two independent variables Frisch-Waugh-Lovell theorem.
Some Topics In Multivariate Regression. Some Topics We need to address some small topics that are often come up in multivariate regression. I will illustrate.
Lecture 4 This week’s reading: Ch. 1 Today:
The Basics of Regression continued
1 Review of Correlation A correlation coefficient measures the strength of a linear relation between two measurement variables. The measure is based on.
Interpreting Bi-variate OLS Regression
EDUC 200C Section 4 – Review Melissa Kemmerle October 19, 2012.
DUMMY CLASSIFICATION WITH MORE THAN TWO CATEGORIES This sequence explains how to extend the dummy variable technique to handle a qualitative explanatory.
Lecture 15 Basics of Regression Analysis
How do Lawyers Set fees?. Learning Objectives 1.Model i.e. “Story” or question 2.Multiple regression review 3.Omitted variables (our first failure of.
EDUC 200C Section 3 October 12, Goals Review correlation prediction formula Calculate z y ’ = r xy z x for a new data set Use formula to predict.
F TEST OF GOODNESS OF FIT FOR THE WHOLE EQUATION 1 This sequence describes two F tests of goodness of fit in a multiple regression model. The first relates.
Introduction to Linear Regression
MGS3100_04.ppt/Sep 29, 2015/Page 1 Georgia State University - Confidential MGS 3100 Business Analysis Regression Sep 29 and 30, 2015.
STAT E100 Section Week 12- Regression. Course Review - Project due Dec 17 th, your TA. - Exam 2 make-up is Dec 5 th, practice tests have been updated.
1 COMPARING LINEAR AND LOGARITHMIC SPECIFICATIONS When alternative specifications of a regression model have the same dependent variable, R 2 can be used.
Stats Methods at IC Lecture 3: Regression.
QM222 Class 19 Section D1 Tips on your Project
Correlation and Regression
QM222 Class 12 Section D1 1. A few Stata things 2
Chapter 13 Simple Linear Regression
Sit in your permanent seat
Chapter 14 Introduction to Multiple Regression
Chapter 20 Linear and Multiple Regression
QM222 Class 9 Section A1 Coefficient statistics
QM222 Class 11 Section D1 1. Review and Stata: Time series data, multi-category dummies, etc. (chapters 10,11) 2. Capturing nonlinear relationships (Chapter.
Chapter 4 Basic Estimation Techniques
business analytics II ▌appendix – regression performance the R2 
QM222 Class 10 Section D1 1. Goodness of fit -- review 2
Inference for Least Squares Lines
QM222 Nov. 7 Section D1 Multicollinearity Regression Tables What to do next on your project QM222 Fall 2016 Section D1.
Statistics for Managers using Microsoft Excel 3rd Edition
26134 Business Statistics Week 5 Tutorial
Correlation and Simple Linear Regression
QM222 Class 13 Section D1 Omitted variable bias (Chapter 13.)
Review Multiple Regression Multiple-Category Dummy Variables
QM222 Class 16 & 17 Today’s New topic: Estimating nonlinear relationships QM222 Fall 2017 Section A1.
QM222 Class 11 Section A1 Multiple Regression
QM222 Class 19 Omitted Variable Bias pt 2 Different slopes for a single variable QM222 Fall 2017 Section A1.
Linear Regression and Correlation Analysis
Multiple Regression Analysis and Model Building
QM222 Class 14 Section D1 Different slopes for the same variable (Chapter 14) Review: Omitted variable bias (Chapter 13.) The bias on a regression coefficient.
QM222 Class 18 Omitted Variable Bias
QM222 A1 On tests and projects
QM222 Class 8 Section A1 Using categorical data in regression
QM222 Class 8 Section D1 1. Review: coefficient statistics: standard errors, t-statistics, p-values (chapter 7) 2. Multiple regression 3. Goodness of fit.
Chapter 11 Simple Regression
QM222 A1 Nov. 27 More tips on writing your projects
The slope, explained variance, residuals
QM222 A1 How to proceed next in your project Multicollinearity
Chapter 13 Simple Linear Regression
Simple Linear Regression
POSC 202A: Lecture Lecture: Substantive Significance, Relationship between Variables 1.
Regression and Residual Plots
Correlation and Simple Linear Regression
QM222 Class 14 Today’s New topic: What if the Dependent Variable is a Dummy Variable? QM222 Fall 2017 Section A1.
QM222 Your regressions and the test
Prepared by Lee Revere and John Large
QM222 Class 15 Section D1 Review for test Multicollinearity
Correlation and Simple Linear Regression
Simple Linear Regression and Correlation
Chapter Thirteen McGraw-Hill/Irwin
MGS 3100 Business Analysis Regression Feb 18, 2016
Introduction to Econometrics, 5th edition
Correlation and Simple Linear Regression
Correlation and Simple Linear Regression
Presentation transcript:

QM222 Class 9 Section D1 1. Multiple regression – review and in-class exercise 2. Goodness of fit 3. What if your Dependent Variable is an 0/1 Indicator variable? 3. Reviewing Assignment 3 QM222 Fall 2016 Section D1

Review QM222 Fall 2016 Section D1

Coefficient statistics Source | SS df MS Number of obs = 1085 -------------+------------------------------ F( 1, 1083) = 3232.35 Model | 5.6104e+13 1 5.6104e+13 Prob > F = 0.0000 Residual | 1.8798e+13 1083 1.7357e+10 R-squared = 0.7490 -------------+------------------------------ Adj R-squared = 0.7488 Total | 7.4902e+13 1084 6.9098e+10 Root MSE = 1.3e+05  ------------------------------------------------------------------------------ price | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+-------------------------------------- -------------------------- size | 407.4513 7.166659 56.85 0.000 393.3892 421.5134 _cons | 12934.12 9705.712 1.33 0.183 -6110.006 31978.25 We are approximately 95% (68%) certain that the “true” coefficient is within two (one) standard errors of the estimated coefficient. The 95% confidence interval for each coefficient is given on the right of that coefficient’s line. If the 95% confidence interval of a coefficient does not include zero, we are at least 95% confident that the coefficient is NOT zero so that size affects price. The t-stat next to the coefficient in the regression output tests the null hypothesis that the true coefficient is actually zero. When the | t | >2.0, we reject this hypothesis so that size affects price (with >=95% certainty). p-values: The p–value tells us exactly how probable it is that the coefficient is 0 or of the opposite sign. When the p-value<=.05, we are at least 95% certain that size affects price. QM222 Fall 2016 Section D1

Multiple Regression The multiple linear regression model is an extension of the simple linear regression model, where the dependent variable Y depends (linearly) on more than one explanatory variable: Ŷ=b0+b1X1 +b2X2 +b3X3 … We now interpret b1 as the change in Y when X1 changes by 1 and all other variables in the equation REMAIN CONSTANT. We say: “controlling for” other variables (X2 , X3).

Finding the best regression You care about the effect of one (or more) main explanatory variable. Howver, always ALSO include in the regression any additional explanatory variables that: you believe might bias your main explanatory variable’s coefficient by being correlated with both it and Y (io.e. possibly confounding factors). you can measure. If you can’t measure the confounding factor, think about the bias it might create in your key coefficient. We’ll talk about this in a later class. QM222 Fall 2015 Section D1

Multiple regression: Why use it? 2 reasons why we do multiple regression To get the closer to the“correct/causal” (unbiased) coefficient by controlling for confounding factors To increase the predictive power of a regression …. Our next topic

Goodness of fit How well does the model explain our dependent variables? New Statistics: R2 or R-squared adjusted R2 How accurate are our predictions? New Statistic: SEE / Root MSE

Background to Goodness of Fit: Predicted line Ŷ=b0+b1X and errors Y = Ŷ + error Y = b0+b1X + error Errors can be negative For any specific Xi (e.g. 2700), we predict Ŷ the value along the line. (The subscript i is for an “individual” point.) But each actual Yi observation is not exactly the same as the predicted value Ŷ. The difference is called the RESIDUAL or ERROR.

price= 16651 + 401.31 size Condos (dif. Data from before): The intercept and slopes are the same in both regressions So predictions will be the same. But which fits better? The smaller the errors as a whole (the closer the data points are to the line), the better the prediction is. In other words, the more highly correlated the two variables, the better the goodness of fit.

The R2 measures the goodness of fit. Higher is better. Compare .9414 to .6408

Measuring Goodness of Fit with R2 R2: the fraction of the variation in Y explained by the regression It is always between 0 and 1 R2 = [correlation(X,Y)]2

(review) Correlation coefficients go from -1 to 0 to +1 -1 0 +1 perfect negative perfect positive correlation correlation If you did a scatter of X and Y, If you did a scatter of X and Y, the dots would all lie exactly the dots would all lie exactly on a downward sloping line. on an upward sloping line. 0: no correlation; if you did a scatter of X and Y, the dots would seem to have no relationship with each other. If you were to fit a line to the dots, it would be flat (since Y doesn’t change as X changes). QM222 Fall 2016 Section Sections E1 & G1

Measuring Goodness of Fit with R2 R2: the fraction of the variation in Y explained by the regression It is always between 0 and 1 R2 = [correlation(X,Y)]2 What does R2 =1 tell us? It is the same as a correlation of 1 OR -1 It means that the regression predicts Y perfectly What does R2 =0 mean? It means that the model doesn’t predict any variation in Y It is the same as a correlation of 0 Also, the slope b1 would be 0 if there really is 0 correlation

What is a “high” R2 ? As with correlation, there are no strict rules - it depends on context We’ll get high R2 for outcomes that are easily predictable We’ll get low R2 for outcomes that depend heavily on unobserved factors (like people’s behavior) But that doesn’t mean that the X variable is a useless predictor … It means a person is hard to predict. Do not worry too much about R-squared unless your question is “how well can I predict?” Most of you will emphasize statistics about the coefficients, i.e. “how well can I predict the IMPACT of X on Y?”

Where do we see information on R-squared on the Stata output? This is the R2 It is tiny . regress price Beacon_Street Source | SS df MS Number of obs = 1085 -------------+------------------------------ F( 1, 1083) = 3.31 Model | 2.2855e+11 1 2.2855e+11 Prob > F = 0.0689 Residual | 7.4673e+13 1083 6.8951e+10 R-squared = 0.0031 -------------+------------------------------ Adj R-squared = 0.0021 Total | 7.4902e+13 1084 6.9098e+10 Root MSE = 2.6e+05 ------------------------------------------------------------------------------- price | Coef. Std. Err. t P>|t| [95% Conf. Interval] --------------+---------------------------------------------------------------- Beacon_Street | -46969.18 25798.41 -1.82 0.069 -97589.71 3651.345 _cons | 520728.9 8435.427 61.73 0.000 504177.2 537280.5

Regression of price on Size in sq. ft. regress yvariablename xvariablename  For instance, to run a regression of price on size, type:   regress price size This R2 is much greater Size is a better predictor of the Condo Price QM222 Fall 2015 Section D1

Goodness of Fit in a Multiple Regression A higher R2 means a better fit than a lower R2 (when you have the same number of explanatory variables) In multiple regressions, use the adjusted R2 instead (right below R2 ). Number of obs = 1085 F( 2, 1082) = 1627.49 Prob > F = 0.0000 R-squared = 0.7505 Adj R-squared = 0.7501 Root MSE = 1.3e+05 If you compare two models with the same dependent variable (and data set), the best fit will be the one with the Highest Adjusted R2 QM222 Fall 2015 Section D1

We can also measure fit by looking at the dispersion of population Y around predicted Y We predict Ŷ but we can make confidence intervals around predicted Ŷ using the Root MSE (or SEE) The RootMSE (Root mean squared error ) a.k.a. the SEE(standard effort of the equation) measures how spread out the distribution of the errors (residuals) from a regression is: We are approximately 68% (or around 2/3rds) certain that the actual Y will be within one Root MSE (or SEE) of predicted Ŷ This is called the 68% Confidence Interval (CI). We are approximately 95% certain that the actual Y will be within two Root MSEs (or SEE) of predicted Ŷ This is called the 95% Confidence Interval (CI). -3RMSE -2RMSE -1RMSE Ŷ +1RMSE +2RMSE+3RMSE

Where is the Root MSE? . regress price Beacon_Street size   Source | SS df MS Number of obs = 1085 -------------+------------------------------ F( 2, 1082) = 1627.49 Model | 5.6215e+13 2 2.8108e+13 Prob > F = 0.0000 Residual | 1.8687e+13 1082 1.7271e+10 R-squared = 0.7505 -------------+------------------------------ Adj R-squared = 0.7501 Total | 7.4902e+13 1084 6.9098e+10 Root MSE = 1.3e+05 ------------------------------------------------------------------------------- price | Coef. Std. Err. t P>|t| [95% Conf. Interval] --------------+---------------------------------------------------------------- Beacon_Street | 32935.89 12987.55 2.54 0.011 7452.263 58419.52 size | 409.4219 7.190862 56.94 0.000 395.3122 423.5315 _cons | 6981.353 9961.969 0.70 0.484 -12565.61 26528.32 The Root MSE (root mean squared error) or SEE is just the square root of the sum of squared errors (SSE) divided by the # of observations (kind of) QM222 Fall 2015 Section D1

If an apartment is not on Beacon and has 2000 square feet, what it the predicted price and what is its 95% confidence interval? . regress price Beacon_Street size   Source | SS df MS Number of obs = 1085 -------------+------------------------------ F( 2, 1082) = 1627.49 Model | 5.6215e+13 2 2.8108e+13 Prob > F = 0.0000 Residual | 1.8687e+13 1082 1.7271e+10 R-squared = 0.7505 -------------+------------------------------ Adj R-squared = 0.7501 Total | 7.4902e+13 1084 6.9098e+10 Root MSE = 1.3e+05 ------------------------------------------------------------------------------- price | Coef. Std. Err. t P>|t| [95% Conf. Interval] --------------+---------------------------------------------------------------- Beacon_Street | 32935.89 12987.55 2.54 0.011 7452.263 58419.52 size | 409.4219 7.190862 56.94 0.000 395.3122 423.5315 _cons | 6981.353 9961.969 0.70 0.484 -12565.61 26528.32 Predicted Price= 6981.4 + 32936*0 + 409.4*2000 = 825,781 95% confidence interval: 825781 +/- 2*130,000 = 565,781 to 1,085,781. QM222 Fall 2015 Section D1

Goodness of Fit in a Multiple Regression revisited Number of obs = 1085 F( 2, 1082) = 1627.49 Prob > F = 0.0000 R-squared = 0.7505 Adj R-squared = 0.7501 Root MSE = 1.3e+05 If you compare two models with the same dependent variable (and data set), the best fit will be both of these: The one with the Highest Adjusted R2 The one with the Lowest MSE/SEE Note: MSE/SEE depends on the scale of the dependent variable, so it cannot be used to compare the fit of two regressions with different dependent variables. QM222 Fall 2015 Section D1

What if your Dependent Variable is a 0/1 Indicator variable? For instance, your dependent (left-hand side variable) might be an indicator variable =1 if you bought something, 0 if you didn’t QM222 Fall 2015 Section D1

Example: Your company wants to how many people will buy a specific ebook at different prices. You do an experiment randomly offering the book at different prices. You make an indicator variable for “Did they purchase the book?” (0= no, 1=purchased one or more) Visitor #1: 1 (i.e. bought the ebook) P=$10 Visitor #2: 0 (i.e. didn’t buy) P=$10 Visitor #3: 0 (i.e. didn’t buy) P=$5 Visitor #4: 1 (i.e. bought the book) P=$5 Visitor #5: 1 (i.e. bought the book) P=$7 What is the average probability that the person bought the book? It is just the average of all of the 1’s and 0’s. How can you find out how the price affects this probability? You run a regression where the indicator variable is the dependent (Y, LHS) variable. The explanatory X variable is Price. This will predict the “Probability of Purchase” based on price.

Assignment 3 (summarized – see website) After you have your data in a Stata data set … Open a TEXT log file to capture your commands and the Stata output you generate in preparing your data set in steps b-e below. Put YOUR name and “assignment3” in its title (for instance: log using smithassignment3, text). Clean your data set, ensuring all missing values of the numerical variables that you plan to use are a “.” in Stata (if they weren’t already). Makes the missing values of string variables “” Generate any variables that you know you will need that are combinations of other variables. Then save this Stata dataset (for instance, save dataset1 ) Then, if your dataset is so large that you have trouble using it, get rid of any observations that you know for certain you will never need, and any variables that you know for certain you will never need. However, it is much more difficult to retrieve variables that you erased than to carry around un-needed variables. Save this as a new Stata dataset under a new name (e.g. save dataset2 ) Name or rename your variables so they describe what they are in a way that you and readers can easily understand them. Close your log file (log close) and Post it on Tools. However, in the future, whenever you change your data (e.g. adding new variables that you generate), append it to this file (log using smithassignment3.log, append) so you have a single log file with all of the changes you made to the data. Post your data file on QuestromTools→ Assignments→Assignment 3 Stata data file. If your data set is too large to post, make a file with only the first 1000 observations and post that. (Stata command: keep 1/1000) Get the TA to help you if you need it! Also, in the inline section of Assignment 3 log file, answer these questions: How many observations in total are there in your data set? How many observations in total are there in your data set that have non-missing data for your main dependent variable? How many observations in total are there in your data set that have non-missing data for both your main dependent variable and the main explanatory variables you will focus on? Consider the observations that are missing values for the dependent variable. How many missing observations are there? Is there any reason to believe that these observations are different from the other observations in a way that might bias your results? Explain. QM222 Fall 2016 Section D1

Current Project Status: Q 1- Q6 What specific question or questions will your project address? Who is your client? What is the source of the data you are using? Put here your “codebook” for all variables that you use in your analysis. Add more lines as needed. Put here the summary of all variables that you use in your analysis (using the Stata sum command.) Put here the tab of all categorical variables that you use in your analysis (using the Stata tab command). (Cut and paste Stata output – format as Courier New 9 point). CODEBOOK Dependent variable or explanatory? Variable Name in source data Variable Name in your dataset Variable definition (Everything readers need to understand what this variable is.) Units the variable is measured in. (If a dummy variable, say 1=__ 0=___)   QM222 Fall 2016 Section D1