Download presentation
Presentation is loading. Please wait.
Published byBertha Carroll Modified over 6 years ago
1
QM222 Class 9 Section D1 1. Multiple regression – review and in-class exercise 2. Goodness of fit 3. What if your Dependent Variable is an 0/1 Indicator variable? 3. Reviewing Assignment 3 QM222 Fall 2016 Section D1
2
Review QM222 Fall 2016 Section D1
3
Coefficient statistics
Source | SS df MS Number of obs = F( 1, 1083) = Model | e e Prob > F = Residual | e e R-squared = Adj R-squared = Total | e e Root MSE = 1.3e+05 price | Coef. Std. Err t P>|t| [95% Conf. Interval] size | _cons | We are approximately 95% (68%) certain that the “true” coefficient is within two (one) standard errors of the estimated coefficient. The 95% confidence interval for each coefficient is given on the right of that coefficient’s line. If the 95% confidence interval of a coefficient does not include zero, we are at least 95% confident that the coefficient is NOT zero so that size affects price. The t-stat next to the coefficient in the regression output tests the null hypothesis that the true coefficient is actually zero. When the | t | >2.0, we reject this hypothesis so that size affects price (with >=95% certainty). p-values: The p–value tells us exactly how probable it is that the coefficient is 0 or of the opposite sign. When the p-value<=.05, we are at least 95% certain that size affects price. QM222 Fall 2016 Section D1
4
Multiple Regression The multiple linear regression model is an extension of the simple linear regression model, where the dependent variable Y depends (linearly) on more than one explanatory variable: Ŷ=b0+b1X1 +b2X2 +b3X3 … We now interpret b1 as the change in Y when X1 changes by 1 and all other variables in the equation REMAIN CONSTANT. We say: “controlling for” other variables (X2 , X3).
5
Finding the best regression
You care about the effect of one (or more) main explanatory variable. Howver, always ALSO include in the regression any additional explanatory variables that: you believe might bias your main explanatory variable’s coefficient by being correlated with both it and Y (io.e. possibly confounding factors). you can measure. If you can’t measure the confounding factor, think about the bias it might create in your key coefficient. We’ll talk about this in a later class. QM222 Fall 2015 Section D1
6
Multiple regression: Why use it?
2 reasons why we do multiple regression To get the closer to the“correct/causal” (unbiased) coefficient by controlling for confounding factors To increase the predictive power of a regression …. Our next topic
7
Goodness of fit How well does the model explain our dependent variables? New Statistics: R2 or R-squared adjusted R2 How accurate are our predictions? New Statistic: SEE / Root MSE
8
Background to Goodness of Fit: Predicted line Ŷ=b0+b1X and errors
Y = Ŷ + error Y = b0+b1X + error Errors can be negative For any specific Xi (e.g. 2700), we predict Ŷ the value along the line. (The subscript i is for an “individual” point.) But each actual Yi observation is not exactly the same as the predicted value Ŷ. The difference is called the RESIDUAL or ERROR.
9
price= size Condos (dif. Data from before): The intercept and slopes are the same in both regressions So predictions will be the same. But which fits better? The smaller the errors as a whole (the closer the data points are to the line), the better the prediction is. In other words, the more highly correlated the two variables, the better the goodness of fit.
10
The R2 measures the goodness of fit. Higher is better. Compare .9414 to .6408
11
Measuring Goodness of Fit with R2
R2: the fraction of the variation in Y explained by the regression It is always between 0 and 1 R2 = [correlation(X,Y)]2
12
(review) Correlation coefficients go from -1 to 0 to +1
perfect negative perfect positive correlation correlation If you did a scatter of X and Y, If you did a scatter of X and Y, the dots would all lie exactly the dots would all lie exactly on a downward sloping line on an upward sloping line. 0: no correlation; if you did a scatter of X and Y, the dots would seem to have no relationship with each other. If you were to fit a line to the dots, it would be flat (since Y doesn’t change as X changes). QM222 Fall 2016 Section Sections E1 & G1
13
Measuring Goodness of Fit with R2
R2: the fraction of the variation in Y explained by the regression It is always between 0 and 1 R2 = [correlation(X,Y)]2 What does R2 =1 tell us? It is the same as a correlation of 1 OR -1 It means that the regression predicts Y perfectly What does R2 =0 mean? It means that the model doesn’t predict any variation in Y It is the same as a correlation of 0 Also, the slope b1 would be 0 if there really is 0 correlation
14
What is a “high” R2 ? As with correlation, there are no strict rules - it depends on context We’ll get high R2 for outcomes that are easily predictable We’ll get low R2 for outcomes that depend heavily on unobserved factors (like people’s behavior) But that doesn’t mean that the X variable is a useless predictor … It means a person is hard to predict. Do not worry too much about R-squared unless your question is “how well can I predict?” Most of you will emphasize statistics about the coefficients, i.e. “how well can I predict the IMPACT of X on Y?”
15
Where do we see information on R-squared on the Stata output?
This is the R2 It is tiny . regress price Beacon_Street Source | SS df MS Number of obs = F( 1, 1083) = 3.31 Model | e e+11 Prob > F = Residual | e e+10 R-squared = Adj R-squared = Total | e e+10 Root MSE = 2.6e price | Coef. Std. Err. t P>|t| [95% Conf. Interval] Beacon_Street | _cons |
16
Regression of price on Size in sq. ft.
regress yvariablename xvariablename For instance, to run a regression of price on size, type: regress price size This R2 is much greater Size is a better predictor of the Condo Price QM222 Fall 2015 Section D1
17
Goodness of Fit in a Multiple Regression
A higher R2 means a better fit than a lower R2 (when you have the same number of explanatory variables) In multiple regressions, use the adjusted R2 instead (right below R2 ). Number of obs = F( 2, 1082) = Prob > F = R-squared = Adj R-squared = Root MSE = 1.3e+05 If you compare two models with the same dependent variable (and data set), the best fit will be the one with the Highest Adjusted R2 QM222 Fall 2015 Section D1
18
We can also measure fit by looking at the dispersion of population Y around predicted Y
We predict Ŷ but we can make confidence intervals around predicted Ŷ using the Root MSE (or SEE) The RootMSE (Root mean squared error ) a.k.a. the SEE(standard effort of the equation) measures how spread out the distribution of the errors (residuals) from a regression is: We are approximately 68% (or around 2/3rds) certain that the actual Y will be within one Root MSE (or SEE) of predicted Ŷ This is called the 68% Confidence Interval (CI). We are approximately 95% certain that the actual Y will be within two Root MSEs (or SEE) of predicted Ŷ This is called the 95% Confidence Interval (CI). -3RMSE -2RMSE -1RMSE Ŷ RMSE +2RMSE+3RMSE
19
Where is the Root MSE? . regress price Beacon_Street size Source | SS df MS Number of obs = F( 2, 1082) = Model | e e Prob > F = Residual | e e R-squared = Adj R-squared = Total | e e Root MSE = 1.3e+05 price | Coef. Std. Err t P>|t| [95% Conf. Interval] Beacon_Street | size | _cons | The Root MSE (root mean squared error) or SEE is just the square root of the sum of squared errors (SSE) divided by the # of observations (kind of) QM222 Fall 2015 Section D1
20
If an apartment is not on Beacon and has 2000 square feet, what it the predicted price and what is its 95% confidence interval? . regress price Beacon_Street size Source | SS df MS Number of obs = F( 2, 1082) = Model | e e Prob > F = Residual | e e R-squared = Adj R-squared = Total | e e Root MSE = 1.3e+05 price | Coef. Std. Err t P>|t| [95% Conf. Interval] Beacon_Street | size | _cons | Predicted Price= * *2000 = 825,781 95% confidence interval: /- 2*130,000 = 565,781 to 1,085,781. QM222 Fall 2015 Section D1
21
Goodness of Fit in a Multiple Regression revisited
Number of obs = F( 2, 1082) = Prob > F = R-squared = Adj R-squared = Root MSE = 1.3e+05 If you compare two models with the same dependent variable (and data set), the best fit will be both of these: The one with the Highest Adjusted R2 The one with the Lowest MSE/SEE Note: MSE/SEE depends on the scale of the dependent variable, so it cannot be used to compare the fit of two regressions with different dependent variables. QM222 Fall 2015 Section D1
22
What if your Dependent Variable is a 0/1 Indicator variable?
For instance, your dependent (left-hand side variable) might be an indicator variable =1 if you bought something, 0 if you didn’t QM222 Fall 2015 Section D1
23
Example: Your company wants to how many people will buy a specific ebook at different prices.
You do an experiment randomly offering the book at different prices. You make an indicator variable for “Did they purchase the book?” (0= no, 1=purchased one or more) Visitor #1: 1 (i.e. bought the ebook) P=$10 Visitor #2: 0 (i.e. didn’t buy) P=$10 Visitor #3: 0 (i.e. didn’t buy) P=$5 Visitor #4: 1 (i.e. bought the book) P=$5 Visitor #5: 1 (i.e. bought the book) P=$7 What is the average probability that the person bought the book? It is just the average of all of the 1’s and 0’s. How can you find out how the price affects this probability? You run a regression where the indicator variable is the dependent (Y, LHS) variable. The explanatory X variable is Price. This will predict the “Probability of Purchase” based on price.
24
Assignment 3 (summarized – see website)
After you have your data in a Stata data set … Open a TEXT log file to capture your commands and the Stata output you generate in preparing your data set in steps b-e below. Put YOUR name and “assignment3” in its title (for instance: log using smithassignment3, text). Clean your data set, ensuring all missing values of the numerical variables that you plan to use are a “.” in Stata (if they weren’t already). Makes the missing values of string variables “” Generate any variables that you know you will need that are combinations of other variables. Then save this Stata dataset (for instance, save dataset1 ) Then, if your dataset is so large that you have trouble using it, get rid of any observations that you know for certain you will never need, and any variables that you know for certain you will never need. However, it is much more difficult to retrieve variables that you erased than to carry around un-needed variables. Save this as a new Stata dataset under a new name (e.g. save dataset2 ) Name or rename your variables so they describe what they are in a way that you and readers can easily understand them. Close your log file (log close) and Post it on Tools. However, in the future, whenever you change your data (e.g. adding new variables that you generate), append it to this file (log using smithassignment3.log, append) so you have a single log file with all of the changes you made to the data. Post your data file on QuestromTools→ Assignments→Assignment 3 Stata data file. If your data set is too large to post, make a file with only the first 1000 observations and post that. (Stata command: keep 1/1000) Get the TA to help you if you need it! Also, in the inline section of Assignment 3 log file, answer these questions: How many observations in total are there in your data set? How many observations in total are there in your data set that have non-missing data for your main dependent variable? How many observations in total are there in your data set that have non-missing data for both your main dependent variable and the main explanatory variables you will focus on? Consider the observations that are missing values for the dependent variable. How many missing observations are there? Is there any reason to believe that these observations are different from the other observations in a way that might bias your results? Explain. QM222 Fall 2016 Section D1
25
Current Project Status: Q 1- Q6
What specific question or questions will your project address? Who is your client? What is the source of the data you are using? Put here your “codebook” for all variables that you use in your analysis. Add more lines as needed. Put here the summary of all variables that you use in your analysis (using the Stata sum command.) Put here the tab of all categorical variables that you use in your analysis (using the Stata tab command). (Cut and paste Stata output – format as Courier New 9 point). CODEBOOK Dependent variable or explanatory? Variable Name in source data Variable Name in your dataset Variable definition (Everything readers need to understand what this variable is.) Units the variable is measured in. (If a dummy variable, say 1=__ 0=___) QM222 Fall 2016 Section D1
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.