Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 4. Multiple Regression I ECON 251 Research Methods.

Similar presentations


Presentation on theme: "1 4. Multiple Regression I ECON 251 Research Methods."— Presentation transcript:

1 1 4. Multiple Regression I ECON 251 Research Methods

2 2  In this section, we extend the simple linear regression model, and allow for any number ( k ) of independent variables. This should yield a better model in most cases. y =  0 +  1 x 1 +  2 x 2 + …+  k x k +   We add Adjusted R 2 to our model assessment tools.  Because of the complexity of the calculations, we will rely exclusively on the computer to do our model estimation. Coefficients Dependent variableIndependent variables Random error variable Basic Multiple Regression Model

3 3 y =  0 +  1 x X1X1 Y X2X2 The simple linear regression model allows for one independent variable, “ x ” y =  0 +  1 x +  The multiple linear regression model allows for more than one independent variable. Y =  0 +  1 x 1 +  2 x 2 +  Note how the straight line becomes a plain. y =  0 +  1 x 1 +  2 x 2 Basic Multiple Regression Model

4 4  One of the most important aspects of regression analysis is verifying that our results are not being impacted by assumption violations or “other dangers.” That is why we return to this important topic. In this section, we will be looking for solutions to instances where we encounter problems.  Recall Our List of “Assumption Violations & Other Dangers”: The error (  term is properly distributed. Which means: 1.The probability distribution of  is normal, with a mean of 0. 2.The standard deviation of  is   for all values of x. 3.The set of errors associated with different values of y are all independent. Other assumptions, that when violated can threaten the usefulness of your results include: 4.No unnecessary outliers 5.No serious multicollinearity Regression Diagnostics

5 5 Assumptions #1 and #2 – Remedying Violations  We discussed both assumptions in the last section, as well as how to detect them using visual inspection of graphs.  Nonnormality or heteroscedasticity can be remedied using transformations on the y variable.  The transformations can improve the linear relationship between the dependent variable and the independent variables.  Many computer software systems allow us to make the transformations easily.

6 6 » y ’ = ln y (for y > 0) ―Use when the s  increases with y, or ―Use when the error distribution is positively skewed » y ’ = y 2 ―Use when the s 2  is proportional to E( y ), or ―Use when the error distribution is negatively skewed » y ’ = y 1/2 (for y > 0) ―Use when the s 2  is proportional to E( y ) » y ’ = 1/ y ―Use when s 2  increases significantly when y increases beyond some value. A brief list of transformations

7 7 Example – Quiz Score  A statistics professor wanted to know whether time limit affect the scores on a quiz?  A random sample of 100 students was split into 5 groups.  Each student wrote a quiz, but each group was given a different time limit. See data below. ScoreScore Analyze these results, and include diagnostics

8 8 The errors seem to be _______ distributed The model tested: SCORE =  0 +  1 TIME +  There is ________ linear relationship between time and score. This model is ______ and provides a ______ fit. Example – Quiz Score

9 9 The standard error of estimate seems to __________ with the predicted value of y. Two transformations are used to remedy this problem: 1. y ’ = ln y 2. y ’ = 1/ y Example – Quiz Score

10 10 Let us see what happens when a transformation is applied 40,18 40,23 40, 3.135 40, 2.89 Ln 23 = 3.135 Ln 18 = 2.89 The original data, where “Score” is a function of “Time” The modified data, where LnScore is a function of “Time" Example – Quiz Score

11 11 The new regression analysis and the diagnostics are: The model tested: LnScore =  ’ 0 +  ’ 1 TIME +  ’ Predicted LnScore = 2.1295 +.0217 Time This model is _______ and provides a ________ fit. Example – Quiz Score

12 12 The errors seem to be _________ distributed The standard errors still changes with the predicted y, but the change is _______ than before. Example – Quiz Score

13 13 Example – Quiz Score  Let TIME = 55 minutes LnScore = 2.1295 + 0.0217 * Time = 2.1295 + 0.0217 * (55) = 3.323  How do we use the modified model to predict? To find the predicted score, take the antilog: antilog e 3.323 = e 3.323 = 27.770  If 55 minutes is given for the quiz, we expect the score to be 27.770.  Find the predicted score if 50 minutes are given for the quiz.

14 14 Example – Quiz Score

15 15  Exists when independent variables included in the same regression, are linearly related to one another.  Multicollinearity nearly always exists. We will (somewhat arbitrarily) consider it serious if the absolute value of the correlation coefficient exceeds 0.8.  Example – House Price A real estate agent believes that a house selling price can be predicted using the house size, number of bedrooms, and lot size. A random sample of 100 houses was drawn and data recorded. Analyze the relationship among the four variables Assumption #5 Violation – Serious Multicollinearity

16 16  The proposed model is PRICE =  0 +  1 BEDROOMS +  2 H-SIZE +  3 LOTSIZE +  Excel solution The model is ____, but no variable is significantly related to the selling price !! Example – House Price

17 17  However, when regressing the price on each independent variable alone, it is found that each variable is strongly related to the selling price. Multicollinearity is the source of this problem.  Multicollinearity inflates S b i ’s: Bringing t-stats closer to zero and insignificance. The  coefficients cannot be interpreted as “slopes”. Example – House Price

18 18 Correcting for Multicollinearity: Get rid of one of the variables that is a duplicate, and re-estimate the model. With this done, and the high R 2 relative to your first model, and the high p-value for “Bedrooms”, estimate the model with only “House Size” as a variable. Example – House Price

19 19 Note R 2 is nearly as high as original model, but adjusted R 2 is actually higher than before. F-test for overall validity of model is fine, t-test for your independent variable also fine. This is your final model. Now: Estimate sale price for a house with 3 bedrooms, 2000 sq ft of house on a lot of 5,000 sq ft. Compare results of final model with original model. Example – House Price

20 20  This condition is common with time series data.  When it exists in time series data, it is referred to as Autocorrelation.  Detection: run a regression save residuals plot residuals against time if you see a pattern, your regression may have auto- correlation problem Assumptions #3 Violation – Non-Independence of Errors

21 21 + ++ + + + + + + + + + + + + y Time Positive first order autocorrelation occurs when consecutive residuals tend to be similar. Positive first order autocorrelation Negative first order autocorrelation Residuals Time 0 0 Residuals Time Negative first order autocorrelation occurs when consecutive residuals tend to markedly differ. Time y Autocorrelation

22 22  How does the weather affect the sales of lift tickets in a ski resort?  Data of the past 20 years sales of tickets, along with the total snowfall and the average temperature during Christmas week in each year, was collected.  The model hypothesized was TICKETS =  0 +  1 SNOWFALL +  2 TEMPERATURE +   Regression analysis yielded the following results: Example – Lift Ticket

23 23 The model seems to be very poor: The fit is _______ (R-square=0.12), It is _________ (Signif. F =0.33) No variable is ___________ to Sales Example – Lift Ticket

24 24 Residual over time Residual vs. predicted y The errors are ___ independent The error variance is constant The error distribution Example – Lift Ticket

25 25 The modified regression model TICKETS =  0 +  1 SNOWFALL +  2 TEMPERATURE +  3 YEARS +   Are all the required conditions for this model met?  How good is the fit of this model?  Is the model useful?  Which variables are linearly related to ticket sales and which ones are not?  The autocorrelation has occurred over time.  Therefore, a time dependent variable added to the model may correct the problem Example – Lift Ticket

26 26 The fit of this model is _____ R 2 = 0.74The model is _____. Significance F = 5.93 E-5. All the required conditions ______ for this model. TEMPERATURE is ________ related to ticket sales. SNOWFALL and YEARS ________ related to ticket sales


Download ppt "1 4. Multiple Regression I ECON 251 Research Methods."

Similar presentations


Ads by Google