Multiple Regression Indicator (Dummy) Variables Interaction Terms

Multiple Regression Indicator (Dummy) Variables Interaction Terms
Chapter 28 Multiple Regression Indicator (Dummy) Variables Interaction Terms

Introduction In this section we extend simple linear regression where we had one explanatory variable, and allow for any number of explanatory variables. We expect to build a model that fits the data better than the simple linear regression model.

Introduction - 2 We shall use computer printout to Assess the model
How well it fits the data Is it useful Are any required conditions violated? Employ the model Interpreting the coefficients Predictions using the prediction equation Estimating the expected value of the dependent variable

The Multiple Regression Model
Idea: Examine the linear relationship between 1 response variable (y) & 2 or more explanatory variables (xi) Population model: Y-intercept Population slopes Random Error Estimated multiple regression model: Estimated (or predicted) value of y Estimated intercept Estimated slope coefficients

Simple Linear Regression
y Observed Value of y for xi εi Slope = b1 Predicted Value of y for xi Random Error for this x value Intercept = b0 xi x

Multiple Regression, 2 explanatory variables
* Y 2 1 Least Squares Plane (instead of line) Scatter of points around plane are random error.

Multiple Regression Model
Two variable model y Sample observation yi yi < e = (yi – yi) < x2i x2 x1i < The best fit equation, y , is found by minimizing the sum of squared errors, e2 x1

Estimating the Coefficients and Assessing the Model
The procedure used to perform regression analysis: Obtain the model coefficients and statistics using statistical software. Diagnose violations of required conditions. Try to remedy problems when identified. Assess the model fit using statistics obtained from the sample. If the model assessment indicates good fit to the data, use it to interpret the coefficients and generate predictions.

Estimating the Coefficients and Assessing the Model. Example
Predicting final exam scores in ST 350 We would like to predict final exam scores in 350. Use information generated during the semester. Predictors of the final exam score: Exam 1 Exam 2 Exam 3 Homework total

Estimating the Coefficients and Assessing the Model, Example
Data were collected from 203 randomly selected students from previous semesters The following model is proposed final exam = b0 + b1exam1 + b2exam2 + b3exam3 + b4hwtot exam 1 exam2 exam3 hwtot finalexm 80 60 159 72 70 75 359 76 95 90 330 84 100 92 272 64 344 85 351 88 35 200 55 251 40 293

Regression Analysis, Excel Output
This is the sample regression equation (sometimes called the prediction equation) Final exam score = exam exam exam hwtot

Interpreting the Coefficients
b0 = This is the intercept, the value of y when all the variables take the value zero. Since the data range of all the independent variables do not cover the value zero, do not interpret the intercept. b1 = In this model, for each additional point on exam 1, the final exam score increases on average by (assuming the exam 2, exam 3, and homework variables are held constant).

Interpreting the Coefficients
b2 = In this model, for each additional point on exam 2, the final exam score increases on average by (assuming the exam 1, exam 3, and homework variables are held constant). b3 = For each additional point on exam 3, the final exam score increases on average by (assuming the exam 1, exam 2, and homework variables are held constant). b4 = For each additional point on the homework, the final exam score increases on average by (assuming the exam 1, exam 2, and exam 3 variables are held constant).

Final Exam Scores, Predictions
Predict the average final exam score of a student with the following exam scores and homework score: Exam 1 score 75, Exam 2 score 79, Exam 3 score 85, Homework score 310 Use trend function in Excel Final exam score = (75) (79) (85) (310) =

Model Assessment The model is assessed using three tools:
The standard error of the residuals The coefficient of determination The F-test of the analysis of variance The standard error of the residuals participates in building the other tools.

Standard Error of Residuals
The standard deviation of the residuals is estimated by the Standard Error of the Residuals: The magnitude of se is judged by comparing it to

Regression Analysis, Excel Output
Standard error of the residuals; sqrt(MSE) (standard error of the residuals)2: MSE=SSE/198 Sum of squares of residuals SSE

Standard Error of Residuals
From the printout, se = …. Calculating the mean value of y we have It seems se is not particularly small. Question: Can we conclude the model does not fit the data well?

Coefficient of Determination R2 (like r2 in simple linear regression
The proportion of the variation in y that is explained by differences in the explanatory variables x1, x2, …, xk R2 = 1 – (SSE/SSTotal) From the printout, R2 = … 38.25% of the variation in final exam score is explained by differences in the exam1, exam2, exam3, and hwtot explanatory variables % remains unexplained. When adjusted for degrees of freedom, Adjusted R2 = 36.99%

Testing the Validity of the Model
We pose the question: Is there at least one explanatory variable linearly related to the response variable? To answer the question we test the hypotheses H0: b1 = b2 = … = bk=0 HA: At least one bi is not equal to zero. If at least one bi is not equal to zero, the model has some validity.

Testing the Validity of the Final Exam Scores Regression Model
The hypotheses are tested by what is called an F test shown in the Excel output below MSR/MSE P-value k = n–k–1 = n-1 = SSR MSR=SSR/k MSE=SSE/(n-k-1) SSE

[Variation in y] = SSR + SSE. Large F results from a large SSR. Then, much of the variation in y is explained by the regression model; the model is useful, and thus, the null hypothesis H0 should be rejected. Reject H0 when P-value < 0.05

Conclusion: There is sufficient evidence to reject the null hypothesis in favor of the alternative hypothesis. At least one of the bi is not equal to zero. Thus, at least one explanatory variable is linearly related to y. This linear regression model is valid The P-value (Significance F) < 0.05 Reject the null hypothesis.

Testing the Coefficients
The hypothesis for each bi is Excel printout H0: bi = 0 H1: bi ¹ 0 Test statistic d.f. = n - k -1

Multiple Regression (cont.)
QTM1310/ Sharpe Multiple Regression (cont.) Indicator (or Dummy) Variables What makes a good roller coaster ride? Speed Height Duration Inversions Other Do you expect the duration of the ride to be related to the length of the ride? 25

Indicator (or Dummy) Variables
QTM1310/ Sharpe Indicator (or Dummy) Variables From a sample of coasters worldwide: Duration on Length looks strong and positive. 26 26

QTM1310/ Sharpe Indicator (or Dummy) Variables From a sample of coasters worldwide: The regression of Duration (y) on Length (x) looks strong. The duration of a ride increases by about 23 seconds for each additional 1000 ft of track. 27 27

QTM1310/ Sharpe Indicator (or Dummy) Variables Many rides have inversions – loops, corkscrews, etc. These features impose limitations on the speed of the coaster. How do we introduce the categorical variable inversion? 28 28

QTM1310/ Sharpe Indicator (or Dummy) Variables Let’s analyze each group (rides with inversions and rides without inversions) separately: The slopes are very similar. The intercepts are different. 29 29

QTM1310/ Sharpe Indicator (or Dummy) Variables When the data can be divided into two groups with similar regression slopes, we can incorporate the group information into a single regression model. We accomplish this by introducing an indicator (or “dummy”) variable that indicates whether a coaster has an inversion: Inversions = 1 if “Yes” Inversions = 0 if “No” 30 30

QTM1310/ Sharpe Indicator (or Dummy) Variables Re-analyze with a multiple regression that includes both Length and Inversions: The R2 is larger for the multiple regression (70.4%) than for the simple regression (62.0%). The t-ratios of both coefficients are large. 31 31

QTM1310/ Sharpe Indicator (or Dummy) Variables Notice how the indicator variable works in the model. Inversions “turns on” (1) and “turns off” (0) an additional seconds of duration, depending on whether the ride has an inversion. Turning this factor “on” shifts the intercept upward by about 30 seconds while leaving the slope unaffected. This is consistent with the simple regression analyses of the separate groups (at right). 32 32

Indicator (or Dummy) Variables: Prediction
Inversions “turns on” (1) and “turns off” (0) an additional seconds of duration, depending on whether the ride has an inversion. Predicted duration, 4700 ft of track with inversion: Predicted duration, 4700 ft of track without inversion:

Multiple Regression (cont.)
QTM1310/ Sharpe Multiple Regression (cont.) Adjusting for Different Slopes – Interaction Terms Indicator variables can account for differences in the intercepts of different groups. But, what if the slopes of groups differ? Example: Calories vs. Carbohydrates for selected Burger King® products Meat-based dishes Non-meat dishes The slopes of the two groups are different. 34 34

Adjusting for Different Slopes – Interaction Terms
QTM1310/ Sharpe Adjusting for Different Slopes – Interaction Terms Start as before by introducing an indicator variable Meat. Meat = 1 if meat (including chicken and fish) is present in the dish: Meat = 0 if meat isn’t present Adding Meat to the model can adjust the intercept. To adjust the slope, add the interaction term Carbs*Meat. Note that Carbs*Meat equals just Carbs when Meat = 1 and equals 0 (“disappears”) when Meat = 0. 35 35

QTM1310/ Sharpe Adjusting for Different Slopes – Interaction Terms Re-analyze with the interaction term: Dependent variable is: Calories R-squared = 77.4% R-squared (adjusted) = 75.0% s = with = 28 degrees of freedom Source Sum of Squares DF Mean Square F-ratio Regression 3 31.99 Residual 28 Variable Coefficients SE(Coeff) t Stat P-value Intercept 59.67 2.35 0.026 Carbs 3.95 1.1254 3.51 0.002 Meat 99.66 -0.292 0.772 Carbs*Meat 7.858 2.202 3.57 0.0013 The predictor Meat is not significant whereas the interaction Carbs*Meat is significant. 36 36

QTM1310/ Sharpe Adjusting for Different Slopes – Interaction Terms Prediction equation: For Meat = 0, the intercept is and the slope is 3.95: For Meat = 1, the intercept drops to and the slope increases to 11.81: (The drop in intercept is not significant.) 37 37

QTM1310/ Sharpe Adjusting for Different Slopes – Interaction Terms Predict the calories if each has 50 g of carbs. Prediction equation: For fries, Meat = 0, so: For a hamburger, Meat = 1, so: 38 38

Simple regressions of the meat group and the non-meat group.
QTM1310/ Sharpe Adjusting for Different Slopes – Interaction Terms Introducing an interaction term produces a result consistent with simple regressions of the two groups: Meat-based dishes Non-meat dishes Simple regressions of the meat group and the non-meat group. 39 39

Indicator Variables: 3 or More Levels
QTM1310/ Sharpe Indicator Variables: 3 or More Levels Indicators for Three or More Levels: Some categorical data may have more than two levels. For example, the variable Month has 12 levels. Temptation: Create a single indicator variable Month with: Month = 1 for January, Month = 2 for February, etc. This is not recommended! 40 40

QTM1310/ Sharpe Indicator Variables: 3 or More Levels Indicators for Three or More Levels: Rather, introduce 11 indicator variables, one that “turns on and off” the month of February, another that “turns on and off” the month of March, etc. If no month is turned “on”, then the model defaults to the baseline month (January), so there is no need to include a separate indicator variable for this month. Clearly, the model will give results when more than one month is turned “on”, but these results should not be interpreted. 41 41

Automobile traffic delays waste time and fuel Data for 68 metro areas y: Delay per person (hours per year) x1 avg highway speed; x2 avg arterial speed 68 metro areas classified as Very Large, Large, Medium, and Small (4 categories) 3 indicator variables: very large, 2) large, 3) medium “small” is the “base” level (all indicators are 0)

Interpreting the Coefficients: “base” is “Small” metro area
ANOVA df SS MS F Regression 5 Residual 62 Total 67 Coefficients Standard Error t Stat P-value Intercept E-10 HiWay MPH E-05 Arterial MPH Medium Large Very Large The delay/person in a year in a Very Large metro area is 7 hours longer than in a Small metro area The delay/person in a year in a Large metro area is 8.60 hours longer than in a Small metro area The delay/person in a year in a Medium metro area is 3.59 hours longer than in a Small metro area

Excel Output; “base” is “Small” metro area
Predict the delay per person in Raleigh (medium) if HiWay MPH is 55 and Arterial MPH is 35 MPH ANOVA df SS MS F Regression 5 Residual 62 Total 67 Coefficients Standard Error t Stat P-value Intercept E-10 HiWay MPH E-05 Arterial MPH Medium Large Very Large

Multiple Regression Indicator (Dummy) Variables Interaction Terms

Similar presentations

Presentation on theme: "Multiple Regression Indicator (Dummy) Variables Interaction Terms"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Multiple Regression Indicator (Dummy) Variables Interaction Terms

Similar presentations

Presentation on theme: "Multiple Regression Indicator (Dummy) Variables Interaction Terms"— Presentation transcript:

Similar presentations

About project

Feedback