Multiple Linear Regression Analysis BSAD 30 Dave Novak Fall 2018 Source: Ragsdale, 2018 Spreadsheet Modeling and Decision Analysis 8th edition © 2017 Cengage Learning
Overview Last class we considered the relationship between one independent variable and one dependent variable Referred to as “simple” linear regression Today, we consider the relationship between more than one independent variable (X’s) and a single dependent variable (Y) Referred to as “multiple” linear regression Example
Multiple regression When more than one independent variable can be used to explain variance in Y 𝑌 i = b0 + b1X1i + b2X2i + …. + bnXni Where it is assumed all bi’s are independent
Multiple regression Assume that you want to develop a model to predict the market value (or price) of houses in your town / city We have access to the following data Source: Ragsdale, 2018 Spreadsheet Modeling and Decision Analysis 8th
Multiple regression We want to predict housing price (in thousands of $) using some combination of data we have on square footage, the garage size, and number of bedrooms (3 possible X values) Even with three independent variables, we can create many different regression models A model with the most X’s is often not the “best” model
Multiple regression Start by looking at scatter plots and correlation Having access to many different independent variables does not necessarily mean that they all should be part of a regression model Rule of thumb for linear regression: KEEP IT AS SIMPLE AS POSSIBLE Start by looking at scatter plots and correlation
Multiple regression Source: Ragsdale, 2018 Spreadsheet Modeling and Decision Analysis 8th
Multiple regression Look at correlation coefficient (r) for each X / Y combination
Multiple regression Given results from scatter and correlation, start with three separate simple linear regression models and compare results 𝑌 i = b0 + b1X1i (X1 = square footage) 𝑌 i = b0 + b2X2i (X2 = garage size) 𝑌 i = b0 + b3X3i (X3 = # bedrooms)
X1 – square footage
X2 – garage size
X3 – # bedrooms
Summary comparison X1 has highest R2 ,Adj R2, and lowest Std. error Reasonable to start with X1 and build off that variable
Combinations of two variables 𝑌 i = b0 + b1X1i + b2X2i (square footage + garage) 𝑌 i = b0 + b1X1i + b3X3i (square footage + # bedrooms)
X1 + X2 (Sq ft + Garage)
X1 + X3 (Sq ft + Bedrooms)
Change in b0 and b1 𝑌 i = b0 + b1X1i (b0 = 109.5, b1 = 56.394) Notice values of b0 and b1 have changed from the model where we just had X1 𝑌 i = b0 + b1X1i (b0 = 109.5, b1 = 56.394) 𝑌 i = b0 + b1X1i + b2X2i (b0 = 127.68, b1 = 38.576) 𝑌 i = b0 + b1X1i + b3X3i (b0 = 108.31, b1 = 44.313)
Summary comparison Where X1 = sq. ft., X2 = garage, X3 = # bedrooms
Multicollinearity Not surprising adding X3 (# of bedrooms) to regression model with X1 (total square footage) did not improve model Both variables represent similar things – a measure of house size (sq ft) These variables appear to be highly correlated
Combination of all three 𝑌 i = b0 + b1X1i + b2X2i + b3X3i (square footage + garage + # bedrooms) As it is not a time consuming undertaking to test all three variables, we also want to examine the FULL (all independent variables) model
X1 + X2 + X3
Summary comparison
Best fit How do we choose? The two variable model with X1 and X2 has highest adj R2 and lowest std. error of all models Making the model more complex by adding all three variables doesn’t add anything to predictive power We also know that X3 is highly correlated with X1 (X1 and X3 not necessarily independent)
Making predictions Use our selected model to estimate average selling price of house with 2,100 sq ft and a 2-car garage Y 𝑖 =127.64+38.576 X 1𝑖 +12.875 X 2𝑖
Making predictions 95% prediction interval for the actual selling price:
Problem In-class example problem