Multiple Regression Predicting a response with multiple explanatory variables
Assumptions Sample representative Error is random with mean of zero Independent variables measured without error Independent variables are linearly independent (multicollinearity) Errors uncorrelated Variance is constant (homoscedasticity
Data/Distribution Issues Consideration of outlier values – accurate estimates may require eliminating them or using robust approaches Non-normal distributions may require transformation Plot response against each explanatory variable
Modeling We want to obtain a model that fits the response (predicts) variable with as few variables as possible R 2 measures proportion of variability accounted for by the explanatory variables Adjusted R 2 takes the number of explanatory variables into account
Modeling Methods General approach is to include variables theoretically relevant to predicting the response –Gradually remove variables that are not significant and compare difference between models for significance Automatic stepwise methods –Forward and backwards
A Simple Example Kalahari data includes site area (LMS), the number of days the site was occupied and the number of people who occupied it Rcmdr – Statistics | Fit models | Linear Model
Two models Model 1: LMS ~ People + Days Model 2: LMS ~ People * Days –LMS ~ People + Days + People * Days Check significance of slopes Compare models for significant difference
> LinearModel.1 <- lm(LMS ~ People +Days, data=Kalahari) > summary(LinearModel.1) Call: lm(formula = LMS ~ People + Days, data = Kalahari) Residuals: Min 1Q Median 3Q Max Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) * People e-05 *** Days * --- Signif. codes: 0 '***' '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: on 12 degrees of freedom Multiple R-squared: ,Adjusted R-squared: F-statistic: on 2 and 12 DF, p-value: 6.377e-05
> LinearModel.2 <- lm(LMS ~ People*Days, data=Kalahari) > summary(LinearModel.2) Call: lm(formula = LMS ~ People * Days, data = Kalahari) Residuals: Min 1Q Median 3Q Max Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) People Days People:Days Residual standard error: on 11 degrees of freedom Multiple R-squared: ,Adjusted R-squared: F-statistic: on 3 and 11 DF, p-value:
> anova(LinearModel.1, LinearModel.2) Analysis of Variance Table Model 1: LMS ~ People + Days Model 2: LMS ~ People * Days Res.Df RSS Df Sum of Sq F Pr(>F)
Darl Points Create subset of DartPoints containing only the Darl Points Model 1: Length ~ Width + Thickness Model 2: Length ~ Width * Thickness
> LinearModel.4 <- lm(Length ~ Width +Thick, data=Darl) > summary(LinearModel.4) Call: lm(formula = Length ~ Width + Thick, data = Darl) Residuals: Min 1Q Median 3Q Max Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) Width * Thick * --- Signif. codes: 0 '***' '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: on 24 degrees of freedom Multiple R-squared: ,Adjusted R-squared: F-statistic: on 2 and 24 DF, p-value: 8.554e-05
> LinearModel.5 <- lm(Length ~ Width * Thick, data=Darl) > summary(LinearModel.5) Call: lm(formula = Length ~ Width * Thick, data = Darl) Residuals: Min 1Q Median 3Q Max Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) Width Thick Width:Thick Residual standard error: on 23 degrees of freedom Multiple R-squared: ,Adjusted R-squared: F-statistic: on 3 and 23 DF, p-value:
> anova(LinearModel.4, LinearModel.5) Analysis of Variance Table Model 1: Length ~ Width + Thick Model 2: Length ~ Width * Thick Res.Df RSS Df Sum of Sq F Pr(>F)