STAT 4030 – Programming in R STATISTICS MODULE: Multiple Regression

STAT 4030 – Programming in R STATISTICS MODULE: Multiple Regression
Jennifer Lewis Priestley, Ph.D. Kennesaw State University 1

STATISTICS MODULE Basic Descriptive Statistics and Confidence Intervals Basic Visualizations Histograms Pie Charts Bar Charts Scatterplots Ttests One Sample Paired Independent Two Sample Proportion Testing ANOVA Chi Square and Odds Regression Basics 2 2 2

STATISTICS MODULE: Multiple Regression
Previously, we learned that a simple linear equation of a line takes the general form of y=mx+b, where: Y is the dependent variable m is the slope of the line X is the independent variable or predictor b is the Y-intercept. When we discussion regression models, we transform this equation to be: Y = bo + b1x1 Where bo is the y-intercept and b1 is the slope of the line. The “slope” is also the effect of a one unit change of x on y. 3

This was fine…but typically we don’t have just one predictor – we have lots of predictors. When we discussion multiple regression models, the general form of the equation is like this: Y = bo + b1x1 + b2x2 + b3x3 … bnxn Where bo is still the y-intercept and “bi “ is the effect of a unit change of each of the individual predictors on the y (dependent) variable. Lets discuss the general form of different hypothetical multiple regression models… 4

The requirements for Multiple Regression are general the same as they were for Linear Regression: The relationship of the dependent and the independent (s) variables is assumed to be linear. The relationship of the dependent and the independent (s) variables will have some (hopefully) significant correlation. There should be no extreme values that influence (usually negatively) the results. Results are homoscedastic. All observations are independent. 5

But…there are some issues in Multiple Regression which are not present in Linear Regression: Multicollinearity amongst predictors “Ingredient” variables Selection Methods/Model Parsimony Lets explore each of these in turn… 6

Consider the VIF (Variance Inflation Factor). VIF = 1/(1-R2)…where the R2 value here is the value when the predictor in question is set as the dependent variable. For example, if the VIF = 10, then the respective R2 would be 90%. This would mean that 90% of the variance in the predictor in question can be explained by the other independent variables. Because so much of the variance is captured elsewhere, removing the predictor in question should not cause a substantive decrease in overall R2. The rule of thumb is to remove variables with VIF scores greater than 10. 7

What is an “ingredient” variable? If the dependent variable is comprised of one of the predictor variables (or vice versa), the results are not reliable. One or both of the following will happen: You will generate an incredibly high R2 value The predictor in question will have a DOMINATING t-statistic 8

What are the different selection methods and what are the differences? “All In” “Forward” “Backward” “Stepwise” Model Parsimony = less is more. You are better off with an R2 of .75 and 3 predictors than with an R2 of .80 and 10 predictors. 9

mod1<-lm(y~x1+x2+x3+x4+x5+x6, data=data) summary(mod1) confint(mod1, level=0.99) #this will generate the conf intervals around the beta coefficients vif(mod1) #this will generate the variance inflation factor values ncvTest(mod1) #this will test if the distribution of the predictions is consistent with what is expected (we want this to “fail” with a high p-value) test<-step(mod1, direction="backward", trace=TRUE) #this will execute a selection method to exclude any non-significant predictors 10

STAT 4030 – Programming in R STATISTICS MODULE: Multiple Regression

Similar presentations

Presentation on theme: "STAT 4030 – Programming in R STATISTICS MODULE: Multiple Regression"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

STAT 4030 – Programming in R STATISTICS MODULE: Multiple Regression

Similar presentations

Presentation on theme: "STAT 4030 – Programming in R STATISTICS MODULE: Multiple Regression"— Presentation transcript:

Similar presentations

About project

Feedback