Multiple Regression. Objectives Explanation The most direct interpretation of the regression variate is a determination of the relative importance of.

Slides:



Advertisements
Similar presentations
Forecasting Using the Simple Linear Regression Model and Correlation
Advertisements

6-1 Introduction To Empirical Models 6-1 Introduction To Empirical Models.
Regression Analysis Once a linear relationship is defined, the independent variable can be used to forecast the dependent variable. Y ^ = bo + bX bo is.
Regression Analysis Notes. What is a simple linear relation? When one variable is associated with another variable in such a way that two numbers completely.
LECTURE 3 Introduction to Linear Regression and Correlation Analysis
Chapter 13 Multiple Regression
Multiple Regression Predicting a response with multiple explanatory variables.
LINEAR REGRESSION: Evaluating Regression Models Overview Assumptions for Linear Regression Evaluating a Regression Model.
Chapter 13 Additional Topics in Regression Analysis
Chapter 12 Simple Regression
Chapter 12 Multiple Regression
Chapter 13 Introduction to Linear Regression and Correlation Analysis
Lecture 19: Tues., Nov. 11th R-squared (8.6.1) Review
The Simple Regression Model
Examining Relationship of Variables  Response (dependent) variable - measures the outcome of a study.  Explanatory (Independent) variable - explains.
Nemours Biomedical Research Statistics April 2, 2009 Tim Bunnell, Ph.D. & Jobayer Hossain, Ph.D. Nemours Bioinformatics Core Facility.
Chapter Topics Types of Regression Models
Linear Regression and Correlation Analysis
Topic 3: Regression.
Multiple Regression and Correlation Analysis
Regression Diagnostics Checking Assumptions and Data.
Linear Regression Example Data
© 2000 Prentice-Hall, Inc. Chap Forecasting Using the Simple Linear Regression Model and Correlation.
Chapter 14 Introduction to Linear Regression and Correlation Analysis
Correlation and Regression Analysis
Chapter 7 Forecasting with Simple Regression
Multiple Regression Dr. Andy Field.
Introduction to Regression Analysis, Chapter 13,
Checking Regression Model Assumptions NBA 2013/14 Player Heights and Weights.
Statistics for Managers Using Microsoft Excel, 4e © 2004 Prentice-Hall, Inc. Chap 13-1 Chapter 13 Introduction to Multiple Regression Statistics for Managers.
Correlation & Regression
Regression and Correlation Methods Judy Zhong Ph.D.
Introduction to Linear Regression and Correlation Analysis
Statistics for Managers Using Microsoft Excel, 4e © 2004 Prentice-Hall, Inc. Chap 12-1 Chapter 12 Simple Linear Regression Statistics for Managers Using.
Multiple Linear Regression and Correlation Analysis
 Combines linear regression and ANOVA  Can be used to compare g treatments, after controlling for quantitative factor believed to be related to response.
OPIM 303-Lecture #8 Jose M. Cruz Assistant Professor.
© 2003 Prentice-Hall, Inc.Chap 13-1 Basic Business Statistics (9 th Edition) Chapter 13 Simple Linear Regression.
Introduction to Linear Regression
Chap 12-1 A Course In Business Statistics, 4th © 2006 Prentice-Hall, Inc. A Course In Business Statistics 4 th Edition Chapter 12 Introduction to Linear.
EQT 373 Chapter 3 Simple Linear Regression. EQT 373 Learning Objectives In this chapter, you learn: How to use regression analysis to predict the value.
Applied Quantitative Analysis and Practices LECTURE#23 By Dr. Osman Sadiq Paracha.
Chap 14-1 Statistics for Business and Economics, 6e © 2007 Pearson Education, Inc. Chapter 14 Additional Topics in Regression Analysis Statistics for Business.
Multiple Regression and Model Building Chapter 15 Copyright © 2014 by The McGraw-Hill Companies, Inc. All rights reserved.McGraw-Hill/Irwin.
Regression Model Building LPGA Golf Performance
Lesson Multiple Regression Models. Objectives Obtain the correlation matrix Use technology to find a multiple regression equation Interpret the.
Dr. C. Ertuna1 Issues Regarding Regression Models (Lesson - 06/C)
14- 1 Chapter Fourteen McGraw-Hill/Irwin © 2006 The McGraw-Hill Companies, Inc., All Rights Reserved.
© Copyright McGraw-Hill Correlation and Regression CHAPTER 10.
Business Statistics: A Decision-Making Approach, 6e © 2005 Prentice-Hall, Inc. Chap 13-1 Introduction to Regression Analysis Regression analysis is used.
Tutorial 4 MBP 1010 Kevin Brown. Correlation Review Pearson’s correlation coefficient – Varies between – 1 (perfect negative linear correlation) and 1.
Lecture 10: Correlation and Regression Model.
 Relationship between education level, income, and length of time out of school  Our new regression equation: is the predicted value of the dependent.
Statistics for Managers Using Microsoft Excel, 4e © 2004 Prentice-Hall, Inc. Chap 14-1 Chapter 14 Multiple Regression Model Building Statistics for Managers.
Linear Models Alan Lee Sample presentation for STATS 760.
Business Statistics: A Decision-Making Approach, 6e © 2005 Prentice- Hall, Inc. Chap 14-1 Business Statistics: A Decision-Making Approach 6 th Edition.
Applied Quantitative Analysis and Practices LECTURE#30 By Dr. Osman Sadiq Paracha.
B AD 6243: Applied Univariate Statistics Multiple Regression Professor Laku Chidambaram Price College of Business University of Oklahoma.
Copyright © 2004 by The McGraw-Hill Companies, Inc. All rights reserved.
Basic Business Statistics, 10e © 2006 Prentice-Hall, Inc. Chap 15-1 Chapter 15 Multiple Regression Model Building Basic Business Statistics 10 th Edition.
Statistics for Managers Using Microsoft® Excel 5th Edition
Statistics for Managers Using Microsoft Excel, 4e © 2004 Prentice-Hall, Inc. Chap 14-1 Chapter 14 Multiple Regression Model Building Statistics for Managers.
1 Regression Review Population Vs. Sample Regression Line Residual and Standard Error of Regression Interpretation of intercept & slope T-test, F-test.
Tutorial 5 Thursday February 14 MBP 1010 Kevin Brown.
Yandell – Econ 216 Chap 15-1 Chapter 15 Multiple Regression Model Building.
Chapter 14 Introduction to Multiple Regression
Regression Analysis Simple Linear Regression
Chapter 13 Simple Linear Regression
Multiple Regression Chapter 14.
Presentation transcript:

Multiple Regression

Objectives

Explanation The most direct interpretation of the regression variate is a determination of the relative importance of each independent variable in the prediction of the dependent measure. Assess the nature of the relationships between the independent variables and the dependent variable. Provide insight into the relationships among independent variables. Y’ X1X1 X2X2 X3X3

Sample Problem (Leslie Salt Property):Finding Fair Price of a Land Variable NameDescription PRICESale price in $000 per acre COUNTYSan Mateo=0, Santa Clara=1 SIZESize of the property in acres ELEVATIONAverage elevation in feet above sea level SEWERDistance (in feet) to nearest sewer connection DATEDate of sale counting backward from current time (in months) FLOODSubject to flooding by tidal action =1; otherwise=0 DISTANCEDistance in miles from Leslie property

PRICECOUNTYSIZEELEVATIONSEWERDATEFLOODDISTANCE

SEWER FLOOD SIZE COUNTY DISTANCE ELEVATION DATE PRICE SEWER FLOOD SIZECOUNTYDISTANCEELEVATION DATEPRICE COUNTYSIZEELEVATIONSEWERDATEFLOODDISTANCE PRICE100.00%-18.22%-23.97%35.18%-39.12%59.47%-32.31%9.33% COUNTY-18.22%100.00%-33.94%47.52%-5.00%-36.98%-55.18%-74.22% SIZE-23.97%-33.94%100.00%-20.95%5.34%-34.95%10.89%55.69% ELEVATION35.18%47.52%-20.95%100.00%-35.94%-5.65%-37.31%-36.25% SEWER-39.12%-5.00%5.34%-35.94%100.00%-15.15%-11.31%-15.87% DATE59.47%-36.98%-34.95%-5.65%-15.15%100.00%1.54%4.44% FLOOD-32.31%-55.18%10.89%-37.31%-11.31%1.54%100.00%42.33% DISTANCE9.33%-74.22%55.69%-36.25%-15.87%4.44%42.33%100.00%

PRICECOUNTYSIZEELEVATIONSEWERDATEFLOODDISTANCE PRICE100.00%-18.22%-23.97%35.18%-39.12%59.47%-32.31%9.33% COUNTY-18.22%100.00%-33.94%47.52%-5.00%-36.98%-55.18%-74.22% SIZE-23.97%-33.94%100.00%-20.95%5.34%-34.95%10.89%55.69% ELEVATION35.18%47.52%-20.95%100.00%-35.94%-5.65%-37.31%-36.25% SEWER-39.12%-5.00%5.34%-35.94%100.00%-15.15%-11.31%-15.87% DATE59.47%-36.98%-34.95%-5.65%-15.15%100.00%1.54%4.44% FLOOD-32.31%-55.18%10.89%-37.31%-11.31%1.54%100.00%42.33% DISTANCE9.33%-74.22%55.69%-36.25%-15.87%4.44%42.33%100.00%

summary(model) Call: lm(formula = leslie_salt[, 1] ~ leslie_salt[, 4] + leslie_salt[, 5] + leslie_salt[, 6]) Residuals: Min 1Q Median 3Q Max Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) e-08 *** leslie_salt[, 4] * leslie_salt[, 5] leslie_salt[, 6] *** --- Signif. codes: 0 ‘***’ ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: on 27 degrees of freedom Multiple R-squared: ,Adjusted R-squared: F-statistic: on 3 and 27 DF, p-value:

Assumptions

Linearity (cts.) A higher order term of the dependent variable should be included. In that case define a new variable by taking the square (for this case) of that independent variable and use squared values in the regression. Use: Visual inspection

More troublesome is MODERATOR effect If an independent-dependent variable relationship is effected by another independent variable this situation is termed a moderator effect. The most common moderator effect in multiple regression is the bilinear moderator in which the slope of the relationship of one independent variable (X 1 ) changes across values of the moderator variable (X 2 ).

Example

Adding Moderator Effect The idea comes from observing a self moderator effect. If a variable has a moderator effect onto itself then we would assume a nonlinear (second degree) relationship with the dependent variable. Thus if there is a moderator effect add X 1 X 2 as an independent variable to regression equation. But we will return back to this!!!

Assumption: Homoscedasticity Constant variance of the error terms.

Heteroscedasticity (cts.) in residuals within variables

Heteroscedasticity (cts.) Use: Levene Test. Levene Test: Tests the equality of variance. Levene's test works by testing the null hypothesis that the variances of the group are the same. The output probability is the probability that at least one of the samples in the test has a significantly different variance. If this is greater than a selected percentage (usually 5%) then it is considered too great to be able to usefully apply parametric tests. Variances In SPSS it is reported. In R: In «lawstat» library use levene.test() function.

Use F test for more than 2 groups…

Assumptions Independence of the error terms. Check the coordinates!!!

Independence of Error Terms Use: Durbin-Watson The value of the Durbin-Watson statistic ranges from 0 to 4. As a general rule of thumb, the residuals are uncorrelated is the Durbin- Watson statistic is approximately 2. A value close to 0 indicates strong positive correlation, while a value of 4 indicates strong negative correlation.

In SPSS Durbin Watson is reported. In R under «lmtest» library use dwtest() dwtest(formula, order.by = NULL, alternative = c("greater", "two.sided", "less"), iterations = 15, exact = NULL, tol = 1e-10, data = list()) For our regression model. > dwtest(model) Durbin-Watson test data: model DW = , p-value = alternative hypothesis: true autocorrelation is greater than 0

Assumptions Normality of the error term distribution.

qqPlot(model)

Diagonistics Call: lm(formula = leslie_salt[, 1] ~ leslie_salt[, 4] + leslie_salt[, 5] + leslie_salt[, 6]) Residuals: Min 1Q Median 3Q Max Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) e-08 *** leslie_salt[, 4] * leslie_salt[, 5] leslie_salt[, 6] *** --- Signif. codes: 0 ‘***’ ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: on 27 degrees of freedom Multiple R-squared: ,Adjusted R-squared: F-statistic: on 3 and 27 DF, p-value:

Identifying Influential Observations observations that lie outside the general patterns of the data set observations that strongly influence regression results Types of Influential Observations 1.Outliers – observations that have large residuals (based on dependent variables) 2.Leverage points – observations that are distinct from the remaining observations based on their independent variable values. 3.Influential observations – including all observations that have a disproportionate effect on the regression results.

Outliers Typical boxplot test. In «car» library  outlierTest(model) rstudent unadjusted p-value Bonferonni p

Leverage

Index cooks.distance(model)

R-Code # Influential Observations # added variable plots av.Plots(model) # Cook's D plot # identify D values > 4/(n-k-1) cutoff <- 4/((nrow(leslie_salt)- length(model$coefficients)-2)) plot(fit, which=4, cook.levels=cutoff) # Influence Plot influencePlot(model, id.method="identify", main="Influence Plot", sub="Circle size is proportial to Cook's Distance" )

Leverage Standardized residuals lm(leslie_salt[, 1] ~ leslie_salt[, 4] + leslie_salt[, 6] + leslie_salt[, 7... Cook's distance Residuals vs Leverage 2 9 4

Assessing Multicollinearity***** A key issue in interpreting the regression variate is the correlation among the independent variables. Our task in a regression analysis includes the following: 1.Assess the degree of multicollinearity 2.Determine its impact on results 3.Apply the necessary remedies if needed

Assess the degree of multicollinearity The simplest and most obvious way: Identifying collinearity in correlation matrix. Check for correlation >90%. A direct measure of multicollinearity is tolerance (1/VIF). The amount of variability of the selected independent variable not explained by the other independent variables. Computation: Take each independent variable. Assume it as the dependent variable. Compute adjusted R 2. Tolerance is then 1-R 2. For example if other variables explain 25% of an independent variable then tolerence of this variable is 75%. Tolerence should be more than 10% > 1/vif(model) leslie_salt[, 4] leslie_salt[, 6] leslie_salt[, 7] leslie_salt[, 8]

Further… see page for diagonistic tests with Rhttp://

Partial Correlation A partial correlation coefficient is a way of expressing the unique relationship between the criterion and a predictor. Partial correlation represents the correlation between the criterion and a predictor after common variance with other predictors has been removed from both the criterion and the predictor of interest. t.values <- model$coeff / sqrt(diag(vcov(model))) partcorr <- sqrt((t.values^2) / ((t.values^2) + model$df.residual)) partcorr ***************************************************** leslie_salt[, 4] leslie_salt[, 6] leslie_salt[, 7] leslie_salt[, 8]

Part (Semi-partial) Correlation A semipartial correlation coefficient represents the correlation between the criterion and a predictor that has been residualized with respect to all other predictors in the equation. Note that the criterion remains unaltered in the semipartial. Only the predictor is residualized. After removing variance that the predictor has in common with other predictors, the semipartial expresses the correlation between the residualized predictor and the unaltered criterion. An important advantage of the semipartial is that the denominator of the coefficient (the total variance of the criterion, Y) remains the same no matter which predictor is being examined. This makes the semipartial very interpretable. The square of the semipartial can be interpreted as the proportion of the criterion variance associated uniquely with the predictor. It is also possible to use the semipartial to fully deconstruct the variance components in a regression analysis.

Project (Step1): Go to web page: Replicate the results there using a dataset of your own. Be creative in problem formulation. Data may be imaginary. Use at least 5 independent variables.

Comparing Regression Models

Stepwise Regression Start with the most basic model. Pick your favourite independent variable and construct the model. Test it. Remember correlation matrix (price in logs) PRICECOUNTYSIZEELEVATIONSEWERDATEFLOODDISTANCE PRICE100.00%-18.22%-23.97%35.18%-39.12%59.47%-32.31%9.33% COUNTY-18.22%100.00%-33.94%47.52%-5.00%-36.98%-55.18%-74.22% SIZE-23.97%-33.94%100.00%-20.95%5.34%-34.95%10.89%55.69% ELEVATION35.18%47.52%-20.95%100.00%-35.94%-5.65%-37.31%-36.25% SEWER-39.12%-5.00%5.34%-35.94%100.00%-15.15%-11.31%-15.87% DATE59.47%-36.98%-34.95%-5.65%-15.15%100.00%1.54%4.44% FLOOD-32.31%-55.18%10.89%-37.31%-11.31%1.54%100.00%42.33% DISTANCE9.33%-74.22%55.69%-36.25%-15.87%4.44%42.33%100.00%

Call: lm(formula = leslie_salt[, 1] ~ leslie_salt[, 6]) Residuals: Min 1Q Median 3Q Max Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) e-13 *** leslie_salt[, 6] *** --- Signif. codes: 0 ‘***’ ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: on 29 degrees of freedom Multiple R-squared: ,Adjusted R-squared: F-statistic: on 1 and 29 DF, p-value: Our focus is the improvement in RSS. So we need residual sum of squares. But it is not given in the report directly (given in SPSS). > anova(m1) Analysis of Variance Table Response: leslie_salt[, 1] Df Sum Sq Mean Sq F value Pr(>F) leslie_salt[, 6] *** Residuals Signif. codes: 0 ‘***’ ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Now lets add another variable say SEWER and assume we have done all testing Call: lm(formula = leslie_salt[, 1] ~ leslie_salt[, 6] + leslie_salt[, 5]) Residuals: Min 1Q Median 3Q Max Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 3.442e e e-14 *** leslie_salt[, 6] 1.643e e *** leslie_salt[, 5] e e ** --- Signif. codes: 0 ‘***’ ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.51 on 28 degrees of freedom Multiple R-squared: ,Adjusted R-squared: F-statistic: on 2 and 28 DF, p-value: 2.766e-05 Analysis of Variance Table Response: leslie_salt[, 1] Df Sum Sq Mean Sq F value Pr(>F) leslie_salt[, 6] e-05 *** leslie_salt[, 5] ** Residuals

How much improvement do we have?

In our case

Back to moderator effect.

Mediation

Project (Step2,3 and 4): Find the best regression equation for your Project. Test moderator effects Test mediation effects.