Multiple Regression Assumptions & Diagnostics

Name: Multiple Regression Assumptions & Diagnostics
Uploaded: 2017-11-03T18:50:27+00:00
Duration: PTM28S33
Channel: Christiana Marshall
Description: Multiple Regression Assumptions & Diagnostics

Multiple Regression Assumptions & Diagnostics
Sociology 8811 Copyright © 2007 by Evan Schofer Do not copy or distribute without permission

Announcements None

Multiple Regression Hypothesis Tests
Hypothesis tests can be conducted independently for all slopes (b) of X variables For X1, X2…Xk, we can test hypotheses for b1, b2…bk Null/Alternative hypotheses are the same: H0: bk = 0 H1: bk  0; Or, one-tailed tests: H1: bk > 0, H1: bk < 0 Hypothesis tests are about the slope controlling for other variables in the model Sometimes people explicitly mention this in hypotheses NOTE: Results with “controls” may differ from bivariate hypothesis tests!

Multiple Regression Hypothesis Tests
Formula for MV hypothesis tests: Where b is a slope, sb is a standard error k represents the kth independent variable K = total number of independent variables T-test degrees of freedom depends on N and number of independent variables Compare observed t-value to critical t; or p to a.

Multiple Regression Estimation
Calculating b’s involves solving a set of equations to minimize squared error Analogous to bivariate, but math is more complex The optimal estimator has minimum variance and is referred to as “BLUE”: Best Linear, Unbiased Estimate The BLUE Multiple Regression has more assumptions than bivariate.

Multiple Regression Assumptions
As discussed in Knoke, p. 256 Note: Allison refers to error (e) as disturbance (U); And uses slightly different language… but ideas are the same! 1. a. Linearity: The relationship between dependent and independent variables is linear Just like bivariate regression Points don’t all have to fall exactly on the line; but error (disturbance) must be random Check scatterplots of X’s and error (residual) Watch out for non-linear trends: error is systematically negative (or positive) for certain ranges of X There are strategies to cope with non-linearity, such as including X and X-squared to model curved relationship.

1. b. And, the model is properly specified: No extra variables are included in the model, and no important variables are omitted. This is HARD! Correct model specification is critical If an important variable is left out of the model, results are biased (“omitted variable bias”) Example: If we model job prestige as a function of family wealth, but do not include education Coefficient estimate for wealth would be biased Use theory and previous research to decide what critical variables must be included in your model.

Correct model specification is critical If an important variable is left out of the model, results are biased This is called “omitted variable bias” Example: If we model job prestige as a function of family wealth, but do not include education Coefficient estimate for wealth would be biased Use theory and previous research to help you identify critical variables For final paper, it is OK if model isn’t perfect.

2. All variables are measured without error Unfortunately, error is common in measures Survey questions can be biased People give erroneous responses (or lie) Aggregate statistics (e.g., GDP) can be inaccurate This assumption is often violated to some extent We do the best we can: Design surveys well, use best available data And, there are advanced methods for dealing with measurement error.

3. The error term (ei) has certain properties Recall: error is a cases deviation from the regression line Not the same as measurement error! After you run a regression, SPSS can tell you the error value for any or all cases (called the “residual”) 3. a. Error is conditionally normal For bivariate, we looked to see if Y was conditionally normal… Here, we look to see if error is normal Examine “residuals” (ei) for normality at different values of X variables.

Regression Assumptions
Normality: Examine residuals at different values of X. Make histograms and check for normality. Good Not very good

3. b. The error term (ei) has a mean of 0 This affects the estimate of the constant. (Not a huge problem) 3. c. The error term (ei) is homoskedastic (has constant variance) Note: This affects standard error estimates, hypothesis tests Look at residuals, to see if they spread out with changing values of X Or plot standardized residuals vs. standardized predicted values.

Homoskedasticity: Equal Error Variance Examine error at different values of X. Is it roughly equal? Here, things look pretty good.

Heteroskedasticity: Unequal Error Variance At higher values of X, error variance increases a lot. This looks pretty bad.

3. d. Predictors (Xis) are uncorrelated with error This most often happens when we leave out an important variable that is correlated with another Xi Example: Predicting job prestige with family wealth, but not including education Omission of education will affect error term. Those with lots of education will have large positive errors. Since wealth is correlated with education, it will be correlated with that error! Result: coefficient for family wealth will be biased.

4. In systems of equations, error terms of equations are uncorrelated Knoke, p. 256 This is not a concern for us in this class Worry about that later!

5. Sample is independent, errors are random Technically, part of 3.c. Not only should errors not increase with X (heteroskedasticity), there should be no pattern at all! Things that cause patterns in error (autocorrelation): Measuring data over long periods of time (e.g., every year). Error from nearby years may be correlated. Called: “Serial correlation”.

More things that cause patterns in error (autocorrelation): Measuring data in families. All members are similar, will have correlated error Measuring data in geographic space. Example: data on 50 US states. States in a similar region have correlated error Called “spatial autocorrelation” There are variations of regression models to address each kind of correlated error.

Regression: Outliers Note: Even if regression assumptions are met, slope estimates can have problems Example: Outliers -- cases with extreme values that differ greatly from the rest of your sample More formally: “influential cases” Outliers can result from: Errors in coding or data entry Highly unusual cases Or, sometimes they reflect important “real” variation Even a few outliers can dramatically change estimates of the slope, especially if N is small.

Regression: Outliers Outlier Example:
Extreme case that pulls regression line up 4 2 -2 -4 Regression line with extreme case removed from sample

Regression: Outliers Strategy for identifying outliers:
1. Look at scatterplots or regression partial plots for extreme values Easiest. A minimum for final projects 2. Ask SPSS to compute outlier diagnostic statistics Examples: “Leverage”, Cook’s D, DFBETA, residuals, standardized residuals.

Regression: Outliers SPSS Outlier strategy: Go to Regression – Save
Choose “influence” and “distance” statistics such as Cook’s Distance, DFFIT, standardized residual Result: SPSS will create new variables with values of Cook’s D, DFFIT for each case High values signal potential outliers Note: This is less useful if you have a VERY large dataset, because you have to look at each case value.

Scatterplots Example: Study time and student achievement.
X variable: Average # hours spent studying per day Y variable: Score on reading test Y axis X axis 30 20 10 Case X Y 1 2.6 28 2 1.4 13 3 .65 17 4 4.1 31 5 .25 8 6 1.9 16 7 3.5

Outliers Results with outlier:

Outlier Diagnostics Residuals: The numerical value of the error
Error = distance that points falls from the line Cases with unusually large error may be outliers Note: residuals have many other uses! Standardized residuals Z-score of residuals… converts to a neutral unit Often, standardized residuals larger than 3 are considered worthy of scrutiny But, it isn’t the best outlier diagnostic.

Outlier Diagnostics Cook’s D: Identifies cases that are strongly influencing the regression line SPSS calculates a value for each case Go to “Save” menu, click on Cook’s D How large of a Cook’s D is a problem? Rule of thumb: Values greater than: 4 / (n – k – 1) Example: N=7, K = 1: Cut-off = 4/5 = .80 Cases with higher values should be examined.

Outlier Diagnostics Example: Outlier/Influential Case Statistics Hours
Score Resid Std Resid Cook’s D 2.60 28 9.32 1.01 .124 1.40 13 -1.97 -.215 .006 .65 17 4.33 .473 .070 4.10 31 7.70 .841 .640 .25 8 -3.43 -.374 .082 1.90 16 -.515 -.056 .0003 3.50 6 -15.4 -1.68 .941

Outliers Results with outlier removed:

Regression: Outliers Question: What should you do if you find outliers? Drop outlier cases from the analysis? Or leave them in? Obviously, you should drop cases that are incorrectly coded or erroneous But, generally speaking, you should be cautious about throwing out cases If you throw out enough cases, you can produce any result that you want! So, be judicious when destroying data.

Regression: Outliers Circumstances where it can be good to drop outlier cases: 1. Coding errors 2. Single extreme outliers that radically change results Your results should reflect the dataset, not one case! 3. If there is a theoretical reason to drop cases Example: In analysis of economic activity, communist countries may be outliers If the study is about “capitalism”, they should be dropped.

Regression: Outliers Circumstances when it is good to keep outliers
1. If they form meaningful cluster Often suggests an important subgroup in your data Example: Asian-Americans in a dataset on education In such a case, consider adding a dummy variable for them Unless, of course, research design is not interested in that sub-group… then drop them! 2. If there are many Maybe they reflect a “real” pattern in your data.

Regression: Outliers When in doubt: Present results both with and without outliers Or present one set of results, but mention how results differ depending on how outliers were handled For final projects: Check for outliers! At least with scatterplots In the text: Mention if there were outliers, how you handled them, and the effect it had on results.

Multicollinearity Another common regression problem: Multicollinearity
Definition: collinear = highly correlated Multicollinearity = inclusion of highly correlated independent variables in a single regression model Recall: High correlation of X variables causes problems for estimation of slopes (b’s) Recall: variable denominators approach zero, coefficients may wrong/too large.

Multicollinearity Multicollinearity symptoms:
Unusually large standard errors and betas Compared to if both collinear variables aren’t included Betas often exceed 1.0 Two variables have the same large effect when included separately… but… When put together the effects of both variables shrink Or, one remains positive and the other flips sign Note: Not all “sign flips” are due to multicollinearity!

Multicollinearity What does multicollinearity do to models?
Note: It does not violate regression assumptions But, it can mess things up anyway 1. Multicollinearity can inflate standard error estimates Large standard errors = small t-values = no rejected null hypotheses Note: Only collinear variables are effected. The rest of the model results are OK.

Multicollinearity What does multicollinearity do?
2. It leads to instability of coefficient estimates Variable coefficients may fluctuate wildly when a collinear variable is added These fluctuations may not be “real”, but may just reflect amplification of “noise” and “error” One variable may only be slightly better at predicting Y… but SPSS will give it a MUCH higher coefficient Note: These only affect variables that are highly correlated. The rest of the model is OK.

Multicollinearity Diagnosing multicollinearity:
1. Look at correlations of all independent vars Correlation of .7 is a concern, .8> is often a problem But, sometimes problems aren’t always bivariate… and don’t show up in bivariate correlations Ex: If you forget to omit a dummy variable 2. Watch out for the “symptoms” 3. Compute diagnostic statistics Tolerances, VIF (Variance Inflation Factor).

Multicollinearity Multicollinearity diagnostic statistics:
“Tolerance”: Easily computed in SPSS Low values indicate possible multicollinearity Start to pay attention at .4; Below .2 is very likely to be a problem Tolerance is computed for each independent variable by regressing it on other independent variables.

Multicollinearity If you have 3 independent variables: X1, X2, X3…
Tolerance is based on doing a regression: X1 is dependent; X2 and X3 are independent. Tolerance for X1 is simply 1 minus regression R-square. If a variable (X1) is highly correlated with all the others (X2, X3) then they will do a good job of predicting it in a regression Result: Regression r-square will be high… 1 minus r-square will be low… indicating a problem.

Multicollinearity Variance Inflation Factor (VIF) is the reciprocal of tolerance: 1/tolerance High VIF indicates multicollinearity Gives an indication of how much the Standard Error of a variable grows due to presence of other variables.

Multicollinearity Solutions to multcollinearity
It can be difficult if a fully specified model requires several collinear variables 1. Drop unnecessary variables 2. If two collinear variables are really measuring the same thing, drop one or make an index Example: Attitudes toward recycling; attitude toward pollution. Perhaps they reflect “environmental views” 3. Advanced techniques: e.g., Ridge regression Uses a more efficient estimator (but not BLUE – may introduce bias).

Models and “Causality”
Issue: People often use statistics to support theories or claims regarding causality They hope to “explain” some phenomena What factors make kids drop out of school Whether or not discrimination leads to wage differences What factors make corporations earn higher profits Statistics provide information about association Always remember: Association (e.g., correlation) is not causation! The old aphorism is absolutely right Association can always be spurious

How do we determine causality? The randomized experiment is held up as the ideal way to determine causality Example: Does drug X cure cancer? We could look for association between receiving drug X and cancer survival in a sample of people But: Association does not demonstrate causation; Effect could be spurious Example: Perhaps rich people have better access to drug X; and rich people have more skilled doctors! Can you think of other possible spurious processes?

In a randomized experiment, people are assigned randomly to take drug X (or not) Thus, taking drug X is totally random and totally uncorrelated with any other factor (such as wealth, gender, access to high quality doctors, etc) As a result, the association between drug X and cancer survival cannot be affected by any spurious factor Nor can “reverse causality” be a problem SO: We can make strong inferences about causality!

Unfortunately, randomized experiments are impractical (or unethical) in many cases Example: Consequences of high-school dropout, national democracy, or impact of homelessness Plan B: Try to “control” for spurious effects: Option 1: Create homogenous sub-groups Effects of Drug X: If there is a spurious relationship with wealth, compare people with comparable wealth Ex: Look at effect of drug X on cancer survivors among people of constant wealth… eliminating spurious effect.

Option 2: Use multivariate model to “control” for spurious effects Examine effect of key variable “net” of other relationships Ex: Look at effect of Drug X, while also including a variable for wealth Result: Coefficients for Drug X represent effect net of wealth, avoiding spuriousness.

Limitations of “controls” to address spuriousness 1. The “homogenous sub-groups” reduces N To control for many possible spurious effects, you’ll throw away lots of data 2. You have to control for all possible spurious effects If you overlook any important variable, your results could be biased… leading to incorrect conclusions about causality First: It is hard to measure and control for everything Second: Someone can always think up another thing you should have controlled for, undermining your causal claims.

Under what conditions can a multivariate model support statements about causality? In theory: A multivariate model support claims about causality… IF: The sample is unbiased The measurement is accurate The model includes controls for every major possible spurious effect The possibility of reverse causality can be ruled out And, the model is executed well: assumptions, outliers, multicollinearity, etc. are all OK.

In Practice: Scholars commonly make tentative assertions about causality… IF: The data set is of high quality; sample is either random or arguably not seriously biased Measures are high quality by the standards of the literature The model includes controls for major possible spurious effects discussed in the prior literature The possibility of reverse causality is arguably unlikely And, the model is executed well: assumptions, outliers, multicollinearity, etc. are all acceptable… (OR, the author uses variants of regression necessary to address problems).

In sum: Multivariate analysis is not the ideal tool to determine causality If you can run an experiment, do it But: Multivariate models are usually the best tool that we have! Advice: Multivariate models are a terrific way to explore your data Don’t forget: “correlation is not causation” The models aren’t magic; they simply sort out correlation But, if used thoughtfully, they can provide hints into likely causal processes!

Multiple Regression Assumptions & Diagnostics

Similar presentations

Presentation on theme: "Multiple Regression Assumptions & Diagnostics"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Multiple Regression Assumptions & Diagnostics

Similar presentations

Presentation on theme: "Multiple Regression Assumptions & Diagnostics"— Presentation transcript:

Similar presentations

About project

Feedback