Download presentation
Presentation is loading. Please wait.
Published byWilfred Ball Modified over 9 years ago
1
PADM 692 | Data Analysis II Session II Linear Regression Diagnostics March 17, 2012 University of La Verne Soomi Lee, PhD Copyright © by Soomi Lee Do not copy or distribute without permission
2
Overview 1.Recap: multiple regression 2.Assumptions of Classical Linear Regression Model (CLRM) 3.Most common problems in regression analysis 1.Multicollinearity 2.Omitted variable bias 3.Heteroskedasticity 4.Outliers
3
Recap: Multiple Regression Summary statistics Eyeball the relationship between your main IV(s) and DV – Cannot plot two IVs. Show one IV and DV at a time. Interpretation 1.Individual coefficients (significance, magnitude) – Statistically significant? – How big is it? – Do not forget: “holding other variables constant” 2.Overall model performance – Adjusted R square: how much variation of Y does your model explain? – F statistic (statistical significance of the model as a whole): are your independent variables jointly important?
4
Presidential Election Example DV: Incumbent president’s vote share (%) 1.How many IVs? 2.Interpretation of each coefficient? 3.Overall model performance? GROWTH.705*** (.217) INFLATION-.478 (.339) Constant53.365*** (1.919) N23 Adjusted R2.527 F13.25*** Note: *** p<0.01; standard errors are in parentheses.
5
Abortion Rates Example DV: abortion rate (per 1000 women aged 15- 44) 1.How many IVs? 2.Interpretation of each coefficient? 3.Overall model performance? Religion.0004 (.083) Price-.045** (.022) Income.002*** (.000) Picket -.109*** (.040) Constant-5.869 (9.182) N50 Adjusted R2.498 F13.162*** Note: ** p<0.05, *** p<0.01; Standard errors are in parentheses.
6
British Crime Rates Example DV: Crime rate (per 1000 people) 1.How many IVs? 2.Interpretation of each coefficient? 3.Overall model performance? Unemployment5.352* (2.703) Cars-.052 (.036) Police-4.204 (7.546) Young population7.941*** (2.176) Constant.309 (36.312) N42 Adjusted R2.589 F15.705*** Note: * p<0.1; ** p<0.05, *** p<0.01; Standard errors are in parentheses.
7
Concerns No mindless data crunching. Theory should guide you. Limitation of our dataset – Number of observation > number of IVs – External validity: generalization – Internal validity: are you measuring what you want to measure?
8
Break
9
Assumptions of Classical Linear Regression Model (CLRM) 1.Zero average of population errors 2.Equal variance (homoskedasticity) 3.No autocorrelation 4.No correlation between X and errors 5.No measurement errors 6.No model specification error 7.Normal distribution of the population error term
10
Break
11
Collinearity Collinearity: Two independent variables are linearly related. Multicollinearity: More than two variables are a linear combination of one another
12
High Multicollinearity High multicollinearity: one of the IVs is highly correlated with one or more of the other IVs. Why it is a problem? –Redundant information. Cannot separate the effect of X1 and X2 on the DV. Unable to estimate the marginal effect. –Why? Inflates standard errors –Not a signal of a problem in our model –OLS is still BLUE in the presence of multicollinearity
13
High Multicollinearity How to detect 1.Look at the significance of individual independent variables and overall model performance: Adjusted R 2 is high, but individual variables are not statistically significant. 2.Examine correlations among X variables (Create correlation matrix). Solution 1.Increase sample size (not always feasible). 2.Drop one of the variables.
14
High multicollinearity: Example British Crime Rates Model CARS is not statistically significant. Possible multicollinearity? CARS may be highly correlated with other variables. What to do: create a correlation matrix Unemployment.329* (5.703) CARS-.052 (.036) Police-4.204 (7.546) Young population7.941*** (2.176) Constant.309 (36.312) N42 Adjusted R2.589 F15.705*** Note: ** p<0.05, *** p<0.01; Standard errors are in parentheses.
15
Detecting High Multicollinearity CUT-OFF (rule of thumb): 0.8 CARS: did not exceed correlation coefficient of 0.8 with any other IVs. It may be correlated with combination of variables. We also have a high correlation coefficient between police and unemployment. CARSPolice Young population Unem- ployment CARS1--- Police-0.6391-- Young population-0.5190.6201- Unemployment-0.5750.8100.5981
16
Multicollinearity Diagnostics in SPSS Request the display of Collinearity Statistics Analyze Regression Linear Statistics Collinearity diagnostics SPSS will show you “tolerance” and “VIF” along with regression results.
17
Multicollinearity Diagnostics in SPSS Tolerance (between 0 and 1) – The percent of variance in a predictor that cannot be accounted for by the other predictors. (1-R 2 i ) – Low level of tolerance high multicollinearity – At 0.3 Attention – Less than.1 indication of multicollinearity. Likely to be a problem. VIF: Variance Inflation Factor – 1/tolerance – High level of VIF high multicollinearity – Greater than 10 Attention
18
Example: Medicaid We want to estimate the effect of poverty on the share of state Medicaid spending. H0: b 1 =0; H1: b 1 >0 1. Estimate the following model: Medicaid=a+b 1 (poverty)+b 2 (age65)+b 3 (income)+e 2. Examine if there is a problem of multicollinearity. 1)Look at the overall model performance and individual coefficients. 2)Create a correlation matrix. High correlations? 3)Compute tolerance and VIF. What these values tell us about multicollinearity in this model? What do we do about it?
19
Group Work We want to estimate the marginal effect of education on wage. Hypothesis 1: H 0 : b 1 =0; H 1 : b 1 >0 Hypothesis 2: H 0 : b 2 =0; H 1 : b 2 >0 1.Estimate the following model: Wage = a+b 1 (edu)+b 2 (experience)+e 2. Examine if there is a problem of multicollinearity. 1)Look at the overall model performance and individual coefficients. 2)Create a correlation matrix. High correlations? 3)Compute tolerance and VIF. What these values tell us about multicollinearity in this model? What do we do about it?
20
Break
21
Omitted Variable Bias Omitted variable bias: exclusion of a relevant variable or inclusion of an irrelevant variable Model specification error If you have an omitted variable bias you violate the assumption that regressors are uncorrelated to the errors.
22
Omitted Variable Bias Suppose our true model is: Y i = α + β 1 X 1i + β 2 X 2i + u i Our misspecified model excludes X 2 (due to ignorance or unavailability of X 2 ): Y i = c + d 1 X 1i + v i Now our error term in the misspecified model contains the effects of the X 2. v i = β 2 X 2i + u i Our estimation of d 1 is biased!
23
Solution? Think about possible measurement errors. Re-specify your model: adding one or more variables or excluding possibly irrelevant variables See if an additional variable changes anything (significance and the size of predictors, R 2 ) Go back to theory.
24
Example: Medicaid We want to estimate the effect of poverty on the share of state Medicaid spending. H0: b 1 =0; H1: b 1 >0 1. Estimate the following models: Medicaid=a+b 1 (poverty)+b 2 (age65)+b 3 (income)+e Medicaid=a+c 2 (age65)+c 3 (income)+v 2. Compare the two models. 1)Compare Adjusted R2 and the F statistic. 2)See if there are any changes in statistical significance or magnitude of each regression coefficient. 3)Do you consider the second model misspecified?
25
Group Work We want to estimate the marginal effect of education on wage. Hypothesis 1: H 0 : b 1 =0; H 1 : b 1 >0 Hypothesis 2: H 0 : b 2 =0; H 1 : b 2 >0 1.Estimate the following models: Wage = a+b 1 (edu)+b 2 (experience)+b 3 (female)+e Wage = a+c 2 (experience)+c 3 (female)+v 2.Compare the two models. 1)Compare Adjusted R2 and the F statistic. 2)See if there is any changes in statistical significance or magnitude of each regression coefficient. 3)Do you consider the second model misspecified?
26
Break
27
Heteroskedasticity Error variance is constant throughout the regression line. homoskedastic (equal variance) If the error variance is not constant throughout the regression line it violates the assumption of equal variance heteroskedastic
28
Homoskedastic Heteroskedastic
29
Heteroskedasticity: Causes Measurement errors Omitted variables – Suppose the true model is: y=a+b 1 x 1 +b 2 x 2 +u – The model we estimate fails to include x 2 : y=a+c 1 x 1 +v – Then the error term in the model will be capturing the effect of X 2, so it will be correlated with X 2. Non-linearity – True model: y=a+b 1 X 1 2 +u – Our model: y=a+cX 1 +v – Then the residual will capture the non-linearity and affect the variance accordingly.
30
Heteroskedasticity: Consequences Heteroskedasticity by itself does not cause OLS to be biased or inconsistent. It is still unbiased, but not the best. Then you no longer have the minimum variance lose accuracy Heteroskedasticity is a symptom of omitted variables, and measurement errors. OLS estimators will be biased and inconsistent if you have omitted variables or measurement errors.
31
Heteroskedasticity: Detection Graphical Methods – Plot the standardized residuals (on Y axis) against the standardized predicted values (on X axis) Analyze Regression Linear Plots Select ZRESID (standardized residuals) as the Y and ZPRED (standardized predicted value) as the X variable. Click on the “Continue” button. – If you see a pattern (a funnel shape or a curve) this indicates heteroskedasticity.
32
Satisfactory Residual Plot
33
Non-constant Variance
34
Heteroskedasticity: Solution Re-specify your model. You may omit one or more important variables. Consider using different measurements.
35
Example: Medicaid We want to estimate the effect of poverty on the share of state Medicaid spending. H0: b 1 =0; H1: b 1 >0 1. Estimate the following models: Medicaid=a+b 1 (poverty)+b 2 (age65)+b 3 (income)+e Medicaid=a+c 2 (age65)+c 3 (income)+v 2. Create and compare residual plots for both models.
36
Group Work We want to estimate the marginal effect of education on wage. Hypothesis 1: H 0 : b 1 =0; H 1 : b 1 >0 Hypothesis 2: H 0 : b 2 =0; H 1 : b 2 >0 1.Estimate the following models: Wage = a+b 1 (edu)+b 2 (experience)+b 3 (female)+e Wage = a+c 2 (experience)+c 3 (female)+v 2. Create and compare residual plots for both models.
37
Break
38
Outlier Outliers: cases with extreme values – They are influential: removing them substantially changes the estimate of coefficients. Consequence – Estimates are biased especially when the sample size is small. Causes of outliers – Errors in coding or data entry – Highly unusual cases – Important real variation
39
Extreme case that pulls the regression line up Regression line with extreme case removed from the sample
40
Detecting Outliers 1.Scatter plots – Detect if there are any outliers (eyeballing)
41
Detecting Outliers 2.Compute Cook’s D – Identifies strongly influential cases to the regression line – Higher value potential outlier – Rule of thumb: cut-off point = 4/N – If Cook’s D > 4/N pay attention – Example: N=50 cutoff: 4/50=0.08
42
Detecting Outliers 3.Compute DfBeta – DfBeta: change in the regression coefficient that results form the deletion of the ith case. – DfBeta value is calculated for each case for each regression coefficient. – Higher value potential outlier – Rule of thumb: Pay attention if DfBeta > cut-off = 2/sqrt(N) – Ex. N=50 cut-off = 2/sqrt(50) = 0.28
43
Detecting Outliers 4.Compute DfFit – Changes in the predicted value when the ith case is deleted. – Higher value potential outlier – Rule of thumb: Pay attention if DfFit > cut-off = 2×sqrt(K/N); K=number of independent variables – Ex. N=50; K=5 cut-off = 6.32
44
Example 1.Scatter plots Analyze Regression Linear Plots Produce all partial plots 1.Compute Cook’s D, DfBeta, DfFit Analyze Regression Linear Save Cook’s, DfBeta(s), DfFit (Cook’s D, DfBeta, DfFit will be saved in your data file. Go back to Data View and see)
45
Solution? In the presence of outliers… Fit the model with and without outliers Remove influential observations from regression analysis Recall “They are influential: removing them substantially changes the estimate of coefficients.” Must justify why. Do not destroy your data without justification.
46
Example: Medicaid We want to estimate the effect of poverty on the share of state Medicaid spending. H0: b 1 =0; H1: b 1 >0 1. Estimate the following model: Medicaid=a+b 1 (poverty)+b 2 (age65)+b 3 (income)+e 2. Examine outliers. 1)Create partial scatter plots. Eyeball each scatter plot if there are any outliers. 2)Compute Cook’sD, DfBeta, and DfFit. Calculate cut-offs for each. Do we have any outliers?
47
Group Work We want to estimate the marginal effect of education on wage. Hypothesis 1: H 0 : b 1 =0; H 1 : b 1 >0 Hypothesis 2: H 0 : b 2 =0; H 1 : b 2 >0 1.Estimate the following model: Wage = a+b 1 (edu)+b 2 (experience)+b 3 (female)+e 2.Examine outliers. 1)Create partial scatter plots. Eyeball each scatter plot if there are any outliers. 2)Compute Cook’sD, DfBeta, and DfFit. Calculate cut-offs for each. Do we have any outliers?
48
Go Home.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.