Download presentation
Presentation is loading. Please wait.
Published byMaximillian Dixon Modified over 9 years ago
1
April 4, 2006Lecture 11Slide #1 Diagnostics for Multivariate Regression Analysis Homeworks Assumptions of OLS Revisited Visual Diagnostic Techniques Non-linearity Non-normality and Heteroscedasticity Outliers and Case Statistics
2
April 4, 2006Lecture 11Slide #2 Homework -- Set 1 Test for the additional explanatory power of attention to the temp change issue (c_4_31) when modeling the certainty of temperature change (c_4_32). Use the more complex model from the prior lecture exercise (shown on Slide #5) as your base of comparison. Discuss the theoretical meaning of your results.
3
April 4, 2006Lecture 11Slide #3 F-Testing a Nested Model Simpler Model: Temp Change Scale = b 0 + b 1 (c4_1_ide) + b 2 (c4_3_env) + b 3 (c5_3_age) b 4 (c4_7_un_) + b5(c4_15_co) + b 6 (c4_25_io) More Complex Model: Temp Change Scale = b 0 + b 1 (c4_1_ide) + b 2 (c4_3_env) + b 3 (c5_3_age) b 4 (c4_7_un_) + b5(c4_15_co) + b 6 (c4_25_io) + b7(c4_32_ct) Given the models, K = 8, K - H = 1, and n = 2054. Calculating the RSS’s involves running the two models, obtaining the RSS from each. For these models, RSS K = 4397.96 and RSS K-H = 4547.82. So: Given that df 1 = H (1) and df 2 = n-K (2046), the p-value of the model improvement shown in Table A4.2 (pp. 351-353) is <0.001. So what?
4
April 4, 2006Lecture 11Slide #4 Homework -- Set 2 Write a 1 page paper in which you answer the following question: –Do male respondents have the same relationship between education (c5_1a) and income (c5_5a) as do female respondents? Control for other relevant X’s. Be sure to: –Specify your hypotheses –Properly recode your variables –Interpret your statistical tests
5
April 4, 2006Lecture 11Slide #5 Predicting Income with Slope and Intercept Dummies for Male Respondents So education appears to have the same weight for men and women scientists
6
April 4, 2006Lecture 11Slide #6 But Wait… Diagnostics: –VIF Tests for linearity and homoscedasticity also raise red flags –Multicolinearity swamps the model, especially given the small proportion of women scientists
7
April 4, 2006Lecture 11Slide #7 Critical OLS Assumptions 1: Fixed Xs –The effect of X is constant. Differences in Y i ’s, given X, are due to variations in “error” 2: Errors cancel out –1 & 2 assure the independence of errors and X’s. Results in unbiased estimations of ß’s. –“Unbiased” means, in the long run, sample estimates will center on the true population parameters: –Efficiency:
8
April 4, 2006Lecture 11Slide #8 3: Errors have constant variance: –(homoscedasticity) 4: Errors are uncorrelated with each other –(no autocorrelation) Implications of assumptions 1-4: Standard errors of the estimate are unbiased OLS is more efficient than any other linear unbiased estimator (Gauss-Markov Theorem) A matter of the degree to which 3 and 4 hold… –As n-size increases, stringency of assumptions 3 and 4 decrease (law of large numbers) More Critical OLS Assumptions
9
April 4, 2006Lecture 11Slide #9 5: Errors are Normally Distributed –Justifies use of t and F distributions –Necessary for hypothesis tests and confidence intervals –Makes OLS more efficient than any other unbiased estimator Last Critical OLS Assumption
10
April 4, 2006Lecture 11Slide #10 Assumptions of “Correct” Model Specification Y is a linear function of modeled X variables No X’s are omitted that affect E[Y] and that are correlated with included X’s All X’s in the model affect E[Y]
11
April 4, 2006Lecture 11Slide #11 Summary of Assumption Failures and their Implications ProblemBiased bBiased SEInvalid t/FHi Var Non-linear YesYesYes--- Omit relev. X YesYesYes--- Irrel X NoNoNoYes X meas. Error YesYesYes--- Heterosced. NoYes YesYes Autocorr. NoYes YesYes X corr. error YesYes Yes--- Non-normal err. NoNoYesYes Multicolinearity NoNoNoYes
12
April 4, 2006Lecture 11Slide #12 How do we know if the assumptions have been met? Our data permit empirical tests for some assumptions, but not all: –We can check for: Linearity Whether an X should be included Homoscedasticity Autocorrelation Normality –We can’t check for: Correlation between error and X’s Mean error equals zero All relevant X’s included
13
April 4, 2006Lecture 11Slide #13 So what do we do? Univariate analysis –Y, X’s -- look for skew, other possible problems with distributions Can identify possible outliers, adequacy of variance
14
April 4, 2006Lecture 11Slide #14 Bi-Variate Scatterplots Detect non-linearities (curvilinearity) Heteroscedasticity (non-constant variance)
15
April 4, 2006Lecture 11Slide #15 Residual vs. Predicted Y Plots Checks for: Curvilinearity (are curves apparent?) Heteroscedasticity (fan shapes? Also |e| plots) Non-normality (density appropriate?) Outliers (singles or clusters evident?)
16
April 4, 2006Lecture 11Slide #16 Non-Linearity One of the signal failings of OLS Run an “ovtest” –Use the “rhs” option to use powers of the right-hand side variables If non-linear relationships are suspected –Look at the bivariate plots –Use “acprplot” for each of the independent variables: an augmented component plus residual plot acprplot c5_3_age, mspline msopts (bands(7))
17
April 4, 2006Lecture 11Slide #17 Normality of Errors This is a critical assumption for OLS because it is required for: –Hypothesis tests, confidence interval estimation Particularly sensitive with small samples –Efficiency Non-normality will increase sample-to-sample variation Diagnostics: Plot the residuals Run the “hettest” (checks for heteroscedasticity) –Age, Environmental status, and (especially) certainty all appear to produce non-standard variance Then what? Use robust estimators (uses medians rather than means) –“regress c4_31_tc c4_1_ide c4_3_env c5_3_age c4_7_un_ c4_14_co c4_25_io c4_32_ct, beta robust” Transform the non-linear variables –Logs are common with badly skewed dependent variables
18
April 4, 2006Lecture 11Slide #18 Influence Analysis Does any particular case substantially change the regression results? Can sometimes be spotted visually, but not always –We use: –Asks: by how many standard errors does b k change when case i is removed? Measures the influence of case i on the k th estimated coefficient –If DFBETA > 0, then case i pulls b k up –If DFBETA < 0, then case i pulls b k down
19
April 4, 2006Lecture 11Slide #19 Criteria for Influence Analyses External criteria (size-adjusted cut-off): –If |DFBETA ik | > 2/, then consider deleting Gets the top 5% of influential cases, given sample Internal criteria: –If box-plot defines “severe outlier”, consider deleting that case Caution: Evaluate theory –Consider possible modeling approaches (dummies?) –Throwing away data is a last resort!
20
April 4, 2006Lecture 11Slide #20 Obtaining DFBETAS in Stata After running the regression – type “dfbeta” –Saves the DFBETA for: Each parameter (including intercept) For each case –Permits Scatterplots, box plots, etc. –graph box DFc5_3_age DFc5_1a DFc5_4_gen, legend(cols(3)) Sorts, by size, to identify large influence cases
21
April 4, 2006Lecture 11Slide #21 Dfbetas for first three Xs graph box c4_1_ide c4_3_env, legend(cols(2))
22
April 4, 2006Lecture 11Slide #22 Checking Outliers Sort the dfbetas identified as having outliers –Use the education case (c5_1a) –The list the highest and lowest, with case id’s Note that there appears to be no obvious pattern –Suggests keeping the outliers in the model
23
April 4, 2006Lecture 11Slide #23 Homework Run diagnostics on a full model of predicted temperature changes –Evaluate for all diagnosable assumptions, including: Multicolinearity Non-Linearity Normality of errors Influence Heteroscedasticity Prepare a brief (1-page) summary
24
April 4, 2006Lecture 11Slide #24 Take a Break...
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.