1 Reg12W G Multiple Regression Week 12 (Wednesday) Review of Regression Diagnostics Influence statistics Multicollinearity Examples
2 Reg12W An issue for survey data: Influence In balanced experimental designs, main effects of variables are a function of large proportions of observations In survey designs, one or a few observations may completely determine the direction and statistical significance of a main effect Example: lack of support in miscarriage study »We found that more than 90% of the women were supported. »Results depended on 4 women
3 Reg12W Measures of Influence Influence takes into account both leverage and discrepancy. It is calculated by seeing the impact of dropping each successive observation. Diff B: How much does each regression weight change when an observation is deleted The numerator may itself be of practical interest, since it can affect whether an effect appears to be significant.
4 Reg12W Global Influence Cook’s distance can be interpreted as a global test of these differences It is related to an F distribution on (k+1, n-k-1) df. It is also related to the square of another global measure, DIFFIT.
5 Reg12W Summary Check regression diagnostics »To exercise quality control of your data »To understand the generalizability of your findings »To explore new aspects of your data
6 Reg12W Example of Influence Two predictors that are highly correlated »Neither one has particular outliers »Jointly there is a lone point
7 Reg12W List of Observations with Highest CooksD The first is a point at the extreme of the correlated predictor space The second is the point that is isolated in the bivariate plot. None of these values "ring the bell". F(3,27)=.8 for CooksD
8 Reg12W Multicollinearity Problems associated with highly correlated predictors »In extreme case, numerical instability »Problem of interpretation Indices depend on R 2 i|X, the multiple R-square of the i th variable with other Xs »Variance Inflation Factor =1/(1- R 2 i|X ) »Tolerance = (1- R 2 i|X ) =1/VIF
9 Reg12W Approaches to Multicollinearity Conceptual: Rethink the set of variables. »E.g. If measures of anxiety and depression are very highly correlated as predictors, think about whether one is simply interested in distress Statistical: »Principal Components analysis »Factor analysis »Structural equation analysis
10 Reg12W Regression Diagnostics in Logistic Models Discrepancy »There will not be “outliers” due to extreme Y values in logistic regression since Y is binary. »Residuals are difference between Y and fitted P(Y=1). »If an event occurs when it is thought to be extremely unlikely, the discrepancy will be large Leverage and Influence »One can study H matrix as before to identify influential points »Cook’s distance has been generalized to logistic.