General Structural Equations (LISREL) Week 3 #5 Missing Data
Missing Data Old approaches: LISTWISE DELETION Default in PRELIS when forming covariance matrix Not the default in AMOS: Must “manually” listwise delete variables if this is desired
Missing Data LISTWISE DELETION Not the default in AMOS: Must “manually” listwise delete variables if this is desired SPSS: COMPUTE NMISS=0. RECODE V1 TO V99 (missing,sysmis=-999). DO REPEAT XX = V1 TO V99. IF (XX EQ -999) NMISS=NMISS+1. END REPEAT. SELECT IF (NMISS EQ 0). SAVE OUTFILE = …. [specifications]
Missing Data What are the major issues with listwise deletion? Why would we want to use LISTWISE DELETION with AMOS when it offers “high tech” solutions (FIML estimation) as its default? better model diagnostics available when FIML not invoked What are the major issues with listwise deletion? Inefficient (loses cases) Bias if pattern of “missingness” is not MAR (prob(Y)miss unrelated to Y after controls for X) [under certain assumptions, could be consistent]
Missing Data Pairwise deletion Some debate on its appropriateness at all. Better when correlations are low; worse when correlations are high Problem of determining the appropriate N (“minimum pairwise”?) Can produce non positive-definite matrices
Missing Data Pairwise deletion An option in PRELIS (check box) Most stats packages will produce pairwise deleted covariances: SPSS: /MISSING=PAIRWISE SAS: default in PROC CORR [PROC CORR COV; VAR ….{var list}] For AMOS, pass “matrix file” from SPSS instead of regular SPSS data file
Missing Data Two terrible approaches: Mean substitution (an option in some SPSS procedures; easy to implement manually in all stats packages) Deflates variances Deflates covariances (cov(a,X) = 0 Converts normal distribution into distribution with “spike”
Missing Data Two terrible approaches: Regression prediction X1 X2 X3 . 3 5 X-hat = b0 + b1X1 + b2X2 2 8 9 6 12 89 etc. Problem: R-square X1 with X2,X3 is perfect for missing cases (no error term) - inflates covariances VAR(X1) = b12 Var(X2) + var(e) [var(e) omitted]
Missing Data Regression prediction: ANOTHER APPROACH: REWEIGHTING Not quite so bad if prediction from “outside the model” (but then, one must argue “predictors” are irrelevant to the model) ANOTHER APPROACH: REWEIGHTING Providing series of weights so distribution more closely represents full sample
EM Algorithm Expectation/maximization X1 X2 X3 X4 . 18 22 15 . 23 19 8 . 18 22 15 . 23 19 8 . 12 12 . 12 . 16 4 23 1 2 4 38 16 12 5 . 22 . 5
X1 X2 X3 X4 . 18 22 15 . 23 19 8 . 12 12 . 12 . 16 4 23 1 2 4 38 16 12 5 . 22 . 5 We want: var-covariance matrix that is based on “complete” dataset Σ (cov) z (means) E-STEP: get start values for S, z - listwise or pairwise OK Compute regression coefficients for all needed subsets Cases 1 & 2: X1 = b0 + b1X2 + b2 X3 + b3 X4 Base the calculation of these coefficients on all cases for which data on X1, X2, X3 and X4 are available Case 3 X1 = b0 + b1 X2 + b2 X3 Base calculation of these coefficients on all cases for which data on X1, x2 and X3 are available Also: X4 = b0 + b1 X2 + b2 X3 Case 7 X1 = b0 + b1 X2 + b2 X4
Imputed cases X1 X2 X3 X4 x* 18 22 15 X* 23 19 8 *=hat (predicted) M-STEP: Re-Calculate means, covariances Means: usual formula Variances: add in residual VAR(X1) = b2 var(X2) + VAR(e1) ^^^^ add in Use new z, Σ to re-calculate imputations Continue E/M steps until convergence
EM Algorithm Advantages: Full information estimation Also imputes cases Can estimate asymptotic covariance matrix for ADF estimation Disadvantages: Assumes normal distribution (could transform data first) Assumes continuous distribution When input into other programs, standard errors biased, usually downward
EM Algorithm: Implementation PRELIS will use the EM algorithm to construct a “corrected” covariance matrix (and/or mean vector). Syntax: EM IT=200 (200 iterations) PRELIS SYNTAX: (title) SY='E:\Classes\ICPSR2004\Week3Examples\RelSexData\File1.PSF' EM CC = 0.00001 IT = 200 TC = 2 NEW LINE OU MA=CM XT XM
EM Algorithm: Implementation Interactive PRELIS: Statistics Multiple Imputation (check box at bottom: All values missing – probably better to select “delete cases”
EM Algorithm: Implementation Small issue: - if you want to select variables or cases (as opposed to constructing a covariance matrix from the entire file, you cannot exit the dialogue box with the “select” commands intact (must either run or cancel). Worse: If you put case selection commands in PRELIS syntax, these are ignored if there is imputation!
EM Algorithm: Implementation Issue: - if you want to select variables or cases (as opposed to constructing a covariance matrix from the entire file, you cannot exit the dialogue box with the “select” commands intact (must either run or cancel). [information on whether this is an issue with verison 8.7 not presently available] Moreover, case selection specifications will not work if imputation is performed. Solution: select out the variables and cases you want with the stat package first. SPSS: select if (v2 eq 11). save outfile =‘e:\classes\ICPSR2004\Week3Examples\MissingData\RelSexUSA.sav' /keep=v9 v147 v175 v176 v304 to v310 v355 v356 sex.
EM Algorithm: Implementation Steps: In LISREL/PRELIS: File Import external data in other formats Statistics multiple imputation
EM Algorithm: Implementation Steps: In LISREL/PRELIS: File Import external data in other formats Remember to define variables as continuous unless other option required: Variable type
EM Algorithm: Implementation Steps: In LISREL/PRELIS: File Import external data in other formats Remember to define variables as continuous unless other option required: Variable type
EM Algorithm: Implementation Steps: In LISREL/PRELIS: File Import external data in other formats Define variables Statistics multiple imputation
EM Algorithm: Implementation Steps: In LISREL/PRELIS: File Import external data in other formats Statistics multiple imputation Select output options then specify location for matrices Return to mult. Imp. Menu, then RUN
EM Algorithm: Implementation ------------------------------- EM Algoritm for missing Data: Number of different missing-value patterns= 67 Convergence of EM-algorithm in 4 iterations -2 Ln(L) = 92782.24999 Percentage missing values= 2.25 Estimated Means BEFORE CORRECTION: Total Sample Size = 1839 Number of Missing Values 0 1 2 3 4 5 6 7 8 Number of Cases 1484 264 55 12 9 4 7 3 1 Listwise Deletion Total Effective Sample Size = 1484
EM Algorithm: Implementation Estimated Covariances EM estimation V9 V147 V175 V176 V304 V9 0.8074 V147 1.3026 6.5725 V175 0.3063 0.7525 0.4854 V176 -1.6002 -3.4273 -1.0734 6.8036 V304 0.3912 1.0714 0.2624 -1.3650 2.9791 Covariance Matrix No correction V9 V147 V175 V176 V304 V305 -------- -------- -------- -------- -------- -------- V9 0.813 V147 1.338 6.494 V175 0.310 0.755 0.476 V176 -1.626 -3.474 -1.088 6.721 V304 0.400 1.056 0.286 -1.496 2.895 V305 0.461 0.980 0.270 -1.436 1.336 3.509
Comparison -------- -------- v9 1.000 - - v147 2.212 - - (0.080) ETA 1 ETA 2 -------- -------- v9 1.000 - - v147 2.212 - - (0.080) 27.649 v175 0.657 - - (0.024) 27.581 v176 -3.262 - - (0.098) -33.348 v304 - - 1.000 v305 - - 1.062 (0.060) 17.763 v307 - - 2.335 (0.122) 19.136 v308 - - 1.736 (0.094) 18.564 ETA 1 ETA 2 -------- -------- v9 1.000 - - v147 2.195 - - (0.086) 25.504 v175 0.657 - - (0.026) 25.353 v176 -3.305 - - (0.107) -31.011 v304 - - 1.000 v305 - - 1.046 (0.063) 16.601 v307 - - 2.185 (0.119) 18.309 v308 - - 1.646 (0.092) 17.865
GAMMA EM GAMMA Regular v355 v356 sex -------- -------- -------- -------- -------- -------- ETA 1 -0.006 0.035 0.177 (0.001) (0.009) (0.038) -4.962 3.959 4.594 ETA 2 -0.008 0.096 0.011 (0.002) (0.013) (0.051) -5.376 7.602 0.209 GAMMA EM v355 v356 sex -------- -------- -------- ETA 1 -0.006 0.034 0.175 (0.001) (0.009) (0.035) -5.582 3.966 5.042 ETA 2 -0.007 0.100 -0.021 (0.001) (0.012) (0.043) -5.555 8.610 -0.477
EM alorithm: SAS implementation PROC MI Form: PROC MI DATA=file OUT=file2; EM OUTEM=file3; PROC CALIS DATA=file3 COV MOD; Lineqs [regular SAS calis specification]
Hot deck /nearest neighbor PRELIS ONLY X1 X2 X3 X4 . 2 8 16 . 3 1 9 2 8 29 32 1 5 6 13 2 9 2 3 6 . 1 4 1st case, look for closest case which does not have missing values for X2, X3, X4 : 1 5 6 13 impute from this case (4th) to missing value hence X1 for case 1 will be 1
Nearest neighbor Matching variables: more accurate if all cases non-missing Worst case: no non-missing match Special problem arises if small # of discrete values (next slide
Variables have small # of discrete values X1 X2 X3 X4 . 1 2 4 2 2 1 5 3 1 2 5 Ties! 1 2 2 4 5 2 2 4 0 1 3 4 Impute with the average values across the ties BUT….
Variables have small # of discrete values X1 X2 X3 X4 . 1 2 4 2 2 1 5 3 1 2 5 Ties! 1 2 2 4 5 2 2 4 0 1 3 4 Impute with the average values across the ties BUT…. WHAT if the std. deviation of the imputed values is not less than the overall standard deviation for X1? Then, the imputation almost reduces to imputed = mean of X1 The “variance ratio” : In PRELIS, imputation “fails” if the variance ratio is too large (usually .5, .7, can be adjusted)
Nearest neighbor Advantages: Get “full” data set to work with May be superior for non-normal data Disadvantages Deflated standard errors (imputed values treated as “real” by estimating software) “Failure to impute” a large proportion of missing values is a common outcome with survey data
Multiple Group Approach Allison Soc. Methods&Res. 1987 Bollen, p. 374 (uses old LISREL matrix notation)
Multiple Group Approach Note: 13 elements of matrix have “pseudo” values - 13 df
Multiple group approach Disadvantage: - Works only with a relatively small number of missing patterns
FIML (also referred to as “direct ML”) Available in AMOS and in LISREL AMOS implementation fairly easy to use (check off means and intercepts, input data with missing cases and … voila!) LISREL implementation a bit more difficult: must input raw data from PRELIS into LISREL
FIML
FIML
FIML
(end)