Presentation is loading. Please wait.

Presentation is loading. Please wait.

A new architecture for handling multiply imputed data in Stata JC Galati 1, JB Carlin 1,2, P Royston 3 1 Murdoch Childrens Research Institute (MCRI), Melbourne.

Similar presentations


Presentation on theme: "A new architecture for handling multiply imputed data in Stata JC Galati 1, JB Carlin 1,2, P Royston 3 1 Murdoch Childrens Research Institute (MCRI), Melbourne."— Presentation transcript:

1 A new architecture for handling multiply imputed data in Stata JC Galati 1, JB Carlin 1,2, P Royston 3 1 Murdoch Childrens Research Institute (MCRI), Melbourne 2 The University of Melbourne 3 MRC Clinical Trials Unit, London

2 2 Missing data Why do we need additional tools for analysing datasets with missing values? Traditional methods work with complete datasets Statistical packages discard incomplete observations when analysing an incomplete dataset i.e. a complete-case analysis is performed This can lead to loss of power, and possibly to biased estimates, depending on why the data went missing

3 3 Multiple imputation (MI) Introduced by Donald Rubin (1987 book, Wiley) Based on Bayesian principles Both the data-generating mechanism and the missingness mechanism are modelled Fairly broad assumptions about data-generating model Fairly restrictive assumptions about missingness mechanism Modelling assumptions apply at the imputation level Statistical modelling is general (once data is imputed) Post estimation – some; more work needs to be done Diagnostics – theory and practice not yet worked out Model-building – in its infancy – work has started

4 4 MI data analysis Start with a dataset with some values missing Missing values are imputed multiple times Using a Bayesianly “proper” imputation method This creates m sets of completed data Each completed dataset is analysed separately: Standard complete-data estimation methods are used E.g. linear regression, logistic regression

5 5 Inference (estimation) using MI Coefficient estimates and variances (SE’s) from complete-data analyses are combined using Rubin’s Rules: Parameter estimates: Average of the complete-data parameter estimates Variance is the sum of two components: Within-imputation variance (average of the complete-data variances) Between-imputation variance (determined from complete-data parameter estimates) Point estimators divided by SE have approximate t distributions Estimate d.f. and use t-multipliers to get confidence intervals

6 6 Background (MI in Stata) What is available in Stata currently? “MI Tools,” Carlin et. al. Stata J. 2003: Imputed datasets stored in separate dta files myfile1.dta,..., myfile`m’.dta Estimation: mifit with: regress, logit, probit, clogit, glm, logistic, poisson, svyreg, svylogit, svyprobit, svypoisson, xtgee, xtreg Post estimation: milincom, mitestparm Data manipulation miset, miappend, mimerge, mido, misave

7 7 Background (MI in Stata) Main drawbacks of “MI Tools”: Loose association between original and imputed data Loose association between individual imputed datasets Limit to range of estimation commands supported (13) Choice of coding of some aspects resulted in slow execution time in some cases No capacity to perform imputation

8 8 Background (MI in Stata) What is available in Stata currently? (cont.) ice, micombine, Royston Stata J. 2004/05: ice stores imputed datasets in a single dta file uses impid and obsid vars Estimation: micombine with clogit, cnreg, glm, logistic, logit, poisson, probit, qreg, regress, rreg, xtgee, streg, stcox, ologit, oprobit, mlogit Post estimation: results returned in e(b), e(V) etc. onus on user to know when post-estimation command applied directly to combined estimates is valid

9 9 Background (MI in Stata) ice, micombine, Royston Stata J. 2004/05 (cont.): Data manipulation left to user, but stacked format facilitates simple transformation of variables etc. mijoin, misplit (for conversion between formats) Main drawbacks Limit to range of estimation commands supported (16) Manipulation that changes number of observations in each dataset not easily supported (eg. reshape ) Not clear when/if post-estimation is valid

10 10 mim: A new architecture Main aims To unify two sets of tools into a single architecture To combine functionality of both sets of tools To simplify the command syntax To extend the range of estimation commands supported Better post-estimation facilities testparm, lincom, predict Make it harder to do crazy things Add other post-estimation commands later

11 11 mim: A new architecture Scope: Creation of imputations is NOT included But easy for users to put imputed datasets into mim format Architecture covers analysis and manipulation of existing imputed datasets Designed to handle: Estimation Data manipulation ( reshape, append & merge ) Post-estimation ( lincom, testparm & predict ) Replay ( management of estimation results ) Utility functions

12 12 mim: A new architecture Storage of imputed datasets: Based on Royston’s stacked format Fixed names for impid and obsid vars _mj (impid) and _mi (obsid) no need for: dataset characteristics to record the names additional command options to specify the names dedicated set command to manage the characteristics stacking requires only generate, append and replace Original data stored in the stack _mj == 0

13 13 mim: A new architecture Storage of datasets: illustration _mj _mi y x ---------------------------------- 0 1 1.1 105 0 2 9.2 106 0 3 1.1. 0 4 2.3. 0 5 7.5 108 0 6 7.9. 1 1 1.1 105 1 2 9.2 106 1 3 1.1 109.796 1 4 2.3 110.456 1 5 7.5 108 1 6 7.9 102.243 2 1 1.1 105 2 2 9.2 106 2 3 1.1 107.952 2 4 2.3 115.968 2 5 7.5 108 2 6 7.9 114.479

14 14 mim: A new architecture Command structure: A single command prefix called mim mim processes the multiply-imputed dataset currently in memory Typical syntax. mim: command E.g.. use myImputedData, clear. mim: regress y x1 x2 x3. mim: predict yhat. mim: lincom x1+x2+x3, or

15 15 mim: A new architecture Commands (cont.): General syntax Default behaviour of mim may be modified through mim options: mim [, mim_options]: command mim_options depend on whether one wishes to do estimation data manipulation post-estimation replay

16 16 Using mim Estimation: mim recognises 28 estimation commands regress mean proportion ratio logistic logit ologit mlogit probit oprobit poisson glm binreg blogit clogit cnreg mvreg rreg qreg iqreg sqreg bsqreg stcox streg xtgee xtreg xtlogit xtmixed and 11 svy commands svy:regress svy:mean... svy:poisson Plus, in principle any Stata estimation command may be used: mim, category(fit): estimation_command

17 17 Using mim Data manipulation: Stacked format allows simple manipulation using existing stata commands:. generate, replace, label etc.. by _mj: tabulate... mim recognises 3 data manipulation commands. mim: reshape cmdline. mim: append using “another mim dataset”. mim, sort(varlist) : merge using... In principle, any Stata data manipulation command may be used with mim : mim, category(manip) sort(varlist): manip_command

18 18 Using mim mim recognises some post-estimation commands:. mim: lincom cmdline. mim: testparm cmdline. mim: predict xbvar [, eq(name)]. mim: predict sevar, stdp [ eq(name)] Replay combined estimates:. mim Replay individual estimates (# th imputed dataset):. mim, j(#)

19 19 Using mim Interactive example in Stata

20 20 Final comments Difficulties faced: Simplicity of programming versus ease of use and flexibility Inconsistencies between commands resulted in more tailoring than we’d hoped Progress Coding of version 1 complete Current version is 1.0.3 Help file written Has been in beta-testing for several months Submitted for publication in Stata Journal


Download ppt "A new architecture for handling multiply imputed data in Stata JC Galati 1, JB Carlin 1,2, P Royston 3 1 Murdoch Childrens Research Institute (MCRI), Melbourne."

Similar presentations


Ads by Google