Mannheim Research Institute for the Economics of Aging Data Cleaning Process Patrick Bartels MEA Frankfurt, December 6 th
A short reminder „Respondents don´t lie!“ only change values if you´re really sure gather information about your country_specific database by references of survey agencies by information of remarks by own investigation write syntax or do-file, don´t change the data directely save original variable, when recoding values e.g. varname_original indicate by flag_variable e.g. varname_flag save corrected data files with new name e.g. filename_corrected
Division of work What we do consistency checks between cv_r & modules between wave_1 & wave_2 for demography for children fixing of interchanged IDs by automatic exchanges
Automatic corrections (respid) gender_w1 gender_w2 month / year of birth_w1 month / year of birth_w2 sampidrespid female maleOkt. 1945Apr male female Apr. 1942Okt. 1945
Automatic corrections (respid) gender_w1 gender_w2 month / year of birth_w1 month / year of birth_w2 sampidrespid female maleOkt. 1945Apr male Apr wave1 wave2 female Okt compute respid_original = respid compute respid_flag = 1
Overview of merge between wave_1 and wave_2 malefemalemissingtotal male female refusal missing total wave_1 - gender wave_2 - gender after auto-corrections
Division of work What we do consistency checks between cv_r & modules between wave_1 & wave_2 for demography for children fixing of interchanged IDs by automatic exchanges correction of wave_1 by further information in wave_2 What we want you to do ID-corrections initiated by survey agencies check booklets, tests, HH-composition (> Omar) check financial modules (> Mario) check remarks (> Laura) check country specific deviations (> Stephanie) encoding open questions priority: education, ep005 you´re much better in doing this we can fix a lot of cases
Division of work What we do consistency checks between cv_r & modules between wave_1 & wave_2 for demography for children fixing of interchanged IDs by automatic exchanges correction of wave_1 by further information in wave_2 response for not fixable cases to country-teams What we want you to do ID-corrections initiated by survey agencies check booklets, tests, HH-composition (> Omar) check financial modules (> Mario) check remarks (> Laura) check country specific deviations (> Stephanie) encoding open questions priority: education, ep005 check data again, inquire survey agencies if necessary you´re much better in doing this we can fix a lot of cases
Do-File or Syntax name of author, date of program short description of ‘what is made‘ which database and which modules version of data, date of publishing conditions / order of do-files for STATA-users: define global path
Example of STATA-do_file (1) /************************************************************ ****************** This program provides changes in cvid and respid variables in wave2 datasets of the longitudinal sample, in order to get exact matching between wave1 and wave2 respondents. A variable called "mix_hh_flag" is added to the final dataset : it is equal to 1 in each household when the value of the respid variable was changed in one or two interviews of that household. data-version: 2007/Oct/26 Omar Paccagnella, 30 October 2007 VERY IMPORTANT! IN ORDER TO GET EXACT MATCHING OF RESPONDENTS WITHIN AND BETWEEN WAVES, THIS PROGRAM MUST BE RUN ONLY AFTER THE PROGRAMS: "IT_DN_changes_w2.do", "IT_CV_changes_w2.do" and "IT_XT_changes_w2.do" ! **********************************************/ author´s name & date of program short description which dataset order of do-files data-version
Example of STATA-do_file (2) global drive “S:/Share/wave2“ /************************************************************* THIS PROGRAM HAS TO BE RUN FOR ALL SECTIONS FROM DN TO IV **************************************************************/ foreach module in ac as br cf ch co cs dn ep ex hc hh ho iv mh pf ph sp ws { use $drive/sharew2_`module' gen mix_hh_flag=0 gen sampid_original = sampid gen respid_original = respid replace respid=1 if sampid==" " & cvid==2 & respid==2 replace mix_hh_flag=1 if sampid==" " [...] save $drive/sharew2_`module'_corrected } global drive save original variables flag-variable for which modules? new version of data
Example of SPSS-syntax (1) COMMENT This program provides changes in cvid and respid variables in wave2 datasets of the longitudinal sample, in order to get exact matching between wave1 and wave2 respondents. A variable called "mix_hh_w2" is added to the final dataset (called sharew2_`var'_checked): it is equal to 1 in each household when the value of the respid variable was changed in one or two interviews of that household. * date of data: 2007/Oct/26 * Omar Paccagnella, October 2007 * VERY IMPORTANT! IN ORDER TO GET EXACT MATCHING OF RESPONDENTS WITHIN AND BETWEEN WAVES, * THIS PROGRAM MUST BE RUN ONLY AFTER THE PROGRAMS: "IT_DN_changes_w2.do", * "IT_CV_changes_w2.do" and "IT_XT_changes_w2.do" ! **************************************************************************** *THIS PROGRAM HAS TO BE RUN FOR ALL SECTIONS FROM DN TO IV short description author´s name which dataset order of syntax data-version for which modules?
Example of SPSS-syntax (2) GET FILE='S:\SHARE\wave2\dn_module.sav'. EXE. compute mix_hh_flag=0. compute cvid_original = cvid. compute respid_original = respid. compute sampid_original = sampid. if (sampid = & cvid = 2) cvid = 1. if (sampid = & cvid = 2) respid = 2. if sampid = ( ) mix_hh_flag=1. EXE. [...] SAVE OUTFILE='S:\SHARE\wave2\dn_module_corrected.sav'. EXE. flag-variable save original variables
Any problems with programming do-files or syntax? Please give us a call