EDITING OF MULTIPLE SOURCE DATA IN THE CASE OF SLOVENIAN AGRICULTURAL CENSUS 2010 Rudi Seljak, Aleš Krajnc Statistical Office of the Republic of Slovenia
Overview of the presentation General about the Agriculture Census (AC 2010) Database organization Statistical data processing Main problems and challenges Conclusions
General information about the AC 2010 Collection of exhaustive information on all the agricultural holdings (AH) which fulfill the certain criteria stated in the EU regulation. In accordance with the EU regulation it is conducted every 10 year. In 2010 conducted in most EU Member States (few in 2009). The aim of obligatory regulation is to get for the first time the comparable data on agricultural indicators based on the same methodology.
Slovenian AC 2010 Carried out by the Statistical Office of the Republic of Slovenia (SURS) in June-July 2010. Part of the data collected with the field survey (CAPI) and (large) part was obtained from different administrative sources. There were 94,686 AH visited in the field → 74,646 that satisfy the ECA criteria. The field work and data entry program was done by the outsourced company, but all the all the instructions and rules were provided by the SURS’s staff. About 600 interviewers finished the fieldwork in approx. 75 days
Micro-data Database Field data were separated into the different tables according to the sets of related questions. Each of the different administrative sources was put in the separate table. Each table was „accompanied“ with the statuses of variables. Status „flagged“ the collection mode and also each change in the process. Each table has one associated table where all the changed records are inserted. Views to different version of the data were created. All together 199 tables and views and all together 9,583 variables to be processed
Database – schematic presentation Tables Statuses Data TabX TabX_S TabX_edi TabX_S_edi Views View - All versions of the record View - Last version of the record View - All versions of the record View - Last version of the record
Statistical data processing Combination of general application and custom made computer programs used for data processing. Custom made programs: Insertion of the new units. Units that were according to the field data not AH, but admin data indicated the opposite Replacement of the whole set of data in the case where the field data were of bad quality Calculation of the derived variables General application: Logical controls Individual and systematic corrections Imputation
General application The metadata driven application for data editing which is used in several other surveys (also in population census) Due to the requirements of the AC 2010 data processing some additional functionalities were added: General metadata driven process for linkage of arbitrary number of tables General metadata driven process for the calculation of the “aggregated derived variables” data on the level of persons, which work at the AH are aggregated to the level of AH Several new imputation methods were added 8
General application – Pros and Cons Greater independency from IT persons IT (programming) work decreased significantly Traceability and repeatability is ensured The process documented through the metadata database Cons: More skilled subject personnel needed A lot of metadata produced → sometimes difficult to manage and control 9
AC 2010 – main challenges AC2010 already by its nature very demanding survey: Large number of units and variables Data from different level (AC holding + persons work at holding) Combination of different data sources makes the job even more complicated Creation of rules (process metadata) was spread among several subject, each of them covering one of the areas → overall coordination quite demanding task In the first phase a lot of errors in the syntax was produced 10
AC 2010 – main challenges cont’d Large number of variables required large number of process steps (e.g. 16 steps in the imputation part of the process) → sometimes difficult to follow the process and enable consistency in corrected data Integration of the data from two different sources was a special challenge: Priority setting in the case of the “overlapping” of the sources Large differences in data from different sources had to be resolved → very time consuming 11
Conclusions – points for discussion What is the influence of the outsourcing of the data collection to the quality of the incoming data? Importance of active cooperation of the SURS staff in testing of the questionnaire and training of interviewers Usage of combined data sources: Large advantage in decrease of the reporting burden Not large influence to costs reduction Increased workload at the data editing stage Usage of different sources increased the quality of final micro data Challenge to find the balance between these factors 12
Conclusions – points for discussion cont’d Complexity of data processing: Balance between the usage and (if needed) upgrade of general IT solutions and creation of custom made programs Micro-data provided to Eurostat and given on disposal to researchers Can we still afford selective data editing? 13
Thank you for your attention 14