Technical Coordination Group, Zagreb, Croatia, 26 January 2018 IT and technological solutions in different types of censuses in the region and their requirements Danilo Dolenc, MSc, Census Project Manager Statistical Office of the Republic of Slovenia Technical Coordination Group, Zagreb, Croatia, 26 January 2018
Introduction Focus on Data collection methods Data integration and principles of data editing Traceability Repeatability Stage processing Data warehouse
Paper vs. electronic Paper is out! Electronic data collection Built-in data control – how far we can go? To write, to mark, use of drop-down (pull-down) menus Responsive design Written answers Taking into account all languages
Web vs. field Management information system Coverage Avoiding double enumeration Coordination More modes, more complexity Security of data transmission Online or offline
Identifiers Common identifiers to link data from different input sources Personal (PIN, tax number,…) Geo-data (coordinates, addresses) Linkage population and housing data Dwelling numbers Household numbers Barcodes Other specially designed unique codes
Data integration Two IT environments should be set up Productional one for data integration and data editing Final data warehouse Data partly from productional part Derived variables Never ask what is possible to derive in the process The most typical – family data
Principles of IT data production (1) Repeatability Change of any value at any variable in the record New version of the record is created and inserted into database Versions of the record properly denoted (sequel number)
Principles of IT data production (2) Repeatability You can start / repeat the statistical process from beginning or from some stage Status 1 means input data after data integration but before data editing Number of statuses depends on complexity of statistical editing In 2015 the highest status was 18
Principles of IT data production (3) Traceability Metadata on method of change of value of every variable Which productional phase In combination with status of record How the record has been changed Automated correction Manual correction Imputation
Principles of IT data production (4) IT process in stages Important for register-based approach Could be applied in any other census type Not all sources available at the same time No more changes of data after every stage Final data could be disseminated much earlier In Slovenia four stages process Basic population data (T + 4 months) Households, families (T + 9) Socioeconomic characteristics + migration (T + 12) Housing data (T + 18)
Productional environment (1) Basic census tables (NAME) Scheme of table – description of variables Short name, format, length, classification Auxiliary (‘‘shadow‘‘) tables for every basic table Editable tables (NAME_EDI) Status tables (NAME_EDI_S)
Productional environment (2) Classifications (D_...) and codebooks (SIF_...) for every applicable variable Primary data sources for data integration (VIR_...) Variables with same name as basic table Data in defined format Secondary data sources (VIR_S_...) Re-loaded new versions of data Some variables only or limited number of records
Productional environment (3) Views on tables – important for regular monitoring of data All records (V_NAME_VSI) 6,4 million records in population table Records with highest status only (V_NAME_ZST) 2,1 million records in population table
General advantages Input data and final data available in the same table Easy to calculate quality indicators Controlled automated data editing in whole database No partial processing Possibility to apply manual editing if needed Data from field should be coded in advance No text data in basic census tables
Data warehouse (1) At least one table per basic enumeration unit should be set up Unique identifiers (not linkable to other identifiers) needed Household ID Family ID Dwelling (building) ID New - household table and family table Not needed in productional environment
Data warehouse (2) Main advantages More than one census data in warehouse Longitudinal analyses Most of topics same from census to census Derived tables / variables allow easier tabulation and effective user support Access to final warehouse by using standardized or self-developed tools OLAP, Discoverer, SAS, Excel,…
Some examples - Slovenia Table PERSONS Only productional variables – 35 Only warehouse variables – 45 Productional variables directly transferred to warehouse table – 86 Derived table HOUSEHOLDS 86 derived variables 16 associated dwelling variables 16 territorial variables
Conclusion Every country must take into account their own situation Model presented could be applied for every census type Outsourcing of IT is feasible for data collection stage Data processing should be developed in close cooperation by own IT and methodological staff