New technologies of data collections supporting transformation from traditional to combined and register-based censuses (PL) Janusz Dygaszewicz Central Statistical Office of Poland THE CONTRACTOR IS ACTING UNDER A FRAMEWORK CONTRACT CONCLUDED WITH THE COMMISSION
Long-term history (tradition) of Polish statistics 1789 - first countrywide population census in Poland (based on registers!! ) The register of nobility and burghers The parish register of births and deaths 1918 - establishment of Central Statistical Office (CSO) Almost 100 years of continued existence and development of CSO Poland. Janusz Dygaszewicz
In 2011 - Mixed Model for Population and Housing Census Combining Census a combination of data from administrative sources (full survey covering basic demographic variables) with data acquired from ad-hoc 20% sample survey. 3 Janusz Dygaszewicz
Data collection channels in 2011 Census Round Including spatial data reference registers Administrative Sources CAII (CAWI) – Computer Assisted Internet Interview Self-enumeration by Internet CATI - Computer Assisted Telephone Interview (Call Center) Telephone Interview Registered on hand-held terminals with usage GPS and GIS service CAPI - Computer Assisted Personal Interview Face-to-face Interview with respondents executed by the census enumerators CAPI method was obligatory! 4 Janusz Dygaszewicz
CAxI CAxI CAII/CAWI - Computer Assisted Internet Interview, CAPI - Computer Assisted Personal Interview, CATI - Computer Assisted Telephone Interviewing. CAxI 5
NSP 2011 – full scale survey (short form - 15 questions) Administrative and non-administrative systems The CAII method Data sumplementation (CAPI and CATI) Data from administrative register – Master Record Data acquired using the CAII method Data supplemented using CATI and CAPI method 6 Janusz Dygaszewicz
NSP 2011 – sample survey (long form - about 100 questions) Administrative and non-administrative systems Data from administrative register – Master Record Data acquired using: The CAII method The CAPI method Data supplemented using CATI method The CAPI method The CAII method Sample survey The CATI method – The supplementation of data 7
Census organisation 8 Janusz Dygaszewicz
National Population and Housing Census 2011 The project timetable 2007 Start of preparation IX-X.2009 Trial Census PSR 2010 IV-V.2010 Trial census NSP 2011 IX-X.2010 PSR 2010 IV-VI.2011 NSP 2011 2013 End of project Agriculture Census 2010 National Population and Housing Census 2011 9 9 Janusz Dygaszewicz
Organization NSP 2011 around 2800 gmina leaders 16 central controllers 32 managers of the WCZS and the WCC 571 voivodship controllers 642 statistical interviewers around 2800 gmina leaders around 18000 census enumerators Over 10 external companies oproviding ongoing support The number of statistical interviewers was determined by the manager of the voivodship Call Centre (the CATI method management unit in voivodship statistical offices), based on the number of dwellings assigned to the full-scope survey with an available phone number . The final decision on the number of interviewers in a voivodship was issued by the Director of the Statistical Office (in most cases depending on the availability of persons delegated to work as statistical interviewers). The number of census enumerators employed in particular voivodships depended on the number of dwellings selected for the sample survey. Additionally, when estimating the number of census enumerators, the number of buildings to be verified as part of the pre-census field check (a stage in which enumerators verified the dwelling locations in the field) was taken into consideration, in order to ensure that each enumerator would be obliged to verify not more than 400 address points. The number of dwellings per each enumerator did not exceed 150 in the sample survey and 30 in the full-scope survey. 10 Janusz Dygaszewicz
DATA acquisition 11
On-line channels for data collection System Architecture CAII CATI Map Server CAPI Operational Microdata Base Census Completeness Management 12 Janusz Dygaszewicz
The architecture of a CAII/CAWI method OBM Online electronic questionnaire system Offline questionnaire ZKS Downloading the application file Internet Email Online Electronic media Browser Online method Offline method 13 Janusz Dygaszewicz
Self-enumeration by Internet filling the questionnaire by the respondent Identification Used to confirm the identity of the respondent. Entering identification data in a questionnaire(f.ex.: PIN, NIP, first name, last name) or additional authentication qualities (f.ex. a place of birth, mother’s maiden name) Establishing a password which jointly with PIN was the basis of authentication within 14 days 14
Self-enumeration by internet course of action in completing the survey by person In case of full scope survey person fulfilled his/her questionnaire , and could also specify adults who lived in the same apartment (proxy) In case of sample survey the dwellings questionnaires were completed in the first place. Later were fulfilled personal questionnaires. After authorization the census form was available for 14 days for multiple access. 15 Janusz Dygaszewicz
Typical person using selfenumeration A man Age: about 24 years old A city inhabitant Secondary degree 16
17
CATI – Computer Assisted Telephone Interview scheduled as the first or the second (following CAII) channel of collecting data; working posts of telephone interviewers located in separated Call Center studies; telephone interviewers provided with professional equipment. 18
CATI in National Census 2011 16 distributed Call Centers 320 SIP Stations 642 telephone interviewers/consultants technology Interactive Inteligence utility applications prepared by CSO in Poland The system worked from 8.04 to 30.06 (84 days) in hours 8.00 – 20.00/2 shifts Hotline 0 800 800 800 or (22) 44 44 777 CATI method was organized in the following manner: It was 16 distributed Call Centers, 320 SIP Station- (tool enabling conducting a telephone interview), 642 telephone interviewers- Technology Interactive Inteligence, Utility application prepared by CSO of Poland. 19 Janusz Dygaszewicz
The most significant functionality of Call Center Confirming the identity of the interviewer/census enumerator Interviewing Arranging visits by census enumerators Hotline 20
CAPI – Computer Assisted Personal Interview the third channel of data collection in the case of failure to obtain a complete set of data via CAII and CATI channels direct interviews in households (first or second channel) where such a way of proceeding results from adopted methodology or whose members has not expressed consent for a telephone survey 21
The architecture of a CAPI method Dispatching application - server - OBM Communication server ZKS Management of a terminal Map server Dedicated APN Mobile network WAN CSO Statistical Office Cryptographic SIM card Module GPS Dispatching application - client - Mobile application 22
Field enumeration management Supervisor – Regional (NUTS2) level Responsibilities: Address Point and Census Area management Enumerator monitoring Census Progress Localization and trail Emergency situation management Providing help for enumerators Providing necessary information to enumerators 23 Janusz Dygaszewicz
24
25
HH - Mobile terminal with GPS HTC Touch Pro2 Screen touch-screen size 3,6’’ resolution 480 x 800 pixels sliding, tilting - convenient usage sliding, 5-rows QWERTY keyboard GSM/GPRS/EDGE/UMTS/HSPA GPS module camera - 3,2 MP Windows Mobile® 6.5 26 Janusz Dygaszewicz 26
Enumerator Visiting all assigned holdings Filling electronic questionnaires Daily synchronisation Contact with the supervisor in terms of task scheduling Adding newly identified holdings 27 Janusz Dygaszewicz
Enumerator 28
Enumerator 29
Enumerator Alarm procedure In emergency situations, enumerators have a possibility of sending an alarm signal to their supervisors Alarm notice is sent to the supervisor application and via SMS to the supervisor 30 30 Janusz Dygaszewicz
Administrative registers 31
The use of administrative sources in censuses The usage of administrative sources in the census: direct source of research data , source of information to create a list of entities covered by the census frame (address-housing survey) , in addition, a source of information for : imputation, data estimation, comparison the quality of the data. 32 32 Janusz Dygaszewicz
The conditions of using administrative data by the Polish official statistics Legal issues, related to the access and use of data, including especially the identifiable and personal microdata Public relations – public trust and acceptance of administrative data linkage and re-use, primarily collected for other purposes, in official statistics Organisational issues, related to the statistical survey management process and to the work organisation changes Methodological issues – methods ensuring optimal quality of output data, including definition consistency of the official statistics system with the information systems of public administration Technical issues, concerning the use of new technologies, including especially the information and communication technologies 33
Registers - data acquisition Data Owners: Ministry of Finance, Ministry of Interior and Administration, Ministry of Justice, Agricultural Social Insurance Fund, National Health Fund, Agency for Restructuring and Modernisation of Agriculture, Agricultural and Food Quality Inspection, Agency for Geodesy and Cartography, State Fund for Rehabilitation of Disabled Persons, County Offices, Commune Offices, Regional Offices, Telcoms, Energy Suppliers, Office For Foreigners, Social Insurance Institution, Housing Managers, 34 Janusz Dygaszewicz
Key administrative sources used in the Polish official statistics population registration system tax system economic activity information system farming activity information system social security system social insurance system health insurance system real estate information system education information system vehicle and vehicle owners information system 35
Data quality -measures- Measuring the quality of administrative registers timeliness of data methodological compatibility completness identification standards used in the registry usefulness compatibility of data in administrative sources to data obtained in the study/survey Measuring the quality in processing of data registers excessive coverage error rate incomplete coverage error rate – subjective indicator of completness objective indicator of completness imputation rate data correction index integration data from various sources index 36 Janusz Dygaszewicz
CENSUS Data Processing Infrastructure 37 Janusz Dygaszewicz
Stages: Data processing Data acqusition, Source data collection and preparation, Source data loading and correction, Census Frame preparation, Master Record generation, CAxI data colection, Golden Record generation, Export to Analitycal Microdatabase, Data processing, Information Analisys, Product generation, Dissemination. Janusz Dygaszewicz
Architecture solution XML Registry 1 Stage I – Preparatory works Stage III – Results compilation CAXI Stage II – Data acquisition TXT Questionnaires Registry 2 ETL Tools Operational Microdata Base Analytical Microdata Base Statistical Files Golden Record Files SDMX Registry n Census work has been divided into three stages. Metadata Metadata Metadata XML Metadata server Portal 39 Janusz Dygaszewicz
Administrative sources The process of data processing started from collecting administrative sources from data administrators’ registers in the field defined by the appropriate legal Acts. The Polish statistics has the right to access all unit data stored in information sources of the public and commercial sectors. The obtained data include necessary identifiers and personal data supporting the process of merging (linking) unit data from different sources. The obtained data include necessary identifiers and personal data supporting the process of merging (linking) unit data from different sources. Janusz Dygaszewicz
Operational Microdata Base The basic structure of data in the OMB is a layer. It is a set of records, each of which relates to one census unit (a person, a dwelling, a household). The records include the values of census attributes derived from source data collected from respondents or defined in a different way (e.g. in the process of imputation). Layer can differ between one another with a set of attributes whose values are presented in a given layer. In the first step of processing, before beginning the census, on the basis of source sets and the census frame the layer referred to as a Master Record was created, consisting of the initial value of the selected subset of census attributes. The values from this layer were transferred to the CAPI, CATI and CAII processes for personalization electronics questionnaires. The values from this layer were transferred to the CAPI, CATI and CAII processes for personalization electronics questionnaires. 42 Janusz Dygaszewicz
Operational Microdata Base After completing the census with the use of the CAPI, CATI and CAII processes, on the basis of information collected in the processes proper layers in the database are created. The layer which have already been saved in the system can serve as the basis for creating new, internal layer in which new attributes (derived from the existing ones) can be added. Collection and processing of data was done in the OMB, which contained personalized data on the census unit, together with the value of characteristics obtained from administrative sources and from other channels (CAII, CATI, CAPI) Collection and processing of data was done in the OMB, which contained personalised data on the census unit, together with the value of characteristics obtained from administrative sources and from other chanells (CAII, CATI, CAPI). Janusz Dygaszewicz
Operational Microdata Base The Operational MicroData Base (OMB) - system included hardware-system-tool infrastructure (computer hardware, system software, tool software) and application software (computer programs that are the result of programming work). This base enabled the inclusion of data transmitted in electronic form through four informational channels by entities and to conduct further data processing. In the OMB processes connected with the control, correction, and linking of data, up to their complete cleansing took place. Next, depersonalized data (as the Golden Records) were transferred to the Analytical Microdata Base (AMB). The Operational Micro Data Base (OMB) - system included hardware-system-tool infrastructure (computer hardware, system software, tool software) and application software (computer programs that are the result of programming work). In the OMB there took place processes connected with the control, correction, and linking of data, up to their complete cleansing. Next, depersonalised data were transferred to the Analytical Microdata Base (AMB). 44 Janusz Dygaszewicz
Operational Microdata Base Analitycal Microdata Base Data acquisition XML Registry 1 CAXI TXT Questionaries Registry 2 ETL Tools Operational Microdata Base Analitycal Microdata Base Statistical Files Golden Record Files SDMX Registry n Metadata Metadata Metadata XML Metadata server Portal 45
Files format: Data acquisition Flat files, XML files, Local Databases XML files integration, 46
Transformation to StATISTICAL register 47
Source data collection and preparation Registers loading into data laboratory envroiment, Denormalization, Standarization, Deduplication, Validation, Data completion, Vocabulary validation and automatic correction, Statistical files (register) generation, 48
Source data collection and preparation XML Registry 1 CAXI TXT Questionaries Registry 2 ETL Tools Operational Microdata Base Analitycal Microdata Base Statistical Files Golden Record Files SDMX Registry n Metadata Metadata Metadata XML Metadata server Portal 49
Transform data Data processing in the production environment consisting of: profiling – create a raport on the data quality, unification/standardization of data, parsing (separation) or combining variables, standardization with schemes, conversion, validation, deduplication, data integration. 50 Janusz Dygaszewicz
ETL process scheme LOAD DATA EXTRACT DATA A Register B Register REGISTER OF REFERENCE PROFILING LOAD DATA STANDARDIZATION STANDARDIZATION EXTRACT DATA TRANSFORM DATA STANDARDIZATION A Register ANALYTICAL BASE PARSING INTEGRATION STATISTICAL REGISTER B Register CONVERSION VALIDATION VALIDATION C Register DEDUPLICATION DEDUPLICATION ANALYTICAL ENVIRONMENT DATA FROM REGISTERS PRODUCTION ENVIRONMENT 51 Janusz Dygaszewicz
The data transformation model from administrative sources into statistical data sets 52
The data transformation model from administrative sources into statistical data sets Preliminary preparation of an administrative register Importing data sets Mapping – excluding excessive variables from further processing, and labelling the remaining ones in accordance with an agreed standard Simple de-duplication – removing identical records from the data set De-normalisation – creating a flat table with a specific substantive content 53
The data transformation model from administrative sources into statistical data sets Data transition – cleaning obtaining a set of coherent data, containing standardised variables which are accurate in substantive terms Validation and adjustment verifying the processing outcome based on suitable rules, i.e. checking the meeting requirements for data coherence, accuracy and standardisation generating a data quality improvement report further processing or re-cleaning the data set 54
The data transformation model from administrative sources into statistical data sets Data set integration with the reference set (e.g. with a previously given list of units – individuals, real estate, farms, enterprises – the statistical business register or an administrative register) Complex de-duplication – eliminating recurrent units – parallel but not identical obtaining one record for a given entity without losing other information Selection of the statistical variable value from multiple registers 55
The data transformation model from administrative sources into statistical data sets A statistical data set comprises a certain population determined in the reference table, together with a number of entity-specific descriptive variables, the values of which represent the synthesis of information included in multiple registers the outcome of data quality improvement, upon transformation of the source data set into a statistical one, may be illustrated statistically 56
Census Frame preparation Citizens, buildings and dwelling list preparing, Citizens, buildings and dwelling list and statistical data integration, Census Frame preparing. Goal Frame Preparation, Random Sample preparation, 57
Operational Microdata Base Analitycal Microdata Base CAxI XML Registry 1 CAXI TXT Questionaries Registry 2 ETL Tools Operational Microdata Base Analitycal Microdata Base Statistical Files Golden Record Files SDMX Registry n Metadata Metadata Metadata XML Metadata server Portal 58
Golden Record 59
Golden Record generation XML Registry 1 CAXI TXT Questionaries Registry 2 ETL Tools Operational Microdata Base Analitycal Microdata Base Statistical Files Golden Record Files SDMX Registry n Metadata Metadata Metadata XML Metadata server Portal 60
Golden Record generation Integration with Census Frame and CAxI data, Validation, Correction, Operational Imputation, Transfer proper values to Golden Record, Registers 1..n This picture presents processs of Golden Record Generation OMB Layers CAxI Golden Record AMB Janusz Dygaszewicz
Export to Analitycal Microdata Base Transition Tables Preparing, Golden Records anonymisation, Transfer to Analitycal Microdatabase, 62
Export to Analitycal Microdata Base XML Registry 1 CAXI TXT Questionaries Registry 2 ETL Tools Operational Microdata Base Analitycal Microdata Base Statistical Files Golden Record Files SDMX Registry n Metadata Metadata Metadata XML Metadata server Portal 63
The main objectives of AMB Preparation and dissemination census data for statistical analysis Dissemination of analytical and reporting platform for development of census data collected and preparation of statistical products for national and international users. Supporting for process of creation and dissemination statistical products The AMB also allows the calculation of aggregates available in the Geostatistical Portal as maps (cartograms and cartodiagrams). 64 Janusz Dygaszewicz
Analytical Microdata Base cont. In the Analytical Microdata Base took place the following processes: data integration, validation, automatic correction, imputation, calibration, creation a new secondary variables and new statistical units (i.e. families and households). ETL processes in OMB and AMB were repeatedly executed until the approval by the methodologists. In the next step, processes concerning creation multidimensional objects - OLAP Cubes (domestic needs), and Hypercube (60 HC) and Quality Hypercube (in accordance with international requirements - Eurostat) were made. 65
STATISTICAL PRODUCTS FILES TO EUROSTAT Concept solution APPLICATION OF EXTERNAL USERS APPLICATION OF INTERNAL USERS SDMX GOLDEN RECORD NSP’2011 ETL - transfer to base - validation - correction - imputation weighting calibration transformation quality indicators, calculation of the secondary variables, ANALYTICAL MODEL MULTIDIMENTIONAL STRUCTURES REPORTS, ANALYSES DISSEMINATION DATA STATISTICAL PRODUCTS FILES TO EUROSTAT From the left we have Golden record. REPOSITORY OF DATA AND METADATA 66 Janusz Dygaszewicz
Data processing in AMB In order to realize the process of "Data Processing”, a series of ETL jobs were divided into the following steps (using the SAS Data Integration Studio tools): S00: load source data from Golden Record NSP’2011 (8 tables: Buildings, Dwellings, Collective living quarters, Persons, Emigrants after 2002, Homeless, Households, Families), S01: copy the relevant objects, Households, Families , S02: execution of the automatic data correction, S03: Validation tables of Golden Record, S04: execution of manual data correction, S05: weighing data S06: data calibration and imputation Stages from S00 to S09a there are ETL processes of correction, imputation and calibration. These processes are responsible for preparation data to creation multidimentional objects, hypercubes and aggregates for the geostatistical portals. 67 Janusz Dygaszewicz
Data processing in AMB S07: control reports S08: calculation of derived variables S09: calculating rules on the basis of dictionaries S09a: adding the secondary variables S10: transformation data to the analytical model (facts tables and dimension tables, cubs for national purposes) S11: creation and loading hypercubes for Eurostat S12: validation of hypercubes and creation quality hypercubes for Eurostat S13: transformation of operational metadata S14: enumeration of quality indicators S15: preparation aggregates for the Geostatistics Portal S10 is responsible for transformation data…. 68 Janusz Dygaszewicz
Instead of a conclusion Census 2011 Census in 2002 180 thousands of census enumerators 120 mln of questionnaires 1 000 tons of papers At the end shredding census questionnaires 18 thousands of census enumerators 0 questionnaires 0 tons of papers ca. 50 mln € less better data the more reliable results statistical surveys in the future Janusz Dygaszewicz
Methodological CSO work on the use of administrative registers for the 2021 National Census (NSP 2021) conducted within: Improvement of the use of administrative sources (ESS.VIP ADMIN WP6 Pilot studies and applications) 1.X.2015 - 30.IX.2016 Improvement of the use of administrative sources (ESS.VIP ADMIN WP6 Pilot studies and applications) 03.10.2016 - 02.04.2018 Initial identification of administrative registers as data sources for the 2021 National Census (NSP 2021) Improvement of: method for assessing the quality of administrative registers assumptions for creating identifiers for linking data from administrative registers assumptions for getting variable values using many administrative registers assumptions to derive new variables from the administrative registers methods of building a list of address and dwelling linking information about people, dwellings and buildings methods of data integration from different sources methods of data imputation and calibration using data from administrative registers and surveys methods of generalizing results based on combinations of data from different sources at different levels, territorial cross-section methods of assessing the quality of estimates - generalized results methods of improving the quality of the spatial location of buildings in a statistical frame.
Thank you for your attention e-mail: j.dygaszewicz@stat.gov.pl