Stephanie Hirner ESTP ”Administrative data and censuses

Slides:



Advertisements
Similar presentations
© Federal Statistical Office Germany, IV A2 Federal Statistical Office Germany Application of Regular Expressions in the German Business Register Session.
Advertisements

Quality assurance -Population and Housing Census Alma Kondi, INSTAT, Albania.
Bosna i Hercegovina Agencija za statistiku Bosne i Hercegovine Bosna i Hercegovina Agencija za statistiku Bosne i Hercegovine Post-enumeration Survey-A.
Counting the Dutch, The Future of the Virtual Census in the Netherlands Presentation at the seminar Counting the 7 Billion 24 February 2012 * Geert Bruinooge.
March 2013 ESSnet DWH - Workshop IV DATA LINKING ASPECTS OF COMBINING DATA INCLUDING OPTIONS FOR VARIOUS HIERARCHIES (S-DWH CONTEXT)
The Dutch Censuses of 1960, 1971 and 2001 Producing public use files in the IPUMS project Wijnand Advokaat Statistics Netherlands Division Social and Spatial.
© John M. Abowd 2005, all rights reserved Sampling Frame Maintenance John M. Abowd February 2005.
© Federal Statistical Office, Business Register Federal Statistical Office Germany Information from Administrative Data for Business Register Session 5:
Procedures to Develop and Register Data Elements in Support of Data Standardization September 2000.
Census Census of Population, Housing,Buildings,Establishments and Agriculture Huda Ebrahim Al Shrooqi Central Informatics Organization.
Identity verification in the private sector Chris Gration 30 March 2006.
Dutch Virtual Census Presentation at the International Seminar on Population and Housing Censuses; Beyond the 2010 Round November, 2012 Egon Gerards,
The Statistical Business Register of Macao SAR Government of Macao SAR Statistics and Census Service.
Record matching for census purposes in the Netherlands Eric Schulte Nordholt Senior researcher and project leader of the Census Statistics Netherlands.
Transition from traditional census to sample survey? (Experience from Population and Housing Census 2011) Group of Experts on Population and Housing Censuses,
Register-Based Census 2011 in Slovenia – Some Quality Aspects Danilo Dolenc Statistical Office of the Republic of Slovenia UNECE-Eurostat Expert Group.
S T A T I S T I C S A U S T R I A May 13th – 15th Register Based Census “The Austrian Principles of Redundancy” UNECE/Eurostat.
Geneva, 21 May 2012 Snezana Lakcevic Statistical Office of the Republic of Serbia Head of Population Census Division Workshop on Censuses Using Registers.
October 28-30, 2009 UNECE Geneva Quality Assessment of 2008 Integrated Census - Israel Pnina ZADKA Central Bureau of Statistics Israel.
Use of Administrative Data Seminar on Developing a Programme on Integrated Statistics in support of the Implementation of the SNA for CARICOM countries.
Statistical Expertise for Sound Decision Making Quality Assurance for Census Data Processing Jean-Michel Durr 28/1/20111Fourth meeting of the TCG - Lubjana.
The availability of Dutch census microdata Eric Schulte Nordholt Senior researcher and project leader of the Census Statistics Netherlands Division Social.
1 For a Population Statistical Register Characteristics and Potentials for the Official Statistics Central department for administrative data and archives.
Session 3 The population registers in Germany – the main data source in the 2011 Census UNECE-Eurostat Expert Group Meeting on Censuses Using Registers.
© Statistisches Bundesamt, VI A Statistisches Bundesamt The new method of the next german Population census Johann Szenzenstein, Federal Statistical Office,
QUALITY ASSESSMENT OF THE REGISTER-BASED SLOVENIAN CENSUS 2011 Rudi Seljak, Apolonija Flander Oblak Statistical Office of the Republic of Slovenia.
The Setup of the Register of Addresses and Buildings of the German 2011 Census Data quality issues and solutions.
Public Libraries Survey Data File Overview. 2 What We’ll Talk About PLS: Public Library Survey State level data Public library data (Administrative Entities)
Building Preservation Environments with Data Grid Technology Reagan W. Moore Presenter: Praveen Namburi.
Armenia Action B1 28 th March – 1 st April 2011 General introduction to Business Registers Wednesday 30 March 2011 Mrs Vibeke Skov Møller
Establishing a register-based statistical system Example: Population and housing censuses in Norway Training workshop on censuses using administrative.
UNECE Seminar on New Frontiers for Statistical Data Collection, Geneva
Use of population registers for vital statistics purposes
Integrating administrative data – the 2021 Census and beyond
Statistics Netherlands Division Social and Spatial Statistics
ECE Work Session on Population Censuses
I n f o r m a t i o n e n Wir bewegen
Civil Registration Process: Place, Time, Cost, Late Registration
Session 8 Data Processing Estonian case study
Post Enumeration Survey Census
PRODUCTION PROCESS AND FLOW
International Standards and Contemporary Technologies
Update and Overview of Administrative Records for the 2020 Census
Estimation methods for the integration of administrative sources
Sample surveys versus business register evaluations:
Central Statistics Organization
A new fantastic source for updating the Statistical Business Register
Software Systems for Survey and Census
Sub-regional workshop on integration of administrative data, big data
Overview of Census Evaluation and Selected Methods Pres. 2
Overview of Census Evaluation and Selected Methods Pres. 2
Census Planning and Management
2020 Census Local Update of Census Addresses Operation (LUCA)
Technical Coordination Group for the next Census round in South East Europe EUROSTAT PREPARATION FOR CENSUS 2020 MONTENEGRO Budapest Jun 2017.
Administrative Data and their Use in Economic Statistics
Overview of Approaches to Register-Based Populating Censuses
Overview of Census Evaluation Methods
Fabio Crescenzi Territorial Databases and Gis
Overview of Census Evaluation and Selected Methods Pres. 2
Management of territorial units over time
Using the Spatial Reference for Official Statistics in Germany Joint Working Party meeting on "Geographical Information Systems for Statistics" on.
Treatment of Missing Data Pres. 8
Pete Benton , Beyond 2011 Programme Director
Key Considerations for Planning and Management of Census Operations
Quality assurance and assessment in the vital statistics system
Technical Coordination Group, Zagreb, Croatia, 26 January 2018
Pnina ZADKA Central Bureau of Statistics Israel
Pnina ZADKA Central Bureau of Statistics Israel
Key Considerations for Planning and Management of Census Operations
Kaija Ruotsalainen Statistics Finland
Presentation transcript:

Matching registers without direct identifiers and confidentiality issues Stephanie Hirner ESTP ”Administrative data and censuses Wiesbaden 22 – 24 May 2018 THE CONTRACTOR IS ACTING UNDER A FRAMEWORK CONTRACT CONCLUDED WITH THE COMMISSION

Contents Types of matching procedures Matching of address data Matching of personal data sets Confidentiality issues © Federal Statistical Office of Germany | Census 10.12.2019

Contents Types of matching procedures Matching of address data Matching of personal data sets Confidentiality issues © Federal Statistical Office of Germany | Census 10.12.2019

Matching via … Identifiers Identical Items Similar Items e.g. Addresses Personal data © Federal Statistical Office of Germany | Census 10.12.2019

Matching via … Identifiers Identical Items Similar Items e.g. Addresses Address ID Postal Code Street name Street number Street name: original and standardised Personal data Personal ID Name Sex Date of birth Place of birth Birth name versus familiy name © Federal Statistical Office of Germany | Census 10.12.2019

Matching process Preprocessing Deterministic process Parsing Standardisation Deterministic process Including all items Omit items step by step Probabilistic process Similarity of items Fuzzy merge Probability of matching © Federal Statistical Office of Germany | Census 10.12.2019

Probabilistic methods - examples SPEDIS „Determines the likelihood of two words matching, expressed as the asymmetric spelling distance between the two words“ (see SAS Documentation „SPEDIS Function“) Jaro-Winkler similarity Measure of similarity between two strings, uses the number of matching characters and the number of transpositions  Sources of error False match Missing match © Federal Statistical Office of Germany | Census 10.12.2019

SPEDIS Method Results Comparison of items (e.g. names) Identification of „costs“ to transform one value into the target word Weighting by using the length of the string Transformation in both directions Results Probability of correct matching © Federal Statistical Office of Germany | Census 10.12.2019

Jaro-Winkler Method Results Comparison of items (e.g. names) Weighting of identical digits in the compared words Higher weigth for consistency at the beginning of the word Results Probability of correct matching © Federal Statistical Office of Germany | Census 10.12.2019

Matching of… Data source 1 Data source 2 Addition of items ID Item 1 111 A xx 14 mLx 222 B yy 12 pQn 333 C 00 sFc Item A C34 F76 A94 Data source 2 ID Item A 111 C34 222 F76 333 A94 Addition of items © Federal Statistical Office of Germany | Census 10.12.2019

Matching of… Data source 1 Data source 2 Outer join ID Item 1 Item 2 111 A xx 222 B yy 333 C ID Item 1 Item 2 999 X yy 888 K dd ID Item 1 Item 2 111 A xx 222 B yy 333 C 999 X 888 K dd Outer join © Federal Statistical Office of Germany | Census 10.12.2019

Matching of… Reference date 1 Reference date 2 ID Item 1 Item 2 111 A xx 222 B yy 333 C ID Item 1 Item 2 111 A xx 222 B yy 333 C ID Item 1 Item 2 111 A xx 222 B yy 333 C Identical registers over time © Federal Statistical Office of Germany | Census 10.12.2019

Contents Types of matching procedures Matching of address data Matching of personal data sets Confidentiality issues © Federal Statistical Office of Germany | Census 10.12.2019

Register of addresses Matching registers Support of the register Setup of the register Quality aspects Support of the register Validation Quality aspect: up-to-dateness Quality aspect: completeness © Federal Statistical Office of Germany | Census 10.12.2019

Register of addresses in the German Census Covered all addresses with housing space and occupied living quarters 2 administrative data sources -> outer join Federal Mapping Agency Population registers Checking of addresses if only included in one data source Classification of addresses as "addresses with housing space“ © Federal Statistical Office of Germany | Census 10.12.2019

Data acquisition: using registers in place Geo-referenced address data records: 21 million including geo-coordinates Data of residents registration offices records: 86 million contains demographic and geographical information © Federal Statistical Office of Germany | Census 10.12.2019

Problems No identification characteristis  Address characteristic as ID Local register data Low standardisation of register entries Low harmonisation between registers Redundant/false/obsolete data entries  Complex data processing © Federal Statistical Office of Germany | Census 10.12.2019

Setup of the register of addresses Data checks Preprocessing Decomposing the address data into address components Standardisation of the address information Aggregation of individual data sets Harmonisation Referencing the street names at street level Adjustment of changing address identifiers Merging/record linkage © Federal Statistical Office of Germany | Census 10.12.2019

Challenges in using the address as a key variable Decentralised administrative data, different registers -> No harmonised address format -> Address unstable, changes not notified simultaneously in all registers street name J.-F.-K.-Straße John-F.-Ken.-Straße © Federal Statistical Office of Germany | Census 10.12.2019

Standardisation of key variables Necessary condition for completion and updating: standardisation Standardisation of street names Automated standardisation  capital letters  uniform abbreviations (street -> str, place -> pl)  eliminating blanks Manual checks by the statistical offices of the Länder Thesaurus of streetnames Aggregation on street level © Federal Statistical Office of Germany | Census 10.12.2019

Thesaurus of streetnames: harmonisation of spellings external source postal code street name 38471 J.-F.-K.-Straße standardised street name JOHNFKENNEDYSTR postal code street name standardised street name 38471 J.-F.-K.-Straße JOHNFKENNEDYSTR John-F.-Ken.-Straße thesaurus of streetnames © Federal Statistical Office of Germany | Census 10.12.2019

Preparation and integration of register data GA pre-processing deterministic 1:1 matching- method matching data register MR corrected data non-matching data Correction (regional authorities) © Federal Statistical Office of Germany | Census 10.12.2019

Two-stage correction model Municipal Code Check criterion Existence, Correctness I. Street-Level Street A Street B Existence, Correctness, housing space II. Address-Level No. 1 No. 2 No. 1 No. 2 © Federal Statistical Office of Germany | Census 10.12.2019

Validation of addresses – quality aspect Validated mass: addresses of two data sources Check for housing space: adress in only one data source GA MR © Federal Statistical Office of Germany | Census 10.12.2019

Results: addresses to be checked for housing space (2011 Census) © Federal Statistical Office of Germany | Census 10.12.2019

Quality aspect: up-to-dateness Coordination function -> keeping the register up to date Address up-to-dateness = How instabil are the addresses? How often will be updated? Changes to address variables at municipal level -> address is unstable, when and how often it changes is not predictable © Federal Statistical Office of Germany | Census 10.12.2019

Instability of the address (2010-2011): change of at least one variable in percent Germany © Federal Statistical Office of Germany | Census 10.12.2019 27 27

Keeping the register up to date Integration of 5 different registers (e.g. population register) -> identical registers over time Mismatches: the statistical offices of the Länder checked -> existence -> correctness -> renamings old street name new street name Kochstraße John-F.-Ken.-Straße © Federal Statistical Office of Germany | Census 10.12.2019 28 28 28 28

Quality aspect: completeness Register of addresses = reference for population New buildings, demolition of residential buildings, incorrect data in registers Completion by: Registers -> outer join Other survey components, information from other sources © Federal Statistical Office of Germany | Census 10.12.2019 29 29 29

-> most of the new addresses based on register integration New addresses added to the register by data origin over time (2011 Census) total administrative registers other findings -> most of the new addresses based on register integration © Federal Statistical Office of Germany | Census 10.12.2019 30 30 30

Conclusion Decentralised administrative data, differing quality of register data and missing ID = core problem To update and complete an instable key variable is the major focus in the context of the register of addresses -> precondition: harmonisation/ standardisation Updating and completion of the register can mainly be achieved through register integration © Federal Statistical Office of Germany | Census 10.12.2019

Contents Types of matching procedures Matching of address data Matching of personal data sets Confidentiality issues © Federal Statistical Office of Germany | Census 10.12.2019

Data acquisition and integration in Germany Decentralised via the statistical offices of the Länder Two supplies around the census reference date Integration Linking of the information on addresses Adding personal data records via the address-ID Build-up of a temporary centralised population register for Germany © Federal Statistical Office of Germany | Census 10.12.2019

Matching of different deliveries over time Merging information Address Family name at birth and first name(s), Sex, Date of birth, Place of birth Results Confirm data sets Update data sets Add data sets © Federal Statistical Office of Germany | Census 10.12.2019

Reference data stock Merging datasets from different sources without existing personel identification numbers (registers, surveys) Merging information: family name at birth and first name(s), sex, date of birth, municipal code, post code, street name, house number © Federal Statistical Office of Germany | Census 10.12.2019

Matching procedures Limitations? Risks? Challenges? Chances? Deterministic process Including all items Omit items step by step Probabilistic process Similarity of items Probability of matching Limitations? Risks? Chances? Challenges? © Federal Statistical Office of Germany | Census 10.12.2019

Challenges Matching process „step by step“ Create subsets Avoid false matches Quality checks © Federal Statistical Office of Germany | Census 10.12.2019

Contents Types of matching procedures Matching of address data Matching of personal data sets Confidentiality issues © Federal Statistical Office of Germany | Census 10.12.2019

Data protection and confidentiality Collection of personal data Names, date of birth,… Additional data only for matching process Create internal IDs Limitations for quality checks Prohibition to transmit the data back to the administration © Federal Statistical Office of Germany | Census 10.12.2019

Thank you for your attention! Stephanie Hirner stephanie.hirner@destatis.de © Federal Statistical Office of Germany | Census