Matching registers without direct identifiers and confidentiality issues Stephanie Hirner ESTP ”Administrative data and censuses Wiesbaden 22 – 24 May 2018 THE CONTRACTOR IS ACTING UNDER A FRAMEWORK CONTRACT CONCLUDED WITH THE COMMISSION
Contents Types of matching procedures Matching of address data Matching of personal data sets Confidentiality issues © Federal Statistical Office of Germany | Census 10.12.2019
Contents Types of matching procedures Matching of address data Matching of personal data sets Confidentiality issues © Federal Statistical Office of Germany | Census 10.12.2019
Matching via … Identifiers Identical Items Similar Items e.g. Addresses Personal data © Federal Statistical Office of Germany | Census 10.12.2019
Matching via … Identifiers Identical Items Similar Items e.g. Addresses Address ID Postal Code Street name Street number Street name: original and standardised Personal data Personal ID Name Sex Date of birth Place of birth Birth name versus familiy name © Federal Statistical Office of Germany | Census 10.12.2019
Matching process Preprocessing Deterministic process Parsing Standardisation Deterministic process Including all items Omit items step by step Probabilistic process Similarity of items Fuzzy merge Probability of matching © Federal Statistical Office of Germany | Census 10.12.2019
Probabilistic methods - examples SPEDIS „Determines the likelihood of two words matching, expressed as the asymmetric spelling distance between the two words“ (see SAS Documentation „SPEDIS Function“) Jaro-Winkler similarity Measure of similarity between two strings, uses the number of matching characters and the number of transpositions Sources of error False match Missing match © Federal Statistical Office of Germany | Census 10.12.2019
SPEDIS Method Results Comparison of items (e.g. names) Identification of „costs“ to transform one value into the target word Weighting by using the length of the string Transformation in both directions Results Probability of correct matching © Federal Statistical Office of Germany | Census 10.12.2019
Jaro-Winkler Method Results Comparison of items (e.g. names) Weighting of identical digits in the compared words Higher weigth for consistency at the beginning of the word Results Probability of correct matching © Federal Statistical Office of Germany | Census 10.12.2019
Matching of… Data source 1 Data source 2 Addition of items ID Item 1 111 A xx 14 mLx 222 B yy 12 pQn 333 C 00 sFc Item A C34 F76 A94 Data source 2 ID Item A 111 C34 222 F76 333 A94 Addition of items © Federal Statistical Office of Germany | Census 10.12.2019
Matching of… Data source 1 Data source 2 Outer join ID Item 1 Item 2 111 A xx 222 B yy 333 C ID Item 1 Item 2 999 X yy 888 K dd ID Item 1 Item 2 111 A xx 222 B yy 333 C 999 X 888 K dd Outer join © Federal Statistical Office of Germany | Census 10.12.2019
Matching of… Reference date 1 Reference date 2 ID Item 1 Item 2 111 A xx 222 B yy 333 C ID Item 1 Item 2 111 A xx 222 B yy 333 C ID Item 1 Item 2 111 A xx 222 B yy 333 C Identical registers over time © Federal Statistical Office of Germany | Census 10.12.2019
Contents Types of matching procedures Matching of address data Matching of personal data sets Confidentiality issues © Federal Statistical Office of Germany | Census 10.12.2019
Register of addresses Matching registers Support of the register Setup of the register Quality aspects Support of the register Validation Quality aspect: up-to-dateness Quality aspect: completeness © Federal Statistical Office of Germany | Census 10.12.2019
Register of addresses in the German Census Covered all addresses with housing space and occupied living quarters 2 administrative data sources -> outer join Federal Mapping Agency Population registers Checking of addresses if only included in one data source Classification of addresses as "addresses with housing space“ © Federal Statistical Office of Germany | Census 10.12.2019
Data acquisition: using registers in place Geo-referenced address data records: 21 million including geo-coordinates Data of residents registration offices records: 86 million contains demographic and geographical information © Federal Statistical Office of Germany | Census 10.12.2019
Problems No identification characteristis Address characteristic as ID Local register data Low standardisation of register entries Low harmonisation between registers Redundant/false/obsolete data entries Complex data processing © Federal Statistical Office of Germany | Census 10.12.2019
Setup of the register of addresses Data checks Preprocessing Decomposing the address data into address components Standardisation of the address information Aggregation of individual data sets Harmonisation Referencing the street names at street level Adjustment of changing address identifiers Merging/record linkage © Federal Statistical Office of Germany | Census 10.12.2019
Challenges in using the address as a key variable Decentralised administrative data, different registers -> No harmonised address format -> Address unstable, changes not notified simultaneously in all registers street name J.-F.-K.-Straße John-F.-Ken.-Straße © Federal Statistical Office of Germany | Census 10.12.2019
Standardisation of key variables Necessary condition for completion and updating: standardisation Standardisation of street names Automated standardisation capital letters uniform abbreviations (street -> str, place -> pl) eliminating blanks Manual checks by the statistical offices of the Länder Thesaurus of streetnames Aggregation on street level © Federal Statistical Office of Germany | Census 10.12.2019
Thesaurus of streetnames: harmonisation of spellings external source postal code street name 38471 J.-F.-K.-Straße standardised street name JOHNFKENNEDYSTR postal code street name standardised street name 38471 J.-F.-K.-Straße JOHNFKENNEDYSTR John-F.-Ken.-Straße thesaurus of streetnames © Federal Statistical Office of Germany | Census 10.12.2019
Preparation and integration of register data GA pre-processing deterministic 1:1 matching- method matching data register MR corrected data non-matching data Correction (regional authorities) © Federal Statistical Office of Germany | Census 10.12.2019
Two-stage correction model Municipal Code Check criterion Existence, Correctness I. Street-Level Street A Street B Existence, Correctness, housing space II. Address-Level No. 1 No. 2 No. 1 No. 2 © Federal Statistical Office of Germany | Census 10.12.2019
Validation of addresses – quality aspect Validated mass: addresses of two data sources Check for housing space: adress in only one data source GA MR © Federal Statistical Office of Germany | Census 10.12.2019
Results: addresses to be checked for housing space (2011 Census) © Federal Statistical Office of Germany | Census 10.12.2019
Quality aspect: up-to-dateness Coordination function -> keeping the register up to date Address up-to-dateness = How instabil are the addresses? How often will be updated? Changes to address variables at municipal level -> address is unstable, when and how often it changes is not predictable © Federal Statistical Office of Germany | Census 10.12.2019
Instability of the address (2010-2011): change of at least one variable in percent Germany © Federal Statistical Office of Germany | Census 10.12.2019 27 27
Keeping the register up to date Integration of 5 different registers (e.g. population register) -> identical registers over time Mismatches: the statistical offices of the Länder checked -> existence -> correctness -> renamings old street name new street name Kochstraße John-F.-Ken.-Straße © Federal Statistical Office of Germany | Census 10.12.2019 28 28 28 28
Quality aspect: completeness Register of addresses = reference for population New buildings, demolition of residential buildings, incorrect data in registers Completion by: Registers -> outer join Other survey components, information from other sources © Federal Statistical Office of Germany | Census 10.12.2019 29 29 29
-> most of the new addresses based on register integration New addresses added to the register by data origin over time (2011 Census) total administrative registers other findings -> most of the new addresses based on register integration © Federal Statistical Office of Germany | Census 10.12.2019 30 30 30
Conclusion Decentralised administrative data, differing quality of register data and missing ID = core problem To update and complete an instable key variable is the major focus in the context of the register of addresses -> precondition: harmonisation/ standardisation Updating and completion of the register can mainly be achieved through register integration © Federal Statistical Office of Germany | Census 10.12.2019
Contents Types of matching procedures Matching of address data Matching of personal data sets Confidentiality issues © Federal Statistical Office of Germany | Census 10.12.2019
Data acquisition and integration in Germany Decentralised via the statistical offices of the Länder Two supplies around the census reference date Integration Linking of the information on addresses Adding personal data records via the address-ID Build-up of a temporary centralised population register for Germany © Federal Statistical Office of Germany | Census 10.12.2019
Matching of different deliveries over time Merging information Address Family name at birth and first name(s), Sex, Date of birth, Place of birth Results Confirm data sets Update data sets Add data sets © Federal Statistical Office of Germany | Census 10.12.2019
Reference data stock Merging datasets from different sources without existing personel identification numbers (registers, surveys) Merging information: family name at birth and first name(s), sex, date of birth, municipal code, post code, street name, house number © Federal Statistical Office of Germany | Census 10.12.2019
Matching procedures Limitations? Risks? Challenges? Chances? Deterministic process Including all items Omit items step by step Probabilistic process Similarity of items Probability of matching Limitations? Risks? Chances? Challenges? © Federal Statistical Office of Germany | Census 10.12.2019
Challenges Matching process „step by step“ Create subsets Avoid false matches Quality checks © Federal Statistical Office of Germany | Census 10.12.2019
Contents Types of matching procedures Matching of address data Matching of personal data sets Confidentiality issues © Federal Statistical Office of Germany | Census 10.12.2019
Data protection and confidentiality Collection of personal data Names, date of birth,… Additional data only for matching process Create internal IDs Limitations for quality checks Prohibition to transmit the data back to the administration © Federal Statistical Office of Germany | Census 10.12.2019
Thank you for your attention! Stephanie Hirner stephanie.hirner@destatis.de © Federal Statistical Office of Germany | Census