Download presentation
Presentation is loading. Please wait.
Published byŞebnem Eriş Modified over 5 years ago
1
Matching registers without direct identifiers and confidentiality issues
Stephanie Hirner ESTP ”Administrative data and censuses Wiesbaden 22 – 24 May 2018 THE CONTRACTOR IS ACTING UNDER A FRAMEWORK CONTRACT CONCLUDED WITH THE COMMISSION
2
Contents Types of matching procedures Matching of address data
Matching of personal data sets Confidentiality issues © Federal Statistical Office of Germany | Census
3
Contents Types of matching procedures Matching of address data
Matching of personal data sets Confidentiality issues © Federal Statistical Office of Germany | Census
4
Matching via … Identifiers Identical Items Similar Items e.g.
Addresses Personal data © Federal Statistical Office of Germany | Census
5
Matching via … Identifiers Identical Items Similar Items e.g.
Addresses Address ID Postal Code Street name Street number Street name: original and standardised Personal data Personal ID Name Sex Date of birth Place of birth Birth name versus familiy name © Federal Statistical Office of Germany | Census
6
Matching process Preprocessing Deterministic process
Parsing Standardisation Deterministic process Including all items Omit items step by step Probabilistic process Similarity of items Fuzzy merge Probability of matching © Federal Statistical Office of Germany | Census
7
Probabilistic methods - examples
SPEDIS „Determines the likelihood of two words matching, expressed as the asymmetric spelling distance between the two words“ (see SAS Documentation „SPEDIS Function“) Jaro-Winkler similarity Measure of similarity between two strings, uses the number of matching characters and the number of transpositions Sources of error False match Missing match © Federal Statistical Office of Germany | Census
8
SPEDIS Method Results Comparison of items (e.g. names)
Identification of „costs“ to transform one value into the target word Weighting by using the length of the string Transformation in both directions Results Probability of correct matching © Federal Statistical Office of Germany | Census
9
Jaro-Winkler Method Results Comparison of items (e.g. names)
Weighting of identical digits in the compared words Higher weigth for consistency at the beginning of the word Results Probability of correct matching © Federal Statistical Office of Germany | Census
10
Matching of… Data source 1 Data source 2 Addition of items ID Item 1
111 A xx 14 mLx 222 B yy 12 pQn 333 C 00 sFc Item A C34 F76 A94 Data source 2 ID Item A 111 C34 222 F76 333 A94 Addition of items © Federal Statistical Office of Germany | Census
11
Matching of… Data source 1 Data source 2 Outer join ID Item 1 Item 2
111 A xx 222 B yy 333 C ID Item 1 Item 2 999 X yy 888 K dd ID Item 1 Item 2 111 A xx 222 B yy 333 C 999 X 888 K dd Outer join © Federal Statistical Office of Germany | Census
12
Matching of… Reference date 1 Reference date 2
ID Item 1 Item 2 111 A xx 222 B yy 333 C ID Item 1 Item 2 111 A xx 222 B yy 333 C ID Item 1 Item 2 111 A xx 222 B yy 333 C Identical registers over time © Federal Statistical Office of Germany | Census
13
Contents Types of matching procedures Matching of address data
Matching of personal data sets Confidentiality issues © Federal Statistical Office of Germany | Census
14
Register of addresses Matching registers Support of the register
Setup of the register Quality aspects Support of the register Validation Quality aspect: up-to-dateness Quality aspect: completeness © Federal Statistical Office of Germany | Census
15
Register of addresses in the German Census
Covered all addresses with housing space and occupied living quarters 2 administrative data sources -> outer join Federal Mapping Agency Population registers Checking of addresses if only included in one data source Classification of addresses as "addresses with housing space“ © Federal Statistical Office of Germany | Census
16
Data acquisition: using registers in place
Geo-referenced address data records: 21 million including geo-coordinates Data of residents registration offices records: 86 million contains demographic and geographical information © Federal Statistical Office of Germany | Census
17
Problems No identification characteristis Address characteristic as ID Local register data Low standardisation of register entries Low harmonisation between registers Redundant/false/obsolete data entries Complex data processing © Federal Statistical Office of Germany | Census
18
Setup of the register of addresses
Data checks Preprocessing Decomposing the address data into address components Standardisation of the address information Aggregation of individual data sets Harmonisation Referencing the street names at street level Adjustment of changing address identifiers Merging/record linkage © Federal Statistical Office of Germany | Census
19
Challenges in using the address as a key variable
Decentralised administrative data, different registers -> No harmonised address format -> Address unstable, changes not notified simultaneously in all registers street name J.-F.-K.-Straße John-F.-Ken.-Straße © Federal Statistical Office of Germany | Census
20
Standardisation of key variables
Necessary condition for completion and updating: standardisation Standardisation of street names Automated standardisation capital letters uniform abbreviations (street -> str, place -> pl) eliminating blanks Manual checks by the statistical offices of the Länder Thesaurus of streetnames Aggregation on street level © Federal Statistical Office of Germany | Census
21
Thesaurus of streetnames: harmonisation of spellings
external source postal code street name 38471 J.-F.-K.-Straße standardised street name JOHNFKENNEDYSTR postal code street name standardised street name 38471 J.-F.-K.-Straße JOHNFKENNEDYSTR John-F.-Ken.-Straße thesaurus of streetnames © Federal Statistical Office of Germany | Census
22
Preparation and integration of register data
GA pre-processing deterministic 1:1 matching- method matching data register MR corrected data non-matching data Correction (regional authorities) © Federal Statistical Office of Germany | Census
23
Two-stage correction model
Municipal Code Check criterion Existence, Correctness I. Street-Level Street A Street B Existence, Correctness, housing space II. Address-Level No. 1 No. 2 No. 1 No. 2 © Federal Statistical Office of Germany | Census
24
Validation of addresses – quality aspect
Validated mass: addresses of two data sources Check for housing space: adress in only one data source GA MR © Federal Statistical Office of Germany | Census
25
Results: addresses to be checked for housing space (2011 Census)
© Federal Statistical Office of Germany | Census
26
Quality aspect: up-to-dateness
Coordination function -> keeping the register up to date Address up-to-dateness = How instabil are the addresses? How often will be updated? Changes to address variables at municipal level -> address is unstable, when and how often it changes is not predictable © Federal Statistical Office of Germany | Census
27
Instability of the address (2010-2011): change of at least one variable in percent
Germany © Federal Statistical Office of Germany | Census 27 27
28
Keeping the register up to date
Integration of 5 different registers (e.g. population register) -> identical registers over time Mismatches: the statistical offices of the Länder checked -> existence -> correctness -> renamings old street name new street name Kochstraße John-F.-Ken.-Straße © Federal Statistical Office of Germany | Census 28 28 28 28
29
Quality aspect: completeness
Register of addresses = reference for population New buildings, demolition of residential buildings, incorrect data in registers Completion by: Registers -> outer join Other survey components, information from other sources © Federal Statistical Office of Germany | Census 29 29 29
30
-> most of the new addresses based on register integration
New addresses added to the register by data origin over time (2011 Census) total administrative registers other findings -> most of the new addresses based on register integration © Federal Statistical Office of Germany | Census 30 30 30
31
Conclusion Decentralised administrative data, differing quality of register data and missing ID = core problem To update and complete an instable key variable is the major focus in the context of the register of addresses -> precondition: harmonisation/ standardisation Updating and completion of the register can mainly be achieved through register integration © Federal Statistical Office of Germany | Census
32
Contents Types of matching procedures Matching of address data
Matching of personal data sets Confidentiality issues © Federal Statistical Office of Germany | Census
33
Data acquisition and integration in Germany
Decentralised via the statistical offices of the Länder Two supplies around the census reference date Integration Linking of the information on addresses Adding personal data records via the address-ID Build-up of a temporary centralised population register for Germany © Federal Statistical Office of Germany | Census
34
Matching of different deliveries over time
Merging information Address Family name at birth and first name(s), Sex, Date of birth, Place of birth Results Confirm data sets Update data sets Add data sets © Federal Statistical Office of Germany | Census
35
Reference data stock Merging datasets from different sources without existing personel identification numbers (registers, surveys) Merging information: family name at birth and first name(s), sex, date of birth, municipal code, post code, street name, house number © Federal Statistical Office of Germany | Census
36
Matching procedures Limitations? Risks? Challenges? Chances?
Deterministic process Including all items Omit items step by step Probabilistic process Similarity of items Probability of matching Limitations? Risks? Chances? Challenges? © Federal Statistical Office of Germany | Census
37
Challenges Matching process „step by step“ Create subsets
Avoid false matches Quality checks © Federal Statistical Office of Germany | Census
38
Contents Types of matching procedures Matching of address data
Matching of personal data sets Confidentiality issues © Federal Statistical Office of Germany | Census
39
Data protection and confidentiality
Collection of personal data Names, date of birth,… Additional data only for matching process Create internal IDs Limitations for quality checks Prohibition to transmit the data back to the administration © Federal Statistical Office of Germany | Census
40
Thank you for your attention!
Stephanie Hirner © Federal Statistical Office of Germany | Census
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.