Presentation is loading. Please wait.

Presentation is loading. Please wait.

Stephanie Hirner ESTP ”Administrative data and censuses

Similar presentations


Presentation on theme: "Stephanie Hirner ESTP ”Administrative data and censuses"— Presentation transcript:

1 Matching registers without direct identifiers and confidentiality issues
Stephanie Hirner ESTP ”Administrative data and censuses Wiesbaden 22 – 24 May 2018 THE CONTRACTOR IS ACTING UNDER A FRAMEWORK CONTRACT CONCLUDED WITH THE COMMISSION

2 Contents Types of matching procedures Matching of address data
Matching of personal data sets Confidentiality issues © Federal Statistical Office of Germany | Census

3 Contents Types of matching procedures Matching of address data
Matching of personal data sets Confidentiality issues © Federal Statistical Office of Germany | Census

4 Matching via … Identifiers Identical Items Similar Items e.g.
Addresses Personal data © Federal Statistical Office of Germany | Census

5 Matching via … Identifiers Identical Items Similar Items e.g.
Addresses Address ID Postal Code Street name Street number Street name: original and standardised Personal data Personal ID Name Sex Date of birth Place of birth Birth name versus familiy name © Federal Statistical Office of Germany | Census

6 Matching process Preprocessing Deterministic process
Parsing Standardisation Deterministic process Including all items Omit items step by step Probabilistic process Similarity of items Fuzzy merge Probability of matching © Federal Statistical Office of Germany | Census

7 Probabilistic methods - examples
SPEDIS „Determines the likelihood of two words matching, expressed as the asymmetric spelling distance between the two words“ (see SAS Documentation „SPEDIS Function“) Jaro-Winkler similarity Measure of similarity between two strings, uses the number of matching characters and the number of transpositions  Sources of error False match Missing match © Federal Statistical Office of Germany | Census

8 SPEDIS Method Results Comparison of items (e.g. names)
Identification of „costs“ to transform one value into the target word Weighting by using the length of the string Transformation in both directions Results Probability of correct matching © Federal Statistical Office of Germany | Census

9 Jaro-Winkler Method Results Comparison of items (e.g. names)
Weighting of identical digits in the compared words Higher weigth for consistency at the beginning of the word Results Probability of correct matching © Federal Statistical Office of Germany | Census

10 Matching of… Data source 1 Data source 2 Addition of items ID Item 1
111 A xx 14 mLx 222 B yy 12 pQn 333 C 00 sFc Item A C34 F76 A94 Data source 2 ID Item A 111 C34 222 F76 333 A94 Addition of items © Federal Statistical Office of Germany | Census

11 Matching of… Data source 1 Data source 2 Outer join ID Item 1 Item 2
111 A xx 222 B yy 333 C ID Item 1 Item 2 999 X yy 888 K dd ID Item 1 Item 2 111 A xx 222 B yy 333 C 999 X 888 K dd Outer join © Federal Statistical Office of Germany | Census

12 Matching of… Reference date 1 Reference date 2
ID Item 1 Item 2 111 A xx 222 B yy 333 C ID Item 1 Item 2 111 A xx 222 B yy 333 C ID Item 1 Item 2 111 A xx 222 B yy 333 C Identical registers over time © Federal Statistical Office of Germany | Census

13 Contents Types of matching procedures Matching of address data
Matching of personal data sets Confidentiality issues © Federal Statistical Office of Germany | Census

14 Register of addresses Matching registers Support of the register
Setup of the register Quality aspects Support of the register Validation Quality aspect: up-to-dateness Quality aspect: completeness © Federal Statistical Office of Germany | Census

15 Register of addresses in the German Census
Covered all addresses with housing space and occupied living quarters 2 administrative data sources -> outer join Federal Mapping Agency Population registers Checking of addresses if only included in one data source Classification of addresses as "addresses with housing space“ © Federal Statistical Office of Germany | Census

16 Data acquisition: using registers in place
Geo-referenced address data records: 21 million including geo-coordinates Data of residents registration offices records: 86 million contains demographic and geographical information © Federal Statistical Office of Germany | Census

17 Problems No identification characteristis  Address characteristic as ID Local register data Low standardisation of register entries Low harmonisation between registers Redundant/false/obsolete data entries  Complex data processing © Federal Statistical Office of Germany | Census

18 Setup of the register of addresses
Data checks Preprocessing Decomposing the address data into address components Standardisation of the address information Aggregation of individual data sets Harmonisation Referencing the street names at street level Adjustment of changing address identifiers Merging/record linkage © Federal Statistical Office of Germany | Census

19 Challenges in using the address as a key variable
Decentralised administrative data, different registers -> No harmonised address format -> Address unstable, changes not notified simultaneously in all registers street name J.-F.-K.-Straße John-F.-Ken.-Straße © Federal Statistical Office of Germany | Census

20 Standardisation of key variables
Necessary condition for completion and updating: standardisation Standardisation of street names Automated standardisation  capital letters  uniform abbreviations (street -> str, place -> pl)  eliminating blanks Manual checks by the statistical offices of the Länder Thesaurus of streetnames Aggregation on street level © Federal Statistical Office of Germany | Census

21 Thesaurus of streetnames: harmonisation of spellings
external source postal code street name 38471 J.-F.-K.-Straße standardised street name JOHNFKENNEDYSTR postal code street name standardised street name 38471 J.-F.-K.-Straße JOHNFKENNEDYSTR John-F.-Ken.-Straße thesaurus of streetnames © Federal Statistical Office of Germany | Census

22 Preparation and integration of register data
GA pre-processing deterministic 1:1 matching- method matching data register MR corrected data non-matching data Correction (regional authorities) © Federal Statistical Office of Germany | Census

23 Two-stage correction model
Municipal Code Check criterion Existence, Correctness I. Street-Level Street A Street B Existence, Correctness, housing space II. Address-Level No. 1 No. 2 No. 1 No. 2 © Federal Statistical Office of Germany | Census

24 Validation of addresses – quality aspect
Validated mass: addresses of two data sources Check for housing space: adress in only one data source GA MR © Federal Statistical Office of Germany | Census

25 Results: addresses to be checked for housing space (2011 Census)
© Federal Statistical Office of Germany | Census

26 Quality aspect: up-to-dateness
Coordination function -> keeping the register up to date Address up-to-dateness = How instabil are the addresses? How often will be updated? Changes to address variables at municipal level -> address is unstable, when and how often it changes is not predictable © Federal Statistical Office of Germany | Census

27 Instability of the address (2010-2011): change of at least one variable in percent
Germany © Federal Statistical Office of Germany | Census 27 27

28 Keeping the register up to date
Integration of 5 different registers (e.g. population register) -> identical registers over time Mismatches: the statistical offices of the Länder checked -> existence -> correctness -> renamings old street name new street name Kochstraße John-F.-Ken.-Straße © Federal Statistical Office of Germany | Census 28 28 28 28

29 Quality aspect: completeness
Register of addresses = reference for population New buildings, demolition of residential buildings, incorrect data in registers Completion by: Registers -> outer join Other survey components, information from other sources © Federal Statistical Office of Germany | Census 29 29 29

30 -> most of the new addresses based on register integration
New addresses added to the register by data origin over time (2011 Census) total administrative registers other findings -> most of the new addresses based on register integration © Federal Statistical Office of Germany | Census 30 30 30

31 Conclusion Decentralised administrative data, differing quality of register data and missing ID = core problem To update and complete an instable key variable is the major focus in the context of the register of addresses -> precondition: harmonisation/ standardisation Updating and completion of the register can mainly be achieved through register integration © Federal Statistical Office of Germany | Census

32 Contents Types of matching procedures Matching of address data
Matching of personal data sets Confidentiality issues © Federal Statistical Office of Germany | Census

33 Data acquisition and integration in Germany
Decentralised via the statistical offices of the Länder Two supplies around the census reference date Integration Linking of the information on addresses Adding personal data records via the address-ID Build-up of a temporary centralised population register for Germany © Federal Statistical Office of Germany | Census

34 Matching of different deliveries over time
Merging information Address Family name at birth and first name(s), Sex, Date of birth, Place of birth Results Confirm data sets Update data sets Add data sets © Federal Statistical Office of Germany | Census

35 Reference data stock Merging datasets from different sources without existing personel identification numbers (registers, surveys) Merging information: family name at birth and first name(s), sex, date of birth, municipal code, post code, street name, house number © Federal Statistical Office of Germany | Census

36 Matching procedures Limitations? Risks? Challenges? Chances?
Deterministic process Including all items Omit items step by step Probabilistic process Similarity of items Probability of matching Limitations? Risks? Chances? Challenges? © Federal Statistical Office of Germany | Census

37 Challenges Matching process „step by step“ Create subsets
Avoid false matches Quality checks © Federal Statistical Office of Germany | Census

38 Contents Types of matching procedures Matching of address data
Matching of personal data sets Confidentiality issues © Federal Statistical Office of Germany | Census

39 Data protection and confidentiality
Collection of personal data Names, date of birth,… Additional data only for matching process Create internal IDs Limitations for quality checks Prohibition to transmit the data back to the administration © Federal Statistical Office of Germany | Census

40 Thank you for your attention!
Stephanie Hirner © Federal Statistical Office of Germany | Census


Download ppt "Stephanie Hirner ESTP ”Administrative data and censuses"

Similar presentations


Ads by Google