Presentation is loading. Please wait.

Presentation is loading. Please wait.

Sandra Lagarto, Statistics Portugal,

Similar presentations


Presentation on theme: "Sandra Lagarto, Statistics Portugal,"— Presentation transcript:

1 Record linkage methods for administrative data: The Portuguese Census Transformation Program
Sandra Lagarto, Statistics Portugal, Anabela Delgado, Statistics Portugal, 2021 Census Unit, R. Silva, Instituto Superior Técnico/INESC-ID, L. Sampaio, Instituto Superior Técnico/INESC-ID, M.J. Silva, Instituto Superior Técnico/INESC-ID, P. Calado, Instituto Superior Técnico/INESC-ID,

2 Today’s presentation Background Matching overview
Census with Admin Data Portuguese Statistical Population Dataset Record linkage and data matching Background Initial conditions The anonymization process The use of match keys – Deterministic matching Matching overview Probabilistic matching Score based matching with logistic regression Methodological options Matching developments Measuring quality of the models New matches from logistic model Matching validation Results and Discussion Conclusions

3 Background –I Portuguese Census Transformation Program
Census with Admin Data Portuguese Statistical Population Dataset (SPD) Record linkage and data matching 2016 Resident population in Portugal from SPD, Population Estimates (PE) and Civil Register (BDIC), by quinary age group

4 Admin Registers Acronym Designation BDIC
Civil Register (Portuguese citizens) SEF Immigration Register IRS Taxes Register (Personal Income) ISS Social Security Register CGA State Pension/ Work Fund Register QP Private Employment Register IEFP Unemployment Register EDUC Education Register ACSS Patient Register (Hospital assistance)

5 Background – II Registers No Registers integrated in the 2015 SPD
Registers integrated and not integrated in 2015 SPD, by administrative source Registers No Registers integrated in the SPD Registers not integrated the in 2015 SPD % 2014 IRS 95,7 4,3 2015 ISS 96,4 3,6 2015 EDUC 93,8 6,2 2014 QP 99,1 24 779 0,9 2015 CGA 97,1 30 268 2,9

6 Goal Identify and register individual residents in Portugal in a specific year, using as much administrative data as possible Keep errors at the minimum Portuguese resident population database

7 Initial conditions Portugal does not have:
A central population register A unique personal identification number A unified numerical identification system Numerical Id presence and coverage, by register Register NIC NIF NISS 2015 BDIC 100,0 - 2015 SEF* 62,2 50,8 2015 ISS 81,5 98,8 2014 QP 99,1 2015 IEFP 99,6 2015 CGA 77,0 81,9 2015 EDUC 90,9 66,6 2014 IRS

8 The anonymization process
The National Data Protection Authority (CNPD) allowed SP to access the registers (Deliberations and ), after some transformations: Personal numeric identifiers (NIC; NIF; NISS) encrypted with a hash function to convert them into a condensed representation of fixed value in a one way irreversible process Person names had to be truncated to the first three letters of the first name and the three final letters of the family name

9 Matching developments
Deterministic, identifying identical keys Probabilistic, matching similar (approximate) keys Score based matching with logistic regression Example of individual personal characteristics used in match keys Matching attributes First three characters from first name Last three characters from family name Sex Date of birth Parish or Municipality of residence Legal marital status Place of birth

10 Matching process Record linkage System Architecture

11 Data cleaning and standardization

12 Blocking

13 Negative examples generation
Blocking criteria: 3_LETTERS_FIRST_NAME + DTBIRTH

14 Feature generation Comparison vectors generated with Edit Distance (Levenshtein) metric

15 Logistic regression Logistic curve example

16 Quality of models Results of 2-Fold Cross Validation Model
Precision (%) Sensitivity (%) Matches Non-matches BDIC_IRS 98 97 BDIC_ISS 99 100 BDIC_EDUC 93 94 BDIC_IEFP BDIC_CGA 88 95 SEF_IRS SEF_ISS SEF_EDUC 96

17 Matching results Results of classification applying the trained model
Data Sources matched No. records Records without common key Exact matches Exact matches validated No. records not integrated in SPD New matches 2015 BDIC (19,3%) 2014 IRS (12%) 2015 ISS 40 769 8 224 (46,3%) 2015 EDUC 84 968 2015 SEF 34 381 2 249 (12,1%)

18 Validation of matched pairs
Register No records Resident candidates table (No records) Validated records No % 2015 BDIC 93,4 2014 IRS 95,8 2015 ISS 98,0 2015 EDUC 95,2 2015 IEFP 98,8 2015 CGA 2015 SEF 83,6 92,2 16 848 10 576 62,8 87 017

19 Conclusions - Strategic
Changing the census paradigm will take time and require huge investment. Work is still in progress for the transition to a register-based census. This line of research should continue until 2021 and beyond. The results are encouraging and research to date has shown the potential to produce good quality population estimates at a national level but improvements must be made: The setting-up of a more favourable legal framework that: Promote the creation of an integrated system to produce Increase the commitment, by the data owners, on: Sending data on a regular basis Provide up to dated data Increase data coverage (numerical Ids) Data cleaning ... Refinements in the methodological framework: Improvements in probabilistic record linkage

20 Conclusions - Operational
Data cleaning is essential to this process This method adds from 11% to 46% new records to SPD First step towards a census based on administrative data under the current legal environment in Portugal Future work

21 The SP – IST Team mario.gaspar.silva@tecnico.ulisboa.pt

22 References Levenshtein, V. I Binary codes capable of correcting deletions, insertions and reversals, Soviet Physics-Doklady 10, Office for National Statistics (ONS) Beyond 2011 ‘Matching Anonymous Data’ Methods & Policies Report (M9). Available at: publications/methods-and-policies/index.html (Accessed: 14 May 2018). Sampaio Velho, L. Emparelhamento de dados Censitários Thesis to obtain the Master of Science Degree in Information Systems and Computer Engineering. IST, Lisboa, 68p. Available at: t/dissertacao/ (Accessed: 14 May 2018). Silva, R. ‘Matching Census Data Records’ Thesis to obtain the Master of Science Degree in Information Systems and Computer Engineering. IST, Lisboa, 58p. Available at: a/dissertacao/ (Accessed: 14 May 2018). Silva, R., Sampaio, L., Calado, P., Silva, M. and Delgado, A. (2017) Matching administrative data for census purposes. In Proceedings of the Data Science, Statistics and Visualization Conference (DSSV 2017), Lisbon, Portugal. Available at: (Accessed: 14 May 2018). Statistics Portugal (SP) Interligação das diferentes bases de dados provenientes de fontes administrativas, no âmbito do novo modelo censitário para Census Unit. Available at: (Accessed: 14 May 2018). Statistics Portugal (SP) Metodologia de atualização da Base de População Residente – Construção da BPR Census Unit. Available at: (Accessed: 14 May 2018).

23 Record linkage methods for administrative data: the Portuguese Census Transformation Program
Sandra Lagarto, Statistics Portugal, 2021 Census Unit,


Download ppt "Sandra Lagarto, Statistics Portugal,"

Similar presentations


Ads by Google