ESSnet DI WP2: Record Linkage Luca Valentino Istat
2 Task 2.1: Record linkage – a practical problem The problem, already illustrated in the Den Haag meeting, is to link these registers: P4 (New born inclusion in the residents administrative register) CEDAP (survey on the assistance certificates in the childbirth moment) In one data set there are the characteristics of the new born as weight, type of birth, how many brothers he/she has, week of birth, while the other data set contains data on the characteristics of the household as marital status, nationality, education of the parents
3 Harmonization of populations CEDAP P4 Common units: Alive newborns New borns in Italy Italian residents Dead newborns Non residents Italian newborns in other countries Molise and Calabria
4 Common variables EntityVariables available in P4 and CEDAP MOTHERMather’s birthdate (day, month and year) MOTHERMather’s marital status MOTHERMather’s citizenship MOTHERCounty of residence of the mother (hence ) NEWBORNCounty where childbirth happened NEWBORNNewborn’s birthdate (day, month and year) NEWBORNNewborn gender FATHERFather’s birthdate (day, month and year) FATHERFather’s citizenship The linkage was performed on the month of March 2005 Exclusion of the non eligible newborns lead to the following file sizes: P4March '336 units CEDAPMarch '381 units The common variables available for linkage are:
5 Different approaches The objective was the application of different approaches and comparing result: deterministic record linkage approach : rule defined from survey experts probabilistic record linkage approach : based on Fellegi – Sunter method (Relais) Liseo and Tancredi approach : to be done
6 Deterministic approach This approach is based on deterministic rule defined by survey experts (equivalence on all the common variables or on all but one common variable). This rule is performed by SAS procedures ad hoc and results are considered very reliable (declared matches are considered actual matches) The number of declared matches in this case is 32’595
7 Probabilistic approach Search space reduction by blocking in variables: Newborn’s birthdate – Newborn’s gender Matching variables: Reduction to 1:1 solution variablemetric day of mother’s birth dateequality month of mother’s birth dateequality year of mother’s birth dateequality birthplaceequality father’s birth dateJaro thr. 0.9
8 Probabilistic approach Before applying the EM algorithm and the 1:1 reduction, the probabilistic approach finds a set of pairs with a probability to be a match (P_POST) The final result depends on the choice of match threshold that depends on the quality required for the linkage In this case, high precision is required (in order to prevent as much as possible false matches). Hence the match threshold is fixed at 0.9 The number of declared matches is 36’562
9 Comparison between approaches The comparison between the deterministic approach (or Expert’s rule) and probabilistic approach (or Relais) shows a strong congruence The pairs declared as matches for both approaches are 31’931 87% of matches according to Relais are matches also for the expert’s rule 98% of matches according to the expert’s rule are matches also for Relais
10 Comparison by clerical review Mather’s birthdate Newborn’s gender Newborn’s Birth day Female06 An assessment of the quality of the linkage procedures can be performed through an evaluation of samples of pairs to be carefully evaluated by clerical review The clerical review consists in the analysis of all common variables of two records: If there are minimal differences between those variables that do not coincide, including when these variables are missing, the pair is classified as a true link otherwise the pair is classified as a false link Mather’s birthdate Newborn’s gender Newborn’s Birth day
11 Comparison by clerical review A - common matches B - common matches consisting of twins C - pairs declared as matches only by the expert’s rule D - pairs declared as matches only by Relais similar on at least half of the variables not used in the probabilistic linkage procedure E - other pairs declared as matches by Relais Expert's ruleRelaisClass of pairs Total pairsSample size True linkFalse link Match A Match B MatchUnmatchC UnmatchMatchD UnmatchMatchE
12 Comparison by clerical review F - pairs in the Relais solution but p_post value is below the match threshold G - pairs that coincide in at least one of the most significant variables Expert's rule RelaisClass of pairs Total pairs Sample size True linkFalse link Unmatch F Unmatch G
13 Comparison by clerical review The results obtained on the checked samples give the following false match and false non match rates: Deterministic approach (Expert’s rule) Probabilistic approach (Relais) False match rate0% False non match rate14,35% False match rate0,25% False non match rate4,16%