The relationship between error rates and parameter estimation in the probabilistic record linkage context Tiziana Tuoto, Nicoletta Cibella, Marco Fortini.

The relationship between error rates and parameter estimation in the probabilistic record linkage context Tiziana Tuoto, Nicoletta Cibella, Marco Fortini Italian National Statistical Institute ISTAT – Rome, Italy

2 Q2008 Outline Record linkage and Quality Record linkage as a statistical problem Aim of the work: evaluate the linkage errors Results on a case study Further works

3 Q2008 Record linkage at Quality2008 The record linkage purpose is to identify, quickly and accurately, the same real world entity, which can be differently represented in data sources Widespread examples of applications (in official statistics field): – creation, update and de-duplication of frame – measure of population amount by capture-recapture model – check of the confidentiality of public-use microdata Record linkage procedures substantially improve the quality and quantity of the available information Warning in using data coming from linkage: must consider the quality of the linkage procedure

4 Q2008 Linkage Framework Linkage Preparatory activities Method adjustments Evaluation Usage

5 Q2008 Two data sets A and B, size N A and N B respectively Consider Ω = {(a,b), a  A and b  B} of size N=N A  N B The problem: classify the pairs in Ω in two subsets M and U mutually exclusive: Mis the set of matches (a=b) Uis the set of non-matches (a≠b) To classify the pairs, common identifiers (matching variables) are selected For each pairs a comparison vector is defined For example: Record linkage: formalization (1)

6 Q2008 The ratio of the distributions of  in the M and U subsets is used to classify the pairs Record linkage: formalization (2) The classification criterion is based on two thresholds T m and T u ( T m > T u ) The thresholds are chosen so that false match rate, FMR, and false non- match rate, FNMR, are minimized

7 Q2008 Choosing the threshold values The Fellegi and Sunter approach is heavily dependent on the accuracy of m(  ) and u(  ) estimates. Misspecifications in the model assumptions, lack of information and other problems can cause a loss of accuracy in the estimates and, as a consequence, an increase of both false match and non-match errors For this reason the appropriate thresholds values are often identified mainly through empirical methods

8 Q2008 Theoretical situation Density function Increasing value of TmTm TuTu Error Error  M*M* U*U*

9 Q2008 where p=P(M) is the match prior probability Estimation of m(  ) and u(  ): the mixture model approach Armstrong and Mayda (1995) assume that the frequency distribution of the observed patterns  is a mixture of the distributions of the matches m(  ) and non-matches u(  ) EM algorithm for the estimation

10 Q2008 The joint distribution of the observations  and the latent variable C=c (c=(0,1) is given by: Latent Class Analysis The likelihood function for m k (  ), u k (  ) (k=1,…,K) and p is given by: EM algorithm for the parameter estimation

11 Q2008 Under the local independence assumption Latent Class Analysis and model fitting Warning: Local independence assumption can be not satisfied (often) Some authors Winkler (1989) and Thibaudeau (1989) introduce in the latent class models suitable constrains on the parameters in order to partially go over the local independence assumption Aims of the work is to study the relationship between the model fitting and the linkage error evaluation

12 Q2008 Case study: the data We know the true linkage status of all candidate pairs, due to the accuracy of the matching procedures adopted when estimating Census Coverage Rate through Capture-recapture model File A from Census and file B from PES of about 650 records each one 4 matching variables: Name, Surname, Day and Year of Birth. Block on Month of Birth SurnameNameDay of BYear of Bfreq 0000414138 00015321 001014004 0011168 01003090 010143 0110102 01119 1000969 100117 101022 101119 110014 11019 11106 1111513 From the 2001 Italian Post Enumeration Census Survey

13 Q2008 Results under local independence assumption Probabilities P(X k /M), P(X k /U) and P(M) are computed for each of the 4 matching variables by means of the EM algorithm under the local independence assumption P(M)=0.0013 X P(X=1|M) P(X=1|U) Surname0.98530.0023 Name0.96500.0074 Day of birth0.9825 0.0327 Year of birth 0.9889 0.0127

14 Q2008 Results under local independence assumption Fix only one threshold T m =1, corresponding to the expected false match error FMR  =0.001. The resulting expected false non-match rate FNMR = 0.0001 TmTm M*M* U*U*

15 Q2008 Results under local independence assumption The linkage results are “appreciable” but the linkage errors are not well estimated Observed FMR=0.017 vs the expected 0.001 Observed FNMR=0.010 vs the expected 0.0001 True Linkage Status MatchedNot MatchedTotal Results of the Linkage Procedure Matched56710577 Not Matched6 Total573

16 Q2008 Results using deterministic approach 1°Merge : (1,1,1,1) + on the 1°Merge-residuals 2°Merge : (1,0,1,1) + on the residuals 3°Merge : (1,1,0,1) X P(X=1|M) P(X=1|U) Surname0.98530.0023 Name0.96500.0074 Day of birth0.9825 0.0327 Year of birth 0.9889 0.0127 True Linkage Status MatchedNot Matched Total Results of the Linkage Procedure Matched5383541 Not Matched35 Total573 Observed FMR=0.005 Observed FNMR=0.06

17 Q2008 Try to insert the interaction between matching variables, given the latent variable Relaxing the conditional independence assumption ModellogLikelihoodBICnpar Local Indep-123 586247 2899 Interaction Surname and Name -123 583247 29710 Interaction Surname and Day of Birth -123 584247 29810 Interaction Surname and Year of Birth -123 585247 30110 Interaction Name and Day of Birth -123 586247 30210 Interaction Name and Year of Birth Not identifiable 10 Interaction Day and Year of Birth Not identifiable 10

18 Q2008 The true match distributions

19 Q2008 Further analyses Improving model fitting: –Distinguish between missing and inequality  deepen models based on categorical and/or continuous comparisons (Winkler, 2001) –Study the validity of the local independence assumption  Perturb real data to introduce associated errors in order to establish the relationship among model fitting, thresholds, linkage results and linkage errors

The relationship between error rates and parameter estimation in the probabilistic record linkage context Tiziana Tuoto, Nicoletta Cibella, Marco Fortini.

Similar presentations

Presentation on theme: "The relationship between error rates and parameter estimation in the probabilistic record linkage context Tiziana Tuoto, Nicoletta Cibella, Marco Fortini."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

The relationship between error rates and parameter estimation in the probabilistic record linkage context Tiziana Tuoto, Nicoletta Cibella, Marco Fortini.

Similar presentations

Presentation on theme: "The relationship between error rates and parameter estimation in the probabilistic record linkage context Tiziana Tuoto, Nicoletta Cibella, Marco Fortini."— Presentation transcript:

Similar presentations

About project

Feedback