Sampling procedures for assessing accuracy of record linkage Paul A. Smith, S3RI, University of Southampton Shelley Gammon, Sarah Cummins, Christos Chatzoglou,

Sampling procedures for assessing accuracy of record linkage Paul A. Smith, S3RI, University of Southampton Shelley Gammon, Sarah Cummins, Christos Chatzoglou, Office for National Statistics Dick Heasman 1

Outline  Record linkage and strategies  Quality measures for record linkage  Stratified sampling  Inverse sampling  Results  Conclusions 2

Record linkage  increasingly important in official statistics  many strategies for linkage  most high-quality linkage routines contain multiple passes, eg  exact matching  rule-based matching  probabilistic matching methods 3

Quality measures  precision  recall  f-measure  rely on truth assessment – usually clerical and expensive 4 TP = true positive, FN = false negative etc

Overall and component quality  Want overall assessment of matching quality from all stages  Also interested in quality of the different stages  treat stages as strata  stratification advantageous if strata different, and internally homogeneous  expect differences in precision in different match stages  can stratification improve precision/reduce costs of assessment?  how many records do I need to assess clerically? 5

Example with known outcome  Use linkage from 2011 England & Wales Population Census and Census Coverage Survey  linkage was done automatically with clerical resolution of 'uncertain' cases  take clerical linkage result as ‘truth’  Evaluate automated linkage procedures  compare precision and recall  experiment with accuracy measures for inference on quality measures 6

Choice of stratifying variables – true/false +ves  600k links from a new automated process  0.26% FPs compared to original clerical linkage ‘truth’  sample 500 TPs and 500 FPs by srs as basis for modelling  Census hard to count index (1-5)  Sex (1 or 2)  Age group (0-17, 18-24, 25-39, 40-64, 65+, unknown)  Whether in London (1) or not London (0)  Ethnicity (White, Asian/Asian British, Black/Black British, mixed, other, unknown)  Linkage pass (1, 2, 3, 4, 5, 6, 7, 11) 7

Stratifying variables  BIC to penalise overfitting  best model has  hard to count index (HtC)  ‘match pass’ (linkage method)  merge similar levels  HtC {1&2}, {3} and {4&5}  match pass with levels {1}, {2-7} and {11} (=exact, deterministic and probabilistic matching) 8

Choice of stratifying variables – true/false -ves  30k ‘most likely’ non-links (deduplication)  approx 1/3 are FNs  sample 500 TNs and 500 FNs by srs as basis for modelling  BIC  best model includes match probability, London, age group and ethnicity  merge similar levels  age group to {0-24}, {25-64} and {65+}  ethnicity to {White, mixed}, {Asian, Black, other} and {unknown}. 9

Estimation  precision  estimate as proportion of all links  straightforward – know links  recall  estimate as proportion of real matches (TP + FN)  not straightforward – need to estimate FN and calculate  FN estimated as proportion of non-links 10

Stratified sampling  Stratified sampling for proportions (Cochran 1977)  requires ‘knowledge’ of  then allows control of 11

Variance or cv?  Control of var(p) only helpful if p guessed/known accurately  Often unknown initially  Inverse sampling (Haldane 1945) controls cv(p) by continuing sampling until m ‘successes’ observed, m fixed  sequential procedure with stopping rule  fixed m does not mean fixed sample size 12

inverse sampling 13

Stratified inverse sampling  no clear solution for allocation of n h from var(p h )  numerical search approach  minimum m h = m* allocated in each stratum  increase m h in stratum where largest decrease in cv(p), based on expected no. of cases to next ‘success’  repeat until target cv achieved 14

Stratum ‘success’ sizes 15

Overall ‘success’ size 16

Sample sizes 17

Overall sample size 18

Estimated p’s, correct 19

Achieved cv’s, correct 20

Estimated p’s, 21

Achieved cv’s 22

Estimated p’s, varying 23

Achieved cv’s 24

Strategy  This suggests a strategy for sampling to assess the quality of linkage consisting of: 1.If reasonable estimates of available, use them in a Neyman allocation in a stratified design. This will give the smallest sample size with reasonable chance of achieving the required cv. 2.If these are not available, and only an overall estimate is required, use inverse sampling on randomly sorted data. 3.If separate estimates in the strata are desirable, follow the algorithm for stratified inverse sampling.  Possible combination approach using inverse sampling to get initial estimates of to feed into Neyman allocation 25

Conclusions  Stratified inverse sampling not effective?  Sample sizes smaller for stratified sampling (much smaller for p~0.1, not much different for p~0.001)  Combination strategies may offer improved results when initial estimates not available. 26

Questions?  Paul Smith  p.a.smith@soton.ac.uk 27

Sampling procedures for assessing accuracy of record linkage Paul A. Smith, S3RI, University of Southampton Shelley Gammon, Sarah Cummins, Christos Chatzoglou,

Similar presentations

Presentation on theme: "Sampling procedures for assessing accuracy of record linkage Paul A. Smith, S3RI, University of Southampton Shelley Gammon, Sarah Cummins, Christos Chatzoglou,"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Sampling procedures for assessing accuracy of record linkage Paul A. Smith, S3RI, University of Southampton Shelley Gammon, Sarah Cummins, Christos Chatzoglou,

Similar presentations

Presentation on theme: "Sampling procedures for assessing accuracy of record linkage Paul A. Smith, S3RI, University of Southampton Shelley Gammon, Sarah Cummins, Christos Chatzoglou,"— Presentation transcript:

Similar presentations

About project

Feedback