Download presentation
Presentation is loading. Please wait.
Published byOswald Merritt Modified over 8 years ago
1
Sampling procedures for assessing accuracy of record linkage Paul A. Smith, S3RI, University of Southampton Shelley Gammon, Sarah Cummins, Christos Chatzoglou, Office for National Statistics Dick Heasman 1
2
Outline Record linkage and strategies Quality measures for record linkage Stratified sampling Inverse sampling Results Conclusions 2
3
Record linkage increasingly important in official statistics many strategies for linkage most high-quality linkage routines contain multiple passes, eg exact matching rule-based matching probabilistic matching methods 3
4
Quality measures precision recall f-measure rely on truth assessment – usually clerical and expensive 4 TP = true positive, FN = false negative etc
5
Overall and component quality Want overall assessment of matching quality from all stages Also interested in quality of the different stages treat stages as strata stratification advantageous if strata different, and internally homogeneous expect differences in precision in different match stages can stratification improve precision/reduce costs of assessment? how many records do I need to assess clerically? 5
6
Example with known outcome Use linkage from 2011 England & Wales Population Census and Census Coverage Survey linkage was done automatically with clerical resolution of 'uncertain' cases take clerical linkage result as ‘truth’ Evaluate automated linkage procedures compare precision and recall experiment with accuracy measures for inference on quality measures 6
7
Choice of stratifying variables – true/false +ves 600k links from a new automated process 0.26% FPs compared to original clerical linkage ‘truth’ sample 500 TPs and 500 FPs by srs as basis for modelling Census hard to count index (1-5) Sex (1 or 2) Age group (0-17, 18-24, 25-39, 40-64, 65+, unknown) Whether in London (1) or not London (0) Ethnicity (White, Asian/Asian British, Black/Black British, mixed, other, unknown) Linkage pass (1, 2, 3, 4, 5, 6, 7, 11) 7
8
Stratifying variables BIC to penalise overfitting best model has hard to count index (HtC) ‘match pass’ (linkage method) merge similar levels HtC {1&2}, {3} and {4&5} match pass with levels {1}, {2-7} and {11} (=exact, deterministic and probabilistic matching) 8
9
Choice of stratifying variables – true/false -ves 30k ‘most likely’ non-links (deduplication) approx 1/3 are FNs sample 500 TNs and 500 FNs by srs as basis for modelling BIC best model includes match probability, London, age group and ethnicity merge similar levels age group to {0-24}, {25-64} and {65+} ethnicity to {White, mixed}, {Asian, Black, other} and {unknown}. 9
10
Estimation precision estimate as proportion of all links straightforward – know links recall estimate as proportion of real matches (TP + FN) not straightforward – need to estimate FN and calculate FN estimated as proportion of non-links 10
11
Stratified sampling Stratified sampling for proportions (Cochran 1977) requires ‘knowledge’ of then allows control of 11
12
Variance or cv? Control of var(p) only helpful if p guessed/known accurately Often unknown initially Inverse sampling (Haldane 1945) controls cv(p) by continuing sampling until m ‘successes’ observed, m fixed sequential procedure with stopping rule fixed m does not mean fixed sample size 12
13
inverse sampling 13
14
Stratified inverse sampling no clear solution for allocation of n h from var(p h ) numerical search approach minimum m h = m* allocated in each stratum increase m h in stratum where largest decrease in cv(p), based on expected no. of cases to next ‘success’ repeat until target cv achieved 14
15
Stratum ‘success’ sizes 15
16
Overall ‘success’ size 16
17
Sample sizes 17
18
Overall sample size 18
19
Estimated p’s, correct 19
20
Achieved cv’s, correct 20
21
Estimated p’s, 21
22
Achieved cv’s 22
23
Estimated p’s, varying 23
24
Achieved cv’s 24
25
Strategy This suggests a strategy for sampling to assess the quality of linkage consisting of: 1.If reasonable estimates of available, use them in a Neyman allocation in a stratified design. This will give the smallest sample size with reasonable chance of achieving the required cv. 2.If these are not available, and only an overall estimate is required, use inverse sampling on randomly sorted data. 3.If separate estimates in the strata are desirable, follow the algorithm for stratified inverse sampling. Possible combination approach using inverse sampling to get initial estimates of to feed into Neyman allocation 25
26
Conclusions Stratified inverse sampling not effective? Sample sizes smaller for stratified sampling (much smaller for p~0.1, not much different for p~0.001) Combination strategies may offer improved results when initial estimates not available. 26
27
Questions? Paul Smith p.a.smith@soton.ac.uk 27
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.