Sampling procedures for assessing accuracy of record linkage Paul A. Smith, S3RI, University of Southampton Shelley Gammon, Sarah Cummins, Christos Chatzoglou,

Slides:



Advertisements
Similar presentations
Learning Algorithm Evaluation
Advertisements

Sampling Strategy for Establishment Surveys International Workshop on Industrial Statistics Beijing, China, 8-10 July 2013.
Chapter 5 Stratified Random Sampling n Advantages of stratified random sampling n How to select stratified random sample n Estimating population mean and.
Who and How And How to Mess It up
Why sample? Diversity in populations Practicality and cost.
STAT262: Lecture 5 (Ratio estimation)
A new sampling method: stratified sampling
Stratified Simple Random Sampling (Chapter 5, Textbook, Barnett, V
STAT 4060 Design and Analysis of Surveys Exam: 60% Mid Test: 20% Mini Project: 10% Continuous assessment: 10%
Formalizing the Concepts: Simple Random Sampling.
Formalizing the Concepts: STRATIFICATION. These objectives are often contradictory in practice Sampling weights need to be used to analyze the data Sampling.
Key terms in Sampling Sample: A fraction or portion of the population of interest e.g. consumers, brands, companies, products, etc Population: All the.
17 June, 2003Sampling TWO-STAGE CLUSTER SAMPLING (WITH QUOTA SAMPLING AT SECOND STAGE)
Sample Design.
Sampling : Error and bias. Sampling definitions  Sampling universe  Sampling frame  Sampling unit  Basic sampling unit or elementary unit  Sampling.
Sampling: Theory and Methods
1 Ratio estimation under SRS Assume Absence of nonsampling error SRS of size n from a pop of size N Ratio estimation is alternative to under SRS, uses.
Evaluating What’s Been Learned. Cross-Validation Foundation is a simple idea – “ holdout ” – holds out a certain amount for testing and uses rest for.
Scot Exec Course Nov/Dec 04 Survey design overview Gillian Raab Professor of Applied Statistics Napier University.
Agricultural and Biological Statistics. Sampling and Sampling Distributions Chapter 5.
Sampling Methods. Probability Sampling Techniques Simple Random Sampling Cluster Sampling Stratified Sampling Systematic Sampling Copyright © 2012 Pearson.
Aim: How do we use sampling distributions for proportions? HW5: complete last slide.
Sampling Sources: -EPIET Introductory course, Thomas Grein, Denis Coulombier, Philippe Sudre, Mike Catchpole -IDEA Brigitte Helynck, Philippe Malfait,
Sampling and Statistical Analysis for Decision Making A. A. Elimam College of Business San Francisco State University.
ICCS 2009 IDB Seminar – Nov 24-26, 2010 – IEA DPC, Hamburg, Germany Training Workshop on the ICCS 2009 database Weights and Variance Estimation picture.
IPDET Module 9: Choosing the Sampling Strategy. IPDET © Introduction Introduction to Sampling Types of Samples: Random and Nonrandom Determining.
1. 2 DRAWING SIMPLE RANDOM SAMPLING 1.Use random # table 2.Assign each element a # 3.Use random # table to select elements in a sample.
Sampling Design and Procedure
Institute of Professional Studies School of Research and Graduate Studies Selecting Samples and Negotiating Access Lecture Eight.
Chapter 5 Stratified Random Samples. What is a stratified random sample and how to get one Population is broken down into strata (or groups) in such a.
AC 1.2 present the survey methodology and sampling frame used
Module 9: Choosing the Sampling Strategy
Sample Size Determination
Aaker, Kumar, Day Ninth Edition Instructor’s Presentation Slides
Overview of probability and statistics
Section 4.2 Random Sampling.
Marketing Research Aaker, Kumar, Leone and Day Eleventh Edition
ESTIMATION.
Sampling Why use sampling? Terms and definitions
CHAPTER 4 Designing Studies
Assessing Disclosure Risk in Microdata
Graduate School of Business Leadership
Sampling And Sampling Methods.
Week 6 Lecture 1 Chapter 10. Sample Survey.
SAMPLE DESIGN.
Sampling: Theory and Methods
Slides by JOHN LOUCKS St. Edward’s University.
STRATIFIED SAMPLING.
Federalist Papers Activity
Sampling Design.
Producing Data Chapter 5.
Random sampling Carlo Azzarri IFPRI Datathon APSU, Dhaka
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Learning Algorithm Evaluation
Stratified Sampling for Data Mining on the Deep Web
Daniela Stan Raicu School of CTI, DePaul University
Model Evaluation and Selection
Section 5.1 Designing Samples
Chapter 5: Producing Data
Estimating population size and a ratio
Test Drop Rules: If not:
An Introduction to Automated Record Linkage
Sampling and estimation
Chapter 7 Sampling and Sampling Distributions
10/18/ B Samples and Surveys.
Maximising the quality of population estimates from the 2011 UK census
Data Collection and Sampling Techniques
Enhancing Causal Inference in Observational Studies
Adaptive mixed-mode design WP1
Enhancing Causal Inference in Observational Studies
Presentation transcript:

Sampling procedures for assessing accuracy of record linkage Paul A. Smith, S3RI, University of Southampton Shelley Gammon, Sarah Cummins, Christos Chatzoglou, Office for National Statistics Dick Heasman 1

Outline  Record linkage and strategies  Quality measures for record linkage  Stratified sampling  Inverse sampling  Results  Conclusions 2

Record linkage  increasingly important in official statistics  many strategies for linkage  most high-quality linkage routines contain multiple passes, eg  exact matching  rule-based matching  probabilistic matching methods 3

Quality measures  precision  recall  f-measure  rely on truth assessment – usually clerical and expensive 4 TP = true positive, FN = false negative etc

Overall and component quality  Want overall assessment of matching quality from all stages  Also interested in quality of the different stages  treat stages as strata  stratification advantageous if strata different, and internally homogeneous  expect differences in precision in different match stages  can stratification improve precision/reduce costs of assessment?  how many records do I need to assess clerically? 5

Example with known outcome  Use linkage from 2011 England & Wales Population Census and Census Coverage Survey  linkage was done automatically with clerical resolution of 'uncertain' cases  take clerical linkage result as ‘truth’  Evaluate automated linkage procedures  compare precision and recall  experiment with accuracy measures for inference on quality measures 6

Choice of stratifying variables – true/false +ves  600k links from a new automated process  0.26% FPs compared to original clerical linkage ‘truth’  sample 500 TPs and 500 FPs by srs as basis for modelling  Census hard to count index (1-5)  Sex (1 or 2)  Age group (0-17, 18-24, 25-39, 40-64, 65+, unknown)  Whether in London (1) or not London (0)  Ethnicity (White, Asian/Asian British, Black/Black British, mixed, other, unknown)  Linkage pass (1, 2, 3, 4, 5, 6, 7, 11) 7

Stratifying variables  BIC to penalise overfitting  best model has  hard to count index (HtC)  ‘match pass’ (linkage method)  merge similar levels  HtC {1&2}, {3} and {4&5}  match pass with levels {1}, {2-7} and {11} (=exact, deterministic and probabilistic matching) 8

Choice of stratifying variables – true/false -ves  30k ‘most likely’ non-links (deduplication)  approx 1/3 are FNs  sample 500 TNs and 500 FNs by srs as basis for modelling  BIC  best model includes match probability, London, age group and ethnicity  merge similar levels  age group to {0-24}, {25-64} and {65+}  ethnicity to {White, mixed}, {Asian, Black, other} and {unknown}. 9

Estimation  precision  estimate as proportion of all links  straightforward – know links  recall  estimate as proportion of real matches (TP + FN)  not straightforward – need to estimate FN and calculate  FN estimated as proportion of non-links 10

Stratified sampling  Stratified sampling for proportions (Cochran 1977)  requires ‘knowledge’ of  then allows control of 11

Variance or cv?  Control of var(p) only helpful if p guessed/known accurately  Often unknown initially  Inverse sampling (Haldane 1945) controls cv(p) by continuing sampling until m ‘successes’ observed, m fixed  sequential procedure with stopping rule  fixed m does not mean fixed sample size 12

inverse sampling 13

Stratified inverse sampling  no clear solution for allocation of n h from var(p h )  numerical search approach  minimum m h = m* allocated in each stratum  increase m h in stratum where largest decrease in cv(p), based on expected no. of cases to next ‘success’  repeat until target cv achieved 14

Stratum ‘success’ sizes 15

Overall ‘success’ size 16

Sample sizes 17

Overall sample size 18

Estimated p’s, correct 19

Achieved cv’s, correct 20

Estimated p’s, 21

Achieved cv’s 22

Estimated p’s, varying 23

Achieved cv’s 24

Strategy  This suggests a strategy for sampling to assess the quality of linkage consisting of: 1.If reasonable estimates of available, use them in a Neyman allocation in a stratified design. This will give the smallest sample size with reasonable chance of achieving the required cv. 2.If these are not available, and only an overall estimate is required, use inverse sampling on randomly sorted data. 3.If separate estimates in the strata are desirable, follow the algorithm for stratified inverse sampling.  Possible combination approach using inverse sampling to get initial estimates of to feed into Neyman allocation 25

Conclusions  Stratified inverse sampling not effective?  Sample sizes smaller for stratified sampling (much smaller for p~0.1, not much different for p~0.001)  Combination strategies may offer improved results when initial estimates not available. 26

Questions?  Paul Smith  27