Challenges in data linkage: error and bias Katie Harron October 2014 UCL Institute of Child Health k.harron@ucl.ac.uk
The linkage problem Match status Match Non-match Link status Link Match status Match (pair from same individual) Non-match (pair from different individuals) Link status Link Identified match False match Non-link Missed match Identified non-match 2
Deterministic linkage in Hospital Episode Statistics (HES) Few false-matches 1 Sex Date of Birth NHS Number 2 Postcode Local Patient Identifier within Provider 3 More missed-matches What kind of linkage is this? Which patients might be missed?
Quality of unique identifiers 166,406 records of admissions to paediatric intensive care (PICANet) 85,137 non-matches 81,269 matches 46 (0.1%) same NHS number 3,207 (4%) different NHS numbers Hagger-Johnson et al. Causes and consequences of data linkage errors: False and missed matches following linkage of hospital data (under review)
Deterministic linkage with pseudonymisation at source Courtesy of Peter Jones, ONS Beyond 2011 programme
Probabilistic linkage Highest weight is retained pair 1 pair 2 pair 3 Low match High match weight weight Primary File Ronald Fisher Linking File Ronald Fisher Linking File Karl Pearson Linking File Carl Gauss
Probabilistic linkage Highest weight is retained pair 3 Low match High match weight weight P(γ=1 | M) = m-probability = sensitivity the probability of agreement given the records from same subject P(γ=1 | U) = u-probability= 1-specificity the probability of agreement given the records from different subjects Log ratio = w = log2 (m/u) if identifiers agree log2 [(1-m)/(1-u)] if identifiers disagree Match weight = W = ∑wi
Probabilistic linkage Matches agreement on NHS number Non-matches agreement on sex Low match High match weight weight disagreement on date of birth agree on some ids disagree on some ids Chance (same date of birth) Recording errors Missing data
Matches Non-matches Two thresholds Missed matches False matches Links Low match High match weight weight Links Links Two thresholds
Evaluating linkage quality Small amounts of linkage error can result in substantially biased results The impact of linkage error on results is rarely reported Linkage error affects different types of analysis in different ways
Why it’s important to evaluate linkage error Schmidlin et al (2013) Impact of unlinked deaths and coding changes on mortality trends in the Swiss National Cohort. BMC Med Inform Decis Mak 13 (1):1
Why it’s important to evaluate linkage error Hobbs, G. & Vignoles, A., 2007. Is free school meal status a valid proxy for socio-economic status (in schools research)? Centre for the Economics of Education; London School of Economics and Political Science.
Why it’s important to evaluate linkage error Highly sensitive Highly specific Lariscy (2011). Differential Record Linkage by Hispanic Ethnicity and Age in Linked Mortality Studies: Implications for the Epidemiologic Paradox. J Aging Health 23(8):1263-84
Why it’s important to evaluate linkage error Ford et al 2006. Characteristics of unmatched maternal and baby records in linked birth records and hospital discharge data. Paediatric and Perinatal Epidemiology 20(4): 329-337.
Evaluating linkage quality i) Sensitivity analysis using different linkage criteria ii) Subset of gold-standard data to quantify linkage bias Highly sensitive Highly specific iii) Comparisons of linked and unlinked data iv) Imputation for uncertain links
Imputation for linkage Primary file Linking file Variable of interest Record 1 Exact match high Record 2 low Record 3 Prior-informed imputation high low Record 4 Match weight=10 med Match weight=1 low Record n Match weight=5 Match weight=4 high Match weight=3 med high Goldstein et al Stat Med 2012;31(28):3481-3493 Harron et al BMC Med Res Method 2014;14(1):36
Implications for data providers i) Sensitivity analysis using different thresholds Availability of all candidate records (linked and unlinked) ii) Subset of gold-standard data to quantify linkage bias iii) Comparisons of linked and unlinked data Subset of data where true match status is known (gold-standard) iv) Imputation for uncertain links Harron et al 2012. Opening the black box of record linkage. J Epidemiol Commun H 66(12):1198
Summary Data linkage is a powerful tool for enhancing administrative data Linkage error has important effects on analyses Results vary according to choice of thresholds and methods Taking error into account is possible without releasing identifiable data Communication between linkers and data users is vital
Acknowledgements and funding Harvey Goldstein, Ruth Gilbert, Gareth Hagger-Johnson and Angie Wade, UCL Institute of Child Health Berit Muller-Pebody, Public Health England Roger Parslow, Tom Fleming, Lee Norman and the PICANet team, University of Leeds This work was supported by funding from the National Institute for Health Research Health Technology Assessment (NIHR HTA) programme (project number 08/13/47). The views and opinions expressed therein are those of the authors and do not necessarily reflect those of the HTA programme, NIHR, NHS or the Department of Health. The authors state no conflicts of interest.