Linking Mortality and Inpatient Discharge Records: Comparing Deterministic and Probabilistic Methodologies Richard Miller Office of Health Informatics.

Slides:



Advertisements
Similar presentations
CHAPTER 16 Life Tables.
Advertisements

Donald T. Simeon Caribbean Health Research Council
US Berkeley 2/12/2013 linking population-based data to child welfare records: a public health approach to surveillance Emily Putnam-Hornstein, PhD University.
The Linked PDD-Death Product More than you want to know David Zingmond, MD, PhD Division of General Internal and Health Services Research UCLA School of.
Wisconsin Department of Health Services Richard Miller Research Scientist Wisconsin Office of Health Informatics October 28, 2014 Matching Traffic Crash.
Linked Data Products Vital Statistics Death/PDD Presenter: Jan Morgan.
Area 4 SHARP Face-to-Face Conference Phenotyping Team – Centerphase Project Assessing the Value of Phenotyping Algorithms June 30, 2011.
Record Linkage Simulation Biolink Meeting June Adelaide Ariel.
Using ICD Codes and Birth Records to Prevent Mismatches of Multiple Births in Linked Hospital Readmission Data Alison Fraser 1, MSPH, Zhiwei Liu 2, MS,
Graph Analysis Matching Program Burdette Pixton. Record Linkage Object Identification Problem Identifies possible links in pedigrees Advantages Compress.
Record Linkage in Stata
Capturing Sensitive Data & Data Linkage. Capturing Sensitive Data Data Protection Act 1998 (Section 33) – Allows data to be used for research purposes.
APHA, Nov Improving the linkage of deliveries over time using vital records and hospital discharge data Mark McLaughlin, Judy Weiss, ScD, Milton.
Turning Junk Data into Value Yukiko Yoneoka, MS UDOH Public Health Informatics Brown Bag July 22, 2009 Using 9-digit Mixed Identifiers to Enhance Linkage.
© 2007 John M. Abowd, Lars Vilhuber, all rights reserved Introduction to Record Linking John M. Abowd and Lars Vilhuber April 2007.
March 2013 ESSnet DWH - Workshop IV DATA LINKING ASPECTS OF COMBINING DATA INCLUDING OPTIONS FOR VARIOUS HIERARCHIES (S-DWH CONTEXT)
Project Update : Claims/Clinical Linkage Project MHDO Board of Directors June 6, 2013.
Lecture 7 Model Development and Model Verification.
Global Burden of Disease
Chapter 14 Inferential Data Analysis
Thoughts on Biomarker Discovery and Validation Karla Ballman, Ph.D. Division of Biostatistics October 29, 2007.
Introducing HealthStats Eleanor Howell, MS Manager, Data Dissemination Unit State Center for Health Statistics February 2, 2012.
RESEARCH A systematic quest for undiscovered truth A way of thinking
Introduction to Record Linking John M. Abowd and Lars Vilhuber April 2011 © 2011 John M. Abowd, Lars Vilhuber, all rights reserved.
Performance Measures 101 Presenter: Peggy Ketterer, RN, BSN, CHCA Executive Director, EQRO Services Health Services Advisory Group June 18, :15 p.m.–4:45.
Health information that does not identify an individual and with respect to which there is no reasonable basis to believe that the information can be.
Record matching for census purposes in the Netherlands Eric Schulte Nordholt Senior researcher and project leader of the Census Statistics Netherlands.
Components of HIV/AIDS Case Surveillance: Case Report Forms and Sources.
Dr K N Prasad Community Medicine
Generic Approaches to Model Validation Presented at Growth Model User’s Group August 10, 2005 David K. Walters.
New National Approaches to Immigrant Health Assessment M. DesMeules, J. Gold, B. Vissandjée, J. Payne, A. Kazanjian, D. Manuel Health Canada, University.
The relationship between error rates and parameter estimation in the probabilistic record linkage context Tiziana Tuoto, Nicoletta Cibella, Marco Fortini.
Studying Injuries Using the National Hospital Discharge Survey Marni Hall, Ph.D. Hospital Care Statistics Branch, Division of Health Care Statistics.
Health Information Solutions Gaining Insights through Data Linkage: The VS-PDD Linked Data Files Presenters: Beate Danielsen & Jan Morgan.
© Nuffield Trust 22 June 2015 Matched Control Studies: Methods and case studies Cono Ariti
2008 Wisconsin County Health Rankings Online Webinar Available November 14, 2008 Kyla Taylor.
(Spring 2015) Instructor: Craig Duckett Lecture 10: Tuesday, May 12, 2015 Mere Mortals Chap. 7 Summary, Team Work Time 1.
Sub-regional Workshop on Census Data Evaluation, Phnom Penh, Cambodia, November 2011 Evaluation of Census Data using Consecutive Censuses United.
Assessing SES differences in life expectancy: Issues in using longitudinal data Elsie Pamuk, Kim Lochner, Nat Schenker, Van Parsons, Ellen Kramarow National.
1 NCHS Record Linkage Activities Kimberly A. Lochner Christine S. Cox NCHS Data Users Conference July 11, 2006 U.S. DEPARTMENT OF HEALTH AND HUMAN SERVICES.
Tele-Medicine Risk Adjustment. Agenda What is Medicare Risk adjustment? Conclusion Summery of project specification Why Tele-Medicine? Team Workflow Design.
Medicaid Analytic eXtract (MAX) Presentation to the Academy Health Annual Research Meeting San Diego, California Dave Baugh, CMS/ORDI June 8, 2004.
Sub-regional Workshop on Census Data Evaluation, Phnom Penh, Cambodia, November 2011 Evaluation of Age and Sex Distribution United Nations Statistics.
CHAPTER 5 CONSTRUCTING HYPOTHESeS. What is A Hypothesis? A proposition, condition, or principle which is assumed, perhaps without belief, in order to.
Blindfolded Record Linkage Presented by Gautam Sanka Susan C. Weber, Henry Lowe, Amar Das, Todd Ferris.
European Patients’ Academy on Therapeutic Innovation The Purpose and Fundamentals of Statistics in Clinical Trials.
Synthetic Approaches to Data Linkage Mark Elliot, University of Manchester Jerry Reiter Duke University Cathie Marsh Centre.
7/14/2003(c) 2003 Strategic Matching, Inc.1 29 th International Traffic Records Forum Using Multiple Imputation to Resolve Missing Data Issues.
© 2010 Jones and Bartlett Publishers, LLC. Chapter 12 Clinical Epidemiology.
A ssociation of Public Health Observatories Hospital Activity data Roy Maxwell SWPHO & Bristol University Dr Richard Wilson Sandwell PCT.
Stats 242.3(02) Statistical Theory and Methodology.
PRAGMATIC Study Designs: Elderly Cancer Trials
Sampling procedures for assessing accuracy of record linkage Paul A. Smith, S3RI, University of Southampton Shelley Gammon, Sarah Cummins, Christos Chatzoglou,
Challenges in data linkage: error and bias
Quality of Electronic Emergency Department Data: How Good Are They?
Implementation of Quality indicators for administrative data
Linking CRASH Data with Health Data Systems Improving motor vehicle safety through public health partnership Michelle Lackovic - Louisiana Public Health.
Colin Fischbacher Information Services Division (ISD)
Examining the Role Weather Conditions Play in the Patterns and Outcomes of Motor Vehicle Crashes in New York State, Motao Zhu, Michael Bauer,
Evaluating Sepsis Guidelines and Patient Outcomes
Measuring Social Life: How Many? How Much? What Type?
Strategies for Implementing Flexible Clinical Trials Jerald S. Schindler, Dr.P.H. Cytel Pharmaceutical Research Services 2006 FDA/Industry Statistics Workshop.
SQL for Cleaning Data Farrokh Alemi, Ph.D.
Selecting the Right Predictors
Improving Overlap Farrokh Alemi, Ph.D.
Evaluating the Completeness of the Civil Registration System
Evaluating the Completeness of the Civil Registration System
Pnina ZADKA Central Bureau of Statistics Israel
Pnina ZADKA Central Bureau of Statistics Israel
State Consumer Health Information and Policy Advisory Council Meeting
Presentation transcript:

Linking Mortality and Inpatient Discharge Records: Comparing Deterministic and Probabilistic Methodologies Richard Miller Office of Health Informatics Mike Yuan Bureau of Community Health Promotion Wisconsin Division of Public Health June 2011

Linking (matching) Mortality Records and Inpatient Discharge Records Why Combine Mortality Records and Inpatient Discharge Records? How to link or match records Method 1: Deterministic record linkage Method 2: Probabilistic record linkage How do the results compare? Lessons learned

Why Combine Mortality Records and Inpatient Discharge Records? Improve surveillance of CVD and other chronic diseases Enhanced surveillance analysis opportunities –Mortality records capture CVD only if an underlying or contributing cause –Inpatient records capture CVD treated in that setting, but the case history ends at discharge Capture hospital record information on demographics, co-morbidities, complications, and surgical procedures. Measure treatment outcomes on a population basis

The Time Frame for Linked Records Analyses are more complete the more time there is to find a death record following a hospitalization The scale of mortality and inpatient records in Wisconsin: 2 million inpatient discharge records Smaller number of individual patients 140,000 mortality records How to find matching records? How to define links between records?

False Positives and Negatives Matching records involves finding a balance between false positive and false negative matches.  False positive matches combine records for different people.  False negatives fail to include all persons in the dataset of matched records – possibly introducing bias.

Method 1. Deterministic Record Linkage Pairs of records are compared for exactly matching indentifying information. Exact matches determine true record matches. Works perfectly only if information that uniquely identifies the same individual in two datasets is available, is captured perfectly, and is recorded perfectly In real world data systems: –uniquely identifying elements often not available; –recorded data have small differences between records –some records have some fields with missing values.

Method 2. Probabilistic Record Linkage Every pair of records has some probability of being a “true match.” Specialized software estimates that probability by applying statistical principles and tools. Set some threshold for “high probability matches”  A common criterion is 0.9 probability of being a true match  This defines the risk of accepting false positives Some methods impute missing matches to pairs that look unlikely due to possible reporting and recording errors.

Part I. Deterministic Linkage among Inpatient Records Identifying Patients = de-duplicating inpatient records Method: Iterative application of combinations of elements with person-matching face validity. Available fields: Initials 3-digit encryption of last name (Miller = M460) Date of birth Gender ZIP code of residence Insurance ID >> “SSN-like string” Hospital and medical record number

Part I: Deterministic Linkage among Inpatient Records Uniqueness of Patient Identifiers Wisconsin Inpatients Discharged , N=2,017,339 “Patient” Identifier % Records with identifier % with unique values Initials + DOB + sex100%56.2% Initials + DOB + sex + ZIP99.9%63.4 Policy number + DOB + sex92.1%64.7 SSN-like string + DOB + sex78.2%61.2 Hospital + medical record number 99.7%70.9

Part I: Deterministic Linkage among Inpatient Records Record links were evaluated by looking for three indicators of false positive matches: 1.Any later admission date preceding the earliest admission’s discharge date. 2.Any admission date preceding the previous admission’s discharge date. 3.Records indicating the patient died but patient has later hospitalizations.

Part II: Deterministic Matching of Patients to Mortality Records Matches between the 1,280,000 resident patients and the 135,000 Wisconsin occurrence deaths to residents. Which inpatient record? The most recent one… Iterative procedures use a succession of identifiers (combinations of the available data elements). Construct a linking identifier Select records with unique values of the “linker” Sort each set by that linking identifier Matching and merge those records with identical linker values Collect the remaining records Construct an alternative linking combination Repeat until plausible linking combinations have been exhausted.

Part II: Deterministic Matching of Patients to Mortality Records Iterative matching in two phases: I. Match the records for in- hospital deaths  Less time between events and more data elements in common  Date of death = discharge date  Hospital is match element  25% of deaths; 2% of inpatients. II. Examine the remaining records for matches

Part II: Deterministic Matching of Inpatient Records to Mortality Records Phase I. Linked In-Hospital Deaths Linker# Pairs Matched Matched RecordsRemaining Unmatched Records % of inpatient records % of mortality records # of inpatient records # of mortality records All In-Hospital Deaths32,81635,745 Initials + DOB + Sex + ZIP26,02279%73%6,7949,723 Initials + DOB + Sex + SSN2,666874,1287,057 Initials + Sex + ZIP3 + DOD1,496442,6325,561 Hospital + DOD + DOB833221,7994,728 Initials + Sex + DOB ,7624,691 All Linked Pairs31, %86.9%

Part II: Deterministic Matching of Inpatient Records to Mortality Records Phase 2: Linked Residual Deaths and Patients Linker# Pairs Matched Matched RecordsRemaining Unmatched Records % of inpatient records % of mortality records # of inpatient records # of mortality records Residual Deaths1,195,638104,023 Initials + DOB + Sex + ZIP53,0594%51%1,142,57950,964 Initials + DOB + Sex + SSN 5,514<1111,137,06545,450 All Residual Linked Pairs58,5734.9%56.3%

Part II: Deterministic Matching of Inpatient Records to Mortality Records Combined results: Linked 66% of the mortality records to a hospital patient 89,627 of the 135,077 total resident and occurrence deaths Evaluated results with logic tests Admission date after previous discharge date Not hospitalized again after discharged ‘expired’ Agreement rates among other data elements

Part III: Probabilistic Matching of Inpatient Records to Mortality Records A “probabilistic record linkage methodology” recognizes that a pair of records has some probability of being a “true match.” Specialized software products estimate that probability: LinkSolv – our choice LinkPlus LinkPro LinkSolv is based on Bayesian statistics as applied by Fellegi and Sunter and considerably developed by Dr. Michael McGlincy, the software developer.

Part III: Probabilistic Matching of Inpatient Records to Mortality Records LinkSolv compares pairs of fields, incorporating a number of adjustments to account for real-world violations of statistical assumptions: The probability that apparently different values may both be correct; Rates of missing data; Estimated rates of reporting errors; and Discounting some weights for matching/mismatching values if agreements/disagreements on one field are related to agreements/disagreements on another. Comparisons may be for exact matches or acceptable differences

Part III: Probabilistic Matching of Inpatient Records to Mortality Records Some simplifying decisions:  Use the most recent inpatient discharge identified by the deterministic linkage process  Drop the 30% of patients who are mothers and their newborns  Work only with the patients whose last hospitalization was in 2006

Part III: Probabilistic Matching of Inpatient Records to Mortality Records Experimented with comparison fields: Disaggregate birth date or not? Break up ZIP in ZIP-3 and ZIP-2 components or not? Break up name into separate initials and encrypted field? Use full SSN or just last 4 digits (SSN-4)? Use elements only available for the in-hospital deaths?

Part III: Probabilistic Matching of Inpatient Records to Mortality Records Final model was relatively simple: Last initial + encryption (Miller = M460) First initial SSN-4 Date of birth as one field Gender M/F ZIP-3

Part III: Probabilistic Matching of Inpatient Records to Mortality Records This model was applied to three over-lapping subsets of records, along with estimated corrections to statistical assumptions. We merged the three linkage passes in a multiple imputation process that applies Markov Chain-Monte Carlo techniques to create five alternative sets of paired records. –Identifies additional record pairs that have a low - but real - probability of being true matches, due to possible measurement errors. For evaluation purposes, we de-duplicated these 5 sets to identify a final set of 36,562 inpatient-mortality records linked with probabilistic methods.

Comparison of Results Combined Linked Pairs  93% of deterministic matches were confirmed by the probabilistic matches  14% of probabilistic matches were not captured by deterministic linking.

Comparison of Results Evaluating the discrepant results: High-probability matches not found in the deterministic matches. The most common issue was discrepancies in the last two ZIP digits. Low-probability matches 2% of the record pairs identified by both methods were evaluated by LinkSolv as having a low probability of being a true match. This suggests that some deterministic criteria are weaker than would be desirable, notably last name encryption and SSN. Deterministic matches not confirmed by probabilistic matching. Should we be wary of this 5% of matches? Disproportionately are in-hospital deaths

Conclusions De-duplicating patients The strongest linking combination was patient’s initials + date of birth + sex + ZIP. Yielded reasonable and apparently robust results. Given the observed instability of ZIP code in the population of deceased recent patients, we should experiment with substituting ZIP-3. This will result in fewer ‘patients’ being identified. The trade-off is the creation of more false-positive matches.

Conclusions Linking patients to mortality records The probabilistic process yields more matched pairs than the deterministic process, but not dramatically so. Overall, the more rigorous probabilistic method validated the results of the deterministic linkage. Initials, date of birth, and sex Patient and mortality records generally reliable and consistent. ZIP Less reliable - small moves often result in different ZIPs. Older patients particularly likely to make such moves. Probabilistic models only used ZIP-3 SSN Using full SSN limited the success of exact matching. SSNs were teased out of policy numbers but are often missing or are a spouse’s SSN. Probabilistic models used only SSN-4.

Conclusions Both methods created reasonable sets of matched pairs of records Those sets had a high degree of common pairs. The deterministic process is probably more accessible and efficient for the general user. However, the quality is heavily dependent on the completeness and accuracy of the recorded data.

Conclusions The probabilistic process, particularly as developed in LinkSolv, is more statistically rigorous and will more thoroughly identify matched pairs. Using multiply-imputed output datasets requires sophisticated statistical treatment by well-trained researchers.  Useful lessons can be learned from the application of both methods to the same datasets. The probabilistic process provides a rigorous evaluation and, perhaps, validation of the results of deterministic exact-matching.  The probabilistic process provides insights into the utility of particular data elements; this may be used to refine and improve a deterministic matching process.

Acknowledgments We gratefully acknowledge the support of CSTE’s Cardiovascular Disease Surveillance Data Pilot Project We are indebted to Dr. Michael McGlincy, Strategic Matching Inc., for his thoughtful advice.

Linking Mortality and Inpatient Records: Comparing Deterministic and Probabilistic Methods Richard Miller HerngLeh (Mike) Yuan Wisconsin Division of Public Health