Record Linkage Concepts. Acknowledgements Slides adapted from training materials developed by CDC–NPCR Faculty: Melissa Jim, CDC/IHS

Slides:



Advertisements
Similar presentations
Presenting on behalf of the Stage 2 MU PH Reporting Requirements Task Force.
Advertisements

The 10 Essential Public Health Services An Overview
Overview of Link Plus Probabilistic Record Linkage Software.
Instructions and Reporting Requirements Module 1 Electronic Reporting For Facilities March 2014 North Carolina Central Cancer Registry State Center for.
N orthwest P ortland A rea I ndian H ealth B oard Indian Leadership for Indian Health Unintentional injury and motor vehicle crash mortality in the Northwest.
CDC and Tribal Epidemiology Centers: Working Relationship Update
EVital Records Initiative What we learned and what comes next.
Racial Misclassification of American Indians in the Oklahoma Central Cancer Registry Anne Bliss, MPH Interim Program Manager/Epidemiologist Oklahoma Central.
Quality Cancer Data Saves Lives The Vital Role of Cancer Registrars in the Fight against Cancer.
Web Plus Overview Division of Cancer Prevention and Control National Center for Chronic Disease Prevention and Health Promotion CDC Registry Plus Training.
An Overview of Tribal Epidemiology Centers and Collaborations with State Vital Records to Improve Data Quality and Address Emerging Issues Judith Thierry,
Graph Analysis Matching Program Burdette Pixton. Record Linkage Object Identification Problem Identifies possible links in pedigrees Advantages Compress.
Michigan Newborn Screening & Live Births Records Linkage and Follow-Up of Potentially Un-Screened Infants Steven J. Korzeniewski, MA, MSc, Maternal & Child.
Quality Cancer Data The Vital Role of Cancer Registrars in the Fight against Cancer Saves Lives.
Cancer Screening and Care for AIANs in Washington State Scott Ramsey Fred Hutchinson Cancer Research Center.
How are cancer statistics kept up to date?.  Example:  Dx stage II colon cancer  Cancer has metastasized to the liver – 2009  How does the.
Gayle Greer Clutter, R.T., CTR Program Consultant
MOLLY SCHWENN, MD CANCER REGISTRY MAINE CDC, DHHS OCTOBER 25, 2013 Population-based Cancer Surveillance: State Perspective.
United Nations Workshop on Principles and Recommendations for a Vital Statistics System, Revision 3, for African English-speaking countries Addis Ababa,
Project Update: Improving mortality data for American Indians and Alaska Natives NAPHSIS 2006 Annual Conference San Diego, CA David Espey, MD IHS/CDC.
The National Program of Cancer Registries: Enhancing Cancer Incidence Data … Hannah K. Weir, PhD Division of Cancer Prevention and Control Centers for.
Multi-State Results of Linking Death Certificates to Indian Health Service Patient Records 2007 NAPHSIS Annual Meeting June 6, 2007 Salt Lake City.
Using Historical Records to Reconstruct Early Life SES Exposures in Decedents: Preliminary Findings from a Pilot Study Kathryn Rose and J. Stephen Perhac.
National Program of Cancer Registries
Improving the Accuracy of American Indian Data through Linkage to Tribal Rolls NAPHSIS 2008.
CANCER INCIDENCE IN NEW JERSEY BY COUNTY, for the Comprehensive Cancer Control Plan County Needs Assessments August 2003 Prepared by: Cancer.
ABSTRACT Background: In late 2003, a group of Centers for Disease Control and Prevention/National Program of Cancer Registries (CDC/NPCR) staff and faculty/staff.
CONCEPTS AND TECHNIQUES FOR RECORD LINKAGE, ENTITY RESOLUTION, AND DUPLICATE DETECTION BY PETER CHRISTEN PRESENTED BY JOSEPH PARK Data Matching.
AUDIT IN COMPUTERIZED ENVIRONMENT
U.S. Centers for Disease Control and Prevention National Center for Health Statistics International Statistics Program Birth Records These materials have.
 2013 Pearson Education, Inc. Publishing as Prentice Hall, AIS, 11/e, by Bodnar/Hopwood Chapter 7 7 – 1 Electronic Data Processing Systems.
National Center for Health Statistics International Statistics Program U.S. Centers for Disease Control and Prevention National Center for Health Statistics.
Overview of Civil Registration and Vital Statistics Systems
The Tohono O'odham Nation (TON) Comprehensive Cancer Planning Grant “Data” Experience Joseph B. Hawes MD, MPH.
H1N1 Disease Surveillance Team Project Week 5 Presentation Melody Dungee Beena Joy David Medina Calvin Palmer.
United Nations Workshop on Principles and Recommendations for a Vital Statistics System, Revision 3, for African English-speaking countries Addis Ababa,
United Nations Workshop on Principles and Recommendations for a Vital Statistics System, Revision 3, for African English-speaking countries Addis Ababa,
Northwest Portland Area Indian Health Board IDEA-NW PROJECT & DATA SHARING AGREEMENTS NPAIHB QUARTERLY BOARD MEETING JANUARY 2016 Victoria Warren-Mears,
By:miguel iturrade.  A computer network is a group of computers that are connected to each other for the purpose of communication.
Strand 3 Maya Mangindin 3.7 Databases. A structured set of data held in a computer, esp. one that is accessible in various ways Google Definition.
1 Linking Social Security Death Index (SSDI) Data with Registry Data to Update Demographics and Vital Status David O’Brien, PhD, GISP Alaska Cancer Registry.
1 Record Linkage & Fuzzy Matching (More on "Blocking" for Performance Improvement) Joseph Vertido Melissa Data Fuzzy.
Enhancing Registry-Based Research: A Virtual Pooled Data Resource Dennis Deapen, DrPH NAACCR, Austin, TX June 13,
The Geographic and Demographic Distribution of Melanoma Throughout the United States: Implications for Primary and Secondary Prevention Bertina Backus.
NPCR – Advancing E-cancer Reporting and Registry Operations (NPCR-AERRO): An Update on Innovative Activities NAACCR Annual Conference June 16, 2009 Sandy.
Introducing… The Death Clearance Manual Robin Otto, RHIA, CTR Manager, Pennsylvania Cancer Registry Co-Chair, Death Clearance Issues Workgroup NAACCR 2008.
PUBLIC HEALTH APPROACH. PUBLIC HEALTH APPROACH-Step 1 Define the problem -How many deaths, injuries, violence related behaviors - Frequency -Trends -
NAACCR Annual Conference Quebec City, Quebec, Canada Shannon Vann, CTR Jim Hofferkamp, CTR.
National Center for Health Statistics (NCHS) Centers for Disease Control and Prevention.
"The findings and conclusions in this report are those of the author(s) and do not necessarily represent the official position of the Centers for Disease.
Population-Based Cancer Registries in the United States:
A State’s Experience.
Quality Control Abstract Visual Review Process
Web Plus Version 2: Secure Web-based Functions For Death Certificate and Pathology Lab Follow-back Efforts Kathleen Thoburn, Sanjeev Baral (CDC/NPCR.
North Carolina Central Cancer Registry Electronic Reporting For
Tribal Epidemiology Centers California and Northwest
Yumei Cao Statistical analyst Dept.of Epidemiology
Pregnancy risk factors and birth outcomes within Oregon and Idaho’s American Indian/Alaska Native population ( ) Suzanne Zane Maternal and Child.
Northwest Tribal Epidemiology Center
SCHS and Health Statistics
Overview of Link Plus Probabilistic Record Linkage Software
department of health and human resources
EFFICIENCY OF MrSID Chapter 6. Slide 41
The failure of breast cancer screening
Louisiana’s Hospital Follow-up Exchange: A Decade of Partnership
Functioning of the vital statistics system
Quality assurance and assessment in the vital statistics system
Presented by: Don Green
Costs of Operating Population-Based Cancer Registries: Results from Four Sub-Saharan African Countries Florence Tangka, PhD Senior Health Economist, Division.
Database Design Chapter 7.
Presentation transcript:

Record Linkage Concepts

Acknowledgements Slides adapted from training materials developed by CDC–NPCR Faculty: Melissa Jim, CDC/IHS David Espey, CDC/IHS CDC/Link Plus development and training: Kathleen Thoburn David Gu Adapted by: Megan Hoopes, NW Tribal Epidemiology Center

Overview of Record Linkage “Record Linkage” aka “Matching” aka “Merge” Combining information from a variety of data sources for the same individual Merge information from a record in one data source (file 1) with information from another data source (file 2) – Example: merging cancer information from cancer registry file with death information from vital statistics file

Overview of Record Linkage Can be accomplished manually, by visually comparing records from two separate sources Approach becomes time consuming, tedious, inefficient, and unpractical as the number of records in file 1 and file 2 increases Technological advances in computer systems and programming techniques – Economically feasible to perform computerized record linkage between large files – Efficient and relatively accurate

Duplicate Detection Fundamental requirement for accuracy and validity of counts in any disease registry Example: National Program of Cancer Registries/ North American Association of Central Cancer Registries standard – Maintain <= 0.1% (<=1 per 1,000) duplicates

Deterministic Matching Computerized comparison where EVERYTHING needs to match EXACTLY: Last NameFirst NameSiteSSNDOBSexDateDx SMITHJOHNC SMITHJOHNC

Deterministic Matching Often slight variations exist in the data between the two files for the same variables: Last NameFirst NameSiteSSNDOBSexDateDx SMITHJOHNC SMITHJOHNC Last NameFirst NameSiteSSNDOBSexDateDx SMITHJOHNC SMYTHJOHNC These variations would prevent a match from being identified Or variables are missing from one of the files:

Deterministic Matching Manual Review When we manually review, we use intuition to help us identify positive matches for records containing slight variations in, or missing information for, data between the two files for the same variables Last nameFirst NameSiteSSNDOBSexDateDx SMITHJOHNC SMITHJOHNC Typo in SSN, transposition of digits in the day component of DOB, but would still deem a match

Probabilistic Matching Translating intuition into formal decision rules Use the concept of PROBABILITY and perform PROBABILISTIC matching Recommended over traditional deterministic (exact matching) methods when: – coding errors, reporting variations, missing data or duplicate records Estimate probability/likelihood that two records are for the same person versus not

Probabilistic Matching Find the records in File 2 that seem to match records in File 1 Calculate a score that indicates, for any pair of records, how likely it is that they both refer to the same person Sort the likely and possible matched pairs in order of their scores Define a threshold (Cut Off values) for automatically accepting and rejecting a potential link – Discard unlikely matched pairs (scores below 2 nd Cut Off) – Gray area: range of scores between the two cut off values considered uncertain matches Manually review uncertain matches

Probabilistic Matching The total score for a linkage between any two records is the sum of the scores generated from matching individual fields The score assigned to a matching of individual fields is: – Based on the probability that a matching variable agrees given that a comparison pair is a match M Probability - similar to "sensitivity“ – Reduced by the probability that a matching variable agrees given that a comparison pair is not a match U Probability - similar to "specificity"

Probabilistic Matching Agreement argues for linkage Disagreement argues against linkage Full agreement argues more strongly for linkage than partial agreement Some types of partial agreements are stronger than others ̶ Rare surname versus residence county code

Probabilistic Matching  Agreement on an uncommon value argues more strongly for linkage than a common value ̶ Espey versus Smith  Agreement on a more specific variable argues more strongly for linkage than agreement on a less specific one ̶ SSN versus Sex  Agreement on more variables/disagreement on few argues for linkage

Probabilistic Matching Once comparisons are made, a weight is calculated for each field comparison A total weight (or “score”) is derived by summing these separate field comparisons across all fields being compared Probabilistic weights are – Field-specific – Birth date versus Sex – Value-specific - “Jane” versus “Janiqua”

Linkage basics Blocking variables Matching variables Advantages of Link Plus Using Link Plus

Concept of Blocking With so many comparisons, large files can make impossible resource demands Blocking is an initial probabilistic linkage step that reduces the number of record comparisons between files Sort and match the two files by one or more identifying (“blocking”) variables Comparisons subsequently made only within blocks – Discard very unlikely record-pairings from the start

Sorting socks analogy Blocking variable: Pattern 6 of 13 socks within pattern block  compare matching variables 7 of 13 socks fall outside pattern block  Non-matches

Compare matching variables color & size within blocked pairs High Score Gray Area Low Score Possible matches

Probabilistic linkage concepts (1) DescriptionCommon usage* Blocking An initial step to reduce the number of record comparisons and increase efficiency of linkage. At least one blocking variable must match exactly (or phonetically) between the two records being compared; subsequent comparisons are made after blocking. Blocking variables: Last name First name Social security number Date of birth Matching After blocking, matching variables are compared to generate a match score for each record pair. Match scores for each variable are: Field-specific (matching DOB is scored higher than matching sex) Value-specific (last name of ‘Hoopes’ is scored higher than ‘Smith’ due to frequency of occurrence) Matching variables: Last name First name Social Security Number Date of Birth Sex Address The user may designate matching algorithms & M-probabilities for each variable. * May vary based on data items and quality of data in available in matching data sets

Probabilistic linkage concepts (2) DescriptionCommon usage* Match score The total probability weight assigned to each record pair; equal to the sum of scores generated by comparing each match field. Based on software-calculated M probability (sensitivity) and U probability (specificity). The range of match scores is examined to determine upper and lower cut-off values. High match scores are likely true matches and scores below cut-off value are automatically designated false matches. Record pairs between cut-off values are clerically reviewed. Clerical review Case-by-case review of uncertain matches that fall between the upper and lower cut- off values. Additional variables can be added to record layout to assist in the designation of match status. This process can be completed independently by two or more reviewers to increase reliability. Additional variables may include: Street address City, state, zip code Suffix Race/ethnicity Maiden name * May vary based on data items and quality of data in available in matching data sets