1 Probabilistic Linkage: Issues and Strategies Craig A. Mason, Ph.D. University of Maine

Slides:

Advertisements

Similar presentations

TWO STEP EQUATIONS 1. SOLVE FOR X 2. DO THE ADDITION STEP FIRST

Advertisements

C) between 18 and 27. D) between 27 and 50.

The t Test for Two Independent Samples

You have been given a mission and a code. Use the code to complete the mission and you will save the world from obliteration…

Fundamentals of Probability

Advanced Piloting Cruise Plot.

Introductory Mathematics & Statistics for Business

Chapter 1 The Study of Body Function Image PowerPoint

STATISTICS HYPOTHESES TEST (II) One-sample tests on the mean and variance Professor Ke-Sheng Cheng Department of Bioenvironmental Systems Engineering National.

By D. Fisher Geometric Transformations. Reflection, Rotation, or Translation 1.

The Application of Propensity Score Analysis to Non-randomized Medical Device Clinical Studies: A Regulatory Perspective Lilly Yue, Ph.D.* CDRH, FDA,

1 Data Linkage Strategies Shihfen Tu, Ph.D. University of Maine

Summary of Convergence Tests for Series and Solved Problems

Jeopardy Q 1 Q 6 Q 11 Q 16 Q 21 Q 2 Q 7 Q 12 Q 17 Q 22 Q 3 Q 8 Q 13

Jeopardy Q 1 Q 6 Q 11 Q 16 Q 21 Q 2 Q 7 Q 12 Q 17 Q 22 Q 3 Q 8 Q 13

Title Subtitle.

Determine Eligibility Chapter 4. Determine Eligibility 4-2 Objectives Search for Customer on database Enter application signed date and eligibility determination.

My Alphabet Book abcdefghijklm nopqrstuvwxyz.

DIVIDING INTEGERS 1. IF THE SIGNS ARE THE SAME THE ANSWER IS POSITIVE 2. IF THE SIGNS ARE DIFFERENT THE ANSWER IS NEGATIVE.

FACTORING ax2 + bx + c Think “unfoil” Work down, Show all steps.

Overview of Lecture Parametric vs Non-Parametric Statistical Tests.

Overview of Lecture Partitioning Evaluating the Null Hypothesis ANOVA

Lecture 2 ANALYSIS OF VARIANCE: AN INTRODUCTION

1 Contact details Colin Gray Room S16 (occasionally) address: Telephone: (27) 2233 Dont hesitate to get in touch.

Negative Numbers What do you understand by this?.

Evaluating Provider Reliability in Risk-aware Grid Brokering Iain Gourlay.

Chapter 7 Sampling and Sampling Distributions

Solve Multi-step Equations

The Nature of the Bias When Studying Only Linkable Person Records: Evidence from the American Community Survey Adela Luque (U.S. Census Bureau) Brittany.

On Comparing Classifiers : Pitfalls to Avoid and Recommended Approach

Randomized Algorithms Randomized Algorithms CS648 1.

Elementary Statistics

(This presentation may be used for instructional purposes)

Data Structures: A Pseudocode Approach with C

ABC Technology Project

Contingency Tables Prepared by Yu-Fen Li.

1. 2 No lecture on Wed February 8th Thursday 9 th Feb 14: :00 Thursday 9 th Feb 14: :00.

1 Breadth First Search s s Undiscovered Discovered Finished Queue: s Top of queue 2 1 Shortest path from s.

Identifying Our Own Style Extended DISC ® Personal Analysis.

A Strong Church. Introduction Our goal should be to help our congregation become a strong church. We better appreciate the need for a strong church when.

Factor P 16 8(8-5ab) 4(d² + 4) 3rs(2r – s) 15cd(1 + 2cd) 8(4a² + 3b²)

Squares and Square Root WALK. Solve each problem REVIEW:

Lecture 3 Validity of screening and diagnostic tests

© 2012 National Heart Foundation of Australia. Slide 2.

Lets play bingo!!. Calculate: MEAN Calculate: MEDIAN

Understanding Generalist Practice, 5e, Kirst-Ashman/Hull

Chapter 5 Test Review Sections 5-1 through 5-4.

GG Consulting, LLC I-SUITE. Source: TEA SHARS Frequently asked questions 2.

Addition 1’s to 20.

Model and Relationships 6 M 1 M M M M M M M M M M M M M M M M

25 seconds left…...

U1A L1 Examples FACTORING REVIEW EXAMPLES.

Januar MDMDFSSMDMDFSSS

We will resume in: 25 Minutes.

©Brooks/Cole, 2001 Chapter 12 Derived Types-- Enumerated, Structure and Union.

Chapter Thirteen The One-Way Analysis of Variance.

Chapter 18: The Chi-Square Statistic

Chapter 8 Estimation Understandable Statistics Ninth Edition

A SMALL TRUTH TO MAKE LIFE 100%

PSSA Preparation.

Chapter 11: The t Test for Two Related Samples

Immunobiology: The Immune System in Health & Disease Sixth Edition

CpSc 3220 Designing a Database

Presentation transcript:

1 Probabilistic Linkage: Issues and Strategies Craig A. Mason, Ph.D. University of Maine

2 Faculty Disclosure Information In the past 12 months, I have not had a significant financial interest or other relationship with the manufacturer(s) of the product(s) or provider(s) of the service(s) that will be discussed in my presentation This presentation will (not) include discussion of pharmaceuticals or devices that have not been approved by the FDA or if you will be discussing unapproved or "off-label" uses of pharmaceuticals or devices.

3 Acknowledgements Shihfen Tu, Quansheng Song Keith Scott, Marygrace Yale, Tony Gonzalez Derek Chapman

4 Overview of Linkage Process Two databases containing information on some of the same individuals Birth CertificatesEHDI Diagnostic Data

5 Overview of Linkage Process Many births not in Diagnostic Data Birth CertificatesEHDI Diagnostic Data

6 Overview of Linkage Process Some entries in EHDI Diagnostic Data do not appear in Electronic Birth Certificates Birth CertificatesEHDI Diagnostic Data

7 Overview of Linkage Process Final linkage is a subset of each Birth CertificatesEHDI Diagnostic Data

8 Linkage Algorithms Deterministic –Exactly match on specified common fields –Easiest, quickest linkage strategy –Misconception that this is the gold standard

9 Linkage Algorithms Deterministic –May result in significant bias Non-traditional spellings in African American names –Result in errors due to non-links Many non-links can result in greater bias than a few erroneous pairings

10 Linkage Algorithms Probabilistic –Statistically estimate likelihood or odds that two records are for the same individual, even if they disagree on some fields

11 Linkage Algorithms Factors Impacting Probabilistic Linkage –Likelihood that a fields would agree if a correct link Good quality data counts more than poor quality data –Likelihood that fields would agree if not a correct link Rare values count more than common values –Number of expected matches Much more complicated and expensive strategy

12 Good work, but I think we might need just a little more detail right here. Implementing an Effective Data Linkage Then a miracle occurs out Start Modified from Kim Church, Maine Genetics Program

13 Probabilistic Matching Probabilistic Matching: Two records are not required to match in all fields –Two records are compared on each of the specified fields. –A weightw i is calculated for each field in a potential match reflecting the strength of the agreement or disagreement w1w1 w2w2

14 Reliability of data fields –Greater reliability results in increased odds of correct match A match on a high-quality, reliably entered field is good Not matching on a poor-quality field with lots of known data entry errors may not be a fatal error –If a field is pure noise, correct matches will be random across the databases Factors Influencing Likelihood of Match

15 Frequency of field values –The more common the value in a field, the greater the odds that the records will be erroneously matched A match based on the name Zbignew is a relatively good indicator of a match, even if there may be disagreement in other fields A match based on the name John may be of much less value, requiring matches on more fields in order to conclude two records are the same individual Number of expected matches one would obtain randomly Factors Influencing Likelihood of Match

16 Weight Calculation –M-probability Probability that a field agrees if the pairing reflects a correct match –U-probability Probability that a field agrees if the pairing reflects an incorrect match Chance that a given field will agree randomly Approximately = # records with a specific value/total # of records Calculating Match Weights

17 Probabilistic Matching If the field agrees, w i is equal to …. w1w1 w2w2

18 Probabilistic Matching –m i for first name =.98, or 98% of the time, if its a correct match, the first names will agree –u i for Zbignew is is the probability of randomly getting two first names that are Zbignew w1w1 w2w2

19 Probabilistic Matching In cases where two records disagree on a specified field, w i is equal to ….. w1w1 w2w2

20 Probabilistic Matching –m i for last name =.96, or 96% of the time, if its a correct match, the last names will agree –u i for Brezinsky is is the probability of randomly getting two last names that are Brezinsky w1w1 w2w2

21 A composite weight, w t calculated for each pair of records –The sum of weights across all fields used in linkage Larger w t suggest a correct match, Smaller or negative w t suggest an incorrect match. Calculating Match Weights

22 Match Determination –Could compare every record in one dataset with every record in the second dataset Result in N 1 x N 2 comparisons –Blocking Records first blocked on a subset of fields for which a deterministic match is required. Within each block, all records from the one dataset are compared to all records from the other dataset w t calculated for each of these possible pairings. The distribution of w t s across all blocks examined in order to determine a critical cut-off score necessary to classify two records as a match. Blocking

23

24

25 The total-weight required for two records to have a probability, p, of being a match is equal to… –Where p is the desired probability of a match, –E is the expected potential matches –N 1 and N 2 are the number of records in each database, Estimating Probabilities is the base 2 log of the odds of a random match

26 if two fields agree, and… Estimating Probabilities if two fields do not agree odds of a random match, From this formula, it is possible to derive an equation for estimating the probability that any two records are a match

27 Note that the probability equation is equivalent to a base-2 version of the logistic probability formula The computational formula avoids the need to repeatedly calculate powers of 2 and log 2 –This is due to the weights in the exponent themselves being a log-value The same probability is obtained using e and the natural log in place of 2 and log 2 throughout –Base 2 results in improved computational speed Notes

28 Thats nice, but ….. All right. But apart from the sanitation, the medicine, education, wine, public order, irrigation, roads, the fresh water system, and public health… What have the Romans ever done for us? --- Reg, spokesman for the Peoples Front of Judea Monty Python Life of Brian (and Martin White, UC Berkeley)