An Introduction to Automated Record Linkage

Slides:

Advertisements

Similar presentations

Brief introduction on Logistic Regression

Advertisements

Part II Sigma Freud & Descriptive Statistics

Statistical Methods Chichang Jou Tamkang University.

Today Concepts underlying inferential statistics

Chapter 7 Correlational Research Gay, Mills, and Airasian

Chapter 12 Inferential Statistics Gay, Mills, and Airasian

LEARNING PROGRAMME Hypothesis testing Intermediate Training in Quantitative Analysis Bangkok November 2007.

Some Background Assumptions Markowitz Portfolio Theory

1 CSI5388: Functional Elements of Statistics for Machine Learning Part I.

Lecture 7 Introduction to Hypothesis Testing. Lecture Goals After completing this lecture, you should be able to: Formulate null and alternative hypotheses.

Business Statistics for Managerial Decision Farideh Dehkordi-Vakil.

Correlation Analysis. Correlation Analysis: Introduction Management questions frequently revolve around the study of relationships between two or more.

Chapter 4 Linear Regression 1. Introduction Managerial decisions are often based on the relationship between two or more variables. For example, after.

Statistical Power The power of a test is the probability of detecting a difference or relationship if such a difference or relationship really exists.

Introduction to sample size and power calculations Afshin Ostovar Bushehr University of Medical Sciences.

Issues concerning the interpretation of statistical significance tests.

Chapter 6: Analyzing and Interpreting Quantitative Data

Regression Analysis: Part 2 Inference Dummies / Interactions Multicollinearity / Heteroscedasticity Residual Analysis / Outliers.

Design of Clinical Research Studies ASAP Session by: Robert McCarter, ScD Dir. Biostatistics and Informatics, CNMC

Educational Research Inferential Statistics Chapter th Chapter 12- 8th Gay and Airasian.

Lecture 5.  It is done to ensure the questions asked would generate the data that would answer the research questions n research objectives  The respondents.

Multivariate Analysis - Introduction. What is Multivariate Analysis? The expression multivariate analysis is used to describe analyses of data that have.

Sampling procedures for assessing accuracy of record linkage Paul A. Smith, S3RI, University of Southampton Shelley Gammon, Sarah Cummins, Christos Chatzoglou,

Stats Methods at IC Lecture 3: Regression.

Logic of Hypothesis Testing

Chapter 8: Estimating with Confidence

Chapter 8: Estimating with Confidence

Chapter 5 Data Production

Chapter 7. Classification and Prediction

Notes on Logistic Regression

Inference and Tests of Hypotheses

Classification of Research

Hypothesis Testing and Confidence Intervals (Part 1): Using the Standard Normal Lecture 8 Justin Kern October 10 and 12, 2017.

SAMPLING (Zikmund, Chapter 12.

Chapter 8: Inference for Proportions

POSC 202A: Lecture Lecture: Substantive Significance, Relationship between Variables 1.

R. E. Wyllys Copyright 2003 by R. E. Wyllys Last revised 2003 Jan 15

Day 2 Session 2 Biological reference points - Supplementary

Cross Sectional Designs

Gathering and Organizing Data

Basic Training for Statistical Process Control

Ontario`s Mandated High School Community Service Program: Assessing Civic Engagement After Four Years S. D. Brown, S.M. Pancer, P. Padanyi, M. Baetz, J.

Basic Training for Statistical Process Control

Essential Statistics Introduction to Inference

Chapter 8: Estimating with Confidence

Overview of Census Evaluation through Demographic Analysis Pres. 3

Chapter 5: Producing Data

Demographic Analysis and Evaluation

An Introduction to Correlational Research

Chapter 8: Estimating with Confidence

Chapter 8: Estimating with Confidence

Chapter 8: Estimating with Confidence

Gathering and Organizing Data

Chapter 8: Estimating with Confidence

Two Halves to Statistics

Chapter 8: Estimating with Confidence

Roc curves By Vittoria Cozza, matr

Chapter 8: Estimating with Confidence

EGR 2131 Unit 12 Synchronous Sequential Circuits

Chapter 8: Estimating with Confidence

DESIGN OF EXPERIMENTS by R. C. Baker

Chapter 8: Estimating with Confidence

Chapter 8: Estimating with Confidence

Chapter 8: Estimating with Confidence

Chapter 8: Estimating with Confidence

ECE 352 Digital System Fundamentals

Type I and Type II Errors

Multivariate Analysis - Introduction

Frequency Distributions

Presentation transcript:

An Introduction to Automated Record Linkage Jonas Helgertz helgertz@umn.edu

Background and aim Many research topics within the social sciences benefit from (or require) a life-course perspective, i.e. with data that measure individual level characteristics at various points in time. For example, studying the relationship between education and mortality as a minimum requires information on completed educational attainment (age 25-30) and mortality, typically occurring much later. Survey data, sometimes linked to modern administrative register data, is frequently used to address research questions where there is a significant amount of time between exposure and outcome. Recall bias Costly Limited sample size

Digitized Public Use Historical Data Full count U.S. censuses of 1850/1860/1870/1880/1900/1910/1920/1930/1940 Social Security Death Master File/NUMIDENT, covering a gradually increasing share of deaths occurring from around 1940 to 2013/2007 In theory, the entire population can be followed every ten years and until death, overcoming problems associated with survey data. Historical data sources, however, lack a consistent identifier (e.g. SSN) that allows for straightforward linking across sources. Instead - as you have experienced first hand – we have to rely on other indicators to make an educated guess regarding which record in source B that corresponds to a given record in source A. Caveats associated with hand linking include it being time consuming and expensive, as well as it being challenging to apply consistent rules and documentation/replication.

Automated Methods of Record Linkage Sometimes using methods developed within computer science, researchers within the social sciences have embarked upon using and developing computer algorithms to link millions of individuals across different sources. Computationally quite demanding Deterministic and probabilistic methods of linking Quality of linking job assessed based on i) the share of confirmed matches and, ii) estimated measures of linking precision. Goal is typically to minimize the share of so-called false positives. Algorithm Match Non-match Match True positive False negative Truth Non-match False positive True negative

Defining the population of potential matches Linking individuals in the 1920 census to the 1940 census, how do we restrict the population of potential matches? E.g. Samuel Smith, born 1903 in Texas In theory, when finding the individual in the census conducted twenty years later, he should be enumerated as a male with an identical (or at least very similar) name and born that same year and in that same state (blocking criteria). Historical census data is, however, messy. Population of potential matches typically defined a bit more broadly, for example by allowing for birth year to differ by +/- 3 years between the sources. Less conservative restrictions increases the likelihood that the true match will be among the potential matches, but at the expense of an increased risk of false positives.

Deterministic linking A simple, rules-based, way of conducting record linkage. Pros: Easy to replicate, minimally computationally demanding Cons: Fails to capture nuances and complexity of underlying data, selective linking Example: After obtaining population of potential matches based on chosen blocking criteria (name similarity score, birth year, sex, state of birth), matches are declared if there only exists one unique match for any given individual. ?

Deterministic linking, cont’d Individuals with common names almost impossible to link, as there is no ability to distinguish between seeming equally good potential matches. Particularly problematic when working with messy data, such as historical U.S. census data.

Probabilistic linking Based on the idea that the computer can be taught to recognize patterns in the data that is consistent with a given observation being the same individual across the two sources. Essentially, conditional on the chosen blocking criteria, every potential match is assigned a probability of it being the true match, with the researcher in the following step choosing thresholds that correspond to their preferences in terms of the share of false positives that is acceptable.

Establishing the “ground truth”, through training data The first task is to generate a data set of the highest quality possible, which will provide the machine learning algorithm with the information necessary to learn how to identify matches. This is called “training data”, being a subsample of the larger population one wishes to link. As an example, if we want to link boys born 1910-1920 from the 1920 census to the 1940 census, we would extract a subsample of say, 500-5000, individuals from that population and link it to the 1940 census. The more and better sources we use, the better the quality of the resulting training data (i.e. absence of false positives and a maximized share of true positives).

Using the Training Data What patterns did you observe when doing the hand-linking that characterized confirmed matches? A range of different machine learning algorithms available, relying on information provided by training data. Algorithm uses measurable (observable) characteristics to quantify their importance for the probability that any 1:1 potential match should be declared a true match. Here, we will use the probit approach suggested by Feigenbaum (2016), as it is comparatively more straightforward to understand. We estimate a probit regression model using the training data, with the declared matches as the dependent variable, and a set of constructed variables as explanatory variables.

Interpreting the output of the algorithm Negative coefficients means that a higher value on the variable in question, the lower the probability of a match. And vice versa.

Using the algorithm to generate matches Highlighted are all constructed characteristics that the algorithm use to predict the probability of a match. Using these in combination with the coefficients yields a predicted probability of a match, indicated by the column labeled “p”

Using the algorithm to generate matches, cont’d Before assessing the quality of the linking algorithm, the researcher needs to make two choices: Which is the lowest acceptable predicted probability value to declare a match (p) How much better than the second best possible match does the best match need to be? (d = pbest/psecond best) The higher the values on both p and d, the more restrictive the algorithm will be in determining matches. If we choose a p of 0.5 and a d of 2, it is clear that the observation below will be declared a match, since 0.78/0.013=60, i.e. well above the threshold of 2.

Assessing the performance of the algorithm Assumption: our training data is a representative subsample of the population that we ultimately want to link As a result, we can compare the performance of our algorithm to the ground truth data to get a measurement of how well it will perform on a larger set of data. In the case of James Broadoway, it would appear that declaring a match (pmatch) would result in a “true positive”, i.e. that it is a match also in the underlying ground truth data.

Assessing the performance of the algorithm, cont’d The table below shows the four different outcomes (TP/TN/FP/FN) across a range of values of p and d. Holding p constant, as the value on d increases, the FP diminishes (but so does the TP)

Choosing the appropriate thresholds Precision (share of declared positives that are accurate) and recall (share of true positives that are found) are two essential measurements, that, however, move in opposite direction… Precision Recall

We have tended to prefer the Matthew’s Correlation Coefficient when selecting optimal threshold, as it takes both precision and recall into account. As a result, we have managed to link large populations of individuals belonging to the 1910-1920 birth cohorts across the 1920-1940 censuses and then to the death records. In addition, we have linked 22 consecutive birth cohorts of boys from the 1930 and 1940 censuses to the death records, yielding approximately 5 million confirmed links.