An Introduction to Automated Record Linkage

Slides:



Advertisements
Similar presentations
Brief introduction on Logistic Regression
Advertisements

Part II Sigma Freud & Descriptive Statistics
Statistical Methods Chichang Jou Tamkang University.
Today Concepts underlying inferential statistics
Chapter 7 Correlational Research Gay, Mills, and Airasian
Chapter 12 Inferential Statistics Gay, Mills, and Airasian
LEARNING PROGRAMME Hypothesis testing Intermediate Training in Quantitative Analysis Bangkok November 2007.
Some Background Assumptions Markowitz Portfolio Theory
1 CSI5388: Functional Elements of Statistics for Machine Learning Part I.
Lecture 7 Introduction to Hypothesis Testing. Lecture Goals After completing this lecture, you should be able to: Formulate null and alternative hypotheses.
Business Statistics for Managerial Decision Farideh Dehkordi-Vakil.
Correlation Analysis. Correlation Analysis: Introduction Management questions frequently revolve around the study of relationships between two or more.
Chapter 4 Linear Regression 1. Introduction Managerial decisions are often based on the relationship between two or more variables. For example, after.
Statistical Power The power of a test is the probability of detecting a difference or relationship if such a difference or relationship really exists.
Introduction to sample size and power calculations Afshin Ostovar Bushehr University of Medical Sciences.
Issues concerning the interpretation of statistical significance tests.
Chapter 6: Analyzing and Interpreting Quantitative Data
Regression Analysis: Part 2 Inference Dummies / Interactions Multicollinearity / Heteroscedasticity Residual Analysis / Outliers.
Design of Clinical Research Studies ASAP Session by: Robert McCarter, ScD Dir. Biostatistics and Informatics, CNMC
Educational Research Inferential Statistics Chapter th Chapter 12- 8th Gay and Airasian.
Lecture 5.  It is done to ensure the questions asked would generate the data that would answer the research questions n research objectives  The respondents.
Multivariate Analysis - Introduction. What is Multivariate Analysis? The expression multivariate analysis is used to describe analyses of data that have.
Sampling procedures for assessing accuracy of record linkage Paul A. Smith, S3RI, University of Southampton Shelley Gammon, Sarah Cummins, Christos Chatzoglou,
Stats Methods at IC Lecture 3: Regression.
Logic of Hypothesis Testing
Chapter 8: Estimating with Confidence
Chapter 8: Estimating with Confidence
Sampling.
Chapter 5 Data Production
Chapter 7. Classification and Prediction
Notes on Logistic Regression
Inference and Tests of Hypotheses
Classification of Research
Hypothesis Testing and Confidence Intervals (Part 1): Using the Standard Normal Lecture 8 Justin Kern October 10 and 12, 2017.
SAMPLING (Zikmund, Chapter 12.
Chapter 8: Inference for Proportions
POSC 202A: Lecture Lecture: Substantive Significance, Relationship between Variables 1.
R. E. Wyllys Copyright 2003 by R. E. Wyllys Last revised 2003 Jan 15
Day 2 Session 2 Biological reference points - Supplementary
Cross Sectional Designs
Gathering and Organizing Data
Basic Training for Statistical Process Control
Ontario`s Mandated High School Community Service Program: Assessing Civic Engagement After Four Years S. D. Brown, S.M. Pancer, P. Padanyi, M. Baetz, J.
Basic Training for Statistical Process Control
Essential Statistics Introduction to Inference
Chapter 8: Estimating with Confidence
Overview of Census Evaluation through Demographic Analysis Pres. 3
Chapter 5: Producing Data
Demographic Analysis and Evaluation
An Introduction to Correlational Research
Chapter 8: Estimating with Confidence
Chapter 8: Estimating with Confidence
Chapter 8: Estimating with Confidence
Gathering and Organizing Data
Chapter 8: Estimating with Confidence
Two Halves to Statistics
Chapter 8: Estimating with Confidence
Roc curves By Vittoria Cozza, matr
Chapter 8: Estimating with Confidence
EGR 2131 Unit 12 Synchronous Sequential Circuits
Chapter 8: Estimating with Confidence
DESIGN OF EXPERIMENTS by R. C. Baker
Chapter 8: Estimating with Confidence
Chapter 8: Estimating with Confidence
Chapter 8: Estimating with Confidence
Chapter 8: Estimating with Confidence
ECE 352 Digital System Fundamentals
Type I and Type II Errors
Multivariate Analysis - Introduction
Frequency Distributions
Presentation transcript:

An Introduction to Automated Record Linkage Jonas Helgertz helgertz@umn.edu

Background and aim Many research topics within the social sciences benefit from (or require) a life-course perspective, i.e. with data that measure individual level characteristics at various points in time. For example, studying the relationship between education and mortality as a minimum requires information on completed educational attainment (age 25-30) and mortality, typically occurring much later. Survey data, sometimes linked to modern administrative register data, is frequently used to address research questions where there is a significant amount of time between exposure and outcome. Recall bias Costly Limited sample size

Digitized Public Use Historical Data Full count U.S. censuses of 1850/1860/1870/1880/1900/1910/1920/1930/1940 Social Security Death Master File/NUMIDENT, covering a gradually increasing share of deaths occurring from around 1940 to 2013/2007 In theory, the entire population can be followed every ten years and until death, overcoming problems associated with survey data. Historical data sources, however, lack a consistent identifier (e.g. SSN) that allows for straightforward linking across sources. Instead - as you have experienced first hand – we have to rely on other indicators to make an educated guess regarding which record in source B that corresponds to a given record in source A. Caveats associated with hand linking include it being time consuming and expensive, as well as it being challenging to apply consistent rules and documentation/replication.

Automated Methods of Record Linkage Sometimes using methods developed within computer science, researchers within the social sciences have embarked upon using and developing computer algorithms to link millions of individuals across different sources. Computationally quite demanding Deterministic and probabilistic methods of linking Quality of linking job assessed based on i) the share of confirmed matches and, ii) estimated measures of linking precision. Goal is typically to minimize the share of so-called false positives. Algorithm Match Non-match Match True positive False negative Truth Non-match False positive True negative

Defining the population of potential matches Linking individuals in the 1920 census to the 1940 census, how do we restrict the population of potential matches? E.g. Samuel Smith, born 1903 in Texas In theory, when finding the individual in the census conducted twenty years later, he should be enumerated as a male with an identical (or at least very similar) name and born that same year and in that same state (blocking criteria). Historical census data is, however, messy. Population of potential matches typically defined a bit more broadly, for example by allowing for birth year to differ by +/- 3 years between the sources. Less conservative restrictions increases the likelihood that the true match will be among the potential matches, but at the expense of an increased risk of false positives.

Deterministic linking A simple, rules-based, way of conducting record linkage. Pros: Easy to replicate, minimally computationally demanding Cons: Fails to capture nuances and complexity of underlying data, selective linking Example: After obtaining population of potential matches based on chosen blocking criteria (name similarity score, birth year, sex, state of birth), matches are declared if there only exists one unique match for any given individual. ?

Deterministic linking, cont’d Individuals with common names almost impossible to link, as there is no ability to distinguish between seeming equally good potential matches. Particularly problematic when working with messy data, such as historical U.S. census data.

Probabilistic linking Based on the idea that the computer can be taught to recognize patterns in the data that is consistent with a given observation being the same individual across the two sources. Essentially, conditional on the chosen blocking criteria, every potential match is assigned a probability of it being the true match, with the researcher in the following step choosing thresholds that correspond to their preferences in terms of the share of false positives that is acceptable.

Establishing the “ground truth”, through training data The first task is to generate a data set of the highest quality possible, which will provide the machine learning algorithm with the information necessary to learn how to identify matches. This is called “training data”, being a subsample of the larger population one wishes to link. As an example, if we want to link boys born 1910-1920 from the 1920 census to the 1940 census, we would extract a subsample of say, 500-5000, individuals from that population and link it to the 1940 census. The more and better sources we use, the better the quality of the resulting training data (i.e. absence of false positives and a maximized share of true positives).

Using the Training Data What patterns did you observe when doing the hand-linking that characterized confirmed matches? A range of different machine learning algorithms available, relying on information provided by training data. Algorithm uses measurable (observable) characteristics to quantify their importance for the probability that any 1:1 potential match should be declared a true match. Here, we will use the probit approach suggested by Feigenbaum (2016), as it is comparatively more straightforward to understand. We estimate a probit regression model using the training data, with the declared matches as the dependent variable, and a set of constructed variables as explanatory variables.

Interpreting the output of the algorithm Negative coefficients means that a higher value on the variable in question, the lower the probability of a match. And vice versa.

Using the algorithm to generate matches Highlighted are all constructed characteristics that the algorithm use to predict the probability of a match. Using these in combination with the coefficients yields a predicted probability of a match, indicated by the column labeled “p”

Using the algorithm to generate matches, cont’d Before assessing the quality of the linking algorithm, the researcher needs to make two choices: Which is the lowest acceptable predicted probability value to declare a match (p) How much better than the second best possible match does the best match need to be? (d = pbest/psecond best) The higher the values on both p and d, the more restrictive the algorithm will be in determining matches. If we choose a p of 0.5 and a d of 2, it is clear that the observation below will be declared a match, since 0.78/0.013=60, i.e. well above the threshold of 2.

Assessing the performance of the algorithm Assumption: our training data is a representative subsample of the population that we ultimately want to link As a result, we can compare the performance of our algorithm to the ground truth data to get a measurement of how well it will perform on a larger set of data. In the case of James Broadoway, it would appear that declaring a match (pmatch) would result in a “true positive”, i.e. that it is a match also in the underlying ground truth data.

Assessing the performance of the algorithm, cont’d The table below shows the four different outcomes (TP/TN/FP/FN) across a range of values of p and d. Holding p constant, as the value on d increases, the FP diminishes (but so does the TP)

Choosing the appropriate thresholds Precision (share of declared positives that are accurate) and recall (share of true positives that are found) are two essential measurements, that, however, move in opposite direction… Precision Recall

We have tended to prefer the Matthew’s Correlation Coefficient when selecting optimal threshold, as it takes both precision and recall into account. As a result, we have managed to link large populations of individuals belonging to the 1910-1920 birth cohorts across the 1920-1940 censuses and then to the death records. In addition, we have linked 22 consecutive birth cohorts of boys from the 1930 and 1940 censuses to the death records, yielding approximately 5 million confirmed links.