An Introduction to Automated Record Linkage Jonas Helgertz helgertz@umn.edu
Background and aim Many research topics within the social sciences benefit from (or require) a life-course perspective, i.e. with data that measure individual level characteristics at various points in time. For example, studying the relationship between education and mortality as a minimum requires information on completed educational attainment (age 25-30) and mortality, typically occurring much later. Survey data, sometimes linked to modern administrative register data, is frequently used to address research questions where there is a significant amount of time between exposure and outcome. Recall bias Costly Limited sample size
Digitized Public Use Historical Data Full count U.S. censuses of 1850/1860/1870/1880/1900/1910/1920/1930/1940 Social Security Death Master File/NUMIDENT, covering a gradually increasing share of deaths occurring from around 1940 to 2013/2007 In theory, the entire population can be followed every ten years and until death, overcoming problems associated with survey data. Historical data sources, however, lack a consistent identifier (e.g. SSN) that allows for straightforward linking across sources. Instead - as you have experienced first hand – we have to rely on other indicators to make an educated guess regarding which record in source B that corresponds to a given record in source A. Caveats associated with hand linking include it being time consuming and expensive, as well as it being challenging to apply consistent rules and documentation/replication.
Automated Methods of Record Linkage Sometimes using methods developed within computer science, researchers within the social sciences have embarked upon using and developing computer algorithms to link millions of individuals across different sources. Computationally quite demanding Deterministic and probabilistic methods of linking Quality of linking job assessed based on i) the share of confirmed matches and, ii) estimated measures of linking precision. Goal is typically to minimize the share of so-called false positives. Algorithm Match Non-match Match True positive False negative Truth Non-match False positive True negative
Defining the population of potential matches Linking individuals in the 1920 census to the 1940 census, how do we restrict the population of potential matches? E.g. Samuel Smith, born 1903 in Texas In theory, when finding the individual in the census conducted twenty years later, he should be enumerated as a male with an identical (or at least very similar) name and born that same year and in that same state (blocking criteria). Historical census data is, however, messy. Population of potential matches typically defined a bit more broadly, for example by allowing for birth year to differ by +/- 3 years between the sources. Less conservative restrictions increases the likelihood that the true match will be among the potential matches, but at the expense of an increased risk of false positives.
Deterministic linking A simple, rules-based, way of conducting record linkage. Pros: Easy to replicate, minimally computationally demanding Cons: Fails to capture nuances and complexity of underlying data, selective linking Example: After obtaining population of potential matches based on chosen blocking criteria (name similarity score, birth year, sex, state of birth), matches are declared if there only exists one unique match for any given individual. ?
Deterministic linking, cont’d Individuals with common names almost impossible to link, as there is no ability to distinguish between seeming equally good potential matches. Particularly problematic when working with messy data, such as historical U.S. census data.
Probabilistic linking Based on the idea that the computer can be taught to recognize patterns in the data that is consistent with a given observation being the same individual across the two sources. Essentially, conditional on the chosen blocking criteria, every potential match is assigned a probability of it being the true match, with the researcher in the following step choosing thresholds that correspond to their preferences in terms of the share of false positives that is acceptable.
Establishing the “ground truth”, through training data The first task is to generate a data set of the highest quality possible, which will provide the machine learning algorithm with the information necessary to learn how to identify matches. This is called “training data”, being a subsample of the larger population one wishes to link. As an example, if we want to link boys born 1910-1920 from the 1920 census to the 1940 census, we would extract a subsample of say, 500-5000, individuals from that population and link it to the 1940 census. The more and better sources we use, the better the quality of the resulting training data (i.e. absence of false positives and a maximized share of true positives).
Using the Training Data What patterns did you observe when doing the hand-linking that characterized confirmed matches? A range of different machine learning algorithms available, relying on information provided by training data. Algorithm uses measurable (observable) characteristics to quantify their importance for the probability that any 1:1 potential match should be declared a true match. Here, we will use the probit approach suggested by Feigenbaum (2016), as it is comparatively more straightforward to understand. We estimate a probit regression model using the training data, with the declared matches as the dependent variable, and a set of constructed variables as explanatory variables.
Interpreting the output of the algorithm Negative coefficients means that a higher value on the variable in question, the lower the probability of a match. And vice versa.
Using the algorithm to generate matches Highlighted are all constructed characteristics that the algorithm use to predict the probability of a match. Using these in combination with the coefficients yields a predicted probability of a match, indicated by the column labeled “p”
Using the algorithm to generate matches, cont’d Before assessing the quality of the linking algorithm, the researcher needs to make two choices: Which is the lowest acceptable predicted probability value to declare a match (p) How much better than the second best possible match does the best match need to be? (d = pbest/psecond best) The higher the values on both p and d, the more restrictive the algorithm will be in determining matches. If we choose a p of 0.5 and a d of 2, it is clear that the observation below will be declared a match, since 0.78/0.013=60, i.e. well above the threshold of 2.
Assessing the performance of the algorithm Assumption: our training data is a representative subsample of the population that we ultimately want to link As a result, we can compare the performance of our algorithm to the ground truth data to get a measurement of how well it will perform on a larger set of data. In the case of James Broadoway, it would appear that declaring a match (pmatch) would result in a “true positive”, i.e. that it is a match also in the underlying ground truth data.
Assessing the performance of the algorithm, cont’d The table below shows the four different outcomes (TP/TN/FP/FN) across a range of values of p and d. Holding p constant, as the value on d increases, the FP diminishes (but so does the TP)
Choosing the appropriate thresholds Precision (share of declared positives that are accurate) and recall (share of true positives that are found) are two essential measurements, that, however, move in opposite direction… Precision Recall
We have tended to prefer the Matthew’s Correlation Coefficient when selecting optimal threshold, as it takes both precision and recall into account. As a result, we have managed to link large populations of individuals belonging to the 1910-1920 birth cohorts across the 1920-1940 censuses and then to the death records. In addition, we have linked 22 consecutive birth cohorts of boys from the 1930 and 1940 censuses to the death records, yielding approximately 5 million confirmed links.