Presentation is loading. Please wait.

Presentation is loading. Please wait.

© 2007 John M. Abowd, Lars Vilhuber, all rights reserved Introduction to Record Linking John M. Abowd and Lars Vilhuber April 2007.

Similar presentations


Presentation on theme: "© 2007 John M. Abowd, Lars Vilhuber, all rights reserved Introduction to Record Linking John M. Abowd and Lars Vilhuber April 2007."— Presentation transcript:

1 © 2007 John M. Abowd, Lars Vilhuber, all rights reserved Introduction to Record Linking John M. Abowd and Lars Vilhuber April 2007

2 © 2007 John M. Abowd, Lars Vilhuber, all rights reserved Overview Introduction to record linking What is record linking, what is it not, what is the theory? Record linking: Applications and examples How do you do it, what do you need, what are the possible complications? Examples of record linking Do it yourself record linking

3 © 2007 John M. Abowd, Lars Vilhuber, all rights reserved From Imputing to Linking Precision of link Availability of linked data “ Statistical record linkage ” Merge match with imperfect link variables “ Statistical record linkage ” Merge match with imperfect link variables “Massively imputed” Common variables/ values, but datasets can’t be linked “Massively imputed” Common variables/ values, but datasets can’t be linked “Simulated data” No common observations “Simulated data” No common observations “Classical” Merge match by link variable “Classical” Merge match by link variable

4 © 2007 John M. Abowd, Lars Vilhuber, all rights reserved Definitions of Record Linkage “a procedure to find pairs of records in two files that represent the same entity” “identify duplicate records within a file”

5 © 2007 John M. Abowd, Lars Vilhuber, all rights reserved Uses of Record Linkage Merging two files for micro-data analysis –CPS base survey to a supplement –SIPP interviews to each other –Merging years of Business Register –Merging two years of CPS –Merging financial info to firm survey Updating a survey frame or a electoral list –Based on business lists –Based on tax records Disclosure review of potential public use micro-data

6 © 2007 John M. Abowd, Lars Vilhuber, all rights reserved Types of Record Linkage Merging two files for micro-data analysis –CPS base survey to a supplement –SIPP interviews to each other –Merging years of Business Register –Merging two years of CPS* –Merging financial info to firm survey Updating a survey frame or a electoral list –Based on business lists –Based on tax records Disclosure review of potential public use micro-data Deterministic linkage: survey- provided IDs Probabilistic linkage: imperfect or no IDs Probabilistic linkage: no IDs

7 © 2007 John M. Abowd, Lars Vilhuber, all rights reserved Methods of Record Linkage Probabilistic record linkage (PBRL) –non-parametric methods –regression-based methods Distance-based record linkage (DBRL) –Euclidean distance –Mahalanobis distance –Kernel-based distance

8 © 2007 John M. Abowd, Lars Vilhuber, all rights reserved Need for Automated Record Linkage RA time required for the following matching tasks: –Finding financial records for Fortune 100: 200 hours (Abowd, 1989) 50,000 small businesses: ??? hours –Identifying miscoded SSNs on 60,000 wage records: several weeks on 500 million wage records: ???? –Unduplication of the U.S. Census survey frame (115,904,641 households): ???? –Longitudinally linking the 12 million establishments in the Business Register: ????

9 © 2007 John M. Abowd, Lars Vilhuber, all rights reserved Basic Definitions and Notation Entities Associated files Records on files Matches Nonmatches

10 © 2007 John M. Abowd, Lars Vilhuber, all rights reserved Comparison function maps comparison space into some domain: Comparison vector PBRL: Agreement pattern, finitely many values, typically {0,1}, but can be Reals DBRL: distance (scalar) Comparisons

11 © 2007 John M. Abowd, Lars Vilhuber, all rights reserved Linkage Rule A linkage rule defines a record pair’s status based on it’s comparison value –Link (L) –Undecided (Clerical, C) –Non-link (N)

12 © 2007 John M. Abowd, Lars Vilhuber, all rights reserved Linkage Rules Depend on Context PBRL: –For matching: rank by agreement ratios, use cutoff values to classify into {L,C,U} –For disclosure-analysis: rank by agreement ratios, classify as {L} if true link (M) is among top j pairs DBRL: –Rank pairs by distance, link closest pairs

13 © 2007 John M. Abowd, Lars Vilhuber, all rights reserved Probabilistic Record Linkage

14 © 2007 John M. Abowd, Lars Vilhuber, all rights reserved Example Agreement Pattern 3 binary comparisons test whether –γ 1 pair agrees on last name –γ 2 pair agrees on first name –γ 3 pair agrees on street name Simple agreement pattern: γ=(1,0,1) Complex agreement pattern: γ=(0.66,0,0.8)

15 © 2007 John M. Abowd, Lars Vilhuber, all rights reserved Conditional Probabilities Probability that a record pair has agreement pattern γ given that it is a match [nonmatch] P(γ|M) P(γ|U) Agreement ratio R(γ) = P(γ|M) / P(γ|U) This ratio will determine the distinguishing power of the comparison γ.

16 © 2007 John M. Abowd, Lars Vilhuber, all rights reserved Error Rates False match: a linked pair that is not a match (type II error) False match rate: probability that a designated link (L) is a nonmatch: μ=P(L|U) False nonmatch: a nonlinked pair that is a match (type I error) False nonmatch rate: probability that a designated nonlink is a match: λ=P(N|M)

17 © 2007 John M. Abowd, Lars Vilhuber, all rights reserved Fundamental Theorem 1.Order the comparison vectors {γ j } by R(γ) 2.Choose upper T u and lower T l cutoff values for R(γ) 3.Linkage rule:

18 © 2007 John M. Abowd, Lars Vilhuber, all rights reserved Fundamental Theorem (cont.) Error rates are

19 © 2007 John M. Abowd, Lars Vilhuber, all rights reserved Fundamental Theorem (3) Fellegi & Sunter (JASA, 1969): If the error rates for the elements of the comparison vector are conditionally independent, then given the overall error rates ( , ), the linkage rule F minimizes the probability associated with an agreement pattern  being placed in the clerical review set. (optimal linkage rule)

20 © 2007 John M. Abowd, Lars Vilhuber, all rights reserved Applying the Theory The theory holds on any subset of match pairs (blocks) Ratio R: matching weight or total agreement weight Optimality of decision rule heavily dependent on the probabilities P(γ|M) and P(γ|U)

21 © 2007 John M. Abowd, Lars Vilhuber, all rights reserved Distance-Based Record Linking

22 © 2007 John M. Abowd, Lars Vilhuber, all rights reserved Distance-Based Record Linking Distance between any pair of records can be generally defined as

23 © 2007 John M. Abowd, Lars Vilhuber, all rights reserved DBRL: 4 cases Mahalanobis distance, known covariance Mahalanobis distance, unknown covariance

24 © 2007 John M. Abowd, Lars Vilhuber, all rights reserved DBRL: 4 cases Euclidean distance, unstandardized inputs Euclidean distance, standardized inputs

25 © 2007 John M. Abowd, Lars Vilhuber, all rights reserved Linkage rules Matching: Sort by distance, choose top j pairs as matches Disclosure analysis: Sort by distance, identify true matches among top j pairs

26 © 2007 John M. Abowd, Lars Vilhuber, all rights reserved Acknowledgements This lecture is based in part on a 2000 and 2004 lecture given by William Winkler, William Yancey and Edward Porter at the U.S. Census Bureau Some portions draw on Winkler (1995), “Matching and Record Linkage,” in B.G. Cox et. al. (ed.), Business Survey Methods, New York, J. Wiley, 355-384. Some (non-confidential) portions drawn from Abowd, Stinson, Benedetto (2006), “Final Report to Social Security Administration on the SIPP/SSA/IRS Public Use File Project”


Download ppt "© 2007 John M. Abowd, Lars Vilhuber, all rights reserved Introduction to Record Linking John M. Abowd and Lars Vilhuber April 2007."

Similar presentations


Ads by Google