Record Linking, II John M. Abowd and Lars Vilhuber April 2011 © 2011 John M. Abowd, Lars Vilhuber, all rights reserved.

Slides:



Advertisements
Similar presentations
Katherine Jenny Thompson
Advertisements

JavaScript I. JavaScript is an object oriented programming language used to add interactivity to web pages. Different from Java, even though bears some.
Lecture 2 ANALYSIS OF VARIANCE: AN INTRODUCTION
Programming with App Inventor Computing Institute for K-12 Teachers Summer 2012 Workshop.
Conceptualization, Operationalization, and Measurement
Lecture 8: Testing, Verification and Validation
Spelling Correction for Search Engine Queries Bruno Martins, Mario J. Silva In Proceedings of EsTAL-04, España for Natural Language Processing Presenter:
Michael Alves, Patrick Dugan, Robert Daniels, Carlos Vicuna
Assessing “Success” in Anti-Poverty Policy Lars Osberg Dalhousie University October 1, 2004.
Data Mining Methodology 1. Why have a Methodology  Don’t want to learn things that aren’t true May not represent any underlying reality ○ Spurious correlation.
1 A Common Measure of Identity and Value Disclosure Risk Krish Muralidhar University of Kentucky Rathin Sarathy Oklahoma State University.
A Comparison of String Matching Distance Metrics for Name-Matching Tasks William Cohen, Pradeep RaviKumar, Stephen Fienberg.
Record Linkage in Stata
© 2007 John M. Abowd, Lars Vilhuber, all rights reserved Introduction to Record Linking John M. Abowd and Lars Vilhuber April 2007.
March 2013 ESSnet DWH - Workshop IV DATA LINKING ASPECTS OF COMBINING DATA INCLUDING OPTIONS FOR VARIOUS HIERARCHIES (S-DWH CONTEXT)
© 2007 John M. Abowd, Lars Vilhuber, all rights reserved An application of probabilistic matching Abowd and Vilhuber (2004), JBES.
Simple Linear Regression
Chapter 2: Algorithm Discovery and Design
Chapter 3: System design. System design Creating system components Three primary components – designing data structure and content – create software –
Clustered or Multilevel Data
© John M. Abowd and Lars Vilhuber 2005, all rights reserved Record Linking Examples John M. Abowd and Lars Vilhuber March 2005.
© John M. Abowd 2005, all rights reserved Sampling Frame Maintenance John M. Abowd February 2005.
© John M. Abowd and Lars Vilhuber 2005, all rights reserved Introduction to Probabilistic Record Linking John M. Abowd and Lars Vilhuber March 2005.
1 Chapter 2 Reviewing Tables and Queries. 2 Chapter Objectives Identify the steps required to develop an Access application Specify the characteristics.
© John M. Abowd and Lars Vilhuber 2005, all rights reserved Record Linking, II John M. Abowd and Lars Vilhuber March 2005.
© 2007 John M. Abowd, Lars Vilhuber, all rights reserved Estimating m and u Probabilities Using EM Based on Winkler 1988 "Using the EM Algorithm for Weight.
Chapter 2: Algorithm Discovery and Design
Aspects of Bayesian Inference and Statistical Disclosure Control in Python Duncan Smith Confidentiality and Privacy Group CCSR University of Manchester.
Mathcad Variable Names A string of characters (including numbers and some “special” characters (e.g. #, %, _, and a few more) Cannot start with a number.
Correlation and Linear Regression
REFACTORING Lecture 4. Definition Refactoring is a process of changing the internal structure of the program, not affecting its external behavior and.
Introduction to Record Linking John M. Abowd and Lars Vilhuber April 2011 © 2011 John M. Abowd, Lars Vilhuber, all rights reserved.
1 Validation & Verification Chapter VALIDATION & VERIFICATION Very Difficult Very Important Conceptually distinct, but performed simultaneously.
Monthly APCD User Workgroup Webinar April 22, 2014.
Sullivan – Fundamentals of Statistics – 2 nd Edition – Chapter 9 Section 1 – Slide 1 of 39 Chapter 9 Section 1 The Logic in Constructing Confidence Intervals.
Coupling and Cohesion Pfleeger, S., Software Engineering Theory and Practice. Prentice Hall, 2001.
The Use of Administrative Sources for Statistical Purposes Matching and Integrating Data from Different Sources.
Educational Research: Competencies for Analysis and Application, 9 th edition. Gay, Mills, & Airasian © 2009 Pearson Education, Inc. All rights reserved.
Chapter 8: Arrays.
The relationship between error rates and parameter estimation in the probabilistic record linkage context Tiziana Tuoto, Nicoletta Cibella, Marco Fortini.
Cohesion and Coupling CS 4311
The Conditional Independence Assumption in Probabilistic Record Linkage Methods Stephen Sharp National Records of Scotland Ladywell Road Edinburgh EH12.
Introduction to Record Linking John M. Abowd and Lars Vilhuber April 2013.
CONCEPTS AND TECHNIQUES FOR RECORD LINKAGE, ENTITY RESOLUTION, AND DUPLICATE DETECTION BY PETER CHRISTEN PRESENTED BY JOSEPH PARK Data Matching.
© 2007 John M. Abowd, Lars Vilhuber, all rights reserved Record Linking, II John M. Abowd and Lars Vilhuber April 2007.
Chapter 10 Verification and Validation of Simulation Models
Relative Values. Statistical Terms n Mean:  the average of the data  sensitive to outlying data n Median:  the middle of the data  not sensitive to.
1 CSCD 326 Data Structures I Software Design. 2 The Software Life Cycle 1. Specification 2. Design 3. Risk Analysis 4. Verification 5. Coding 6. Testing.
Programming Logic and Design Fourth Edition, Comprehensive Chapter 8 Arrays.
Cross Language Clone Analysis Team 2 February 3, 2011.
CS425: Algorithms for Web Scale Data Most of the slides are from the Mining of Massive Datasets book. These slides have been modified for CS425. The original.
Data Structures and Algorithms Searching Algorithms M. B. Fayek CUFE 2006.
INFO 4470/ILRLE 4470 Visualization Tools and Data Quality John M. Abowd and Lars Vilhuber March 16, 2011.
CMPSC 16 Problem Solving with Computers I Spring 2014 Instructor: Tevfik Bultan Lecture 4: Introduction to C: Control Flow.
 Simulation enables the study of complex system.  Simulation is a good approach when analytic study of a system is not possible or very complex.  Informational,
Copyright © 2014 Curt Hill Algorithms From the Mathematical Perspective.
Introduction to Record Linking John M. Abowd and Lars Vilhuber April 2013, 2016.
1 Record Linkage & Fuzzy Matching (More on "Blocking" for Performance Improvement) Joseph Vertido Melissa Data Fuzzy.
Coupling and Cohesion Schach, S, R. Object-Oriented and Classical Software Engineering. McGraw-Hill, 2002.
Coupling and Cohesion Pfleeger, S., Software Engineering Theory and Practice. Prentice Hall, 2001.
Sampling procedures for assessing accuracy of record linkage Paul A. Smith, S3RI, University of Southampton Shelley Gammon, Sarah Cummins, Christos Chatzoglou,
Introduction to Probabilistic Record Linking
Indexing Structures for Files and Physical Database Design
Two-Sample Hypothesis Testing
Chapter 10 Verification and Validation of Simulation Models
The relational operators
Chapter 3: Lexical Analysis
Discrete Event Simulation - 4
Minwise Hashing and Efficient Search
Stephanie Hirner ESTP ”Administrative data and censuses
Presentation transcript:

Record Linking, II John M. Abowd and Lars Vilhuber April 2011 © 2011 John M. Abowd, Lars Vilhuber, all rights reserved

Implementing Probabilistic Record Linkage Standardizing Blocking and matching variables Calculating the agreement index Choosing M and U probabilities Estimating M and U probabilities using EM Clerical editing Estimating the false match rate Estimating the false nonmatch rate © 2011 John M. Abowd, Lars Vilhuber, all rights reserved

Matching Software Commercial ($$$$-$$$$$) – Automatch/Vality/Ascential/IBM WebSphere Information Integration (grew out of Jaro’s work at the Census Bureau) – DataFlux/ SAS Data Quality Server – Oracle – Others Custom software (0-$$) – C/Fortran Census SRD-maintained software – Java implementation used in Domingo-Ferrer, Abowd, and Torra (2006) – Java Data Mining API © 2011 John M. Abowd, Lars Vilhuber, all rights reserved

Software differences Each software is an empirical/practical implementation driven by specific needs Terminology tends to differ: – “standardize”, “schema”, “simplify” – “block”, “exact match” – “comparator function”, “match definition” © 2011 John M. Abowd, Lars Vilhuber, all rights reserved

Custom/ SRD Vality/Ascent ial SAS DQ Standardizing standard Yes PROC DQSCHEME Blocking and matching variables Yes Yes (blocking = “exact match”) Calculating the agreement index Yes No Choosing M and U probabilities Yes No (varying “sensitivity” achieves the same goal) Estimating M and U probabilities using EM Yes ( eci ) (external)No Matching matcher Yes PROC DQMATCH © 2011 John M. Abowd, Lars Vilhuber, all rights reserved

STANDARDIZING © 2011 John M. Abowd, Lars Vilhuber, all rights reserved

Standardizing Standardization is a necessary preprocessing step for all data to be linked via probabilistic record linking A standardizer: – Parses text fields into logical components (first name, last name; street number, street name, etc.) – Standardizes the representation of each parsed field (spelling, numerical range, capitalization, etc.) Commercial standardizers have very high value- added compared to home-grown standardizers but are very expensive © 2011 John M. Abowd, Lars Vilhuber, all rights reserved

How to Standardize Inspect the file to refine strategy Use commercial software Write custom software (SAS, Fortran, C) Apply standardizer Inspect the file to refine strategy © 2011 John M. Abowd, Lars Vilhuber, all rights reserved

Standardizing Names Alternate spellings 1.Dr. William J. Smith, MD 2.Bill Smith 3.W. John Smith, MD 4.W.J. Smith, Jr. 5.Walter Jacob Smith, Sr. © 2011 John M. Abowd, Lars Vilhuber, all rights reserved

Standardized Names PreFirstMidLastPos t1 Post 2 Alt1Std1 1DrWilliamJSmithMDBWILL 2BillSmithWilliamBWILL 3WJohnSmithMD 4WJSmithJr 4WalterJacobSmithSrWALT © 2011 John M. Abowd, Lars Vilhuber, all rights reserved

Standardizing Addresses Many different pieces of information 1.16 W Main Street #16 2.RR 2 Box Fuller Building, Suite 405, 2 nd door to the right Highway 16W © 2011 John M. Abowd, Lars Vilhuber, all rights reserved

Standardized Addresses Pre 2 HsnmStnmRRBoxPost1Post2Unit 1 Unit 2 Bldg 1W16MainSt Fuller Hwy16W © 2011 John M. Abowd, Lars Vilhuber, all rights reserved

Standardizing and language Standardizers are language- and “country”- specific – Address tokens may differ: “street”, “rue”, “calle”, “Straße” – Address components may differ: 123 Main Street Normal, IL L 7,1 D Mannheim 1234 URB LOS OLMOS PONCE PR © 2011 John M. Abowd, Lars Vilhuber, all rights reserved

Standardizing and language (2) Names differ – Juan, John, Johann, Yohan Variations of names differ: – Sepp, Zé, Joe -> Joseph Frequencies of names differ (will be important later) – Juan is frequent in Mexico, infrequent in Germany © 2011 John M. Abowd, Lars Vilhuber, all rights reserved

Custom standardization Standardization may depend on the particular application Example OPM project – “Department of Defense” – “Department of Commerce” – The token “Department of” does not have distinguishing power, but comprises the majority of the “business name” – Similar: “Service”, “Bureau” © 2011 John M. Abowd, Lars Vilhuber, all rights reserved

MATCHING © 2011 John M. Abowd, Lars Vilhuber, all rights reserved

Implementing the Basic Matching Methodology Identifying comparison strategies: – Which variables to compare – String comparator metrics – Number comparison algorithms – Search and blocking strategies Ensuring computational feasibility of the task – Choice of software/hardware combination – Choice of blocking variables (runtimes quadratic in size of block) Estimating necessary parameters © 2011 John M. Abowd, Lars Vilhuber, all rights reserved

Determination of Match Variables Must contain relevant information Must be informative (distinguishing power!) May not be on original file, but can be constructed (frequency, history information) © 2011 John M. Abowd, Lars Vilhuber, all rights reserved

Blocking and Matching The essence of a probabilistic record link is iterating passes of the data files in which blocking variables (must match exactly) and matching variables (used to compute the agreement index) change roles. Blocking variables reduce the computational burden but increase the false non-match rate => solved by multiple passes As records are linked, the linked records are removed from the input files and the analyst can use fewer blocking variables to reduce the false non-matches. Matching variables increase the computational burden and manage the tradeoff between false match and false non- match errors © 2011 John M. Abowd, Lars Vilhuber, all rights reserved

© 2011 John M. Abowd, Lars Vilhuber, all rights reserved 1’s tenure with A: 1’s employment history CodedCoded Name SSN EIN Lesli Kay 1A Leslie Kay 2A Lesly Kai 3B Earnings $10 $10 $11 Separations too high Accessions too high SSN Name Editing Example /

© 2011 John M. Abowd, Lars Vilhuber, all rights reserved CodedCoded Name SSN EIN Lesli Kay 1A Leslie Kay 2A Lesly Kai 3B Earnings $10 $10 $11 Separations too high Accessions too high SSN Name Editing Example T(1)T(2)T(3)

Computed and observed variables Reclassification of information Blocking on a-priori information FileNameSSNEarnPeriodGenderType ALesli Kay1$10T(2)MHole BLeslie Kay2$10T(2)MPlug BLesly Kai3$10T(4)FPlug © 2011 John M. Abowd, Lars Vilhuber, all rights reserved Blocking: Earn, Period, Gender Match on: Name, SSN

Iterating First pass may block on a possibly miscoded variable FileNameSSNEarnPeriodGenderType ALesli Kay1$10T(2)MHole BLeslie Kay2$10T(2)M FPlug BLesly Kai3$10T(4)FPlug © 2011 John M. Abowd, Lars Vilhuber, all rights reserved Block pass 1: Earn, Period, Gender Block pass 2: Earn, Period

Understanding Comparators Comparators need to account for – Typographical error – Significance of slight variations in numbers (both absolute and relative) – Possible variable inversions (first and last name flipped) © 2011 John M. Abowd, Lars Vilhuber, all rights reserved

Soundex: History Used for historical analysis, archiving Origins in early 20 th century Available in many computer programs (SQL, SAS, etc.) Official “American” Soundex at National Archives: © 2011 John M. Abowd, Lars Vilhuber, all rights reserved

String Comparators: Soundex The first letter is copied unchanged Subsequent letters: bfpv -> "1" cgjkqsxzç -> "2" dt -> "3" l -> "4" mnñ -> "5" r -> "6 " Other characters are ignored Repeated characters treated as single character. 4 chars, zero padded. For example, "SMITH" or "SMYTHE" would both be encoded as "S530". © 2011 John M. Abowd, Lars Vilhuber, all rights reserved

String Comparators: Jaro First returns a value based on counting insertions, deletions, transpositions, and string length Total agreement weight is adjusted downward towards the total disagreement weight by some factor based on the value Custom adjustments (Winkler and others) © 2011 John M. Abowd, Lars Vilhuber, all rights reserved

Comparing Numbers A difference of “34” may mean different things: – Age: a lot (mother-daughter? Different person) – Income: little – SSN or EIN: no meaning Some numbers may be better compared using string comparators © 2011 John M. Abowd, Lars Vilhuber, all rights reserved

Number of Matching Variables In general, the distinguishing power of a comparison increases with the number of matching variables Exception: variables are strongly correlated, but poor indicators of a match Example: General business name and legal name associated with a license. © 2011 John M. Abowd, Lars Vilhuber, all rights reserved

Determination of Match Parameters Need to determine the conditional probabilities P(agree|M), P(agree|U) for each variable comparison Methods: – Clerical review – Straight computation (Fellegi and Sunter) – EM algorithm (Dempster, Laird, Rubin, 1977) – Educated guess/experience – For P(agree|U) and large samples (population): computed from random matching © 2011 John M. Abowd, Lars Vilhuber, all rights reserved

Determination of Match Parameters (2) Fellegi & Sunter provide a solution when γ represents three variables. The solution can be expressed as marginal probabilities m k and u k In practice, this method is used in many software applications For k>3, method-of-moments or EM methods can be used. © 2011 John M. Abowd, Lars Vilhuber, all rights reserved

Calculating the Agreement Index We need to compute P(γ|M), P(γ|U) and the agreement ratio R(γ) = P(γ|M) / P(γ|U) The agreement index is ln R(γ). The critical assumption is conditional independence: P(γ|M) = P(γ 1 |M) P(γ 2 |M)… P(γ K |M) P(γ|U) = P(γ 1 |U) P(γ 2 |U)… P(γ K |U) where the subscript indicates an element of the vector γ. Implies that the agreement index can be written as: © 2011 John M. Abowd, Lars Vilhuber, all rights reserved

Choosing m and u Probabilities Define m k = P(γ k |M) u k = P(γ k |U) These probabilities are often assessed using a priori information or estimated from an expensive clerically edited link. – m often set a priori to 0.9 – u often set a priori to 0.1 Neither of these assumptions has much empirical support © 2011 John M. Abowd, Lars Vilhuber, all rights reserved

Some Rules of Thumb Gender m k = P(γ k |M) is a function of the data (random miscodes of gender variable) u k = P(γ k |U) = 0.5 (unconditional on other variables). This may not be true for certain blocking variables: age, veteran status, etc. will affect this value Exact identifiers (SSN, SIN) m k = P(γ k |M) will depend on verification by the data provider. For example, embedded checksums will move this probability closer to 1. u k = P(γ k |U) << 0.1 © 2011 John M. Abowd, Lars Vilhuber, all rights reserved

Marginal Probabilities: Educated Guesses for Starting Values P(agree on characteristic X| M)= 0.9 if X = first, last name, age 0.8 if X = house no., street name, other characteristic P(agree on characteristic X| U)= 0.1 if X = first, last name, age 0.2 if X = house no., street name, other characteristic © 2011 John M. Abowd, Lars Vilhuber, all rights reserved Note that distinguishing power of first name (R(first)=0.9/0.1=9) is larger than the street name (R(street)=0.8/0.2=4)

© 2011 John M. Abowd, Lars Vilhuber, all rights reserved

Marginal Probabilities: Better Estimates of P(agree|M) P(agree|M) can be improved after a first match pass by a clerical review of match pairs: – Draw a sample of pairs – Manual review to determine “true” match status – Recompute P(agree|M) based on known truth sample © 2011 John M. Abowd, Lars Vilhuber, all rights reserved

Estimating m and u Using Matched Data If you have two files  and  that have already been linked (perhaps clerically, perhaps with an exact link) then these estimates are available: © 2011 John M. Abowd, Lars Vilhuber, all rights reserved

Estimating m and u Probabilities Using EM Based on Winkler 1988 "Using the EM Algorithm for Weight Computation in the Fellegi-Sunter Model of Record Linkage," Proceedings of the Section on Survey Research Methods, American Statistical Association, Uses the identity P(  )=P(  |M)P(M)+P(  |U)P(U) Imposes conditional independence © 2011 John M. Abowd, Lars Vilhuber, all rights reserved

Clerical Editing Once the m and u probabilities have been estimated, cutoffs for the U, C, and L sets must be determined. This is usually done by setting preliminary cutoffs then clerically refining them. Often the m and u probabilities are tweaked as a part of this clerical review. © 2011 John M. Abowd, Lars Vilhuber, all rights reserved

Estimating the False Match Rate This is usually done by clerical review of a run of the automated matcher. Some help is available from Belin, T. R., and Rubin, D. B. (1995), "A Method for Calibrating False-Match Rates in Record Linkage," Journal of the American Statistical Association, 90, © 2011 John M. Abowd, Lars Vilhuber, all rights reserved

Estimating the False Nonmatch Rate This is much harder. Often done by a clerical review of a sample of the non-match records. Since false nonmatching is relatively rare among the nonmatch pairs, this sample is often stratified by variables known to affect the match rate. Stratifying by the agreement index is a very effective way to estimate false nonmatch rates. © 2011 John M. Abowd, Lars Vilhuber, all rights reserved

Post-processing Once matching software has identified matches, further processing may be needed: – Clean up – Carrying forward matching information – Reports on match rates © 2011 John M. Abowd, Lars Vilhuber, all rights reserved

Example: Abowd and Vilhuber (2005) 'The Sensitivity of Economic Statistics to Coding Errors in Personal Identifiers‘, Journal of Business and Economic Statistics, 23(2), pages Goal of the study: Assess the impact of measurement error in tenure on aggregate measures of turnover Appendix A has a detailed description of matching passes and construction of additional blocking variables © 2011 John M. Abowd, Lars Vilhuber, all rights reserved

Example: CPS-Decennial 2000 match “Accuracy of Data for Employment Status as Measured by the CPS-Census 2000 Match”, Palumbo and Siegel (2004), Census 2000 Evaluation B.7, accessed at 20Final%20Report.pdf on April 5, Final%20Report.pdf Goal of the study: assess employment status on two independent questionnaires © 2011 John M. Abowd, Lars Vilhuber, all rights reserved

Software notes Idiosyncratic implementations – Merge two files in Stata/SAS/SQL/etc. (outer join or wide file), within blocks – For given m/u, compute agreement index on a per-variable basis for all – Order records within blocks, select match record – Generically, this works with numeric variables, but has issues when working with character variables Typically, only Soundex available Implementing string comparators (Jaro) may be difficult © 2011 John M. Abowd, Lars Vilhuber, all rights reserved

Software notes: character matching SAS Data Quality – Is focused on “data quality”, “data cleaning” – Has functions for standardization and string matching – String matching is variant of Soundex: simplified strings based on generic algorithms – Issues: no probabilistic foundation, useful primarily for sophisticated string matching Combination of SAS DQ and idiosyncratic processing can yield powerful results © 2011 John M. Abowd, Lars Vilhuber, all rights reserved

Software notes: SQL SQL languages are used in many contexts where matching is frequent – Oracle 10g R1 has Jaro-Winkler edit-distance implemented. – MySQL allows for some integration (possibly see code/ ) code/ – PostGreSQL add-ons © 2011 John M. Abowd, Lars Vilhuber, all rights reserved

Software notes: R R has a recent package (untested by us) that seems a fairly complete suite Does not include standardization © 2011 John M. Abowd, Lars Vilhuber, all rights reserved

Acknowledgements This lecture is based in part on a 2000 lecture given by William Winkler, William Yancey and Edward Porter at the U.S. Census Bureau Some portions draw on Winkler (1995), “Matching and Record Linkage,” in B.G. Cox et. al. (ed.), Business Survey Methods, New York, J. Wiley, Examples are all purely fictitious, but inspired by true cases presented in the above lecture, in Abowd & Vilhuber (2005). © 2011 John M. Abowd, Lars Vilhuber, all rights reserved