© 2007 John M. Abowd, Lars Vilhuber, all rights reserved An application of probabilistic matching Abowd and Vilhuber (2004), JBES.

Slides:



Advertisements
Similar presentations
Record Linking, II John M. Abowd and Lars Vilhuber April 2011 © 2011 John M. Abowd, Lars Vilhuber, all rights reserved.
Advertisements

Chap 8-1 Statistics for Business and Economics, 6e © 2007 Pearson Education, Inc. Chapter 8 Estimation: Single Population Statistics for Business and Economics.
Statistical Decision Making
© 2007 John M. Abowd, Lars Vilhuber, all rights reserved Introduction to Record Linking John M. Abowd and Lars Vilhuber April 2007.
Chapter 8 Estimation: Additional Topics
Cluster Analysis.  What is Cluster Analysis?  Types of Data in Cluster Analysis  A Categorization of Major Clustering Methods  Partitioning Methods.
Statistics for Managers Using Microsoft Excel, 4e © 2004 Prentice-Hall, Inc. Chap 7-1 Chapter 7 Confidence Interval Estimation Statistics for Managers.
Basic Business Statistics, 10e © 2006 Prentice-Hall, Inc. Chap 8-1 Chapter 8 Confidence Interval Estimation Basic Business Statistics 10 th Edition.
Business Statistics: A Decision-Making Approach, 6e © 2005 Prentice-Hall, Inc. Chap 7-1 Introduction to Statistics: Chapter 8 Estimation.
© John M. Abowd 2005, all rights reserved Statistical Tools for Data Integration John M. Abowd April 2005.
Chapter 8 Estimation: Single Population
Chap 9-1 Statistics for Business and Economics, 6e © 2007 Pearson Education, Inc. Chapter 9 Estimation: Additional Topics Statistics for Business and Economics.
© John M. Abowd and Lars Vilhuber 2005, all rights reserved Record Linking Examples John M. Abowd and Lars Vilhuber March 2005.
© John M. Abowd 2005, all rights reserved Sampling Frame Maintenance John M. Abowd February 2005.
© John M. Abowd and Lars Vilhuber 2005, all rights reserved Introduction to Probabilistic Record Linking John M. Abowd and Lars Vilhuber March 2005.
8-1 Copyright ©2011 Pearson Education, Inc. publishing as Prentice Hall Chapter 8 Confidence Interval Estimation Statistics for Managers using Microsoft.
Copyright ©2011 Pearson Education 8-1 Chapter 8 Confidence Interval Estimation Statistics for Managers using Microsoft Excel 6 th Global Edition.
© John M. Abowd and Lars Vilhuber 2005, all rights reserved Record Linking, II John M. Abowd and Lars Vilhuber March 2005.
© 2007 John M. Abowd, Lars Vilhuber, all rights reserved Estimating m and u Probabilities Using EM Based on Winkler 1988 "Using the EM Algorithm for Weight.
Statistics for Managers Using Microsoft Excel, 5e © 2008 Pearson Prentice-Hall, Inc.Chap 8-1 Statistics for Managers Using Microsoft® Excel 5th Edition.
Business Statistics, A First Course (4e) © 2006 Prentice-Hall, Inc. Chap 8-1 Chapter 8 Confidence Interval Estimation Business Statistics, A First Course.
1/49 EF 507 QUANTITATIVE METHODS FOR ECONOMICS AND FINANCE FALL 2008 Chapter 9 Estimation: Additional Topics.
Statistics for Managers Using Microsoft Excel, 4e © 2004 Prentice-Hall, Inc. Chap 7-1 Chapter 7 Confidence Interval Estimation Statistics for Managers.
Statistics for Managers Using Microsoft Excel, 4e © 2004 Prentice-Hall, Inc. Chap 7-1 Chapter 7 Confidence Interval Estimation Statistics for Managers.
Analysis of Variance. ANOVA Probably the most popular analysis in psychology Why? Ease of implementation Allows for analysis of several groups at once.
Chapter 10 Hypothesis Testing
Business Statistics: A Decision-Making Approach, 6e © 2005 Prentice-Hall, Inc. Chap 7-1 Business Statistics: A Decision-Making Approach 6 th Edition Chapter.
Confidence Interval Estimation
Introduction to Record Linking John M. Abowd and Lars Vilhuber April 2011 © 2011 John M. Abowd, Lars Vilhuber, all rights reserved.
Basic Business Statistics, 11e © 2009 Prentice-Hall, Inc. Chap 8-1 Chapter 8 Confidence Interval Estimation Basic Business Statistics 11 th Edition.
Confidence Interval Estimation
Chap 8-1 Copyright ©2013 Pearson Education, Inc. publishing as Prentice Hall Chapter 8 Confidence Interval Estimation Business Statistics: A First Course.
Basic Business Statistics, 10e © 2006 Prentice-Hall, Inc. Chap 8-1 Chapter 8 Confidence Interval Estimation Basic Business Statistics 11 th Edition.
PROBABILITY (6MTCOAE205) Chapter 6 Estimation. Confidence Intervals Contents of this chapter: Confidence Intervals for the Population Mean, μ when Population.
AP Statistics Chap 10-1 Confidence Intervals. AP Statistics Chap 10-2 Confidence Intervals Population Mean σ Unknown (Lock 6.5) Confidence Intervals Population.
Statistics for Managers Using Microsoft Excel, 5e © 2008 Pearson Prentice-Hall, Inc.Chap 8-1 Statistics for Managers Using Microsoft® Excel 5th Edition.
The relationship between error rates and parameter estimation in the probabilistic record linkage context Tiziana Tuoto, Nicoletta Cibella, Marco Fortini.
A grudge is a heavy thing to carry.. Probability The probability of an event is the proportion of times we would expect the event to occur in an infinitely.
Chapter 7 Sample Variability. Those who jump off a bridge in Paris are in Seine. A backward poet writes inverse. A man's home is his castle, in a manor.
The Conditional Independence Assumption in Probabilistic Record Linkage Methods Stephen Sharp National Records of Scotland Ladywell Road Edinburgh EH12.
Estimating  0 Estimating the proportion of true null hypotheses with the method of moments By Jose M Muino.
Chap 7-1 A Course In Business Statistics, 4th © 2006 Prentice-Hall, Inc. A Course In Business Statistics 4 th Edition Chapter 7 Estimating Population Values.
Basic Business Statistics, 10e © 2006 Prentice-Hall, Inc. Chap 8-1 Confidence Interval Estimation.
© 2007 John M. Abowd, Lars Vilhuber, all rights reserved Record Linking, II John M. Abowd and Lars Vilhuber April 2007.
6.1 Inference for a Single Proportion  Statistical confidence  Confidence intervals  How confidence intervals behave.
Chap 7-1 A Course In Business Statistics, 4th © 2006 Prentice-Hall, Inc. A Course In Business Statistics 4 th Edition Chapter 7 Estimating Population Values.
Chap 8-1 Chapter 8 Confidence Interval Estimation Statistics for Managers Using Microsoft Excel 7 th Edition, Global Edition Copyright ©2014 Pearson Education.
Statistics for Managers Using Microsoft Excel, 5e © 2008 Pearson Prentice-Hall, Inc.Chap 8-1 Statistics for Managers Using Microsoft® Excel 5th Edition.
Business Statistics: A Decision-Making Approach, 6e © 2005 Prentice-Hall, Inc. Chap 7-1 Business Statistics: A Decision-Making Approach 6 th Edition Chapter.
Functions Math library functions Function definition Function invocation Argument passing Scope of an variable Programming 1 DCT 1033.
CHAPTER 7 SAMPLING DISTRIBUTIONS Prem Mann, Introductory Statistics, 7/E Copyright © 2010 John Wiley & Sons. All right reserved.
Department of Quantitative Methods & Information Systems
INFO 4470/ILRLE 4470 Visualization Tools and Data Quality John M. Abowd and Lars Vilhuber March 16, 2011.
Hypothesis Tests. An Hypothesis is a guess about a situation that can be tested, and the test outcome can be either true or false. –The Null Hypothesis.
Basic Business Statistics, 11e © 2009 Prentice-Hall, Inc. Chap 8-1 Chapter 8 Confidence Interval Estimation Business Statistics: A First Course 5 th Edition.
Statistics for Business and Economics 8 th Edition Chapter 7 Estimation: Single Population Copyright © 2013 Pearson Education, Inc. Publishing as Prentice.
Chapter 8 Confidence Interval Estimation Statistics For Managers 5 th Edition.
Yandell – Econ 216 Chap 8-1 Chapter 8 Confidence Interval Estimation.
CHAPTER 7 SAMPLING DISTRIBUTIONS Prem Mann, Introductory Statistics, 7/E Copyright © 2010 John Wiley & Sons. All right reserved.
Chapter 7 Confidence Interval Estimation
Introduction to Probabilistic Record Linking
Confidence Interval Estimation
Two-Sample Hypothesis Testing
Statistical Process Control
Confidence Interval Estimation
Daniela Stan Raicu School of CTI, DePaul University
Sampling Distributions (§ )
Chapter 9 Hypothesis Testing: Single Population
Chapter 9 Estimation: Additional Topics
Objectives 6.1 Estimating with confidence Statistical confidence
Presentation transcript:

© 2007 John M. Abowd, Lars Vilhuber, all rights reserved An application of probabilistic matching Abowd and Vilhuber (2004), JBES

© 2007 John M. Abowd, Lars Vilhuber, all rights reserved An example Abowd and Vilhuber (2004), JBES: “The Sensitivity of Economic Statistics to Coding Errors in Personal Identifiers” Approx. 500 million records (quarterly wage records for , California) 28 million SSNs

© 2007 John M. Abowd, Lars Vilhuber, all rights reserved 1’s tenure with A: 1’s employment history CodedCoded Name SSN EIN Leslie Kay 1A Leslie Kay 2A Lesly Kai 3B Earnings $10 $10 $11 Separations too high Accessions too high SSN Name editing Example / 1

© 2007 John M. Abowd, Lars Vilhuber, all rights reserved A&V: standardizing Knowledge of structure of the file: -> No standardizing Matching will be within records close in time -> assumed to be similar, no need for standardization BUT: possible false positives -> chose to do an weighted unduplication step (UNDUP) to eliminate wrongly associated SSNs

© 2007 John M. Abowd, Lars Vilhuber, all rights reserved A&V: UNDUP SSNUIDFirstMiddleLastEarnYQ JohnCDoe Q JohnCDoe Q JonCDoe Q RobertELee743993Q1 A UID is a unique combination of SSN-First-Middle-Last A89

© 2007 John M. Abowd, Lars Vilhuber, all rights reserved A&V: UNDUP (2) SSNUIDFirstMiddleLastEarnYQ JohnCDoe Q JohnCDoe Q JonCDoe Q RobertELee743993Q RobertELee743994Q1 Conservative strategy: Err on the side of caution

© 2007 John M. Abowd, Lars Vilhuber, all rights reserved Matching Define match blocks Define matching parameters: marginal probabilites Define upper T u and lower T l cutoff values

© 2007 John M. Abowd, Lars Vilhuber, all rights reserved Record Blocking Computationally inefficient to compare all possible record pairs Solution: Bring together only record pairs that are LIKELY to match, based on chosen blocking criterion Analogy: SAS merge by-variables

© 2007 John M. Abowd, Lars Vilhuber, all rights reserved Blocking example Without blocking: AxB is 1000x1000=1,000,000 pairs With blocking, f.i. on 3-digit ZIP code or first character of last name. Suppose 100 blocks of 10 characters each. Then only 100x(10x10)=10,000 pairs need to be compared.

© 2007 John M. Abowd, Lars Vilhuber, all rights reserved A&V: Variables and Matching File only contains Name, SSN, Earnings, Employer Construct frequency of use of name, work history, earnings deciles Stage 1: use name and frequency Stage 2: use name, earnings decile, work history with employer

© 2007 John M. Abowd, Lars Vilhuber, all rights reserved A&V: Blocking and stages Two stages were chosen: –UNDUP stage (preparation) –MATCH stage (actual matching) Each stage has own –Blocking –Match variables –Parameters

© 2007 John M. Abowd, Lars Vilhuber, all rights reserved A&V: UNDUP blocking No comparisons are ever going to be made outside of the SSN Information about frequency of names may be useful Large amount of records: 57 million UIDs associated with 28 million SSNs, but many SSNs have a unique UID  Blocking on SSN  Separation of files by last two digits of SSN (efficiency)

© 2007 John M. Abowd, Lars Vilhuber, all rights reserved A&V: MATCH blocking Idea is to fit 1-quarter records into work histories with a 1-quarter interruption at same employer  Block on Employer – Quarter  Possibly block on Earnings deciles

© 2007 John M. Abowd, Lars Vilhuber, all rights reserved A&V: MATCH block setup # Pass 1: BLOCK1 CHAR SEIN SEIN BLOCK1 CHAR QUARTER QUARTER BLOCK1 CHAR WAGEQANT WAGEQANT # follow 3 other BLOCK passes with identical setup # # Pass 2: relax the restriction on WAGEQANT BLOCK5 CHAR SEIN SEIN BLOCK5 CHAR QUARTER QUARTER # follow 3 other BLOCK passes with identical setup

© 2007 John M. Abowd, Lars Vilhuber, all rights reserved Determination of match variables Must contain relevant information Must be informative (distinguishing power!) May not be on original file, but can be constructed (frequency, history information)

© 2007 John M. Abowd, Lars Vilhuber, all rights reserved A&V: UNDUP match variables # Pass1 MATCH1 NAME_UNCERT namef MATCH1 NAME_UNCERT namel MATCH1 NAME_UNCERT namem MATCH1 NAME_UNCERT concat # Pass 2 MATCH2 ARRAY NAME_UNCERT fm_name MATCH2 NAME_UNCERT namel MATCH2 NAME_UNCERT concat # and so on…

© 2007 John M. Abowd, Lars Vilhuber, all rights reserved A&V: MATCH match variables # Pass1 MATCH1 CNT_DIFF SSN SSN MATCH1 NAME_UNCERT namef namef MATCH1 NAME_UNCERT namel namem MATCH1 NAME_UNCERT namel namel # Pass 2 MATCH2 CNT_DIFF SSN SSN MATCH2 NAME_UNCERT concat concat # Pass 3 MATCH3 UNCERT SSN SSN MATCH3 NAME_UNCERT namef namef MATCH3 NAME_UNCERT namem namem MATCH3 NAME_UNCERT namel namel and so on…

© 2007 John M. Abowd, Lars Vilhuber, all rights reserved Adjusting P(agree|M) for relative frequency Further adjustment can be made by adjusting for relative frequency (idea goes back to Newcombe (1959) and F&S (1969)) –Agreement of last name by Smith counts for less than agreement by Vilhuber Default option for some software packages Requires strong assumption about independence between agreement on specific value states on one field and agreement on other fields.

© 2007 John M. Abowd, Lars Vilhuber, all rights reserved A&V: Frequency adjustment UNDUP: –none specified MATCH: –allow for name info, –disallow for wage quantiles, SSN

© 2007 John M. Abowd, Lars Vilhuber, all rights reserved Marginal probabilities: better estimates of P(agree|U) P(agree | U) can be improved by computing random agreement weights between files α(A) and β(B) (i.e. AxB) –# pairs agreeing randomly by variable X divided by total number of pairs

© 2007 John M. Abowd, Lars Vilhuber, all rights reserved Error rate estimation methods Sampling and clerical review –Within L: random sample with follow-up –Within C: since manually processed, “truth” is always known –Within N: Draw random sample with follow-up. Problem: sparse occurrence of true matches Belin-Rubin (1995) method for false match rates –Model the shape of the matching weight distributions (empirical density of R) if sufficiently separated Capture-recapture with different blocking for false non-match rates

© 2007 John M. Abowd, Lars Vilhuber, all rights reserved Analyst Review Matcher outputs file of matched pairs in decreasing weight order Examine list to determine cutoff weights and non-matches.

© 2007 John M. Abowd, Lars Vilhuber, all rights reserved A&V: Finding cutoff values UNDUP: –CUTOFF –CUTOFF2 8 8 –Etc. MATCH: –CUTOFF –CUTOFF –CUTOFF –Etc.

© 2007 John M. Abowd, Lars Vilhuber, all rights reserved A&V: Simulated matcher output

© 2007 John M. Abowd, Lars Vilhuber, all rights reserved Post-processing Once matching software has identified matches, further processing may be needed: –Clean up –Carrying forward matching information –Reports on match rates

© 2007 John M. Abowd, Lars Vilhuber, all rights reserved Generic workflow (2) Start with initial set of parameter values Run matching programs Review moderate sample of match results Modify parameter values (typically only m k ) via ad hoc means