Record Linkage in Stata

Slides:



Advertisements
Similar presentations
Copyright 2009 bSolv. All rights reserved Citizen360 Identity Resolution Introduction to the Identity Resolution (IR) processes Version 1.0 You should.
Advertisements

The Linked PDD-Death Product More than you want to know David Zingmond, MD, PhD Division of General Internal and Health Services Research UCLA School of.
 Copyright I/O International, 2013 Visit us at: A Feature Within from Sales Rep User Friendly Maintenance – with Zip Code.
Implementing “CLEAN Address” Address Validation Software at Oakland University.
Linking Mortality and Inpatient Discharge Records: Comparing Deterministic and Probabilistic Methodologies Richard Miller Office of Health Informatics.
Programming Logic and Design Fourth Edition, Introductory
March 2013 ESSnet DWH - Workshop IV DATA LINKING ASPECTS OF COMBINING DATA INCLUDING OPTIONS FOR VARIOUS HIERARCHIES (S-DWH CONTEXT)
Michigan Newborn Screening & Live Births Records Linkage and Follow-Up of Potentially Un-Screened Infants Steven J. Korzeniewski, MA, MSc, Maternal & Child.
Data Quality Class 10. Agenda Review of Last week Cleansing Applications Guest Speaker.
1 Creating and Tweaking Data HRP223 – 2010 October 24, 2011 Copyright © Leland Stanford Junior University. All rights reserved. Warning: This.
© 2007 John M. Abowd, Lars Vilhuber, all rights reserved An application of probabilistic matching Abowd and Vilhuber (2004), JBES.
INTERPRET MARKETING INFORMATION TO TEST HYPOTHESES AND/OR TO RESOLVE ISSUES. INDICATOR 3.05.
Data Quality Class 7. Agenda Record Linkage Data Cleansing.
Geocoding: - Table to geocode may be an ASCII, spreadsheet, dBase, or MapInfo table - Referred to as the “target” table - The target table is the attribute.
© John M. Abowd and Lars Vilhuber 2005, all rights reserved Introduction to Probabilistic Record Linking John M. Abowd and Lars Vilhuber March 2005.
Implementing CLEAN_Address Address Validation Software at Oakland University.
Maintenance of Selective Editing in ONS Business Surveys Daniel Lewis.
LOGICAL DATABASE DESIGN
Linking STD and HIV Morbidity and Risk Behaviors in Indiana James D. Beall, MA Sr. Public Health Advisor Indiana State Department of Health.
Introduction to Database Systems
Monthly APCD User Workgroup Webinar March 25th, 2014.
Introduction to Record Linking John M. Abowd and Lars Vilhuber April 2011 © 2011 John M. Abowd, Lars Vilhuber, all rights reserved.
C LIENT R EGISTRY OpenEMPI: Operations Support Training SYSNET International, Inc.
Monthly APCD User Workgroup Webinar April 22, 2014.
Chapter 6 Generating Form Letters, Mailing Labels, and a Directory
Risk Assessment/Risk Reduction © Risk Assessment/Risk Reduction Risk Assessment Risk Reduction Software.
- Darshana Pathak - Dr. Hye-Chung Kum.  Overview  Entity resolution process  About Framework  Configuration file  Class Details  How to …  Future.
© Hanson Research Corporation Deduping contacts in Sage CRM 24 th Day of November 2010.
 A database is a collection of data that is organized so that its contents can easily be accessed, managed, and updated. What is Database?
The Use of Administrative Sources for Statistical Purposes Matching and Integrating Data from Different Sources.
Key Data Management Tasks in Stata
ISV Innovation Presented by ISV Innovation Presented by Business Intelligence Fundamentals: Data Cleansing Ola Ekdahl IT Mentors 9/12/08.
An Object-Oriented Approach to Programming Logic and Design Fourth Edition Chapter 5 Arrays.
ELSA ELSA datasets and documentation available from the archive or by special arrangement Kate Cox National Centre for Social.
Upgrading to SQL Server 2000 Kashef Mughal. Multiple Versions SQL Server 2000 supports multiple versions of SQL Server on the same machine It does that.
1 Lab 2 and Merging Data (with SQL) HRP223 – 2009 October 19, 2009 Copyright © Leland Stanford Junior University. All rights reserved. Warning:
McGraw-Hill/Irwin The O’Leary Series © 2002 The McGraw-Hill Companies, Inc. All rights reserved. Microsoft Excel 2002 Lab 6 Creating and Using Lists and.
Introduction to Data Bases. What is (are) Data Information? - no Information is PROCESSED data Data are facts - descriptions of an entity, Entity -> Me.
1 Duplicate Analyzer Exercises. 2 Installation and Initial Configuration: Exercises Exercises 1.Install Duplicate Analyzer on your local PC. 2.Configure.
3 / 12 Databases MIS105 Lec13 Irfan Ahmed Ilyas CHAPTER Prepared By:
Microsoft Office XP Illustrated Introductory, Enhanced Tables and Queries Using.
Order Entry Program Please see speaker notes for additional information!
Database Management COP4540, SCS, FIU Physical Database Design (2) (ch. 16 & ch. 6)
Programming Logic and Design Fourth Edition, Comprehensive Chapter 8 Arrays.
Excel 2007 Part (3) Dr. Susan Al Naqshbandi
Databases 101 © Dolinski What you will learn How relational databases work What are the components that make up a database How to create each component.
MIS2502: Data Analytics Relational Data Modeling
Blindfolded Record Linkage Presented by Gautam Sanka Susan C. Weber, Henry Lowe, Amar Das, Todd Ferris.
THRio Database Linkage and THRio Database Issues.
IMS 4212: Constraints & Triggers 1 Dr. Lawrence West, Management Dept., University of Central Florida Stored Procedures in SQL Server.
Lab 6: Geocoding You have received a dBase file that contains the address list of over 500 homes in your neighborhood that have had reports of lead poisoning.
FHA Training Module 1 This document reflects current policy related to this topic. Its content is approved for use in all external and internal FHA-related.
1 Linking Social Security Death Index (SSDI) Data with Registry Data to Update Demographics and Vital Status David O’Brien, PhD, GISP Alaska Cancer Registry.
1 Record Linkage & Fuzzy Matching (More on "Blocking" for Performance Improvement) Joseph Vertido Melissa Data Fuzzy.
Database (Microsoft Access). Database A database is an organized collection of related data about a specific topic or purpose. Examples of databases include:
"The findings and conclusions in this report are those of the author(s) and do not necessarily represent the official position of the Centers for Disease.
Data quality & VALIDATION
Introduction to Probabilistic Record Linking
Introduction to Database Systems
Risk Assessment Risk Reduction Software
5.8 Presentation.
Creating Tables & Inserting Values Using SQL
MIS2502: Data Analytics Relational Data Modeling
WHO-FIC Collaborating Center South Africa
Lab 2 and Merging Data (with SQL)
Lab 2 HRP223 – 2010 October 18, 2010 Copyright © Leland Stanford Junior University. All rights reserved. Warning: This presentation is protected.
Data Management – Processing
POTENTIALS OF FOR DATA LINKAGE
Indicator 3.05 Interpret marketing information to test hypotheses and/or to resolve issues.
Presentation transcript:

Record Linkage in Stata presented at: North American Stata Users Group Meeting August 13, 2007 Boston, MA Presented by: Michael Blasnik M. Blasnik & Associates

What is Record Linkage? Record linkage involves matching records from two different data files that do not share a unique and reliable key field Jon Smithfield 314 Longwood Drive Boston John Smithfeld 314 Longword Dr Boston People are good at assessing matches, but underlying logic is tricky to codify

Record Linkage Methods Rules-based enumerate a set of rules for how to decide on matches unwieldy, custom solution needed for each problem Probabilistic Record Linkage Fellegi and Sunter “A Theory for Record Linkage” JASA Vol. 64 (1969), 1183-1210. Developed framework based on concepts like P(xvar matches | true record match), P(xvar matches | records not a match) to assess overall probability of a match between records based on matching/mismatching of multiple variables.

Record Linkage Issues Assigning Match/Non-match Probabilities By variable and/or observation, based on uniqueness or judgment greater weight to matching on phone # than city greater weight to matching Blasnik than Smith Imprecise matches for strings String comparators assess degree of matching Edit distance: # edits needed to transform one string into the other Bigram: proportion of 2 character sub-strings in common Speed problem N1*N2 comparisons required for exhaustive search (like nnmatch) Most systems employ “Or-blocking” and/or multiple passes Only select records that match on at least one variable Can also employ large indexed bigram tree for large repeated searches Some Matches Unclear 3 groups: matched, unmatched, maybe matched but needs review

Data Cleaning & Prep Matching improves if data are properly prepared Addresses Fully standardized addresses best (mass-mailing software) Use standardized abbreviations (Street = St, etc.) Create separate variable for house # since errors less frequent Zip/Postal code standardization (5 digit vs. 9 digit zip code) Dates: separate month, day and year since errors often occur in only one Names Standardize Junior = Jr, etc. Can create phonetic coding version (e.g., soundex) Telephone numbers: separate area code to deal with missing Numeric Data generally better to convert numeric fields to string if you think typographical errors may exist so that string comparators are used

Stand-alone Linkage Systems Some free record linkage software Link Plus US CDC free software designed for working with cancer registries, but can be used more widely Febrl open-source application in Python / C by Australian National University

reclink.ado Stata ado file to implement basic record linkage User-assigned match and non-match weights per variable Or-Blocking: allowed, automatic if >=4 variables And-Blocking – required exact matches may be specified Bigram string comparator (option to override) user-assignable matching threshold, default =0.6 proportional match/non-match weighting Multiple Passes facilitated with –exclude- option Implemented in Stata 8.2 ado code speed benefit from moving to Mata?

reclink: Typical Usage reclink ssn fname lname address phone using fapdata, wmatch(10 3 6 20 10) wnomatch(7 5 5 15 2) idmaster(trackid) idusing(fapid) gen(fapscore) compares datasets based on 5 variables since # vars>3, uses or-blocking to only assess using data obs that match perfectly on at least one variable. orblock(none) would over-ride by default, bigrams used with a minimum match threshold of 0.6 match weights are highest for address and ssn non-match weight =2 for phone since #s change, but match weight=10 resulting dataset contains all original obs with a -joinby- to using dataset observations with best scores, if score>0.8 default minimum. _merge and fapscore (holding the matching score) are created New variables also created: Ussn, Ufname, etc.

Example mydata.dta fname, lname, address, city, idm Jon, Smithfield, 314 Longwood Drive, Boston, 101 ludata.dta fname, lname, address, city, idlu John, Smithfeld, 314 Longword Dr, Boston, 1002 reclink address fname lname city using ludata , gen(myscore) idmaster(idm) idusing(idlu) they match! myscore = 0.9964, _merge=3, all fields in both shown can then drop U* after browsing to confirm the match help file explains more details about setting options and specifying weights