Blindfolded Record Linkage Presented by Gautam Sanka Susan C. Weber, Henry Lowe, Amar Das, Todd Ferris.

Slides:



Advertisements
Similar presentations
An Adaptive Algorithm for Detection of Duplicate Records.
Advertisements

Technical Seminar 2004 RAMAKANTA BEHERA IT An Adaptive Algorithm for Detection of Duplicate Records 1 Presented By: Rama kanta Behera IT
DECISION TREES. Decision trees  One possible representation for hypotheses.
IPv6 Near-Unique Site Local Addresses draft-francis-ipngwg-unique-site-local-00.txt.
The Linked PDD-Death Product More than you want to know David Zingmond, MD, PhD Division of General Internal and Health Services Research UCLA School of.
Unintended Consequences of Data Sharing Laws and Rules Sam Weber Software Engineering Institute, CMU.
Wisconsin Department of Health Services Richard Miller Research Scientist Wisconsin Office of Health Informatics October 28, 2014 Matching Traffic Crash.
HIPAA and Public Health 2007 Epi Rapid Response Team Conference.
HIPAA – Privacy Rule and Research USCRF Research Educational Series March 19, 2003.
RSNA – December, 2002 Internet Based Remote Servicing of Medical Equipment under HIPAA – A standard solution Joint NEMA/COCIR/JIRA Security and Privacy.
THE DISTRIBUTION OF SAMPLE MEANS How samples can tell us about populations.
Enforceable Specification of Privacy Peter Mork Jean Stanford CEM IR&D.
Linking Mortality and Inpatient Discharge Records: Comparing Deterministic and Probabilistic Methodologies Richard Miller Office of Health Informatics.
A Ternary Unification Framework for Optimizing TCAM-Based Packet Classification Systems Author: Eric Norige, Alex X. Liu, and Eric Torng Publisher: ANCS.
GENERIC ENTITY RESOLUTION WITH NEGATIVE RULES Steven Euijong Whang · Omar Benjelloun · Hector Garcia-Molina Compiled by – Darshana Pathak.
Identity Management Based on P3P Authors: Oliver Berthold and Marit Kohntopp P3P = Platform for Privacy Preferences Project.
Area 4 SHARP Face-to-Face Conference Phenotyping Team – Centerphase Project Assessing the Value of Phenotyping Algorithms June 30, 2011.
Record Linkage Simulation Biolink Meeting June Adelaide Ariel.
Using ICD Codes and Birth Records to Prevent Mismatches of Multiple Births in Linked Hospital Readmission Data Alison Fraser 1, MSPH, Zhiwei Liu 2, MS,
Record Linkage in Stata
March 2013 ESSnet DWH - Workshop IV DATA LINKING ASPECTS OF COMBINING DATA INCLUDING OPTIONS FOR VARIOUS HIERARCHIES (S-DWH CONTEXT)
Project Update : Claims/Clinical Linkage Project MHDO Board of Directors June 6, 2013.
(De-Identified) Record Linkage Dongqiuye Pu, Ashraf Farrag, Javed Mostafa.
Motion Detection And Analysis Michael Knowles Tuesday 13 th January 2004.
Informed Consent and HIPAA Tim Noe Coordinating Center.
Data Quality Case Study Prepared by ORC Macro. 2 Background –Data Correction Tracking system SAS AF query application Guidelines –Profile Analysis SSNs.
Health Insurance Portability and Accountability Act (HIPAA)
2010 Hematopoietic and Lymphoid Neoplasm Project Registry Operations and the SEER Program.
HIMSS – January 28, 2002 Remote Servicing under HIPAA with proposed Solution A John F. Moehrke Chairmen of Remote Servicing Focus Group NEMA/COCIR/JIRA.
Marketing Systems Group Southern California MRA Education Seminar Presentation September 17, 2005 Privacy and Current Issues.
Health Insurance Portability and Accountability Act (HIPAA)
Li Xiong CS573 Data Privacy and Security Healthcare privacy and security: Genomic data privacy.
De-identifying Pathology Reports for Pathology Informatics
PRIVACY AND HIPAA THE RIGHT THING TO DO. WHAT’S WRONG WITH THIS PICTURE? ? “ Did you hear that Jane from the 5 th floor is in the hospital?” “No!! Let’s.
Arkansas State Law Which Governs Sensitive Information…… Part 3B
Linear Reduction for Haplotype Inference Alex Zelikovsky joint work with Jingwu He WABI 2004.
Introduction to ArcGIS for Environmental Scientists Module 3 – GIS Analysis Address Geocoding.
Nursing Research Project Idea? CALL Center for Nursing Research & Practice Is it research or quality improvement? Once your submission is.
VUHID Update for CHC Collaborative Health Consortium Barry R. Hieb, MD Chief Scientist, Global Patient Identifiers Inc. Updated Dec., 2011 \marketing\presentations\CHCpresentation
Joyce Mull, MPM Director, Regulatory Affairs National Surgical Adjuvant Breast and Bowel Project Consent Form and IRB Challenges that Arise with Specimen.
Health Insurance Portability and Accountability Act (HIPAA) CCAC.
Understanding HIPAA (Health Insurandce Portability and Accountability Act)
1 CPSC 320: Intermediate Algorithm Design and Analysis July 28, 2014.
CONCEPTS AND TECHNIQUES FOR RECORD LINKAGE, ENTITY RESOLUTION, AND DUPLICATE DETECTION BY PETER CHRISTEN PRESENTED BY JOSEPH PARK Data Matching.
Arkansas Research Center Neal Gibson, Ph.D. Greg Holland, Ph.D.
Finding a PersonBOS Finding a Person! Building an algorithm to search for existing people in a system Rahn Lieberman Manager Emdeon Corp (Emdeon.com)
Bloom Cookies: Web Search Personalization without User Tracking Authors: Nitesh Mor, Oriana Riva, Suman Nath, and John Kubiatowicz Presented by Ben Summers.
Developing Tools for the Random Selection Process Brian Baker (Cambridge Systematics) & Mike Redington (US DOT/Volpe Center)
Search Engines WS 2009 / 2010 Prof. Dr. Hannah Bast Chair of Algorithms and Data Structures Department of Computer Science University of Freiburg Lecture.
Amar K. Das, MD, PhD Associate Professor of Biomedical Data Science, Psychiatry and Health Policy & Clinical Practice Geisel School of Medicine at Dartmouth.
David Evans Class 15: P vs. NP (Smiley Puzzles and Curing Cancer) CS150: Computer Science University of Virginia Computer.
THRio Database Linkage and THRio Database Issues.
August 14-15, 2003 Crystal Gateway Marriott Arlington, VA Software Developers Conference.
Biomedical Informatics Research Network DATA SHARING HIPAA Compliance & IRB Approvals Martha Payne, Jeffrey Grethe October 10, nd Annual All Hands.
CAN THE CANNED FORMS: Practical Advice in Implementing HIPAA Privacy Policies and Forms Margaret Marchak, Esq. Rachel Nosowsky, Esq. HIPAA Summit West.
Arizona’s Sentinel Site Data Quality Efforts Fragmented Records and MOGE Coding Lisa Rasmussen Arizona Department of Health Services March 30, 2011.
Linking Electronic Health Records Across Institutions to Understand Why Women Seek Care at Multiple Sites for Breast Cancer Caroline A. Thompson, PhD,
HIPAA and RESEARCH 5 th Thursday May 31, Page 2.
The Power of Analytics Applying and Implementing Analytics – How to, When to, and Why May 23, 2016 Session 2 Presented by Kelly Jin Citywide Analytics.
"The findings and conclusions in this report are those of the author(s) and do not necessarily represent the official position of the Centers for Disease.
MEDICAL RECORD BROKER -LAVANYA GUNDAMARAJU Introduction Introduction n Database and database systems have become an essential part of everyday life.
Side-Channel Attack on Encrypted Traffic
Rule Induction for Classification Using
SEER Case Consolidation Study: Design & Objective
October 2011 eUCI and You Ryan White Services Report HIV/AIDS Bureau, Health Resources and Services Administration Welcome to the eUCI and You video! This.
By (Group 17) Mahesha Yelluru Rao Surabhee Sinha Deep Vakharia
S. Findley, M. Irigoyen, P. Sternfels, F. Chimkin, M. Sanchez
Hash Functions for Network Applications (II)
Pseudonymised Matching: Robustly Linking Molecular and Prescription Data to Cancer Registry Data in England Brian Shand, Fiona McRonald, Katherine Henson,
Generalized Protein Parsimony
Presentation transcript:

Blindfolded Record Linkage Presented by Gautam Sanka Susan C. Weber, Henry Lowe, Amar Das, Todd Ferris

Introduction and Objectives  Challenges  Patient Privacy vs. Building Cross-Site records  Solutions  Mandate that identifiers be disclosed  Privacy officers find this unacceptable  Keep only de-identified information in the registry but share an algorithm to Third Parties for generating an anonymous identifier

De-identification Explained  This anonymous identifier will be created in such a way that:  Probability of same identifier generated at two different sites is high for the same person  And low for different people

What can be used?  Using SSN – Bad Idea  Using names and DOB may seem best but:  Nicknames at one site and full name at another  Misspellings  Different Titles (Mr. Ms. Mrs.)

Goal of Project  Breast Cancer Patients at PAMF (Palo Alto Medical Foundation) and Stanford University Medical Center  Merge the Data with de-identification under HIPAA and IRB approval

Interesting Approaches  Bigrams  For the names Ann and Anne  [AN, NN]  [AN, NN, NE]  The Dice Co-efficient is 2 * (2/5) = 4/5  Bloom Filter  Both were not implemented due to the complexities

 A single SHA-1 string was constructed based on  Gender  DOB  Zip  Three letter Prefix of last name  In their case, only first two letters of patients’ first and last names were used

Composite Identifier  Felt that a combination of DOB and the first two letters of names would uniquely identify  Most applicable when:  Compliance restrictions preclude the exchange of actual identifiers  Total number of comparisons is less than 10^8  Names and DOB are easily available  DOB has a low error rate

Methods  Measured Rate of false positives in data  Dropped name prefixes  Dropped DOB stating 1/1/1900 and 1/1/1901  Performed a self-join on three sets of 1.5M rows, 0.5M rows and 10,000 rows

Specificity based on Data Set Size

 Measure False Negative  Both sites exchanged cryptographic hashes based on SSNs  The number of matches found by matching SSNs and not composite identifiers became the Lower Bound for False Negatives  Removal of all False Positives based on real identifiers

PAMF 8,166 Stanford 10, Common Patients

Total found by Composite Identifier 2028 Exact Matches in Names + DOB 1824 Confirmed by Full Identifiers Later 204 “This was a very interesting result in that it provided us with a measure of how much better our approach is compared to using full names rather than two-letter prefixes.”

Reasons for False Negatives in Composite Identification Found by SSN and later confirmed manually

Simply Using SSN  SSNs found only 1806 out of 2028  Rate of false negatives is 10% higher than a composite identifier  Reasons  172 of the 222 with false negatives had a missing SSN

What about the other 50? In conclusion, 57 False Positives for SSN matches 3 False Positives for Composite Identifier 20 times worse

Which identifiers are best?

When should we use this tool?  Most useful where privacy policies preclude the full exchange of the identifiers required by more sophisticated and sensitive linkage algorithms  For Data Sets of High quality, this approach (in comparison to complex algorithms)  Easy to explain  Adheres to minimum rules set by HIPAA  Faster and less cumbersome

Suggestions