POTENTIALS OF FOR DATA LINKAGE

Slides:



Advertisements
Similar presentations
Efficient modelling of record linked data A missing data perspective Harvey Goldstein Record Linkage Methodology Research Group Institute of Child Health.
Advertisements

The Linked PDD-Death Product More than you want to know David Zingmond, MD, PhD Division of General Internal and Health Services Research UCLA School of.
Wisconsin Department of Health Services Richard Miller Research Scientist Wisconsin Office of Health Informatics October 28, 2014 Matching Traffic Crash.
Linked Data Products Vital Statistics Death/PDD Presenter: Jan Morgan.
Linking Mortality and Inpatient Discharge Records: Comparing Deterministic and Probabilistic Methodologies Richard Miller Office of Health Informatics.
Quality assurance -Population and Housing Census Alma Kondi, INSTAT, Albania.
ESSnet DI WP2: Record Linkage Luca Valentino Istat.
Efficient modelling of record linked data A missing data perspective Harvey Goldstein Record Linkage Methodology Research Group Institute of Child Health.
Record Linkage in Stata
Capturing Sensitive Data & Data Linkage. Capturing Sensitive Data Data Protection Act 1998 (Section 33) – Allows data to be used for research purposes.
© 2007 John M. Abowd, Lars Vilhuber, all rights reserved Introduction to Record Linking John M. Abowd and Lars Vilhuber April 2007.
March 2013 ESSnet DWH - Workshop IV DATA LINKING ASPECTS OF COMBINING DATA INCLUDING OPTIONS FOR VARIOUS HIERARCHIES (S-DWH CONTEXT)
Michigan Newborn Screening & Live Births Records Linkage and Follow-Up of Potentially Un-Screened Infants Steven J. Korzeniewski, MA, MSc, Maternal & Child.
© John M. Abowd and Lars Vilhuber 2005, all rights reserved Introduction to Probabilistic Record Linking John M. Abowd and Lars Vilhuber March 2005.
Improving Data Quality and Quality Assurance in Newborn Screening by Including the Bloodspot Screening Collection Device Serial Number on Birth Certificates.
Introduction to Record Linking John M. Abowd and Lars Vilhuber April 2011 © 2011 John M. Abowd, Lars Vilhuber, all rights reserved.
- Darshana Pathak - Dr. Hye-Chung Kum.  Overview  Entity resolution process  About Framework  Configuration file  Class Details  How to …  Future.
All the answers? Statistics New Zealand’s Integrated Data Infrastructure Paper by Felibel Zabala, Rodney Jer, Jamas Enright and Allyson Seyb Presented.
A Survey Based Seminar: Data Cleaning & Uncertain Data Management Speaker: Shawn Yang Supervisor: Dr. Reynold Cheng Prof. David Cheung
Metadata Models in Survey Computing Some Results of MetaNet – WG 2 METIS 2004, Geneva W. Grossmann University of Vienna.
Longitudinal Data Recent Experience and Future Direction August 2012.
The relationship between error rates and parameter estimation in the probabilistic record linkage context Tiziana Tuoto, Nicoletta Cibella, Marco Fortini.
Linking, selecting cut-offs, and examining quality in the Integrated Data Infrastructure (IDI) Laura O’Sullivan Statistics New Zealand
Biolink NL A national infrastructure for linkage of biobanks to medical and socioeconomic registries Adelaide Ariel SHIP Conference 28th-30th August 2013.
The 2011 Census: Estimating the Population Alexa Courtney.
Using administrative data to produce official social statistics New Zealand’s experience.
Linking data resources Paul Lambert, University of Stirling Presentation to the Scottish Civil Society Data Partnership Project (S-CSDP), Webinar 3 on.
Session topic (i) – Editing Administrative and Census data Discussants Orietta Luzi and Heather Wagstaff UNECE Worksession on Statistical Data Editing.
Record linkage approaches in pSCANNER Toan Ong, PhD Assistant Professor Department of Pediatrics University of Colorado, Anschutz Medical Campus.
Looking for statistical twins
Methods for Data-Integration
COMP 135: Human-Computer Interface Design
UNECE Data Integration Project
WEB SCRAPING FOR JOB STATISTICS
CCCApply CCCApply BOG Fee Waiver
Introduction to Probabilistic Record Linking
Dominik Rozkrut Central Statistical Office of Poland
Integrating administrative data – the 2021 Census and beyond
Statistics Netherlands Division Social and Spatial Statistics
COMP Mobile Data Networking
A Network Science Approach to Fake News Detection on Social Media
State Report Processing
ECONOMETRICS ii – spring 2018
Section 3: Sweep implementation
Multi-Sectoral Nutrition Action Planning Training Module
By (Group 17) Mahesha Yelluru Rao Surabhee Sinha Deep Vakharia
Numerical Descriptives in R
Increased Efficiency and Effectiveness
Identifying Worker Characteristics Using LEHD and GIS
Properties 7: Properties Programming C# © 2003 DevelopMentor, Inc.
Data Quality By Suparna Kansakar.
Institutional Framework, Resources and Management
The use of Linked Employer-Employee Data in Maintaining the Statistics New Zealand Business Frame and in producing Business Demographic Statistics Geoff.
Information Society Statistics
United Nations Statistics Division
Explaining the Methodology : steps to take and content to include
LAMAS Working Group June 2017
Overview of Approaches to Register-Based Populating Censuses
Spreadsheets, Modelling & Databases
LAMAS Working Group 29 June-1 July 2016
Modernisation of Statistics Production Stockholm November 2009
Development of a Mobility Demand Model for Private Usage under Non-Urban Conditions Maria Kugler 20th European Conference on Mobility Management – ECOMM.
Andrew Jenkins and Rosalind Levačić
The role of metadata in census data dissemination
Pete Benton , Beyond 2011 Programme Director
Replacing a Survey with Administrative Data
Work Session on Statistical Metadata (Geneva, Switzerland May 2013)
Pnina ZADKA Central Bureau of Statistics Israel
Pnina ZADKA Central Bureau of Statistics Israel
Stephanie Hirner ESTP ”Administrative data and censuses
Presentation transcript:

POTENTIALS OF FOR DATA LINKAGE Radmila Velichkovich MA Survey Methodology and Public Opinion MS Statistics University of Neuchatel, Switzerland SatRday conference, Belgrade, Serbia 27/10/2018

CONTENTS Data linkage: What it is and why is it important? Approaches in data linkage How can R help? Examples

Data linkage: What it is and why is it important? What is data linkage? -merging that brings together information from two or more sources of data Why is it important? -avoidance of new data collection -cost-effective way of additional information -linked data allow analysis not possible on individual datasets -bunch of data ready to be exploit exists: in private/public sector/official statistics/health records/financial records/loyalty data/social media data/ -novel approaches preserve privacy 1) Pre-merge data cleaning 2) Applying data linkage framework 3) evaluating data linkage Example: LEHD; Longitudinal Employer Household Dynamics

Approaches in data linkage Deterministic approach exact agreement on a unique identifier(s) Only id alternative identifiers--such as SSNs, birth dates, and first and last names of individuals (provided it is reliable) Probabilistic approach Connecting datasets without unique identifier(s) -based on partial information existing in the dataset -based on some similarities -connecting two datasets with different units of observation https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5005943/

DETERMINISTIC PROBABILISTIC How can R help? Functions in R merge() match() ___join() in dplyr Blocking and Soundex algorithm (English/German) in RecordLinkage package User-defined function Packages in R RecordLinkage fastLink PPRL (Privacy Preserving Record Linkage) DETERMINISTIC PROBABILISTIC To be updated very soon https://www.gov.scot/resource/0039/00390031.pdf

Example 1:media usage and general consumption data sets linkage Straightforward merging merge(x,y, by=”uid”) Typical settings when different surveys are used in a panel But quality checks needed: e.g.duplicates DEMOGRAPHIC/SOCIO-ECONOMIC/ATTITUDE/BEHAVIOURAL DATA Media usage

Example 2: FastLink package Implements a Fellegi-Sunter probabilistic record linkage model that allows for missing data and the inclusion of auxiliary information. Expectation Maximization algorithm The package implements methods described in Enamorado, Fifield, and Imai (2017): ''Using a Probabilistic Model to Assist Merging of Large-scale Administrative Records''

Example 2: FastLink package

Example 2: FastLink package Many other options exist: 1) Calculate agreement variable by variable: gammapar*() 2) Count unique agreement patterns: tableCounts() 3) Calculate Fellegi-Sunter weights: emlinkMARmov() 4) Find the matches: matchesLink() 5) Dedupe the matches: dedupeMatches()

Example 3: PPRL (in what follows all the credits and many thanks to Prof. Dr. Rainer Schnell (Duisburg-Essen University) for providing the material from BigSurv18 conference) PPRL enables new opportunities to link data that would otherwise not be available for research purposes. Encrypted Statistical Linkage Keys (ESLs) and the use of Bloom filters Efficient for large scale data

PPRL using R By Prof. Dr. Rainer Schnell (Duisburg-Essen University); BigSurv18 conference, Barcelona

References and further readings Package fastLink; Fast Probabilistic Record Linkage with Missing Data Package RecordLinkage The RecordLinkage Package: Detecting Errors in Data; Package PPRL Schnell, Rainer (2015): „Privacy Preserving Record Linkage“. In: Methodological Developments in Data Linkage. Hrsg. von Katie Harron/Harvey Goldstein/Chris Dibben. Chichester: Wiley: 201–225.