Synthetic Approaches to Data Linkage Mark Elliot, University of Manchester Jerry Reiter Duke University Cathie Marsh Centre.

Slides:



Advertisements
Similar presentations
Alternative Approaches to Data Dissemination and Data Sharing Jerome Reiter Duke University
Advertisements

Estimating Identification Risks for Microdata Jerome P. Reiter Institute of Statistics and Decision Sciences Duke University, Durham NC, USA.
Handling attrition and non- response in longitudinal data Harvey Goldstein University of Bristol.
Introduction to Monte Carlo Markov chain (MCMC) methods
Latent normal models for missing data Harvey Goldstein Centre for Multilevel Modelling University of Bristol.
Research on Improvements to Current SIPP Imputation Methods ASA-SRM SIPP Working Group September 16, 2008 Martha Stinson.
9. Weighting and Weighted Standard Errors. 1 Prerequisites Recommended modules to complete before viewing this module  1. Introduction to the NLTS2 Training.
Lecture 28 Categorical variables: –Review of slides from lecture 27 (reprint of lecture 27 categorical variables slides with typos corrected) –Practice.
Sampling Distributions
Moderation: Assumptions
March 2013 ESSnet DWH - Workshop IV DATA LINKING ASPECTS OF COMBINING DATA INCLUDING OPTIONS FOR VARIOUS HIERARCHIES (S-DWH CONTEXT)
1 Practicals, Methodology & Statistics II Laura McAvinue School of Psychology Trinity College Dublin.
Sampling Distributions
FINAL REPORT: OUTLINE & OVERVIEW OF SURVEY ERRORS
Increasing Survey Statistics Precision Using Split Questionnaire Design: An Application of Small Area Estimation 1.
PEAS wprkshop 2 Non-response and what to do about it Gillian Raab Professor of Applied Statistics Napier University.
UNECE Workshop on Confidentiality Manchester, December 2007 Comparing Fully and Partially Synthetic Data Sets for Statistical Disclosure Control.
Eurostat Statistical Data Editing and Imputation.
Work Package 5: Integrating data from different sources in the production of business statistics Daniel Lewis Office for National Statistics (UK)
Physics 114: Lecture 15 Probability Tests & Linear Fitting Dale E. Gary NJIT Physics Department.
Montecarlo Simulation LAB NOV ECON Montecarlo Simulations Monte Carlo simulation is a method of analysis based on artificially recreating.
Generic Approaches to Model Validation Presented at Growth Model User’s Group August 10, 2005 David K. Walters.
Module 1: Statistical Issues in Micro simulation Paul Sousa.
Topic (ii): New and Emerging Methods Maria Garcia (USA) Jeroen Pannekoek (Netherlands) UNECE Work Session on Statistical Data Editing Paris, France,
Geo597 Geostatistics Ch9 Random Function Models.
ANOVA and Linear Regression ScWk 242 – Week 13 Slides.
Defining Success Understanding Statistical Vocabulary.
University of Warwick, Department of Sociology, 2014/15 SO 201: SSAASS (Surveys and Statistics) (Richard Lampard) Week 7 Logistic Regression I.
Using Weighted Data Donald Miller Population Research Institute 812 Oswald Tower, December 2008.
for statistics based on multiple sources
CHAPTER 12 Descriptive, Program Evaluation, and Advanced Methods.
Stat 112: Notes 2 Today’s class: Section 3.3. –Full description of simple linear regression model. –Checking the assumptions of the simple linear regression.
Calibrated imputation of numerical data under linear edit restrictions Jeroen Pannekoek Natalie Shlomo Ton de Waal.
EMIS 7300 SYSTEMS ANALYSIS METHODS FALL 2005 Dr. John Lipp Copyright © Dr. John Lipp.
DATA PREPARATION: PROCESSING & MANAGEMENT Lu Ann Aday, Ph.D. The University of Texas School of Public Health.
Quality Assurance Programme of the Canadian Census of Population Expert Group Meeting on Population and Housing Censuses Geneva July 7-9, 2010.
Understanding Sampling
The Practice of Statistics, 5th Edition Starnes, Tabor, Yates, Moore Bedford Freeman Worth Publishers CHAPTER 10 Comparing Two Populations or Groups 10.1.
The Practice of Statistics, 5th Edition Starnes, Tabor, Yates, Moore Bedford Freeman Worth Publishers CHAPTER 10 Comparing Two Populations or Groups 10.1.
Eurostat Statistical Matching using auxiliary information Training Course «Statistical Matching» Rome, 6-8 November 2013 Marco Di Zio Dept. Integration,
CROSS-VALIDATION AND MODEL SELECTION Many Slides are from: Dr. Thomas Jensen -Expedia.com and Prof. Olga Veksler - CS Learning and Computer Vision.
Simulation Using computers to simulate real- world observations.
New Measures of Data Utility Mi-Ja Woo National Institute of Statistical Sciences.
Using Targeted Perturbation of Microdata to Protect Against Intelligent Linkage Mark Elliot, University of Manchester Cathie.
© John M. Abowd 2007, all rights reserved General Methods for Missing Data John M. Abowd March 2007.
Disclosure Limitation in Microdata with Multiple Imputation Jerry Reiter Institute of Statistics and Decision Sciences Duke University.
Lynn Lethbridge SHRUG November, What is Bootstrapping? A method to estimate a statistic’s sampling distribution Bootstrap samples are drawn repeatedly.
Confidence Intervals for Variance and Standard Deviation.
Creating Open Data whilst maintaining confidentiality Philip Lowthian, Caroline Tudor Office for National Statistics 1.
Item-Non-Response and Imputation of Labor Income in Panel Surveys: A Cross-National Comparison ITEM-NON-RESPONSE AND IMPUTATION OF LABOR INCOME IN PANEL.
Disclosure Risk and Grid Computing Mark Elliot, Kingsley Purdam, Duncan Smith and Stephan Pickles CCSR, University of Manchester
1 Module One: Measurements and Uncertainties No measurement can perfectly determine the value of the quantity being measured. The uncertainty of a measurement.
Chapter 13: Inferences about Comparing Two Populations Lecture 8b Date: 15 th November 2015 Instructor: Naveen Abedin.
R. Ty Jones Director of Institutional Research Columbia Basin College PNAIRP Annual Conference Portland, Oregon November 7, 2012 R. Ty Jones Director of.
Organization of statistical investigation. Medical Statistics Commonly the word statistics means the arranging of data into charts, tables, and graphs.
Chapter 3 Surveys and Sampling © 2010 Pearson Education 1.
Tutorial I: Missing Value Analysis
Methods of Secure Computation and Data Integration Jerome Reiter, Duke University Alan Karr, NISS Xiaodong Lin, University of Cincinnati Ashish Sanil,
Learning Objectives After this section, you should be able to: The Practice of Statistics, 5 th Edition1 DESCRIBE the shape, center, and spread of the.
Creating a data set From paper surveys to excel. STEPS 1.Order your filled questionnaires 2.Number your questionnaires 3.Name your variables. 4.Create.
Øyvind Langsrud New Challenges for Statistical Software - The Use of R in Official Statistics, Bucharest, Romania, 7-8 April 1 A variance estimation R.
Data Screening. What is it? Data screening is very important to make sure you’ve met all your assumptions, outliers, and error problems. Each type of.
Reconciling Confidentiality Risk Measures from Statistics and Computer Science Jerry Reiter Department of Statistical Science Duke University.
How to handle missing data values
Dealing with missing data
The European Statistical Training Programme (ESTP)
Chapter 4, Regression Diagnostics Detection of Model Violation
Chapter 7: Sampling Distributions
Chapter 13: Item nonresponse
Jerome Reiter Department of Statistical Science Duke University
Presentation transcript:

Synthetic Approaches to Data Linkage Mark Elliot, University of Manchester Jerry Reiter Duke University Cathie Marsh Centre for Census and Survey Research & Social Statistics Discipline Area University of Manchester

Outline Initial thoughts – no empirical data yet. –What does record linkage do? –What does synthetic data generation do? –What might synthetic linkage be able to do.

Data Linkage Joining together two sets of data to produce a single set. Record Linkage –Linkage of rows – cases - in K datasets to increase the number columns - variables. K usually equals 2.

Data linkage: why? Gives additional information –To address complex research questions –Allows longitudinal analysis To check accuracy and reliability of a data source To fill in missing information in a data source To reduce respondent burden and costs of surveys To enhance survey quality –understand survey non-response

Record Linkage Issues Problems –Data divergence will tend to produce more or less false positives and negatives. It is difficult to estimate the frequency of false positives and negatives It is difficult to estimate the effect of false confirmed matches on any estimands.

Record Linkage Issues Problems –Linkage requires a good identifier key. –This restricts the instances where multiple datasets can be used.

Record Linkage Issues Problems –There is no well-formed solution of how to deal with multiple datasets K>2. Weighting? The most usually tried solution is chaining –starting from the most reliable dataset

Data Synthesis Rubin (1993, JOS ): create multiple, fully synthetic datasets for public release so that: No unit in released data has sensitive data from actual unit in population Released data look like actual data Statistical procedures valid for original data are valid for released data

Generating fully synthetic data Randomly sample new units from frame (can use simple random samples) Impute survey variables for new units using models fit from observed data Repeat multiple times and release m datasets

Generating fully synthetic data: A simple example Suppose data contain sex and height for 600 women and 400 men Heights normally distributed within sex Sample mean and variance are:

Simple example: Create released data of size 2000 Suppose population is 51% women and 49% men Randomly sample 2000 sexes with Pr(woman) =.51 For all 2000, simulate height from

Multivariate Synthesis: Sequential Regression Models Suppose data include survey variables and some known design variables ( ) 1)Randomly sample values of, say 2)Regress using original data. 3)Simulate new values of from this model for 4)Repeat for using synthetic when simulating 5)Repeat for

Synthetic linkage Synthesis and linkage both aim to produce a new dataset from old information. Perhaps the approaches can be combined? Can synthesis give us any purchase on the situations that linkage finds difficult

Synthetic linkage This reformulates the record linkage problem. Rather than: –how can I accurately link the records in d 1, d 2,..., d n to produce a new linked dataset d l. We have instead: –How can I populate the empty database data given the information I have in d 1,d 2...,dn

Synthetic linkage To populate the empty dataset we will use the –Available data However many datasets –Models of data generating process

Synthetic linkage project The project aims to assess the utility of the synthetic linkage approach.

Synthetic linkage project Our first task is to specify the use cases –Linkage scenarios –What type of dataset overlap, how many datasets etc. –It is clear that if the approach has any merit it will be most useful in the multiple dataset case.

Synthetic linkage project Exemplar case 1 –d 1 consists of variables X, Y –d 2 consists of Y,Z. –d 3 consists of X,Y,Z –d 1 and d 2 refer to the same population. Y only weakly links them –d 3 refers to a related population, for example old census data, which cannot be meaningfully linked to d 1 or d 2 but can be used to model Z given X and Y

Synthetic linkage project Exemplar case 1 continued We first use models based on d 3 to generate multiple synthetic values for Z for each record in d 1 to create d 1 ’ –Each record in d 2 is then linked to the multiple records in d 1 ’ –The resolution of the linkages will be complicated and involved linear programming.

Synthetic linkage project exemplar case 2 –d 1 consists of variables X, Y –d 2 consists of Y,Z –d 3 consists of X, Z –d 1,d 2 d 3 refer to the same population. –Use d 3 to create synthetic values for Z on d1 then link d 2 using the same mechanism as in the previous example.

Summary Synthetic record linkage is an alternative approach too record linkage which is worth exploring where: –There are multiple datasets to be linked –The linkage variables are weak