Download presentation
Presentation is loading. Please wait.
Published bySilvia Walker Modified over 8 years ago
1
Synthetic Approaches to Data Linkage Mark Elliot, University of Manchester Jerry Reiter Duke University Mark.Elliot@manchester.ac.uk Cathie Marsh Centre for Census and Survey Research & Social Statistics Discipline Area University of Manchester
2
Outline Initial thoughts – no empirical data yet. –What does record linkage do? –What does synthetic data generation do? –What might synthetic linkage be able to do.
3
Data Linkage Joining together two sets of data to produce a single set. Record Linkage –Linkage of rows – cases - in K datasets to increase the number columns - variables. K usually equals 2.
4
Data linkage: why? Gives additional information –To address complex research questions –Allows longitudinal analysis To check accuracy and reliability of a data source To fill in missing information in a data source To reduce respondent burden and costs of surveys To enhance survey quality –understand survey non-response
5
Record Linkage Issues Problems –Data divergence will tend to produce more or less false positives and negatives. It is difficult to estimate the frequency of false positives and negatives It is difficult to estimate the effect of false confirmed matches on any estimands.
6
Record Linkage Issues Problems –Linkage requires a good identifier key. –This restricts the instances where multiple datasets can be used.
7
Record Linkage Issues Problems –There is no well-formed solution of how to deal with multiple datasets K>2. Weighting? The most usually tried solution is chaining –starting from the most reliable dataset
8
Data Synthesis Rubin (1993, JOS ): create multiple, fully synthetic datasets for public release so that: No unit in released data has sensitive data from actual unit in population Released data look like actual data Statistical procedures valid for original data are valid for released data
9
Generating fully synthetic data Randomly sample new units from frame (can use simple random samples) Impute survey variables for new units using models fit from observed data Repeat multiple times and release m datasets
10
Generating fully synthetic data: A simple example Suppose data contain sex and height for 600 women and 400 men Heights normally distributed within sex Sample mean and variance are:
11
Simple example: Create released data of size 2000 Suppose population is 51% women and 49% men Randomly sample 2000 sexes with Pr(woman) =.51 For all 2000, simulate height from
12
Multivariate Synthesis: Sequential Regression Models Suppose data include survey variables and some known design variables ( ) 1)Randomly sample values of, say 2)Regress using original data. 3)Simulate new values of from this model for 4)Repeat for using synthetic when simulating 5)Repeat for
13
Synthetic linkage Synthesis and linkage both aim to produce a new dataset from old information. Perhaps the approaches can be combined? Can synthesis give us any purchase on the situations that linkage finds difficult
14
Synthetic linkage This reformulates the record linkage problem. Rather than: –how can I accurately link the records in d 1, d 2,..., d n to produce a new linked dataset d l. We have instead: –How can I populate the empty database data given the information I have in d 1,d 2...,dn
15
Synthetic linkage To populate the empty dataset we will use the –Available data However many datasets –Models of data generating process
16
Synthetic linkage project The project aims to assess the utility of the synthetic linkage approach.
17
Synthetic linkage project Our first task is to specify the use cases –Linkage scenarios –What type of dataset overlap, how many datasets etc. –It is clear that if the approach has any merit it will be most useful in the multiple dataset case.
18
Synthetic linkage project Exemplar case 1 –d 1 consists of variables X, Y –d 2 consists of Y,Z. –d 3 consists of X,Y,Z –d 1 and d 2 refer to the same population. Y only weakly links them –d 3 refers to a related population, for example old census data, which cannot be meaningfully linked to d 1 or d 2 but can be used to model Z given X and Y
19
Synthetic linkage project Exemplar case 1 continued We first use models based on d 3 to generate multiple synthetic values for Z for each record in d 1 to create d 1 ’ –Each record in d 2 is then linked to the multiple records in d 1 ’ –The resolution of the linkages will be complicated and involved linear programming.
20
Synthetic linkage project exemplar case 2 –d 1 consists of variables X, Y –d 2 consists of Y,Z –d 3 consists of X, Z –d 1,d 2 d 3 refer to the same population. –Use d 3 to create synthetic values for Z on d1 then link d 2 using the same mechanism as in the previous example.
21
Summary Synthetic record linkage is an alternative approach too record linkage which is worth exploring where: –There are multiple datasets to be linked –The linkage variables are weak
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.