Efficient modelling of record linked data A missing data perspective Harvey Goldstein Record Linkage Methodology Research Group Institute of Child Health.

Slides:



Advertisements
Similar presentations
Handling attrition and non- response in longitudinal data Harvey Goldstein University of Bristol.
Advertisements

Missing data – issues and extensions For multilevel data we need to impute missing data for variables defined at higher levels We need to have a valid.
Non response and missing data in longitudinal surveys.
Lectures 6&7: Variance Reduction Techniques
Latent normal models for missing data Harvey Goldstein Centre for Multilevel Modelling University of Bristol.
Efficient modelling of record linked data A missing data perspective Harvey Goldstein Record Linkage Methodology Research Group Institute of Child Health.
Treatment of missing values
FTP Biostatistics II Model parameter estimations: Confronting models with measurements.
Deliverable 2.8: Outliers Gary Brown Office for National Statistics UK.
Multilevel survival models A paper presented to celebrate Murray Aitkin’s 70 th birthday Harvey Goldstein ( also 70 ) Centre for Multilevel Modelling University.
What is MPC? Hypothesis testing.
Regional Workshop for African Countries on Compilation of Basic Economic Statistics Pretoria, July 2007 Administrative Data and their Use in Economic.
1 Editing Administrative Data and Combined Data Sources Introduction.
Chapter 7 – K-Nearest-Neighbor
Resampling techniques Why resampling? Jacknife Cross-validation Bootstrap Examples of application of bootstrap.

1 Unsupervised Learning With Non-ignorable Missing Data Machine Learning Group Talk University of Toronto Monday Oct 4, 2004 Ben Marlin Sam Roweis Rich.
Specific to General Modelling The traditional approach to econometrics modelling was as follows: 1.Start with an equation based on economic theory. 2.Estimate.
. Approximate Inference Slides by Nir Friedman. When can we hope to approximate? Two situations: u Highly stochastic distributions “Far” evidence is discarded.
11-1 Copyright  2006 McGraw-Hill Australia Pty Ltd Revised PPTs t/a Auditing and Assurance Services in Australia 3e by Grant Gay and Roger Simnett Slides.
N The Experimental procedure involves manipulating something called the Explanatory Variable and seeing the effect on something called the Outcome Variable.
FINAL REPORT: OUTLINE & OVERVIEW OF SURVEY ERRORS
Statistical Methods for Missing Data Roberta Harnett MAR 550 October 30, 2007.
Bootstrapping applied to t-tests
Jeff Howbert Introduction to Machine Learning Winter Classification Bayesian Classifiers.
Scot Exec Course Nov/Dec 04 Ambitious title? Confidence intervals, design effects and significance tests for surveys. How to calculate sample numbers when.
Hypothesis Testing. Distribution of Estimator To see the impact of the sample on estimates, try different samples Plot histogram of answers –Is it “normal”
1 1 Slide Statistical Inference n We have used probability to model the uncertainty observed in real life situations. n We can also the tools of probability.
Introduction to plausible values National Research Coordinators Meeting Madrid, February 2010.
Charteredaccountants.com.au/training Fundamentals of Auditing in 2007 Chartered Accountants Audit Conference ASA 530 – Audit Sampling and Other Means of.
PARAMETRIC STATISTICAL INFERENCE
Propensity Score Matching and Variations on the Balancing Test Wang-Sheng Lee Melbourne Institute of Applied Economic and Social Research The University.
The relationship between error rates and parameter estimation in the probabilistic record linkage context Tiziana Tuoto, Nicoletta Cibella, Marco Fortini.
Jeroen Pannekoek - Statistics Netherlands Work Session on Statistical Data Editing Oslo, Norway, 24 September 2012 Topic (I) Selective and macro editing.
CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.
Topic (vi): New and Emerging Methods Topic organizer: Maria Garcia (USA) UNECE Work Session on Statistical Data Editing Oslo, Norway, September 2012.
Ensembles. Ensemble Methods l Construct a set of classifiers from training data l Predict class label of previously unseen records by aggregating predictions.
1 1 Slide STATISTICS FOR BUSINESS AND ECONOMICS Seventh Edition AndersonSweeneyWilliams Slides Prepared by John Loucks © 1999 ITP/South-Western College.
Extreme values and risk Adam Butler Biomathematics & Statistics Scotland CCTC meeting, September 2007.
Eurostat Statistical matching when samples are drawn according to complex survey designs Training Course «Statistical Matching» Rome, 6-8 November 2013.
ISQS 6347, Data & Text Mining1 Ensemble Methods. ISQS 6347, Data & Text Mining 2 Ensemble Methods Construct a set of classifiers from the training data.
Issues concerning the interpretation of statistical significance tests.
© John M. Abowd 2007, all rights reserved General Methods for Missing Data John M. Abowd March 2007.
1 G Lect 13W Imputation (data augmentation) of missing data Multiple imputation Examples G Multiple Regression Week 13 (Wednesday)
Disclosure Limitation in Microdata with Multiple Imputation Jerry Reiter Institute of Statistics and Decision Sciences Duke University.
Pavel B. Klimov Barry M. OConnor University of Michigan, Museum of Zoology, 1109 Geddes Ave., Ann Arbor, MI The next generation of identification tools:
Chapter 6 – Three Simple Classification Methods © Galit Shmueli and Peter Bruce 2008 Data Mining for Business Intelligence Shmueli, Patel & Bruce.
Simple examples of the Bayesian approach For proportions and means.
13-1 Sequential File Processing Chapter Chapter Contents Overview of Sequential File Processing Sequential File Updating - Creating a New Master.
A REVIEW By Chi-Ming Kam Surajit Ray April 23, 2001 April 23, 2001.
Simulation Study for Longitudinal Data with Nonignorable Missing Data Rong Liu, PhD Candidate Dr. Ramakrishnan, Advisor Department of Biostatistics Virginia.
13- 1 Chapter 13.  Overview of Sequential File Processing  Sequential File Updating - Creating a New Master File  Validity Checking in Update Procedures.
Chapter 13: Inferences about Comparing Two Populations Lecture 8b Date: 15 th November 2015 Instructor: Naveen Abedin.
Latent regression models. Where does the probability come from? Why isn’t the model deterministic. Each item tests something unique – We are interested.
Tutorial I: Missing Value Analysis
Surveying II. Lecture 1.. Types of errors There are several types of error that can occur, with different characteristics. Mistakes Such as miscounting.
1 IP Routing table compaction and sampling schemes to enhance TCAM cache performance Author: Ruirui Guo, Jose G. Delgado-Frias Publisher: Journal of Systems.
STA248 week 121 Bootstrap Test for Pairs of Means of a Non-Normal Population – small samples Suppose X 1, …, X n are iid from some distribution independent.
Challenges in data linkage: error and bias
Multiple Imputation using SOLAS for Missing Data Analysis
Hypothesis Testing and Confidence Intervals (Part 1): Using the Standard Normal Lecture 8 Justin Kern October 10 and 12, 2017.
Maximum Likelihood & Missing data
Multiple Imputation Using Stata
How to handle missing data values
Demographic Analysis and Evaluation
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
The European Statistical Training Programme (ESTP)
Chapter 7: The Normality Assumption and Inference with OLS
Non response and missing data in longitudinal surveys
Chapter 13: Item nonresponse
Presentation transcript:

Efficient modelling of record linked data A missing data perspective Harvey Goldstein Record Linkage Methodology Research Group Institute of Child Health University College London and Centre for Multilevel Modelling University of Bristol 1

Record Linkage Consider 2 data files: File of interest (FOI) and the linking data file (LDF) and assume for simplicity all cases within FOI also exist in LDF Readily extended to multiple LDFs and cases missing from LDF as in deaths file We want variables of interest (VOI) from LDF to add to FOI records and we have a set of matching variables (MV) that enable us to link records in each file. Deterministic matching relies on a unique (and error free) combination of MV values having a one-to-one relationship from the FOI to the LDF. Probabilistic record matching arises in the common case when this cannot be assumed, e.g. misspelling of names or transcription errors, resulting in many possible matches. These are traditionally assigned ‘matching weights’ 2

Think of it as a missing data problem Set A variablesSet B variables 00XXX 00X0X 00XX0 0000X Extended FOI contains 2 sets of variables: Set A where all are missing (i.e. in LDF) and set B which are available – with possible holes Research problem is to change the 0s to Xs. This is a particular case of missing data and our approach is to use an extension of Multiple Imputation (MI) techniques. The focus is on data analysis. 3

Applying MI to extended FOI We cannot directly use MI for set A since all of them are missing. So: consider filling in some of them with certainty from a LDF – the deterministic matching stage. We now have something like this – where first record has no definite match Set ASet B 00XXX 0XX0X XXXX0 XX00X Note e.g. that some of the imported values may be missing. At this point we might choose simply to use multiple imputation for the remaining missing data since we have information to do this. This can often produce acceptable estimates – e.g. if data MAR. Can we do better by using probabilistic importation of data values? 4

Probabilistic record matching as it exists 5

Probabilistic matching - problems A threshold has to be chosen: some possible matches therefore rejected Even if threshold is high some chosen matches will be wrong and these ‘measurement errors’ should be carried through to the analysis, but typically are not. ( Jaro, M. (1995). "Probabilistic linkage of large public health data files." Statistics in Medicine 14: ) What we really want, for data analysis purposes, is not to carry the record, but the LDF data values – the VOI. So consider the following: 6

Extending the probabilistic matching model If we can assign, for each ‘candidate’ record in the LDF a probability that it is the correct match, then we can adapt our imputation by treating these probabilities as constituting a ‘prior’ distribution. Formally we combine the imputation likelihood for the missing set A variables with the prior for each candidate record to form a posterior distribution for these records from which we choose the largest. We also can choose a lower threshold so that if none exceeds then standard MI is used. To obtain MAR we can condition on the matching variables as well as all other variables in the model of interest (MOI) in obtaining the imputation likelihood. Especially useful when probability of a correct match depends on values of the LDF variables. 7

Advantages Combining prior and likelihood will tend more often to select the correct record. Some bias will still remain but can be minimised since threshold for acceptance can be made very high (e.g. a probability of 0.95) At imputation stage we can condition on auxiliary variables to satisfy ignorability assumption (MAR) If elimination of bias is priority and a large enough proportion can successfully be unequivocally matched then standard MI can be used. 8

Implementation and Software Multiply imputed datasets produced and model fits combined in usual way (Rubin’s rules). Matlab routines available and new STATJR software at Bristol will develop these and improve efficiency. Currently will handle mixtures of normal and binary variables and also multilevel data. If implemented routinely requires ancillary data (allowing matching probabilities to be estimated) from the matching process to be supplied to data analyst. PPRL procedures need to recognise that ‘matching probabilities’ need to be transferred along with encrypted (hashed) MV values. Goldstein, H., Harron, K., and Wade, A. (2012). The analysis of record linked data using multiple imputation with data value priors. Statistics in medicine, DOI: /sim