Efficient modelling of record linked data A missing data perspective Harvey Goldstein Record Linkage Methodology Research Group Institute of Child Health.

Slides:

Advertisements

Similar presentations

Applications of one-class classification

Advertisements

Dummy Dependent variable Models

Handling attrition and non- response in longitudinal data Harvey Goldstein University of Bristol.

Missing data – issues and extensions For multilevel data we need to impute missing data for variables defined at higher levels We need to have a valid.

MCMC estimation in MlwiN

Non response and missing data in longitudinal surveys.

Lectures 6&7: Variance Reduction Techniques

Estimation of Means and Proportions

Latent normal models for missing data Harvey Goldstein Centre for Multilevel Modelling University of Bristol.

Treatment of missing values

Mean, Proportion, CLT Bootstrap

FTP Biostatistics II Model parameter estimations: Confronting models with measurements.

1 Chapter 9 Hypothesis Testing Developing Null and Alternative Hypotheses Type I and Type II Errors One-Tailed Tests About a Population Mean: Large-Sample.

1 1 Slide STATISTICS FOR BUSINESS AND ECONOMICS Seventh Edition AndersonSweeneyWilliams Slides Prepared by John Loucks © 1999 ITP/South-Western College.

The adjustment of the observations

Efficient modelling of record linked data A missing data perspective Harvey Goldstein Record Linkage Methodology Research Group Institute of Child Health.

Visual Recognition Tutorial

Chapter 7 – K-Nearest-Neighbor

Mutual Information Mathematical Biology Seminar

Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 11: Probabilistic Information Retrieval.

1 Learning Entity Specific Models Stefan Niculescu Carnegie Mellon University November, 2003.

Point and Confidence Interval Estimation of a Population Proportion, p

Visual Recognition Tutorial

1 1 Slide © 2006 Thomson/South-Western Chapter 9 Hypothesis Testing Developing Null and Alternative Hypotheses Developing Null and Alternative Hypotheses.

Aaker, Kumar, Day Seventh Edition Instructor’s Presentation Slides

Chapter 9 Numerical Integration Numerical Integration Application: Normal Distributions Copyright © The McGraw-Hill Companies, Inc. Permission required.

CSCI 347 / CS 4206: Data Mining Module 04: Algorithms Topic 06: Regression.

Bootstrapping applied to t-tests

Jeff Howbert Introduction to Machine Learning Winter Classification Bayesian Classifiers.

EE513 Audio Signals and Systems Statistical Pattern Classification Kevin D. Donohue Electrical and Computer Engineering University of Kentucky.

Aaker, Kumar, Day Ninth Edition Instructor’s Presentation Slides

McGraw-Hill/IrwinCopyright © 2009 by The McGraw-Hill Companies, Inc. All Rights Reserved. Chapter 9 Hypothesis Testing.

© 2008 McGraw-Hill Higher Education The Statistical Imagination Chapter 9. Hypothesis Testing I: The Six Steps of Statistical Inference.

1 1 Slide Statistical Inference n We have used probability to model the uncertainty observed in real life situations. n We can also the tools of probability.

Introduction to plausible values National Research Coordinators Meeting Madrid, February 2010.

Copyright © Cengage Learning. All rights reserved. 8 Tests of Hypotheses Based on a Single Sample.

Copyright 2010, The World Bank Group. All Rights Reserved. Estimation and Weighting, Part I.

1 CSI5388: Functional Elements of Statistics for Machine Learning Part I.

1 1 Slide © 2003 Thomson/South-Western Slides Prepared by JOHN S. LOUCKS St. Edward’s University.

ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Deterministic vs. Random Maximum A Posteriori Maximum Likelihood Minimum.

The relationship between error rates and parameter estimation in the probabilistic record linkage context Tiziana Tuoto, Nicoletta Cibella, Marco Fortini.

Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.6: Linear Models Rodney Nielsen Many of.

Estimation Kline Chapter 7 (skip , appendices)

CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.

1 In this case, each element of a population is assigned to one and only one of several classes or categories. Chapter 11 – Test of Independence - Hypothesis.

Chapter 4: Introduction to Predictive Modeling: Regressions

Eurostat Statistical matching when samples are drawn according to complex survey designs Training Course «Statistical Matching» Rome, 6-8 November 2013.

Issues concerning the interpretation of statistical significance tests.

Copyright © 2006 by The McGraw-Hill Companies, Inc. All rights reserved. McGraw-Hill/Irwin 8-1 Chapter Eight Audit Sampling: An Overview and Application.

Pavel B. Klimov Barry M. OConnor University of Michigan, Museum of Zoology, 1109 Geddes Ave., Ann Arbor, MI The next generation of identification tools:

Chapter 6 – Three Simple Classification Methods © Galit Shmueli and Peter Bruce 2008 Data Mining for Business Intelligence Shmueli, Patel & Bruce.

CSCI 347, Data Mining Evaluation: Cross Validation, Holdout, Leave-One-Out Cross Validation and Bootstrapping, Sections 5.3 & 5.4, pages

Latent regression models. Where does the probability come from? Why isn’t the model deterministic. Each item tests something unique – We are interested.

Understanding Basic Statistics Fourth Edition By Brase and Brase Prepared by: Lynn Smith Gloucester County College Chapter Nine Hypothesis Testing.

Copyright © 2012 by Nelson Education Limited. Chapter 5 Introduction to inferential Statistics: Sampling and the Sampling Distribution 5-1.

Tutorial I: Missing Value Analysis

Surveying II. Lecture 1.. Types of errors There are several types of error that can occur, with different characteristics. Mistakes Such as miscounting.

Naïve Bayes Classifier April 25 th, Classification Methods (1) Manual classification Used by Yahoo!, Looksmart, about.com, ODP Very accurate when.

Copyright © 2013, 2009, and 2007, Pearson Education, Inc. 1 FINAL EXAMINATION STUDY MATERIAL III A ADDITIONAL READING MATERIAL – INTRO STATS 3 RD EDITION.

AP Statistics From Randomness to Probability Chapter 14.

Chapter Nine Hypothesis Testing.

Chapter 11 – Test of Independence - Hypothesis Test for Proportions of a Multinomial Population In this case, each element of a population is assigned.

Challenges in data linkage: error and bias

How to handle missing data values

Please use speaker notes for additional information!

Discrete Event Simulation - 4

Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.

The European Statistical Training Programme (ESTP)

Chapter 7: The Normality Assumption and Inference with OLS

Non response and missing data in longitudinal surveys

Presentation transcript:

Efficient modelling of record linked data A missing data perspective Harvey Goldstein Record Linkage Methodology Research Group Institute of Child Health University College London

Record Linkage Consider 2 data files: File of interest (FOI) and the linking data file (LDF) and assume for simplicity all cases within FOI also exist in LDF Readily extended to multiple LDFs and cases missing from LDF as in deaths file We want variables of interest (VOI) from LDF to add to FOI records and we have a set of matching variables (MV) that enable us to link records in each file. Deterministic matching relies on a unique (and error free) combination of MV values having a one-to-one relationship from the FOI to the LDF. Probabilistic record matching arises in the common case when this cannot be assumed, e.g. misspelling of names or transcription errors, resulting in many possible matches. These are traditionally assigned ‘matching weights’

Think of it as a missing data problem Set A variablesSet B variables 00XXX 00X0X 00XX0 0000X Extended FOI contains 2 sets of variables: Set A where all are missing (i.e. in LDF) and set B which are available – with possible holes Research problem is to change the 0s to Xs. This is a particular case of missing data and our approach is to use an extension of Multiple Imputation (MI) techniques. The focus is on data analysis.

Applying MI to extended FOI We cannot directly use MI for set A since all of them are missing. So: consider filling in some of them with certainty from a LDF – or indeed from anywhere we can get them. We now have something like this – where first record has no definite match Set ASet B 00XXX 0XX0X XXXX0 XX00X Note e.g. that some of the imported values may be missing. At this point we might choose simply to impute the remaining missing data since we have information to do this. This can often produce acceptable estimates. Can we do better by using probabilistic importation of data values?

Probabilistic record matching as it exists First we ascertain and link the ‘certain’ records and consider the residue The probabilistic matching method produces a weight for each LDF record, for each remaining FOI record. If the maximum of these over the LDF records is greater than a chosen threshold a match is accepted for the LDF individual corresponding to this maximum value: –Consider each FOI record with a given set of MV values (g): –E.g. for 3 binary matching variables we may observe a pattern g={1,0,1} indicating {match, no match, match} –We compute the probability of observing that pattern of MV values –A) Given that it is a match P(g|M) –B) Given that it is not a match P(g|NM) –Then compute R=P(g|M)/P(g|NM) and W=log(R) if agreement, –Or R=(1-P(g|M))/(1-P(g|NM)) if no agreement –R can be obtained via e.g. training data or otherwise estimated using existing data where matching status is known. –The cut-off threshold for W to accept a match has to be chosen, for example, to minimise the percentage of ‘false positives’. –In practice W is computed for each MV and then summed and cut-off based on sum. Essentially an ‘independence’ assumption.

Probabilistic matching - problems A threshold has to be chosen: some possible matches therefore rejected Even if threshold is high some chosen matches will be wrong and these ‘measurement errors’ should be carried through to the analysis, but typically are not. ( Jaro, M. (1995). "Probabilistic linkage of large public health data files." Statistics in Medicine 14: ) What we really want, for data analysis purposes, is not to carry the record, but the LDF data values – the VOI. So consider the following:

Extending the probabilistic matching model If we can assign, for each ‘candidate’ record in the LDF a probability that it is the correct match, then we can adapt our imputation by treating these probabilities as constituting a ‘prior’ distribution. Formally we combine the imputation likelihood for the missing set A variables with the prior for each candidate record to form a posterior distribution for these records from which we choose the largest. We also can choose a lower threshold so that if none exceeds then standard MI is used. We can condition on the matching variables as well as all other variables in the model of interest (MOI) in obtaining the imputation likelihood. Especially useful when probability of a correct match depends on values of the LDF variables.

Advantages Combining prior and likelihood will tend more often to select the correct record. Some bias will still remain but can be minimised since threshold for acceptance can be made very high (e.g. a probability of 0.95) At imputation stage we can condition on auxiliary variables to satisfy ignorability assumption (missing at random) If elimination of bias is priority and a large enough proportion can successfully be unequivocally matched then standard MI can be used.

Implementation and Software Multiply imputed datasets produced and model fits combined in usual way (Rubin’s rules). Matlab routines available and new STATJR software at Bristol will develop these and improve efficiency. Currently will handle mixtures of normal and binary variables and also multilevel data. If implemented routinely requires ancillary data (allowing matching probabilities to be estimated) from the matching process to be supplied to data analyst. –This has important implications for ‘third parties’ who do matching – e.g. HSCIC (care.data) Goldstein, H., Harron, K., and Wade, A. (2012). The analysis of record linked data using multiple imputation with data value priors. Statistics in medicine, DOI: /sim.5508