When adjusting for bias due to linkage errors: a sensitivity analysis Q2014 Tiziana Tuoto 05/06/2014 Joint work with Loredana Di Consiglio.

Slides:



Advertisements
Similar presentations
Investigation of Treatment of Influential Values Mary H. Mulry Roxanne M. Feldpausch.
Advertisements

Paul Smith Office for National Statistics
Hypothesis testing Another judgment method of sampling data.
1 Health Warning! All may not be what it seems! These examples demonstrate both the importance of graphing data before analysing it and the effect of outliers.
Data Mining Methodology 1. Why have a Methodology  Don’t want to learn things that aren’t true May not represent any underlying reality ○ Spurious correlation.
Likelihood Ratio, Wald, and Lagrange Multiplier (Score) Tests
Chapter 9 Creating and Maintaining Database Presented by Zhiming Liu Instructor: Dr. Bebis.
Record Linkage Simulation Biolink Meeting June Adelaide Ariel.
ESSnet DI WP2: Record Linkage Luca Valentino Istat.
March 2013 ESSnet DWH - Workshop IV DATA LINKING ASPECTS OF COMBINING DATA INCLUDING OPTIONS FOR VARIOUS HIERARCHIES (S-DWH CONTEXT)
Examining the use of administrative data for annual business statistics Joanna Woods, Ria Sanderson, Tracy Jones, Daniel Lewis.
Statistics II: An Overview of Statistics. Outline for Statistics II Lecture: SPSS Syntax – Some examples. Normal Distribution Curve. Sampling Distribution.
Supervised classification performance (prediction) assessment Dr. Huiru Zheng Dr. Franscisco Azuaje School of Computing and Mathematics Faculty of Engineering.
Darlene Goldstein 29 January 2003 Receiver Operating Characteristic Methodology.
Basic Business Statistics, 10e © 2006 Prentice-Hall, Inc. Chap 9-1 Chapter 9 Fundamentals of Hypothesis Testing: One-Sample Tests Basic Business Statistics.
© John M. Abowd and Lars Vilhuber 2005, all rights reserved Introduction to Probabilistic Record Linking John M. Abowd and Lars Vilhuber March 2005.
Topic 3: Regression.
BS704 Class 7 Hypothesis Testing Procedures
The Census Data Enhancement Project Glenys Bishop.
Review for Exam 2 Some important themes from Chapters 6-9 Chap. 6. Significance Tests Chap. 7: Comparing Two Groups Chap. 8: Contingency Tables (Categorical.
SW388R7 Data Analysis & Computers II Slide 1 Multiple Regression – Split Sample Validation General criteria for split sample validation Sample problems.
Chapter 12 Inferential Statistics Gay, Mills, and Airasian
Introduction to Multilevel Modeling Using SPSS
Introduction to Record Linking John M. Abowd and Lars Vilhuber April 2011 © 2011 John M. Abowd, Lars Vilhuber, all rights reserved.
Chap 20-1 Statistics for Business and Economics, 6e © 2007 Pearson Education, Inc. Chapter 20 Sampling: Additional Topics in Sampling Statistics for Business.
PARAMETRIC STATISTICAL INFERENCE
Statistical Decision Making. Almost all problems in statistics can be formulated as a problem of making a decision. That is given some data observed from.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Deterministic vs. Random Maximum A Posteriori Maximum Likelihood Minimum.
The relationship between error rates and parameter estimation in the probabilistic record linkage context Tiziana Tuoto, Nicoletta Cibella, Marco Fortini.
Various topics Petter Mostad Overview Epidemiology Study types / data types Econometrics Time series data More about sampling –Estimation.
CHAPTER 12 Descriptive, Program Evaluation, and Advanced Methods.
Weighting and estimation methods: description in the Memobust handbook Loredana di Consiglio, Fabrizio Solari 2013 European Establishment Statistics Workshop.
Chapter 5 Parameter estimation. What is sample inference? Distinguish between managerial & financial accounting. Understand how managers can use accounting.
1 Chapter 8 Hypothesis Testing 8.2 Basics of Hypothesis Testing 8.3 Testing about a Proportion p 8.4 Testing about a Mean µ (σ known) 8.5 Testing about.
EMIS 7300 SYSTEMS ANALYSIS METHODS FALL 2005 Dr. John Lipp Copyright © Dr. John Lipp.
Workshop on Demographic Analysis and Evaluation. Fertility: Indirect Estimation Based on Age Structure. Rele’s Method.
Univariate Linear Regression Problem Model: Y=  0 +  1 X+  Test: H 0 : β 1 =0. Alternative: H 1 : β 1 >0. The distribution of Y is normal under both.
Eurostat Statistical matching when samples are drawn according to complex survey designs Training Course «Statistical Matching» Rome, 6-8 November 2013.
Reserve Variability – Session II: Who Is Doing What? Mark R. Shapland, FCAS, ASA, MAAA Casualty Actuarial Society Spring Meeting San Juan, Puerto Rico.
Eurostat Statistical Matching using auxiliary information Training Course «Statistical Matching» Rome, 6-8 November 2013 Marco Di Zio Dept. Integration,
Statistical Inference for the Mean Objectives: (Chapter 9, DeCoursey) -To understand the terms: Null Hypothesis, Rejection Region, and Type I and II errors.
Eurostat Weighting and Estimation. Presented by Loredana Di Consiglio Istituto Nazionale di Statistica, ISTAT.
1 Understanding and Measuring Uncertainty Associated with the Mid-Year Population Estimates Joanne Clements Ruth Fulton Alison Whitworth.
The Impact of Missing Data on the Detection of Nonuniform Differential Item Functioning W. Holmes Finch.
Sampling Design and Analysis MTH 494 Lecture-22 Ossam Chohan Assistant Professor CIIT Abbottabad.
- 1 - Overall procedure of validation Calibration Validation Figure 12.4 Validation, calibration, and prediction (Oberkampf and Barone, 2004 ). Model accuracy.
Multivariate selective editing via mixture models: first applications to Italian structural business surveys Orietta Luzi, Guarnera U., Silvestri F., Buglielli.
Eurostat Accuracy of Results of Statistical Matching Training Course «Statistical Matching» Rome, 6-8 November 2013 Marcello D’Orazio Dept. National Accounts.
Education 793 Class Notes Inference and Hypothesis Testing Using the Normal Distribution 8 October 2003.
Introduction to statistics Definitions Why is statistics important?
Chapter 8 Relationships Among Variables. Outline What correlational research investigates Understanding the nature of correlation What the coefficient.
Hypothesis Testing. Statistical Inference – dealing with parameter and model uncertainty  Confidence Intervals (credible intervals)  Hypothesis Tests.
ESSNET Data Integration - Rome, January 2010 ESSNET on Statistical Disclosure Control Daniela Ichim.
Copyright © 2013 Pearson Education, Inc. Publishing as Prentice Hall Statistics for Business and Economics 8 th Edition Chapter 9 Hypothesis Testing: Single.
The Probit Model Alexander Spermann University of Freiburg SS 2008.
Educational Research Inferential Statistics Chapter th Chapter 12- 8th Gay and Airasian.
Synthetic Approaches to Data Linkage Mark Elliot, University of Manchester Jerry Reiter Duke University Cathie Marsh Centre.
Chapter 10: The t Test For Two Independent Samples.
Proposals for linking Big Data and statistical registers
Introduction to Probabilistic Record Linking
UNECE Seminar on New Frontiers for Statistical Data Collection, Geneva
Ch3: Model Building through Regression
How to handle missing data values
CONCEPTS OF HYPOTHESIS TESTING
Chapter 9 Hypothesis Testing.
Review for Exam 2 Some important themes from Chapters 6-9
The MR process.
Statistics II: An Overview of Statistics
Psych 231: Research Methods in Psychology
New Techniques and Technologies for Statistics 2017  Estimation of Response Propensities and Indicators of Representative Response Using Population-Level.
Presentation transcript:

When adjusting for bias due to linkage errors: a sensitivity analysis Q2014 Tiziana Tuoto 05/06/2014 Joint work with Loredana Di Consiglio

Outline of the talk 1.Motivations 2.Linkage errors and total survey error 3.Methodologies for analyses on linked data 4.A sensitivity analysis 5.Concluding remarks and future works Adjusting for bias due to linkage errors, Tiziana Tuoto – Vienna, June 5° 2014

Why linking and why linkage errors? Integration of different sources (surveys, administrative lists, registers ) has acquired a preeminent role The huge accomplished effort to link data is not the final aim of the statistical process Whatever is the statistical analysis to perform on integrated data, when dealing with data resulting from a record linkage process, it should be taken into account that linkage is subject to two types of errors: 1.erroneous acceptance of false links 2.rejection of true matches (missed links) Adjusting for bias due to linkage errors, Tiziana Tuoto – Vienna, June 5° 2014

Linkage Errors and Total Survey Error Biemer 2010 Adjusting for bias due to linkage errors, Tiziana Tuoto – Vienna, June 5° 2014

Linkage Errors and Total Survey Error Zhang 2012 Adjusting for bias due to linkage errors, Tiziana Tuoto – Vienna, June 5° 2014

Methodologies for analyses on linked data 1965 : Neter, Maynes and Ramanathan : Scheuren and Winkler 2000 : Lahiri and Larsen 2009 : Chambers Regression analysis of probability-linked data, Official Statistics Research Series, Vol : Chipperfield, Bishop and Campbell Chambers (2009) contains a systematic overview of regression analysis of linked data, describes the approach developed by Neter et al., Scheuren et al, Lahiri et al. and gives his own bias-corrected estimators of regression parameters Adjusting for bias due to linkage errors, Tiziana Tuoto – Vienna, June 5° 2014

Methodologies for analyses on linked data Those settings work under strong assumptions Exchangeability linkage errors model Equal size of linking sets (or smallest set contained in the biggest one) Linking in 1:1 constrain Adjusting for bias due to linkage errors, Tiziana Tuoto – Vienna, June 5° 2014

A sensitivity analysis Winkler (2014) notes «Scheuren and Winkler (1997) observed that, if linkage error is below 1%, then can perform statistical analysis without adjustment. Most ‘good’ matching situations have overall linkage error above 10%. Even ‘high match scores’ sets of pairs may have linkage error in range 1- 5%. The current models may adjust the ‘observed’ matched pairs to having linkage error down from 10% to 7.5%» Adjusting for bias due to linkage errors, Tiziana Tuoto – Vienna, June 5° 2014

Experimental data ScenarioDeclared Matches False matches in Declared  Gold Silver Bronze Random Sample of 1000 units from the fictitious population census data in the ESSnet DI (2011). Linear model (as in Chambers, 2009): Y= X  +  with X~[1,Uniform(0,1)]  =[1,5]  ~Norm(0,1) Logistic model: X~Bernoulli(0.75) Y~Multinom(0.7,0.05,0.2,0.05) dependent on X. Two lists L1 and L2 were generated L1 = [Xs, 942 units] L2 = [Ys, 921 units] Units in common (the true matches) 868; true un-matches are 127 Adjusting for bias due to linkage errors, Tiziana Tuoto – Vienna, June 5° 2014

The three Linkage scenarios Probabilistic record linkage procedures (Fellegi and Sunter, 1969) with the software RELAIS (2011). Gold scenario: Name, Surname, Complete date of birth Silver scenario: Name, Surname, Year of Birth Bronze scenario: Day of birth, Month of birth, Address. ScenarioDeclared Matches False matches in Declared = prob. Missing true matches  = prob. False matches  = false matches rate Gold Silver Bronze Table 1 – Results of linkage procedures for the three Scenarios Adjusting for bias due to linkage errors, Tiziana Tuoto – Vienna, June 5° 2014

Linear Model – Naive Estimator and Linkage error bias adjusted estimators Linkage scenarioEstimatorBetaStandard Error Population True Value Perfect LinkageNaïve Gold LinkageNaïve Silver Linkage Naïve Ratio – ModOLS – Predictive Eb_CUE Bronze Linkage Naïve Ratio – ModOLS – Predictive Eb_CUE Adjusting for bias due to linkage errors, Tiziana Tuoto – Vienna, June 5° 2014

Logistic Model – Naive and Adjusted estimators Adjusting for bias due to linkage errors, Tiziana Tuoto – Vienna, June 5° 2014 Linkage scenarioEstimatorBetaStandard Error PopulationTrue Value Perfect LinkageNaïve Gold LinkageNaïve Silver Linkage Naïve Est. Equ. ML LL Est. Equ. Ch Bronze Linkage Naïve Est. Equ. ML LL Est. Equ. Ch

Remarks Relevance of the missing matches to completely remove linkage errors effect on the estimate bias. The naïve estimators under perfect linkage and Gold scenario are still biased due to missing true matches. Again, in the logistic regression, under the Bronze scenario the naïve estimate is less biased because there the missed matches component is lower than in the other scenarios. The correction for bias is effective in the linear case (achieving a bias reduction of about 10% for the Silver scenario and higher in the Bronze one) but more work is needed for the logistic case where the naïve estimator performs slightly better. Adjusting for bias due to linkage errors, Tiziana Tuoto – Vienna, June 5° 2014

Future works Further works to investigate linkage errors effects on variability component. Further analyses to assess the trade-off in adjusting for bias with respect to the expected increase of variance. More flexible framework, as in Chipperfield et al. (2011), where exchangeability of linkage errors is not required and missed matches are explicitly considered Finally, here the probability of being correctly linked and the probability of erroneous missed matches are assumed to be known, whereas the linkage errors evaluation is not a straightforward task Adjusting for bias due to linkage errors, Tiziana Tuoto – Vienna, June 5° 2014

Bibliography Biemer (2010) Total Survey Error Design, Implementation, And Evaluation Public Opinion Quarterly, Vol. 74, No. 5, 2010 Chambers R. (2009) Regression analysis of probability-linked data, Official Statistics Research Series, Vol. 4. Chipperfield, J. O., Bishop, G. R. and Campbell P. (2011). Maximum likelihood estimation for contingency tables and logistic regression with incorrectly linked data, Survey Methodology, Vol. 37, No. 1 Fellegi I.P., Sunter A.B. (1969) “A Theory for record linkage”, Journal of the American Statistical Association, 64, Lahiri, P., and Larsen, M.D. (2000). Model based analysis of records linked using mixture models. Proc. Of the section on survey research methods, ASA, Lahiri, P., and Larsen, M.D. (2005). Regression analysis with linked data. Journal of the American Statistical Association, 100, McLeod, Heasman and Forbes, (2011) Simulated data for the on the job training, Essnet DI Adjusting for bias due to linkage errors, Tiziana Tuoto – Vienna, June 5° 2014

Bibliography Neter, J., Maynes, S., Ramanathan, R. (1965): The effect of mismatching on the measurement of response errors, JASA RELAIS, (2011). User’s guide version 2.2, available at Scheuren, F., Winkler, W.E. (1993): Regression analysis of data files that are computer matched, Survey Methodology, Scheuren, F., Winkler, W.E. (1997): Regression analysis of data files that are computer matched part II, Survey Methodology, Winkler, W.E. (2014), Quality and Analysis of National Files - Computational Methods for Censuses and Surveys, Presentation, January 9, 2014 Zhang, L.-C. (2012), Topics of statistical theory for register-based statistics and data integration. Statistica Neerlandica, 66 Adjusting for bias due to linkage errors, Tiziana Tuoto – Vienna, June 5° 2014