ESSnet DI WP2: Record Linkage Luca Valentino Istat.

Slides:



Advertisements
Similar presentations
Data transfer to the EHES RC Luxembourg
Advertisements

Sharing Solution for Record Linkage: the RELAIS Software and the Italian and Spanish Experiences Nicoletta Cibella 1, Gervasio-Luis Fernandez 2, Marco.
1 Editing the Integrated Census in Israel. EDITING THE INTEGRATED CENSUS IN ISRAEL Prepared by Eva Rotenberg, Central Bureau of Statistics, Israel (1)
The Nature of the Bias When Studying Only Linkable Person Records: Evidence from the American Community Survey Adela Luque (U.S. Census Bureau) Brittany.
The Linked PDD-Death Product More than you want to know David Zingmond, MD, PhD Division of General Internal and Health Services Research UCLA School of.
Wisconsin Department of Health Services Richard Miller Research Scientist Wisconsin Office of Health Informatics October 28, 2014 Matching Traffic Crash.
ISTAT - Italian National Institute of Statistics Labour Force Survey Division Unit “Methods for LFS data treatment” 5 th Workshop on LFS methodology Paris,
Record Linkage Simulation Biolink Meeting June Adelaide Ariel.
Bosna i Hercegovina Agencija za statistiku Bosne i Hercegovine Bosna i Hercegovina Agencija za statistiku Bosne i Hercegovine Post-enumeration Survey-A.
© 2007 John M. Abowd, Lars Vilhuber, all rights reserved Introduction to Record Linking John M. Abowd and Lars Vilhuber April 2007.
March 2013 ESSnet DWH - Workshop IV DATA LINKING ASPECTS OF COMBINING DATA INCLUDING OPTIONS FOR VARIOUS HIERARCHIES (S-DWH CONTEXT)
Michigan Newborn Screening & Live Births Records Linkage and Follow-Up of Potentially Un-Screened Infants Steven J. Korzeniewski, MA, MSc, Maternal & Child.
When adjusting for bias due to linkage errors: a sensitivity analysis Q2014 Tiziana Tuoto 05/06/2014 Joint work with Loredana Di Consiglio.
United Nations Regional Workshop on the 2010 World Programme on Population and Housing Censuses: Census Evaluation and Post Enumeration Surveys, Bangkok,
© John M. Abowd and Lars Vilhuber 2005, all rights reserved Introduction to Probabilistic Record Linking John M. Abowd and Lars Vilhuber March 2005.
The Census Data Enhancement Project Glenys Bishop.
Uses of Population Censuses and Household Sample Surveys for Vital Statistics in South Africa United Nations Expert Group Meeting on International Standards.
Joint UNECE/Eurostat Meeting on Population and Housing Censuses (13-15 May 2008) Sample results expected accuracy in the Italian Population and Housing.
Indonesia country office Household and health facility surveys in Indonesia Indonesia country team Jakarta, Indonesia.
Introduction to Record Linking John M. Abowd and Lars Vilhuber April 2011 © 2011 John M. Abowd, Lars Vilhuber, all rights reserved.
Business Statistics, A First Course (4e) © 2006 Prentice-Hall, Inc. Chap 11-1 Chapter 11 Chi-Square Tests Business Statistics, A First Course 4 th Edition.
SHARE data cleaning meeting Frankfurt – December, 6 Some suggestions from the Italian experience Paccagnella Omar Omar Paccagnella Data cleaning meeting.
The paired sample experiment The paired t test. Frequently one is interested in comparing the effects of two treatments (drugs, etc…) on a response variable.
Use of survey (LFS) to evaluate the quality of census final data Expert Group Meeting on Censuses Using Registers Geneva, May 2012 Jari Nieminen.
9 th Workshop on Labour Force Survey Methodology – Rome, May 2014 The Italian LFS sampling design: recent and future developments 9 th Workshop on.
Record matching for census purposes in the Netherlands Eric Schulte Nordholt Senior researcher and project leader of the Census Statistics Netherlands.
The Use of Administrative Sources for Statistical Purposes Matching and Integrating Data from Different Sources.
Has Public Health Insurance for Older Children Reduced Disparities in Access to Care and Health Outcomes? Janet Currie, Sandra Decker, and Wanchuan Lin.
The relationship between error rates and parameter estimation in the probabilistic record linkage context Tiziana Tuoto, Nicoletta Cibella, Marco Fortini.
The new multiple-source system for Italian Structural Business Statistics based on administrative and survey data Orietta Luzi, Ugo Guarnera, Paolo Righi.
The Conditional Independence Assumption in Probabilistic Record Linkage Methods Stephen Sharp National Records of Scotland Ladywell Road Edinburgh EH12.
Assessing SES differences in life expectancy: Issues in using longitudinal data Elsie Pamuk, Kim Lochner, Nat Schenker, Van Parsons, Ellen Kramarow National.
A discussion of Comparing register and survey wealth data ( F. Johansson and A. Klevmarken) & The Impact of Methodological Decisions around Imputation.
1 Open source software: a way to enrich local solutions N. Cibella, M. Scannapieco, T. Tuoto Italian National Statistical Office, Italy.
U.S. Centers for Disease Control and Prevention National Center for Health Statistics International Statistics Program Birth Records These materials have.
Linking, selecting cut-offs, and examining quality in the Integrated Data Infrastructure (IDI) Laura O’Sullivan Statistics New Zealand
May 12-15, Evaluating the Integrated Census Israel Pnina ZADKA Central Bureau of Statistics Israel.
Comparison and integration among different sources for determining the legal foreign population stock in Italy Costanza Giovannelli Joint.
Data sources of the EuroGroups Register Presentation by Eurostat
Post Enumeration Survey Baku Training Module.  Discuss:  What is a Post Enumeration Survey?  How is it undertaken in Australia?  Questions Overview.
Building a database for children with disabilities using administrative data and surveys Adele D. Furrie September 27, 2011.
Probabilistic Record Linkage in Genealogical Research John Lawson, Dave White, Brenda Price and Ryan Yamagata Introduction Description of Probabilistic.
United Nations Workshop on Evaluation and Analysis of Census Data, 1-12 December 2014, Nay Pyi Taw, Myanmar DATA VALIDATION-I Evaluation of editing and.
© Statistisches Bundesamt, VI A Statistisches Bundesamt The new method of the next german Population census Johann Szenzenstein, Federal Statistical Office,
Postgraduate books recommended by Degree Management and Postgraduate Education Bureau, Ministry of Education Medical Statistics (the 2nd edition) 孙振球 主.
Towards a Process Oriented View on Statistical Data Quality Michaela Denk, Wilfried Grossmann.
United Nations Workshop on Evaluation and Analysis of Census Data, 1-12 December 2014, Nay Pyi Taw, Myanmar DATA VALIDATION-II Consistency check.
The 2011 Census: Estimating the Population Alexa Courtney.
Salvatore Favazza – Maria Pia Sorvillo Istat - National Institute of Statistics - Italy MEASURING IMMIGRATION AND FOREIGN POPULATION IN ITALY New York,
Joint UNECE-Eurostat worksession on confidentiality, 2011, Tarragona Sampling as a way to reduce risk and create a Public Use File maintaining weighted.
SHARELIFE Meeting Vienna – November, 5-6 The Italian experience in SHARE data cleaning Paccagnella Omar Omar Paccagnella SHARELIFE meeting November 6,
An Overview of Editing and Imputation Methods for the next Italian Censuses Gianpiero Bianchi, Antonia Manzari, Alessandra Reale UNECE-Eurostat Meeting.
1 Linking Social Security Death Index (SSDI) Data with Registry Data to Update Demographics and Vital Status David O’Brien, PhD, GISP Alaska Cancer Registry.
Methodology of estimating the annual number of usual resident population in Latvia Baiba Zukula Deputy Director of Social Statistics Department Central.
Module 3: Selecting Locations and Respondents Outcome Monitoring and Evaluation Using LQAS.
Examining Education - Occupation Match Rates of Immigrants in Broad Immigration Categories: Results from the 2011 National Household Survey – Immigration.
Sampling procedures for assessing accuracy of record linkage Paul A. Smith, S3RI, University of Southampton Shelley Gammon, Sarah Cummins, Christos Chatzoglou,
Proposals for linking Big Data and statistical registers
Challenges in data linkage: error and bias
Introduction to Probabilistic Record Linking
Register-based census: Pros and Cons
Statistics Netherlands Division Social and Spatial Statistics
International Standards and Contemporary Technologies
Demographic Analysis and Evaluation
Generic Statistical Business Process-Censuses
Preliminaries Training Course «Statistical Matching» Rome, 6-8 November 2013 Mauro Scanu Dept. Integration, Quality, Research and Production Networks.
POTENTIALS OF FOR DATA LINKAGE
An Introduction to Automated Record Linkage
Pnina ZADKA Central Bureau of Statistics Israel
Pnina ZADKA Central Bureau of Statistics Israel
Presentation transcript:

ESSnet DI WP2: Record Linkage Luca Valentino Istat

2 Task 2.1: Record linkage – a practical problem The problem, already illustrated in the Den Haag meeting, is to link these registers: P4 (New born inclusion in the residents administrative register) CEDAP (survey on the assistance certificates in the childbirth moment) In one data set there are the characteristics of the new born as weight, type of birth, how many brothers he/she has, week of birth, while the other data set contains data on the characteristics of the household as marital status, nationality, education of the parents

3 Harmonization of populations CEDAP P4 Common units: Alive newborns New borns in Italy Italian residents Dead newborns Non residents Italian newborns in other countries Molise and Calabria

4 Common variables EntityVariables available in P4 and CEDAP MOTHERMather’s birthdate (day, month and year) MOTHERMather’s marital status MOTHERMather’s citizenship MOTHERCounty of residence of the mother (hence ) NEWBORNCounty where childbirth happened NEWBORNNewborn’s birthdate (day, month and year) NEWBORNNewborn gender FATHERFather’s birthdate (day, month and year) FATHERFather’s citizenship The linkage was performed on the month of March 2005 Exclusion of the non eligible newborns lead to the following file sizes: P4March '336 units CEDAPMarch '381 units The common variables available for linkage are:

5 Different approaches The objective was the application of different approaches and comparing result: deterministic record linkage approach : rule defined from survey experts probabilistic record linkage approach : based on Fellegi – Sunter method (Relais) Liseo and Tancredi approach : to be done

6 Deterministic approach This approach is based on deterministic rule defined by survey experts (equivalence on all the common variables or on all but one common variable). This rule is performed by SAS procedures ad hoc and results are considered very reliable (declared matches are considered actual matches) The number of declared matches in this case is 32’595

7 Probabilistic approach Search space reduction by blocking in variables: Newborn’s birthdate – Newborn’s gender Matching variables: Reduction to 1:1 solution variablemetric day of mother’s birth dateequality month of mother’s birth dateequality year of mother’s birth dateequality birthplaceequality father’s birth dateJaro thr. 0.9

8 Probabilistic approach Before applying the EM algorithm and the 1:1 reduction, the probabilistic approach finds a set of pairs with a probability to be a match (P_POST) The final result depends on the choice of match threshold that depends on the quality required for the linkage In this case, high precision is required (in order to prevent as much as possible false matches). Hence the match threshold is fixed at 0.9 The number of declared matches is 36’562

9 Comparison between approaches The comparison between the deterministic approach (or Expert’s rule) and probabilistic approach (or Relais) shows a strong congruence The pairs declared as matches for both approaches are 31’931 87% of matches according to Relais are matches also for the expert’s rule 98% of matches according to the expert’s rule are matches also for Relais

10 Comparison by clerical review Mather’s birthdate Newborn’s gender Newborn’s Birth day Female06 An assessment of the quality of the linkage procedures can be performed through an evaluation of samples of pairs to be carefully evaluated by clerical review The clerical review consists in the analysis of all common variables of two records: If there are minimal differences between those variables that do not coincide, including when these variables are missing, the pair is classified as a true link otherwise the pair is classified as a false link Mather’s birthdate Newborn’s gender Newborn’s Birth day

11 Comparison by clerical review A - common matches B - common matches consisting of twins C - pairs declared as matches only by the expert’s rule D - pairs declared as matches only by Relais similar on at least half of the variables not used in the probabilistic linkage procedure E - other pairs declared as matches by Relais Expert's ruleRelaisClass of pairs Total pairsSample size True linkFalse link Match A Match B MatchUnmatchC UnmatchMatchD UnmatchMatchE

12 Comparison by clerical review F - pairs in the Relais solution but p_post value is below the match threshold G - pairs that coincide in at least one of the most significant variables Expert's rule RelaisClass of pairs Total pairs Sample size True linkFalse link Unmatch F Unmatch G

13 Comparison by clerical review The results obtained on the checked samples give the following false match and false non match rates: Deterministic approach (Expert’s rule) Probabilistic approach (Relais) False match rate0% False non match rate14,35% False match rate0,25% False non match rate4,16%