The Use of Administrative Sources for Statistical Purposes Matching and Integrating Data from Different Sources.

Slides:

Advertisements

Similar presentations

Statistics NZs experience in using Administrative Data in an Integrated Programme of Economic Vince Galvin General Manager Strategy & Communications.

Advertisements

The Business Register Research, Design and Evaluation Division Statistical Institute of Jamaica.

Survey design. What is a survey?? Asking questions – questionnaires Finding out things about people Simple things – lots of people What things? What people?

© Federal Statistical Office Germany, IV A2 Federal Statistical Office Germany Application of Regular Expressions in the German Business Register Session.

The Many Ways of Improving the Industrial Coding for Statistics Canada’s Business Register Yanick Beaucage ICES III June 2007.

United Nations Statistics Division Principles and concepts of classifications.

Bosna i Hercegovina Agencija za statistiku Bosne i Hercegovine Bosna i Hercegovina Agencija za statistiku Bosne i Hercegovine Post-enumeration Survey-A.

Counting the Dutch, The Future of the Virtual Census in the Netherlands Presentation at the seminar Counting the 7 Billion 24 February 2012 * Geert Bruinooge.

March 2013 ESSnet DWH - Workshop IV DATA LINKING ASPECTS OF COMBINING DATA INCLUDING OPTIONS FOR VARIOUS HIERARCHIES (S-DWH CONTEXT)

Classifications and CASCOT Ritva Ellison Institute for Employment Research University of Warwick.

The Dutch Censuses of 1960, 1971 and 2001 Producing public use files in the IPUMS project Wijnand Advokaat Statistics Netherlands Division Social and Spatial.

© John M. Abowd 2005, all rights reserved Sampling Frame Maintenance John M. Abowd February 2005.

United Nations Workshop on Revision 3 of Principles and recommendations for Population and Housing Censuses and Census Evaluation Amman, Jordan, 19 – 23.

The Use of Administrative Sources for Economic Statistics An Overview Steven Vale Office for National Statistics UK.

The Use of Administrative Sources for Statistical Purposes Administrative Sources and Statistical Registers.

Quality assuring the UK business register Andrew Allen.

A. Skalitz – INSEE 26 novembre 2008 The French Business Register : from a quality approach …. ….to a statistical register.

United Nations Economic Commission for Europe Statistical Division Applying the GSBPM to Business Register Management Steven Vale UNECE

Agenda 02/21/2013 Discuss exercise Answer questions in task #1 Put up your sample databases for tasks #2 and #3 Define ETL in more depth by the activities.

United Nations Workshop on Revision 3 of Principles and recommendations for Population and Housing Censuses and Census Evaluation Amman, Jordan, 19 – 23.

1 1 Establishing a register-based statistical system Example: Population and housing censuses in Norway Statistical Training Course Use of Administrative.

Work Package 5: Integrating data from different sources in the production of business statistics Daniel Lewis Office for National Statistics (UK)

1 BUSINESS REGISTER CBS-ISRAEL. 2 LEGAL FRAME WORK in 1997 two inter-governmental committees issued: 1. LEGAL ASPECTS 2. PRACTICAL & TECHNICAL ASPECTS.

Use of survey (LFS) to evaluate the quality of census final data Expert Group Meeting on Censuses Using Registers Geneva, May 2012 Jari Nieminen.

Dutch Virtual Census Presentation at the International Seminar on Population and Housing Censuses; Beyond the 2010 Round November, 2012 Egon Gerards,

The Statistical Business Register of Macao SAR Government of Macao SAR Statistics and Census Service.

Multiple Indicator Cluster Surveys Survey Design Workshop Sampling: Overview MICS Survey Design Workshop.

Record matching for census purposes in the Netherlands Eric Schulte Nordholt Senior researcher and project leader of the Census Statistics Netherlands.

2011 CENSUS Coverage Assessment – What’s new? OWEN ABBOTT.

Emerging methodologies for the census in the UNECE region Paolo Valente United Nations Economic Commission for Europe Statistical Division International.

Software Systems for Survey and Census Yudi Agusta Statistics Indonesia (Chief of IT Division Regional Statistics Office of Bali Province) Joint Meeting.

Register-Based Census 2011 in Slovenia – Some Quality Aspects Danilo Dolenc Statistical Office of the Republic of Slovenia UNECE-Eurostat Expert Group.

Combining survey and administrative data to create a new input data file for National Accounts processes Shaun McLaughlin Central Statistics Office, Ireland.

The Dutch Virtual Census based on registers and already existing surveys Eric Schulte Nordholt Senior researcher and project leader of the Census Statistics.

The Dutch Virtual Census of 2001 A New Approach by Combining Different Sources Eric Schulte Nordholt ECE Census meetings Geneva, November 2004.

for statistics based on multiple sources

United Nations Economic Commission for Europe Statistical Division Mapping Data Production Processes to the GSBPM Steven Vale UNECE

Use of Administrative Data Seminar on Developing a Programme on Integrated Statistics in support of the Implementation of the SNA for CARICOM countries.

ITGS Databases.

Statistical Expertise for Sound Decision Making Quality Assurance for Census Data Processing Jean-Michel Durr 28/1/20111Fourth meeting of the TCG - Lubjana.

The availability of Dutch census microdata Eric Schulte Nordholt Senior researcher and project leader of the Census Statistics Netherlands Division Social.

The challenge of a mixed-mode design survey and new IT tools application: the case of the Italian Structure Earning Surveys Fabiana Rocci Stefania Cardinleschi.

Improved Register Data Matching and its Impact on Survey Population Estimates Steve Vale Office for National Statistics, UK.

Paolo Valente - UNECE Statistical Division Slide 1 Technology for census data coding, editing and imputation Paolo Valente (UNECE) UNECE Workshop on Census.

1 For a Population Statistical Register Characteristics and Potentials for the Official Statistics Central department for administrative data and archives.

Preparing for A Strategy for Change Based on Previous Experiences Steve Vale Office for National Statistics, UK.

S T A T I S T I K A U S T R I A Quality Assessment of register-based Statistics A Quality Framework Manuela LENK Directorate.

Towards a Process Oriented View on Statistical Data Quality Michaela Denk, Wilfried Grossmann.

Beyond 2011 Administrative data sources and low-level aggregate models for producing population counts.

The 2011 Census: Estimating the Population Alexa Courtney.

5.8 Finalise data files 5.6 Calculate weights Price index for legal services Quality Management / Metadata Management Specify Needs Design Build CollectProcessAnalyse.

Public Libraries Survey Data File Overview. 2 What We’ll Talk About PLS: Public Library Survey State level data Public library data (Administrative Entities)

An Overview of Editing and Imputation Methods for the next Italian Censuses Gianpiero Bianchi, Antonia Manzari, Alessandra Reale UNECE-Eurostat Meeting.

Data Processes on 2011 Population And Housing Census Of Turkey

Statistics Netherlands Division Social and Spatial Statistics

Michael Biddington, UN ESCAP Statistics Division,

ESSnet on Consistency Workshop

Sharne Bailey, Tony Byrne UK, Office for National Statistics

Survey phases, survey errors and quality control system

Michael Biddington, UN ESCAP Statistics Division,

Survey phases, survey errors and quality control system

Matching and Industry Coding

Administrative Data and their Use in Economic Statistics

Istat - Structural Business Statistics

Mapping Data Production Processes to the GSBPM

Improved Register Data Matching and its Impact on Survey Population Estimates Steve Vale Office for National Statistics, UK.

Changes in the Canadian Census of Population Program

3 EGR Identification Service

Barış DULKADİR TURKSTAT Expert

Stephanie Hirner ESTP ”Administrative data and censuses

Presentation transcript:

The Use of Administrative Sources for Statistical Purposes Matching and Integrating Data from Different Sources

What is Matching? Linking data from different sources Exact Matching - linking records from two or more sources, often using common identifiers Probabilistic Matching - determining the probability that records from different sources should match, using a combination of variables

Why Match? Combining data sets can give more information than is available from individual data sets Reduce response burden Build efficient sampling frames Impute missing data To allow data integration

Models for Data Integration Statistical registers Statistics from mixed source models –Split population model –Split data approach –Pre-filled questionnaires –Using administrative data for non- responders –Using administrative data for estimation Register-based statistical systems

Statistical Registers

Mixed Source Models Traditionally one statistical output was based on one statistical survey Very little integration or coherence Now there is a move towards more integrated statistical systems Outputs are based on several sources

Split Population Model One source of data for each unit Different sources for different parts of the population

Split Population Model

Split Data Approach Several sources of data for each unit

Pre-filled Questionnaires Survey questionnaires are pre-filled with data from other sources where possible Respondents check that the information is correct, rather than completing a blank questionnaire This reduces response burden but may introduce a bias!

Example

Using Administrative Data for Non-responders Administrative data are used directly to supply variables for units that do not respond to a statistical survey Often used for less important units, so that response-chasing resources can be focused on key units

Using Administrative Data for Estimation Administrative data are used as auxiliary variables to improve the accuracy of statistical estimation Often used to estimate for small sub- populations or small geographic areas

Register- based Statistical Systems

Matching Terminology

Matching Keys Data fields used for matching e.g. Reference Number Name Address Postcode/Zip Code/Area Code Birth/Death Date Classification (e.g. ISIC, ISCO) Other variables (age, occupation, etc.)

Distinguishing Power 1 This relates to the uniqueness of the matching key Some keys or values have higher distinguishing powers than others High - reference number, full name, full address Low - sex, age, city, nationality

Distinguishing Power 2 Can depend on level of detail –Born 1960, Paris –Born 23 June 1960, rue de l’Eglise, Montmartre, Paris Choose variables, or combinations of variables with the highest distinguishing power

Match A pair that represents the same entity in reality A  A

Non-match A pair that represents two different entities in reality AB 

Possible Match A pair for which there is not enough information to determine whether it is a match or a non-match Aa 

False Match A pair wrongly designated as a match in the matching process (false positive) AB =

False Non-match A pair which is a match in reality, but is designated as a non-match in the matching process (false negative) AA 

Matching Techniques

Clerical Matching Expensive Inconsistent Slow Intelligent

Automatic Matching Cheap Consistent Quick Limited intelligence

The Solution Use an automatic matching tool to find obvious matches and no-matches Refer possible matches to specialist staff Maximise automatic matching rates and minimise clerical intervention

How Automatic Matching Works

Standardisation Generally used for text variables Abbreviations and common terms are replaced with standard text Common variations of names are standardised Postal codes, dates of birth etc. are given a common format

Blocking If the file to be matched against is very large, it may be necessary to break it down into smaller blocks to save processing time –e.g. if the record to be matched is in a certain town, only match against other records from that town, rather than all records for the whole country

Blocking Blocking must be used carefully, or good matches will be missed Experiment with different blocking criteria on a small test data set Possible to have two or more passes with different blocking criteria to maximise matches

Parsing Names and words are broken down into matching keys e.g.Steven Vale  stafan val Stephen Vael  stafan val Improves success rates by allowing matching where variables are not identical

Scoring Matched pairs are given a score based on how closely the matching variables agree Scores determine matches, possible matches and non-matches

How to Determine X and Y Mathematical methods e.g. Fellegi / Sunter method Trial and Error Data contents and quality may change over time so periodic reviews are necessary

Enhancements Re-matching files at a later date reduces false non-matches (if at least one file is updated) Link to data cleaning software, e.g. address standardisation

Matching Software Commercial products e.g. Informatica, Trillium, Automatch In-house products –Jasper (Statistics Canada) –Relais (ISTAT) Open-source products e.g. FEBRL No “off the shelf” products - all require tuning to specific needs

Internet Applications Google (and other search engines) – Cascot – an automatic coding tool based on text matching – ons/software/cascot/choose_classificatio/ Address finders e.g. Postes Canada – nced-f.asp

Software Applications Trigram method applied in SAS code (freeware) for matching in the Eurostat business demography project Works by comparing groups of 3 letters, and counting matching groups

Trigram Method Match “Steven Vale” –Ste/tev/eve/ven/en /n V/ Va/Val/ale To “Stephen Vale” – Ste/tep/eph/phe/hen/en /n V/ Va/Val/ale – 6 matching trigrams And “Stephen Vael” – Ste/tep/eph/phe/hen/en /n V/ Va/Vae/ael – 4 matching trigrams Parsing would improve these scores

Matching in Practice

Matching Records Without a Common Identifier The UK Experience by Steven Vale (Eurostat / ONS) and Mike Villars (ONS)

The Challenge The UK statistical business register relies on several administrative sources It needs to match records from these different sources to avoid duplication There is no system of common business identification numbers in UK

The Solution Records are matched using business name, address and post code The matching software used is Identity Systems / SSA-NAME3 Matching is mainly automatic via batch processing, but a user interface also allows the possibility of clerical matching

Batch Processing 1 Name is compressed to form a namekey, the last word of the name is the major key Major keys are checked against those of existing records at decreasing levels of accuracy until possible matches are found The name, address and post codes of possible matches are compared, and a score out of 100 is calculated

Batch Processing 2 If the score is >79 it is considered to be a definite match If the score is between 60 and 79 it is considered a possible match, and is reported for clerical checking If the score is <60 it is considered a non-match

Clerical Processing Possible matches are checked and linked where appropriate using an on-line system Non-matches with >9 employment are checked - if no link is found they are sent a Business Register Survey questionnaire Samples of definite matches and smaller non-matches are checked periodically

Problems Encountered 1 “Trading as” or “T/A” in the name e.g. Mike Villars T/A Mike’s Coffee Bar, Bar would be the major key, but would give too many matches as there are thousands of bars in the UK. Solution - split the name so that the last word prior to “T/A” e.g. Villars is the major key, improving the quality of matches.

Problems Encountered 2 The number of small non-matched units grows over time leading to increasing duplication Checking these units is labour intensive Solutions –Fine tune matching parameters –Re-run batch processes –Use extra information e.g. legal form / company number where available

Future Developments Clean and correct addresses prior to matching using “QuickAddress” and the Post Office Address File Links to geographical referencing Business Index - plans to link registers of businesses across UK government departments Unique identifiers?