Jeroen Pannekoek, Sander Scholtus and Mark van der Loo

Slides:



Advertisements
Similar presentations
Integrated Data Editing and Imputation Ton de Waal Department of Methodology Voorburg Statistics Netherlands ICES III conference, Montréal June 19, 2007.
Advertisements

Data Imputation United Nations Statistics Division (UNSD) 16 March 2011 Santiago, Chile.
Module B-4: Processing ICT survey data TRAINING COURSE ON THE PRODUCTION OF STATISTICS ON THE INFORMATION ECONOMY Module B-4 Processing ICT Survey data.
Editing and Imputing VAT Data for the Purpose of Producing Mixed- Source Turnover Estimates Hannah Finselbach and Daniel Lewis Office for National Statistics,
1 Business Exchange Structures Concepts.
Chapter 13 Auditing Information Technology
AUDIT PROCEDURES. Commonly used Audit Procedures Analytical Procedures Analytical Procedures Basic Audit Approaches - Basic Audit Approaches - System.
Data Editing United Nations Statistics Division (UNSD) 16 March 2011 Santiago, Chile.
United Nations Economic Commission for Europe Statistical Division Applying the GSBPM to Business Register Management Steven Vale UNECE
Environment Change Information Request Change Definition has subtype of Business Case based upon ConceptPopulation Gives context for Statistical Program.
Eurostat Statistical Data Editing and Imputation.
Topics Covered: Data preparation Data preparation Data capturing Data capturing Data verification and validation Data verification and validation Data.
Development of metadata in the National Statistical Institute of Spain Work Session on Statistical Metadata Genève, 6-8 May-2013 Ana Isabel Sánchez-Luengo.
All the answers? Statistics New Zealand’s Integrated Data Infrastructure Paper by Felibel Zabala, Rodney Jer, Jamas Enright and Allyson Seyb Presented.
Deliverable 2.6: Selective Editing Hannah Finselbach 1 and Orietta Luzi 2 1 ONS, UK 2 ISTAT, Italy.
S14: Analytical Review and Audit Approaches. Session Objectives To define analytical review To define analytical review To explain commonly used analytical.
Topic (ii): New and Emerging Methods Maria Garcia (USA) Jeroen Pannekoek (Netherlands) UNECE Work Session on Statistical Data Editing Paris, France,
Jeroen Pannekoek - Statistics Netherlands Work Session on Statistical Data Editing Oslo, Norway, 24 September 2012 Topic (I) Selective and macro editing.
European Conference on Quality in Official Statistics Session 26: Quality Issues in Census « Rome, 10 July 2008 « Quality Assurance and Control Programme.
United Nations Economic Commission for Europe Statistical Division Mapping Data Production Processes to the GSBPM Steven Vale UNECE
CBS-SSB STATISTICS NETHERLANDS – STATISTICS NORWAY Work Session on Statistical Data Editing Oslo, Norway, September 2012 Jeroen Pannekoek and Li-Chun.
Statistical Expertise for Sound Decision Making Quality Assurance for Census Data Processing Jean-Michel Durr 28/1/20111Fourth meeting of the TCG - Lubjana.
Building Simulation Model In this lecture, we are interested in whether a simulation model is accurate representation of the real system. We are interested.
Topic (iii): Macro Editing Methods Paula Mason and Maria Garcia (USA) UNECE Work Session on Statistical Data Editing Ljubljana, Slovenia, 9-11 May 2011.
Outlining a Process Model for Editing With Quality Indicators Pauli Ollila (part 1) Outi Ahti-Miettinen (part 2) Statistics Finland.
Paolo Valente - UNECE Statistical Division Slide 1 Technology for census data coding, editing and imputation Paolo Valente (UNECE) UNECE Workshop on Census.
Integrated Approach Processing Marie Brodeur Director General, Industry Statistics Branch, Statistics Canada St. Lucia February, 2014 SNA seminar in the.
Software Engineering1  Verification: The software should conform to its specification  Validation: The software should do what the user really requires.
Generic Statistical Data Editing Models (GSDEMs) Workshop on the Modernisation of Official Statistics The Hague, 24 November 2015.
Analytical Review and Audit Approaches
1 Phase Testing. Janice Regan, For each group of units Overview of Implementation phase Create Class Skeletons Define Implementation Plan (+ determine.
An Overview of Editing and Imputation Methods for the next Italian Censuses Gianpiero Bianchi, Antonia Manzari, Alessandra Reale UNECE-Eurostat Meeting.
Elaborating on the Business Architecture of SN Robbert Renssen Statistics Netherlands Standard Process Steps.
The business process models and quality issues at the Hungarian Central Statistical Office (HCSO) Mr. Csaba Ábry, HCSO, Methodological Department Geneva,
1 Handbook on Population and Housing Census Editing Department of Economic and Social Development United Nations Statistics Division Studies in Methods,
Dr. Justin Bateh. Point of Estimate the value of a single sample statistics, such as the sample mean (or the average of the sample data). Confidence Interval.
PROCESSING DATA.
Implementation of Quality indicators for administrative data
Theme (v): Managing change
Generic Statistical Data Editing Models (GSDEMs)
Hardware & Software Reliability
The user as data detective
Standardized and modernized data editing in Statistics Denmark
Auditing Information Technology
As the last CC-list represents Maximum Compatible Classes we conclude:
UNITED NATIONS ECONOMIC COMMISSION FOR EUROPE CONFERENCE OF EUROPEAN STATISTICIANS Work Session on Statistical Data Editing April 2017 The Hague,
The Calibration Process
Entry-Task-Validation-Exit (ETVX)
Documentation of statistics
Survey phases, survey errors and quality control system
Generic Statistical Business Process Model (GSBPM)
ESSnet project "Automated data collection and reporting in accommodation statistics"   Objectives, achievements and results
Structural Business Statistics Data validation
Fundamental Test Process
Survey phases, survey errors and quality control system
Tomaž Špeh, Rudi Seljak Statistical Office of the Republic of Slovenia
Test Case Test case Describes an input Description and an expected output Description. Test case ID Section 1: Before execution Section 2: After execution.
Prodcom ESTP course October 2010
Mapping Data Production Processes to the GSBPM
Metadata used throughout statistics production
Parallel Session: BR maintenance Quality in maintenance of a BR:
Introduction to the design (and analysis) of experiments
The role of metadata in census data dissemination
GSBPM AND ISO AS QUALITY MANAGEMENT SYSTEM TOOLS: AZERBAIJAN EXPERIENCE Yusif Yusifov, Deputy Chairman of the State Statistical Committee of the Republic.
Task Force on Small and Medium Sized Enterprise Data (SMED)
Automatic Editing with Soft Edits
Jeroen Pannekoek, Mark van der Loo and Bart van den Broek
A handbook on validation methodology. Metrics.
New and Emerging Methods
PRODCOM Working Group JMO M November 2012
Presentation transcript:

Jeroen Pannekoek, Sander Scholtus and Mark van der Loo Automatic data editing functions Jeroen Pannekoek, Sander Scholtus and Mark van der Loo

Data editing in the GSBPM The GSBPM Phases : Sub-processes: 1. Specify needs 2. Design 3. Build 4. Collect 5. Process 5. Process 6. Analyse 7. Disseminate 8. Archive 9. Evaluate 5.3 Review Validate edit 5.4 Impute The General statistical business model is der Data Automatic data edititng is only one of the subprocesses or processes or what? However it is one with high considerable methodological component en software. 47 sub-processes

Data editing and efficiency Data editing involves all activities to transform raw microdata with errors and missing values into edited statistical micro-data that are suitable for the production of publication figures. Data editing is an expensive process it is often estimated that 40% of the total budget is spend on data editing. NSI’s keep searching for more efficient ways of editing. Selective manual editing (only a small subset of the units that contain influential errors are edited) Automated editing, automate the editing as much as possible. NSI’s keep searching for more efficient ways of editing They do that by improving methods for selective editing: that is And by improving methods for automate the editing process as much as possible

Automatic editing functions Automatic editing is not a single method but consists of a collection of actions that each perform a specific task in the editing process. To support automatic editing with general methods and tools we need to indentify the common statistical functions that can be used as building blocks in many editing processes. This gives a decomposition of the overall editing process in more detail than the GSBPM can provide but it serves similar goals, facilitate: process design re-use of methodological components and documentation development of generic software tools. To imp Some classifications of data editing functions that are relevant for the process design. lement automatic editing system in a general manner we must

Editing functions: verification Confronting our data with prior knowledge and expectation Edit rules Systems of connected balance edits: profit=turnover-total costs. total costs = costs of employees + costs of purchases +  Also non-negativity edits and inequalities. Input: data and rules output: N×k failed edit-matrix Scores Measure the potential effect that editing a unit may have on estimates of totals or other aggregate parameters of interest. Based on measures of the deviation between observed values and predicted or “anticipated” values. Input: data and function output: N vector unit scores The firs group of editing function we have identified is verification. The data are varified against our current expectations and possible problems are identified Some familiar examples are:

Editing functions: selection Selection of units and fields for further treatment Selection of units for manual editing By comparing scores to a predetermined threshold value. Input: scores , threshold output: selected units indicator Selection of fields for amendment: error localization Detect errors with a detectable cause Generic: thousand errors, recognizable typos, rounding errors. Subject-related: specific “if-then” type of correction rules. To resolve edit-failures, some values need to be changed. A generic automatic approach (Felligi-Holt): select the fewest (weighted) number of variables to change . Input: editrules, data output: selected fields indicator Selection: selects units or data fields for specific further treatment

Editing functions: amendment Changing data values Amendment of systematic errors (with known cause) Since the cause is known an appropriate correction can be made. Input field indicator and data output amended value Imputation of missing or erroneous values Missing values can be imputed. But also errors determined by FH are generally treated as missing values and thus imputed. Input indicator for missing output imputed value Adjustment for inconsistency Adjustment of imputations to ensure consistency with edit-rules Input data and edit-rules output adjusted value So far we have done nothing to actually improve the quality of the data. Amendment just does that it changes data values with the purpose to improve the quality. These three functions verification, selection and amendment and some more are given in the following scheme which we cakll a taxonomie of editing functions

A taxonomy of data editing functions Data editing is first <zie ons JOS paper>

Illustration: Indicators & edit checking Data from child day care institutions: 800 records with 43 SBS-type variables and 73 hard edit-rules. Indicators Indicators Edit checking after amendment Action Changes Raw data Direct rules 142 Thousand errors 24 Typing errors 30 Rounding errors 37 FH-localisation 1290 Imputation 2481 Adjustment 1640 Action Changes Raw data Direct rules 142 Thousand errors 24 Typing errors 30 Rounding errors 37 FH-localisation 1290 Imputation 2481 Adjustment 1640 Failed edit rules Not verified edit rules Missing 1332 2875 1191 1330 1336 1300 1275 6301 2481 1193

Amendment: effect and plausibility % difference in means Raw data – Manually edited Auto edited – Manually edited

Amendment: plausibility by process step % deviation of mean from manual edited data. Means taken over non-missing values. Action Employees FTE Total costs Labour costs Revenues Result Raw data 2.5 3.5 5.9 1.4 5.7 20.5 Direct rules Thousand errors -0.2 0.2 -0.9 Typo’s -0.1 Rounding errors FH-localisation 0.7 5.2 4.3 1.9 4.5 1.7 Imputation -0.3 -0.5 -0.8 -3.4 Adjustment -0.7

Further work Phases: Specification of rules and methods Execution of editing Analysis of effects of editing Further work will include methods and tools for the first and last phase: Specification and improvement of systems of edits Edit specification (general part + specific part) Editing the Edits (consistency, redundancy, restrictiveness) Analyses of effects of editing Indicators for effects on estimates and micro-data Just as in the GSBPM we can distinguish differtent phases in automatic editing applications. Specification of rules and methods Execution of the process Analysis of the effects