CBS-SSB STATISTICS NETHERLANDS – STATISTICS NORWAY Work Session on Statistical Data Editing Oslo, Norway, September 2012 Jeroen Pannekoek and Li-Chun Zhang On the general flow of editing
CBS - SSB 1 Introduction An overall data editing process involves all activities to transform raw micro-data with errors and missing values into edited statistical micro-data that are suitable for production of publication figures. GSBPM: review, validate and edit, impute, output control. For implementation of an E&I system we need more detailed descriptions called statistical functions that each perform some action on the data. This paper tries to identify common statistical functions that are used as building blocks in different overall E&I processes or strategies. The decomposition of the overall process can facilitate process design, re-use of methodological components and documentation and generic software tools.
CBS - SSB 2 Contents Some classifications of data editing functions that are relevant for the process design. A summary of statistical data editing functions in some detail. Some process flow examples, using the statistical functions as building blocks, from the Netherlands and Norway. Concluding remarks
CBS - SSB 3 Classification of functions by purpose Verification Checking of hard and soft edit rules, calculation scores, detection of systematic errors. Input: rules and data → Output: quality indicators and measures Less formal: graphical macro-editing, output control. Selection (for further processing) Selection of units for manual editing. Selection of variables to change, error localisation. Input: quality indicators and data → Output: selection of records or fields Amending Modifying selected data values to resolve problems detected by verification, including imputation of missing values.
CBS - SSB 4 Unit-mode versus batch-mode operation Since manual editing is time-consuming it should start during the sometimes lengthy data collection period. This must then also hold for any automatic editing function that is applied before manual editing. Unit-mode functions Proceed on a record-by-record basis and can be applied during the data collection phase. Bach-mode functions Use all of the data (or a large subset) and can only be applied near the end of the data collection phase.
CBS - SSB 5 Editing functions: verification (1/2) Edit-rules (unit-mode) Systems of connected balance edits: profit=turnover-total costs. total costs = costs of employees + costs of purchases + Non-negativity edits and inequalities. Ratio edits (soft). Score functions Measure the potential effect that editing a unit may have on estimates of totals or other aggregate parameters of interest. Based on measures of the deviation between observed values and predicted or “anticipated” values s i =f(x j,x j a ). Unit-mode: x j a is based historical data or other external source. Batch-mode: x j a is based on current data. Also applied to measure and check the actual effect of (automatic) editing instead of the potential effect of editing. Then x j a is the edited value.
CBS - SSB 6 Editing functions: verification (2/2) Extended score functions Score functions can be extended by adding indicators for further processing based on simple criteria, other than the regular score function. For instance: >0: regular score value -9: “crucial” (dominates the totals in its branch) → manual editing -8: influential and main variables are missing → re-contact -7: non-influential and main variables missing → unit nonrespons Macro-verification Macro-verification functions are batch-mode by definition. They include all macro-editing activities: verifying aggregates, graphical inspection of distributions, graphical or model-based outlier detection etc.
CBS - SSB 7 Editing functions: selection Selection of units for manual editing using regular scores By comparing to a predetermined threshold value – unit-mode. By ordering units on scores and select the highest ranking – batch-mode Selection of variables for amendment: error localization (unit-mode). To resolve edit-failures, some values need to be changed. The error localization problem is the selection of which variables to be changed. A generic automatic approach (Felligi-Holt): select the fewest (weighted) number of variables to change Macro-selection (batch mode) of units for manual editing Implausible aggregates eventually lead to suspect units (down-drilling) Graphical verification leads to selection of the most extraordinary units.
CBS - SSB 8 Editing functions: amendment Amendment of systematic errors (unit-mode) Errors with a detectable cause and reliable correction mechanism. Generic: Thousand errors, recognizable typos, rounding errors. Subject-related: specific “if-then” type of correction rules. Deductive imputation of missing values (unit-mode) Some missing values can univocally be determined by the hard edit- rules. Which gives the only possible feasible imputation. Model based imputation (batch- or unit-mode) For most missing value we need model-based predicted values to impute. Batch-mode if current data are used to estimate parameters. Adjustment for inconsistency (unit-mode) Adjustment of imputation to ensure consistency with edit-rules
CBS - SSB 9 Illustration of automatic editing Action# failed hard edits# missing values none5140 Treatment of Systematic errors Thousand errors5140 Typing errors4760 Rounding errors4400 Selection of fields to change F-H Error localization-397 Automatic imputation/adjustment Deductive imputation-266 Regression imputation2540 Adjustment of imp. values00 Data from child day care institutions: 500 records with 68 SBS-type variables and 40 hard edit-rules.
CBS - SSB 10 Process flow. Scenario A: Selective editing No Input micro data 5. Macro-selection 1. Primary automated processing 4. Automatic amendment of uncritical units Edited micro data 3. Clerical interactive editing No 1a. Systematic errors 1b. Evaluation of scores 2a. Selection using scores 2b. (FH-)selection of fields 4a. Imputation of missings 4b. Adjustments 5. Macro-verification and selection 2. Micro-selection Yes
CBS - SSB 11 Process flow. Scenario B: More automatic editing No Input micro data 5. Macro-selection 1. Primary automated processing 4. Automatic amendment of uncritical units Edited micro data 3. Clerical interactive editing No 1. All unit-mode automatic editing 1a. Systematic errors 1b. (FH-)selection of fields 1c. Imputation 1d. Adjustments 1e. Evaluation of scores 4a. Batch-mode Imputation 4b. Adjustments 5. Macro-verification and selection 2. Micro-selection Yes
CBS - SSB 12 Process flow: Scenario C. No timeliness problems, No Input micro data 1. Primary automated processing 4. Automatic amendment Edited micro data 3. (Partial) Clerical interactive editing No 1. Systematic errors 4a. Imputation of missings 4b. Adjustments 2. Macro-verification and selection. Including batch- mode scores 2. Macro-selection Yes 3. (Partial) Clerical interactive editing.
CBS - SSB 13 Concluding remarks The shown description of the overall process can be helpful in the communication between editing staff, project managers, process designers and methodologists. It clarifies the organization of the process and the choices that must be made. It also helps to define the functionalities and interfaces of generic software components by placing them in the context of the overall process scheme. Increasing automatic editing can greatly reduce the amount of manual editing. This may involve automatic editing of influential units and subject specific “if-then” rules.