Statistical Data Editing Anders Norberg, Statistics Sweden (SCB) 2011-11-11.

Slides:



Advertisements
Similar presentations
Work Session on Statistical Data Editing Paris, France, April 2014 Topic (i): Selective editing / macro editing Experiences from Selective Editing.
Advertisements

Preparing Data for Quantitative Analysis
Brian A. Harris-Kojetin, Ph.D. Statistical and Science Policy
1 QUANTITATIVE DESIGN AND ANALYSIS MARK 2048 Instructor: Armand Gervais
Workshop on Energy Statistics, China September 2012 Data Quality Assurance and Data Dissemination 1.
1 Editing Administrative Data and Combined Data Sources Introduction.
15 de Abril de A Meta-Analysis is a review in which bias has been reduced by the systematic identification, appraisal, synthesis and statistical.
Software Quality Control Methods. Introduction Quality control methods have received a world wide surge of interest within the past couple of decades.
1 BA 555 Practical Business Analysis Review of Statistics Confidence Interval Estimation Hypothesis Testing Linear Regression Analysis Introduction Case.
FINAL REPORT: OUTLINE & OVERVIEW OF SURVEY ERRORS
Maintenance of Selective Editing in ONS Business Surveys Daniel Lewis.
International Workshop on Industrial Statistics Dalian, China June 2010 Shyam Upadhyaya UNIDO Statistical units and data items.
Vienna, 23 April 2008 UNECE Work Session on SDE Topic (v) Editing on results (post-editing) 1 Topic (v): Editing based on results Discussants: Maria M.
1 WORLD TOURISM ORGANIZATION (UNWTO) MEASURING TOURISM EXPENDITURE: A UNWTO PROPOSAL SESRIC-UNWTO WORKSHOP ON TOURISM STATISTICS AND THE ELABORATION OF.
Using survey data collection as a tool for improving the survey process Silvia Biffignandi, Antonio Laureti Giulio Perani University of Bergamo Istat Istat.
The Edit Anders Norberg, Statistics Sweden (SCB) Work Session on Statistical Data Editing Ljubljana, Slovenia, 9-11 May 2011.
Electronic reporting in Poland 27th Voorburg Group Meeting Warsaw, Poland October 1st to October 5th, 2012 Central Statistical Office of Poland.
Volunteer Angler Data Collection and Methods of Inference Kristen Olson University of Nebraska-Lincoln February 2,
M ETADATA OF NATIONAL STATISTICAL OFFICES B ELARUS, R USSIA AND K AZAKHSTAN Miroslava Brchanova, Moscow, October, 2014.
12th Meeting of the Group of Experts on Business Registers
Multiple Indicator Cluster Surveys Survey Design Workshop Sampling: Overview MICS Survey Design Workshop.
Statistical Process Control
A generic tool to assess impact of changing edit rules in a business survey – SNOWDON-X Pedro Luis do Nascimento Silva Robert Bucknall Ping Zong Alaa Al-Hamad.
Eng.Mosab I. Tabash Applied Statistics. Eng.Mosab I. Tabash Session 1 : Lesson 1 IntroductiontoStatisticsIntroductiontoStatistics.
The Adoption of METIS GSBPM in Statistics Denmark.
Chapter Thirteen Validation & Editing Coding Machine Cleaning of Data Tabulation & Statistical Analysis Data Entry Overview of the Data Analysis.
Eurostat Overall design. Presented by Eva Elvers Statistics Sweden.
Support for design of statistical surveys at Statistics Sweden
Deliverable 2.6: Selective Editing Hannah Finselbach 1 and Orietta Luzi 2 1 ONS, UK 2 ISTAT, Italy.
Update on Selective Editing and Implications for Staff Skills International Trade Conference September 2008 Ken Smart.
Applying Process Indicators to Monitor the Editing Process.
A Strategy for Prioritising Non-response Follow-up to Reduce Costs Without Reducing Output Quality Gareth James Methodology Directorate UK Office for National.
The application of selective editing to the ONS Monthly Business Survey Emma Hooper Office for National Statistics
Jeroen Pannekoek - Statistics Netherlands Work Session on Statistical Data Editing Oslo, Norway, 24 September 2012 Topic (I) Selective and macro editing.
United Nations Economic Commission for Europe Statistical Division Mapping Data Production Processes to the GSBPM Steven Vale UNECE
DATA PREPARATION: PROCESSING & MANAGEMENT Lu Ann Aday, Ph.D. The University of Texas School of Public Health.
ICES 2007 Labour Cost Index and Sample Allocation Outi Ahti-Miettinen and Seppo Laaksonen Statistics Finland (+ University of Helsinki) Labour cost index.
CBS-SSB STATISTICS NETHERLANDS – STATISTICS NORWAY Work Session on Statistical Data Editing Oslo, Norway, September 2012 Jeroen Pannekoek and Li-Chun.
1 Selective data editing Development & implementation Q 2010 Helsinki Jörgen Svensson Process Owner Statistics Sweden (SCB)
Paul P. Biemer RTI International Lars E. Lyberg Statistics Sweden I ntroduction to S urvey Q uality.
StatisticalData Editing Anders Norberg, Statistics Sweden (SCB)
A Quality Driven Approach to Managing Collection and Analysis
Topic (iii): Macro Editing Methods Paula Mason and Maria Garcia (USA) UNECE Work Session on Statistical Data Editing Ljubljana, Slovenia, 9-11 May 2011.
Outlining a Process Model for Editing With Quality Indicators Pauli Ollila (part 1) Outi Ahti-Miettinen (part 2) Statistics Finland.
Copyright 2010, The World Bank Group. All Rights Reserved. Testing and Documentation Part II.
Developing and applying business process models in practice Statistics Norway Jenny Linnerud and Anne Gro Hustoft.
Topic (i): Selective editing / macro editing Discussants Orietta Luzi - Italian National Statistical Institute Rudi Seljak - Statistical Office of Slovenia.
1 Towards a common statistical enterprise architecture Ongoing process reengineering at Statistics Sweden Service Oriented Architecture – SOA Sharing of.
United Nations Oslo City Group on Energy Statistics OG7, Helsinki, Finland October 2012 ESCM Chapter 8: Data Quality and Meta Data 1.
A selective editing method considering both suspicion and potential impact, developed and applied to the Swedish foreign trade statistics Topic (ii), WP.
Recent development in the metadata area at Statistics Sweden Klas Blomqvist
1 Processorientated statistical production IAOS Conference, October 16, 2008 Åke Bruhn, Director, Process Dept, Statistics Sweden.
Evaluating the benefits of using VAT data to improve the efficiency of editing in a multivariate annual business survey Daniel Lewis.
5.8 Finalise data files 5.6 Calculate weights Price index for legal services Quality Management / Metadata Management Specify Needs Design Build CollectProcessAnalyse.
1 Testing Statistical Hypothesis The One Sample t-Test Heibatollah Baghi, and Mastee Badii.
SECTION 7.2 Estimating a Population Proportion. Where Have We Been?  In Chapters 2 and 3 we used “descriptive statistics”.  We summarized data using.
1 Process Orientation at statistics Sweden – Implementation and Initial Experiences IAOS Conference, October 15, 2008 Mats Bergdahl, Deputy Director Process.
TRITON - An event driven SOA architecture MSIS Jakob Engdahl, Statistic Sweden
FDI - Imputation. Overview Introduction Overview of Imputation Methods Overview of Outliering methods Overview of Estimation methods Aggregation Disclosure.
Copyright 2010, The World Bank Group. All Rights Reserved. Producer prices, part 2 Measurement issues Business Statistics and Registers 1.
Systems Analysis Lecture 5 Requirements Investigation and Analysis 1 BTEC HNC Systems Support Castle College 2007/8.
Survey phases, survey errors and quality control system
Improving the efficiency of editing in ONS business surveys
Survey phases, survey errors and quality control system
Tomaž Špeh, Rudi Seljak Statistical Office of the Republic of Slovenia
Validation at Statistics Sweden
Jeroen Pannekoek, Sander Scholtus and Mark van der Loo
Data processing German foreign trade statistics
Mapping Data Production Processes to the GSBPM
Metadata used throughout statistics production
Presentation transcript:

Statistical Data Editing Anders Norberg, Statistics Sweden (SCB)

Editing Editing is an activity of detecting, resolving and understanding errors in data and produced statistics 2

Selective data editing The purpose of selective data editing is to reduce cost for the statistical agency as well as for the respondents, without significant decrease of the quality of the output statistics. 3

Papers by my colleague Leopold Granquist Granquist (1984). On the role of editing. Statistical Review 2 Granquist (1997). The New View on Editing. International Statistical Review Granquist and Kovar (1997). Editing of Survey Data: How Much is Enough? In Survey Measurement and Process Quality. Wiley.

If … we only want information from businesses that we know they have, and we ask for that information so they understand, and we motivate them to deliver as good quality in data as possible, and we help them to avoid accidental errors in answering questionnaires, then editing would be a minor process! 5

Where errors are introduced Errors in raw data delivered by respondents to the statistical agency are typically non-response and measurement errors Errors in data transmissions The statistics production process is a mixture of many activities with risks of introducing errors 6

The role of editing Quality Control of the measurement process –Find errors (efficient controls) –Consider every identified error as a problem for the respondent to deliver correct data by our collection instrument –Identify sources of error (process data) –Analyse process data – communicate with cognitive specialists Contribute to quality declaration Adjust (change/correct) significant errors 3

Types of errors  Obvious errors / Fatal errors  Non-valid values  Item non-response  Data structure- or model errors, total≠sum of components  Contradictions  Suspected data values  Deviation errors (Outliers) Suspiciously high/low values, data outside of predetermined limits  Definition errors (Inliers) Many respondent miss-understand a question in the same way Many respondents fetch data from info-systems with other definitions

Suspected data values Deviation errors Manual follow-up takes time and is expensive Few deviation errors have impact on output statistics (low hit-rate, many changes in data have very little impact) Editing must have impact on the output! Remember response burdon !

Suspected data values Definition errors (Inliers) Difficult to find Ways to find them:  Combined editing for several surveys  Deep interviews in focus groups  Use statistics from FEQ and from re-contacts with respondents  High proportions of item non-response  Graphical editing  Good examples

The Process Perspective Audit and improve data collection (measurement instrument and collection process) and the editing process itself Un-edited data must be saved in order to produced important process indicators, as hit- rate and impact on output!

Process data Sources of errors = problem for the respondents Suspicions Error codes Manuel actions (accept / amended values) Automatic actions

Process indicators Sources of errors (problem for the respondents) Prop. of flagged units and variables Prop. of manually and automatically reviewed units and variables Prop. of amended values and impact of the changes, per variable Hit-rate for edits

14 Establish needs 1 Plan and design 2 Create and test 3 Collect 4 Prepare and process 5 Analyse 6 Report and communicate 7 Statistical Production Process survey statistical needs 1.1 Affirm customer needs 1.2 Develop table plan 1.3 Identify data sources 1.4 Examine disclosure 1.5 Carry out market process 1.6 Prepare data and statistical values for dissemination 7.1 produce final output 7.2 Report and communicate final output 7.3 handle inquiries 7.4 Market final output 7.5 Classify and code micro data 5.1 Check micro data 5.2 Impute for Non-response 5.3 Complement Data set 5.4 Calculate weights 5.5 Establish final observation register 5.6 Create frame and draw sample 4.1 Handle respondent issues 4.2 Prepare data collection 4.3 Carry out data collection 4.4 Transfer and store data electronically 4.5 Plan and design table plan 2.1 Plan and design data collection 2.3 Plan and design data processing 2.4 Plan and design analysis and reporting 2.5 Plan production flow 2.6 Design production system 2.7 Create and test Measurem. instrument 3.1 Develop existing and building new production tools 3.2 Assure communication between production tools 3.3 Test production system 3.4 Carry out pilot test 3.5 Implement production tools 3.6 Implement production system 3.7 Produce statistical values 6.1 Quality assurance of produced statistics 6.2 Interpret and explain 6.3 Prepare contents for reporting and communication 6.4 Establish contents for reporting and communication 6.5 Plan and design frame/population and sample 2.2

15 Establish needs 1 Plan and design 2 Create and test 3 Collect 4 Prepare and process 5 Analyse 6 Report and communicate 7 Statistical Production Process survey statistical needs 1.1 Affirm customer needs 1.2 Develop table plan 1.3 Identify data sources 1.4 Examine disclosure 1.5 Carry out market process 1.6 Prepare data and statistical values for dissemination 7.1 produce final output 7.2 Report and communicate final output 7.3 handle inquiries 7.4 Market final output 7.5 Classify and code micro data 5.1 Check micro data 5.2 Impute for Non-response 5.3 Complement Data set 5.4 Calculate weights 5.5 Establish final observation register 5.6 Create frame and draw sample 4.1 Handle respondent issues 4.2 Prepare data collection 4.3 Carry out data collection 4.4 Transfer and store data electronically 4.5 Plan and design table plan 2.1 Plan and design data collection 2.3 Plan and design data processing 2.4 Plan and design analysis and reporting 2.5 Plan production flow 2.6 Design production system 2.7 Create and test Measurem. instrument 3.1 Develop existing and building new production tools 3.2 Assure communication between production tools 3.3 Test production system 3.4 Carry out pilot test 3.5 Implement production tools 3.6 Implement production system 3.7 Produce statistical values 6.1 Quality assurance of produced statistics 6.2 Interpret and explain 6.3 Prepare contents for reporting and communication 6.4 Establish contents for reporting and communication 6.5 Plan and design frame/population and sample 2.2

16 Establish needs 1 Plan and design 2 Create and test 3 Collect 4 Prepare and process 5 Analyse 6 Report and communicate 7 Statistical Production Process survey statistical needs 1.1 Affirm customer needs 1.2 Develop table plan 1.3 Identify data sources 1.4 Examine disclosure 1.5 Carry out market process 1.6 Prepare data and statistical values for dissemination 7.1 produce final output 7.2 Report and communicate final output 7.3 handle inquiries 7.4 Market final output 7.5 Classify and code micro data 5.1 Check micro data 5.2 Impute for Non-response 5.3 Complement Data set 5.4 Calculate weights 5.5 Establish final observation register 5.6 Create frame and draw sample 4.1 Handle respondent issues 4.2 Prepare data collection 4.3 Carry out data collection 4.4 Transfer and store data electronically 4.5 Plan and design table plan 2.1 Plan and design data collection 2.3 Plan and design data processing 2.4 Plan and design analysis and reporting 2.5 Plan production flow 2.6 Design production system 2.7 Create and test Measurem. instrument 3.1 Develop existing and building new production tools 3.2 Assure communication between production tools 3.3 Test production system 3.4 Carry out pilot test 3.5 Implement production tools 3.6 Implement production system 3.7 Produce statistical values 6.1 Quality assurance of produced statistics 6.2 Interpret and explain 6.3 Prepare contents for reporting and communication 6.4 Establish contents for reporting and communication 6.5 Plan and design frame/population and sample 2.2

Average proportions of costs of sub-processes Process Proportion of total cost (%) All products Short-period Annual surveys and periodic Respondent service Manual pre-editing Data-registration editing Production editing Output editing Total editing cost

Web data collection Demands: High hit-rate in electronic questionnaires System that can measure hit-rate? Question: Can it be a goal for us to move all editing to electronic data collection?

Expectation on the production editing process at Stat. Sweden Generic IT-tools –Less IT-maintenance –Easier planning of work and personnel at Data collection units –Better working environments –Methodology studies Efficient editing methods –Selective/significance editing –Better working environments –Less response burden Collection and analysis of process data –Continuous improvement of data collection and editing processes –Information for quality declaration of statistics

Input, throughput, output

Impact Actual impact = w ( y_une – y_edi) for observation k is the impact on domain-total T if y_une is kept instead of making a review to find y_edi. Potential impact = w (y_edi – y_pred) is a proxy for actual impact to be used in practice, as y_edi will not be known until review. y_pred is a prediction (expected value) for y_edi. Anticipated (expected) impact (per domain, variable, observation) is the product of suspicion and potential impact.

Suspicion: Traditional edits

Selective data editing Construct a score function for prioritizing variables and records. Alternatives: 1) Potential impact on statistics for records flagged by at least one traditional edit 2) Sum of expected impact on statistics for variable values flagged as suspected by edits Norberg, A. et al. (2010): A General Methodology for Selective Data Editing. Statistics Sweden 23

Selective data editing Potential impact Suspicion 01 Flagged A procedure which targets only some of the micro data variables or records for review by prioritizing the manual work. 24

1

5

Predicted (expected) values Data / predictor Time series Previous value Forecast Cross section Mean/standard error Median/quartile Edit groups All data Blue collar workers White collars Monthly pay Weekly pay Profession=3111 Profession=3112 Payment by the hour Monthly pay Weekly pay Payment by the hour Profession =1Profession= 2Profession= 3Profession=9 Profession=3480 MenWomen Profession=

Suspicion R= Suspicion=R/(TAU+R) Susp

Score function Local score, by domain d, variable j & observed unit k,l is the anticipated impact related to an appropriate measure of size for the domain/variable, say standard error of estimate. VIOLIN j = weights for variables (j) CLARINET c = weights for domains by classifications d(c) OBOE j = adjustment for size of estimated total or its standard error (j) LScore d,j,k,l = Suspicion j,k,l x(Potential impact) x CELLO d(c),j

Score function Global scores are aggregated local scores by (5) domains, (4) variables, second stage units (3) to one score for each primary unit (2) and finally to (1) respondent. Methods: sum, sum of squares, sum of local scores truncated by local thresholds, maximum etc. 28

Cut-off or probability sampling? Say that 821 of the total sample (n=4 000) have a score >0. There are two options for manual review: –Cut-off sampling: Score2 >Threshold2, assuming the remaining bias is small –Two-phase sampling: πps-sampling and design-based estimation of measurement errors to subtract from initial estimates Ilves, K. (2010): Probability Approach to Editing. Workshop on Survey Sampling Theory and Methodology, Vilnius, Lithuania, August 23-27,

Evaluation Relative pseudo-bias is a measure of error in output due to incomplete data review

Evaluation Psedobias for PPI relative to the overall price index. Observation units ordered in descending order of impact.

Editing – remaining methodology issues  Fatal errors –Classifying variables –Survey variables  Confidence (respondents and clients)  New and old respondents  Edited in earlier processes –Web-questionnaires –Scanned paper questionnaires  Data and methods for computing predicted values etc.  Homogenous edit groups  Priorities; variables, domains (from the clients perspective)  Score functions  How to decide threshold values  Sampling below threshold –Inference –Data for evaluation

SELEKT 1.1 Survey specific cold adapter (SAS code) Data preparation SAS data set PRE-SELEKT Parameter specifications, Analysis of cold data AUTOSELEKT Score calculation & record flagging Records to FOLLOW-UP Process data and reports Input (hot) survey data Records to IMPUTATION Raw+edited past (cold) survey data Survey specific hot adapter (SAS code) Data preparation SAS data set Table of Parameters Table of Estimates Accepted records CLAN estimation software SNOWDON -X analysis of edits Edits 42

SELEKT Parameters (in) Give AUTO - SELEKT the parameters: %let PATH_sys=C:\SELEKT\1.0; %let PATH_app=C:\SELEKT\Prod\Demo1_Enterprise; %let EDIT_parms=Ent_Parms1; %let EDIT_data=Demo1_Adap_Ent_Inflowdata; %let EDIT_T1_Value=2009; %let EDIT_T2_Value=1; By the parameter table &EDIT_parms AUTO - SELEKT knows what to do.

SELEKT Error list (out) Identification: Column name = Variable Id1 = Identity for respondent[optional] Id2 = Identity for primary sampling unit (PeOrgnr, CfarNr etc.) Id3 = Identity for observational unit (Social security number, CN8 for products) EditNumber = Edit identification, if the edit flags for suspicions or obvious error. Timestamp = Time when the questionnaire passes SELEKT Process data: EditFlag = 0 = accepted, 1-5 = error flagged EditSuspicion = Suspicion generated by continuous edits. Score1 = (Local) Score for respondent[optional] Score2 = (Local/Global) Score for primary sampling unit [optional] Score3 = (Global) Score for observational unit [optional] N_Obs = Number of observations, which have gone through the edit round. N_Obs_Flagg = Number of error flagged observations in the PSU, on this list N_PSU = Number of PSU for the respondent, which have passed the edit round N_PSU_Flagg = Number of error flagged primary sampling units, on this list

Edits EDIT GROUP AND ACCEPTANCE REGION Edit identification Edit group Acceptance range EDIT Edit identification Type of edit Active Section Internal error message External error message Instruction for data review Un-edited test variable Error flag KEY Edit identification Survey variable IMPACT ON STATISTICS Survey variable Potential impact on statistics FLAGGING OBJECTS EDIT PRACTICAL SUPPORT Edit identification Standard edit rule Edited test variable Suspicion probability value produced by the SELEKT system 2 1