Data Editing United Nations Statistics Division (UNSD) 16 March 2011 Santiago, Chile.

Slides:



Advertisements
Similar presentations
Integrated Data Editing and Imputation Ton de Waal Department of Methodology Voorburg Statistics Netherlands ICES III conference, Montréal June 19, 2007.
Advertisements

Data Imputation United Nations Statistics Division (UNSD) 16 March 2011 Santiago, Chile.
Module B-4: Processing ICT survey data TRAINING COURSE ON THE PRODUCTION OF STATISTICS ON THE INFORMATION ECONOMY Module B-4 Processing ICT Survey data.
Transformations & Data Cleaning
1 Continuity Equations: Analytical Monitoring of Business Processes and Anomaly Detection in Continuous Auditing Michael G. Alles Alexander Kogan Miklos.
Editing and Imputing VAT Data for the Purpose of Producing Mixed- Source Turnover Estimates Hannah Finselbach and Daniel Lewis Office for National Statistics,
Deliverable 2.8: Outliers Gary Brown Office for National Statistics UK.
1 Editing Administrative Data and Combined Data Sources Introduction.
Data Analysis Statistics. Inferential statistics.
INTERPRET MARKETING INFORMATION TO TEST HYPOTHESES AND/OR TO RESOLVE ISSUES. INDICATOR 3.05.
BA 555 Practical Business Analysis
of Experimental Density Data Purpose of the Experiment
Lecture 10 Comparison and Evaluation of Alternative System Designs.
Data Analysis Statistics. Inferential statistics.
Statistics for Managers Using Microsoft® Excel 5th Edition
Multiple Regression – Basic Relationships
FINAL REPORT: OUTLINE & OVERVIEW OF SURVEY ERRORS
SW388R7 Data Analysis & Computers II Slide 1 Multiple Regression – Basic Relationships Purpose of multiple regression Different types of multiple regression.
Multiple Linear Regression A method for analyzing the effects of several predictor variables concurrently. - Simultaneously - Stepwise Minimizing the squared.
AUDIT PROCEDURES. Commonly used Audit Procedures Analytical Procedures Analytical Procedures Basic Audit Approaches - Basic Audit Approaches - System.
Multiple Regression Farrokh Alemi, Ph.D. Kashif Haqqi M.D.
8/15/2015Slide 1 The only legitimate mathematical operation that we can use with a variable that we treat as categorical is to count the number of cases.
Eurostat Statistical Data Editing and Imputation.
Work Package 5: Integrating data from different sources in the production of business statistics Daniel Lewis Office for National Statistics (UK)
IB Chemistry Chapter 11, Measurement & Data Processing Mr. Pruett
Multivariate Statistical Data Analysis with Its Applications
Rudi Seljak, Metka Zaletel Statistical Office of the Republic of Slovenia TAX DATA AS A MEANS FOR THE ESSENTIAL REDUCTION OF THE SHORT-TERM SURVEYS RESPONSE.
How do Lawyers Set fees?. Learning Objectives 1.Model i.e. “Story” or question 2.Multiple regression review 3.Omitted variables (our first failure of.
Deliverable 2.6: Selective Editing Hannah Finselbach 1 and Orietta Luzi 2 1 ONS, UK 2 ISTAT, Italy.
S14: Analytical Review and Audit Approaches. Session Objectives To define analytical review To define analytical review To explain commonly used analytical.
Jeroen Pannekoek - Statistics Netherlands Work Session on Statistical Data Editing Oslo, Norway, 24 September 2012 Topic (I) Selective and macro editing.
Correlation & Regression
Chap 1-1 Statistics for Managers Using Microsoft Excel ® 7 th Edition Chapter 1 Defining & Collecting Data Statistics for Managers Using Microsoft Excel.
Part 2: Model and Inference 2-1/49 Regression Models Professor William Greene Stern School of Business IOMS Department Department of Economics.
MGT-491 QUANTITATIVE ANALYSIS AND RESEARCH FOR MANAGEMENT OSMAN BIN SAIF Session 16.
1. 2 Traditional Income Statement LO1: Prepare a contribution margin income statement.
© Federal Statistical Office, Institute for Research and Development in Federal Statistics, Elmar Wein Federal Statistical Office Introducing and implementing.
DATA PREPARATION: PROCESSING & MANAGEMENT Lu Ann Aday, Ph.D. The University of Texas School of Public Health.
Correlation & Regression Chapter 15. Correlation It is a statistical technique that is used to measure and describe a relationship between two variables.
CBS-SSB STATISTICS NETHERLANDS – STATISTICS NORWAY Work Session on Statistical Data Editing Oslo, Norway, September 2012 Jeroen Pannekoek and Li-Chun.
Design and Assessment of the Toronto Area Computerized Household Activity Scheduling Survey Sean T. Doherty, Erika Nemeth, Matthew Roorda, Eric J. Miller.
Statistical Expertise for Sound Decision Making Quality Assurance for Census Data Processing Jean-Michel Durr 28/1/20111Fourth meeting of the TCG - Lubjana.
ICCS 2009 IDB Workshop, 18 th February 2010, Madrid 1 Training Workshop on the ICCS 2009 database Weighting and Variance Estimation picture.
Model Selection and Validation. Model-Building Process 1. Data collection and preparation 2. Reduction of explanatory or predictor variables (for exploratory.
The challenge of a mixed-mode design survey and new IT tools application: the case of the Italian Structure Earning Surveys Fabiana Rocci Stefania Cardinleschi.
Engineering Statistics KANCHALA SUDTACHAT. Statistics  Deals with  Collection  Presentation  Analysis and use of data to make decision  Solve problems.
Topic (iii): Macro Editing Methods Paula Mason and Maria Garcia (USA) UNECE Work Session on Statistical Data Editing Ljubljana, Slovenia, 9-11 May 2011.
Data Preprocessing Compiled By: Umair Yaqub Lecturer Govt. Murray College Sialkot.
Chapter 11 Data Validation. Question Should your program assume the data is correct, or should your program edit the data to ensure it is correct?
Multivariate selective editing via mixture models: first applications to Italian structural business surveys Orietta Luzi, Guarnera U., Silvestri F., Buglielli.
Chapter Eight: Quantitative Methods
Analytical Review and Audit Approaches
Tutorial I: Missing Value Analysis
Sampling Design and Analysis MTH 494 Lecture-21 Ossam Chohan Assistant Professor CIIT Abbottabad.
INFO 4470/ILRLE 4470 Visualization Tools and Data Quality John M. Abowd and Lars Vilhuber March 16, 2011.
Review: Stages in Research Process Formulate Problem Determine Research Design Determine Data Collection Method Design Data Collection Forms Design Sample.
Data Mining What is to be done before we get to Data Mining?
Introduction Dispersion 1 Central Tendency alone does not explain the observations fully as it does reveal the degree of spread or variability of individual.
FDI - Imputation. Overview Introduction Overview of Imputation Methods Overview of Outliering methods Overview of Estimation methods Aggregation Disclosure.
Copyright 2010, The World Bank Group. All Rights Reserved. Producer prices, part 2 Measurement issues Business Statistics and Registers 1.
Chapter 13 Linear Regression and Correlation. Our Objectives  Draw a scatter diagram.  Understand and interpret the terms dependent and independent.
Maria Garcia US Census Bureau UNECE/SDE, Oslo, Norway, September 2012 An Application of Selective Editing to the US Census Bureau Trade Data.
Stats Methods at IC Lecture 3: Regression.
Correlation and Regression
The European Statistical Training Programme (ESTP)
Jeroen Pannekoek, Sander Scholtus and Mark van der Loo
BIVARIATE ANALYSIS: Measures of Association Between Two Variables
Chapter 13 Additional Topics in Regression Analysis
Indicator 3.05 Interpret marketing information to test hypotheses and/or to resolve issues.
Chapter 13: Item nonresponse
Presentation transcript:

Data Editing United Nations Statistics Division (UNSD) 16 March 2011 Santiago, Chile

22 Editing and Imputation Defined Data editing: Identification and flagging of missing, invalid, inconsistent or anomalous entries Imputation: Resolves problems identified in editing

33 Editing and Imputation Process Flow

44 A General Editing and Imputation Process 1.Identify and treat initial errors At the data capture stage At the data entry stage Ex: Data entered into a table is shifted by a row 2.Identify and treat errors a: Interactively/Manually treat influential errors b: Automatically treat non-influential errors 3.Check the aggregated output

55 Editing and Imputation Process Flow

6 Editing Errors Two categories of errors – Systematic – reported consistently by some of the respondents Ex: Gross values are reported instead of net values Ex: Units are reported in thousands – Random – non-systematic or caused by accident Ex: An extra digit is accidentally typed in the response Manifestations of errors can be systematic or random – Missing Ex: A variable is left blank because the respondent does not know the answer to the question, does not want to answer the question or does not understand the question – Outliers – values that deviate from a model Ex: Unanticipated large values as compared to historic trend – Violation of logical or consistency rules Ex: A total value is larger than the sum of its components Edit rules are used to detect errors and often define how they should be treated

77 Systematic Errors Errors that are reported consistently over time. – Unit error Ex: x t-1 / x t <= 300 – Sign error – Bugs in the collection vehicle – Misunderstanding a question or skip rules Ex: systematic missing values Detection – High failure rates of edits – Outlier detection (e.g. for unit errors) – Knowledge of the survey and the raw data processing

88 Systematic errors (2) Suggestions Improvements in the survey or processing procedures should be made When systematic errors are identified, they should be turned into edit rules Detecting and correcting is cost effective Should be treated before random errors

99 Missing Values Stem from questions a respondent did not answer Detection is usually simple Suggestions Do not ignore missing values (→ bias and loss of estimate precision) – Missing values may not be missing at random Do not replace with zeros (→ inaccurate results) Nonresponse indicators should be compiled and analyzed because missing values may be systematic

10 Outliers Observations that do not fit well to a model – Ex: Median-k*IQR < value < Median+k*IQR – Ex: Month-on-month change <= 50% May be defined by one variable (univariate) or a set of variables (multivariate) Two types – Representative: correct with similar units in population – Non-representative: either incorrect or correct but unique Ex: correct – isolated labor strike at a plant

11 Outliers (2) Detection – Univariate – Multivariate – Periodic data (e.g. Hidiroglou-Berthelot) – Regression models or tree-models

12 Edit Rules Edit rules are used to determine whether a value is consistent or may be erroneous – Surveys are often created to allow these rules Edit rules flag data in two ways – Fatal edit – indicates a value that is (almost) certainly in error – Query edit – indicates values that may be in error

13 Types of Edit Rules Validation edits – often in the form of if-then statements – Ex: if total hours worked > 0 then employees > 0 – Ex: if Σproduction quantity > 0 then Σproduction value > 0 – Ex: if revenue from manufacturing plant> 0 then 1.hours worked by machinery technicians > 0 2.plant capacity utilization > 0 3.Σproduction volume > 0 4.Σproduction value > 0 Balance edits – detail items must add to total – Ex: total employee remuneration = wages + salaries + employer contributions to social security + welfare benefits + profits distributed to workers

14 Types of Edit Rules (2) Ratio edits – the ratio of two data items is bounded by lower and upper bounds. The pairs should be correlated. – Ex: total hours/employee/day is between 6 and 10 (very correlated) – Ex: plant capacity utilization <= 20% change from prvs month – Ex: wages (W) should change within 10% of the change in total employment (E) (E t /E t-1 - 1) -.1 <= W t /W t-1 -1 <= (E t /E t-1 - 1) +.1 – Ex: Σproduct value / Σ product quantity <= 10% change from previous month

15 Types of Edit Rules (3) Hidiroglou-Berthelot is a particular type of ratio edit – Ex : Employee month-on-month change <=100 employees: <= 50% change from prvs month 100< emp < =200: <= 20% change from prvs month >200 emp: <= 10% change from prvs month

16 Editing & Imputation Process Interactive/Manual – a record with flagged data is manually reviewed, preferably by a subject matter expert Automatic – a record with flagged data is automatically reviewed and corrected by a computer Selective – designed to route edits/imputations into interactive or automatic streams – based on influential vs. non-influential errors Marcroediting

17 Editing and Imputation Process Flow

18 Selective Editing Distinguishes between errors in values that have a significant influence on survey estimate and those that are insignificant to the estimate Selective editing splits raw data into two streams: – critical stream: records that most likely contain influential errors and large companies – non-critical stream: records that are unlikely to contain influential errors A score function determines which responses go into which stream

19 Selective Editing (2) Local score function = influence * risk For example: Influence = Risk = Raw value Anticipated value Sampling weight Influence Risk

20 Selective Editing (3) Local score functions are aggregated into global score functions for each record – First local scores are scaled, e.g. dividing observed values by mean values – Scaled local scores are combined into a global score. For example: Minkowski metric (a common approach) – The influence of large local scores increases with α α = 1 : simple sum of local scores α = 2 : Euclidean metric α -> ∞ : max local score

21 Selective Editing (4) GS cut-off threshold must be determined – All records above the cut-off are selected for interactive editing – A simulation can be performed on previous data to determine a threshold Raw unedited values and corresponding edited values are used The first p% of records are edited and the resultant estimate is compared with the fully edited estimate Trial and error will lead to estimates that are the same and a corresponding cut-off value Alternatively, a threshold doesn’t need to be used – Records can be edited in priority order until time or budget constraints tell one to stop

22 Selective Editing (5) A score function can be augmented in many ways – E.g. Size criteria where large enterprises are always selected for critical stream (influence irrespective of risk) Selective editing improves efficiency

23 Macro-Editing Macro-editing techniques account for the distribution of variables and for the plausibility of estimates Two forms of macro-editing – Aggregation method – Distribution method

24 Macro-Editing - Aggregation Verification whether figures to be published seem plausible Compare estimates with – Previous estimate values – Values from other related sources – Related estimates (such as electricity production and consumption)

25 Macro-Editing - Distribution Available data used to characterize distribution of variables Individual values are compared with this distribution Records that contain values that are uncommon may require further inspection and possibly for editing

26 Macro-Editing Example: Graphical Editing Univariate plot Bivariate scatter plot

27 Editing and Imputation Process Flow