Data Validation practice in Statistics Lithuania

Slides:



Advertisements
Similar presentations
Describing Quantitative Variables
Advertisements

1 1 Slide © 2012 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole.
1 1 Slide © 2003 South-Western/Thomson Learning TM Slides Prepared by JOHN S. LOUCKS St. Edward’s University.
BCOR 1020 Business Statistics
Slides by JOHN LOUCKS St. Edward’s University.
7-2 Estimating a Population Proportion
Copyright © 2010, 2007, 2004 Pearson Education, Inc. Lecture Slides Elementary Statistics Eleventh Edition and the Triola Statistics Series by.
1 1 Slide © 2003 South-Western/Thomson Learning TM Slides Prepared by JOHN S. LOUCKS St. Edward’s University.
Chapter 3 - Part B Descriptive Statistics: Numerical Methods
1 1 Slide © 2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole.
1 1 Slide © 2014 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole.
Electronic reporting in Poland 27th Voorburg Group Meeting Warsaw, Poland October 1st to October 5th, 2012 Central Statistical Office of Poland.
Chapter 3 (continued) Nutan S. Mishra. Exercises Size of the data set = 12 for all the five problems In 3.11 variable x 1 = monthly rent of.
1 1 Slide © 2009 Thomson South-Western. All Rights Reserved Slides by JOHN LOUCKS St. Edward’s University.
Rudi Seljak, Metka Zaletel Statistical Office of the Republic of Slovenia TAX DATA AS A MEANS FOR THE ESSENTIAL REDUCTION OF THE SHORT-TERM SURVEYS RESPONSE.
Section Copyright © 2014, 2012, 2010 Pearson Education, Inc. Lecture Slides Elementary Statistics Twelfth Edition and the Triola Statistics Series.
Chapter 4 Statistics. 4.1 – What is Statistics? Definition Data are observed values of random variables. The field of statistics is a collection.
Section Copyright © 2014, 2012, 2010 Pearson Education, Inc. Lecture Slides Elementary Statistics Twelfth Edition and the Triola Statistics Series.
The Examination of Residuals. Examination of Residuals The fitting of models to data is done using an iterative approach. The first step is to fit a simple.
1 1 Slide Slides Prepared by JOHN S. LOUCKS St. Edward’s University © 2002 South-Western/Thomson Learning.
Copyright © 2010, 2007, 2004 Pearson Education, Inc. All Rights Reserved. Section 7-1 Review and Preview.
Chapter 3, Part B Descriptive Statistics: Numerical Measures n Measures of Distribution Shape, Relative Location, and Detecting Outliers n Exploratory.
Chapter 3 Section 3 Measures of variation. Measures of Variation Example 3 – 18 Suppose we wish to test two experimental brands of outdoor paint to see.
1 IT system and data validation process in Latvian CPI/HICP Prepared by Oskars Alksnis, Central Statistical Bureau of Latvia EU Twinning Project Forwarding.
ESTP course, SBS module 13 March 2013 Structural Business Statistics Data reporting to Eurostat, transmission format and tools.
4-6 September 2013, Vilnius Quality in Statistics: Administrative Data and Official Statistics USING ADMINISTRATIVE DATA SOURCES IN OFFICIAL.
The normal distribution
Lecture Slides Elementary Statistics Twelfth Edition
WHO The World Health Survey Data Entry
Quantitative Data Analysis and Interpretation
Business and Economics 6th Edition
Probability and Statistics
EU-SILC Survey Process in the Czech Republic presentation for EU-SILC Methodological Workshop November 7th Martina Mysíková, Martin Zelený Social.
BAE 6520 Applied Environmental Statistics
Probability and Statistics for Computer Scientists Second Edition, By: Michael Baron Chapter 8: Introduction to Statistics CIS Computational Probability.
BAE 5333 Applied Water Resources Statistics
Market Research Unit 5 - slide 13.
2.5: Numerical Measures of Variability (Spread)
Travelling to School.
Data Mining: Concepts and Techniques
The Diversity of Samples from the Same Population
Rudi Seljak, Aleš Krajnc
Introduction to Summary Statistics
IET 603 Quality Assurance in Science & Technology
The usage of web interviewing in Lithuanian Labour Force Survey
Lecture Slides Elementary Statistics Thirteenth Edition
Andris Fisenko and Jānis Lapiņš
Regression model Y represents a value of the response variable.
CHAPTER 29: Multiple Regression*
Survey phases, survey errors and quality control system
ESTP COURSE ON PRODCOM STATISTICS
Structural Business Statistics Data validation
Session 8 Data Processing
Survey phases, survey errors and quality control system
Structural Business Statistics Data reporting to Eurostat, transmission format and tools ESTP course, SBS module 13 March 2013.
The normal distribution
Use of handheld electronic devices for data collection in GeoStat
The European Statistical Training Programme (ESTP)
CONTINUOUS RANDOM VARIABLES AND THE NORMAL DISTRIBUTION
The Computer-Assisted Personal
Measures of Position Section 3.3.
Data validation handbook
A Story of Functions Module 2: Modeling with Descriptive Statistics
Chapter 13 Additional Topics in Regression Analysis
Lecture Slides Elementary Statistics Twelfth Edition
Structural Business Statistics
Data validation in Liechtenstein
Chapter 13: Item nonresponse
A handbook on validation methodology. Metrics.
PRODCOM Working Group JMO M November 2012
Presentation transcript:

Data Validation practice in Statistics Lithuania Nadežda Fursova, Jūratė Petrauskienė 9–11 November 2015, Wiesbaden Workshop ValiDat Foundation

Structure of Statistics Lithuania MANAGEMENT DIVISION Director General & 4 Deputies Director General DATA PREPARATION DIVISIONS Territorial (in 5 cities) GENERAL ACTIVITY DIVISIONS IT Development, Document Management, Internal Audit, etc. STATISTICS DIVISIONS Methodology & Quality, National Accounts, Price Statistics, Labour Statistics, etc. 9–11 November 2015, Wiesbaden Workshop ValiDat Foundation

Data validation, editing and imputation process Raw data Stage 1: Initial (primary) data validation and editing Imputation is usually done at Stage 2 Stage 2: Further (secondary) data validation and editing Final data 9–11 November 2015, Wiesbaden Workshop ValiDat Foundation

Initial data editing: who validates (1) Initial data validation is usually performed by the specialists of 5 territorial Data Preparation divisions of Statistics Lithuania. Data Preparation divisions are responsible for collecting data from respondents (economic entities) and entering them into the database. In some cases data are collected and entered into the database not by Data Preparation divisions but by respective statistics divisions; then initial data validation is performed by them. In household (population) surveys initial data validation is performed by interviewers who collect data from respondents and enter them into the database. In some price statistics surveys (e. g. consumer prices of goods and services survey) initial data validation is performed by price collectors who register prices, enter them using some special programs in their mobile devices / computers and transfer them to the Price Statistics Division. 9–11 November 2015, Wiesbaden Workshop ValiDat Foundation

Initial data editing: who validates (2) In almost all surveys (with only a few exceptions) an automatic data control during the data entry process is set (using IT tools). In most surveys where respondents are economic entities, data are collected using: A paper form of a questionnaire; the filled-in form is sent to Statistics Lithuania by mail or fax. An e-form of a questionnaire; an e-form is filled in and transmitted to Statistics Lithuania via the special IT system e-Statistics (e. Statistika) http://estatistika.stat.gov.lt/ or by e-mail. When respondents fill in an e-form of a questionnaire they partly perform initial editing themselves: e-questionnaires contain automatic primary data checks (validation rules). Respondents have to remove mistakes; otherwise the questionnaire will not be accepted (a respondent will not be allowed to finish and save the questionnaire). 9–11 November 2015, Wiesbaden Workshop ValiDat Foundation

Initial data validation: types of validation rules Primary data checks applied during the data entry (into the database or an e-form of a questionnaire) process: Fatal edits (or hard edits) – identify errors with certainty. Data that do not satisfy this type of validation rules must be corrected; otherwise the questionnaire will not be accepted. Query edits (or soft edits) – point to suspicious data items that may be in error. Data that do not satisfy this type of validation rules may be left uncorrected; an explanation may be required. All the validation rules used during the data entry process are documented in programming work technical tasks (a standard form). Validation rules to be applied are not standardized: this issue is handled per every statistical survey separately. 9–11 November 2015, Wiesbaden Workshop ValiDat Foundation

Initial data validation: validation rules (1) Validation rules commonly applied during initial data validation: Validity check (valid data type, field length, correspondence to a certain code list, etc. ) Examples: only integer numbers should be entered date format should be YYYY-MM-DD ID code should consist of 8 digits a country of birth should contain only entries from a list of valid ISO country codes Missing values check if an answer to a question No. X is “YES”, then a question No. Y should be answered if an answer to a question No. X is “NO”, then an answer to a question No. Y should be missing 9–11 November 2015, Wiesbaden Workshop ValiDat Foundation

Initial data validation: validation rules (2) Mathematical and logical checks (identity check, range check, compatibility of variables, etc.) Examples: field A + field B + field C = field D field A + field B + field C <= field D 0.01 < production (units) made / production (units) sold < 100 0.5 < turnover (current month) / turnover (previous month) < 2 if an enterprise is operating then turnover > 0 if employment status == “old-age pensioner” then age > 54: fatal (hard) edit: if employment status == “old-age pensioner” and age < 35 then the error message: “Too young to be an old-age pensioner!” shows up. query (soft) edit: if employment status == “old-age pensioner” and 35<= age <= 54 then the error message: “Too young to be an old-age pensioner?” shows up. 9–11 November 2015, Wiesbaden Workshop ValiDat Foundation

Further data validation and editing: the process (1) When initial data validation and editing is done, specialists of statistics divisions continue the process: analyze collected data analyze error reports check missing values impute if necessary check the distribution of variables detect outliers analyze outliers’ influence on aggregates compare primary and aggregated data to available additional information analyze time series validate final data 9–11 November 2015, Wiesbaden Workshop ValiDat Foundation

Further data validation: validation rules Validation rules commonly applied during further data validation and editing in Statistics Lithuania: Boundary rule Outlier detection based on a normal distribution or empirical quartiles Outlier detection using linear regression methods Graphic methods (box plot, scatter plot, histogram) Comparison to previous period data and available additional information (administrative data, data from other surveys ) 9–11 November 2015, Wiesbaden Workshop ValiDat Foundation

Boundary rule Boundary rule can be applied when 2 or more variables are not bound up by exact mathematical formulas but can be expressed in an approximate relation (e. g. if the variable X = a then the values of the variable Y should be between b and c). Note: if possible (e. g. if additional information is not necessary, if all related variables are in the same questionnaire, etc.), this rule is already applied during the initial validation process. 9–11 November 2015, Wiesbaden Workshop ValiDat Foundation

Outlier detection based on a normal distribution (1) Suppose X is a normally distributed variable with a mean of μ and a standard deviation of σ. The theoretical density function of X is shown in the graph: Let’s denote: – sample mean of the variable X – an estimate of X standard deviation 9–11 November 2015, Wiesbaden Workshop ValiDat Foundation

Outlier detection based on a normal distribution (2) Outliers are X values that fall outside the interval Here and are arbitrary constants, e. g. equal to a 0.975 quantile of standard normal distribution: The most common intervals are: Note: this outlier detection method is also used when the variable X has an approximately normal or symmetric distribution (the data histogram is close to a normal curve). 9–11 November 2015, Wiesbaden Workshop ValiDat Foundation

Outlier detection based on empirical quartiles Suppose the distribution of the variable X is far from normal (e. g. asymmetric). Let’s denote: – the first empirical quartile of X – the third empirical quartile of X – interquartile range and – arbitrary constants Then outliers are those X values that fall outside the interval The most common intervals are: 9–11 November 2015, Wiesbaden Workshop ValiDat Foundation

Outlier detection using linear regression methods (1) Suppose we have two having linearly dependent variables X and Y . Thus we can apply a linear regression model. An outlier is a two-dimentional observation that strongly deviates from the regression line. Example: X – turnover in Euros (survey data), Y – turnover estimated from VAT (administrative data). outlier 9–11 November 2015, Wiesbaden Workshop ValiDat Foundation

Outlier detection using linear regression methods (2) Statistical packages compute various statistics (measures) for outlier detection. Measures used in Statistics Lithuania: Leverage Standardized residuals Cook’s distance (Cook’s D) DFBETAs 9–11 November 2015, Wiesbaden Workshop ValiDat Foundation

Validation methods usage Statistics Lithuania carry out approx. 110 surveys based on statistical questionnaires (other surveys are based on administrative data). Recently Methodology and Quality division has carried out a poll on various data validation methods used in those surveys during the further validation stage. The results are: Outlier detection method based on linear regression models is applied only to several surveys. Outlier detection method based on a normal distribution or empirical quartiles is not very common either (is applied to approx. 10 surveys). Graphic methods are used in a bit more than 10 surveys. The frequency of using the boundary rule during the further validation process is similar. Comparison to previous period data and available additional information (especially aggregated data (estimates) comparison) is made in all surveys but usually this is not a computer-assisted process. 9–11 November 2015, Wiesbaden Workshop ValiDat Foundation

Quality issues (1) Quality indicators on data validation that are computed (for each statistical survey separately): The number and share (%) of statistical questionnaires validated due to respondent or data entry mistakes compared to the total number of questionnaires (a questionnaire is considered as erroneous if at least one fatal validation rule has been unsatisfied). The number and share (%) of statistical questionnaires validated due to respondent mistakes compared to the total number of questionnaires. The number and share (%) of statistical questionnaires validated by the specialists of Data Preparation divisions compared to the total number of questionnaires. The number and share (%) of statistical questionnaires validated by the specialists of statistics divisions compared to the total number of questionnaires. The number and share (%) of values validated by the specialists of Data Preparation divisions compared to the total number of entered values. The number and share (%) of values validated by the specialists of statistics divisions compared to the total number of entered values. 9–11 November 2015, Wiesbaden Workshop ValiDat Foundation

Quality issues (2) All the afore mentioned quality indicators are computed automatically in the database management system. Computation of any other quality indicators on validation, editing and imputation as separate processes or the data validation, editing and imputation process as a whole is optional and not regulated. If performed, efficiency of the data validation process is measured at the discretion of the statistical survey managers. To improve the data validation process in Statistics Lithuania, a special working group has been established. The working group together with statistical survey managers are planning to revise the validation rules used during the data entry process and methods used during the further validation process. Suggestions on improving the data validation process are going to be made for every statistical survey. 9–11 November 2015, Wiesbaden Workshop ValiDat Foundation