Download presentation
Presentation is loading. Please wait.
1
Data Validation practice in Statistics Lithuania
Nadežda Fursova, Jūratė Petrauskienė 9–11 November 2015, Wiesbaden Workshop ValiDat Foundation
2
Structure of Statistics Lithuania
MANAGEMENT DIVISION Director General & 4 Deputies Director General DATA PREPARATION DIVISIONS Territorial (in 5 cities) GENERAL ACTIVITY DIVISIONS IT Development, Document Management, Internal Audit, etc. STATISTICS DIVISIONS Methodology & Quality, National Accounts, Price Statistics, Labour Statistics, etc. 9–11 November 2015, Wiesbaden Workshop ValiDat Foundation
3
Data validation, editing and imputation process
Raw data Stage 1: Initial (primary) data validation and editing Imputation is usually done at Stage 2 Stage 2: Further (secondary) data validation and editing Final data 9–11 November 2015, Wiesbaden Workshop ValiDat Foundation
4
Initial data editing: who validates (1)
Initial data validation is usually performed by the specialists of 5 territorial Data Preparation divisions of Statistics Lithuania. Data Preparation divisions are responsible for collecting data from respondents (economic entities) and entering them into the database. In some cases data are collected and entered into the database not by Data Preparation divisions but by respective statistics divisions; then initial data validation is performed by them. In household (population) surveys initial data validation is performed by interviewers who collect data from respondents and enter them into the database. In some price statistics surveys (e. g. consumer prices of goods and services survey) initial data validation is performed by price collectors who register prices, enter them using some special programs in their mobile devices / computers and transfer them to the Price Statistics Division. 9–11 November 2015, Wiesbaden Workshop ValiDat Foundation
5
Initial data editing: who validates (2)
In almost all surveys (with only a few exceptions) an automatic data control during the data entry process is set (using IT tools). In most surveys where respondents are economic entities, data are collected using: A paper form of a questionnaire; the filled-in form is sent to Statistics Lithuania by mail or fax. An e-form of a questionnaire; an e-form is filled in and transmitted to Statistics Lithuania via the special IT system e-Statistics (e. Statistika) or by . When respondents fill in an e-form of a questionnaire they partly perform initial editing themselves: e-questionnaires contain automatic primary data checks (validation rules). Respondents have to remove mistakes; otherwise the questionnaire will not be accepted (a respondent will not be allowed to finish and save the questionnaire). 9–11 November 2015, Wiesbaden Workshop ValiDat Foundation
6
Initial data validation: types of validation rules
Primary data checks applied during the data entry (into the database or an e-form of a questionnaire) process: Fatal edits (or hard edits) – identify errors with certainty. Data that do not satisfy this type of validation rules must be corrected; otherwise the questionnaire will not be accepted. Query edits (or soft edits) – point to suspicious data items that may be in error. Data that do not satisfy this type of validation rules may be left uncorrected; an explanation may be required. All the validation rules used during the data entry process are documented in programming work technical tasks (a standard form). Validation rules to be applied are not standardized: this issue is handled per every statistical survey separately. 9–11 November 2015, Wiesbaden Workshop ValiDat Foundation
7
Initial data validation: validation rules (1)
Validation rules commonly applied during initial data validation: Validity check (valid data type, field length, correspondence to a certain code list, etc. ) Examples: only integer numbers should be entered date format should be YYYY-MM-DD ID code should consist of 8 digits a country of birth should contain only entries from a list of valid ISO country codes Missing values check if an answer to a question No. X is “YES”, then a question No. Y should be answered if an answer to a question No. X is “NO”, then an answer to a question No. Y should be missing 9–11 November 2015, Wiesbaden Workshop ValiDat Foundation
8
Initial data validation: validation rules (2)
Mathematical and logical checks (identity check, range check, compatibility of variables, etc.) Examples: field A + field B + field C = field D field A + field B + field C <= field D 0.01 < production (units) made / production (units) sold < 100 0.5 < turnover (current month) / turnover (previous month) < 2 if an enterprise is operating then turnover > 0 if employment status == “old-age pensioner” then age > 54: fatal (hard) edit: if employment status == “old-age pensioner” and age < 35 then the error message: “Too young to be an old-age pensioner!” shows up. query (soft) edit: if employment status == “old-age pensioner” and 35<= age <= 54 then the error message: “Too young to be an old-age pensioner?” shows up. 9–11 November 2015, Wiesbaden Workshop ValiDat Foundation
9
Further data validation and editing: the process (1)
When initial data validation and editing is done, specialists of statistics divisions continue the process: analyze collected data analyze error reports check missing values impute if necessary check the distribution of variables detect outliers analyze outliers’ influence on aggregates compare primary and aggregated data to available additional information analyze time series validate final data 9–11 November 2015, Wiesbaden Workshop ValiDat Foundation
10
Further data validation: validation rules
Validation rules commonly applied during further data validation and editing in Statistics Lithuania: Boundary rule Outlier detection based on a normal distribution or empirical quartiles Outlier detection using linear regression methods Graphic methods (box plot, scatter plot, histogram) Comparison to previous period data and available additional information (administrative data, data from other surveys ) 9–11 November 2015, Wiesbaden Workshop ValiDat Foundation
11
Boundary rule Boundary rule can be applied when 2 or more variables are not bound up by exact mathematical formulas but can be expressed in an approximate relation (e. g. if the variable X = a then the values of the variable Y should be between b and c). Note: if possible (e. g. if additional information is not necessary, if all related variables are in the same questionnaire, etc.), this rule is already applied during the initial validation process. 9–11 November 2015, Wiesbaden Workshop ValiDat Foundation
12
Outlier detection based on a normal distribution (1)
Suppose X is a normally distributed variable with a mean of μ and a standard deviation of σ. The theoretical density function of X is shown in the graph: Let’s denote: – sample mean of the variable X – an estimate of X standard deviation 9–11 November 2015, Wiesbaden Workshop ValiDat Foundation
13
Outlier detection based on a normal distribution (2)
Outliers are X values that fall outside the interval Here and are arbitrary constants, e. g. equal to a quantile of standard normal distribution: The most common intervals are: Note: this outlier detection method is also used when the variable X has an approximately normal or symmetric distribution (the data histogram is close to a normal curve). 9–11 November 2015, Wiesbaden Workshop ValiDat Foundation
14
Outlier detection based on empirical quartiles
Suppose the distribution of the variable X is far from normal (e. g. asymmetric). Let’s denote: – the first empirical quartile of X – the third empirical quartile of X – interquartile range and – arbitrary constants Then outliers are those X values that fall outside the interval The most common intervals are: 9–11 November 2015, Wiesbaden Workshop ValiDat Foundation
15
Outlier detection using linear regression methods (1)
Suppose we have two having linearly dependent variables X and Y . Thus we can apply a linear regression model. An outlier is a two-dimentional observation that strongly deviates from the regression line. Example: X – turnover in Euros (survey data), Y – turnover estimated from VAT (administrative data). outlier 9–11 November 2015, Wiesbaden Workshop ValiDat Foundation
16
Outlier detection using linear regression methods (2)
Statistical packages compute various statistics (measures) for outlier detection. Measures used in Statistics Lithuania: Leverage Standardized residuals Cook’s distance (Cook’s D) DFBETAs 9–11 November 2015, Wiesbaden Workshop ValiDat Foundation
17
Validation methods usage
Statistics Lithuania carry out approx. 110 surveys based on statistical questionnaires (other surveys are based on administrative data). Recently Methodology and Quality division has carried out a poll on various data validation methods used in those surveys during the further validation stage. The results are: Outlier detection method based on linear regression models is applied only to several surveys. Outlier detection method based on a normal distribution or empirical quartiles is not very common either (is applied to approx. 10 surveys). Graphic methods are used in a bit more than 10 surveys. The frequency of using the boundary rule during the further validation process is similar. Comparison to previous period data and available additional information (especially aggregated data (estimates) comparison) is made in all surveys but usually this is not a computer-assisted process. 9–11 November 2015, Wiesbaden Workshop ValiDat Foundation
18
Quality issues (1) Quality indicators on data validation that are computed (for each statistical survey separately): The number and share (%) of statistical questionnaires validated due to respondent or data entry mistakes compared to the total number of questionnaires (a questionnaire is considered as erroneous if at least one fatal validation rule has been unsatisfied). The number and share (%) of statistical questionnaires validated due to respondent mistakes compared to the total number of questionnaires. The number and share (%) of statistical questionnaires validated by the specialists of Data Preparation divisions compared to the total number of questionnaires. The number and share (%) of statistical questionnaires validated by the specialists of statistics divisions compared to the total number of questionnaires. The number and share (%) of values validated by the specialists of Data Preparation divisions compared to the total number of entered values. The number and share (%) of values validated by the specialists of statistics divisions compared to the total number of entered values. 9–11 November 2015, Wiesbaden Workshop ValiDat Foundation
19
Quality issues (2) All the afore mentioned quality indicators are computed automatically in the database management system. Computation of any other quality indicators on validation, editing and imputation as separate processes or the data validation, editing and imputation process as a whole is optional and not regulated. If performed, efficiency of the data validation process is measured at the discretion of the statistical survey managers. To improve the data validation process in Statistics Lithuania, a special working group has been established. The working group together with statistical survey managers are planning to revise the validation rules used during the data entry process and methods used during the further validation process. Suggestions on improving the data validation process are going to be made for every statistical survey. 9–11 November 2015, Wiesbaden Workshop ValiDat Foundation
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.