Download presentation
Presentation is loading. Please wait.
Published byWilfried Ritter Modified over 6 years ago
1
Structural Business Statistics Data validation
ESTP course, SBS module 13 March 2013
2
Contents Data validation process and rules
IT validation tool available and possible further developments.
3
SBS statistical data process
The SBS data disseminated by Eurostat are the result of joint production of the ESS. Member States: data collection, data processing (computing diff. var., checking/validating, aggregating etc.) data transmission to Eurostat. Eurostat: data processing (checking/validating, computing derived indicators etc.) disseminating national and EU aggregated data.
4
Validation rules The list of checks was agreed in the Steering Group of April 2007 and 2012 – available on CIRCA The data should be checked before they are sent to Eurostat.
5
Validation process
6
1.Format and file structure checks
Comma-separated values (CSV) file - via eDAMIS server Data set identifier (ex. RSBSSERV_1A1_A) Record structure EU Reg. 250/2009 ex: the file has 24 columns or fields; in the value field no separators for thousands are used; "na" for missing data values etc.
7
2.Intra-dataset checks (individual dataset) Checks on the correct aggregation of the values of data
The consistency of the aggregation of the data for all required breakdowns (NACE, employment size class etc.) is checked.
8
2.Intra-dataset checks (individual datasets) Checks on the logical relation between values of variables of the same series
9
2.Intra-dataset checks (individual datasets) Checks on correctness of the confidentiality pattern
Using the aggregation levels for different breakdowns (NACE, employment size class etc.), the confidentiality pattern is checked.
10
3.Intra-source checks (linked datasets) Year to year variations (1)
Carried out for characteristics and ratios calculated based on one or more characteristics - Verify the plausibility of the time series (only for datasets 1A, 2A, 3A, 4A and 1P, 2P, 3P, 4P) The evolution of the data compared to the previous year is checked by taking into account: - The number of enterprises, The inflation rate and The economic growth
11
3.Intra-source checks (linked datasets) Year to year variations (2)
In order to check the y-to-y variations, a standard interval is applied for each variable based on the past observations. The boundary width is adapted based on the number of enterprises in the year (t-1): Variance of a sum of observations decreases inversely proportional to the number of observations =>standard deviation decreases inversely proportional to the square root of the number of observations The confidence boundaries should hence increase in width proportional to These percentage values are divided by (lower boundary) or multiplied by (upper boundary) . Central value of the interval is adapted according to growth and inflation: both lower and upper boundary are multiplied with (1+real growth)*(1+inflation rate).
12
3.Intra-source checks (linked datasets) Year to year variations (3)
Formula for confidence interval
13
3.Intra- source checks (linked datasets) Year to year variations (4)
Checks to be carried out for characteristics or for ratios calculated on the basis of two or more characteristics
14
3.Intra-source checks (linked datasets) Year to year variations (5)
15
3.Intra-source checks (linked datasets) Checks on consistency of linked series
The consistency of the common data values in the linked series is checked (Ex: Series 1A/1B-2A/2B-3A/3B-4A/4B)
16
3.Intra-source checks (linked datasets) Checks on consistency of the confidentiality pattern of linked series
17
4.Inter-source checks Cross country comparisons
For a list of ratios a cross-country analysis is carried out. The analysis consists of examining: the distribution of the ratios, to regroup (classify) countries with a similar "performance" for the indicator and an outlier detection.
18
4.Inter-source checks Cross country comparisons (1)
a) Examining the distribution of ratios by: NACE 4-digit code, country, ranking the country for a particular activity in 5 classes, defined in relation to the median calculated on all countries for that activity: Class 1: very low in comparison to the median (< median minus twice the standard deviation) Class 2: low in comparison to the median (< median minus the standard deviation but >= median minus twice the standard deviation) Class 3: close to the median (>= median minus the standard deviation and < median plus the standard deviation) Class 4: high in comparison to the median (>=median plus the standard deviation but < median plus twice the standard deviation) Class 5: very high in comparison to the median (>= median plus twice the standard deviation).
19
4.Inter-source checks Cross country comparisons (2)
20
4.Inter-source checks Cross country comparisons (3)
b) Clustering Statistical methods are used to group the reporting countries for each of the indicators so that data providers can see which countries have a similar performance as their and assess whether this corresponds with economic reality. The groups are presented for two or more consecutive years to see if they change considerably from one year to the next.
21
4.Inter-source checks Cross country comparisons (4)
c) Outlier detection the outlier detection - use the box whisker plot method: the distribution of the ratios is transformed in more symmetric distributions and the most symmetric is chosen: first the inter-quartile range (iqr) is calculated (iqr=q3-q1). the boundaries are then determined using iqr, q1 and q3. the lower boundary = q1–m*iqr and the upper boundary = q3+n*iqr. In general, coefficients used are 1,5 for the inner fence (to identify suspected outliers) and 3 for outer fence (to identify outliers).
22
4.Inter-source checks Cross country comparisons (5)
In the figure below the calculation of the upper boundaries that helps identifying outliers in the ratios that are higher than normally expected.
23
5.Intra institution (Eurostat) checks
Plausibility or consistency checks between two domains available in the same Institution are to be investigated: STS/SBS – done at national level Labour Force Survey/SBS – done at national level IFATS/SBS – done at national and EU level
24
6.Further work: Inter-institutions checks
Plausibility or consistency checks between the data available in Eurostat and the data / information available outside Eurostat. This implies no "control" over the methodology on the basis of which the external data are collected, and sometimes a limited knowledge of it. Not yet systematically carried out for SBS. It could be investigated whether such checks could be useful.
25
IT validation tool available and possible future developments
Editing building block (EBB 2011 – SBS) is an IT validation tool developed by Eurostat for cleaning the data. Available on CIRCA for Annexes I-IV and IX; Checking: Technical format; Single Series; Inter Series and Year to Year data. Further work: Confidentiality audit function could be added
26
Dissemination National data are disseminated when:
the results are checked by Eurostat and only with the Member States approval. National data for Annexes I to IV are published by Eurostat at the same time with the EU aggregated results.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.