Structural Business Statistics Data validation

Slides:



Advertisements
Similar presentations
Measures of Dispersion
Advertisements

4. FREQUENCY DISTRIBUTION
Descriptive Statistics
Slides by JOHN LOUCKS St. Edward’s University.
B a c kn e x t h o m e Classification of Variables Discrete Numerical Variable A variable that produces a response that comes from a counting process.
Chapter 7 Estimation: Single Population
Descriptive Statistics  Summarizing, Simplifying  Useful for comprehending data, and thus making meaningful interpretations, particularly in medium to.
2009 Mathematics Standards of Learning Training Institutes Algebra II Virginia Department of Education.
Describing Data: Numerical
Chapter 2 Describing Data with Numerical Measurements
Chapter 2 Describing Data with Numerical Measurements General Objectives: Graphs are extremely useful for the visual description of a data set. However,
Dr. Serhat Eren DESCRIPTIVE STATISTICS FOR GROUPED DATA If there were 30 observations of weekly sales then you had all 30 numbers available to you.
Numerical Descriptive Techniques
Methods for Describing Sets of Data
Chapter 4 Statistics. 4.1 – What is Statistics? Definition Data are observed values of random variables. The field of statistics is a collection.
© Copyright McGraw-Hill CHAPTER 3 Data Description.
Worked examples and exercises are in the text STROUD PROGRAMME 27 STATISTICS.
Review Measures of central tendency
Chapter 2 Describing Data.
Ex St 801 Statistical Methods Introduction. Basic Definitions STATISTICS : Area of science concerned with extraction of information from numerical data.
Data Preprocessing Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2010.
Using cluster analysis for Identifying outliers and possibilities offered when calculating Unit Value Indices OECD NOVEMBER 2011 Evangelos Pongas.
Business Statistics, 4e, by Ken Black. © 2003 John Wiley & Sons. 3-1 Business Statistics, 4e by Ken Black Chapter 3 Descriptive Statistics.
MODULE 3: DESCRIPTIVE STATISTICS 2/6/2016BUS216: Probability & Statistics for Economics & Business 1.
E-PRTR incompleteness check Irene Olivares Industrial Pollution Group Air and Climate Change Programme Eionet NRC workshop on Industrial Pollution Copenhagen.
EDIT validation tool item 8 of the agenda Structural Business Statistics Working Group 14 April 2015, Luxembourg Arlind Dobërdolani.
Production process for SBS item 9 of the agenda Structural Business Statistics Working Group 14 April 2015, Luxembourg Tatiana Mrlianová.
Introduction Dispersion 1 Central Tendency alone does not explain the observations fully as it does reveal the degree of spread or variability of individual.
STATISTICS Chapter 2 and and 2.2: Review of Basic Statistics Topics covered today:  Mean, Median, Mode  5 number summary and box plot  Interquartile.
ESTP course, SBS module 13 March 2013 Structural Business Statistics Data reporting to Eurostat, transmission format and tools.
Example - Fax Here are the number of pages faxed by each fax sent from our Math and Stats department since April 24 th, in the order that they occurred.
Virtual University of Pakistan Lecture No. 11 Statistics and Probability by Miss Saleha Naghmi Habibullah.
Slide 1 Copyright © 2004 Pearson Education, Inc.  Descriptive Statistics summarize or describe the important characteristics of a known set of population.
Exploratory Data Analysis
Data analysis is one of the first steps toward determining whether an observed pattern has validity. Data analysis also helps distinguish among multiple.
Methods for Describing Sets of Data
Theme 4. Measures of individual position
EMPA Statistical Analysis
Notes 13.2 Measures of Center & Spread
2.5: Numerical Measures of Variability (Spread)
Model validation and prediction
Regression Analysis Part D Model Building
Describing, Exploring and Comparing Data
AP Lab Skills Guide Data will fall into three categories:
CHAPTER 3 Data Description 9/17/2018 Kasturiarachi.
Description of Data (Summary and Variability measures)
Distributions (Chapter 1) Sonja Swanson
1.2 Describing Distributions with Numbers
Lecture 2 Chapter 3. Displaying and Summarizing Quantitative Data
Structural Business Statistics Data reporting to Eurostat, transmission format and tools ESTP course, SBS module 13 March 2013.
Measures of Dispersion
Task Force on Annual Financial Accounts
Validation of WStatR-Data
ETS WG meeting 6-7 September 2006
Prodcom ESTP course October 2010
Numerical Descriptive Measures
Sharing data validation activities in the ESS.
Warm Up # 3: Answer each question to the best of your knowledge.
Education and Training Statistics Working Group – 2-3 June 2016
Item 7.1 Implementation of the 2016 Adult Education Survey
Measures of Center and Spread
MBA 510 Lecture 2 Spring 2013 Dr. Tonya Balan 4/20/2019.
Structural Business Statistics
Descriptive Statistics
Data Validation practice in Statistics Lithuania
Central Tendency & Variability
Tukey Control Chart Farrokh Alemi, Ph.D.
Validation Activities in the ESS What you will hear today…
7 EGR initial and preliminary frames and validation tasks
Presentation transcript:

Structural Business Statistics Data validation ESTP course, SBS module 13 March 2013

Contents Data validation process and rules IT validation tool available and possible further developments.

SBS statistical data process The SBS data disseminated by Eurostat are the result of joint production of the ESS. Member States: data collection, data processing (computing diff. var., checking/validating, aggregating etc.) data transmission to Eurostat. Eurostat: data processing (checking/validating, computing derived indicators etc.) disseminating national and EU aggregated data.

Validation rules The list of checks was agreed in the Steering Group of April 2007 and 2012 – available on CIRCA https://circabc.europa.eu/sd/d/b21a8a60-60ee-411b-bac7-b059fc86d1da/doc%2004%20-%20New%20quality%20checks.pdf The data should be checked before they are sent to Eurostat.

Validation process

1.Format and file structure checks Comma-separated values (CSV) file - via eDAMIS server Data set identifier (ex. RSBSSERV_1A1_A) Record structure EU Reg. 250/2009 ex: the file has 24 columns or fields; in the value field no separators for thousands are used; "na" for missing data values etc.

2.Intra-dataset checks (individual dataset) Checks on the correct aggregation of the values of data The consistency of the aggregation of the data for all required breakdowns (NACE, employment size class etc.) is checked.

2.Intra-dataset checks (individual datasets) Checks on the logical relation between values of variables of the same series

2.Intra-dataset checks (individual datasets) Checks on correctness of the confidentiality pattern Using the aggregation levels for different breakdowns (NACE, employment size class etc.), the confidentiality pattern is checked.

3.Intra-source checks (linked datasets) Year to year variations (1) Carried out for characteristics and ratios calculated based on one or more characteristics - Verify the plausibility of the time series (only for datasets 1A, 2A, 3A, 4A and 1P, 2P, 3P, 4P) The evolution of the data compared to the previous year is checked by taking into account: - The number of enterprises, The inflation rate and The economic growth

3.Intra-source checks (linked datasets) Year to year variations (2) In order to check the y-to-y variations, a standard interval is applied for each variable based on the past observations. The boundary width is adapted based on the number of enterprises in the year (t-1): Variance of a sum of observations decreases inversely proportional to the number of observations =>standard deviation decreases inversely proportional to the square root of the number of observations The confidence boundaries should hence increase in width proportional to   . These percentage values are divided by (lower boundary) or multiplied by (upper boundary)    . Central value of the interval is adapted according to growth and inflation: both lower and upper boundary are multiplied with (1+real growth)*(1+inflation rate).

3.Intra-source checks (linked datasets) Year to year variations (3) Formula for confidence interval

3.Intra- source checks (linked datasets) Year to year variations (4) Checks to be carried out for characteristics or for ratios calculated on the basis of two or more characteristics

3.Intra-source checks (linked datasets) Year to year variations (5)

3.Intra-source checks (linked datasets) Checks on consistency of linked series The consistency of the common data values in the linked series is checked (Ex: Series 1A/1B-2A/2B-3A/3B-4A/4B)

3.Intra-source checks (linked datasets) Checks on consistency of the confidentiality pattern of linked series

4.Inter-source checks Cross country comparisons For a list of ratios a cross-country analysis is carried out. The analysis consists of examining: the distribution of the ratios, to regroup (classify) countries with a similar "performance" for the indicator and an outlier detection.

4.Inter-source checks Cross country comparisons (1) a) Examining the distribution of ratios by: NACE 4-digit code, country, ranking the country for a particular activity in 5 classes, defined in relation to the median calculated on all countries for that activity: Class 1: very low in comparison to the median (< median minus twice the standard deviation) Class 2: low in comparison to the median (< median minus the standard deviation but >= median minus twice the standard deviation) Class 3: close to the median (>= median minus the standard deviation and < median plus the standard deviation) Class 4: high in comparison to the median (>=median plus the standard deviation but < median plus twice the standard deviation) Class 5: very high in comparison to the median (>= median plus twice the standard deviation).

4.Inter-source checks Cross country comparisons (2)

4.Inter-source checks Cross country comparisons (3) b) Clustering Statistical methods are used to group the reporting countries for each of the indicators so that data providers can see which countries have a similar performance as their and assess whether this corresponds with economic reality. The groups are presented for two or more consecutive years to see if they change considerably from one year to the next.

4.Inter-source checks Cross country comparisons (4) c) Outlier detection the outlier detection - use the box whisker plot method: the distribution of the ratios is transformed in more symmetric distributions and the most symmetric is chosen: first the inter-quartile range (iqr) is calculated (iqr=q3-q1). the boundaries are then determined using iqr, q1 and q3. the lower boundary = q1–m*iqr and the upper boundary = q3+n*iqr. In general, coefficients used are 1,5 for the inner fence (to identify suspected outliers) and 3 for outer fence (to identify outliers).

4.Inter-source checks Cross country comparisons (5) In the figure below the calculation of the upper boundaries that helps identifying outliers in the ratios that are higher than normally expected.

5.Intra institution (Eurostat) checks Plausibility or consistency checks between two domains available in the same Institution are to be investigated: STS/SBS – done at national level Labour Force Survey/SBS – done at national level IFATS/SBS – done at national and EU level

6.Further work: Inter-institutions checks Plausibility or consistency checks between the data available in Eurostat and the data / information available outside Eurostat. This implies no "control" over the methodology on the basis of which the external data are collected, and sometimes a limited knowledge of it. Not yet systematically carried out for SBS. It could be investigated whether such checks could be useful.

IT validation tool available and possible future developments Editing building block (EBB 2011 – SBS) is an IT validation tool developed by Eurostat for cleaning the data. Available on CIRCA for Annexes I-IV and IX; https://circabc.europa.eu/w/browse/7f169728-f20b-4da1-9046-0a1d0dfd6f58 Checking: Technical format; Single Series; Inter Series and Year to Year data. Further work: Confidentiality audit function could be added

Dissemination National data are disseminated when: the results are checked by Eurostat and only with the Member States approval. National data for Annexes I to IV are published by Eurostat at the same time with the EU aggregated results.