Presentation is loading. Please wait.

Presentation is loading. Please wait.

Data Validation in the ESS Context

Similar presentations


Presentation on theme: "Data Validation in the ESS Context"— Presentation transcript:

1 Data Validation in the ESS Context
ESTP course Item 02 Luxembourg, Nov 2017 Eurostat, Unit B1

2 Presentation Outline ESS Vision 2020 Quality and compliance GSBMP
Data validation – Life cycle Main problems in current situation and medium term goals Validation levels 20 main types of validation rules in the ESS

3 ESS Validation Youtube video at https://www. youtube. com/watch

4 The path to ESS Vision 2020 COM (2009)404 Vision COM (2009)433 GDP and beyond (2011)211 Robust quality ESSC Nov.2012 ESS.VIP Programme Reg. 99/2013 ESP "The ESS provides the EU, the world and the public with independent high quality information on the economy and society on European, national and regional levels and make the information available to everyone for decision-making purposes, research and debate." Reg. 223/2009 ESS

5 In May 2014, the ESSC came to an agreement on the ESS Vision 2020 as the guiding frame for the ESS development during the years up to 2020 5 key areas of work: Focus on users Strive for quality Harness new data sources Promote efficiency in production processes Improve dissemination and communication

6 ESS Vision 2020 portfolio

7 ESS Vision 2020 – Building Blocks
Standards Network Services Data warehouses Enterprise Architecture Quality Validation Users Communication Administrative data Big Data Governance Methods

8 Quality and compliance
Countries have to provide statistics to Eurostat on time (punctuality) and of acceptable quality. Link with quality principles of the European statistics code of practice – Principles 11 to 15 related to statistical output: 11: Relevance 12: Accuracy and reliability 13: Timeliness and punctuality 14: Coherence and comparability 15: Accessibility and clarity

9 GSBPM Global Statistical Business Process Model

10 Data validation – life cycle

11 Main problems in current situation
Time-consuming "validation ping-pong" Possibility of validation gaps No clear picture of who does what Risk of non-consistent assessment over time, between countries and across statistical domains Possible misunderstanding on what is acceptable or not Possible subjective assessment of data quality Duplication of IT development costs within the ESS Manual work due to low integration between the different tools Lack of standards and of common architecture

12 Medium-term goals - ESS VIP validation
Business Outcomes Goal 1: Ensure the transparency of the validation procedures applied to the data sent to Eurostat by the ESS Member States. Increase in the quality and credibility of European statistics Reduction of costs related to the time-consuming validation cycle in the ESS ("validation Ping-Pong") Goal 2: Enable sharing and re-use of validation services across the ESS on a voluntary basis. Reduction of costs related to IT development and maintenance

13 Data Validation levels 0 to 5 (Graph B) Validation complexity
Same file Level 0: Format & file structure Same dataset Level 1: Cells, records, file From the same source Within a domain Between files Level 2: Revisions and Time series Within an organisation Validation complexity Data Between datasets Level 2: Between correlated datasets From different sources Level 3: Mirror checks Between domains Level 4: Consistency checks Between different organisations Level 5: Consistency checks

14 Validation levels 0 to 5 (Graph A)
Data Within a statistical authority Between statistical authorities Level 5 Consistency checks Within a domain Between domains From the same source From different sources Same dataset Between datasets Same file Between files Level 0 Format and File structure checks Level 1 Cells, Records, File checks Level 4 Consistency Checks Level 3 Mirror checks Level 2 Checks between correlated datasets Revisions and Time series Validation complexity

15 Priority to Validation at file level - Why?
Easiest to define, check, implement, configure and share in a validation service (for "pre-validation") Fastest to check, Most numerous validation rules, Most numerous to lead to severity level “error” (with rejection of data), lead, when not checked to tedious and burdensome manual work and "ping-pong" At least these rules should be checked prior to transmission to Eurostat

16 Validation rules… Guess the level(s)
No worry: you'll remain anonymous… Time for a Quiz … Validation rules… Guess the level(s) Validation rule/Level 1 2 3 4 5 Valid code in field "aircraft type" Annual data consistent with quarterly data Country code in first field consistent with data sender Eurostat data for country X is consistent with OECD data for country X The revisions of data between 2 versions are limited Outliers detection Export figures reported by country A towards country B correspond to Import figures reported by country B from country A Changes in agricultural production of olives for country "X" is consistent with neighbouring countries 16

17 20 Main types of validation rules
Cover at least 80-90% of the validation rules used for ESS data Used: To structure the standard documentation that describes validation rules in domains (and as a check-list) To identify the main VTL operators needed for Validation of ESS data In Validation Rule Manager to allow an easy definition of rules based on a simple set of parameters per rules and an easy and automatic generation of VTL

18 Classified in 4 groups Basic file structure check (level 0): => Preliminary checks needed => SDMX full coverage (structural Validation) => 4 types of rules that lead to errors in case of failure Basic intra-file checks (Levels zero and 1): => SDMX large coverage => 5 types of rules that lead in most cases to errors in case of failure Checks intra or inter files (levels 1 to 5): => SDMX very partial coverage => 8 types of rules that usually lead to errors in case of failure Checks inter-files in same statistical domain (Level 2 and 3): => No SDMX coverage => 3 types of rules that usually lead to warnings in case of failure

19 Main types of validation rules - Overview
Rule type Mandatory Default Validation level SDMX Micro data Severity level Comments 1 2 3 4 5 E W I (EVA) Envelope is Acceptable X (FLF) File Format (FDD) Fields Delimiter (X) “;” Mandatory for CSV files (DES) Decimals Separator “.” Mandatory for CSV files (always “.” For SDMX-ML) (FDT) Field Type (FDL) Field Length (FDM) Field is Mandatory or empty (COV) Codes are Valid Mandatory for key fields (RWD) Records are Without Duplicates Key Mandatory for aggregated (tabular) data Default: No duplicate key (REP) Records Expected are Provided (RNR) Records’ Number is in a Range >=1 Default: at least one record (file not empty) (COC) Codes are Consistent (VIR) Values are in Range >=0 Default: values are zero or positive (Min=zero) (VCO) Values are Consistent (VAD) Values for Aggregates are consistent with Details = Mandatory if aggregates and details are provided Default: aggregates = sum of details (VNO) Values are Not Outliers (VSA) Values for Seasonally Adjusted data are plausible (RRL) Records Revised are Limited (VRT) Values are Revised within a Tolerance level (VMP) Values for Mirror data are Plausible

20 Main types of validation rules Focus on 8 rules from Group 3
Rule type Mandatory Default Validation level SDMX Micro data Severity level Comments 1 2 3 4 5 E W I (REP) Records Expected are Provided X (X) (RNR) Records’ Number is in a Range >=1 Default: at least one record (file not empty) (COC) Codes are Consistent (VIR) Values are in Range >=0 Default: values are zero or positive (Min=zero) (VCO) Values are Consistent (VAD) Values for Aggregates are consistent with Details = Mandatory if aggregates and details are provided Default: aggregates = sum of details (VNO) Values are Not Outliers (VSA) Values for Seasonally Adjusted data are plausible

21 QUESTIONS ?

22 Time for a Coffee break 22


Download ppt "Data Validation in the ESS Context"

Similar presentations


Ads by Google