Background Data validation, a critical issue for the E.S.S.

Slides:



Advertisements
Similar presentations
Status on the Mapping of Metadata Standards
Advertisements

13 September 2012 SDMX Technical Working Group1 Report of the SDMX Technical Standards Working Group SDMX Expert Group Meeting, Paris, September 2012.
The European Statistical System Vision Infrastructure Programme Daniel Defays, Director Directorate B, Eurostat Eurostat Workshop on the Modernisation.
WP.5 - DDI-SDMX Integration
WP.5 - DDI-SDMX Integration E.S.S. cross-cutting project on Information Models and Standards Marco Pellegrino, Denis Grofils Eurostat METIS Work Session6-8.
NSI 1 Collect Process AnalyseDisseminate Survey A Survey B Historically statistical organisations have produced specialised business processes and IT.
Survey Data Management and Combined use of DDI and SDMX DDI and SDMX use case Labor Force Statistics.
Overview of SDMX: Statistical Data and Metadata eXchange Technical and Content Standards for Statistical Data Ann McPhail, Division Chief Statistics Department,
M ETADATA OF NATIONAL STATISTICAL OFFICES B ELARUS, R USSIA AND K AZAKHSTAN Miroslava Brchanova, Moscow, October, 2014.
SDMX and DDI Working Together Technical Workshop 5-7 June 2013
3 rd Annual European DDI Users Group Meeting, 5-6 December 2011 The Ongoing Work for a Technical Vocabulary of DDI and SDMX Terms Marco Pellegrino Eurostat.
5 June 2013 SDMX Technical Working Group Luxembourg 1 5 June 2013 SDMX Technical Working Group Luxembourg 1 WP Item 6 The Expressions Language of Banca.
GSIM implementation in the Istat Metadata System: focus on structural metadata and on the joint use of GSIM and SDMX Mauro Scanu
Eurostat Expression language (EL) in Eurostat SDMX - TWG Luxembourg, 5 Jun 2013 Adam Wroński.
13-Jul-07 Implementation of SDMX for data and metadata exchange Balance of Payments Working Group 2-3 April 2012 Daniel Suranyi Eurostat B5 Management.
Marco Oksman SDMX Transformation Component Applying CSPA.
The future of Statistical Production CSPA. 50 task team members 7 task teams CSPA 2015 project.
Eurostat SDMX and Global Standardisation Marco Pellegrino Eurostat, Statistical Office of the European Union Bangkok,
Eurostat 4. SDMX: Main objects for data exchange 1 Raynald Palmieri Eurostat Unit B5: “Central data and metadata services” SDMX Basics course, October.
SDMX IT Tools Introduction
Statistical data editing - UNECE work session – OSLO September 2012 Proposal of a revised approach for data validation within the European Statistical.
1 SDMX Global Conference September 2015 SDMX into the future VTL (Validation and Transformation Language) A new technical standard for enhancing.
GSIM Mapping to SDMX and DDI: Preliminary Findings and Status Arofan Gregory Metadata Technology METIS, May , Geneva.
Work Session on Statistical Metadata 2013 Session III: Metadata in the Statistical Business Process Better documenting statistical business processes:
Information about the HLG work Some considerations on HLG and related work from an NSI point of view Rune Gløersen, Statistics Norway.
13-14 December 2012 SDMX Technical Working Group Paris WP Item 6 Expressions and Calculations.
GSIM, DDI & Standards- based Modernisation of Official Statistics Workshop – DDI Lifecycle: Looking Forward October 2012.
IT Directors’ Group Meeting October 2010 Sharing data validation tools in the ESS Christine WIRTZ – Head of Unit B3 Georges PONGAS – Unit B3 Daniel.
1 Joint UNECE/EUROSTAT/OECD METIS Work Session (Geneva, March 2010) The On-Going Review of the SDMX Technical Specifications Marco Pellegrino, Håkan.
United Nations Economic Commission for Europe Statistical Division GSBPM and Other Standards Steven Vale UNECE
Eurostat Sharing data validation services Item 5.1 of the agenda.
3 June 2013 SDMX Technical Working Group Luxembourg 1 WP Item 6 Expressions and Calculations.
SDMX Basics course, March 2016 Eurostat SDMX Basics course, March Introducing the Roadmap Marco Pellegrino Eurostat Unit B5: “Data and.
1 High Level Seminar for Eastern Europe, Caucasus and Central Asia Countries (EECCA). Quality in Statistics: Metadata Tbilisi, Georgia, June 2012.
1 Recent developments in quality related matters in the ESS High level seminar for Eastern Europe, Caucasus and Central Asia countries Claudia Junker,
IAEA International Atomic Energy Agency Implementing SDMX for Energy Domain: From Discussion to Actual Implementation and Testing Andrii Gritsevskyi Oslo.
Theme (iv): Standards and international collaboration
UNECE-CES Work session on Statistical Data Editing
Exchanging Reference Metadata using SDMX
The CVD Metadata Handler
ESS Vision 2020 Validation: Implementation of deliverables
SISAI STATISTICAL INFORMATION SYSTEMS ARCHITECTURE AND INTEGRATION
11. The future of SDMX Introducing the SDMX Roadmap 2020
Logical information model LIM Geneva june
2. An overview of SDMX (What is SDMX? Part I)
ESS Vision 2020: ESS.VIP Validation
2. An overview of SDMX (What is SDMX? Part I)
The Generic Statistical Information Model
ESS.VIP VALIDATION An ESS.VIP project for mutual benefits
Statistical Information Technology
Giuliano Amerini Unit E6 (Transport)
3rd WGM Meeting 3 May 2018 Item 2.3 Possible standards for ESS Validation.
Validation services developed in the ESS
Point 6. Eurostat plans for Time Use Survey data processing and dissemination Working Group on Time Use Surveys 10 April 2013.
Applying the ESS EARF in a VIP project: The ESS.VIP Validation example
SDMX : General introduction H. Linden, Eurostat, Unit B5
SISAI STATISTICAL INFORMATION SYSTEMS ARCHITECTURE AND INTEGRATION
4.1 Do you speak VTL? Validation and Transformation Language
ESS.VIP Validation Item 5.1
Presentation to SISAI Luxembourg, 12 June 2012
Legislative strategy for cross-cutting ESS legislation
SDMX Implementation The National Accounts use case
M. Henrard, B5 N. Buysse and H. Linden, B6 Eurostat
The future of Statistical Production
Introducing the Data Documentation Initiative
ESS Vision and VALIDATION
Validation Activities in the ESS What you will hear today…
VTL – Validation and Transformation Language: a new emerging standard
SDMX Roadmap 2020: Achievements, status and future outlook
Presentation transcript:

VTL (Validation and Transformation Language) A new standard for data validation and processing Marco Pellegrino Eurostat Acknowledgements: Bank of Italy, SDMX Technical Working Group, DDI Alliance, Bryan Fitzpatrick, Arofan Gregory, and others… Eurostat

Background Data validation, a critical issue for the E.S.S. Eurostat and Member States: double work or "no work"? Inefficiencies: Lack of coordination Lack of documentation Lack of formalisation of validation procedures and rules Low harmonisation of software solutions. Need of a comprehensive solution: portfolio of actions in the framework of the ESS Vision 2020 Communication from the Commission to the European Parliament and the Council on the production method of EU statistics (Document ESSC 2010/05/6) calls for: more harmonisation and standardisation of statistical methodologies for data validation within the ESS harmonising the IT infrastructure and sharing IT tools as a way to facilitate the use of agreed statistical methods, leading to better quality and higher productivity in the processing of statistical data

GSBPM (Generic Statistical Business Process Model) Approach SDMX originally focused on data collection and dissemination Current line of tendency: Support more stages of the statistical production process GSBPM (Generic Statistical Business Process Model) VTL was developed by the SDMX Technical Working Group Based on earlier developments at the Banca d’Italia, and expansions of this work by Eurostat Included participants from the DDI Alliance Work started in 2012 A series of face-to-face and virtual meetings resulted in publication earlier this year The goal was to produce a platform-neutral language for formally describing validation and transformation Can exchange these processes between different organizations Even if they use different software to perform these functions

Data Validation Process Before/During Transmission (“First Level”) - Covered by SDMX today - Format Check (SDMX-ML) - Code Check (SDMX DSD) After Transmission ( “Second Level”) - Not yet covered by SDMX  SDMX-VTL - Detailed value check - Mirror check - …

The VTL initiative Main goals: At a later stage: Define and preserve validation rules (document and preserve the validation know-how) Exchange and share validation rules (with reporting institutions & other correspondents) Apply validation rules in the collection and production processes (aiming at an industrialized processing of statistical data) At a later stage: Improve the VTL to support more complex algorithms for data compilation and estimation This topic is already present within package n. 13 of the SDMX Information Model, which is named “Transformations and Expressions”. This package describes the generic model aimed to track the derivation of data, which is derived from the CWM (Common Warehouse Metamodel), a standard from OMG (Object Management Group) widely used in the IT field. It allows the identification and documentation of the calculations by means of mathematical expressions as well as the definition of an expression language that can be used to write the expressions. It also allows a formal representation of the operations to be performed, so that a program can “read” the metadata and transform the expressions in whatever computer language is appropriate for the calculation. This part of the model also allows specifying and documenting the validation rules among different data, expressing them as calculations (for example, the coherence rule “a + b = c” can be written as “a + b - c = 0” and checked through the calculation “if ((a + b – c) = 0, then …, else …)”). However, the “Transformations and Expressions” package of the SDMX IM is only a basic framework, which requires more work on elaborating its integration and actual use. The WP is aimed to develop such an elaboration, announced as available in the future by the SDMX specifications. In concrete terms, it is required to design the “expression language” (the list of the operators available for defining the expressions and their formal grammar) and the IT formats for exchanging the definitions produced in this way. The completion of this feature would make it possible to exchange the formal definitions of the coherence rules, and of the algorithms used or specified for calculating data. As the results of a calculation may be input to other ones, it would also be possible to define and exchange chains of controls/calculations, which express the relationships between statistical data (like it happens between the cells of a spreadsheet).

What is VTL 1.0? A reference framework for the creation of rules for data validation and transformation It maps to a clear and generic information model It aligns with relevant statistical information standards such as SDMX and GSIM SDMX VTL: part 1 - part 2 BNF (Extended Backus-Naur Form) Technical notation

Proper governance is needed Main VTL features User orientation Integrated approach IT implementation independence Active role for processing Extensibility and customizability Language effectiveness Proper governance is needed

The VTL Information Model VTL is a “stand-alone” specification It can be used with SDMX, DDI, or potentially anything else It can be used on its own Because different standards have different information models, VTL must establish its own information model Other information models can be mapped against it VTL uses GSIM as a basis

VTL Data Model Organizes Data Points into Data Sets Describes Data Structures using Structure Components Measures Attributes Identifiers very similar to GSIM

Identifier Component Measure Component Identifier Component Logical Data Set Data Points

Transformation Model Takes a set of Transformation Expressions and organizes them into a Transformation Scheme Each Expression has an Operand, and Operator, and a Result Operands can have Parameters Operators and Results are identified by the Expression when it is executed VTL specifies the Operators and the types of Parameters VTL uses the SDMX Transformation model

Transformations and Process models Transformation model It exists in SDMX, but not in GSIM and DDI It allows defining calculations through mathematical expressions It does not allow cycles (same structure than a spreadsheet) Process model It exists in SDMX, GSIM, DDI and other standards (e.g. BPM) It allows defining calculations through a process It allow cycles (like a procedural programming language) 13

GSIM Process Model

Process Method and Rules

Governance and Standards Alignment VTL will be maintained by the SDMX TWG Extensions will be considered for inclusion in future versions Has already produced some feedback to GSIM for next version VTL can be mapped against SDMX VTL can be directly utilized by DDI in those places where computations are included VTL could be used in CSPA services where processing is performed As GSIM processing Rules

What's next? More operators and features + bug-fixing + fine-tuning = VTL 1.1 Reuse of rules, structural validation? SDMX specifications (e.g. for exchanging VTL rules in SDMX messages, for storing rules and for requesting validation rules from web services) in progress Implementation tests with some pilot domains Integration within the ESS Validation Architecture (Validation project with national statistical institutes).

Conclusions A formal unambiguous and standard language was needed for encoding validation rules so that these can be translated into specific data editing systems Use of generic software services provided within the ESS community is foreseen Great achievement, led by a task-force with experts from statistical institutes, central banks, international organisations and (a few) private experts Thanks for your attention! Marco.Pellegrino@ec.europa.eu

Examples

VTL Grammar: A Simple Example Devo controllare che l'aggregazione su "Country" e "Year" mi dia valore 100: 1) uso l'operatore aggregate sul dataset privato della colonna "Gender" (Keep[Country,Year, Percentage]) e sommo su "Percentage",  2) quello che ottengo lo controllo attraverso l'operatore check in modo da verificare che il valore sia =100.Mi faccio restiture in caso di errore, proprio il valore della percentuale ottenuta attraverso il parametro imbalance. Il parametro all alla fine mi restituisce tutti i dati. La formula completa è  check (ds1[keep (Country,Year, Percentage)][aggregate sum(Percentage)]=100, imbalance(Percentage), all)   L'imbalance non è necessario, lo puoi chiedere se vuoi che nei casi in cui il risultato del check è "false" ti venga restituito qualcosa (vale 0 se il risultato è true). Il parametro all invece sta ad indicare che vuoi tutti i record, sia quelli corretti che quelli errati. Nikos: datasetACountryYear := datasetA[keep(Country,Year,Population_percentage)][aggregate sum(Population_percentage)] result := check(datasetACountryYear = 100, all) Laura: dsr:=check (ds1[keep (Country,Year, Percentage)][aggregate sum(Percentage)]=100, imbalance(Percentage), all) Is the total = 100? check (ds1[keep (Country,Year, Percentage)][aggregate sum(Percentage)]=100, imbalance(Percentage), all)

Steps ds1[keep (Country,Year, Percentage)][aggregate sum(Percentage)] check (ds1[keep (Country,Year, Percentage)][aggregate sum(Percentage)]=100, imbalance(Percentage), all)

VTL Grammar: Another Example We have two Data Sets (D1 and D2) with the same structure:

VTL Grammar: A Simple Example (cont.) We want to create a table (Dresult) which provides totals, combining the values for the US and the European Union: Dresult := D1 + D2

Results Dresult is a Data Set containing the United States plus the European Union: