Presentation is loading. Please wait.

Presentation is loading. Please wait.

Background Data validation, a critical issue for the E.S.S.

Similar presentations


Presentation on theme: "Background Data validation, a critical issue for the E.S.S."— Presentation transcript:

0 VTL (Validation and Transformation Language) A new standard for data validation and processing
Marco Pellegrino Eurostat Acknowledgements: Bank of Italy, SDMX Technical Working Group, DDI Alliance, Bryan Fitzpatrick, Arofan Gregory, and others… Eurostat

1 Background Data validation, a critical issue for the E.S.S.
Eurostat and Member States: double work or "no work"? Inefficiencies: Lack of coordination Lack of documentation Lack of formalisation of validation procedures and rules Low harmonisation of software solutions. Need of a comprehensive solution: portfolio of actions in the framework of the ESS Vision 2020 Communication from the Commission to the European Parliament and the Council on the production method of EU statistics (Document ESSC 2010/05/6) calls for: more harmonisation and standardisation of statistical methodologies for data validation within the ESS harmonising the IT infrastructure and sharing IT tools as a way to facilitate the use of agreed statistical methods, leading to better quality and higher productivity in the processing of statistical data

2 GSBPM (Generic Statistical Business Process Model)
Approach SDMX originally focused on data collection and dissemination Current line of tendency: Support more stages of the statistical production process GSBPM (Generic Statistical Business Process Model) VTL was developed by the SDMX Technical Working Group Based on earlier developments at the Banca d’Italia, and expansions of this work by Eurostat Included participants from the DDI Alliance Work started in 2012 A series of face-to-face and virtual meetings resulted in publication earlier this year The goal was to produce a platform-neutral language for formally describing validation and transformation Can exchange these processes between different organizations Even if they use different software to perform these functions

3 Data Validation Process
Before/During Transmission (“First Level”) - Covered by SDMX today - Format Check (SDMX-ML) - Code Check (SDMX DSD) After Transmission ( “Second Level”) - Not yet covered by SDMX  SDMX-VTL - Detailed value check - Mirror check - …

4 The VTL initiative Main goals: At a later stage:
Define and preserve validation rules (document and preserve the validation know-how) Exchange and share validation rules (with reporting institutions & other correspondents) Apply validation rules in the collection and production processes (aiming at an industrialized processing of statistical data) At a later stage: Improve the VTL to support more complex algorithms for data compilation and estimation This topic is already present within package n. 13 of the SDMX Information Model, which is named “Transformations and Expressions”. This package describes the generic model aimed to track the derivation of data, which is derived from the CWM (Common Warehouse Metamodel), a standard from OMG (Object Management Group) widely used in the IT field. It allows the identification and documentation of the calculations by means of mathematical expressions as well as the definition of an expression language that can be used to write the expressions. It also allows a formal representation of the operations to be performed, so that a program can “read” the metadata and transform the expressions in whatever computer language is appropriate for the calculation. This part of the model also allows specifying and documenting the validation rules among different data, expressing them as calculations (for example, the coherence rule “a + b = c” can be written as “a + b - c = 0” and checked through the calculation “if ((a + b – c) = 0, then …, else …)”). However, the “Transformations and Expressions” package of the SDMX IM is only a basic framework, which requires more work on elaborating its integration and actual use. The WP is aimed to develop such an elaboration, announced as available in the future by the SDMX specifications. In concrete terms, it is required to design the “expression language” (the list of the operators available for defining the expressions and their formal grammar) and the IT formats for exchanging the definitions produced in this way. The completion of this feature would make it possible to exchange the formal definitions of the coherence rules, and of the algorithms used or specified for calculating data. As the results of a calculation may be input to other ones, it would also be possible to define and exchange chains of controls/calculations, which express the relationships between statistical data (like it happens between the cells of a spreadsheet).

5 What is VTL 1.0? A reference framework for the creation of rules for data validation and transformation It maps to a clear and generic information model It aligns with relevant statistical information standards such as SDMX and GSIM SDMX VTL: part 1 - part 2 BNF (Extended Backus-Naur Form) Technical notation

6

7 Proper governance is needed
Main VTL features User orientation Integrated approach IT implementation independence Active role for processing Extensibility and customizability Language effectiveness Proper governance is needed

8 The VTL Information Model
VTL is a “stand-alone” specification It can be used with SDMX, DDI, or potentially anything else It can be used on its own Because different standards have different information models, VTL must establish its own information model Other information models can be mapped against it VTL uses GSIM as a basis

9 VTL Data Model Organizes Data Points into Data Sets
Describes Data Structures using Structure Components Measures Attributes Identifiers very similar to GSIM

10

11 Identifier Component Measure Component Identifier Component Logical Data Set Data Points

12 Transformation Model Takes a set of Transformation Expressions and organizes them into a Transformation Scheme Each Expression has an Operand, and Operator, and a Result Operands can have Parameters Operators and Results are identified by the Expression when it is executed VTL specifies the Operators and the types of Parameters VTL uses the SDMX Transformation model

13 Transformations and Process models
Transformation model It exists in SDMX, but not in GSIM and DDI It allows defining calculations through mathematical expressions It does not allow cycles (same structure than a spreadsheet) Process model It exists in SDMX, GSIM, DDI and other standards (e.g. BPM) It allows defining calculations through a process It allow cycles (like a procedural programming language) 13

14

15 GSIM Process Model

16 Process Method and Rules

17 Governance and Standards Alignment
VTL will be maintained by the SDMX TWG Extensions will be considered for inclusion in future versions Has already produced some feedback to GSIM for next version VTL can be mapped against SDMX VTL can be directly utilized by DDI in those places where computations are included VTL could be used in CSPA services where processing is performed As GSIM processing Rules

18 What's next? More operators and features + bug-fixing + fine-tuning = VTL 1.1 Reuse of rules, structural validation? SDMX specifications (e.g. for exchanging VTL rules in SDMX messages, for storing rules and for requesting validation rules from web services) in progress Implementation tests with some pilot domains Integration within the ESS Validation Architecture (Validation project with national statistical institutes).

19 Conclusions A formal unambiguous and standard language was needed for encoding validation rules so that these can be translated into specific data editing systems Use of generic software services provided within the ESS community is foreseen Great achievement, led by a task-force with experts from statistical institutes, central banks, international organisations and (a few) private experts Thanks for your attention!

20 Examples

21 VTL Grammar: A Simple Example
Devo controllare che l'aggregazione su "Country" e "Year" mi dia valore 100: 1) uso l'operatore aggregate sul dataset privato della colonna "Gender" (Keep[Country,Year, Percentage]) e sommo su "Percentage",  2) quello che ottengo lo controllo attraverso l'operatore check in modo da verificare che il valore sia =100.Mi faccio restiture in caso di errore, proprio il valore della percentuale ottenuta attraverso il parametro imbalance. Il parametro all alla fine mi restituisce tutti i dati. La formula completa è  check (ds1[keep (Country,Year, Percentage)][aggregate sum(Percentage)]=100, imbalance(Percentage), all) L'imbalance non è necessario, lo puoi chiedere se vuoi che nei casi in cui il risultato del check è "false" ti venga restituito qualcosa (vale 0 se il risultato è true). Il parametro all invece sta ad indicare che vuoi tutti i record, sia quelli corretti che quelli errati. Nikos: datasetACountryYear := datasetA[keep(Country,Year,Population_percentage)][aggregate sum(Population_percentage)] result := check(datasetACountryYear = 100, all) Laura: dsr:=check (ds1[keep (Country,Year, Percentage)][aggregate sum(Percentage)]=100, imbalance(Percentage), all) Is the total = 100? check (ds1[keep (Country,Year, Percentage)][aggregate sum(Percentage)]=100, imbalance(Percentage), all)

22 Steps ds1[keep (Country,Year, Percentage)][aggregate sum(Percentage)]
check (ds1[keep (Country,Year, Percentage)][aggregate sum(Percentage)]=100, imbalance(Percentage), all)

23 VTL Grammar: Another Example
We have two Data Sets (D1 and D2) with the same structure:

24 VTL Grammar: A Simple Example (cont.)
We want to create a table (Dresult) which provides totals, combining the values for the US and the European Union: Dresult := D1 + D2

25 Results Dresult is a Data Set containing the United States plus the European Union:


Download ppt "Background Data validation, a critical issue for the E.S.S."

Similar presentations


Ads by Google