VTL – Validation and Transformation Language: a new emerging standard

Slides:



Advertisements
Similar presentations
Status on the Mapping of Metadata Standards
Advertisements

Principles of Procedural Programming
Background Data validation, a critical issue for the E.S.S.
WP.5 - DDI-SDMX Integration
WP.5 - DDI-SDMX Integration E.S.S. cross-cutting project on Information Models and Standards Marco Pellegrino, Denis Grofils Eurostat METIS Work Session6-8.
Statistics New Zealand Classification Management System Andrew Hancock Statistics New Zealand Prepared for 2013 Meeting of the UN Expert Group on International.
Vincenzo Del Vecchio Banca d’Italia Statistics Collection and Processing Department 2012 ESSnet Workshop – 4 December.
5 June 2013 SDMX Technical Working Group Luxembourg 1 5 June 2013 SDMX Technical Working Group Luxembourg 1 WP Item 6 The Expressions Language of Banca.
GSIM implementation in the Istat Metadata System: focus on structural metadata and on the joint use of GSIM and SDMX Mauro Scanu
Eurostat Expression language (EL) in Eurostat SDMX - TWG Luxembourg, 5 Jun 2013 Adam Wroński.
United Nations Economic Commission for Europe Statistical Division Introduction to Steven Vale UNECE
Marco Oksman SDMX Transformation Component Applying CSPA.
Model and Representations
Eurostat SDMX and Global Standardisation Marco Pellegrino Eurostat, Statistical Office of the European Union Bangkok,
Eurostat 4. SDMX: Main objects for data exchange 1 Raynald Palmieri Eurostat Unit B5: “Central data and metadata services” SDMX Basics course, October.
SDMX IT Tools Introduction
1 SDMX Global Conference September 2015 SDMX into the future VTL (Validation and Transformation Language) A new technical standard for enhancing.
Generic Statistical Information Model (GSIM) Jenny Linnerud
13-14 December 2012 SDMX Technical Working Group Paris WP Item 6 Expressions and Calculations.
EDIT – Eurostat’s editing tool
1 Enhancing data quality by using harmonised structural metadata within the European Statistical System A. Götzfried Head of Unit B6 Eurostat.
United Nations Economic Commission for Europe Statistical Division GSBPM and Other Standards Steven Vale UNECE
3 June 2013 SDMX Technical Working Group Luxembourg 1 WP Item 6 Expressions and Calculations.
SDMX Basics course, March 2016 Eurostat SDMX Basics course, March Introducing the Roadmap Marco Pellegrino Eurostat Unit B5: “Data and.
Selection Using IF THEN ELSE CASE Introducing Loops.
IAEA International Atomic Energy Agency Implementing SDMX for Energy Domain: From Discussion to Actual Implementation and Testing Andrii Gritsevskyi Oslo.
United Nations Economic Commission for Europe Statistical Division CSPA: The Future of Statistical Production Steven Vale UNECE
1 The XMSF Profile Overlay to the FEDEP Dr. Katherine L. Morse, SAIC Mr. Robert Lutz, JHU APL
Modernisation Story of Statistics Slovenia
UNECE-CES Work session on Statistical Data Editing
Supporting the use of administrative data in official statistics.
Contents Introducing the GSBPM Links to other standards
(COmmon Reference Environment)
The Generic Statistical Information Model (GSIM) and the Sistema Unitario dei Metadati (SUM): state of application of the standard Cecilia Casagrande –
Using the Checklist for SDMX Data Providers
ESS Vision 2020 Validation: Implementation of deliverables
Upcoming changes to the DMX technical standard
Census Hub in practice Working Group "European Statistical Data Support" Luxembourg, 29 April 2015.
Generic Statistical Business Process Model (GSBPM)
SDMX: A brief introduction
SISAI STATISTICAL INFORMATION SYSTEMS ARCHITECTURE AND INTEGRATION
ESSnet on SDMX phase II Laura Vignola
ESSnet project "Automated data collection and reporting in accommodation statistics"   Objectives, achievements and results
11. The future of SDMX Introducing the SDMX Roadmap 2020
Logical information model LIM Geneva june
2. An overview of SDMX (What is SDMX? Part I)
ESS Vision 2020: ESS.VIP Validation
A handbook on validation methodology Marco Di Zio Istat
2. An overview of SDMX (What is SDMX? Part I)
Data Model.
12. Validation services and the new. Validation & Transformation
SDMX Information Model: An Introduction
VTL: Validation and Transformation Language
SDMX in the S-DWH Layered Architecture
Question Banks, Reusability, and DDI 3.2 (Use Parameters)
3rd WGM Meeting 3 May 2018 Item 2.3 Possible standards for ESS Validation.
ESS VIP ICT Project Task Force Meeting 5-6 March 2013.
Prepared by Peter Boško, Luxembourg June 2012
CSPA: The Future of Statistical Production
SISAI STATISTICAL INFORMATION SYSTEMS ARCHITECTURE AND INTEGRATION
4.1 Do you speak VTL? Validation and Transformation Language
ESS.VIP ADMIN – Status report Item 4.1 of the draft agenda
SDMX Implementation The National Accounts use case
Metadata on quality of statistical information
Generic Statistical Information Model (GSIM)
Modernisation of Validation in the ESS Collaboration with countries
Validation Activities in the ESS What you will hear today…
7. Introduction to the main SDMX objects for metadata exchange
SDMX training Francesco Rizzo June 2018
GSIM overview Mauro Scanu ISTAT
Presentation transcript:

VTL – Validation and Transformation Language: a new emerging standard Wiesbaden 10-11 November 2015 Luigi Bellomarini

Let’s start to validate a questionnaire

# an answer has been given > answer_given := SG18B. In what year did you come to live in Italy the first time? Year |_|_|_|_| Don’t know |0|9|9|7| SG18F. Since when have you been living in Italy without leaving the Country for one year or more? Year |_|_|_|_| SG20. Would you like to tell me how old you are? Age |_|_|_| SURVEY Id First time Decided to stay Age 1 1998 2014 25 2 2015 35 3 2013 2012 50 4 997 5 1999 2000 7 ANSWER_GIVEN Id CONDITION 1 TRUE 2 3 4 FALSE 5 # an answer has been given > answer_given := survey#SG18B <> 997 and survey#SG18F <> 997

# age has been specified > age_specified := not isnull(survey#SG20) SG18B. In what year did you come to live in Italy the first time? Year |_|_|_|_| Don’t know |0|9|9|7| SG18F. Since when have you been living in Italy without leaving the Country for one year or more? Year |_|_|_|_| SG20. Would you like to tell me how old you are? Age |_|_|_| SURVEY Id First time Decided to stay Age 1 1998 2014 25 2 2015 35 3 2013 2012 50 4 997 5 1999 2000 7 AGE_SPECIFIED Id CONDITION 1 TRUE 2 3 4 5 # age has been specified > age_specified := not isnull(survey#SG20)

# she decided to stay not before the first visit > years_ok := SG18B. In what year did you come to live in Italy the first time? Year |_|_|_|_| Don’t know |0|9|9|7| SG18F. Since when have you been living in Italy without leaving the Country for one year or more? Year |_|_|_|_| SG20. Would you like to tell me how old you are? Age |_|_|_| SURVEY Id First time Decided to stay Age 1 1998 2014 25 2 2015 35 3 2013 2012 50 4 997 5 1999 2000 7 YEARS_OK Id CONDITION 1 TRUE 2 3 FALSE 4 5 # she decided to stay not before the first visit > years_ok := survey#SG18F >= survey#SG18B

(current_year - survey#SG18F) SG18B. In what year did you come to live in Italy the first time? Year |_|_|_|_| Don’t know |0|9|9|7| SG18F. Since when have you been living in Italy without leaving the Country for one year or more? Year |_|_|_|_| SG20. Would you like to tell me how old you are? Age |_|_|_| SURVEY Id First time Decided to stay Age 1 1998 2014 25 2 2015 35 3 2013 2012 50 4 997 5 1999 2000 7 # current age is greater or equal than the number of years she has been staying > correct_age := survey#SG20 >= (current_year - survey#SG18F) CORRECT_AGE Id CONDITION 1 TRUE 2 3 4 5 FALSE

# years are valid numbers > year_numbers := match_characters( SG18B. In what year did you come to live in Italy the first time? Year |_|_|_|_| Don’t know |0|9|9|7| SG18F. Since when have you been living in Italy without leaving the Country for one year or more? Year |_|_|_|_| SG20. Would you like to tell me how old you are? Age |_|_|_| SURVEY Id First time Decided to stay Age 1 1998 2014 25 2 2015 35 3 2K13 2012 50 4 997 2013 5 1999 2000 7 YEAR_NUMBERS Id CONDITION 1 TRUE 2 3 FALSE 4 5 # years are valid numbers > year_numbers := match_characters( survey,SG18B,”[1234567890]”)

# years are within a specific time interval > year_interval := SG18B. In what year did you come to live in Italy the first time? Year |_|_|_|_| Don’t know |0|9|9|7| SG18F. Since when have you been living in Italy without leaving the Country for one year or more? Year |_|_|_|_| SG20. Would you like to tell me how old you are? Age |_|_|_| SURVEY Id First time Decided to stay Age 1 1998 2014 25 2 2015 35 3 2013 2012 50 4 1900 5 1999 2000 7 YEAR_INTERVAL Id CONDITION 1 TRUE 2 3 4 FALSE 5 # years are within a specific time interval > year_interval := survey#SG18F between 1970 and 2015

# age is a valid and reasonable number > good_age := SG18B. In what year did you come to live in Italy the first time? Year |_|_|_|_| Don’t know |0|9|9|7| SG18F. Since when have you been living in Italy without leaving the Country for one year or more? Year |_|_|_|_| SG20. Would you like to tell me how old you are? Age |_|_|_| SURVEY Id First time Decided to stay Age 1 1998 2014 25 2 2015 -1 3 2013 2012 250 4 1900 5 1999 2000 7 YEAR_INTERVAL Id CONDITION 1 TRUE 2 FALSE 3 4 5 # age is a valid and reasonable number > good_age := match_characters(survey,SG20,”[1234567890]”) and survey#SG20 between 0 and 120

Let’s see another kind of report…

# with job 4, income is less than 10K Report PERS_ID YEAR GEO SEX JOB INCOME DE001 2011 DE M 1 2340 DE002 4 1850 FR002 FR 3450 FR003 2360 FR004 1830 FR005 NULL F 8 1940 # with job 4, income is less than 10K > income_4 := Report[filter Job = 4]#Income <= 10000 # otherwise, the income is greater than 1K > income_any := Report[filter Job <> 4]#Income > 1000

> income_chk := if Report#Job = 4 then Report#Income <= 10000 PERS_ID YEAR GEO SEX JOB INCOME DE001 2011 DE M 1 2340 DE002 4 1850 FR002 FR 3450 FR003 2360 FR004 1830 FR005 NULL F 8 1940 # all in one statement > income_chk := if Report#Job = 4 then Report#Income <= 10000 else Report#Income > 1000 # job is among the valid ones > job_ok := Report[keep Job] exists_in JobCodeList#Id

# income for 2015 is not more than 5K far from # the one in 2014 Report PERS_ID YEAR GEO SEX JOB INCOME DE001 2014 DE M 4 2340 2015 1850 FR002 FR 1 3450 2360 # income for 2015 is not more than 5K far from # the one in 2014 > income_14_15 := abs( Report[keep PERS_ID, Year] [filter Year = 2014]#Income - [filter Year = 2015]#Income) <= 5000

# the average income for France in 2015 is # less than 10K Report PERS_ID YEAR GEO SEX JOB INCOME DE001 2015 DE M 1 2340 DE002 4 1850 FR002 FR 3450 FR003 2360 FR004 2014 1830 # the average income for France in 2015 is # less than 10K > avg_fr := Report[keep GEO, Income] [filter GEO = ”FR”] [aggregate avg(Income)] < 10000

Overview of VTL

General characteristics Standard definition of algorithms validation of statistical data calculation of derived data Designed for users and oriented to statistics logical datasets as first-class objects

Integrated approach independent of the phases of the statistical process independent of the statistical domain suitable for various typologies of statistical data (dimensional, surveys, registers, micro and macro, quantitative and qualitative)

Validation rationale The validation of information content is a kind of calculation a Transformation in SDMX wording (see SDMX 2.1 package 13, Transformations and Expressions) Get the available information (e.g. from the survey or the report, from the database) Calculate the needed derived (sometimes aggregated) info Validate calculate A get B … calculate N validate f(A,B,…N)

Validation rationale (example) # average income by country for 2015 is less than the average income in 2014 > avg_income_14 := Report[filter Year = 2014][keep Country, Income] [aggregate avg(Income)] > avg_income_15 := Report[filter Year = 2015][keep Country, Income][aggregate avg(Income)] > avg_income_chk := avg_income_14 > avg_income_15

VTL and the statistical process

Features Declarative Strongly typed Dataset oriented Statements, expressions and operators # income for 2015 is not more than 5K far from the one in 2014 > income_14_15 := abs( Report[filter Year = 2014]#Income - Report[filter Year = 2015]#Income) <= 5000 expression statement operator

The VTL model VTL is agnostic of the underlying information model (e.g. SDMX, DDI, GSIM, …). SDMX (http://www.sdmx.org) Statistical Data and Metadata eXchange International initiative aiming at standardizing and modernizing the mechanisms and the processes for the exchange of statistical data and metadata among international organizations and their member countries. DDI (http://www.ddialliance.org/what) Data Documentation Initiative Initiative for an international standard for documenting data from the behavioral, social and economic sciences GSIM (http://www1.unece.org/stat/platform/display/metis/Generic+Statistical+Information+Mod el) Generic Statistical Information Model Reference framework of internationally agreed definitions for the definition of information in the production of official statistics

The VTL model (cont’d) The different standards may use the language, mapping their artefacts into VTL ones The language is closed, in the sense that it manipulates artefacts of the model and produces other artefacts of the model The model is inspired by those of SDMX and GSIM

The VTL model

The VTL model and SDMX Data flow Measure Dimension Data Structure Attribute Data Structure Definition

Identifier Components Measure Components Attribute Components Data Structure Data Points

How does VTL work? Operators are applied to every Data Point of the input Datasets Operators produce new Datasets N-ary operators preliminarily combine the datasets

Scalar math > B := abs(round(A#INCOME)) + 1 B YEAR GEO SEX JOB 2011 DE M 1 -12.3 2012 FR F 2 23.2 > B := abs(round(A#INCOME)) + 1 B YEAR GEO SEX JOB INCOME 2011 DE M 1 13 2012 FR F 2 24

String manipulation > B := lower(A#Sex) YEAR GEO SEX JOB INCOME 2011 de M 1 2012 fr F 2 > B := lower(A#Sex) > C := ”EU_” + upper(A#Geo) B YEAR GEO SEX JOB INCOME 2011 de m 1 2012 fr f 2 C YEAR GEO SEX JOB INCOME 2011 EU_DE M 1 2012 EU_FR F 2

Dataset math > C := (A#Income + B#Income) / 2 YEAR GEO SEX JOB INCOME 2011 DE M 1 1000 4 2000 FR 3000 7 2500 B YEAR GEO SEX JOB INCOME 2011 DE M 1 2 4 FR 8 > C := (A#Income + B#Income) / 2 > D := (A#Income * B#Income) C YEAR GEO SEX JOB INCOME 2011 DE M 1 501 4 1002 FR 1504 7 2500 D YEAR GEO SEX JOB INCOME 2011 DE M 1 2000 4 8000 FR 24000

Relational operators > C := intersect(A, B) > D := union(A, B) C YEAR GEO SEX JOB INCOME 2011 DE M 1 1000 4 2000 FR 3000 7 4000 B YEAR GEO SEX JOB INCOME 2011 DE M 1 1000 IT 4 2000 > C := intersect(A, B) > D := union(A, B) C YEAR GEO SEX JOB INCOME 2011 DE M 1 1000 D YEAR GEO SEX JOB INCOME 2011 DE M 1 1000 4 2000 FR 3000 7 4000 IT

Statistical operators YEAR GEO SEX JOB INCOME 2011 DE M 1 1000 4 2000 FR 3000 7 9000 > C := A[keep Year, Geo, Income][aggregate avg(Income)] > D := A[keep Year, Income] [aggregate max(Income)] C YEAR GEO INCOME 2011 DE 1500 FR 6000 D YEAR INCOME 2011 9000

Validation > C := A[GEO = ”FR”]#Income >= 4000 C YEAR GEO SEX JOB INCOME 2011 DE M 1 1000 4 2000 FR 3000 7 9000 > C := A[GEO = ”FR”]#Income >= 4000 C YEAR GEO SEX JOB CONDITION 2011 FR M 1 FALSE 7 TRUE

Validation (cont’d) > C := A#JOB exists_in JOBS#ID A YEAR GEO SEX INCOME 2011 DE M 1 1000 4 2000 FR 2 3000 3 9000 JOBS ID NACE 1 1006 3 1007 4 1008 > C := A#JOB exists_in JOBS#ID A YEAR GEO SEX JOB CONDITION 2011 DE M 1 TRUE 4 FR 2 FALSE 3

Validation (cont’d) > C := check( A[GEO = ”FR”]#Income = YEAR GEO SEX JOB INCOME 2011 DE M 1 1000 4 2000 FR 3000 B YEAR GEO SEX JOB INCOME 2011 DE M 1 1000 4 2000 FR 6000 > C := check( A[GEO = ”FR”]#Income = B[GEO = ”FR”]#Income), imbalance = abs(A#Income - B#Income)) C YEAR GEO SEX JOB CONDITION IMBALANCE 2011 FR M 1 FALSE 3000

Conditionals and Booleans YEAR GEO SEX JOB INCOME 2011 DE M 1 1000 FR 4 2000 3000 B YEAR GEO SEX JOB INCOME 2011 DE M 1 34 FR 4 36 6000 > C := if (A#GEO = ”FR” and A#JOB = 1) then A else B A YEAR GEO SEX JOB INCOME 2011 DE M 1 34 FR 4 36 3000

Where we are VTL 1.0 published in March 2015 sdmx.org/wp-content/uploads/2015/03/VTL-1- package-2015.zip BNF Extended Backus-Naur Form Maintained by SDMX TWG within VTL task force

Work in progress VTL 1.0 theoretical and practical evaluation VTL 1.1 (by Summer 2016) feedback from ESSnet language extension and new operators (e.g. functional) reusability of the language SDMX implementation Messages for exchanging VTL rules Registry for storing VTL rules Web services for retrieving VTL rules

twg@sdmx.org