Download presentation
Presentation is loading. Please wait.
Published byMatthew Stevenson Modified over 5 years ago
1
VTL – Validation and Transformation Language: a new emerging standard
Wiesbaden November 2015 Luigi Bellomarini
2
Let’s start to validate a questionnaire
3
# an answer has been given > answer_given :=
SG18B. In what year did you come to live in Italy the first time? Year |_|_|_|_| Don’t know |0|9|9|7| SG18F. Since when have you been living in Italy without leaving the Country for one year or more? Year |_|_|_|_| SG20. Would you like to tell me how old you are? Age |_|_|_| SURVEY Id First time Decided to stay Age 1 1998 2014 25 2 2015 35 3 2013 2012 50 4 997 5 1999 2000 7 ANSWER_GIVEN Id CONDITION 1 TRUE 2 3 4 FALSE 5 # an answer has been given > answer_given := survey#SG18B <> 997 and survey#SG18F <> 997
4
# age has been specified > age_specified := not isnull(survey#SG20)
SG18B. In what year did you come to live in Italy the first time? Year |_|_|_|_| Don’t know |0|9|9|7| SG18F. Since when have you been living in Italy without leaving the Country for one year or more? Year |_|_|_|_| SG20. Would you like to tell me how old you are? Age |_|_|_| SURVEY Id First time Decided to stay Age 1 1998 2014 25 2 2015 35 3 2013 2012 50 4 997 5 1999 2000 7 AGE_SPECIFIED Id CONDITION 1 TRUE 2 3 4 5 # age has been specified > age_specified := not isnull(survey#SG20)
5
# she decided to stay not before the first visit > years_ok :=
SG18B. In what year did you come to live in Italy the first time? Year |_|_|_|_| Don’t know |0|9|9|7| SG18F. Since when have you been living in Italy without leaving the Country for one year or more? Year |_|_|_|_| SG20. Would you like to tell me how old you are? Age |_|_|_| SURVEY Id First time Decided to stay Age 1 1998 2014 25 2 2015 35 3 2013 2012 50 4 997 5 1999 2000 7 YEARS_OK Id CONDITION 1 TRUE 2 3 FALSE 4 5 # she decided to stay not before the first visit > years_ok := survey#SG18F >= survey#SG18B
6
(current_year - survey#SG18F)
SG18B. In what year did you come to live in Italy the first time? Year |_|_|_|_| Don’t know |0|9|9|7| SG18F. Since when have you been living in Italy without leaving the Country for one year or more? Year |_|_|_|_| SG20. Would you like to tell me how old you are? Age |_|_|_| SURVEY Id First time Decided to stay Age 1 1998 2014 25 2 2015 35 3 2013 2012 50 4 997 5 1999 2000 7 # current age is greater or equal than the number of years she has been staying > correct_age := survey#SG20 >= (current_year - survey#SG18F) CORRECT_AGE Id CONDITION 1 TRUE 2 3 4 5 FALSE
7
# years are valid numbers > year_numbers := match_characters(
SG18B. In what year did you come to live in Italy the first time? Year |_|_|_|_| Don’t know |0|9|9|7| SG18F. Since when have you been living in Italy without leaving the Country for one year or more? Year |_|_|_|_| SG20. Would you like to tell me how old you are? Age |_|_|_| SURVEY Id First time Decided to stay Age 1 1998 2014 25 2 2015 35 3 2K13 2012 50 4 997 2013 5 1999 2000 7 YEAR_NUMBERS Id CONDITION 1 TRUE 2 3 FALSE 4 5 # years are valid numbers > year_numbers := match_characters( survey,SG18B,”[ ]”)
8
# years are within a specific time interval > year_interval :=
SG18B. In what year did you come to live in Italy the first time? Year |_|_|_|_| Don’t know |0|9|9|7| SG18F. Since when have you been living in Italy without leaving the Country for one year or more? Year |_|_|_|_| SG20. Would you like to tell me how old you are? Age |_|_|_| SURVEY Id First time Decided to stay Age 1 1998 2014 25 2 2015 35 3 2013 2012 50 4 1900 5 1999 2000 7 YEAR_INTERVAL Id CONDITION 1 TRUE 2 3 4 FALSE 5 # years are within a specific time interval > year_interval := survey#SG18F between 1970 and 2015
9
# age is a valid and reasonable number > good_age :=
SG18B. In what year did you come to live in Italy the first time? Year |_|_|_|_| Don’t know |0|9|9|7| SG18F. Since when have you been living in Italy without leaving the Country for one year or more? Year |_|_|_|_| SG20. Would you like to tell me how old you are? Age |_|_|_| SURVEY Id First time Decided to stay Age 1 1998 2014 25 2 2015 -1 3 2013 2012 250 4 1900 5 1999 2000 7 YEAR_INTERVAL Id CONDITION 1 TRUE 2 FALSE 3 4 5 # age is a valid and reasonable number > good_age := match_characters(survey,SG20,”[ ]”) and survey#SG20 between 0 and 120
10
Let’s see another kind of report…
11
# with job 4, income is less than 10K
Report PERS_ID YEAR GEO SEX JOB INCOME DE001 2011 DE M 1 2340 DE002 4 1850 FR002 FR 3450 FR003 2360 FR004 1830 FR005 NULL F 8 1940 # with job 4, income is less than 10K > income_4 := Report[filter Job = 4]#Income <= 10000 # otherwise, the income is greater than 1K > income_any := Report[filter Job <> 4]#Income > 1000
12
> income_chk := if Report#Job = 4 then Report#Income <= 10000
PERS_ID YEAR GEO SEX JOB INCOME DE001 2011 DE M 1 2340 DE002 4 1850 FR002 FR 3450 FR003 2360 FR004 1830 FR005 NULL F 8 1940 # all in one statement > income_chk := if Report#Job = 4 then Report#Income <= 10000 else Report#Income > 1000 # job is among the valid ones > job_ok := Report[keep Job] exists_in JobCodeList#Id
13
# income for 2015 is not more than 5K far from # the one in 2014
Report PERS_ID YEAR GEO SEX JOB INCOME DE001 2014 DE M 4 2340 2015 1850 FR002 FR 1 3450 2360 # income for 2015 is not more than 5K far from # the one in 2014 > income_14_15 := abs( Report[keep PERS_ID, Year] [filter Year = 2014]#Income - [filter Year = 2015]#Income) <= 5000
14
# the average income for France in 2015 is # less than 10K
Report PERS_ID YEAR GEO SEX JOB INCOME DE001 2015 DE M 1 2340 DE002 4 1850 FR002 FR 3450 FR003 2360 FR004 2014 1830 # the average income for France in 2015 is # less than 10K > avg_fr := Report[keep GEO, Income] [filter GEO = ”FR”] [aggregate avg(Income)] < 10000
15
Overview of VTL
16
General characteristics
Standard definition of algorithms validation of statistical data calculation of derived data Designed for users and oriented to statistics logical datasets as first-class objects
17
Integrated approach independent of the phases of the statistical process independent of the statistical domain suitable for various typologies of statistical data (dimensional, surveys, registers, micro and macro, quantitative and qualitative)
18
Validation rationale The validation of information content is a kind of calculation a Transformation in SDMX wording (see SDMX 2.1 package 13, Transformations and Expressions) Get the available information (e.g. from the survey or the report, from the database) Calculate the needed derived (sometimes aggregated) info Validate calculate A get B … calculate N validate f(A,B,…N)
19
Validation rationale (example)
# average income by country for 2015 is less than the average income in 2014 > avg_income_14 := Report[filter Year = 2014][keep Country, Income] [aggregate avg(Income)] > avg_income_15 := Report[filter Year = 2015][keep Country, Income][aggregate avg(Income)] > avg_income_chk := avg_income_14 > avg_income_15
20
VTL and the statistical process
21
Features Declarative Strongly typed Dataset oriented
Statements, expressions and operators # income for 2015 is not more than 5K far from the one in 2014 > income_14_15 := abs( Report[filter Year = 2014]#Income - Report[filter Year = 2015]#Income) <= 5000 expression statement operator
22
The VTL model VTL is agnostic of the underlying information model (e.g. SDMX, DDI, GSIM, …). SDMX ( Statistical Data and Metadata eXchange International initiative aiming at standardizing and modernizing the mechanisms and the processes for the exchange of statistical data and metadata among international organizations and their member countries. DDI ( Data Documentation Initiative Initiative for an international standard for documenting data from the behavioral, social and economic sciences GSIM ( el) Generic Statistical Information Model Reference framework of internationally agreed definitions for the definition of information in the production of official statistics
23
The VTL model (cont’d) The different standards may use the language, mapping their artefacts into VTL ones The language is closed, in the sense that it manipulates artefacts of the model and produces other artefacts of the model The model is inspired by those of SDMX and GSIM
24
The VTL model
25
The VTL model and SDMX Data flow Measure Dimension Data Structure
Attribute Data Structure Definition
26
Identifier Components Measure Components Attribute Components Data Structure Data Points
28
How does VTL work? Operators are applied to every Data Point of the input Datasets Operators produce new Datasets N-ary operators preliminarily combine the datasets
29
Scalar math > B := abs(round(A#INCOME)) + 1 B YEAR GEO SEX JOB
2011 DE M 1 -12.3 2012 FR F 2 23.2 > B := abs(round(A#INCOME)) + 1 B YEAR GEO SEX JOB INCOME 2011 DE M 1 13 2012 FR F 2 24
30
String manipulation > B := lower(A#Sex)
YEAR GEO SEX JOB INCOME 2011 de M 1 2012 fr F 2 > B := lower(A#Sex) > C := ”EU_” + upper(A#Geo) B YEAR GEO SEX JOB INCOME 2011 de m 1 2012 fr f 2 C YEAR GEO SEX JOB INCOME 2011 EU_DE M 1 2012 EU_FR F 2
31
Dataset math > C := (A#Income + B#Income) / 2
YEAR GEO SEX JOB INCOME 2011 DE M 1 1000 4 2000 FR 3000 7 2500 B YEAR GEO SEX JOB INCOME 2011 DE M 1 2 4 FR 8 > C := (A#Income + B#Income) / 2 > D := (A#Income * B#Income) C YEAR GEO SEX JOB INCOME 2011 DE M 1 501 4 1002 FR 1504 7 2500 D YEAR GEO SEX JOB INCOME 2011 DE M 1 2000 4 8000 FR 24000
32
Relational operators > C := intersect(A, B) > D := union(A, B) C
YEAR GEO SEX JOB INCOME 2011 DE M 1 1000 4 2000 FR 3000 7 4000 B YEAR GEO SEX JOB INCOME 2011 DE M 1 1000 IT 4 2000 > C := intersect(A, B) > D := union(A, B) C YEAR GEO SEX JOB INCOME 2011 DE M 1 1000 D YEAR GEO SEX JOB INCOME 2011 DE M 1 1000 4 2000 FR 3000 7 4000 IT
33
Statistical operators
YEAR GEO SEX JOB INCOME 2011 DE M 1 1000 4 2000 FR 3000 7 9000 > C := A[keep Year, Geo, Income][aggregate avg(Income)] > D := A[keep Year, Income] [aggregate max(Income)] C YEAR GEO INCOME 2011 DE 1500 FR 6000 D YEAR INCOME 2011 9000
34
Validation > C := A[GEO = ”FR”]#Income >= 4000 C YEAR GEO SEX
JOB INCOME 2011 DE M 1 1000 4 2000 FR 3000 7 9000 > C := A[GEO = ”FR”]#Income >= 4000 C YEAR GEO SEX JOB CONDITION 2011 FR M 1 FALSE 7 TRUE
35
Validation (cont’d) > C := A#JOB exists_in JOBS#ID A YEAR GEO SEX
INCOME 2011 DE M 1 1000 4 2000 FR 2 3000 3 9000 JOBS ID NACE 1 1006 3 1007 4 1008 > C := A#JOB exists_in JOBS#ID A YEAR GEO SEX JOB CONDITION 2011 DE M 1 TRUE 4 FR 2 FALSE 3
36
Validation (cont’d) > C := check( A[GEO = ”FR”]#Income =
YEAR GEO SEX JOB INCOME 2011 DE M 1 1000 4 2000 FR 3000 B YEAR GEO SEX JOB INCOME 2011 DE M 1 1000 4 2000 FR 6000 > C := check( A[GEO = ”FR”]#Income = B[GEO = ”FR”]#Income), imbalance = abs(A#Income - B#Income)) C YEAR GEO SEX JOB CONDITION IMBALANCE 2011 FR M 1 FALSE 3000
37
Conditionals and Booleans
YEAR GEO SEX JOB INCOME 2011 DE M 1 1000 FR 4 2000 3000 B YEAR GEO SEX JOB INCOME 2011 DE M 1 34 FR 4 36 6000 > C := if (A#GEO = ”FR” and A#JOB = 1) then A else B A YEAR GEO SEX JOB INCOME 2011 DE M 1 34 FR 4 36 3000
38
Where we are VTL 1.0 published in March 2015
sdmx.org/wp-content/uploads/2015/03/VTL-1- package-2015.zip BNF Extended Backus-Naur Form Maintained by SDMX TWG within VTL task force
39
Work in progress VTL 1.0 theoretical and practical evaluation
VTL 1.1 (by Summer 2016) feedback from ESSnet language extension and new operators (e.g. functional) reusability of the language SDMX implementation Messages for exchanging VTL rules Registry for storing VTL rules Web services for retrieving VTL rules
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.