4. SDMX: Main objects for data exchange Raynald Palmieri Eurostat Unit B5: “Central data and metadata services” SDMX Basics course, March 2016
The SDMX Components Describe statistics in a standard way Objects and their relationships Data Structure Definition (DSD), Concepts, Code List Central management and standard access SDMX Registry, SDMX Web Services Cross Domain Concepts Cross Domain Code Lists Statistical Domains Metadata Common Vocabulary Push Provider generates and sends file to receiver Pull Provider opens web service to data Receiver downloads regularly Hub Special case of pull: receiver downloads on end user request 2 2
Describing the data exchange Who? Who? When? How? Where? What? What? To exchange statistical data in a standard way, we need to collect information and to define a model that will convey this information. We will analyse the different type of information and we will see how the SDMX information model is built to handle this information. Imagine, that Italy wishes to send a statistical table and its corresponding quality report. Take a simplified table: three tourism topics over four periods. 3 2
Dataflows - classification Category Tourism Statistical Tables = data flows Sub categories
SDMX Implementation steps Dataflows Concepts & Code lists DSD sharing SDMX Data Structure Definition
Dataflows - classification Categories Dataflows Tourism Capacity Occupancy Night_Spent Arrival_of_ residents Occupancy_ rate
Concepts & Codelists : Tourism Example What do we want to exchange? Statistical tables
Preparation phase SDMX Implementation steps Dataflows Concepts & Code lists DSD sharing SDMX Data Structure Definition
Model of the statistical table Number Tourism establishments Italy Annual data Any statistical table is a set of observations or measures. But the observation itself means nothing. We need metadata to identify the figure such as the Tourism establishments, the unit, the country, the frequency, the time… This information will be called structural metadata. We can now introduce our fist SDMX object: the Concept. The concept is a label describing the data. It has an ID, a name and description in several languages. 2529
Model of the statistical table: What do we need to do first? Identify the Concepts A concept is a unit of knowledge created by a unique combination of characteristics (SDMX Information Model) Sources Existing data set tables From website From applications Data Collection Instruments Questionnaires/Excel spreadsheets Handbooks, User Guides Database Tables Existing Data Structure Definitions From other organisations Legislation/Regulation
Identifying the concepts FREQUENCY COUNTRY TOURISM_ACTIVITY TOURISM_INDICATOR UNIT OBS_VALUE TIME E OBS_STATUS P In our example, we can identify the following statistical concepts to describe the information in the statistical table: Frequency (in this case annual) Country Tourism topic Time The observation value, which is the actual single figure The observation status, defined by footnotes like “estimated” or “provisional” At the same time we identify the statistical concepts, we also define the format of the information (Alphanumeric, numeric etc.) Then we regroup all the statistical concepts into an SDMX container called Concept scheme.
Concept Scheme As all SDMX object the Concept Scheme has an ID, a name and a description in several languages. We will see later that the Concept Scheme is a maintainable object which can have many versions. The ID and the name are mandatory in the SMDX information model, the description is optional. The name should be provided at least in one language, generally in English. In the Concept Scheme, we can also define a format for the statistical concepts to be used later for validation purpose, But this is also optional. Finally, when statistical concepts correspond to coded values, we define code lists objects. For example, concept FREQUENCY identifies fixed values such as A, M, Q etc.. We define a SDMX object Code lists .
Identify/Define Code Lists Purpose of a Code List Constrains the value domain of concepts when used in a structure like a data structure definition Defines a shortened language independent representation of the values Gives semantic meaning to the values, possibly in multiple languages Agreeing on harmonised code lists is an important aspect of defining a data structure definition Ask the question Reveals first set of minor bullets on mouse click Then reveals the second major bullet
Concepts & Codelists : Tourism Example SDMX Code List Code list is maintainable SDMX container. Each code is defined uniquely by an ID, a maintenance agency, and a version. The name can be provided in several languages. Partial code lists can also be exchanged (v2.1). The content of the partial code list is specified in a Constraint.
Exercice Exercise: Deriving a concept scheme from a table On this slide is represented a statistical table from WASTE Identify the statistical concepts that desrcibe the structure of the table Imagine the possible format and code list for each concept Fill up the Concept Scheme table Launch the DSW Create the code lists Create the concept scheme
Proposed solution Deriving a concept scheme from a table This is a proposal of Concept Scheme.
Data Set Structure Computers need to know the structure of data in terms of: Dimensionality Additional metadata Measures (Observation) Concepts Valid content Code Lists Non coded format (integer, date, text)
Concepts play roles in a Data Structure Comprises Concepts that identify the observation value Concepts that add additional metadata about the observation value (as a value or the context of the value) Concept that is the observation value Any of these may be coded text date/time number etc. Dimensions Attributes Measure Representation
DERIVING A DATA STRUCTURE FROM A TABLE FREQUENCY COUNTRY TOURISM_ACTIVITY TOURISM_INDICATOR UNIT OBS_VALUE TIME E OBS_STATUS P DIMENSIONS ATTRIBUTES MEASURES Until now we have defined the statistical concepts that identify the information contained in our statistical table. But to have a full picture of the table we need to assign roles to our concepts. Indeed, if we think this table as a slice of a statistical cube, each observation has coordinates that identify it uniquely. In our exemple: Frequency, Time Period, Country, Tourism Indicator and Tourism Activity are required to identify the statistics: they act as dimensions. The observation value, the actual figure, is what we call measure. The ‘observation status’ and "Unit" further explain the figures as an attribute. We can see that they are displayed at different levels of the table. This depends on whether the attribute explains only a single figure, a group of figures or the whole table. In our example the "unit" explains the whole table: it is attached at dataset level. The "Observation status" explains only single figures: it is attached at observation level. In SDMX it is also possible to attach attributes to a group of observations. This is not the case in our example. The result of our small exercise is a formal definition of the corresponding Data Structure Definition, also called DSD, with all the structural metadata of the table.
DATA STRUCTURE DEFINITION As all SDMX objects, the DSD has an ID, a name and a description in several languages. We will see later that the DSD is a maintainable object which can have many versions. The ID and the name are mandatory in the SMDX information model, the description is optional. The name should be provided at least in one language, generally in English. In the DSD, we can also define locally a format and a code list for the statistical concepts to be used later for validation purpose, But this is also optional. The SDMX information model is flexible enough to define local representations to be used only in a specific DSD that shares a common Concept scheme with other DSDs.
DATA STRUCTURE DEFINITION - Summary Reference DSD Reference Concept Scheme Reference Code lists
DATA STRUCTURE DEFINITION - Design Java desktop application Graphical Interface For DSD designers Maintenance of SDMX v2.0/2.1 data and meta data structures Web service to query/submit SDMX registries Data Structure Wizard
SDMX Registry: Designing & Publishing DSDs Graphical User Interface Web service Now we'll turn to another important module of the standard: the IT architecture. The IT architecture involves: The standard formats for the data exchange. Different architectures for data exchange. And the SDMX registry. 23
Exercise: Consult a DSD URL Registry ( Test purpose): https://webgate.test.ec.europa.eu/sdmxregistry/ DSD: WASTE_GENER Now we have defined our DSD we will create the corresponding SDMX file with the Data Strucure Wizard tool. Launch the DSW Select DSD in the left tree, click on + Fill the DSD information as mentionned in the slide (ID, description…) Select the content tab and choose « Observation value » Select the subtab « dimensions » and add the dimensions Note: Dimension can be assigned roles: isFrequency, IsTime, IsMeasure, simple dimension. 6) Select subtab “Attributes” and add the Unit attribute.
Exercise: Browse the different objects of the DSD Codelists: CL_FREQ CL_GEO_EUCCEFTA CL_WASTE CL_HAZARD CL_NACE_R2_WASTE Concept Scheme: CS_WASTE Now we have defined our DSD we will create the corresponding SDMX file with the Data Strucure Wizard tool. Launch the DSW Select DSD in the left tree, click on + Fill the DSD information as mentionned in the slide (ID, description…) Select the content tab and choose « Observation value » Select the subtab « dimensions » and add the dimensions Note: Dimension can be assigned roles: isFrequency, IsTime, IsMeasure, simple dimension. 6) Select subtab “Attributes” and add the Unit attribute. DSD: WASTE_GENER
SDMX Implementation steps Dataflows Concepts & Code lists DSD sharing SDMX Data Structure Definition
DSD Sharing: Tourism Example
How to achieve DSD sharing? Use of Constraints The Constraint can define one or both of: • the Codes in a Code List that are applicable Ex: (A, M, W, Q) (A) • the list of series keys that are applicable Can be used to constrain the DSD for which a sub set of the DSD content is meaningful. Constraints are usually linked to the dataflows or the provision agreements. FREQ COUNTRY TOURISM _INDICATOR _ACTIVITY A IT A003 B100 28 28
Constraints – Example DSD_TOUR_CAP_XS DSD_TOUR_DEM_XS 29 29
SDMX Dataset Define the structure DSD Dataset = XML file describing the table content according to the DSD. E P The data structure definition being defined, let’s see how the statistical data are organised in the file to be exchanged, what is called Dataset in SDMX. In the related SDMX-ML structure Specific Time series data file we can highlight the different structural constructs to which attributes can be related: The Data set, where for this example Attribute UNIT is related The Series, where for this example no Attribute is related And finally the Observation level, where the Attribute Observation status is linked to. Each information in the table is placed into the dataset with its corresponding concept name: UNIT FREQ COUNTRY TOURISM INDICATOR TOURISM ACTIVITY TIME, OBSERVATION VALUE and OBSERVATION STATUS 30
Syntaxes for SDMX datasets Based on a common Information Model SDMX-EDI (GESMES/TS) EDIFACT syntax Time-series oriented – One format for Data Sets SDMX-ML XML syntax Different formats for Data Sets Easier validation (XML based) From the same information model two syntaxes can be used: under some conditions EDIFACT (SDMX-EDI or GESMES/TS) XML (SDMX-ML) The EDIFACT syntax, SDMX-EDI which is also known as GESMES/TS can be used only with Time series data sets and if the Data structure definition contains the dimensions required by the GESMES/TS standard. There is only one format of SDMX-EDI. The SDMX-ML uses the XML syntax making it much easier to generate and to validate. SDMX-ML has four different formats for data sets. Last there are tools that can convert between formats and under some conditions between syntaxes in order to obtain the desired format. 31
SDMX-ML formats Conversions Equivalent formats Generic SDMX-ML Cross-sectional SDMX-ML Compact SDMX-ML Based on the same IM Equivalent SDMX-ML formats can be converted between them. Equivalent SDMX-ML formats are those that are based on the same Information Model. Exceptions are the Cross-Sectional DSD that do not contain a time dimension which is required by the other formats. Conversion is possible also between SDMX-EDI and SDMX-ML formats, provided it is a time series data set and all dimensions and attributes required by SDMX-EDI are present in the Information model. Besides, it is possible as well to convert between other formats such as CSV. Can be expanded to other formats (e.g. CSV, GESMES) 32
SDMX data common header A common part is the <Header>, which has the same basic structure for all message types. An example is shown below providing typical information on the dataset: The information to be included in the header is inserted in the SDMX-ML message as a text file provided as input of the SDMX Converter too. An example of the content of this file “header.prop” is shown in the next figure. In the left column are listed the different elements, their corresponding values are presented in the right column. Some elements may be let blank except ID, Test, Prepared, and Sender. 33
SDMX 2.0 vs 2.1
Equivalent representations for reporting Datasets SDMX-ML formats Equivalent representations for reporting Datasets Version 2.0 Version 2.1 4 data messages, each with a distinct format. Therefore, there are now 4 data messages which are based on two general formats: GenericData • GenericData GenericTimeSeriesData CrossSectional Data Compact Data • StructureSpecificData StructureSpecificTimeSeriesData UtilityData SDMX offers four different representations for reporting datasets according to the final use. These are equivalent in a way, but has different objectives. In most cases conversion are possible from one representation to others. Generic message: Conveys data in a form independent of a data structure definition. It is designed for data provision on websites and in any scenario where applications receiving the data may not have detailed understanding of the data set's structure before they obtain the data set itself. Generic messages and offers no validation. Compact message: Exchange of large data sets in a data structure definition-dependent form. Requires an XML schema specific to the DSD. Utility message: For schema-based functions, such as advanced validation, in a data structure definition-dependent form. Cross-Sectional message: Exchange of many observation types in a data structure definition-dependent form. Used for non time-series data. SDMX messages are deeply described in a devoted student book available by the end of this year. Phased out 35
Data structure Definition (DSD) Version 2.0 Version 2.1 Support for non-time-series data structures Measure Dimension DSD DSD Concepts Concepts Measures Code lists Primary Measure Attributes Code lists Attributes Code lists Dimensions And Measure dimension Dimensions Code lists Code lists Measure Dimension Concept Scheme Concept role explicit element
Constraint Constraint is only available for use in a Registry context Version 2.0 Version 2.1 Dataflow Dataflow Registry Constraint Constraint Provision agreement Furthermore, the Constraint is structural metadata in version 2.1 and is contained in the Structure schemas. Note that a Constraint always constrains the allowable or actual content of structures or data sets that relate to a DSD. Rules have been specified to define how Constraints are inherited or “cascaded” when constraints are specified for one or more objects that are related to the same DSD (e.g. DSD and Dataflow). Provision agreement DSD Constraint Constraint is embedded in the object it constrains The same Constraint can be “used” to constrain multiple objects Constraint is independently maintained
Thank you for your attention! Questions