A strategy on structural metadata management based on SDMX and the GSIM models Stefania Bergamasco, Alessio Cardacino, Francesco Rizzo, Mauro Scanu, Laura.

Slides:

Advertisements

Similar presentations

SDMX in the Vietnam Ministry of Planning and Investment - A Data Model to Manage Metadata and Data ETV2 Component 5 – Facilitating better decision-making.

Advertisements

Peter Neudorfer Managing the quality of the ECB’s enhanced ‘Register of Institutions and Affiliates Database‘ (RIAD) Meeting of the Group of Experts on.

Object-Oriented Analysis and Design

Implementation of GSBPM, DDI and SDMX reference metadata at Statistics Denmark UNECE workshop 5-7 May 2015 Mogens Grosen Nielsen

Environment Change Information Request Change Definition has subtype of Business Case based upon ConceptPopulation Gives context for Statistical Program.

Background Data validation, a critical issue for the E.S.S.

WP.5 - DDI-SDMX Integration

Carmela Pascucci – Istat - Italy Meeting of the Working Party on International Trade in Goods and Trade in Services Statistics (WPTGS) Linking business.

WP.5 - DDI-SDMX Integration E.S.S. cross-cutting project on Information Models and Standards Marco Pellegrino, Denis Grofils Eurostat METIS Work Session6-8.

NSI 1 Collect Process AnalyseDisseminate Survey A Survey B Historically statistical organisations have produced specialised business processes and IT.

Case Studies: Statistics Canada (WP 11) Alice Born Statistics UNECE Workshop on Statistical Metadata.

Survey Data Management and Combined use of DDI and SDMX DDI and SDMX use case Labor Force Statistics.

The Application of the Concept of Uniqueness for Creating Public Use Microdata Files Jay J. Kim, U.S. National Center for Health Statistics Dong M. Jeong,

Metadata management and statistical business process at Statistics Estonia Work Session on Statistical Metadata (Geneva, Switzerland 8-10 May 2013) Kaja.

Using ISO/IEC to Help with Metadata Management Problems Graeme Oakley Australian Bureau of Statistics.

Vincenzo Del Vecchio Banca d’Italia Statistics Collection and Processing Department 2012 ESSnet Workshop – 4 December.

SDMX AND DATA DISSEMINATION SDMX Training BANK INDONESIA SEPTEMBER 2015 YOGYAKARTA, INDONESIA.

Implementation of Eurostat Quality Declarations with Cost- Effective Use of Standards Q European conference on quality in statistics Vienna 2-5 June.

Development of metadata in the National Statistical Institute of Spain Work Session on Statistical Metadata Genève, 6-8 May-2013 Ana Isabel Sánchez-Luengo.

CountryData Technologies for Data Exchange SDMX Information Model: An Introduction.

Assessing Quality for Integration Based Data M. Denk, W. Grossmann Institute for Scientific Computing.

GSIM implementation in the Istat Metadata System: focus on structural metadata and on the joint use of GSIM and SDMX Mauro Scanu

Metadata Models in Survey Computing Some Results of MetaNet – WG 2 METIS 2004, Geneva W. Grossmann University of Vienna.

Statistics Portugal/ Metadata Unit Monica Isfan « Joint UNECE/ EUROSTAT/ OECD Work Session on Statistical Metadata.

BAIGORRI Antonio – Eurostat, Unit B1: Quality; Classifications Q2010 EUROPEAN CONFERENCE ON QUALITY IN STATISTICS Terminology relating to the Implementation.

Metadata driven application for data processing – from local toward global solution Rudi Seljak Statistical Office of the Republic of Slovenia.

Interoperable Visualization Framework towards enhancing mapping and integration of official statistics Haitham Zeidan Palestinian Central.

Francesco Rizzo (ISTAT - Italy) SDMX ISTAT FRAMEWORK GENEVE May 2007 OECD SDMX Expert Group.

Francesco Rizzo (ISTAT - Italy) Stefano De Francisci (ISTAT – Italy) An integration approach for the Statistical Information System of Istat using SDMX.

Environment Change Information Request Change Definition has subtype of Business Case based upon ConceptPopulation Gives context for Statistical Program.

Supporting Researchers and Institutions in Exploiting Administrative Databases for Statistical Purposes: Istat’s Strategy G. D’Angiolini, P. De Salvo,

Instituto Nacional de Estadística, Geografía e Informática (INEGI), Mexico National Economic Surveys (NES) Jun 2007.

Slide 1 Eurostat Unit B3 – Statistical Information Technologies CoRD Meeting – 4 June 2007 Agenda Item 8 Preliminary ideas for a 2011 census hub Giuseppe.

Metadata Common Vocabulary a journey from a glossary to an ontology of statistical metadata, and back Sérgio Bacelar

Model and Representations

Eurostat SDMX and Global Standardisation Marco Pellegrino Eurostat, Statistical Office of the European Union Bangkok,

Eurostat 4. SDMX: Main objects for data exchange 1 Raynald Palmieri Eurostat Unit B5: “Central data and metadata services” SDMX Basics course, October.

SDMX IT Tools Introduction

Experience and response in developing countries: the twinning project with the Tunisian National Statistical Institute Monica Consalvi ISTAT, Division.

SDMX and Metadata SDMX Basics Course 12 April 2013 Daniel Suranyi Eurostat B5 Management of statistical data and metadata.

2.An overview of SDMX (What is SDMX? Part I) 1 Edward Cook Eurostat Unit B5: “Central data and metadata services” SDMX Basics course, October 2015.

1 SDMX Global Conference September 2015 SDMX into the future VTL (Validation and Transformation Language) A new technical standard for enhancing.

Joint UNECE/Eurostat/OECD work session on statistical metadata (METIS) APRIL 2006Mar Blanco Frías STATISTICAL METADATA MODEL DEVELOPED IN SPAIN:CURRENT.

7b. SDMX practical use case: Census Hub

1 Enhancing data quality by using harmonised structural metadata within the European Statistical System A. Götzfried Head of Unit B6 Eurostat.

SDMX Basics course, March 2016 Eurostat SDMX Basics course, March Introducing the Roadmap Marco Pellegrino Eurostat Unit B5: “Data and.

METADATA MANAGEMENT AT ISTAT: CONCEPTUAL FOUNDATIONS AND TOOLS Istituto Nazionale di Statistica ITALY.

Prepared by: Galya STATEVA, Chief expert

The Generic Statistical Information Model (GSIM) and the Sistema Unitario dei Metadati (SUM): state of application of the standard Cecilia Casagrande –

ESSnet on SDMX phase II Laura Vignola

Metadata in the modernization of statistical production at Statistics Canada Carmen Greenough June 2, 2014.

2. An overview of SDMX (What is SDMX? Part I)

Towards common metadata using GSIM and DDI 3

SDMX Information Model: An Introduction

RAMON Re-engineering An Update

Metadata The metadata contains

3rd WGM Meeting 3 May 2018 Item 2.3 Possible standards for ESS Validation.

ESS VIP ICT Project Task Force Meeting 5-6 March 2013.

Generic Statistical Information Model (GSIM)

Work Session on Statistical Metadata (Geneva, Switzerland May 2013)

Joint UNECE/Eurostat/OECD

Petr Elias Czech Statistical Office

Introduction to reference metadata and quality reporting

7. Introduction to the main SDMX objects for metadata exchange

Developing SDMX artefacts for data exchange, sharing and dissemination

ESS conceptual standards for quality reporting

Hands-on GSIM Mauro Scanu ISTAT

SDMX training Francesco Rizzo June 2018

GSIM overview Mauro Scanu ISTAT

Presentation transcript:

A strategy on structural metadata management based on SDMX and the GSIM models Stefania Bergamasco, Alessio Cardacino, Francesco Rizzo, Mauro Scanu, Laura Vignola Istat METIS 2013 Geneva, 6 May 2013

Index Introduction The Istat standard and metodology to create and manage integrated and reusable metadata How to model structural metadata for the data dissemination phase How to define integrated and reusable classifications The impact of this standard on technological standards General technological standards Enhancements of the SDMX standard

Introduction Since 2010 Istat has changed its central systems: -Dissemination system I.Stat (dati.istat.it) (1) -Single Exiti Point -Unified Metadata System From the Metadata point of view Istat aim is: i) to have a unified vision of the metadata management inside Istat; ii) to facilitate the horizontal integration among statistical domains in order to reduce the stovepipes approach; iii) to foster a vertical integration with the European Statistical System, through the adoption of common standards and the harmonization of the content (GSIM, SDMX). (1) I.Stat is based on OECD thecnology

This presentation aims at illustrating the Istat standard and methodology to create and manage integrated and reusable classifications, highlighting its impact on technological standards. It is important to underline that Istat has defined its standards in according with three different points of view at the same time : Technological issues Technological issues Statistical concepts Statistical concepts Management questions in order to optimize the statistical process and to reduce the production sector workload Classifications

The Istat standard and metodology to create and manage integrated and reusable metadata

How can a table like this be described in terms of structural metadata and the traceability problem? This table describes two distinct macrodata How to model structural metadata for the data dissemination phase

Disseminated macrodataValidated microdata Process industrialization led to the necessity to manage metadata transformation along the statistical process Household average monthly expenditure Population: households Num. variable: monthly expenditure Categorical var.: -Territory -Tenure status - … (Statistical) operator: mean value housing average monthly expenditures / house average monthly income Oerator: ratio Household average monthly income Categorical var.: -Territory -Tenure status - …

In order to organize structural metadata, it turned out appropriate to reuse concepts aligned according to international standards. For the industrialization purposes, GSIM is useful because it uses an input/output approach in managing different concepts. In our context we used mainly the following concepts Unit/population Variable Classification Operator Target population Frame population Survey population Analysis population Survey unit Analysis unit In this context it is important to manage the lists of codes characterizing each concept. Let’s focus on classifications e.g. Simple macrodata Composite macrodata Aggregate 1 Aggregate 2 Operator

How to define integrated and reusable classifications In order to create a unique DW a National Statistics Institute has to face with two major problems: - the integration among the data stored into the database; - the dimension dynamism in the course of time

From an integration point of view: Males by marital satus Marriages by bridegroom marital status Disabled people by marital status Male marital statusBridegroom marital status Disabled people marital status 01never married01never married01never married 02married02divorced02married 03divorced03widowed03divorced/ widowed 04widowed Suppose to publish data in the following table and suppose a user wants to analyse data as well as to «work» with them

There are some problems: - different items with same codes - same item with different codes Males by marital satus Marriages by bridegroom marital status Disabled people by marital status Male marital statusBridegroom marital statusDisabled people marital status 01never married01never married01never married 02married02divorced02married 03divorced03widowed03divorced/ widowed 04widowed

First We know that a survey changes from time to time, so it is possible that at year “x” a survey publishes N° of people by gender, marital status and age class (5 age classes) while at year “x+2” it can publish N° of people by gender, marital status and age class (7 age classes) Publication on excel files does not produce problems but in a data base, where there is the need to load data into the same table to allow users to surf among data along different time references, the consequence is that a classification can change in the different time references From dynamism point of view

Second There are two different strategies to create a unique Dw: - linear model everything should be defined ex ante. In other words, all classifications and cubes should be collected and available before populating the DW. - cyclical model the DW is populated with just some surveys, with their classifications and data cubes, and the whole DW will be completed with other surveys step by step. Planning phase Population phase Survey 1 Survey 2 Survey 3

linear cyclical Good management of the change Certainty of maintainability of the classifications in the time Possibility to deploy immediately first release of the dw “Certainty” of the found solution

In a few words classification management should follow two rules: integration and lack of stability In order to fulfil these issues how should classifications be defined and used? The main ideas are in the following slides …

First of all, consider the integrated code list with all the necessary items (1) Marital status 01never married 02married 03divorced 04widowed 05divorced/ widowed (1) En example in the same direction in the code list Age of EUROSTAT

The second thing to do is to distinguish among three different roles: a variable name – which describes the statistical context of a dimension; a code list (classification) - which describes the possible categories that a variable can assume; a group of items – which describes the link between the first and the second for a particular publication (other examples are in appendix 1 of the paper). Variable names Male marital status Bridegroom marital status Disabled people marital status Code list : marital status 01never marriedXXX 02marriedXX 03divorcedXX 04widowedXX 05divorced/ widowed X Group of items

The integration and lack of stability issues imply that a NSI has to manage also the classification items codification. The common statistical standards to code an item are: A – agriculture B - xxx letter 01 – never married 02 – married – celibe/nubile 2 – coniugato/a number BS – autobus (bus) CH – pullman (coach) TR - treno (train) acronym With “.” or “,” With fixed positions

Suppose that in the first cycle of the DWH implementation the code list size class in Figure (a) becomes available, while in the second cycle there is the need to introduce also the “0-1” and “2- 9” items (Figure (b)) and over (a) (b) and over

Which codes do we have to use? It is actually impossible changing the codification of the old items because of their link to already available the data. So we can just add new items. If we use 05 and 06 the classification will become unreadable: for example, the visualization order is jeopardized. The same holds also true if letters are used instead of numbers and over a0-9 e0-1 f2-9 b10-50 c d101 and over

The hierarchical standard could be of help (Table 6.c). But can this hierarchical standard solve the problem if we need to add also the following items: “10-30”, “31-60”, “61-80”, “81-100”? a0-9 a.10-1 a.22-9 b10-50 c d101 and over ?10-30 ?31-60 ?61-80 ?81-100

Hence, in order to manage the evolution of a classification our suggestion is: to distinguish the concepts of code, visualization order, father code to use an acronym

The impact of this standard on technological standards

Each characteristic described before defines a specific standard for technological point of view. In a few words a statistical software has to manage the classification objects: a classification has a name; each item of a classification has the order of visualization; each item of a classification can have one or more fathers; the codification of an item has to be a string (acronym); as well as the relationship between a classification and a variable name: the relationship has specific name (variable name); once defined a relationship it is mandatory to select the subset of items of the corresponding code list. General technological standards

We can also underline that if we consider the use of a classification at the same time for dissemination and survey phases probably we need to manage the visualization order in a different way. Hence, it would be better if the visualization order plays the role of an attribute of the code list/variable relationship instead of the classification.

In accordance with international agreements, data dissemination should be defined according to SDMX. Hence, we are using SDMX through a metadata organization that gives a statistical role to the SDMX artefacts, following GSIM concepts, e.g.: Population/statistical units: code lists Variables: organized in appropriate concept schemes (for both categorical and numerical variables) Statistical operators: code list of the operators used for transforming data in two subsequent phases (e.g. average, median, variable total when passing from validated to analysed data) Enhancements of the SDMX standard

Aggregate data: code list, where each item has its own attributes (population, variable, operator,…) SDMX should be improved in order to describe metadata relationships along the statistical process Link between a list of aggregates and their attributes Historicity of single codes in a code list Necessity to have a “Transformation and expressions” language

Thanks for your attention Stefania Bergamasco – Unit Chief of Integrated systems for the dissemination of statistical information (Integration, Quality, Research and Production Networks Development Dept ) ……