Presentation is loading. Please wait.

Presentation is loading. Please wait.

A strategy on structural metadata management based on SDMX and the GSIM models Stefania Bergamasco, Alessio Cardacino, Francesco Rizzo, Mauro Scanu, Laura.

Similar presentations


Presentation on theme: "A strategy on structural metadata management based on SDMX and the GSIM models Stefania Bergamasco, Alessio Cardacino, Francesco Rizzo, Mauro Scanu, Laura."— Presentation transcript:

1 A strategy on structural metadata management based on SDMX and the GSIM models Stefania Bergamasco, Alessio Cardacino, Francesco Rizzo, Mauro Scanu, Laura Vignola Istat METIS 2013 Geneva, 6 May 2013

2 Index Introduction The Istat standard and metodology to create and manage integrated and reusable metadata How to model structural metadata for the data dissemination phase How to define integrated and reusable classifications The impact of this standard on technological standards General technological standards Enhancements of the SDMX standard

3 Introduction Since 2010 Istat has changed its central systems: -Dissemination system I.Stat (dati.istat.it) (1) -Single Exiti Point -Unified Metadata System From the Metadata point of view Istat aim is: i) to have a unified vision of the metadata management inside Istat; ii) to facilitate the horizontal integration among statistical domains in order to reduce the stovepipes approach; iii) to foster a vertical integration with the European Statistical System, through the adoption of common standards and the harmonization of the content (GSIM, SDMX). (1) I.Stat is based on OECD thecnology

4 This presentation aims at illustrating the Istat standard and methodology to create and manage integrated and reusable classifications, highlighting its impact on technological standards. It is important to underline that Istat has defined its standards in according with three different points of view at the same time : Technological issues Technological issues Statistical concepts Statistical concepts Management questions in order to optimize the statistical process and to reduce the production sector workload Classifications

5 The Istat standard and metodology to create and manage integrated and reusable metadata

6 How can a table like this be described in terms of structural metadata and the traceability problem? This table describes two distinct macrodata How to model structural metadata for the data dissemination phase

7 Disseminated macrodataValidated microdata Process industrialization led to the necessity to manage metadata transformation along the statistical process Household average monthly expenditure Population: households Num. variable: monthly expenditure Categorical var.: -Territory -Tenure status - … (Statistical) operator: mean value housing average monthly expenditures / house average monthly income Oerator: ratio Household average monthly income Categorical var.: -Territory -Tenure status - …

8 In order to organize structural metadata, it turned out appropriate to reuse concepts aligned according to international standards. For the industrialization purposes, GSIM is useful because it uses an input/output approach in managing different concepts. In our context we used mainly the following concepts Unit/population Variable Classification Operator Target population Frame population Survey population Analysis population Survey unit Analysis unit In this context it is important to manage the lists of codes characterizing each concept. Let’s focus on classifications e.g. Simple macrodata Composite macrodata Aggregate 1 Aggregate 2 Operator

9 How to define integrated and reusable classifications In order to create a unique DW a National Statistics Institute has to face with two major problems: - the integration among the data stored into the database; - the dimension dynamism in the course of time

10 From an integration point of view: Males by marital satus Marriages by bridegroom marital status Disabled people by marital status Male marital statusBridegroom marital status Disabled people marital status 01never married01never married01never married 02married02divorced02married 03divorced03widowed03divorced/ widowed 04widowed Suppose to publish data in the following table and suppose a user wants to analyse data as well as to «work» with them

11 There are some problems: - different items with same codes - same item with different codes Males by marital satus Marriages by bridegroom marital status Disabled people by marital status Male marital statusBridegroom marital statusDisabled people marital status 01never married01never married01never married 02married02divorced02married 03divorced03widowed03divorced/ widowed 04widowed

12 First We know that a survey changes from time to time, so it is possible that at year “x” a survey publishes N° of people by gender, marital status and age class (5 age classes) while at year “x+2” it can publish N° of people by gender, marital status and age class (7 age classes) Publication on excel files does not produce problems but in a data base, where there is the need to load data into the same table to allow users to surf among data along different time references, the consequence is that a classification can change in the different time references From dynamism point of view

13 Second There are two different strategies to create a unique Dw: - linear model everything should be defined ex ante. In other words, all classifications and cubes should be collected and available before populating the DW. - cyclical model the DW is populated with just some surveys, with their classifications and data cubes, and the whole DW will be completed with other surveys step by step. Planning phase Population phase Survey 1 Survey 2 Survey 3

14 linear cyclical Good management of the change Certainty of maintainability of the classifications in the time Possibility to deploy immediately first release of the dw “Certainty” of the found solution

15 In a few words classification management should follow two rules: integration and lack of stability In order to fulfil these issues how should classifications be defined and used? The main ideas are in the following slides …

16 First of all, consider the integrated code list with all the necessary items (1) Marital status 01never married 02married 03divorced 04widowed 05divorced/ widowed (1) En example in the same direction in the code list Age of EUROSTAT

17 The second thing to do is to distinguish among three different roles: a variable name – which describes the statistical context of a dimension; a code list (classification) - which describes the possible categories that a variable can assume; a group of items – which describes the link between the first and the second for a particular publication (other examples are in appendix 1 of the paper). Variable names Male marital status Bridegroom marital status Disabled people marital status Code list : marital status 01never marriedXXX 02marriedXX 03divorcedXX 04widowedXX 05divorced/ widowed X Group of items

18 The integration and lack of stability issues imply that a NSI has to manage also the classification items codification. The common statistical standards to code an item are: A – agriculture B - xxx letter 01 – never married 02 – married ------------------------- 1 – celibe/nubile 2 – coniugato/a number BS – autobus (bus) CH – pullman (coach) TR - treno (train) acronym 1.15.33 With “.” or “,” 011533 021355 With fixed positions

19 Suppose that in the first cycle of the DWH implementation the code list size class in Figure (a) becomes available, while in the second cycle there is the need to introduce also the “0-1” and “2- 9” items (Figure (b)) 010-9 0210-50 0351-100 04101 and over (a) (b) 010-9 0-1 2-9 0210-50 0351-100 04101 and over

20 Which codes do we have to use? It is actually impossible changing the codification of the old items because of their link to already available the data. So we can just add new items. If we use 05 and 06 the classification will become unreadable: for example, the visualization order is jeopardized. The same holds also true if letters are used instead of numbers 010-9 050-1 062-9 0210-50 0351-100 04101 and over a0-9 e0-1 f2-9 b10-50 c51-100 d101 and over

21 The hierarchical standard could be of help (Table 6.c). But can this hierarchical standard solve the problem if we need to add also the following items: “10-30”, “31-60”, “61-80”, “81-100”? a0-9 a.10-1 a.22-9 b10-50 c51-100 d101 and over ?10-30 ?31-60 ?61-80 ?81-100

22 Hence, in order to manage the evolution of a classification our suggestion is: to distinguish the concepts of code, visualization order, father code to use an acronym

23 The impact of this standard on technological standards

24 Each characteristic described before defines a specific standard for technological point of view. In a few words a statistical software has to manage the classification objects: a classification has a name; each item of a classification has the order of visualization; each item of a classification can have one or more fathers; the codification of an item has to be a string (acronym); as well as the relationship between a classification and a variable name: the relationship has specific name (variable name); once defined a relationship it is mandatory to select the subset of items of the corresponding code list. General technological standards

25 We can also underline that if we consider the use of a classification at the same time for dissemination and survey phases probably we need to manage the visualization order in a different way. Hence, it would be better if the visualization order plays the role of an attribute of the code list/variable relationship instead of the classification.

26 In accordance with international agreements, data dissemination should be defined according to SDMX. Hence, we are using SDMX through a metadata organization that gives a statistical role to the SDMX artefacts, following GSIM concepts, e.g.: Population/statistical units: code lists Variables: organized in appropriate concept schemes (for both categorical and numerical variables) Statistical operators: code list of the operators used for transforming data in two subsequent phases (e.g. average, median, variable total when passing from validated to analysed data) Enhancements of the SDMX standard

27 Aggregate data: code list, where each item has its own attributes (population, variable, operator,…) SDMX should be improved in order to describe metadata relationships along the statistical process Link between a list of aggregates and their attributes Historicity of single codes in a code list Necessity to have a “Transformation and expressions” language

28 Thanks for your attention Stefania Bergamasco – bergamas@istat.it Unit Chief of Integrated systems for the dissemination of statistical information (Integration, Quality, Research and Production Networks Development Dept ) ……


Download ppt "A strategy on structural metadata management based on SDMX and the GSIM models Stefania Bergamasco, Alessio Cardacino, Francesco Rizzo, Mauro Scanu, Laura."

Similar presentations


Ads by Google