Survey Data Management and Combined use of DDI and SDMX DDI and SDMX use case Labor Force Statistics
PROCESS SCENARIO
Eurostat Web Site The two output tables are the focus of the Processes
GSBPM Stages
Process Scenario Survey/Register Raw Data Set Anonymization, cleaning, recoding, etc. Micro-Data Set/ Public Use Files Tabulation, processing, case selection, etc. Aggregation,harmonization Aggregation,harmonization Aggregate Data Set (Lower level) Aggregate Data Set (Higher Level) DDI SDMX Indicators Structure Described by DDI NCube and SDMX DSD
PROCESS STAGES
Stage 1: Input Data Received Survey and Unit Record Conceptual Model Survey targetted at specific population Comprises questions Question may be linked to Variable Variable has conceptual meaning (Concept) Valid responses are Categories Survey output is Unit Record Data Set
Stage 2: Data Processing and Cleaning Editing Process Can be a variety of functions Validation Outlier Trimming Recoding Edit for Non Response Comprises Description of the Process Program Code used
Stage 3: Data Derivation Survey and Unit Record Conceptual Model New Variables created From existing variables or Create from Concepts Maybe new Classifications (codes, categories) Need description and the program code that derives the new Variables
Stage 4: Tabulation Dimensional Structure Maps to DDI NCube and SDMX DSD DDI NCube describes the structure and Provenance to the Variables etc. SDMX Data is published as SDMX Data Set DSD describes the Dissemination Structure DSD can also describe NCube Structure Structure Map can describe mapping between the two Applications can link back from SDMX structures to DDI structures SDMX data can link back to Variables, data collection etc. SDMX
Stage 5: Dissemination (SDMX) Data Set References a Dataflow, DSD, or Provision Agreement This identifies the Structure (DSD) Provision Agreement also identifies the Data Provider Category Scheme supports “drill down” data discovery Constraint contains actual keys and Dimensions values present in the data source Application now has all of the metadata required to query for and process (e.g. visualise) the data
DDI PROCESSING AND STRUCTURES
Describing Unit-Record Data Sets in DDI [DEMO]
Describing Processes in DDI In our example we have several types of processing: – Recoding – Validation and editing – Derivation of new variables In DDI, these are described as Processing Events”
Describing Processes in DDI (Continued) The Collection Event element is part of the “Data Collection” module, but is also used for describing processing later in the data lifecycle A Processing Event can be: – Control operation – Cleaning operation – Weighting – Coding
Describing Processes in DDI (Continued) These elements allow for a description of the event and a link to or the direct expression of the processing “code” (SAS, SPSS, Java, etc.) used to perform the process The Coding element is divided into: – General Instruction – a generic process description – Derivation Instruction – for deriving new variables – These link to the variables used in the process
Tabulation in DDI DDI describes dimensionalized data sets as “Ncubes” This is very similar to an SDMX DSD except: – The values are addressed using references to variables in a unit-record data set – Calculations of measures can be described in detail (dependent and independent variables, computation, etc.) This means that the actual process of tabulation can be described
DDI NCUBE MAP TO SDMX DSD
DDI DDI DDI NCube Model
SDMX SDMX DSD Model
DDI NCube to SDMX DSD Model Map
DDI Representation to SDMX Representation Model Map
Note that the column names are not used (these are just for viewing). These are mapped to the Variable Id in NCube and the Component (Dimension, Data Attribute, Primary Measure) Id in SDMX DDI Data (CSV Describable by DDI NCube Format)
DDI NCube Data Set Model Fundamentally, the Physical Location describes the CSV format. The CSV file can either be converted to SDMX_ML using data readers and data writers or loaded directly into a database using an appropriate data reader. In both cases the map of the Dimension and Attribute Ids to the CSV columns and Id of the Dataflow will need to be passed to the Data Reader so that it can verify the data content with the relevant DSD.
Data Writers and Readers
SDMX STRUCTURES AND DATA DISCOVERY AND VISUALISATION
SDMX Structural Metadata DSD LFS_STRUCTURE1 Dataflow EMPLY_SEX_OCC_EDUC Dataflow EMPLY_SEX_AGE_NATION Constraint Constraint EMPLY_SEX_OCC_EDUC Provision Agr ES_EMPLY_SEX_AGE_NATION Provision Agr ES_EMPLY_SEX_OCC_EDUC Data Provider ESTAT Category Scheme ESTAT_TOPICS Category LABOR Category POPULATION Category NAC Category Categorisation LAB_SEX_OCCC
Data Discovery Registry Structures Data Discovery GUI
User Data Selection User Selection Generated SDMX REST Query
Pivot Table Built from Query Result