Data Ingestion in ENES and collaboration with RDA Sandro Fiore, Ph.D. Euro-Mediterranean Center on Climate Change Foundation sandro.fiore@cmcc.it INDIGO Summit About data ingestion May 12, 2017 - Catania
Climate Model Intercomparison Data Analysis case study and “data ingestion” This case study proposed in INDIGO by CMCC is related to the climate change community (ENES) and to the multi-model analytics experiments From D2.7: the objective of the data ingestion process is to make the data, and metadata, FAIR, i.e. "Findable, Accessible, Interoperable and Reusable", and accordingly our definition of data ingestion is “the process that ends with the data being ready for sharing / (re-) use, following the usual community requirements”
Earth System Modelling Workflow Source: “ISENES2 Workshop on Workflow Solutions in Earth System Modelling”, by Reinhard Budich (Strategic IT Partnerships Scientific Computing Lab MPI-M) and Kerstin Fieg (Applications Deutsches Klimarechenzentrum DKRZ). June 3-5 2014, DKRZ, Hamburg.
"6S" Data Life Cycle in INDIGO (I) General workflow perspective Stage 1: Plan: prepare a Data Management Plan, including how data will be gathered, metadata definition, preservation plan, etc. Stage 2: Collect: including both creation and acquisition, it is the process of getting data, in different ways. A storage service is needed as well (output of simulations) Stage 3: Curate: also known as “Transform”: using the raw data collected in the previous stage, manual or automatic actions are performed over the data, which is converted and also filtered (postprocessing) Stage 4: Analyse: an optional step also called “Process”, that implies performing different actions to give the data an added value and get new derived data. (analysis, i.e. with Ophidia, metadata checking, etc.) Stage 5: Ingest (& Publish): including other steps like “Access”, “Use” or “Re-use”, in this stage, data is normally associated to metadata, gets a persistent identifier (a DOI) and is published in an accessible repository or catalogue, under a format that makes it useful for further re-use (ingestion & publication on HTTP server) Stage 6: Preserve: "store" both data and analysis for long-term. Licenses and methods need to be taken into account (“transfer data to a long-term storage”)
Earth System Modelling Workflow …with server-side processing curate Publish Preserve Earth System Modelling Workflow Source: “ISENES2 Workshop on Workflow Solutions in Earth System Modelling”, by Reinhard Budich (Strategic IT Partnerships Scientific Computing Lab MPI-M) and Kerstin Fieg (Applications Deutsches Klimarechenzentrum DKRZ). June 3-5 2014, DKRZ, Hamburg.
"6S" Data Life Cycle in INDIGO (II) Data analysis sub-workflow perspective Stage 1: Plan: prepare a Data Management Plan, including how data will be gathered, metadata definition, preservation plan, etc. Stage 2: Collect: including both creation and acquisition, it is the process of getting data, in different ways. A storage service is needed as well. (analysis, i.e. with Ophidia) Stage 3: Curate: also known as “Transform”: using the raw data collected in the previous stage, manual or automatic actions are performed over the data, which is converted and also filtered. (prepare for publication, adding metadata and checking format) Stage 4: Analyse: an optional step also called “Process”, that implies performing different actions to give the data an added value and get new derived data (it could be related to preparing a map, preview, etc.) Stage 5: Ingest (& Publish): including other steps like “Access”, “Use” or “Re-use”, in this stage, data is normally associated to metadata, gets a persistent identifier (a DOI) and is published in an accessible repository or catalogue, under a format that makes it useful for further re-use (ingestion & publication) Stage 6: Preserve: "store" both data and analysis for long-term. Licenses and methods need to be taken into account (“transfer data to a long-term storage (link with WP4)”)
BARRACUDA: pid-BAsed woRkflows foR climAte Change Using ophiDiA RDA Europe Collaboration project: towards a provenance-aware analytics eco-system BARRACUDA: pid-BAsed woRkflows foR climAte Change Using ophiDiA This RDA Europe collaboration project aims at bringing the multi-model climate analytics experiment case study implemented in the context of the H2020 EU INDIGO- DataCloud project, one step forward by adopting the RDA recommendation on the PID Information Types (PIT) framework The provided extensions will: (i) make new data products interoperable, (ii) enable data provenance at large scale (iii) enable experiments reproducibility and (iv) implement a more complete and interoperable workflow lifecycle in close synergy with the ESGF eco-system/services. (v) build a provenance-aware analytics ecos-system
Workplan, results and outputs Project workplan (April 1, 2017 – November 30, 2017) Planned tasks include: Design of the Ophidia support for RDA-PIT Basic tests on the PID Handle service managed at DKRZ Implementation, testing and validation of the Ophidia support for RDA-PIT Integration of the PID-resolving interface in the testbed setup in the EU H2020 INDIGO-DataCloud project Final results: RDA-PIT support integrated into Ophidia PID Handle Service client API integrated into the large-scale INDIGO-DataCloud experiment multi-model climate analytics Output: RDA-PIT support for Ophidia available as open source Ophidia service deployed and running at CMCC with RDA-PIT support Deliverables: Report on the design and implementation of the RDA-PIT extensions for Ophidia Short user manual for using the two extensions The results of the activity will be demonstrated at the ESGF F2F 2017 Conference From the RDA PID Information Type WG into the data lifecycle of the climate experiment proposed in this application.
Implementation stage New computer engineer in the team to work on this project Design of the “pid-based analytics use case” By extending the INDIGO use case for ENES In depth analysis of the RDA PIT recommendation Account setup on a PID Handle Service instance running at DKRZ Next steps: test CLI of the PID Handle Service develop simple test client applications based on the available API Link with data ingestion in INDIGO INDIGO will cover the data ingestion workflow (analysis, curation, publication to HTTP server and copy to a preservation storage) Validation step in INDIGO could exploit the PID support provided by BARRACUDA From the RDA PID Information Type WG into the data lifecycle of the climate experiment proposed in this application.
https://www.indigo-datacloud.eu Better Software for Better Science Thank you https://www.indigo-datacloud.eu Better Software for Better Science