Data Ingestion in ENES and collaboration with RDA

Slides:



Advertisements
Similar presentations
Data Publishing Service Indiana University Stacy Kowalczyk April 9, 2010.
Advertisements

PaN-data WP7 - Integration Brian Matthews STFC-e-Science.
Hydra Partners Meeting March 2012 Bill Branan DuraCloud Technical Lead.
New DFG Information Infrastructure Projects Dr. Stefan Winkler-Nees; Birmingham, 28. March 2011 New DFG Information Infrastructure Projects.
M. Stockhause et al. Martina Stockhause, Michael Lautenschlager, Frank Toussaint Deutsches Klimarechenzentrum (DKRZ) World Data Centre for Climate (WDCC)
Tobias Weigel (DKRZ) Tobias Weigel Deutsches Klimarechenzentrum (DKRZ) Persistent Identifiers Solving a number of problems through a simplistic mechanism.
Z EGU Integration of external metadata into the Earth System Grid Federation (ESGF) K. Berger 1, G. Levavasseur 2, M. Stockhause 1, and M. Lautenschlager.
A framework to support collaborative Velo: Knowledge Management for Collaborative (Science | Biology) Projects A framework to support collaborative 1.
UK Repository Search Project Phase II Project Overview Phil Cross Vic Lyte September 2006.
Next generation Science Gateways in the context of the INDIGO project: a pilot case on large scale climate-change data analytics Roberto Barbera, Riccardo.
Data Citation Implementation Pilot Workshop
Data Management: Data Processing Types of Data Processing at USGS There are several ways to classify Data Processing activities at USGS, and here are some.
An Open Data Platform in the framework of the EGI-LifeWatch Competence Centre Fernando Aguilar Jesús Marco
Weigel, Berger, Kindermann, Lautenschlager EGU Versioning for CMIP6 in the Earth System Grid Federation Data preparation Initial registration.
Bringing visibility to food security data results: harvests of PRAGMA and RDA Quan (Gabriel) Zhou, Venice Juanillas Ramil Mauleon, Jason Haga, Inna Kouper,
1 This slide indicated the continuous cycle of creating raw data or derived data based on collections of existing data. Identify components that could.
Intentions and Goals Comparison of core documents from DFIG and Publishing Workflow IG show that there is much overlap despite different starting points.
Jennie Larkin, PhD Senior Advisor
PIDs in EUDAT Webinar, 15 Februari 2013
RDA 9th Plenary Breakout 3, 5 April :00-17:30
EOSC Services for Scientists
User Interfaces: Science Gateways, Workflows and Toolkits
EUDAT’s engagement with the Earth Sciences
RDA US Science workshop Arlington VA, Aug 2014 Cees de Laat with many slides from Ed Seidel/Rob Pennington.
Joslynn Lee – Data Science Educator
PLM, Document and Workflow Management
More Data Management and Services
Data Flows in ACTRIS: Considerations for Planning the Future
INTAROS WP5 Data integration and management
Exploitation and Sustainability updates
Donatella Castelli CNR-ISTI
Fernando Aguilar, IFCA-CSIC
Data Ingestion in EMSO Presented by Marco Pappalardo
Towards connecting geospatial information and statistical standards in statistical production: two cases from Statistics Finland Workshop on Integrating.
Xiaogang Ma, John Erickson, Patrick West, Stephan Zednik, Peter Fox,
VI-SEEM Data Repository
Research Data Collections WG Plenary 9 Barcelona
Flexible Extensible Digital Object Repository Architecture
INGV-MOIST Case Study and INDIGO solution for:
Flexible Extensible Digital Object Repository Architecture
Workflows in archaeology & heritage sciences
C2CAMP (A Working Title)
SRA Submission Pipeline
Data catalogues and the data repository ADMIRe JISC MRD
DATA SPHINX & EUDAT Collaboration
CMIP6 / ENES Data TF Meeting: DKRZ
EOSCpilot Skills Landscape & Framework
New input for CEOS Persistent Identifier Best Practices
Case Study: Algae Bloom in a Water Reservoir
EUDAT B2FIND A Cross-Discipline Metadata Service and Discovery Portal
EOSC services architecture
The XDC project Daniele Cesini
EOSCpilot All Hands Meeting 8 March 2018 Pisa
CESSDA Workplan: Metadata Harvesting Tool
Using the RDA Collections API to Shape Humanities Data
European Open Science Cloud All Hands Meeting Pisa 8-9 March 2018
Research Data Alliance (RDA) 9th WG/IG Collaboration Meeting: Repository Platforms for Research Data (RPRD) Interest Group 13nd June 2018 Co-Chairs:
IS-ENES Cases Seven use cases are listed as data lifecycle steps A B C
ITDG meeting of of October 2011
Brian Matthews STFC EOSCpilot Brian Matthews STFC
Jisc Research Data Shared Service (RDSS)
Bird of Feather Session
ENVRI Reference Model (RM) Information Viewpoint components
RDA uptake activities and plans: ESGF
A Research Data Catalogue supporting Blue Growth: the BlueBRIDGE case
Joining the EOSC Ecosystem
IDRP: The first distributed data management infrastructure for nanoscience Rossella Aversa Karlsruhe Institute for Technology (KIT) – Steinbuch Center.
Leveraging PIDs for object management in data infrastructures RDA UK Node Workshop, July Tobias Weigel (DKRZ)
EOSC-hub Contribution to the EOSC WGs
Cultivating Semantics for Data in Agriculture and Nutrition
Presentation transcript:

Data Ingestion in ENES and collaboration with RDA Sandro Fiore, Ph.D. Euro-Mediterranean Center on Climate Change Foundation sandro.fiore@cmcc.it INDIGO Summit About data ingestion May 12, 2017 - Catania

Climate Model Intercomparison Data Analysis case study and “data ingestion” This case study proposed in INDIGO by CMCC is related to the climate change community (ENES) and to the multi-model analytics experiments From D2.7: the objective of the data ingestion process is to make the data, and metadata, FAIR, i.e. "Findable, Accessible, Interoperable and Reusable", and accordingly our definition of data ingestion is “the process that ends with the data being ready for sharing / (re-) use, following the usual community requirements”

Earth System Modelling Workflow Source: “ISENES2 Workshop on Workflow Solutions in Earth System Modelling”, by Reinhard Budich (Strategic IT Partnerships Scientific Computing Lab MPI-M) and Kerstin Fieg (Applications Deutsches Klimarechenzentrum DKRZ). June 3-5 2014, DKRZ, Hamburg.

"6S" Data Life Cycle in INDIGO (I) General workflow perspective Stage 1: Plan: prepare a Data Management Plan, including how data will be gathered, metadata definition, preservation plan, etc. Stage 2: Collect: including both creation and acquisition, it is the process of getting data, in different ways. A storage service is needed as well (output of simulations) Stage 3: Curate: also known as “Transform”: using the raw data collected in the previous stage, manual or automatic actions are performed over the data, which is converted and also filtered (postprocessing) Stage 4: Analyse: an optional step also called “Process”, that implies performing different actions to give the data an added value and get new derived data. (analysis, i.e. with Ophidia, metadata checking, etc.) Stage 5: Ingest (& Publish): including other steps like “Access”, “Use” or “Re-use”, in this stage, data is normally associated to metadata, gets a persistent identifier (a DOI) and is published in an accessible repository or catalogue, under a format that makes it useful for further re-use (ingestion & publication on HTTP server) Stage 6: Preserve: "store" both data and analysis for long-term. Licenses and methods need to be taken into account (“transfer data to a long-term storage”)

Earth System Modelling Workflow …with server-side processing curate Publish Preserve Earth System Modelling Workflow Source: “ISENES2 Workshop on Workflow Solutions in Earth System Modelling”, by Reinhard Budich (Strategic IT Partnerships Scientific Computing Lab MPI-M) and Kerstin Fieg (Applications Deutsches Klimarechenzentrum DKRZ). June 3-5 2014, DKRZ, Hamburg.

"6S" Data Life Cycle in INDIGO (II) Data analysis sub-workflow perspective Stage 1: Plan: prepare a Data Management Plan, including how data will be gathered, metadata definition, preservation plan, etc. Stage 2: Collect: including both creation and acquisition, it is the process of getting data, in different ways. A storage service is needed as well. (analysis, i.e. with Ophidia) Stage 3: Curate: also known as “Transform”: using the raw data collected in the previous stage, manual or automatic actions are performed over the data, which is converted and also filtered. (prepare for publication, adding metadata and checking format) Stage 4: Analyse: an optional step also called “Process”, that implies performing different actions to give the data an added value and get new derived data (it could be related to preparing a map, preview, etc.) Stage 5: Ingest (& Publish): including other steps like “Access”, “Use” or “Re-use”, in this stage, data is normally associated to metadata, gets a persistent identifier (a DOI) and is published in an accessible repository or catalogue, under a format that makes it useful for further re-use (ingestion & publication) Stage 6: Preserve: "store" both data and analysis for long-term. Licenses and methods need to be taken into account (“transfer data to a long-term storage (link with WP4)”)

BARRACUDA: pid-BAsed woRkflows foR climAte Change Using ophiDiA RDA Europe Collaboration project: towards a provenance-aware analytics eco-system BARRACUDA: pid-BAsed woRkflows foR climAte Change Using ophiDiA This RDA Europe collaboration project aims at bringing the multi-model climate analytics experiment case study implemented in the context of the H2020 EU INDIGO- DataCloud project, one step forward by adopting the RDA recommendation on the PID Information Types (PIT) framework The provided extensions will: (i) make new data products interoperable, (ii) enable data provenance at large scale (iii) enable experiments reproducibility and (iv) implement a more complete and interoperable workflow lifecycle in close synergy with the ESGF eco-system/services. (v) build a provenance-aware analytics ecos-system

Workplan, results and outputs Project workplan (April 1, 2017 – November 30, 2017) Planned tasks include: Design of the Ophidia support for RDA-PIT Basic tests on the PID Handle service managed at DKRZ Implementation, testing and validation of the Ophidia support for RDA-PIT Integration of the PID-resolving interface in the testbed setup in the EU H2020 INDIGO-DataCloud project Final results: RDA-PIT support integrated into Ophidia PID Handle Service client API integrated into the large-scale INDIGO-DataCloud experiment multi-model climate analytics   Output: RDA-PIT support for Ophidia available as open source Ophidia service deployed and running at CMCC with RDA-PIT support Deliverables: Report on the design and implementation of the RDA-PIT extensions for Ophidia Short user manual for using the two extensions The results of the activity will be demonstrated at the ESGF F2F 2017 Conference From the RDA PID Information Type WG into the data lifecycle of the climate experiment proposed in this application.

Implementation stage New computer engineer in the team to work on this project Design of the “pid-based analytics use case” By extending the INDIGO use case for ENES In depth analysis of the RDA PIT recommendation Account setup on a PID Handle Service instance running at DKRZ Next steps: test CLI of the PID Handle Service develop simple test client applications based on the available API Link with data ingestion in INDIGO INDIGO will cover the data ingestion workflow (analysis, curation, publication to HTTP server and copy to a preservation storage) Validation step in INDIGO could exploit the PID support provided by BARRACUDA From the RDA PID Information Type WG into the data lifecycle of the climate experiment proposed in this application.

https://www.indigo-datacloud.eu Better Software for Better Science Thank you https://www.indigo-datacloud.eu Better Software for Better Science