Fernando Aguilar, IFCA-CSIC

Slides:



Advertisements
Similar presentations
Data Publishing Service Indiana University Stacy Kowalczyk April 9, 2010.
Advertisements

An Introduction to Repositories Thornton Staples Director of Community Strategy and Alliances Director of the Fedora Project.
PaN-data WP7 - Integration Brian Matthews STFC-e-Science.
LHCbPR V2 Sasha Mazurov, Amine Ben Hammou, Ben Couturier 5th LHCb Computing Workshop
Survey Data Management and Combined use of DDI and SDMX DDI and SDMX use case Labor Force Statistics.
Data on the Web Life Cycle Bernadette Farias Lóscio March, 2014.
SITools Enhanced Use of Laboratory Services and Data Romain Conseil
WP 9 (former Task 1b of WP 1): Data infrastructure Robert Huber UNI-HB Esonet 2nd all regions workshop, Paris
MASSACHUSETTS INSTITUTE OF TECHNOLOGY NASA GODDARD SPACE FLIGHT CENTER ORBITAL SCIENCES CORPORATION NASA AMES RESEARCH CENTER SPACE TELESCOPE SCIENCE INSTITUTE.
Data File Access API : Under the Hood Simon Horwith CTO Etrilogy Ltd.
DMPf – USGS Chesapeake Bay -Cassandra Ladino 02/04/14.
Data Management Subsystem Jeff Valenti (STScI). DMS Context PRDS - Project Reference Database PPS - Proposal and Planning OSS - Operations Scripts FOS.
OOI CI LCA REVIEW August 2010 Ocean Observatories Initiative OOI Cyberinfrastructure Architecture Overview Michael Meisinger Life Cycle Architecture Review.
1 Schema Registries Steven Hughes, Lou Reich, Dan Crichton NASA 21 October 2015.
Archival Workshop on Ingest, Identification, and Certification Standards Certification (Best Practices) Checklist Does the archive have a written plan.
John Porter Sheng Shan Lu M. Gastil Gastil-Buhl With special thanks to Chau-Chin Lin and Chi-Wen Hsaio.
Managing live digital content with DuraSpace services Bill Branan PASIG Spring 2015.
EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI OpenSource GeoSpatial Catalogue Platform-as-a-Service Salvatore Pinto Cloud.
An Open Data Platform in the framework of the EGI-LifeWatch Competence Centre Fernando Aguilar Jesús Marco
WP4 Summary Patrick Fuhrmann for the WP4 Tream RIA
Overview of the global architecture Giacinto DONVITO INFN-Bari.
ENVRIPLUS week, May 2016 Zandvoort Background, ocean observation e-infrastructure  Who is the community/project the use case belongs to?  ENVRIPLUS EU.
MESA A Simple Microarray Data Management Server. General MESA is a prototype web-based database solution for the massive amounts of initial data generated.
International Planetary Data Alliance Registry Project Update September 16, 2011.
Data Grids, Digital Libraries and Persistent Archives: An Integrated Approach to Publishing, Sharing and Archiving Data. Written By: R. Moore, A. Rajasekar,
1 This slide indicated the continuous cycle of creating raw data or derived data based on collections of existing data. Identify components that could.
PaaS services for Computing and Storage
Onedata Eventually Consistent Virtual Filesystem for Multi-Cloud Infrastructures Michał Orzechowski (CYFRONET AGH)
Product Designer Hub – Taking HPD to the Web
Implementing the Data Management Principles Opportunities and Advantages Robert R. Downs, PhD Sr. Digital Archivist, CIESIN, Columbia University.
EOSC MODEL Pasquale Pagano CNR - ISTI
Unified Data Access and MGMT. in Distributed hybrid Cloud
Overview of the global architecture
z/Ware 2.0 Technical Overview
Data Flows in ACTRIS: Considerations for Planning the Future
INTAROS WP5 Data integration and management
Exploitation and Sustainability updates
Defining and tracking requirements for New Communities
An Overview of Data-PASS Shared Catalog
Donatella Castelli CNR-ISTI
KER - Open Data Platform
Data Ingestion in ENES and collaboration with RDA
Onedata Eventually Consistent Virtual Filesystem for Multi-Cloud Infrastructures Michał Orzechowski (CYFRONET AGH)
Data Ingestion in EMSO Presented by Marco Pappalardo
ICAT- Experience and activities at ISIS
22-INTEGRATION HUB
DI4R, 30th September 2016, Krakow
An easier path? Customizing a “Global Solution”
IP Publishing From IP Data Base to IP list to IP catalog
Copyright © 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 2 Database System Concepts and Architecture.
Data catalogues and the data repository ADMIRe JISC MRD
Introduction to D4Science
The Onedata platform Konrad Zemek, Krzysztof Trzepla ACC Cyfronet AGH
eCulture Science Gateway – reloaded
Case Study: Algae Bloom in a Water Reservoir
EOSCpilot All Hands Meeting 9 March 2018, Pisa
Enabling direct data access to social science research data
Open Data from a Water Reservoir Platform
A Case Study for Synergistically Implementing the Management of Open Data Robert R. Downs NASA Socioeconomic Data and Applications.
EOSCpilot All Hands Meeting 9 March 2018, Pisa
EUDAT Site and Service Registry
Robin Dale RLG OAIS Functionality Robin Dale RLG
Publishing image services in ArcGIS
Technical Outreach Expert
Work Session on Statistical Metadata (Geneva, Switzerland May 2013)
EOSC-hub Contribution to the EOSC WGs
Research Data Dr Aoife Coffey, Research Data Coordinator
Photon & Neutron working meeting
Integrated Statistical Production System WITH GSBPM
Fundamental Science Practices (FSP) of the U.S. Geological Survey
Presentation transcript:

Fernando Aguilar, IFCA-CSIC INDIGO Data Ingestion Fernando Aguilar, IFCA-CSIC INDIGO-DataCloud WP2 aguilarf@ifca.unican.es RIA-653549

Integrating distributed data infrastructures with INDIGO-DataCloud INDIGO Data Ingestion Deliverables: D2.11, D2.7 Data Life Cycle Analysis Roles, Data Levels, Metadata ENVRI RDA NASA Metadata Standards. Integrating distributed data infrastructures with INDIGO-DataCloud

Data Management in the Cloud Session in last RDA Plenary: EUDAT, EGI, INDIGO. Different Cloud-based data management approaches. Conclusions: Interest in assuring FAIR(+R). Different paths to achieve. Not close to implementation in production. INDIGO Data Ingestion

INDIGO Data Life Cycle (“6S”) Stage 1: Plan: DMP Stage 2: Collect: process of getting data Stage 3: Curate: actions performed over the data. Stage 4: Analyse: also called “Process”, given the data an added. Stage 5: Ingest (& Publish): including other steps like “Access”, “Use” or “Re-use”, in this stage, data is normally associated to metadata, has a persistent identifier and is published in an accessible repository or catalogue, under a format that makes it useful for further re-use. Stage 6: Preserve: "store" both data and analysis for long-term. Licenses and methods need to be taken into account. Integrating distributed data infrastructures with INDIGO-DataCloud

The concept of “Data Levels” Dataset Level Format and Definition Associated Metadata Other metadata/links Raw data SQL tables, real time update (IoT-like) SQL scheme, names of parameters (following EML) Instruments description (OGC) Platform location (GPS) Processed data SQL tables, consolidated backup SQL scheme, matching EML definitions Definition of specific derived variables like, PAR (Photosynthetic Active Radiation), depth (from Press), etc. Curated data SQL tables, revised for spikes, outliers, out- of-range data, etc. Errors deleted. Ingested data CSV, R / Excel ready to be used, associated basic scripts Included in DOI of published dataset. Associated EML file. Published in catalogue. Derived Data NetCDF, HDF, Model proprietary format. NetCDF or HDF metadata. Associated EML File. Included in DOI. Data derived from models (Delft3D) or other analysis tools. Data Level Short Name Description Level 0 RAW Acquired raw data. Level 1 CALIBRATED Calibrated camera data. Level 2 RECONSTRUCTED Reconstructed shower parameters (such as energy, direction, particle ID). Level 3 REDUCED Sets of selected events with associated instrumental response characterizations needed for science analysis. Level 4 SCIENCE High Level binned data products (such as spectra, sky maps, or light curves). Level 5 OBSERVATORY Legacy observatory data (such as survey sky maps or source catalog). Data Levels for Algae Bloom Data Levels for CTA INDIGO Data Ingestion

INDIGO Data Management Solutions OneData Distributed storage solution to access, store and publish data. IAM. OneClient, OneProvider, OneZone. Metadata Management. Web-API Access. File System, Extended and Custom Attributes. Storage QoS (Quality of Service) Get or add information about a storage element characteristic such as type of media, location or latency. Can be combined with SLA. Works with CDMI, Amazon S3, etc. The endpoint is public and reachable by REST API. Integration Integration of the information provided by a sites QoS endpoint into OneData will allow users to identify (and modify if available) the storage qualities via the OneData client. Integrating distributed data infrastructures with INDIGO-DataCloud

INDIGO Data Ingestion: The Arbor metaphor “Data Ingestion as the process that ends with the data being ready for sharing/(re-)use, following the usual community requirements” INDIGO Data Ingestion

INDIGO Data Ingestion: The Arbor metaphor “Data Ingestion as the process that ends with the data being ready for sharing/(re-)use, following the usual community requirements” FAIR + Reproducibility + Security/Legal INDIGO Data Ingestion

INDIGO Data Ingestion: The Arbor metaphor

INDIGO Data Integrity Test STAGE Definition of the Integrity Test components INDIGO-DataCloud Solution 1. PLAN Check DMP Existence Manual Next gen: Machine Actionable DMPs Automatic linking (not implemented) 2. COLLECT DataSet existence EML – Onedata DataSet Integrity (checksum) 3. CURATE Qc/Qa description OK Curating, Quality Software (optional) 4. ANALYSE Parameters description OK Processable Check: Validation 5. INGEST Check all previous stages OK Assign PID/DOI Assure Open Protocol (OAI-PMH) Supported by Onedata (Data Provider role) 6. PRESERVE License Definition Preservation details QoS - Onedata INDIGO Data Ingestion

Example: Collect EML Physical Module INDIGO Data Ingestion

Example: Analyse INDIGO Data Ingestion

Fernando Aguilar, IFCA-CSIC Thank you! Fernando Aguilar, IFCA-CSIC aguilarf@ifca.unican.es INDIGO Data Ingestion