Download presentation
Presentation is loading. Please wait.
Published byRaymond Stone Modified over 9 years ago
1
Project Overview APA Conference 2012 ESA/ESRIN (Frascati), 6-7 November 2012 M. Albani (European Space Agency), U.Di Giammatteo (ACS), D. Giaretta (APA)
2
Presentation Outline Project Overview Combined User Requirements and Use Cases Services and Toolkits
3
Project: SCIence Data Infrastructure for Preservation – Earth Science (SCIDIP-ES) INFRA-2011-1.2.2 Data infrastructures for e-Science Introduction & Participants Project ID: 283401 Project Type: CP-CSA Start Date: 01.09.2011 Duration: 36 Months Website: www.scidip-es.eu Total Budget: 7,721,082 € EC Funding: 6,599,992 € Total funded effort in person/months: 605 Coordinator: European Space Agency Contact Person: Mirko Albani (ESA) Project Consortium (17 partners): Industrial Earth Science Researchers
4
International Context “A fundamental characteristic of our age is the raising tide of data – global, diverse, valuable and complex. In the realm of science, this is both an opportunity and a challenge.” Report of the High-Level Group on Scientific Data, October 2010, “Riding the Wave: how Europe can gain from the raising tide of scientific data” Data Intensive Science (4 th Paradigm): data at the centre of the scientific process. EC Recommendation 2012/417/EU, July 2012: “Access to and preservation of scientific information”. Data is the new gold. “We have a huge goldmine ….. Let’s start mining it.” Neelie Kroes, Vice-President of the European Commission responsible for the Digital Agenda
5
An example of data lifecycle: sensed data need to be acquired… 0014100 0024102 0034102 0046144 0056150 0067168 0076146 Level 0
6
...to be combined and processed to get this 6 Level 2Level 0Level 1 Processing Processing/c ombining
7
...through complex processing schemes Algorithm Manual
9
Data & Knowledge Preservation The preservation of data (the “bytes”) is useless without the preservation of the knowledge associated with the data (e.g. the “quality”, the process to generate them). We must: Ensure and secure the preservation of archived data and associated knowledge ideally for an unlimited time span. Ensure, enhance and facilitate archived data accessibility. Allowing to combine data from different sources and to perform more complex analyses (data unfamiliar become familiar) Ensure coherency of approaches among different Earth Science providers. File specs. Level 0Level 1 ProcessorAlgorithmUser Manual Desc. Info Publications
10
Genesis of the SCIDIP-ES project Two aspects are considered: Technical – based on the advancements in research for digital data preservation. Domain Specific – based on Earth Science needs.
11
SCIDIP-ES Objectives 1)To develop and deploy generic and sustainable digital data preservation services and toolkits. Validate and use them in the Earth Science domain as a start. 2)To harmonise data preservation policies and approaches, metadata and ontologies in the Earth Science domain: Paving the way for the set-up of an harmonized and common approach for the Long Term Preservation of Earth Science Data.
12
SCIDIP-ES Journey JUN 2012 SEP 2011 Interactive Platform & Initial Prototype S&T Release Earth Science Survey, Requirements & Use Cases MAY 2013 FEB 2013 Halftime S&T Release Earth Science Harmonization AUG 2014 FEB 2014 Final S&T Release Services Final Testing & Assessment Earth Science LTDP Architecture
13
Earth Science Community Earth Observation Space Community SCIDIP-ES Cooperation MoUs with FP7 Projects ESA, APA, GIM, STFC ESA ESA, DLR, CNES Non-Earth Science Community MoUs with FP7 Projects e-Infrastructure
14
The Complete Picture From generic to specific From specific to generic European E-Infrastucture European E-Infrastucture Other Communities Earth Science
15
Earth Science Data Preservation status and needs Earth-Science data managers have individually dealt for decades with preservation and access to heterogeneous and complex scientific data. BUT need to: Improve effectiveness of data preservation through the application of standard approaches and tools and exploitation of latest Digital Preservation developments and achievements (i.e. SCIDIP-ES services and toolkits). Reduce preservation threats. Ensure preservation of data but also of all context/provenance/quality information: Data Records, Processing Software, Documentation. Harmonize preservation approaches and policies and coordinate efforts to better support pressing needs from very sensitive applications. Enhance interoperability and facilitate data usability and access by the scientific user community.
16
What will SCIDIP-ES do for Earth Science? Provide state of the art data preservation services and toolkits. Raise awareness on data preservation issue and enhance dialogue in this context between ES data repositories: Paving the way for the set-up of an Earth Science LTDP Framework Define common preservation policies and approaches. Propose path for harmonizing metadata, semantics, ontologies, and data access policies: Facilitating data discovery and use. Define an Earth Science LTDP framework governance model and architecture.
17
Starting from….. Achievements in the domain of Earth Observation from space: Ground Segment Coordination Body LTDP Working Group. Reviewing the Earth Observation results in light of OAIS and latest digital preservation techniques and standards and extending them to the Earth Science domain.
18
Earth Science Activities Workflow MAIN BLOCKS of ACTIVITIES
19
Survey Rationale Survey of earth science users to assess understanding of issues and level of expertise with respect to long-term data preservation Identify current existing and utilised: Data preservation policies and guidelines Metadata, semantic and ontology models Technologies for data discovery, access, management and visualisation
20
Survey Methodology Independent search activity T15.2 Survey of policies and technologies T15.3 Survey of metadata, semantics and ontologies D 15.1 Report on survey of technologies, policies, metadata, semantics and ontologies On-line user survey In-depth user consultation On-line user survey In-depth user consultation
21
Survey Results: 551 responses
22
Survey Results: analysis System infrastructure and architecture Data discovery Data access Processing, knowledge, extraction and management Data preservation, technologies, policies and guidelines Metadata, semantics and ontologies
23
Survey conclusions System infrastructure and architecture Archive systems: EO community: proprietary systems/ Others: open source e.g PostGres Often based on tape archives / disk storage for rapid access to frequently used datasets Users generally have a better understanding of the discovery and retrieval system used to access the data than the underlying archive services Data discovery Metadata standards ISO19115/19119 widely used in the ES domain Metadata harvesting methods include OpenSearch 1.1 and Open Archives Protocol for Metadata Harvesting (OAI-PMH) Earth Observation data also uses the Earth Observation Metadata profile of observations and measurements
24
Survey conclusions Data access Web services are most commonly used for accessing an archive Web based forms + FTP download or off-line ordering also used for very large data sets Wide variety of portals currently in use for accessing different data types Large cyber infrastructure projects also providing data access services e.g GEOSS, EarthCube Processing, knowledge extraction and management Data analysis software separate from data discovery and access services Data processing using a range of tools that are not domain specific Limited number of file formats Trend towards researchers using data beyond their own disciplines and from other geographic regions
25
Survey conclusions Data preservation, technologies, policies and guidelines Archive service providers expect to retain data for up to 10 years Data producers tend towards 5 to 10 years for retention of data Stakeholders would like to retain data indefinitely but in reality this is between 10 and 20 years Data preservation policies most common in EO domain and follow OAIS model and LTDP guidelines Metadata, semantics and ontologies Main use is for querying and exchanging data Many of the models are XML based Majority are dealing with geoinformatics Findings and recommendations from WP15 and WP33 on data discovery and access will be made available to ES Community for further implementation. Collaboration with other projects (e.g. GENESI-DEC, ENVRI).
26
Survey Methodology Independent search activity WP33 Definition of earth science common policies, semantics, ontologies, metadata, architecture and governance model T15.2 Survey of policies and technologies T15.3 Survey of metadata, semantics and ontologies D 15.1 Report on survey of technologies, policies, metadata, semantics and ontologies On-line user survey In-depth user consultation On-line user survey In-depth user consultation
27
Earth Science Harmonization Activities Objectives TASK 33.2 TASK 33.3 TASK 33.1 TASK 33.4 Analyse data preservation policies in the different Earth Science domains and define common one. Analyse data access policies and consider possibility to propose one for subsets of ES data. Define strategy for harmonisation of metadata, ontologies and semantics. Analyse gaps in terms of data preservation and interoperability. Define European LTDP Framework functional architecture and governing principles.
29
Data is the new gold. “We have a huge goldmine … Let’s start mining it.” Neelie Kroes, Vice-President of the European Commission responsible for the Digital Agenda
30
But… Gold is precious because it is rare it does not combine with other elements it does not perish Data is precious because there is so much of it it is more valuable when it is combined together it is highly perishable Need to ensure long term preservation, accessibility, understandability and usability of data
31
Threats to preservation of data Data needs to be preserved against changes in: Technology – hardware and software Environment Semantics and Ontologies Standards Community of data users Tacit knowledge of users
32
Basic preservation activities Libraries say: “Emulate or migrate” Works well with data only in special cases Can repeat what was done before instead of new things Does not help with building cross-disciplinary Earth Science community
33
Data contains numbers etc – need meaning 33
34
...to be combined and processed to get this 34 Level 2Level 0Level 1 Processing Processing/c ombining
35
Our approach For information preservation and re-use: get Representation Information or Transform Alternatively move to another repository
36
Dictionary specification XML GOCE N1 file description Representation Network GOCE Level 1 (N1 File Format) GOCE Level 0 Processor Algorithm GOCE N1 file Dictionary GOCE N1 file standard PDF standard PDF software
37
Transformation Change the format e.g. Word PDF/A PDF/A does not support macros GIF JPEG2000 Resolution/ colour depth……. Excel table FITS file NB FITS does not support formulae Old EO or proprietary format HDF Certainly need to change STRUCTURE RepInfo May need to change SEMANTIC RepInfo We can help with making the decision whether or not to transform
38
Hand-over Preservation requires funding Funding for a dataset (or a repository) may stop Need to be ready to hand over everything needed for preservation OAIS (ISO 14721) defines “Archival Information Package (AIP). Issues: Storage naming conventions Representation Information Provenance ….
39
When things changes We need to: Know something has changed Identify the implications of that change Decide on the best course of action for preservation What RepInfo we need to fill the gaps Created by someone else or creating a new one If transformed: how to maintain data authenticity Alternatively: hand it over to another repository Make sure data continues to be usable Orchestration Service Gap Identification Service Preservation Strategy Tk RepInfo Registry Service Authenticity Toolkit Storage Service Data Virtualisa tion Toolkit Process Virtualisa tion Toolkit RepInf o Toolkit
40
How do we know that the services: Satisfy a general demand? Help with preservation? Evidence
41
Parse.Insight survey Researchers: 1/3 Europe 1/3 USA 1/3 rest of world Responses from researchers, data managers and publishers: 44% Europe 33% USA 23% rest of world
42
Threats to preservation (R) The ones we trust to look after the digital holdings may let us down The current custodian of the data may cease to exist Loss of ability to identify the location of data Access and use restrictions may not be respected in the future Evidence may be lost Lack of sustainable hardware/software Users may be unable to understand or use the data
43
Threats to preservation (R) Users may be unable to understand or use the data e.g. the semantics, format or algorithms involved.
44
ThreatRequirement for solution Users may be unable to understand or use the data e.g. the semantics, format, processes or algorithms involved Ability to create and maintain adequate Representation Information Non-maintainability of essential hardware, software or support environment may make the information inaccessible Ability to share information about the availability of hardware and software and their replacements/substitutes The chain of evidence may be lost and there may be lack of certainty of provenance or authenticity Ability to bring together evidence from diverse sources about the Authenticity of a digital object Access and use restrictions may make it difficult to reuse data, or alternatively may not be respected in future Ability to deal with Digital Rights correctly in a changing and evolving environment Loss of ability to identify the location of data An ID resolver which is really persistent The current custodian of the data, whether an organisation or project, may cease to exist at some point in the future Brokering of organisations to hold data and the ability to package together the information needed to transfer information between organisations ready for long term preservation The ones we trust to look after the digital holdings may let us down Certification process so that one can have confidence about whom to trust to preserve data holdings over the long term RepInfo toolkit, Packager and Registry – to create and store Representation Information. In addition the Orchestration Manager and Knowledge Gap Manager help to ensure that the RepInfo is adequate. Registry and Orchestration Manager to exchange information about the obsolescence of hardware and software, amongst other changes. The Representation Information will include such things as software source code and emulators. Authenticity toolkit will allow one to capture evidence from many sources which may be used to judge Authenticity. Packaging toolkit to package access rights policy into AIP Persistent Identifier system: such a system will allow objects to be located over time. Orchestration Manager will, amongst other things, allow the exchange of information about datasets which need to be passed from one curator to another. Certification toolkit to help repository manager capture evidence for ISO 16363 Audit and Certification
45
CASPAR inheritance CASPAR – an FP6 project Completed fundamental research into digital preservation Produced prototypes for services and toolkits which SCIDIP-ES is building on Produced evidence that these services and toolkits did help in digital preservation
46
The CASPAR flows
47
CASPAR Testing
48
The complete view Storage Service Gap Identification Service Orchestration Service RepInfo Registry Service Preservation Strategy Toolkit Data Virtualisation Toolkit Process Virtualisation Toolkit Authenticity Toolkit Packaging Toolkit RepInfo Toolkit Finding Aid Toolkit Cloud Storage External Access/Use Services Persistent ID i/f Service External PI services ISO Certification Organisation Certification Toolkit Services: run on remote servers Toolkits Runs on local machines These SUPPLEMENT what repositories do (customised for repositories) Make it easier for repositories to do preservation – share the effort These SUPPLEMENT what repositories do (customised for repositories) Make it easier for repositories to do preservation – share the effort
49
When things change We need to: Know something has changed Understand the implications of that change Decide on the best course of action for preservation What RepInfo we need to fill the gaps Created by someone else or creating a new one If transformed: how to maintain data authenticity Alternatively: hand it over to another repository Make sure data is now usable and close the process Orchestration Service Gap Identification Service Preservation Strategy Tk RepInfo Registry Service Authenticity Toolkit Storage Service Data Virtualisa tion Toolkit Process Virtualisa tion Toolkit RepInf o Toolkit
50
Representation Information The Information Model is key Recursion ends at KNOWLEDGEBASE of the DESIGNATED COMMUNITY (this knowledge will change over time and region)
51
Dictionary specification XML GOCE N1 file Description as text file Representation Network GOCE Level 1 (N1 File Format) GOCE Level 0 Processor Algorithm GOCE N1 file Dictionary GOCE N1 file standard PDF standard PDF software OR GOCE N1 file Description using DRB DRB specification RISK: X COST: Y RISK: X’ COST: Y’ RISK: X’’ COST: Y’’ GOCE N1 file Description as text file Preservation Network Model
52
AUTHENTICITY FINDING AIDS REGISTRY DATA STORE ORCHESTRATION PACKAGING REPINFO TOOLBOX GAP MGR DATA STORE AIP (Archival Information Package) Storage Service Gap Identification Service Orchestration Service RepInfo Registry Service Guarantor/Exchange server node
53
Avoiding a tower of Babel Representation Information captures information needed to understand/use data. Allows continued use despite changes over time In principle allows use despite massive diversity but at the cost of massive practical difficulties and costs Therefore need to manage diversity
55
The general picture: from Reqs to Assessment WP12WP12 WP14WP14 Use Scenarios & Reqs & Reqs Community WP22/WP24WP22/WP24 Test & Assessment WP21WP21 Services/Toolkits/Manuals
56
Main steps Initial definition of the main operational scenarios, general enough to accommodate all ES - Discussion with partners for an initial agreement - Discussion of preliminary results at the meeting in Capgemini and at the meeting in Edinburgh - Distribution of high level Use Cases to ES partners - Collection of feedbacks and assistance in specific UC writing - Face-to-face interviews to elicit the requirements and to clarify critical points
57
WP 12 Results The Deliverable D12.1 has been successfully completed, revised within the consortium and delivered (with a slight delay) to the EC - A comprehensive set of high level user requirements has been identified, which will serve as a basis for the test definition - Use cases, reflecting the typical preservation scenarios have been identified
58
Results: process details Description of Services and Toolkits reviewed in close collaboration with technical partners to facilitate their comprehension by the ES community - 26 direct face-to-face interviews with people from 7 different scientific domains - More than 100 persons compiled the online questionnaire (from three categories: data managers, data producers, data consumers)
59
Some other results Achieved results: details of the process - Typical sizes assessed for 10 Data Collections: - up to 1.200.000 different products - up to several TB of raw data per year - up to several GB of RepInfo per year - 3 High Level Use Cases identified - 21 domain-specific Use Cases identified - 25 user Requirements identified - 23 PNMs and Representation Networks defined
60
From LTDP: Preservation workflow The Preservation Analysis Workflow (from EO LTDP Guidelines) defines a procedure for design, creation and maintenance of an Archival Information Package, which is intended to optimise the re-use of ES data in the long term. The Preservation Analysis workflow is thus an excellent guide to set-up a preservation strategy in close collaboration with ES data managers and to elicit user requirements
61
EARTH SCIENCE DATA PRESERVAT ION POLICIES EARTH SCIENCE KNOWLED GE EARTH SCIENCE DATA PRESERV ATION POLICIES From WP33
62
Thank You !! Contacts: Mirko Albani, Project Coordinator: Mirko.Albani@esa.intMirko.Albani@esa.int Website: www.scidip-es.eu
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.