Download presentation
Presentation is loading. Please wait.
Published byBartholomew Adams Modified over 6 years ago
1
Components for a Science Data Infrastructure – preservation and re-use of data
David Giaretta
2
CASPAR Project http://www.casparpreserves.eu EU FP6 Integrated Project
Total spend approx. 16MEuro (8.8 MEuro from EU) Brief introduction to the CASPAR consortium – not all of whom are represented here.
5
Key Components DRM
6
Modules and Dependencies: defining the Designated Community
README.txt TEXT EDITOR ENGLISH LANGUAGE WINDOWS XP FITS FILE FITS STANDARD PDF JAVA s/w JAVA VM s/w DICTIONARY SPECIFICATION UNICODE XML MULTIMEDIA PERFORMANCE DATA C3D DirectX MAX/MSP 3D motion data files 3D scene motion to music mapping strategy We can formalize intelligibility based on this notion. For example, to understand a file README.txt I need a TEXT EDITOR but also knowledge of ENGLISH LANGUAGE. Alternatively, and I if I could not understand ENGLISH, I could understand this file if I had an ENGLISH2GREEK DICTIONARY. We have the same “pattern” in multimedia performance data, in scientific data, ... The same “pattern” occurs almost everywhere. For instance, astronomers use FITS files. To understand a FITS file one has to understand the FITS standards, and so on.
7
USE DATA Use application to find data in Repository Create DIP with enough RepInfo for the user (via DC profile) Obtain more RepInfo from Registry if necessary
8
Registry/Repository Scenario
The Digital Object could have some RepInfo packed with it. User gets data from archive. User may be unfamiliar with the data so needs Representation Information. Some may be packaged with the data CPID CPID CPID CPID CPID CPID CPID CPID Digital Digital CPID CPID CPID Object Object CPID CPID CPID This gives an example of the way in which a Registry might be used in more detail. WE CAN DEMONSTRATE VARIOUS PIECES OF THIS Note that the RepInfo may kept WITH the data in the archive. However we should look on this as a form of “caching” of the RepInfo. Archive Archive User
9
Archive Archive May need MORE Representation Information.
Registry/Repository of Representation Information May need MORE Representation Information. Obtain this from one or more Registry/Repository of Rep.Info. CPID Representation Information CPID CPID CPID CPID CPID CPID CPID Do not want to guess which RepInfo is needed – use an identifier (CPID) associated with the data CPID CPID CPID Digital Digital CPID CPID CPID Object Object CPID CPID CPID This gives an example of the way in which a Registry might be used. Note that the RepInfo may kept WITH the data in the archive. However we should look on this as a form of “caching” of the RepInfo. Archive Archive
10
Registry/Repository of Representation Information
At some point in the future there is a need for further Representation Information. RepInfo Toolkit Someone creates the necessary RepInfo using whatever tools are appropriate. CPID Representation Information CPID CPID CPID CPID CPID CPID CPID CPID CPID CPID Digital Digital CPID CPID CPID Object Object CPID CPID CPID This gives an example of the way in which a Registry might be used. Note that the RepInfo may kept WITH the data in the archive. However we should look on this as a form of “caching” of the RepInfo. Archive Archive
11
Use of RepInfo CPID CPID CPID
RepInfoLabel – points to other RepInfo CPID Structure = CPID Semantics = CPID Rendering s/w = CPID Each “bag of bits” has an associated pointer (CPID) to a Label CPID copy More detail about the link between the data and the registry CPID Structure = CPID Semantics = CPID Rendering s/w = CPID Registry External
13
CREATE REPINFO Data holders etc inform Orchestration of dangers to preservation Orchestration uses Gap Manager to derive implications Orchestration asks experts for help e.g., Experts search for or create additional RepInfo needed to fill “gap” & store in RegRep
14
Key Components DRM
15
Authenticity Protocol
Authenticity Step
16
Authenticity Protocol Execution
Authenticity Protocol Execution Evaluation
17
Overall Authenticity Model
Being implemented by IBM – built into Preservation Data Store
18
Transformation Protocol
Transformation of received data files into IIWG format Recommendation (Policy) “For the received data file to be deemed as sufficient quality to support data analysis it must have been successfully transformed into standard IIWG data format, the use of processing software must be recorded” Steps will capture Details of transformation used Software details including name, version, source Time/date of process System details Details of person responsible Reason for believing this software is reliable Details of Transformational Information Properties checked Information Property descriptions Values checked Details of how checked Details of who was responsible for the check
19
User Experience / Flow diagram
Slide shows the user experience of using the tool with screen shots
20
Key Components DRM
21
Digital Rights Ontology
Core DRO entities A Domain ontology About Intellectual Property Rights To describe all aspects that are relevant to evaluate rights Attributes of rights Validity in time and space Laws and Agreements People Actions Licenses Constraints and Conditions … Harmonized with CIDOC CRM and FRBRoo
22
Interoperable Provenance Information (1)
Intellectual Property Rights Life Cycle
23
Creation of “Spaces of Mind”
Interoperable Provenance Information (2) DRM Creation History Intellectual Property Rights P1F_is_identified_by E39_Actor Daniel Teruggi E82_Actor_Appellation Daniel Teruggi P14B_performed E21_Person Daniel Teruggi F31_Expression_Creation Creation of “Spaces of Mind” R22F_created XXX P14B_performed F20_Self-Contained_Expression XXX A20F_isOn XXX C5_Copyright R1 - Distribution Right E65_Creation Creation of “Spaces of Mind” XXX C5_Copyright R2 - Reproduction Right XXX XXX C5_Copyright R3 - Attribution Right CIDOC-CRM core ontology Digital Rights Ontology
24
Insight: stakeholders
Funding/policy National Funding organisations European funding Corporate funding Research Research institutes (non-profit) Universities Academic libraries Data management (preservation) Data centres (profit / non-profit) Libraries Archives Publishing General (cross-community) publishers Specific (community) publishers
25
Surveys to stakeholders
Funding/policy < responses Research 1397 responses Data management (preservation) 273 responses Publishing 186 responses
26
About researchers Per category EU 44%, USA 33%, Other 23%
27
Data spectrum (R)
28
Sharing of data (R) How open is your data?
29
Sharing of data (R) Which constrains do you see in making data open?
30
Threats to preservation (R)
The ones we trust to look after the digital holdings may let us down The current custodian of the data may cease to exist Loss of ability to identify the location of data Access and use restrictions may not be respected in the future Evidence may be lost Lack of sustainable hardware/software Users may be unable to understand or use the data
31
Threats to preservation (R)
Users may be unable to understand or use the data e.g. the semantics, format or algorithms involved.
32
Requirement for solution
Threat Requirement for solution Users may be unable to understand or use the data e.g. the semantics, format, processes or algorithms involved Ability to create and maintain adequate Representation Information Non-maintainability of essential hardware, software or support environment may make the information inaccessible Ability to share information about the availability of hardware and software and their replacements/substitutes The chain of evidence may be lost and there may be lack of certainty of provenance or authenticity Ability to bring together evidence from diverse sources about the Authenticity of a digital object Access and use restrictions may make it difficult to reuse data, or alternatively may not be respected in future Ability to deal with Digital Rights correctly in a changing and evolving environment Loss of ability to identify the location of data An ID resolver which is really persistent The current custodian of the data, whether an organisation or project, may cease to exist at some point in the future Brokering of organisations to hold data and the ability to package together the information needed to transfer information between organisations ready for long term preservation The ones we trust to look after the digital holdings may let us down Certification process so that one can have confidence about whom to trust to preserve data holdings over the long term RepInfo toolkit, Packager and Registry – to create and store Representation Information. In addition the Orchestration Manager and Knowledge Gap Manager help to ensure that the RepInfo is adequate. Registry and Orchestration Manager to exchange information about the obsolescence of hardware and software, amongst other changes. The Representation Information will include such things as software source code and emulators. Authenticity toolkit will allow one to capture evidence from many sources which may be used to judge Authenticity. Digital Rights and Access Rights tools allow one to virtualise and preserve the DRM and Access Rights information which exist at the time the Content Information is submitted for preservation. Persistent Identifier system: such a system will allow objects to be located over time. Orchestration Manager will, amongst other things, allow the exchange of information about datasets which need to be passed from one curator to another. The Audit and Certification standard to which CASPAR has contributed will allow a certification process to be set up.
33
Discipline repositories
Cross-references Translators Thesauri Knowledge Gap Manager Orchestration/Brokering Processing Context Persistent ID resolver Authenticity tools RepInfo Registry Certification Knowledge Gap Manager Orchestration/Brokering Processing Context Persistent ID resolver Authenticity tools RepInfo Registry Certification Cross-references Translators Thesauri Automated systems Repositories Users Automated systems Repositories Users Compute Resource Local Authentication Local Authorisation Storage Resource Registries Process ID Scheduler Shibboleth FUTURE Compute Resource Local Authentication Local Authorisation Storage Users may be unable to understand or use the data e.g. the semantics, format, processes or algorithms involved Non-maintainability of essential hardware, software or support environment may make the information inaccessible The chain of evidence may be lost and there may be lack of certainty of provenance or authenticity Access and use restrictions may not be respected in the future Loss of ability to identify the location of data The current custodian of the data, whether an organisation or project, may cease to exist at some point in the future The ones we trust to look after the digital holdings may let us down WAN LAN Router Switch Cable Interconnects Gateways Management Network infrastructure: Individual islands of connectivity e.g. individual countrie EU puts in infrastructure GEANT (JANET in UK etc) to connect these (NOTE THAT THE INFRASTRUCTURE COMPONENTS NAMED HERE ARE ONLY INDICATIVE) The network supplies services to isolated organisations, each with its own resources such as CPU and storage EU funds EGEE, UK funds various GRID project to connect these 3) On these resources are used by repositories 1) the individual repositories need to be used in an integrated way 2) EU and JISC fund broad e-Research infrastructure In the future there will be some similar interconnected repositories and we need to transfer information from now to the future However there are many difficulties/threats to this transfer To counter these threats we need to put in place components to support preservation Many, if not all, of this preservation infrastructure is also useful to support the current e-Research infrastructure, particularly where in multidisciplinary research one needs to be able to use unfamiliar information resources. WAN Router LAN Switch Cable
34
Links CASPAR: PARSE.Insight: Alliance for Permanent Access:
PARSE.Insight: Alliance for Permanent Access: Digital Curation Centre: YouTube cartoon: OAIS:
35
END
36
About researchers Communities aggregated to: Agriculture & Nutrition
Behavioural Sciences Humanities Life Sciences Medicine Social Sciences Physical Sciences Socio-Cultural Sciences Technology Based on KNAW classification (Royal Netherlands Academy of Arts and Sciences)
37
Reasons for preservation
It is unique. It potentially has economic value. It may stimulate inter-disciplinary collaborations. It allows for re-analysis of existing data. It may serve validation purposes in the future. It will stimulate the advancement of science (new research can build on existing knowledge). If research is publicly funded, the results should become public property and therefore properly preserved.
38
Reasons for preservation (R)
It is unique It potentially has economic value It may stimulate inter-disciplinary collaborations. It allows for re-analysis of existing data. It may serve validation purposes in the future. It will stimulate the advancement of science. If research is publicly funded, the results should become public property and therefore properly preserved.
39
Reasons for preservation (DM)
It is unique It potentially has economic value It may stimulate inter-disciplinary collaborations. It allows for re-analysis of existing data. It may serve validation purposes in the future. It will stimulate the advancement of science. If research is publicly funded, the results should become public property and therefore properly preserved.
40
Reasons for preservation (P)
It is unique It potentially has economic value It may stimulate inter-disciplinary collaborations. It allows for re-analysis of existing data. It may serve validation purposes in the future. It will stimulate the advancement of science. If research is publicly funded, the results should become public property and therefore properly preserved.
41
Threats to preservation
The ones we trust to look after the digital holdings may let us down. The current custodian of the data, whether an organisation or project, may cease to exist at some point in the future. Loss of ability to identify the location of data. Access and use restrictions (e.g. Digital Rights Management) may not be respected in the future. Evidence may be lost because the origin and authenticity of the data may be uncertain. Lack of sustainable hardware, software or support of computer environment may make the information inaccessible. Users may be unable to understand or use the data e.g. the semantics, format or algorithms involved.
42
Threats to preservation (DM)
The ones we trust to look after the digital holdings may let us down The current custodian of the data may cease to exist Loss of ability to identify the location of data Access and use restrictions may not be respected in the future Evidence may be lost Lack of sustainable hardware/software Users may be unable to understand or use the data
43
Threats to preservation (P)
The ones we trust to look after the digital holdings may let us down The current custodian of the data may cease to exist Loss of ability to identify the location of data Access and use restrictions may not be respected in the future Evidence may be lost Lack of sustainable hardware/software Users may be unable to understand or use the data
46
OAIS Archival Information Package (AIP)
Neeri 2009 1-2 Oct 2009, Helsinki
47
Rep Info Virtualisation /DISCIPLINE
Introducing the layered view of CASPAR which points out that we need to deal with more than RepInfo e.g. Digital Rights etc. Follow the “life-cycle” description in the CM. The items in the red ellipse are the RepInfo we have been talking about previously. The virtualisation is introduced in order to help with automation i.e. we need programmes to process the bytes – how can we make this easier? Virtualisation /DISCIPLINE
48
CASPAR Consortium http://www.casparpreserves.eu
EU FP6 Integrated Project Total spend approx. 16MEuro (8.8 MEuro from EU) Brief introduction to the CASPAR consortium – not all of whom are represented here.
49
CASPAR Solution The CASPAR Foundation Facade Layer KeyComponents
Information Package Mngt Communication Mngt Information Access Designated Community & Knowledge Mngt Security Mngt The CASPAR Foundation KeyComponents Framework Platform
50
CASPAR Foundation The CASPAR Foundation KeyComponents GapManager
DataAccess&Security RepInfoToolbox SemanticWeb Orchestration Registry Packaging DigitalRights FindingAids DataStores Authenticity Virtualisation Framework CASPAR Service Factory The CASPAR Foundation Application Server: Tomcat, Glassfish, WASCE Development Framework: JAX-WS, GWT, Ant Development Management: Hudson and JTrac Platform DBMS: H2, Postgres Java Platform Operating System: Linux, Unix, Windows, Mac
51
Preservation Issues 1… How To guarantee a digital information may be accessed and understood in the future How To guarantee a proper information package management within and OAIS Archive How To guarantee long-time preservation maintenance of any information package How To guarantee retrieval of Archival Information
52
Preservation Issues 2… How To guarantee intellegibility within heterogeneous Designated Communities and their digital information How To guarantee preservation actors are informed about change events How To guarantee an adequate security access with the proper rights to any resource and functionality within an OAIS Archive How To guarantee an adequate integrity and identity for any Archival Information
53
Answer - 1 To guarantee a digital information may be accessed and understood in the future, you need an adequate OAIS Representation Information REPINFO RepInfo ToolBox REG Registry VIRT Virtualisation
54
Answer - 2 To guarantee a proper information package management within and OAIS Archive, you need to create an adequate OAIS Information Package PACK Packaging
55
Preservation DataStores
Answer - 3 To guarantee long-time preservation maintenance of any information package, you need an implementation of OAIS Archival Storage PDS Preservation DataStores
56
Answer - 4 To guarantee retrieval of Archival Information, you need an OAIS Finding Aids FIND Finding
57
Answer - 5 To guarantee intellegibility within heterogeneous Designated Communities and their digital information, you need to manage Designated Community Profiles and their Knowledge Base KM Knowledge
58
Answer - 6 To guarantee preservation actors are informed about change events, you need an adequate management of message exchange POM Orchestration
59
Digital Rights Manager
Answer - 7 To guarantee an adequate security access with the proper rights to any resource and functionality within an OAIS Archive, you need a Security and DRM Management DRM Digital Rights Manager DAMS Data Access Manager & Security
60
Answer - 8 To guarantee an adequate integrity and identity for any Archival Information, you need an Authenticity Tool AUTH Authenticity
61
USE DATA Use application to find data in Repository Create DIP with enough RepInfo for the user (via DC profile) Obtain more RepInfo from Registry if necessary
62
Data… Level 2 GOME Satellite instrument data
This is a mosaic composed by the data retrieved over 14 orbits per day with interpolation of data values to fill the gap between two quasi-adjacent geographical areas covered by the satellite ground track. On that date the white zone over Asia indicates absence of useful data. Format of file can be identified through a combination of file extension, physical location of file (as Indicated by Postgres database) and file name which indicates format. Textual description of the formats Is available on MST support pages Raw data IQ (In phase and quadrature) data - non standard binary format files format (textual description of format available) Spectral - Raw data which has undergone first stage processing to spectral non standard binary file format. (textual description of format available) Processed product Version 2 processed data product Radial data: Nasa Ames format (product specific textual description available) Cartesian data: Nasa Ames format (product specific textual description available) Version 1: processed data product Version 0: processed data product Time averaged radial data: non standard ASCII format (textual description available) Unaveraged radial data: non standard ASCII format (textual description available) Time averaged wind data: non standard ASCII format (textual description available) Unaveraged wind data: non standard ASCII format (textual description available) Time averaged power data: non standard ASCII format (textual description available) Quick Look plots: graphic file png format generated from cartesian products using GNU plot Under development With version 3 binary file of cartesian and radial product will be produced in the NetCDF format in tandem with a Nasa Ames ASCII versions. Once Version 3 processing (Python producing NetCDF radial and Cartesian Product) is released a reprocessing of the archived Spectral file will be undertaken. In order to produce a consistent series of Cartesian and radial product. As they will be of higher quality and more easily supported. After a period of time the BADC would wish to dispose of the old higher level product but would like to maintain the ability to recreate it.
64
Representation Information Network
Neeri 2009 1-2 Oct 2009, Helsinki
65
Modules and Dependencies: defining the Designated Community
README.txt TEXT EDITOR ENGLISH LANGUAGE WINDOWS XP FITS FILE FITS STANDARD PDF JAVA s/w JAVA VM s/w DICTIONARY SPECIFICATION UNICODE XML MULTIMEDIA PERFORMANCE DATA C3D DirectX MAX/MSP 3D motion data files 3D scene motion to music mapping strategy We can formalize intelligibility based on this notion. For example, to understand a file README.txt I need a TEXT EDITOR but also knowledge of ENGLISH LANGUAGE. Alternatively, and I if I could not understand ENGLISH, I could understand this file if I had an ENGLISH2GREEK DICTIONARY. We have the same “pattern” in multimedia performance data, in scientific data, ... The same “pattern” occurs almost everywhere. For instance, astronomers use FITS files. To understand a FITS file one has to understand the FITS standards, and so on.
66
Provenance: Performing Arts
Here is shown the Representation Information model of this performance constructed with the use of the CNRS UI Prototype. Thanks to ULeeds and CNRS
67
Authenticity Model Neeri 2009 1-2 Oct 2009, Helsinki
68
Digital Rights Ontology
Core DRO entities A Domain ontology About Intellectual Property Rights To describe all aspects that are relevant to evaluate rights Attributes of rights Validity in time and space Laws and Agreements People Actions Licenses Constraints and Conditions … Harmonized with CIDOC CRM and FRBRoo
69
Services of the DRM component
A user or an application registerers the creation history of a creative work Title of the work, when and where it was made public for the first time Who contributed in some part of the work What is the precise contribution of each involved person … The DRM derives the existing Copyrights Who owns the rights Details of the rights Author Rights vs Related Rights Economic Rights vs Moral Rights Country of origin of the rights When do the rights expire DRM
72
Interoperable Provenance Information (1)
Intellectual Property Rights Life Cycle
73
Creation of “Spaces of Mind”
Interoperable Provenance Information (2) DRM Creation History Intellectual Property Rights P1F_is_identified_by E39_Actor Daniel Teruggi E82_Actor_Appellation Daniel Teruggi P14B_performed E21_Person Daniel Teruggi F31_Expression_Creation Creation of “Spaces of Mind” R22F_created XXX P14B_performed F20_Self-Contained_Expression XXX A20F_isOn XXX C5_Copyright R1 - Distribution Right E65_Creation Creation of “Spaces of Mind” XXX C5_Copyright R2 - Reproduction Right XXX XXX C5_Copyright R3 - Attribution Right CIDOC-CRM core ontology Digital Rights Ontology
74
Scenario: Change in Copyright Law
Goal Demonstrate that the system withstands changes in Law Scenario: An amendment to a Copyright Law clarifies who can be considered a performer, hence eligible to related rights Current IPR Law No. 92–597 of July 1, 1992, Art says: “… persons who act, sing, deliver, declaim, play in or otherwise perform literary or artistic works, variety, circus or puppet acts.” A hypothetical amendment could extend the definition of performers including also people who project and interpret acousmatic sound files We have more rightholders
75
Scenario 2 Update of PDI (Provenance)
Update Intellectual Property Rights after a Change in Law POM FIND DRM preservation expert <use> DRM CASPAR Web Desktop PACK PDS
76
Scenario: Change in Copyright Law (2/2)
0. A notification is sent to the DRM Preservation Expert Through the POM Actions to do: The rules for correctly deriving existing rights must be updated Through the DRM Administration Interface Rights information (PDI-Provenance) must be updated for all affected archive holdings The affected AIPs are retrieved ( FIND Key Component ) The rights are re-calculated ( DRM Key Component ) The new Provenance is packaged in the AIPs and stored Through the PACK and PDS Key Components 76
77
Repository Audit and Certification
Standard for certification in OAIS Roadmap Initial work produced TRAC Now an official CCSDS Working Group Open virtual meetings, notes and documents: Draft standard submitted to CCSDS/ISO to form the basis of an international audit and certification process
78
Conclusions Preservation Many layers of information
needs more than format information needs more than Significant Properties Many layers of information This extra information must be created – tools are needed must itself be preserved Made easier by being integrated with (e-)infrastructure addressing users’ concerns Must deal with all types of digital objects (metadata)
79
OAIS conformance From OAIS: A conforming OAIS archive implementation shall support the model of information described in The OAIS Reference Model does not define or require any particular method of implementation of these concepts. A conforming OAIS archive shall fulfill the responsibilities listed in Subsection 4 provides examples of the mechanisms that may be used to discharge the responsibilities identified in These mechanisms are not required for conformance. It is expected that a separate standard, as noted in section 1.5, will be produced on which accreditation and certification processes can be built.
80
Modules and Dependencies: Examples (Semantic Web data)
ns4 ns2 ns3 ns1 The same is true for formally expressed knowledge. For instance, in the SWeb one schema may reuse or specialize elements coming from other schemas. Also in this case we have a dependency graph (over namespaces this time). RDF/S
81
Scenario: Intelligibility-aware Packaging
FITS ZIP C3D DirectX MAX/MSP FITS DICTIONARY FITS STANDARD Gap(o2,P1) = Gap(o2,P2) = {FITS, FITS_STANDARD, FITS_DICTIONARY, DICTIONARY_SPECIFICATION} Gap(o2,P3) = {FITS, FITS_STANDARD, FITS_DICTIONARY, DICTIONARY_SPECIFICATION, PDF_STANDARD, XML_SPECIFICATION, UNICODE_SPECIFICATION} Gap(o3,P3) = {ZIP} Gap(o3, ) = {ZIP, C3D, DirectX, MAX/MSP} DICTIONARY SPECIFICATION P2 Here you see the gaps between objects and various profiles (assumption the dependencies of objects have been recorded in a complete manner) PDF STANDARD XML SPECIFICATION UNICODE SPECIFICATION
82
Requirement for solution
Threat Requirement for solution Users may be unable to understand or use the data e.g. the semantics, format, processes or algorithms involved Ability to create and maintain adequate Representation Information Non-maintainability of essential hardware, software or support environment may make the information inaccessible Ability to share information about the availability of hardware and software and their replacements/substitutes The chain of evidence may be lost and there may be lack of certainty of provenance or authenticity Ability to bring together evidence from diverse sources about the Authenticity of a digital object Access and use restrictions may make it difficult to reuse data, or alternatively may not be respected in future Ability to deal with Digital Rights correctly in a changing and evolving environment Loss of ability to identify the location of data An ID resolver which is really persistent The current custodian of the data, whether an organisation or project, may cease to exist at some point in the future Brokering of organisations to hold data and the ability to package together the information needed to transfer information between organisations ready for long term preservation The ones we trust to look after the digital holdings may let us down Certification process so that one can have confidence about whom to trust to preserve data holdings over the long term
83
Creating an OAIS Archival Information Package
Need to decide and form strategies based upon Preservation objective Designated community knowledge base Information required based on this Sources and supply of information (is this achievable given realistic restrictions on resources) Identify Risks dependacies and what and things needs to be monitored
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.