Research Outputs Management in a Nutshell Anna Maria Tammaro, University of Parma Joy Davidson, University of Glasgow Tomasz Miksa, Technical University of Vienna ROMOR Basic Training Workshop/ 06/09/2017
Learning objectives This introductory session will provide an overview of reference models that will help to refine the range of services and infrastructure required to effectively manage and preserve access to research outputs. Key terms will be explained and contextualised to provide a solid foundation for the remainder of the workshop. After this session participants will: understand what digital curation is and how it relates to research outputs management be familiar with the curation lifecycle model be familiar with the OAIS reference model and how it can be applied to the design of open access information repositories and related services be able to communicate more consistently by ensuring a shared understanding of key terms
Structure of session Welcome, practical information, objectives for the workshop (Parma) Overview of digital curation lifecycle model (Glasgow) Overview of Open Archival Information System reference model (Parma) Jargon busting session (Vienna) Open to questions from participants (all)
Introduction to the Digital Curation Lifecycle Model Joy Davidson ROMOR Basic Training Workshop/ 06/09/2017
Digital curation is “maintaining and adding value to a trusted body of digital information for current and future use; specifically… the active management and appraisal of data over the lifecycle of scholarly and scientific materials” http://www.dcc.ac.uk
Why do we need to care about digital curation and preservation? Better access to higher quality outputs for all (public good) Increased confidence in published findings (validation and reproducibility) Improved visibility of the research and reputation of the researcher – higher research standing Improves efficiency (not collecting same data multiple times) Enables novel insights to be derived from existing outputs (data driven research and innovation) Protects data again technical obsolescence
The DCC Curation Lifecycle Model provides a graphical high level overview of the stages required for successful curation and preservation of data from initial conceptualisation or receipt. The model enables: mapping of granular functionality definition of roles and responsibilities building frameworks of standards and technologies to implement identification of additional steps required identification of actions which are not required ensuring adequate documentation of processes and policies
All stages Object oriented model – all lifecycle stages centre on the research output. Actions over the entire lifecycle include: Capturing context Preservation planning Community watch Description Information (Metadata) persistently identifies data and maintains reliable links to them clearly describes what they are clearly identifies technical information needed to use data identifies who is responsible for their management and preservation describes what can be done to them describes what is needed to represent them at the required level of fidelity records their history and documents their authenticity allows users to understand their context and relationship to other objects Representation Information Structure Information: describes the format and data structure concepts to be applied to the bitstream, which result in more meaningful values like characters or number of pixels. Semantic Information: this is needed on top of the structure information. If the digital object is interpreted by the structure information as a sequence of text characters, the semantic information should include details of which language is being expressed. Other Representation Information: includes information about relevant software, hardware and storage media, encryption or compression algorithms, and printed documentation. Preservation planning: is a set of managed activities aims at ensuring the bit-stream is maintained aims at ensuring that data are accessible is concerned with maintaining bit streams and ensuring accessibility for a definable period of time Community watch and participation: access to a wider range of expertise access to tools and systems that might otherwise be unavailable shared influence on R&D of standards and practices attraction of resources and other support for well-coordinated programmes at a regional, national or sectoral level better planning to reduce wasted effort shared development costs shared learning opportunities encouragement for other stakeholders to take preservation seriously
Conceptualise stage Activities during this stage include conceiving and planning for the creation of outputs, including capture methods and storage options. It is, put simply, planning research with digital curation processes and outcomes in mind. Most funders now require a data management plan at the grant application stage illustrating that researchers have considered these issues before any research is undertaken. Be aware that decisions made now can affect the entire lifecycle (ethics, legal, contracts). Conceptualise - plan with digital curation in mind produce first iteration of data management plan develop robust workflow, processes and documentation choose appropriate, existing open standards - interoperability capture and store data in curation-friendly file formats (open source) record sufficient information during data capture to assist with ongoing use scrupulously identify files store data on appropriate media identify a safe place for storage (e.g. a trusted archive) and make sure that archive will take your data identify access methods identify legal framework 9
Create or receive stage During this stage try to capture context as you create outputs (administrative, descriptive, structural and technical, preservation metadata). Use the FAIR principles as a guideline (findable, accessible, interoperable, reusable). If reusing third party outputs, be ware of any licensing restrictions. Remember to document any outputs that may be necessary to validate published findings later! This could include underlying data, models, software and code, algorithms. Create or Receive – ensure data are curation ready of high quality well structured adequately documented interoperable authentic (it is what it claims to be) accurate (it hasn’t been tampered with) renderable (it can be used in the ways for which it was intended, or viewed as originally intended) in a form that best ensures its longevity 10
Appraise and select stage At this point in the lifecycle, you’ll evaluate the outputs you will need to keep and for how long they need to be preserved. Decisions will be based on institutional and funder policies as well as legal and ethical requirements. Consider how selected outputs will be checked for quality and completeness. Who will be responsible for this (researcher, repository, both?) Appraise and Select – develop robust policies How long do we want to keep the data? in terms of changes of technology in terms an organisation’s business requirements in terms of user requirements (e.g. as evidence to verify conclusions derived from research). How long do we need to keep the data? assess benefits and risks of keeping/not keeping data what are the consequences of not keeping the data? how much would it cost to recreate it in the future? is it even possible to recreate it in the future? Re-appraisal and disposal Typically data may be transferred to another archive, repository, data centre or other custodian. In some instances data is destroyed. The data’s nature may, for legal reasons, necessitate secure destruction 11
Ingest stage At this point in the lifecycle, selected outputs will be deposited to a digital repository. transferred to a suitable OAIR, data centre or other custodian. Consider any normalisation actions that are performed (e.g., Excel to PDF) and potential usability issues. Consider how you will maintain links between outputs generated in a given project deposited in multiple systems (publications, data, software).
Preservation and Storage stages Undertake actions to ensure long-term preservation and retention of the authoritative nature of data. Preservation actions should ensure that data remains authentic, reliable and usable while maintaining its integrity. Value added actions include data cleaning, validation, assigning preservation metadata, assigning representation information and ensuring acceptable data structures or file formats for reuse. Store Store the data in a secure manner adhering to relevant standards. Preservation Action Undertake actions to ensure long-term preservation and retention of the authoritative nature of data. Preservation actions should ensure that data remains authentic, reliable and usable while maintaining its integrity. Actions include data cleaning, validation, assigning preservation metadata, assigning representation information and ensuring acceptable data structures or file formats. Information managers use ongoing actions such as Preservation Planning and Community Watch to identify when actions need to be taken on data stored in the stable repository environment. They may also choose to make derivatives of data. Migrate – for preservation storage Migrate data to a different format. This may be done to accord with the storage environment or to ensure the data’s immunity from hardware or software obsolescence. Reappraise Return data which fails validation procedures for further appraisal and reselection. Dispose
Access and reuse stage At this point in the lifecycle, the outputs are made accessible for others to reuse. Consider what context someone would need to use the output. In some cases, restrictions on access are needed (e.g., to protect commercially sensitive or personal information) and appropriate access controls will need to be considered (dealing with access requests, processes). Provide citation guidelines for outputs and consider how usage metrics will be collected. Access, Use and Reuse Ensure that data is accessible to both designated users and re-users, on a day-to-day basis. This may be in the form of publicly available published information. Robust access controls and authentication procedures may be applicable. The original project team has been using their data throughout the curation process and can continue to do so after the project has finished. They perform analyses and publish an article based on their findings. They also opens up access to other researchers who can make different uses of the data. This data is of most value when integrated with other research done in the field, so steps are taken to use this data to augment existing datasets.
Transform stage Accessed outputs can be integrated with other data to facilitate new analyses. These produce derivative research outputs: conclusions that could only be drawn from an amalgamation of existing outputs which feed back into the Conceptualise stage of the lifecycle model. Consider how you might track such transformations over time (impact). Create new data from the original, for example by migration into a different format or by creating a subset, by selection or query, to create newly derived results, perhaps for publication. After being integrated with other data, new analyses and techniques are applied to the data by researchers in the same field and some interdisciplinary studies. These produce derivative data: conclusions that could only be drawn from an amalgamation of data from multiple projects, as well as transformations of data items (for example, new visualisations or ‘enhancements’) which feed back into the Create stage of the Lifecycle.
Open Archival Information System ROMOR Workshop 5 September 2017
OAIS International Standard OAIS is an international standard first published in 2003 as ISO 14721:2003. Space data and information transfer systems -- Open archival information system (OAIS) -- Reference model In 2012, ISO issued a revised version of the Reference Model as ISO 14721:2012
The OAIS is a ‘reference model’ What is a reference model? A framework for understanding relationships among entities of its domain for development of consistent standards or specifications supporting the domain A conceptual model which is based on a small number of underlying concepts describes key concepts and relationships, and how they interface to each other and the external environment may be used as a basis for explaining domain specific concepts to non-specialists OAIS is NOT an implementation model!
Purpose, Scope, and Applicability Primary goal of an OAIS: to preserve information for a designated community over an indefinite period of time The standard is a framework for understanding and applying concepts necessary for long-term preservation of digital objects (of any sort) Note, here long-term means long enough to be concerned about technological change Addresses the full range of archival functions Actually applicable to all sorts of organisations and individuals dealing with information that needs long-term preservation (not just ‘archives’)
What is OAIS about? Defining an OAIS: “An archive, consisting of an organization of people and systems that has accepted the responsibility to preserve information and make it available for a Designated Community” ‘Designated Community’ is a singularly important concept within the OAIS model and is defined as the community of stakeholders and users that the OAIS serves The Reference Model is a high-level description of the Environment Functional model Information model of an OAIS
OAIS is NOT an implementation model!
OAIS environment model Management Producer OAIS (Archive) Designated Community
OAIS environment the key external entities with which the archive interacts in the course of carrying out its operations. Producer is the role played by those persons, or client systems, who provide the information to be preserved Management is the role played by those who set overall OAIS policy as one component in a broader policy domain (not day-to-day administration) Consumer is the role played by those persons, or client systems, who interact with OAIS services to find and acquire preserved information of interest OAIS Environment: the key external entities with which the archive interacts in the course of carrying out its operations. OAIS: this is just the archive itself; we’ll talk about this in detail in a moment MANAGEMENT: entity that sets the broad policies for the archive; e.g.: scope of archive’s content; likely provides funding; might serve an oversight function, periodically reviewing archive’s policies and performance. Does not manage day-to- day operations of archive. PRODUCER: entity (or entities) that submits content to the archive to be preserved. Could be an individual, an organization, or even a computer system or machine (or combination). CONSUMER: entity that utilizes the content preserved in the archive. Again, could be a person, organization, or machine (even another archive). Special class of Consumer: DESIGNATED COMMUNITY: class of Consumers who are expected to independently understand information in the form it is preserved in the archive. Independently understand: without requiring additional expertise/assistance in interpretation. Content: scientific data; individuals with a certain level of scientific expertise Content Java source code; persons skilled in Java Programming Scope of Designated Community has important implications for the amount of metadata that must be packaged with the archived content to ensure that it remains meaningful over the long-term. More on this later when we talk about the OAIS information model. Note that the scope of the Designated Community is not necessarily static: could change over time. Return to the OAIS itself …
OAIS Functional Model
OAIS Functions (1) Ingest: services and functions that accept SIPs from Producers; prepares AIPs for storage, and ensures that AIPs and their supporting Descriptive Information become established within the OAIS Archival Storage: services and functions used for the storage and retrieval of AIPs Data Management: services and functions for populating, maintaining, and accessing a wide variety of information related services visible to Consumer
OAIS Functions (2) Administration: services and functions needed to control the operation of the other OAIS functional entities on a day-to- day basis Preservation Planning: services and functions for monitoring the OAIS environment and ensuring that content remains accessible to the Designated Community Access: services and functions which make the archival information holdings and related services visible to Consumer
OAIS Information Model OAIS provides an information model for managing the digital materials as they pass through the system. The basis of this model is the Information Package (IP). An IP consists of: The digital object(s) to be preserved. The metadata required at that point in the system. The Packaging Information which relates 1 and 2. OAIS specifies three types of Information Package: Submission Information Packages (SIPs). Archival Information Packages (AIPs). Dissemination Information Packages (DIPs).
Information Package model
OAIS information packages SIP: the package that is sent to an OAIS by a Producer. Its form and detailed content is typically negotiated between the Producer and the OAIS AIP: the package that is actually preserved in the OAIS. May be made up of more than one SIP, or one SIP may produce several AIPs DIP: what is delivered to the consumers. It is not necessarily the same as the AIP but will be derived from it in some way
QUESTIONS