Dependency Management APARSEN Training “Access & Usability” Florence, 17 – 18 September, 2014 rene.van.horik@dans.knaw.nl
Outline Digital Preservation Strategies Interoperability Automatic Reasoning (for digital preservation) Deliverable 25.1 “Interoperability Objectives and Approaches” (145 pages) Deliverable 25.2 “Interoperability Strategies” (79 pages) (available on the project website www.aparsen.eu) Outline of presentation: What is quality? Attention for “Context” Models that provide context Data management is important for digital preservation and helps to measure / assess quality. DM is main topic of online course “Essentials 4 Data Support”
Digital Preservation Threats to future access to digital information Format obsolescence Not possible to render object Operating system obsolescence Hardware failure What is it?
Digital Preservation Strategies Technology preservation strategy Technology emulation strategy Digital information migration strategy
Interoperability What is interoperability? (exercise 1) Why is it important for digital preservation? DP = Interoperability with the future It enables use and exchange of information/knowledge Avoiding “vendor lock-in” (open standards) Implies standardization and “trust”
Automatic Reasoning Digital Preservation is based has a lot of dependencies Are assumptions still valid in the future e.g. Documentation is understandable File format is still usable Digital archive still has funding …
Can the OAIS Reference Model help us? YES! The OAIS reference model will help us to be precise and unambiguous More specific please…
Designated Community An identified group of potential Consumers who should be able to understand a particular set of information. The Designated Community may be composed of multiple user communities. A Designated Community is defined by the Archive and this definition may change over time.
Representation Information The information that maps a Data Object into more meaningful concepts. An example of Representation Information for a bit sequence which is a FITS file might consist of the FITS standard which defines the format plus a dictionary which defines the meaning in the file of keywords which are not part of the standard. Another example is JPEG software which is used to render a JPEG file; rendering the JPEG file as bits is not very meaningful to humans but the software, which embodies an understanding of the JPEG standard, maps the bits into pixels which can then be rendered as an image for human viewing.
Example Digital object: Digitized Mediaeval charter Designated community: historians (mediaevalists) Representation information: Digital image in JPEG-> JPEG standard ; JPEG software Transcription in PDF -> PDF standard ; PDF viewer Content in Latin -> English – Latin Dictionary ; Annotations in XML -> ASCII Standard ….
Exercise Take a digital object as an example This object has to be preserved First give a description of the digital object Describe the Designated Community (which assumptions do you make?) Which Representation Information is required in order for the Designated Community to be able to understand the digital object?
Automatic Reasoning EPIMENIDES System developed by FORTH Uses cases provided by DANS Demonstrator: http://www.ics.forth.gr/isl/epimenides
Example 2014: We consider PDF as a durable format (facts and rules) 2014: We create a knowledge base expressing our knowledge concerning the PDF-format (e.g. Software to convert to PDF. Software to check whether PDF files are not corrupt (and still can perform their task) 2015 – 2023: We maintain and apply the knowledge base 2024: The knowledge base is drastically changed, e.g. because PDF format is obsolete 2024: As we have stored dependencies in a system we know what risks that threathen the usabilty of PDF files and we can take measurements (e.g. Migration / emulation) (The knowledge base is created and maintained in the Epimenidis system)
Use Cases (Slides by Forth) For plain users For Archivists
For plain users: The user uploads a file or a zipped bundle of files Upload your own digital objects
The system finds the tasks that usually make sense to apply to the uploaded digital objects Rendering for this .txt file Runnability for this .exe file Requesting performability checking
Getting the results of the Dependency Analysis (the results of the automatic reasoning) Reds: Inability to perform this task on this file Greens: Ability to perform these tasks over these objects
Ability to explore the dependencies related to one task Direct dependencies of Rendering Task
Use Case for Archivists: Aiding the Definition of new Tasks Name of the new task Define the dependencies of this task
Use Case for Archivists: Consequences of a Hypothetical Loss
Exploring the contents of its Knowledge Base Explore the contents of the underling RDF/S triple store
Concluding Remarks (1/2) Each interoperability objective or challenge can be considered as a kind of demand for the performability of a particular task (or tasks). However each task for being performed has various prerequisites (e.g. operating system, tools, software libraries, parameters, etc). We call all these dependencies. Standardization reduces the dependencies without vanishing them, moreover sometimes they can not be adopted However, the ultimate objective is the ability to perform a task, not the compliance to a standard We need an alternative approach which reduce the human effort A dependency management approach Each interoperability objective or challenge can be considered as a kind of demand for the performability of a particular task. However each task for being performed has various prerequisites or dependencies. The definition and adoption of standards (for data and services), aids interoperability because it is more probable to have (now and in the future) systems and tools that support these standards, than having systems and tools that support proprietary formats. From a dependency point of view, standardization essentially reduces the dependencies and makes them more easily resolvable; it does not vanish dependencies. But the ultimate objective is the ability to perform a task, not the compliance to a standard. For these reasons we have proposed a dependency management approach that can reduce the human effort for DP services.
Concluding Remarks (2/2) The proposed approach can be used for capturing converters and emulators and can be applied to concrete use cases We have designed and implemented a proof of concept prototype (Epimenides) for testing whether the proposed reasoning approach behaves as expected. We should also mention that since the implementation is based on W3C standards, it can be straightforwardly enriched with information coming from other external sources (i.e. SPARQL endpoints). An advantage of our approach is that can be used for capturing converters and emulators that are basic preservation strategies. Based on that approach we have designed and implemented a proof of concept prototype (Epimenides) for testing whether the proposed reasoning approach behaves as expected, that is based on W3C standards.
Value of the work done Enables a flexible strategy of achieving interoperability by combining existing software The offered automated reasoning could greatly reduce the human effort required for checking (or periodically monitoring) whether a task on a digital object is performable. The proposed approach offers a flexible strategy of achieving interoperability by combining existing software, and vanishing a gap that prevents the performability of a task. Also could greatly reduce the human effort required for checking (or periodically monitoring) whether a task on a digital object is performable.
http://datasupport.researchdata.nl/en Proper data management obviously has influence on the quality of data. So proper traininig is an important issue. “Data Essentials 4 Data Support” aims to improve the skills of data supporters.
Network of Excellence