Meeting of the Member States Expert Group on Digitisation and Digital Preservation 13.10.2015, Luxembourg European Archival Records and Knowledge Preservation.

Slides:



Advertisements
Similar presentations
Data Publishing Service Indiana University Stacy Kowalczyk April 9, 2010.
Advertisements

1 Metadata Tools for JISC Digitisation Projects of still images and text Ed Fay BOPCRIS, Hartley Library University of Southampton.
Configuration management
DOCUMENT TYPES. Digital Documents Converting documents to an electronic format will preserve those documents, but how would such a process be organized?
October 28, 2003Copyright MIT, 2003 METS repositories: DSpace MacKenzie Smith Associate Director for Technology MIT Libraries.
Digital Preservation - Its all about the metadata right? “Metadata and Digital Preservation: How Much Do We Really Need?” SAA 2014 Panel Saturday, August.
Transformations at GPO: An Update on the Government Printing Office's Future Digital System George Barnum Coalition for Networked Information December.
Funded by: © AHDS Sherpa DP – a Technical Architecture for a Disaggregated Preservation Service Mark Hedges Arts and Humanities Data Service King’s College.
Depositing e-material to The National Library of Sweden.
Mark Evans, Tessella Digital Preservation Boot Camp – PASIG meeting, Washington DC, 22 nd May 2013 PREMIS Practical Strategies For Preservation Metadata.
R.Jantz, August 31, Two-day forum on PREMIS Preservation Metadata and the Trusted Digital Repositories August 31, September 1 National Library of.
3. Technical and administrative metadata standards Metadata Standards and Applications.
PREMIS What is PREMIS? – Preservation Metadata Implementation Strategies When is PREMIS use? – PREMIS is used for “repository design, evaluation, and archived.
Producer-Archive Workflow Network (PAWN) Goals Consistent with the Open Archival Information System (OAIS) model Use of web/grid technologies and platform.
US GPO AIP Independence Test CS 496A – Senior Design Fall 2010 Team members: Antonio Castillo, Johnny Ng, Aram Weintraub, Tin-Shuk Wong.
Automatic Evaluation of Migration Quality in Distributed Networks of Converters Miguel Ferreira Supervisors Ana Alice Baptista.
PAWN: A Novel Ingestion Workflow Technology for Digital Preservation
AIP Archival Information Package – Defines how digital objects and its associated metadata are packaged using XML based files. METS (binding file) MODS.
WMS: Democratizing Data
Tools and Services for the Long Term Preservation and Access of Digital Archives Joseph JaJa, Mike Smorul, and Sangchul Song Institute for Advanced Computer.
PAWN: A Novel Ingestion Workflow Technology for Digital Preservation Mike Smorul, Joseph JaJa, Yang Wang, and Fritz McCall.
A Framework for Distributed Preservation Workflows Rainer Schmidt AIT Austrian Institute of Technology iPres 2009, Oct. 5, San.
US GPO AIP Independence Test CS 496A – Senior Design Team members: Antonio Castillo, Johnny Ng, Aram Weintraub, Tin-Shuk Wong Faculty advisor: Dr. Russ.
Incompatible or Interoperable? A METS bridge for a small gap between two digital preservation software packages Lucas Mak Metadata & CatalogLibrarian
An Overview of Selected ISO Standards Applicable to Digital Archives Science Archives in the 21st Century 25 April 2007 Donald Sawyer - NASA/GSFC/NSSDC.
MDC Open Information Model West Virginia University CS486 Presentation Feb 18, 2000 Lijian Liu (OIM:
Ingest and Dissemination with DAITSS Presented by Randy Fischer, Programmer, Florida Center for Library Automation, University of Florida DigCCurr2007.
METS-Based Cataloging Toolkit for Digital Library Management System Dong, Li Tsinghua University Library
Statewide Digitization and the FCLA Digital Archive Priscilla Caplan, Florida Center for Library Automation Statewide Digitization Planners Meeting OCLC,
A Logical Model for Digital Archives Rathachai Chawuthai Information Management CSIM / AIT Draft document 0.1.
Addressing Metadata in the MPEG-21 and PDF-A ISO Standards NISO Workshop: Metadata on the Cutting Edge May 2004 William G. LeFurgy U.S. Library of Congress.
 To explain the importance of software configuration management (CM)  To describe key CM activities namely CM planning, change management, version management.
How to build your own Dark Archive (in your spare time) Priscilla Caplan FCLA.
1 XML as a preservation strategy Experiences with the DiVA document format Eva Müller, Uwe Klosa Electronic Publishing Centre Uppsala University Library,
The DigiTool to FDA Program Lydia Motyka Florida Center for Library Automation.
Configuration Management (CM)
DAITSS: Dark Archive in the Sunshine State Priscilla Caplan, Florida Center for Library Automation DCC Workshop on Long-term Curation within Digital Repositories.
OAIS Rathachai Chawuthai Information Management CSIM / AIT Issued document 1.0.
1 Schema Registries Steven Hughes, Lou Reich, Dan Crichton NASA 21 October 2015.
PREMIS Rathachai Chawuthai Information Management CSIM / AIT.
Digital Long-term Preservation Service in Finland DLM Forum Members Meeting
Implementor’s Panel: BL’s eJournal Archiving solution using METS, MODS and PREMIS Markus Enders, British Library DC2008, Berlin.
The FCLA Digital Archive Joint Meeting of CSUL Committees, 2005.
Digital Preservation: Current Thinking Anne Gilliland-Swetland Department of Information Studies.
Archival Workshop on Ingest, Identification, and Certification Standards Certification (Best Practices) Checklist Does the archive have a written plan.
ETD2006 Preserving ETDs With D.A.I.T.S.S. FLORIDA CENTER FOR LIBRARY AUTOMATION FC LA PAPER AUTHORS: Chuck Thomas Priscilla.
Selene Dalecky March 20, 2007 FDsys: GPO’s Digital Content System.
OAIS Rathachai Chawuthai Information Management CSIM / AIT Issued document 1.0.
Implementing PREMIS in DigiTool Michael Kaplan ALA 2007 Update.
DAITSS and the Florida Digital Archive Priscilla Caplan Florida Center for Library Automation iPRES 2006.
Florida Digital Archive PREMIS and DAITSS. Florida Digital Archive.
Lifecycle Metadata for Digital Objects November 15, 2004 Preservation Metadata.
Chang, Wen-Hsi Division Director National Archives Administration, 2011/3/18/16:15-17: TELDAP International Conference.
Preservation Functionality in a Digital Archive Erik Oltmans Koninklijke Bibliotheek Raymond J. van Diessen IBM Business Consulting Services Hilde van.
European Archival Records and Knowledge Preservation E-ARK: half-way there! Kuldar Aas DLM Forum Members Meeting Riga, 17 June 2015.
Fitting into an Appraisal, Accessioning, Processing, Discovery, and Delivery Workflow Chris Prom, University of Illinois at Urbana Champaign.
E-ARK Co-ordinators Report Current status and way forward #earkproject Janet Delve / Kuldar Aas / Clive.
Session 2b, 25 November 2015 eChallenges e-2015 Copyright 2015 The National Archives of Estonia Current lack of interoperability among submission information.
A Semi-Automated Digital Preservation System based on Semantic Web Services Jane Hunter Sharmin Choudhury DSTC PTY LTD, Brisbane, Australia Slides by Ananta.
Joint Meeting of CSUL Committees,
General Model of E-ARK Services
FLORIDA CENTER FOR LIBRARY AUTOMATION
Building A Repository for Digital Objects
DAITSS and the Florida Digital Archive
An Introduction to Tessella and The Safety Deposit Box Platform
Joseph JaJa, Mike Smorul, and Sangchul Song
Statewide Digitization and the FCLA Digital Archive
Module 01 ETICS Overview ETICS Online Tutorials
Medusa at the University of Illinois
Robin Dale RLG OAIS Functionality Robin Dale RLG
Presentation transcript:

Meeting of the Member States Expert Group on Digitisation and Digital Preservation , Luxembourg European Archival Records and Knowledge Preservation E-ARK format for storage and long- term preservation and the integrated prototype

Part I: E-ARK format for storage and long-term preservation

WP4 overview SIP Packaged prepared by Pre-Ingest WP3 AIP Package created for long-term archive WP4 DIP Package created for access WP5

WP4 deliverables D4.1 Report on available formats and restructions Existing formats for containers which fit (partially) the base line requirements for EARK AIPs will be examined. D4.2 AIP draft specification Documents specifying the pan-European AIP format, 1st version D4.3 E-ARK AIP pilot specification Documents specifying the pan-European AIP format, 2nd version D4.4 Final version of SIP- AIP conversion component Pan- European AIP format final version Background color meaning: gray: submitted blue: upcoming

WP4 main work areas SIP-AIP Conversion Component Implementation EARK-AIP format specification The SIP-AIP Conversion component creates an AIP that complies with the AIP format specification. The AIP format specification is the basis for the SIP-AIP implementation, nevertheless the development is a two-way dynamic process.

E-ARK Information Package Directory structure –Data and Metadata separated –Structure that allows to distinguish descriptive and digital provenance metadata METS as the structural metadata standard PREMIS as the preservation metadata standard

AIP format specification Hierarchical structure of the information package –aligned with a high level specification that is valid across package types (S/A/D)IP The AIP format specification consists of –a proposed structure that allows storing a SIP which was submitted. –a proposed structure that allows adding new representations during ingest or after the AIP was archived. Directory tree as well as the use of metadata standards describing structure and preservation data of the information package –METS as the structural metadata standard –PREMIS as the preservation metadata standard

Representations PREMIS definition „The set of files, including structural metadata, needed for a complete and reasonable rendition of an Intellectual Entity.” * Representations can be provided as part of the SIP already Creating representations happens either during ingest or after the AIP was archived. A new Representation is not necessarily the result of a file format migration, it can also be a set of instructions how to create an emulation environment to render a set of files. * Introduction and Supporting Materials from PREMIS Data Dictionary, p. 7.Introduction and Supporting Materials from PREMIS Data Dictionary

Let‘s start with the SIP …

representations of the SIP

TIFF representation PDF representation

Descriptive Metadata (relates to all representations)

Structural Metadata

points to

Preservation Metadata (PREMIS)

Structural and preservation metadata per representation

XML Schemas to validate XML instances

… and now to the AIP...

Submission folder contains the content of the SIP (two representations) It is stored in a separate branch to clearly separate it from representations that are created during ingest or post-ingest.

The AIP contains another representation (Rep-001 was migrated during ingest into Rep-001.2)

Structural metadata of the AIP Structural metadata of the SIP

points to SIP METS

points to AIP METS

Descriptive metadata of the AIP Descriptive metadata of the SIP

WEBP representation (added during ingest) TIFF representation PDF representation (both part of the original submission)

IP Structure METS files reference Representation METS files and Descriptive Metadata

Representation METS files reference the Actual Content files and Preservation Metadata

XML Schemas (AIP level) XML Schemas (SIP level)

„Documentation“ folder as a sibling to „Metadata“ at each level (depends on whether the documentation relates to a specific representation or to all representations) Other metadata folders possible at the various levels. For example, PREMIS rights can be placed at the AIP level whereas PREMIS events would be recorded at the representation level.

… but hold on, what‘s the point of all this splitting of METS files?

In this example we were dealing with three image files …

… but we know reality looks different ….

… especially in future! … especially in the future!

In this case we must be able to separate representations ….

Parent AIP Representation part Original submission part And aggregate them.

New „representation“  New „generation“ ID001_001.tar ID001_002.tar „Generation“ AIP ID001 Submission/Representations/Rep-001 AIP ID001 Submission/Representations/Rep-001 Representations/Rep „Identifier“ „Representation“

Part II: E-ARK integrated prototype

Goal of the Integrated Prototype Downloadable package –Software components from WP4, WP5, and WP6 –Packaging; Search in/across packages; Access Usable in combination with existing systems –Extends existing commercial and proprietary archiving products (loosely coupled) –Provides concrete integration hooks for ESSArch Preservation Platform supporting (loose and tight coupling) Based on Scalable Technologies –Deployed for small/medium workloads on single host –Ability to scale out (by design) for large volumes

Processing Information Packages Information Packages are processed using automated workflows during ingest and access –Implemented as scripts –for Information Package creation, validation, storage, … Workflow execution is delegated to a backend processing infrastructure –Makes task processing configurable, portable, distributed, asynchronous, … End-users can manage workflows using the graphical E- ARK Web Portal –Browse, configure, execute, and track workflows –create AIP, store AIP, …

Content Repository and Indexing An infrastructure that can be used to extend existing ingest tools, archiving and preservation platforms. –Provides a staging area for storing Information Packages, e.g. in addition to tape-based archival storage. –Provides support for extracting data and metadata from contained items, which are indexed and stored in a content repository. –Provides interfaces that supports searching in and across Information Packages on a per-item basis. –Provides interfaces that support random access on a per-item basis. Technically based on broadly used technologies for large-scale content processing –Hadoop, Lucene, SolR, Tika, Lily, HBase

ESSArch Preservation Platform Production environment that supports traditional archiving preservation processes (pre-ingest, ingest, different storage methods, …). –Provides a backend for digital libraries and archive information systems –Provides the E-ARK reference implementation supporting E-ARK specifications and software components Integration I: Using Prototype Workflows inside EPP –Direct compatibility based on aligned technology stack. –Python (workflow language), Celery (task processing), same SQL- based data model. Integration II: Using custom Storage Method in EPP –Enables EPP to stage Information Packages to E-ARK content repository and index infrastructure.

Why “Integrated” Prototype? E-ARK Web

Why “Integrated” Prototype? E-ARK Web

Why “Integrated” Prototype? E-ARK Web

SIP-AIP Conversion Component Close to ESS’ product ESSArch EPP ( with regards of the technology stack usedhttp:// –Python programming language –Django-WebUI-Development framework –Celery distributed task queue Implemented as tasks which can be parallelized on a cluster –Modular: Tasks can be combined in different workflows –Extensible: Tasks can be added by using a template that is easy to understand (task implementation interface) Results will be published as deliverable D4.4 “Final version of SIP- AIP conversion component”.

SIP to AIP Conversion Customizable conversion workflows LilyHDFSUpload AIPPackaging AIPValidation AIPCreation SIPValidation SIPExtraction IdentifierAssignment SIPDeliveryValidation Modular: Individual tasks can be used to build workflow variants, e.g. to build a package type specific workflow. Extensible: Specific tasks can be added at any point in the workflow Scalable: Scalability by parallel execution; tasks can be executed as isolated processes (e.g. independent from a database) SIPValidation SIPExtraction SIPDeliveryValidation AIPPackaging AIPValidation AIPCreation Characteristics

Task interface (DefaultTask) Task configuration

Information package status Information package status is defined by: –Last executed task –Success/failure of last task execution Task execution control –Accepted inputs defined for each task –Example: SIPRestructuring task accept_input_from = [SIPExtraction]  To restructure the SIP it needs to be extracted

AIP to DIP earkcore AIP to DIP earkcore SIP to AIP earkcore SIP to AIP earkcore Hadoop Distributed File System Conventional File System Working area Conventional File System Working area Long-term storage AIP Long-term storage AIP Search and Access Lily Repository Search and Access Lily Repository DIP Delivery earkweb DIP Delivery earkweb Workers Celery Workers Celery

THE E-ARK PROJECT ISCO-FUNDED BY THE EUROPEAN COMMISSION UNDER THE ICT-PSPPROGRAMME