OAI-PMH for Resource Harvesting Tutorial OAI4, October 20 th 2005, CERN, Geneva, Switzerland The American Physical Society Project: Standards-based Mirroring of Digital Library Content Jeroen Bekaert, and Herbert Van de Sompel Digital Library Research & Prototyping Team Research Library, Los Alamos National Laboratory This work supported in part by the Library of Congress
OAI-PMH for Resource Harvesting Tutorial OAI4, October 20 th 2005, CERN, Geneva, Switzerland context Add APS collection to locally hosted LANL collection o Remain permanently synced o Ensure correctness of locally stored APS data Bigger picture: o Archive APS content o Create efficient content transfer/mirroring approach between information providers & LANL o NDIIP: Create efficient content transfer/mirroring approach between heterogeneous content repositories. -Efficient mechanisms are largely non-existent. -Devise a standards-based approach: – MPEG-21 DIDL – OAI-PMH – W3C XML Signatures
Bigger picture: OAIS perspective
OAI-PMH for Resource Harvesting Tutorial OAI4, October 20 th 2005, CERN, Geneva, Switzerland APS / LANL mirroring process APS repository OAI-PMH repository LANL pre-ingest & ingest OAI-PMH harvester OAI-PMH request OAI-PMH response aDORe repository APS Digital Object represented as application-neutral MPEG-21 DIDL document & exposed through OAI-PMH front-end Each datastream provided via a DIDL document is accorded a digest. Digests delivered in DIDL document via W3C XML Signatures A complete DIDL document is accorded a digest; delivered in the OAI- PMH « about » container via W3C XML Signature
OAI-PMH for Resource Harvesting Tutorial OAI4, October 20 th 2005, CERN, Geneva, Switzerland APS / LANL mirroring process APS repository OAI-PMH repository LANL pre-ingest & ingest OAI-PMH harvester OAI-PMH request OAI-PMH response aDORe repository Remain synced via OAI-PMH datestamp-based harvesting of DIDL documents: o New APS Digital Objects o Updated APS Digital Objects
OAI-PMH for Resource Harvesting Tutorial OAI4, October 20 th 2005, CERN, Geneva, Switzerland APS / LANL mirroring process Datastreams delivered By-Value and/or By-Reference o By-Reference requires dereferencing of datastream post harvest Storage in pre-ingest area: o Harvested DIDL documents in XMLtape o Dereferenced content in ARC files APS repository OAI-PMH repository LANL pre-ingest & ingest OAI-PMH harvester OAI-PMH request OAI-PMH response aDORe repository
OAI-PMH for Resource Harvesting Tutorial OAI4, October 20 th 2005, CERN, Geneva, Switzerland APS / LANL mirroring process Verification of digests: o DIDL document o Datastreams Digest correct: continue Digest incorrect: reharvest APS repository OAI-PMH repository LANL pre-ingest & ingest OAI-PMH harvester OAI-PMH request OAI-PMH response aDORe repository
OAI-PMH for Resource Harvesting Tutorial OAI4, October 20 th 2005, CERN, Geneva, Switzerland APS / LANL mirroring process Ingest Digital Objects: o Map application-neutral DIDL documents to aDORe-profile DIDL documents o Insert digests per constituent datastream (W3C XML Signatures) o Store in aDORe XMLtape/ARCfile environment APS repository OAI-PMH repository LANL pre-ingest & ingest OAI-PMH harvester OAI-PMH request OAI-PMH response aDORe repository
OAI-PMH for Resource Harvesting Tutorial OAI4, October 20 th 2005, CERN, Geneva, Switzerland APS / LANL mirroring process Recurrent introspection in both repositories Ability to harvest in both directions in case of problems with stored Digital Objects APS repository OAI-PMH repository LANL pre-ingest & ingest OAI-PMH harvester OAI-PMH request OAI-PMH response aDORe repository
OAI-PMH for Resource Harvesting Tutorial OAI4, October 20 th 2005, CERN, Geneva, Switzerland software OAIResource: generic Java-based OAI-PMH resource harvesting software package: o Goal: gather resources by OAI-PMH harvesting first o Can deal with OAI-PMH repositories irrespective of their supported metadata formats o Plug-in structure makes the process of dereferencing datastreams configurable per OAI-PMH repository o Results of harvesting/gathering stored as follows: -OAI-PMH records concatenated into XMLtapes -Datastreams concatenated into Internet Archive ARC files o Log files: -List successful and unsuccesful harvesting/gathering -List relationship between OAI-PMH records in XMLtapes and datastreams in ARC files
OAI-PMH for Resource Harvesting Tutorial OAI4, October 20 th 2005, CERN, Geneva, Switzerland Papers Jeroen Bekaert and Herbert Van de Sompel. A Standards-based Solution for the Accurate Transfer of Digital Assets. D-Lib Magazine, June Standards-based Solution for the Accurate Transfer of Digital Assetshttp://dx.doi.org/ /june2005-bekaert Jeroen Bekaert, Herbert Van de Sompel. Access Interfaces for Open Archival Information Systems based on the OAI-PMH and the OpenURL Framework for Context-Sensitive Services Preprint at Draft of an accepted submission for PV 2005 "Ensuring Long-term Preservation and Adding Value to Scientific and Technical data". Herbert Van de Sompel, Jeroen Bekaert, Xiaoming Liu, Lyudmila Balakireva, Thorsten Schwander. aDORe: a modular, standards-based Digital Object Repository The Computer Journal. Preprint at arXiv:cs.DL/ Computer Journal paper at doi: /comjnl/bxh114 arXiv:cs.DL/ doi: /comjnl/bxh114