ECDL 2005, September 18 th - 23 th 2005, Vienna, Austria File-based storage of Digital Objects: XMLtapes & Internet Archive ARC files Xiaoming Liu, Luda.

Slides:



Advertisements
Similar presentations
RESEARCH LIBRARY Content Packaging for Complex Objects MPEG – 21 1 February 2007 Frances Knudson Repository Team Los Alamos National Laboratory Research.
Advertisements

UKOLN is supported by: JISC Information Environment update Repositories and Preservation Programme meeting, October 24-25, 2006 Rachel Heery UKOLN
CNI Fall Task Force Meeting 2003, Portland, OR Using MPEG-21 DIDL, the OAI-PMH, and the OpenURL as building blocks for storing & disseminating complex.
Y.T. a brief history of the OAI 0 Kaynak: Herbert van de Sompel.
Digital Preservation - Its all about the metadata right? “Metadata and Digital Preservation: How Much Do We Really Need?” SAA 2014 Panel Saturday, August.
Depositing e-material to The National Library of Sweden.
Object Re-Use and Exchange Mellon Retreat, Nassau Inn, Princeton, NJ, March Herbert Van de Sompel, Carl Lagoze The OAI Object Re-Use & Exchange.
1 Introduction to XML. XML eXtensible implies that users define tag content Markup implies it is a coded document Language implies it is a metalanguage.
1 Archiving Workflow between a Local Repository and the National Library Archive Experiences from the DiVA Project Eva Müller, Peter Hansson, Uwe Klosa,
The Fedora Project March 19, 2003 ISTEC Symposium, Brazil Sandy Payette Cornell Information Science.
Some thoughts on OpenURL version 1.0 Herbert Van de Sompel Los Alamos National Laboratory – Research Library NISO AX meeting, Getty Museum, May
The Open Archives Initiative Simeon Warner (Cornell University) Symposium on “Scholarly Publishing and Archiving on the Web”, University.
Web Services Michael Smith Alex Feldman. What is a Web Service? A Web service is a message-oriented software system designed to support inter-operable.
System Design/Implementation and Support for Build 2 PDS Management Council Face-to-Face Mountain View, CA Nov 30 - Dec 1, 2011 Sean Hardman.
Digital Object: A Virtual Online Storage Solution 598C Course Project Huajing Li.
METS-Based Cataloging Toolkit for Digital Library Management System Dong, Li Tsinghua University Library
Addressing Metadata in the MPEG-21 and PDF-A ISO Standards NISO Workshop: Metadata on the Cutting Edge May 2004 William G. LeFurgy U.S. Library of Congress.
Metadata Harvesting The Hague, 13 & 14 January 2009 Julie Verleyen Scientific Coordinator, Europeana Office EuropeanaLocal Knowledge Sharing Workshop.
IIPC GA, Stanford, US - WARCApril 28 th 2015Slide 1 WARC as Package Format for all Preserved Digital Material by Eld Zierau The Royal Library of Denmark.
PREMIS Implementation at The Royal Library of Denmark by Eld Zierau.
OAI-PMH for Resource Harvesting Tutorial OAI4, October 20 th 2005, CERN, Geneva, Switzerland OAIResource Software Her This work supported in part by the.
How to build your own Dark Archive (in your spare time) Priscilla Caplan FCLA.
1 XML as a preservation strategy Experiences with the DiVA document format Eva Müller, Uwe Klosa Electronic Publishing Centre Uppsala University Library,
ALCME: OAI at OCLC Jeffrey A. Young OCLC Online Computer Library Center, Inc.
Access Across Time: How the NAA Preserves Digital Records Andrew Wilson Assistant Director, Preservation.
An Introduction to METS Morgan Cundiff Network Development and MARC Standards Office Library of Congress Metadata Encoding and Transmission Standard.
OAI-PMH for Resource Harvesting Tutorial OAI4, October 20 th 2005, CERN, Geneva, Switzerland A New Model for Web Resource Harvesting Her This work supported.
OAI-PMH for Resource Harvesting Tutorial OAI4, October 20 th 2005, CERN, Geneva, Switzerland OAI-PMH for Resource Harvesting Herbert Van de Sompel Digital.
1 Schema Registries Steven Hughes, Lou Reich, Dan Crichton NASA 21 October 2015.
Implementor’s Panel: BL’s eJournal Archiving solution using METS, MODS and PREMIS Markus Enders, British Library DC2008, Berlin.
OAI-PMH: Open Archives Initiative Protocol for Metadata Harvesting T.B. Rajashekar National Centre for Science Information (NCSI) Indian Institute of Science,
Van de Sompel, Herbert Los Alamos National Laboratory – Research Library OAI-PMH for Resource Harvesting.
The OAI Protocol for Metadata Harvesting Van de Sompel, Herbert Los Alamos National Laboratory – Research Library.
Kurt Maly Department of Computer Science Old Dominion University Norfolk, Virginia 23529, USA Digital Libraries, OAI and Free Software.
IODE Ocean Data Portal - ODP  The objective of the IODE Ocean Data Portal (ODP) is to facilitate and promote the exchange and dissemination of marine.
Alternative Architecture for Information in Digital Libraries Onno W. Purbo
1 GRID Based Federated Digital Library K. Maly, M. Zubair, V. Chilukamarri, and P. Kothari Department of Computer Science Old Dominion University February,
OAI Overview DLESE OAI Workshop April 29-30, 2002 John Weatherley
How to Implement an Institutional Repository: Part II A NASIG 2006 Pre-Conference May 4, 2006 Technical Issues.
Funded by: © AHDS Preservation in Institutional Repositories Preliminary conclusions of the SHERPA DP project Gareth Knight Digital Preservation Officer.
DSpace - Digital Library Software
Sharing Digital Scores: Will the Open Archives Initiative Protocol for Metadata Harvesting Provide the Key? Constance Mayer, Harvard University Peter Munstedt,
OAI-PMH for Resource Harvesting Tutorial OAI4, October 20 th 2005, CERN, Geneva, Switzerland The American Physical Society Project: Standards-based Mirroring.
UKOLN is supported by: Content packaging and MPEG-21 DID Andy Powell, UKOLN, University of Bath JISC Joint Programmes Meeting, July.
Introduction to Web Services Presented by Sarath Chandra Dorbala.
Herbert Van de Sompel Research Library, Los Alamos National Laboratory OAI4, October , CERN, Geneva, Switzerland RESEARCH LIBRARY Lessons in.
Carl Lagoze Digital Library Service Registry Workshop Services in a Scholarly Communication Framework.
1 CS 430: Information Discovery Lecture 26 Architecture of Information Retrieval Systems 1.
Repository-specific Spoke Scripts Content Repository JSR-170/283 Content Repository for Java Technology API Normalized H&S METS Files METS Import/ExportMETS.
Building Preservation Environments with Data Grid Technology Reagan W. Moore Presenter: Praveen Namburi.
SOAP, Web Service, WSDL Week 14 Web site:
Technical Report 4th CERN Workshop of Innovations in Scholarly Communication (OAI4)
Mod_oai: Metadata Harvesting for Everyone Michael L. Nelson, Herbert Van de Sompel, Xiaoming Liu, Aravind Elango
LWW January 27, 2004, Los Alamos, NM LANL Ingestion and Repository architecture Research Library, Los Alamos National Laboratory RESEARCH LIBRARY LANL’s.
A Semi-Automated Digital Preservation System based on Semantic Web Services Jane Hunter Sharmin Choudhury DSTC PTY LTD, Brisbane, Australia Slides by Ananta.
The Multi-Faceted Use of the OAI-PMH in the LANL Repository Written By: Henry, Xiaoming,Patrick Henry, Xiaoming,Patrick and Herbert. Presented By: Shashi.
Introduction to OAI Static Repositories By Thomas G. Habing Grainger Engineering Library.
Efficient, Automatic Web Resource Harvesting Michael L. Nelson, Joan A. Smith and Ignacio Garcia del Campo Old Dominion University Computer Science Dept.
A Modular, Standards-based Digital Object Repository
DAITSS and the Florida Digital Archive
The Fedora Project March 19, 2003 ISTEC Symposium, Brazil
Flexible Extensible Digital Object Repository Architecture
Flexible Extensible Digital Object Repository Architecture
Link Resolver and Knowledge Base in Discovery Services
OAI and Metadata Harvesting
Implementing an Institutional Repository: Part II
Digital Preservation Seminar
Open Archive Initiative
Implementing an Institutional Repository: Part II
How to Implement an Institutional Repository: Part II
Presentation transcript:

ECDL 2005, September 18 th - 23 th 2005, Vienna, Austria File-based storage of Digital Objects: XMLtapes & Internet Archive ARC files Xiaoming Liu, Luda Balakireva, Herbert Van de Sompel RESEARCH LIBRARY File-based storage of Digital Objects and constituent datastreams: XMLtapes and Internet Archive ARC files Xiaoming Liu (1), Luda Balakireva (1), Patrick Hochstenbach (2) and Herbert Van de Sompel (1) (1) Digital Library Research & Prototyping Team Research Library, Los Alamos National Laboratory (2) University Library Ghent University

ECDL 2005, September 18 th - 23 th 2005, Vienna, Austria File-based storage of Digital Objects: XMLtapes & Internet Archive ARC files Xiaoming Liu, Luda Balakireva, Herbert Van de Sompel RESEARCH LIBRARY Disclaimer The term Digital Object (DO) will be used as in Kahn/Wilensky: o Compound object o Multiple datastreams of different mime types o Secondary information pertaining to object and datastreams o Identifiers for object (and datastreams) This is ~ OAIS Content Information TypeMIMEidentifier Digital Objectscholarly paperN/ADOI Constituent Datastream 1metadata recordapplication/xmlPMID Constituent Datastream 2fulltext fileapplication/pdf–

ECDL 2005, September 18 th - 23 th 2005, Vienna, Austria File-based storage of Digital Objects: XMLtapes & Internet Archive ARC files Xiaoming Liu, Luda Balakireva, Herbert Van de Sompel RESEARCH LIBRARY XML-based representation of DOs Growing interest in XML-based representation of DOs in Digital Library architectures: o Platform-independence, o Industry-support o Longevity, potential migration paths o Processing tools, validation capabilities XML-based Compound Object formats: o ISO/IEC MPEG-21 DID & DIDL o METS o IMS/CP o CCDS XFDU Typical functionality: o By-Value (base64) and/or By-Reference provision of constituent datastreams o By-Value and/or By-Reference provision of secondary information o Provision of identifiers

ECDL 2005, September 18 th - 23 th 2005, Vienna, Austria File-based storage of Digital Objects: XMLtapes & Internet Archive ARC files Xiaoming Liu, Luda Balakireva, Herbert Van de Sompel RESEARCH LIBRARY Storing XML-based representations of DOs Existing approaches: o storage of the XML-representations as individual files in a file system: -Poor access performance -Poor backup performance o storage of the XML-representations in (SQL, XML, object) databases -Long term? Data are dependent on the underlying system o storage of the XML-representations by concatenating many such documents into a single file such as tar or zip -Not XML aware, hence, no use of off-the-shelf XML tools -Increasing storage space (base64-encoding of the constituent datastreams)

ECDL 2005, September 18 th - 23 th 2005, Vienna, Austria File-based storage of Digital Objects: XMLtapes & Internet Archive ARC files Xiaoming Liu, Luda Balakireva, Herbert Van de Sompel RESEARCH LIBRARY aDORe XMLtape/ARCfile solution Part of LANL aDORe repository effort: o Standards-based, modular repository architecture -Distributed architecture -Protocol-based interactions between modules -Usable to create interoperable federations of heterogeneous repositories o Actual implementation of the architecture at LANL o Components of aDORe software will be released Inspired by Internet Archive ARC file approach: o File-based mechanism to store datastreams resulting from Web-crawling o Concatenation of multiple datastreams into a single file o Metadata as seperators between datastreams o But not OK to store XML-based representations of DOs: -Metadata capabilities very limited & crawling related -Lose power of XML processing tools

ECDL 2005, September 18 th - 23 th 2005, Vienna, Austria File-based storage of Digital Objects: XMLtapes & Internet Archive ARC files Xiaoming Liu, Luda Balakireva, Herbert Van de Sompel RESEARCH LIBRARY aDORe XMLtape/ARCfile solution Two interconnected file-based storage mechanisms: o XMLtapes: File storage of XML-based representations of Digital Objects o ARCfiles: File storage of constituent datastreams of Digital Objects The ARC files are interconnected with one or more XMLtapes during the ingestion process A protocol-based access mechanism is introduced: o XMLtape is exposed as an autonomous OAI-PMH repository o ARCfile is exposed as an OpenURL Resolver Write once - Read many: o Files remain stable o Protocol-based access mechanism remains stable o Indexing mechanisms can change as technologies evolve Storage approach is independent from the compound object format used to represent DOs as XML o aDORe uses MPEG-21 DIDL

ECDL 2005, September 18 th - 23 th 2005, Vienna, Austria File-based storage of Digital Objects: XMLtapes & Internet Archive ARC files Xiaoming Liu, Luda Balakireva, Herbert Van de Sompel RESEARCH LIBRARY ISO/IEC : MPEG-21 DID & DIDL Digital Item Digital Item Declaration DIDL document has declaration has XML serialization MPEG-21 Abstract Model MPEG-21 DIDL has XML serialization based on

Representing DOs using MPEG-21 DID Digital Object Package sample DIDL document

ECDL 2005, September 18 th - 23 th 2005, Vienna, Austria File-based storage of Digital Objects: XMLtapes & Internet Archive ARC files Xiaoming Liu, Luda Balakireva, Herbert Van de Sompel RESEARCH LIBRARY aDORe XMLtape An XML file that concatenates the XML-based representations of multiple DOs Structure is defined by an XML Schema o o tape-level administrative section: -Open-ended content -Plug-in for processing-related information, indication of related ARCfiles: - o concatenation of records, each of which consists of: -record-level administrative section -identifier and datestamp of the contained record -other record-level administrative information -a record (can be from any XML Namespace). DIDL in case of aDORe: - An XMLtape is a valid and well-formed XML file Independent from chosen XML-based Compound Object Format

ECDL 2005, September 18 th - 23 th 2005, Vienna, Austria File-based storage of Digital Objects: XMLtapes & Internet Archive ARC files Xiaoming Liu, Luda Balakireva, Herbert Van de Sompel RESEARCH LIBRARY aDORe XMLtape   <ta:tape xmlns:ta="  ...   oai:aps.org:PhysRevA  T04:31:22Z  ...  ...  aDORe ta:tape sample XMLtape

ECDL 2005, September 18 th - 23 th 2005, Vienna, Austria File-based storage of Digital Objects: XMLtapes & Internet Archive ARC files Xiaoming Liu, Luda Balakireva, Herbert Van de Sompel RESEARCH LIBRARY aDORe XMLtape index identifier datestamp of ingestion XMLtape record identifier datestamp of ingestion identifier datestamp of ingestion index identifier/datestamp Indexing: Can be achieved with a variety of technologies Current implementation: Berkeley DB Java Edition

ECDL 2005, September 18 th - 23 th 2005, Vienna, Austria File-based storage of Digital Objects: XMLtapes & Internet Archive ARC files Xiaoming Liu, Luda Balakireva, Herbert Van de Sompel RESEARCH LIBRARY aDORe XMLtape as OAI-PMH repository XMLtape record index identifier/datestamp OAI-PMH request DIDL document OAI-PMH identifier = identifier from OAI-PMH datestamp = datetime from OAI-PMH response = content of

ECDL 2005, September 18 th - 23 th 2005, Vienna, Austria File-based storage of Digital Objects: XMLtapes & Internet Archive ARC files Xiaoming Liu, Luda Balakireva, Herbert Van de Sompel RESEARCH LIBRARY Internet Archive ARCfile Concatenation of binary files Designed and used by the Internet Archive (Wayback machine) o > 400 TB web data Under revision by the International Internet Preservation Consortium (IIPC): WARC file format o Input from LANL to facilitate non-Web-crawling use case The ARC file format is structured as follows: o file header that provides administrative information about the ARC file itself o a sequence of document records, consisting of: -a header line containing some, mainly crawl-related, metadata. -URI of the crawled document -timestamp of acquisition of the data -size of the data block -a response to a protocol request such as an HTTP GET

ECDL 2005, September 18 th - 23 th 2005, Vienna, Austria File-based storage of Digital Objects: XMLtapes & Internet Archive ARC files Xiaoming Liu, Luda Balakireva, Herbert Van de Sompel RESEARCH LIBRARY Internet Archive ARC file filedesc://IA arc text/plain Alexa Internet URL IP-address Archive-date Content-type Archive-length text/html 202 HTTP/ Document follows Date: Mon, 04 Nov :21:06 GMT Server: NCSA/1.4.1 Content-type: text/html Last-modified: Sat,10 Aug :33:11 GMT Content-length: 30 Hello World!!! sample ARC file

ECDL 2005, September 18 th - 23 th 2005, Vienna, Austria File-based storage of Digital Objects: XMLtapes & Internet Archive ARC files Xiaoming Liu, Luda Balakireva, Herbert Van de Sompel RESEARCH LIBRARY Internet Archive ARC file in aDORe filedesc://singletape.arc text/plain Internet Archive URL IP-address Archive-date Content-type Archive-length info:lanl-repo/ds/39c2fa93-fa22-4c19-90af-b5f58b9b989a application/pdf  %PDF-1.3  %âãÏÓ  290  0 obj  <<  /Linearized 1  /O 295  /H [ ]  /L  … sample aDORe ARC file sample ARCfile

ECDL 2005, September 18 th - 23 th 2005, Vienna, Austria File-based storage of Digital Objects: XMLtapes & Internet Archive ARC files Xiaoming Liu, Luda Balakireva, Herbert Van de Sompel RESEARCH LIBRARY Internet Archive ARC file index URL ARC datastream URL Indexing: Can be achieved with a variety of technologies Current implementation in aDORe: Heritrix toolkit URL IP-address Archive-date Content-type Archive-length

ECDL 2005, September 18 th - 23 th 2005, Vienna, Austria File-based storage of Digital Objects: XMLtapes & Internet Archive ARC files Xiaoming Liu, Luda Balakireva, Herbert Van de Sompel RESEARCH LIBRARY ARC file as OpenURL Resolver ARC file datastream index URL OpenURL OpenURL request datastream Referent Identifier = datastream identifier = URL from ARC record header Resolver Identifier = identifier of ARC file

ECDL 2005, September 18 th - 23 th 2005, Vienna, Austria File-based storage of Digital Objects: XMLtapes & Internet Archive ARC files Xiaoming Liu, Luda Balakireva, Herbert Van de Sompel RESEARCH LIBRARY Associating an XMLtape with ARC Files (1) A Digital Object is represented using an XML-based Complex Object format (e.g. MPEG-21 DID) The resulting package (e.g. DIDL document) is stored in an XMLtape Constituent datastreams of the Digital Object are provided By-Reference: o Using the ref attribute of the Resource element in MPEG-21 DID o The value of the network location of the constituent datastream is compliant with the NISO OpenURL Framework: baseURL(ARCfile OpenURL Resolver)? url_ver = Z & rft_id = Datastream Identifier & res_id = ARCfile identifier

ECDL 2005, September 18 th - 23 th 2005, Vienna, Austria File-based storage of Digital Objects: XMLtapes & Internet Archive ARC files Xiaoming Liu, Luda Balakireva, Herbert Van de Sompel RESEARCH LIBRARY Associating an XMLtape with ARC Files (1) …… info:lanl-repo/ds/ba0797d d0-90e8-f5397e74892b <didl:Resource mimeType="application/pdf“ ref=" url_ver=Z res_id=info:lanl-repo/arc/2001_4acb6e28-1ef9-11da-9e1e-d8ccd1d6c8f2 rft_id=info:lanl-repo/ds/ba0797d d0-90e8-f5397e74892b“/> …… Extract from DIDL

ECDL 2005, September 18 th - 23 th 2005, Vienna, Austria File-based storage of Digital Objects: XMLtapes & Internet Archive ARC files Xiaoming Liu, Luda Balakireva, Herbert Van de Sompel RESEARCH LIBRARY Associating an XMLtape with ARC Files (2) An XMLtape is associated with its corresponding ARCfiles through a plug-in for the XMLtape-level administrative section.

ECDL 2005, September 18 th - 23 th 2005, Vienna, Austria File-based storage of Digital Objects: XMLtapes & Internet Archive ARC files Xiaoming Liu, Luda Balakireva, Herbert Van de Sompel RESEARCH LIBRARY Associating an XMLtape with ARC Files (2) info:lanl-repo/xmltape/singlescitape info:lanl-repo/arc/singlescitape gov.lanl.xmltape.SingleTapeWriter T22:13:39Z … XMLtape header

ECDL 2005, September 18 th - 23 th 2005, Vienna, Austria File-based storage of Digital Objects: XMLtapes & Internet Archive ARC files Xiaoming Liu, Luda Balakireva, Herbert Van de Sompel RESEARCH LIBRARY AGENT Identifier Locator DIDLDocument-id or content-id List of (baseURL, DIDLDocument-id) DIDLDocument-id or content-id XMLtape DIDLDocument- id index creation datetime index ref DIDL document ref OpenURL datastream-id datastream ARC file datastream id datastream-id index

ECDL 2005, September 18 th - 23 th 2005, Vienna, Austria File-based storage of Digital Objects: XMLtapes & Internet Archive ARC files Xiaoming Liu, Luda Balakireva, Herbert Van de Sompel RESEARCH LIBRARY aDORe XMLtape/ARCfile environment

ECDL 2005, September 18 th - 23 th 2005, Vienna, Austria File-based storage of Digital Objects: XMLtapes & Internet Archive ARC files Xiaoming Liu, Luda Balakireva, Herbert Van de Sompel RESEARCH LIBRARY Implementation XMLtapes: o Berkeley DB Java Edition o OCLC OAICat ARCfiles: o Heritrix o OCLC OpenURL software XMLtape Registry o MySQL db o OCLC OAICat ARCfile Registry: o MySQL db o OCLC OAICat

ECDL 2005, September 18 th - 23 th 2005, Vienna, Austria File-based storage of Digital Objects: XMLtapes & Internet Archive ARC files Xiaoming Liu, Luda Balakireva, Herbert Van de Sompel RESEARCH LIBRARY Performance indicators System: o Model: Dell U rack-mount server o CPU: dual 2.8 GHz Intel Xeon processors o RAM: 5GB RAM o Disks: 10k RPM SCSI disks XMLtape: o 1786 MB, DIDL records o download 100 consecutive DIDL records (787 KB) => 0.18 second o download static file of same size => 0.09 second ARCfile: o 272 MB, 4910 files o download a sample PDF file (312 KB) => 0.24 second o download static file of same size => second

ECDL 2005, September 18 th - 23 th 2005, Vienna, Austria File-based storage of Digital Objects: XMLtapes & Internet Archive ARC files Xiaoming Liu, Luda Balakireva, Herbert Van de Sompel RESEARCH LIBRARY Software Software - ARC files: o Heritrix: the internet archive's open-source, extensible, web-scale, archival- quality web crawler project. o NetArchive.dk: a project that plans for the preservation of Denmark's cultural heritage on the internet for future generations. o Many other tools: XMLtapes: o Perl tool, XML::Tape (LANL & Ghent University), Combined aDORe XMLtape/ARCfile environment: o Java tool (LANL), soon to be released on SourceForge

ECDL 2005, September 18 th - 23 th 2005, Vienna, Austria File-based storage of Digital Objects: XMLtapes & Internet Archive ARC files Xiaoming Liu, Luda Balakireva, Herbert Van de Sompel RESEARCH LIBRARY Conclusion The file-based approach is inherently simple, and reduces dependency on database system. The autonomy of the indexes allows retaining the files over time, while the indexes can be created using other techniques as technologies evolve. The protocol-based nature of the access increases the flexibility in light of evolving technologies as it introduces another layer of abstraction. The XMLtape approach is inspired by the ARC file format, but provides several additional attractive features: o Off-the-shelf XML tools can be used to parse/validate an XMLtape o All DO metadata can be stored in XML-based compound object format Presentation available via Install TSCC codec for avi movies