ECDL 2005, September 18 th - 23 th 2005, Vienna, Austria File-based storage of Digital Objects: XMLtapes & Internet Archive ARC files Xiaoming Liu, Luda Balakireva, Herbert Van de Sompel RESEARCH LIBRARY File-based storage of Digital Objects and constituent datastreams: XMLtapes and Internet Archive ARC files Xiaoming Liu (1), Luda Balakireva (1), Patrick Hochstenbach (2) and Herbert Van de Sompel (1) (1) Digital Library Research & Prototyping Team Research Library, Los Alamos National Laboratory (2) University Library Ghent University
ECDL 2005, September 18 th - 23 th 2005, Vienna, Austria File-based storage of Digital Objects: XMLtapes & Internet Archive ARC files Xiaoming Liu, Luda Balakireva, Herbert Van de Sompel RESEARCH LIBRARY Disclaimer The term Digital Object (DO) will be used as in Kahn/Wilensky: o Compound object o Multiple datastreams of different mime types o Secondary information pertaining to object and datastreams o Identifiers for object (and datastreams) This is ~ OAIS Content Information TypeMIMEidentifier Digital Objectscholarly paperN/ADOI Constituent Datastream 1metadata recordapplication/xmlPMID Constituent Datastream 2fulltext fileapplication/pdf–
ECDL 2005, September 18 th - 23 th 2005, Vienna, Austria File-based storage of Digital Objects: XMLtapes & Internet Archive ARC files Xiaoming Liu, Luda Balakireva, Herbert Van de Sompel RESEARCH LIBRARY XML-based representation of DOs Growing interest in XML-based representation of DOs in Digital Library architectures: o Platform-independence, o Industry-support o Longevity, potential migration paths o Processing tools, validation capabilities XML-based Compound Object formats: o ISO/IEC MPEG-21 DID & DIDL o METS o IMS/CP o CCDS XFDU Typical functionality: o By-Value (base64) and/or By-Reference provision of constituent datastreams o By-Value and/or By-Reference provision of secondary information o Provision of identifiers
ECDL 2005, September 18 th - 23 th 2005, Vienna, Austria File-based storage of Digital Objects: XMLtapes & Internet Archive ARC files Xiaoming Liu, Luda Balakireva, Herbert Van de Sompel RESEARCH LIBRARY Storing XML-based representations of DOs Existing approaches: o storage of the XML-representations as individual files in a file system: -Poor access performance -Poor backup performance o storage of the XML-representations in (SQL, XML, object) databases -Long term? Data are dependent on the underlying system o storage of the XML-representations by concatenating many such documents into a single file such as tar or zip -Not XML aware, hence, no use of off-the-shelf XML tools -Increasing storage space (base64-encoding of the constituent datastreams)
ECDL 2005, September 18 th - 23 th 2005, Vienna, Austria File-based storage of Digital Objects: XMLtapes & Internet Archive ARC files Xiaoming Liu, Luda Balakireva, Herbert Van de Sompel RESEARCH LIBRARY aDORe XMLtape/ARCfile solution Part of LANL aDORe repository effort: o Standards-based, modular repository architecture -Distributed architecture -Protocol-based interactions between modules -Usable to create interoperable federations of heterogeneous repositories o Actual implementation of the architecture at LANL o Components of aDORe software will be released Inspired by Internet Archive ARC file approach: o File-based mechanism to store datastreams resulting from Web-crawling o Concatenation of multiple datastreams into a single file o Metadata as seperators between datastreams o But not OK to store XML-based representations of DOs: -Metadata capabilities very limited & crawling related -Lose power of XML processing tools
ECDL 2005, September 18 th - 23 th 2005, Vienna, Austria File-based storage of Digital Objects: XMLtapes & Internet Archive ARC files Xiaoming Liu, Luda Balakireva, Herbert Van de Sompel RESEARCH LIBRARY aDORe XMLtape/ARCfile solution Two interconnected file-based storage mechanisms: o XMLtapes: File storage of XML-based representations of Digital Objects o ARCfiles: File storage of constituent datastreams of Digital Objects The ARC files are interconnected with one or more XMLtapes during the ingestion process A protocol-based access mechanism is introduced: o XMLtape is exposed as an autonomous OAI-PMH repository o ARCfile is exposed as an OpenURL Resolver Write once - Read many: o Files remain stable o Protocol-based access mechanism remains stable o Indexing mechanisms can change as technologies evolve Storage approach is independent from the compound object format used to represent DOs as XML o aDORe uses MPEG-21 DIDL
ECDL 2005, September 18 th - 23 th 2005, Vienna, Austria File-based storage of Digital Objects: XMLtapes & Internet Archive ARC files Xiaoming Liu, Luda Balakireva, Herbert Van de Sompel RESEARCH LIBRARY ISO/IEC : MPEG-21 DID & DIDL Digital Item Digital Item Declaration DIDL document has declaration has XML serialization MPEG-21 Abstract Model MPEG-21 DIDL has XML serialization based on
Representing DOs using MPEG-21 DID Digital Object Package sample DIDL document
ECDL 2005, September 18 th - 23 th 2005, Vienna, Austria File-based storage of Digital Objects: XMLtapes & Internet Archive ARC files Xiaoming Liu, Luda Balakireva, Herbert Van de Sompel RESEARCH LIBRARY aDORe XMLtape An XML file that concatenates the XML-based representations of multiple DOs Structure is defined by an XML Schema o o tape-level administrative section: -Open-ended content -Plug-in for processing-related information, indication of related ARCfiles: - o concatenation of records, each of which consists of: -record-level administrative section -identifier and datestamp of the contained record -other record-level administrative information -a record (can be from any XML Namespace). DIDL in case of aDORe: - An XMLtape is a valid and well-formed XML file Independent from chosen XML-based Compound Object Format
ECDL 2005, September 18 th - 23 th 2005, Vienna, Austria File-based storage of Digital Objects: XMLtapes & Internet Archive ARC files Xiaoming Liu, Luda Balakireva, Herbert Van de Sompel RESEARCH LIBRARY aDORe XMLtape <ta:tape xmlns:ta=" ... oai:aps.org:PhysRevA T04:31:22Z ... ... aDORe ta:tape sample XMLtape
ECDL 2005, September 18 th - 23 th 2005, Vienna, Austria File-based storage of Digital Objects: XMLtapes & Internet Archive ARC files Xiaoming Liu, Luda Balakireva, Herbert Van de Sompel RESEARCH LIBRARY aDORe XMLtape index identifier datestamp of ingestion XMLtape record identifier datestamp of ingestion identifier datestamp of ingestion index identifier/datestamp Indexing: Can be achieved with a variety of technologies Current implementation: Berkeley DB Java Edition
ECDL 2005, September 18 th - 23 th 2005, Vienna, Austria File-based storage of Digital Objects: XMLtapes & Internet Archive ARC files Xiaoming Liu, Luda Balakireva, Herbert Van de Sompel RESEARCH LIBRARY aDORe XMLtape as OAI-PMH repository XMLtape record index identifier/datestamp OAI-PMH request DIDL document OAI-PMH identifier = identifier from OAI-PMH datestamp = datetime from OAI-PMH response = content of
ECDL 2005, September 18 th - 23 th 2005, Vienna, Austria File-based storage of Digital Objects: XMLtapes & Internet Archive ARC files Xiaoming Liu, Luda Balakireva, Herbert Van de Sompel RESEARCH LIBRARY Internet Archive ARCfile Concatenation of binary files Designed and used by the Internet Archive (Wayback machine) o > 400 TB web data Under revision by the International Internet Preservation Consortium (IIPC): WARC file format o Input from LANL to facilitate non-Web-crawling use case The ARC file format is structured as follows: o file header that provides administrative information about the ARC file itself o a sequence of document records, consisting of: -a header line containing some, mainly crawl-related, metadata. -URI of the crawled document -timestamp of acquisition of the data -size of the data block -a response to a protocol request such as an HTTP GET
ECDL 2005, September 18 th - 23 th 2005, Vienna, Austria File-based storage of Digital Objects: XMLtapes & Internet Archive ARC files Xiaoming Liu, Luda Balakireva, Herbert Van de Sompel RESEARCH LIBRARY Internet Archive ARC file filedesc://IA arc text/plain Alexa Internet URL IP-address Archive-date Content-type Archive-length text/html 202 HTTP/ Document follows Date: Mon, 04 Nov :21:06 GMT Server: NCSA/1.4.1 Content-type: text/html Last-modified: Sat,10 Aug :33:11 GMT Content-length: 30 Hello World!!! sample ARC file
ECDL 2005, September 18 th - 23 th 2005, Vienna, Austria File-based storage of Digital Objects: XMLtapes & Internet Archive ARC files Xiaoming Liu, Luda Balakireva, Herbert Van de Sompel RESEARCH LIBRARY Internet Archive ARC file in aDORe filedesc://singletape.arc text/plain Internet Archive URL IP-address Archive-date Content-type Archive-length info:lanl-repo/ds/39c2fa93-fa22-4c19-90af-b5f58b9b989a application/pdf %PDF-1.3 %âãÏÓ 290 0 obj << /Linearized 1 /O 295 /H [ ] /L … sample aDORe ARC file sample ARCfile
ECDL 2005, September 18 th - 23 th 2005, Vienna, Austria File-based storage of Digital Objects: XMLtapes & Internet Archive ARC files Xiaoming Liu, Luda Balakireva, Herbert Van de Sompel RESEARCH LIBRARY Internet Archive ARC file index URL ARC datastream URL Indexing: Can be achieved with a variety of technologies Current implementation in aDORe: Heritrix toolkit URL IP-address Archive-date Content-type Archive-length
ECDL 2005, September 18 th - 23 th 2005, Vienna, Austria File-based storage of Digital Objects: XMLtapes & Internet Archive ARC files Xiaoming Liu, Luda Balakireva, Herbert Van de Sompel RESEARCH LIBRARY ARC file as OpenURL Resolver ARC file datastream index URL OpenURL OpenURL request datastream Referent Identifier = datastream identifier = URL from ARC record header Resolver Identifier = identifier of ARC file
ECDL 2005, September 18 th - 23 th 2005, Vienna, Austria File-based storage of Digital Objects: XMLtapes & Internet Archive ARC files Xiaoming Liu, Luda Balakireva, Herbert Van de Sompel RESEARCH LIBRARY Associating an XMLtape with ARC Files (1) A Digital Object is represented using an XML-based Complex Object format (e.g. MPEG-21 DID) The resulting package (e.g. DIDL document) is stored in an XMLtape Constituent datastreams of the Digital Object are provided By-Reference: o Using the ref attribute of the Resource element in MPEG-21 DID o The value of the network location of the constituent datastream is compliant with the NISO OpenURL Framework: baseURL(ARCfile OpenURL Resolver)? url_ver = Z & rft_id = Datastream Identifier & res_id = ARCfile identifier
ECDL 2005, September 18 th - 23 th 2005, Vienna, Austria File-based storage of Digital Objects: XMLtapes & Internet Archive ARC files Xiaoming Liu, Luda Balakireva, Herbert Van de Sompel RESEARCH LIBRARY Associating an XMLtape with ARC Files (1) …… info:lanl-repo/ds/ba0797d d0-90e8-f5397e74892b <didl:Resource mimeType="application/pdf“ ref=" url_ver=Z res_id=info:lanl-repo/arc/2001_4acb6e28-1ef9-11da-9e1e-d8ccd1d6c8f2 rft_id=info:lanl-repo/ds/ba0797d d0-90e8-f5397e74892b“/> …… Extract from DIDL
ECDL 2005, September 18 th - 23 th 2005, Vienna, Austria File-based storage of Digital Objects: XMLtapes & Internet Archive ARC files Xiaoming Liu, Luda Balakireva, Herbert Van de Sompel RESEARCH LIBRARY Associating an XMLtape with ARC Files (2) An XMLtape is associated with its corresponding ARCfiles through a plug-in for the XMLtape-level administrative section.
ECDL 2005, September 18 th - 23 th 2005, Vienna, Austria File-based storage of Digital Objects: XMLtapes & Internet Archive ARC files Xiaoming Liu, Luda Balakireva, Herbert Van de Sompel RESEARCH LIBRARY Associating an XMLtape with ARC Files (2) info:lanl-repo/xmltape/singlescitape info:lanl-repo/arc/singlescitape gov.lanl.xmltape.SingleTapeWriter T22:13:39Z … XMLtape header
ECDL 2005, September 18 th - 23 th 2005, Vienna, Austria File-based storage of Digital Objects: XMLtapes & Internet Archive ARC files Xiaoming Liu, Luda Balakireva, Herbert Van de Sompel RESEARCH LIBRARY AGENT Identifier Locator DIDLDocument-id or content-id List of (baseURL, DIDLDocument-id) DIDLDocument-id or content-id XMLtape DIDLDocument- id index creation datetime index ref DIDL document ref OpenURL datastream-id datastream ARC file datastream id datastream-id index
ECDL 2005, September 18 th - 23 th 2005, Vienna, Austria File-based storage of Digital Objects: XMLtapes & Internet Archive ARC files Xiaoming Liu, Luda Balakireva, Herbert Van de Sompel RESEARCH LIBRARY aDORe XMLtape/ARCfile environment
ECDL 2005, September 18 th - 23 th 2005, Vienna, Austria File-based storage of Digital Objects: XMLtapes & Internet Archive ARC files Xiaoming Liu, Luda Balakireva, Herbert Van de Sompel RESEARCH LIBRARY Implementation XMLtapes: o Berkeley DB Java Edition o OCLC OAICat ARCfiles: o Heritrix o OCLC OpenURL software XMLtape Registry o MySQL db o OCLC OAICat ARCfile Registry: o MySQL db o OCLC OAICat
ECDL 2005, September 18 th - 23 th 2005, Vienna, Austria File-based storage of Digital Objects: XMLtapes & Internet Archive ARC files Xiaoming Liu, Luda Balakireva, Herbert Van de Sompel RESEARCH LIBRARY Performance indicators System: o Model: Dell U rack-mount server o CPU: dual 2.8 GHz Intel Xeon processors o RAM: 5GB RAM o Disks: 10k RPM SCSI disks XMLtape: o 1786 MB, DIDL records o download 100 consecutive DIDL records (787 KB) => 0.18 second o download static file of same size => 0.09 second ARCfile: o 272 MB, 4910 files o download a sample PDF file (312 KB) => 0.24 second o download static file of same size => second
ECDL 2005, September 18 th - 23 th 2005, Vienna, Austria File-based storage of Digital Objects: XMLtapes & Internet Archive ARC files Xiaoming Liu, Luda Balakireva, Herbert Van de Sompel RESEARCH LIBRARY Software Software - ARC files: o Heritrix: the internet archive's open-source, extensible, web-scale, archival- quality web crawler project. o NetArchive.dk: a project that plans for the preservation of Denmark's cultural heritage on the internet for future generations. o Many other tools: XMLtapes: o Perl tool, XML::Tape (LANL & Ghent University), Combined aDORe XMLtape/ARCfile environment: o Java tool (LANL), soon to be released on SourceForge
ECDL 2005, September 18 th - 23 th 2005, Vienna, Austria File-based storage of Digital Objects: XMLtapes & Internet Archive ARC files Xiaoming Liu, Luda Balakireva, Herbert Van de Sompel RESEARCH LIBRARY Conclusion The file-based approach is inherently simple, and reduces dependency on database system. The autonomy of the indexes allows retaining the files over time, while the indexes can be created using other techniques as technologies evolve. The protocol-based nature of the access increases the flexibility in light of evolving technologies as it introduces another layer of abstraction. The XMLtape approach is inspired by the ARC file format, but provides several additional attractive features: o Off-the-shelf XML tools can be used to parse/validate an XMLtape o All DO metadata can be stored in XML-based compound object format Presentation available via Install TSCC codec for avi movies