OAI-PMH for Resource Harvesting Tutorial OAI4, October 20 th 2005, CERN, Geneva, Switzerland OAIResource Software Her This work supported in part by the.

Slides:



Advertisements
Similar presentations
RESEARCH LIBRARY Content Packaging for Complex Objects MPEG – 21 1 February 2007 Frances Knudson Repository Team Los Alamos National Laboratory Research.
Advertisements

CNI Fall Task Force Meeting 2003, Portland, OR Using MPEG-21 DIDL, the OAI-PMH, and the OpenURL as building blocks for storing & disseminating complex.
Y.T. a brief history of the OAI 0 Kaynak: Herbert van de Sompel.
ELPUB 2006 June Bansko Bulgaria1 Automated Building of OAI Compliant Repository from Legacy Collection Kurt Maly Department of Computer.
Depositing e-material to The National Library of Sweden.
Object Re-Use and Exchange Mellon Retreat, Nassau Inn, Princeton, NJ, March Herbert Van de Sompel, Carl Lagoze The OAI Object Re-Use & Exchange.
HTTP HyperText Transfer Protocol. HTTP Uses TCP as its underlying transport protocol Uses port 80 Stateless protocol (i.e. HTTP Server maintains no information.
National Science Digital Library (NSDL) Core Infrastructure Metadata Repository (“union catalog”) Naomi Dushay Cornell University.
PAWN: A Novel Ingestion Workflow Technology for Digital Preservation
OAI Standards for Sheet Music Meeting March 28-29, 2002 Basic OAI Principals How They Apply to Sheet Music Presenter: Curtis Fornadley, Senior Programmer/Analyst.
OAI-PMH at Yale Report on the DLF OAI Training Session November 10, 2005 Charlottesville, VA.
HTTP Overview Vijayan Sugumaran School of Business Administration Oakland University.
OCLC Online Computer Library Center OCLC’s Digital Archive – Disseminating with METS Jay Goodkin Software Engineer Digital Collection and Preservation.
Introduction to the OAI Metadata Harvesting Protocol Hussein Suleman, Digital Library Research Laboratory Virginia Tech.
JY Le Meur/Tibor Simko 12 th Feb’04 1)Context 2)Interoperability 3)Submission 4)Search 5)Preservation CERN, OAI3 Workshop, Geneva.
METS-Based Cataloging Toolkit for Digital Library Management System Dong, Li Tsinghua University Library
1 HTML and CGI Scripting CSC8304 – Computing Environments for Bioinformatics - Lecture 10.
Addressing Metadata in the MPEG-21 and PDF-A ISO Standards NISO Workshop: Metadata on the Cutting Edge May 2004 William G. LeFurgy U.S. Library of Congress.
XML: The Strategic Opportunity Roy Tennant Challenges*  Only librarians like to search, everyone else likes to find  Our users want more information.
Metadata Harvesting The Hague, 13 & 14 January 2009 Julie Verleyen Scientific Coordinator, Europeana Office EuropeanaLocal Knowledge Sharing Workshop.
CP476 Internet Computing Lecture 5 : HTTP, WWW and URL 1 Lecture 5. WWW, HTTP and URL Objective: to review the concepts of WWW to understand how HTTP works.
TCP/IP Protocol Suite 1 Chapter 22 Upon completion you will be able to: World Wide Web: HTTP Understand the components of a browser and a server Understand.
1 herbert van de sompel CS 502 Computing Methods for Digital Libraries Cornell University – Computer Science Herbert Van de Sompel
How to build your own Dark Archive (in your spare time) Priscilla Caplan FCLA.
1 XML as a preservation strategy Experiences with the DiVA document format Eva Müller, Uwe Klosa Electronic Publishing Centre Uppsala University Library,
Herbert van de sompel Workshop on OAI and peer review journals in Europe Geneva, Switserland – March 22nd to 24th 2001 Herbert Van de Sompel Cornell University.
ECDL 2005, September 18 th - 23 th 2005, Vienna, Austria File-based storage of Digital Objects: XMLtapes & Internet Archive ARC files Xiaoming Liu, Luda.
Indo-US Workshop, June23-25, 2003 Building Digital Libraries for Communities using Kepler Framework M. Zubair Old Dominion University.
Preserving Digital Culture: Tools & Strategies for Building Web Archives : Tools and Strategies for Building Web Archives Internet Librarian 2009 Tracy.
LIS 654 BUILDING DIGITAL LIBRARIES FALL 2011 NOVEMBER 03, 2011 The OAI-PMH Harvester Plugin for The Omeka Content Management System JAMES R. GRIFFIN III.
OAI-PMH for Resource Harvesting Tutorial OAI4, October 20 th 2005, CERN, Geneva, Switzerland A New Model for Web Resource Harvesting Her This work supported.
OAI-PMH for Resource Harvesting Tutorial OAI4, October 20 th 2005, CERN, Geneva, Switzerland OAI-PMH for Resource Harvesting Herbert Van de Sompel Digital.
1 Schema Registries Steven Hughes, Lou Reich, Dan Crichton NASA 21 October 2015.
The OAI: overview and historical context OAI Open Meeting – Washington DC – January 23 rd 2001 Herbert Van de Sompel & Carl Lagoze Cornell University --
Research Library, Los Alamos National Laboratory RESEARCH OAI4 - Geneva, Switzerland Digital Library Research & Prototyping Team Multi-Graph.
OAI-PMH: Open Archives Initiative Protocol for Metadata Harvesting T.B. Rajashekar National Centre for Science Information (NCSI) Indian Institute of Science,
Van de Sompel, Herbert Los Alamos National Laboratory – Research Library OAI-PMH for Resource Harvesting.
The OAI Protocol for Metadata Harvesting Van de Sompel, Herbert Los Alamos National Laboratory – Research Library.
Web Server Design Assignment #1: Basic Operations Due: 02/03/2010 Old Dominion University Department of Computer Science CS 495/595 Spring 2010 Martin.
Open Archive Initiative – Protocol for metadata Harvesting (OAI-PMH) Surinder Kumar Technical Director NIC, New Delhi
1 GRID Based Federated Digital Library K. Maly, M. Zubair, V. Chilukamarri, and P. Kothari Department of Computer Science Old Dominion University February,
OAI Overview DLESE OAI Workshop April 29-30, 2002 John Weatherley
NSDL October 12-15, 2003Eisenhower National Clearinghouse Slide 1 NSDL and the Open Archives Initiative NSDL – OAI – and the Eisenhower National Clearinghouse.
WEB SERVER Mark Kimmet Shana Blair. The Project Web Server Application  Receives request for web pages or images from a client browser via the internet.
DAITSS: Dark Archive in the Sunshine State Priscilla Caplan Florida Center for Library Automation (FCLA)
The OAI: technical overview OAI Open Meeting – Washington DC – January 23 rd 2001 Herbert Van de Sompel & Carl Lagoze Cornell University -- Computer Science.
Web Technologies Lecture 1 The Internet and HTTP.
The Open Archives Initiative Marshall Breeding Director for Innovative Technologies and Research Vanderbilt University
Archive Ingest and Handling Test: ODU’s Perspective Michael L. Nelson Department of Computer Science Old Dominion University
OAI-PMH for Resource Harvesting Tutorial OAI4, October 20 th 2005, CERN, Geneva, Switzerland The American Physical Society Project: Standards-based Mirroring.
UKOLN is supported by: Content packaging and MPEG-21 DID Andy Powell, UKOLN, University of Bath JISC Joint Programmes Meeting, July.
MathArc Co-operating Preservation Archives Sharing Collections Among Dissimilar OAIS Repositories William Kehoe, Adam Smith, Marcy Rosenkrantz Cornell.
Evaluating Ingest Success: Using the AIHT Michael L. Nelson, Joan A. Smith Department of Computer Science Old Dominion University Norfolk VA DCC.
Herbert Van de Sompel Research Library, Los Alamos National Laboratory OAI4, October , CERN, Geneva, Switzerland RESEARCH LIBRARY Lessons in.
Data Communications and Computer Networks Chapter 2 CS 3830 Lecture 7 Omar Meqdadi Department of Computer Science and Software Engineering University of.
1 CS 430: Information Discovery Lecture 26 Architecture of Information Retrieval Systems 1.
SOAP, Web Service, WSDL Week 14 Web site:
Technical Report 4th CERN Workshop of Innovations in Scholarly Communication (OAI4)
Mod_oai: Metadata Harvesting for Everyone Michael L. Nelson, Herbert Van de Sompel, Xiaoming Liu, Aravind Elango
Web Programming Week 1 Old Dominion University Department of Computer Science CS 418/518 Fall 2007 Michael L. Nelson 8/27/07.
LWW January 27, 2004, Los Alamos, NM LANL Ingestion and Repository architecture Research Library, Los Alamos National Laboratory RESEARCH LIBRARY LANL’s.
The Multi-Faceted Use of the OAI-PMH in the LANL Repository Written By: Henry, Xiaoming,Patrick Henry, Xiaoming,Patrick and Herbert. Presented By: Shashi.
Introduction to OAI Static Repositories By Thomas G. Habing Grainger Engineering Library.
Efficient, Automatic Web Resource Harvesting Michael L. Nelson, Joan A. Smith and Ignacio Garcia del Campo Old Dominion University Computer Science Dept.
OAI and Metadata Harvesting
Context Interoperability Submission Search Preservation
A New Model for Web Resource Harvesting
Digital Preservation Seminar
Open Archive Initiative
Institutional Repositories
Presentation transcript:

OAI-PMH for Resource Harvesting Tutorial OAI4, October 20 th 2005, CERN, Geneva, Switzerland OAIResource Software Her This work supported in part by the Library of Congress Michael Nelson Computer Science Department Old Dominion University Herbert Van de Sompel Digital Library Research & Prototyping Team Research Library, Los Alamos National Laboratory

OAI-PMH for Resource Harvesting Tutorial OAI4, October 20 th 2005, CERN, Geneva, Switzerland OAIResource “Resource-Aware” OAI-PMH harvester –combines metadata and resource harvesting in a single application –developed at LANL Research Library written in Java plug-in architecture for extensibility –based on (among others): OAICat (OCLC) Heretrix (Internet Archive) aDORe tools (LANL RL)

OAI-PMH for Resource Harvesting Tutorial OAI4, October 20 th 2005, CERN, Geneva, Switzerland Two Primary Use Cases Target repository: –supports resource harvesting via a complex object format (e.g., MPEG-21 DIDL, METS, etc.) –does not support resource harvesting uses only Dublin Core or other metadata formats

OAI-PMH for Resource Harvesting Tutorial OAI4, October 20 th 2005, CERN, Geneva, Switzerland Repository Does Not Support COs OAI Resource ListRecords Request ListRecords Response 1. Write OAI-PMH harvest to XMLtape 2. Per record in XMLtape: Examine URL in DC.Identifier or DC.Format - GET if pdf, jpg, tif, mp3 - if html, GET and scrape for pdf, jpg, tif, mp3 3. write datastream to Arc file 4. Write connection between XMLtape and ARCfile in ok.csv 5. Write failure in bad.csv

OAI-PMH for Resource Harvesting Tutorial OAI4, October 20 th 2005, CERN, Geneva, Switzerland Repository Does Support COs OAI Resource ListRecords Request 1. Write OAI-PMH harvest to XMLtape 2. Per record in XMLtape: unpack DIDL. Per Resource in DIDL: - extract base64 if datastream by-val - http GET if datastream by-ref ref attribute) 3. write datastream to Arc file 4. Write connection between XMLtape and ARC file to ok.cvs 5. Write failure to bad.csv ListRecords Response

OAI-PMH for Resource Harvesting Tutorial OAI4, October 20 th 2005, CERN, Geneva, Switzerland Plug-Ins provided in OAIResource Package DCFormatProcessor –look for pdf, jpg, tif, mp3 URLs in DC.Format (as per eprints.org repositories) DCIdentifierProcessor –look for html files in DC.Identifier; grab those and extract pdf, jpg, tif, mp3 files one level down DidlSigProcessor –process DIDLs with XML signatures see: DidlNoSigProcessor –process DIDLs without XML signatures

OAI-PMH for Resource Harvesting Tutorial OAI4, October 20 th 2005, CERN, Geneva, Switzerland OAIResource Demo [AIHT:~] mln% ssh Password: Welcome to Darwin! [dhcp-248:~] demo% cd oai-resource/ dhcp-248.cs.odu.edu:/Users/demo/oai-resource % cd bin dhcp-248.cs.odu.edu:/Users/demo/oai-resource/bin % sh./OAIResource.sh../etc/examples/env.properties../etc/examples/libeprints.open.ac.uk/libeprints.open.ac.uk.properties java version "1.5.0_02" Java(TM) 2 Runtime Environment, Standard Edition (build 1.5.0_02-56) Java HotSpot(TM) Client VM (build 1.5.0_02-36, mixed mode, sharing) … INFO gov.lanl.harvester.ListRecords2Tape - INFO gov.lanl.harvester.ListRecords2Tape - size=2 INFO gov.lanl.harvester.ListRecords2Tape - Done -- harvested 49 records INFO gov.lanl.ingest.oaitape.DirIngester - project:EPRINT INFO gov.lanl.ingest.oaitape.DirIngester - plugin:gov.lanl.ingest.oaitape.simple.DCFormatProcessor INFO gov.lanl.ingest.oaitape.DirIngester - lastingest: INFO gov.lanl.ingest.oaitape.DirIngester - xmltapesdir:/Users/demo/libeprints.open.ac.uk/tape/ INFO gov.lanl.ingest.oaitape.DirIngester - start ingest tape:tape_8cfa2100-8a27-4e4c-a8b5-e6ff0ca169bc.xml INFO gov.lanl.ingest.oaitape.simple.DCFormatProcessor - Oct 15, :22:14 PM org.archive.io.arc.ARCWriter close INFO: Closed /Users/demo/libeprints.open.ac.uk/data/ /tape_8cfa2100-8a27-4e4c-a8b5-e6ff0ca169bc/tmp68653e30-37d2-4be a3ad05959f5c.arc, size INFO gov.lanl.ingest.oaitape.simple.DCFormatProcessor - Oct 15, :22:16 PM org.archive.io.arc.ARCWriter close INFO: Closed /Users/demo/libeprints.open.ac.uk/data/ /tape_8cfa2100-8a27-4e4c-a8b5-e6ff0ca169bc/tmp02ca96eb-5a34-4a23-bb02-9f10c59fbccb.arc, size INFO gov.lanl.ingest.oaitape.simple.DCFormatProcessor - Oct 15, :22:17 PM org.archive.io.arc.ARCWriter close INFO: Closed /Users/demo/libeprints.open.ac.uk/data/ /tape_8cfa2100-8a27-4e4c-a8b5-e6ff0ca169bc/tmpd8f66bac-d543-4c16-afd0-950fb37b5e43.arc, size INFO gov.lanl.ingest.oaitape.simple.DCFormatProcessor - Oct 15, :22:17 PM org.archive.io.arc.ARCWriter close INFO: Closed /Users/demo/libeprints.open.ac.uk/data/ /tape_8cfa2100-8a27-4e4c-a8b5-e6ff0ca169bc/tmpb963273a-1c64-4f5a-82ea-a39c48efd93b.arc, size … INFO: Closed /Users/demo/libeprints.open.ac.uk/data/ /tape_8cfa2100-8a27-4e4c-a8b5-e6ff0ca169bc/tmpb1a53f32-b6a3-4f a7c2155c7d21.arc, size WARN gov.lanl.ingest.oaitape.simple.DCFormatProcessor - DCFormatProcessor:null Oct 15, :24:04 PM org.archive.io.arc.ARCWriter close INFO: Closed /Users/demo/libeprints.open.ac.uk/data/ /tape_8cfa2100-8a27-4e4c-a8b5-e6ff0ca169bc/tmpf4b3796e-842c-4ff3-a3b5-bcbc13d46999.arc, size 169 WARN gov.lanl.ingest.oaitape.simple.DCFormatProcessor - DCFormatProcessor:null Oct 15, :24:04 PM org.archive.io.arc.ARCWriter close INFO: Closed /Users/demo/libeprints.open.ac.uk/data/ /tape_8cfa2100-8a27-4e4c-a8b5-e6ff0ca169bc/tmp1a24a284-d384-4da3-83b5-a33b019ac61a.arc, size 169 Oct 15, :24:04 PM org.archive.io.arc.ARCWriter close INFO: Closed /Users/demo/libeprints.open.ac.uk/data/ /tape_8cfa2100-8a27-4e4c-a8b5-e6ff0ca169bc/EPRINT_abe7aae7-701d-44af-8be5-dd20edb20ff6.arc, size INFO gov.lanl.ingest.oaitape.DirIngester - finish tape:tape_8cfa2100-8a27-4e4c-a8b5-e6ff0ca169bc.xml INFO gov.lanl.ingest.oaitape.DirIngester - finish directory:

OAI-PMH for Resource Harvesting Tutorial OAI4, October 20 th 2005, CERN, Geneva, Switzerland OAIResource Output dhcp-248.cs.odu.edu:/Users/demo/libeprints.open.ac.uk % ls -R data/ lastingest.txt tape/ lastharvest.txt log/./data: /./data/ : tape_8cfa2100-8a27-4e4c-a8b5-e6ff0ca169bc/./data/ /tape_8cfa2100-8a27-4e4c-a8b5-e6ff0ca169bc: EPRINT_abe7aae7-701d-44af-8be5-dd20edb20ff6.arc bad.csv ok.csv./log: /./log/ : tape_8cfa2100-8a27-4e4c-a8b5-e6ff0ca169bc/./log/ /tape_8cfa2100-8a27-4e4c-a8b5-e6ff0ca169bc: log4joutput.log./tape: /./tape/ : tape_8cfa2100-8a27-4e4c-a8b5-e6ff0ca169bc.xml dhcp-248.cs.odu.edu:/Users/demo/libeprints.open.ac.uk % metadata resources mapping run time mesgs

OAI-PMH for Resource Harvesting Tutorial OAI4, October 20 th 2005, CERN, Geneva, Switzerland aDORe XMLtape An XML file that concatenates the XML-based representations of multiple DOs Structure is defined by an XML Schema – –tape-level administrative section: Open-ended content Plug-in for processing-related information, indication of related ARCfiles: – –concatenation of records, each of which consists of: record-level administrative section –identifier and datestamp of the contained record –other record-level administrative information a record (can be from any XML Namespace). DIDL in case of aDORe: – An XMLtape is a valid and well-formed XML file Independent from chosen XML-based Compound Object Format

OAI-PMH for Resource Harvesting Tutorial OAI4, October 20 th 2005, CERN, Geneva, Switzerland aDORe XMLtape XMLtape record OAI-PMH identifier OAI-PMH datestamp OAI-PMH identifier OAI-PMH datestamp OAI-PMH identifier OAI-PMH datestamp

OAI-PMH for Resource Harvesting Tutorial OAI4, October 20 th 2005, CERN, Geneva, Switzerland aDORe XMLtape   <ta:tape xmlns:ta="  ...   oai:open.ac.uk.OAI2:2  Z  ...  ...  aDORe ta:tape

OAI-PMH for Resource Harvesting Tutorial OAI4, October 20 th 2005, CERN, Geneva, Switzerland Internet Archive ARCfile Concatenation of binary files Designed and used by the Internet Archive (Wayback machine) –> 400 TB web data Under revision by the International Internet Preservation Consortium (IIPC): WARC file format –Input from LANL to facilitate non-Web-crawling use case The ARC file format is structured as follows: –file header that provides administrative information about the ARC file itself –a sequence of document records, consisting of: a header line containing some, mainly crawl-related, metadata. –URI of the crawled document –timestamp of acquisition of the data –size of the data block a response to a protocol request such as an HTTP GET

OAI-PMH for Resource Harvesting Tutorial OAI4, October 20 th 2005, CERN, Geneva, Switzerland Internet Archive ARC file URL ARC datastream URL URL IP-address Archive-date Content-type Archive-length

OAI-PMH for Resource Harvesting Tutorial OAI4, October 20 th 2005, CERN, Geneva, Switzerland Internet Archive ARC file filedesc://IA arc text/plain Alexa Internet URL IP-address Archive-date Content-type Archive-length text/html 202 HTTP/ Document follows Date: Mon, 04 Nov :21:06 GMT Server: NCSA/1.4.1 Content-type: text/html Last-modified: Sat,10 Aug :33:11 GMT Content-length: 30 Hello World!!! sample ARC file

OAI-PMH for Resource Harvesting Tutorial OAI4, October 20 th 2005, CERN, Geneva, Switzerland Internet Archive ARC file in OAIResource filedesc://singletape.arc text/plain Internet Archive URL IP-address Archive-date Content-type Archive-length info:lanl-repo/ds/39c2fa93-fa22-4c19-90af-b5f58b9b989a application/pdf  %PDF-1.3  %âãÏÓ  290  0 obj  <<  /Linearized 1  /O 295  /H [ ]  /L  … sample OAIResource ARC file

OAI-PMH for Resource Harvesting Tutorial OAI4, October 20 th 2005, CERN, Geneva, Switzerland OAIResource XMLtape/ARCfile storage XMLtape record ARC datastream ok.csv bad.csv needs post-processing for ingest into repository, for creation of an application, etc.

OAI-PMH for Resource Harvesting Tutorial OAI4, October 20 th 2005, CERN, Geneva, Switzerland Mapping Between XMLTape & ARCfiles % more bad.csv # "tape_record_id, arc_date, message" oai:open.ac.uk.OAI2:53, ,Unable to locate supported file. oai:open.ac.uk.OAI2:5, , oai:open.ac.uk.OAI2:70, , % more ok.csv # "tape_record_id, arc_id, arc_date, ref, derefXPath, sourceURI, digest, localIdentifier" oai:open.ac.uk.OAI2:2, EPRINT_abe7aae7-701d-44af-8be5dd20edb20ff6, , ,, //dc:format/*, urn:sha1:KwJ0KYN27AO92aY3JZxsxqaH/CY=, info:lanl-repo/ds/ffe52b26-58c0-4e6d-b4d5-ebb7a9fe82d1 … oai:open.ac.uk.OAI2:23,EPRINT_abe7aae7-701d-44af-8be5- dd20edb20ff6, ,,//dc:format/*, ha1:YG+ja4cmX7wmTfgrGXPDI7eE8bY=,info:lanl-repo/ds/6e dcef-4c65-a669-ef023b4522fe oai:open.ac.uk.OAI2:36,EPRINT_abe7aae7-701d-44af-8be5- dd20edb20ff6, ,,//dc:format/*, l5S2qHum4KB+febcUdnNJgTbfXc=,info:lanl-repo/ds/13f3d72b-40d f07-ac99f oai:open.ac.uk.OAI2:39,EPRINT_abe7aae7-701d-44af-8be5- dd20edb20ff6, ,,//dc:format/*, B fxF8kkwZAqKu8mFvg8I=,info:lanl-repo/ds/8668c560-79d b2e4-e …

OAI-PMH for Resource Harvesting Tutorial OAI4, October 20 th 2005, CERN, Geneva, Switzerland Papers Jeroen Bekaert and Herbert Van de Sompel. A Standards-based Solution for the Accurate Transfer of Digital Assets. D-Lib Magazine, June Standards-based Solution for the Accurate Transfer of Digital Assets Xiaoming Liu, Luda Balakireva, Patrick Hochstenbach and Herbert Van de Sompel. File-based storage of Digital Objects and constituent datastreams: XMLtapes and Internet Archive ARC files. ECDL 2005 paper in Lecture Notes in Computer Science at Preprint at