Metadata Harvesting - OAI-PMH

Slides:

Advertisements

Similar presentations

A centre of expertise in digital information management The OAI Protocol for Metadata Harvesting Andy Powell UKOLN,

Advertisements

A busy persons introduction to OAI-PMH Christopher Gutteridge ALT, April 2003.

A brief overview of the Open Archives Initiative Steve Hitchcock Open Citation Project (OpCit) Southampton University Prepared for Z39.50/OAI/OpenURL plenary.

A centre of expertise in digital information management IMS Digital Repositories Interoperability Andy Powell UKOLN,

Y.T. a brief history of the OAI 0 Kaynak: Herbert van de Sompel.

OAI in DigiTool DigiTool Version 3.0.

OAI-PMH Dawn Petherick, University Web Services Team Manager, Information Services, University of Birmingham MIDESS Dissemination.

Infrastructures for Using Metadata RSS and OAI-PMH CS 431 – March 14, 2005 Carl Lagoze – Cornell University.

National Science Digital Library (NSDL) Core Infrastructure Metadata Repository (“union catalog”) Naomi Dushay Cornell University.

The Open Archives Initiative Simeon Warner (Cornell University) Symposium on “Scholarly Publishing and Archiving on the Web”, University.

OAI Standards for Sheet Music Meeting March 28-29, 2002 Basic OAI Principals How They Apply to Sheet Music Presenter: Curtis Fornadley, Senior Programmer/Analyst.

The Open Archives Initiative Simeon Warner (Cornell University) Open Archives seminar “Facilitating Free and Efficient Scientific.

OAI-PMH at Yale Report on the DLF OAI Training Session November 10, 2005 Charlottesville, VA.

The Open Archives Initiative Simeon Warner Cornell University, Ithaca, NY, USA CREPUQ 2002, Montréal, Canada 14:00, 24 October 2002.

Thomas G. Habing – University of Illinois at Urbana-Champaign Recap: SIGIR 2001 OAI Workshop 19 September OAI Provider Workshop, University of.

Introduction to the OAI Metadata Harvesting Protocol Hussein Suleman, Digital Library Research Laboratory Virginia Tech.

Metadata Harvesting The Hague, 13 & 14 January 2009 Julie Verleyen Scientific Coordinator, Europeana Office EuropeanaLocal Knowledge Sharing Workshop.

Open Archives Iniative – Protocol for Metadata Harvesting Iztok Kavkler, University of Ljubljana Some slides by Stefaan Ternier, KUL Bram Vandenputte,

Metadata Harvesting Interoperable digital collections.

Herbert van de sompel Workshop on OAI and peer review journals in Europe Geneva, Switserland – March 22nd to 24th 2001 Herbert Van de Sompel Cornell University.

OAI-PMH The Open Archives Initiative Protocol for Metadata Harvesting Presenter: Knud Möller Friday,

1 OAI-PMH harvester for agricultural knowledge gathering (Development, testing and implementation) Francesco Castellani and Stefka Kaloyanova 4 February.

The OAI: overview and historical context OAI Open Meeting – Washington DC – January 23 rd 2001 Herbert Van de Sompel & Carl Lagoze Cornell University --

OAI-PMH: Open Archives Initiative Protocol for Metadata Harvesting T.B. Rajashekar National Centre for Science Information (NCSI) Indian Institute of Science,

DNER Architecture Andy Powell 6 March 2001 UKOLN, University of Bath UKOLN is funded by Resource: The Council for.

The OAI Protocol for Metadata Harvesting Van de Sompel, Herbert Los Alamos National Laboratory – Research Library.

Metadata harvesting in regional digital libraries in PIONIER Network Cezary Mazurek, Maciej Stroiński, Marcin Werla, Jan Węglarz.

Digital Library Interoperability Architecture CS 502 – Carl Lagoze – Cornell University.

Kurt Maly Department of Computer Science Old Dominion University Norfolk, Virginia 23529, USA Digital Libraries, OAI and Free Software.

The OAI: overview and historical context OAI Open Meeting – Washington DC – January 23 rd 2001 Herbert Van de Sompel & Carl Lagoze Cornell University --

Repositories COMP3016 Public, managed, web collections of knowledge.

Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH) Phil Barker, March © Heriot-Watt University. You may reproduce all or any part.

Open Archive Initiative – Protocol for metadata Harvesting (OAI-PMH) Surinder Kumar Technical Director NIC, New Delhi

Caltech CODA CODA: Collection of Digital Archives Caltech Scholarly Communication.

Slavic Digital Text Workshop 2006 The Open Archives Initiative Protocol for Metadata Harvesting: an Opportunity for Sharing Content in a Distributed Environment.

1 GRID Based Federated Digital Library K. Maly, M. Zubair, V. Chilukamarri, and P. Kothari Department of Computer Science Old Dominion University February,

OAI Overview DLESE OAI Workshop April 29-30, 2002 John Weatherley

Bitter Harvest Metadata Harvesting Issues, Problems, and Possible Solutions Roy Tennant California Digital Library.

Search Interoperability, OAI, and Metadata Sarah Shreeves University of Illinois at Urbana-Champaign Basics and Beyond Grainger Engineering Library April.

The OAI: technical overview OAI Open Meeting – Washington DC – January 23 rd 2001 Herbert Van de Sompel & Carl Lagoze Cornell University -- Computer Science.

The Open Archives Initiative Marshall Breeding Director for Innovative Technologies and Research Vanderbilt University

Open Archives Initiative Protocol for Metadata Harvesting.

Metadata Harvesting Interoperable digital collections.

Sharing Digital Scores: Will the Open Archives Initiative Protocol for Metadata Harvesting Provide the Key? Constance Mayer, Harvard University Peter Munstedt,

2/22/2016J Ammerman1 Open Archives Initiative What is it? What’s it good for?

NSDL & the Open Archives Initiative A Brief Introduction to OAI Timothy W. Cole Mathematics Librarian & Professor of Library Administration.

1 CS 430: Information Discovery Lecture 26 Architecture of Information Retrieval Systems 1.

Describing resources II: Dublin Core CERN-UNESCO School on Digital Libraries Rabat, Nov 22-26, 2010 Annette Holtkamp CERN.

The NSDL, OAI and Your Metadata Core Infrastructure Metadata Repository (“union catalog”) Naomi Dushay Cornell University.

Mod_oai: Metadata Harvesting for Everyone Michael L. Nelson, Herbert Van de Sompel, Xiaoming Liu, Aravind Elango

OAI and ODL Building Digital Libraries from Components Hussein Suleman Virginia Tech DLRL 12 September 2002.

The Multi-Faceted Use of the OAI-PMH in the LANL Repository Written By: Henry, Xiaoming,Patrick Henry, Xiaoming,Patrick and Herbert. Presented By: Shashi.

Harvesting and Exporting Metadata 714: Metadata Margaret E.I. Kipp -

Getting a Leg Up on OAI for the NSDL

University of Illinois at Urbana-Champaign OAI Alpha Experiences

An Overview of Data-PASS Shared Catalog

Georges Arnaout Chaitanya Krishna

Outline Pursue Interoperability: Digital Libraries

Introduction to Digital Libraries Week 10: Metadata Harvesting

CS431 guest lecture Simeon Warner

OAI and Metadata Harvesting

Attributes and Values Describing Entities.

Digitometric Services for Open Archives Environments

Old Dominion University Department of Computer Science

Open Archive Initiative

JISC Information Environment Service Registry (IESR)

Old Dominion University Department of Computer Science

Attributes and Values Describing Entities.

IVOA Interoperability Meeting - Boston

Presentation transcript:

Metadata Harvesting - OAI-PMH Thanks to Carl Lagoze, Christoper Gutteridge

The Open Archives Initiative (OAI) and the Protocol for Metadata Harvesting (OAI-PMH) Work of lots of people. I have worked especially closely with Carl Lagoze, Herbert Von de Sompel and Michael Nelson.

Open Archives? An open archive is an archive which exports its metadata in a useful way (OAI-PMH) Most also make some or all content available for free

Origins of the OAI “The Open Archives Initiative has been set up to create a forum to discuss and solve matters of interoperability between electronic preprint solutions, as a way to promote their global acceptance. “ (Paul Ginsparg, Rick Luce & Herbert Van de Sompel - 1999)

“The OAI develops and promotes interoperability What is the OAI now? “The OAI develops and promotes interoperability standards that aim to facilitate the efficient dissemination of content.” (from OAI mission statement) Technological framework around OAI-PMH protocol Application independent Independent of economic model for content Also … a community and a “brand” Notes: Quote is the first sentence of the OAI mission statement. Should not underestimate the success of the name “Open Archives Initiative”. Contains very good buzz-word “Open”. This name is often mis-understood: “archives” means just repository in this context, it has no implication of preservation beyond the idea that items in the repository are in some sense persistent; “Open” refers to open sharing of metadata, it is not necessary that the full-content be openly accessible for an archive to participate in the OAI.

OAIster Search Service Where does the OAI fit? NSDL Metadata Repo. OAIster Search Service DSpace OAI EPrints DLESE Earth Science Digital Library OAI is based around the notion of a collection of repositories (archives; collections) that ‘want’ to make metadata about their collections available. OAI provides the glue so the we an aggregated virtual collection. Services can then be built on the aggregated collection. CiteSeer Library of Congress arXiv

What is OAI-PMH The Open Archives Initiative Protocol for Metadata Harvesting A way of asking an archive about the stuff it’s got in it. This allows services to provide searches and other functionality across the metadata from many archives. XML over HTTP

There is “A” difference OAI and Open Access There is “A” difference Open Archives Initiative Open Access The OAI is not tied to a particular political agenda - technical focus BUT… the OAI provides functionality that is essential for many Open Access proposals

Data Provider (Repository) Service Provider (Harvester) OAI-PMH PMH -> Protocol for Metadata Harvesting http://www.openarchives.org/OAI/2.0/openarchivesprotocol.htm Simple protocol, just 6 verbs Designed to allow harvesting of any XML (meta)data (schema described) For batch-mode not interactive use Data Provider (Repository) Service Provider (Harvester) Protocol requests (GET, POST) XML metadata

What Questions can you ask via OAI-PMH? Identify GetRecord ListIdentifiers ListMetadataFormats ListRecords ListSets

? User OAI for discovery R1 R2 R3 R4 Information islands A number of problems with the notion of many separate repositories: User may not know where they are or how to select appropriate ones 2) User has to deal with peculiarities on interface to each repository 3) Information may be spread over more repositories than it is practical to user directly R4 Information islands

Metadata harvested by service OAI for discovery Service layer R1 R2 Search service User R3 Instead we insert a “service layer”. In this example a search service harvests metadata from all repositories and presents unified “portal” to user. R4 Metadata harvested by service

Global network of resources exposing XML data OAI for XYZ Service layer R1 R2 XYZ service User R3 Perhaps an alerting service. Perhaps even remove the user from the picture and use the same OAI framework to exchange preservation metadata for distributed archiving of open access content. R4 Global network of resources exposing XML data

OAI-PMH Data Model resource item has all available metadata identifier about this sculpture item Notes: The OAI-PMH is based on the “Resource-Item-Record” model. Resource: Details of the resource are outside the scope of the OAI-PMH, resources may be digital objects, physics objects, actors (people, institutions), collections… In this example the resource is an Inuit sculpture which lives in the Montreal museum of fine arts. Item: This is what resides in the repository. Logically contains all available metadata about the resource. In practice the repository (say an eprint archive) may actually contain the resource (eprint) and metadata could be extracted dynamically. In this example, the item might include metadata such as the name of the sculpture, the name of the sculptor, creation date, its size, geographical origin… Records: Particular metadata records disseminated via the OAI-PMH, may be in many overlapping and/or non-overlapping formats. Dublin Core metadata MARC21 branding records record has identifier + metadata format + datestamp

OAI and Metadata Formats Protocol based on the notion that a record can be described in multiple metadata formats Dublin Core is required for “interoperability” Extended to include XML compound object formats: e.g., METS, DIDL http://www.dlib.org/dlib/december04/vandesompel/12vandesompel.html

OAI-PMH and HTTP OAI-PMH uses HTTP as transport Error handling Encoding OAI-PMH in GET http://baseURL?verb=<verb>&arg1=<arg1Val>... Example: http://an.oa.org/OAIscript? verb=GetRecord& identifier=oai:arXiv.org:hep-th/9901001& metadataPrefix=oai_dc Error handling all OK at HTTP level? => 200 OK something wrong at OAI-PMH level? => OAI-PMH error (e.g. badVerb) HTTP codes 302 (redirect), 503 (retry-after), etc. still available to implementers, but do not represent OAI-PMH events

OAI-PMH verbs Function Verb listing of a single record GetRecord listing of N records ListRecords OAI unique ids contained in archive ListIdentifiers sets defined by archive ListSets metadata formats supported by archive ListMetadataFormats description of archive Identify metadata about the repository harvesting verbs most verbs take arguments: dates, sets, ids, metadata formats and resumption token (for flow control)

What Questions can you ask via OAI-PMH? Identify GetRecord ListIdentifiers ListMetadataFormats ListRecords ListSets

Identify Who are you? What kind of stuff do you contain? What is the copyright of your data and your metadata? “A collection-level description”

GetRecord Give me the metadata of a single record!

ListRecords Give me the metadata of all your records! May be limited by the date a record was last modified May be limited to a subset of the archive (e.g. only physics related records, but only if supported by archive)

ListIdentifiers Give me a list of all your records! May be limited by date record was last modified May be limited to a subset of the archive (e.g. only physics related records, but only if supported by archive)

ListMetadataFormats What metadata formats can you supply? All archives must supply Dublin Core but may supply other metadata formats too.

ListSets What subsets of your records may I ask for? Some archives define subsets, by subject, by rights etc. e.g. Physics related records, or public domain items or peer-reviewed items.

So how does a service query an archive? The first time it asks for ALL records. Then, every so often (day, week…) it asks for everything that’s changed since it last asked.

from physics aggregator Harvester #1 (Psychology Service) 500 Cogprints 169 D-Space CogPrints (GNU EPrints) 1600 Records Harvester #3 (General Service) 230,000 arXiv 769 D-Space 264 OrgPrints 1600 CogPrints 150,162 “Improved” records from physics aggregator www.orgprints.org (GNU EPrints) 264 Records arXiv (custom software) 230,000 Records Harvester #2 (Physics Aggregator) 150,000 arXiv 162 D-Space D-Space @ MIT (D-Space Software) 769 Records

Day 1 Harvester Give me everything! Archive Service A 1403 records OK! (1403 records)

Day 2 Give me all records which were added or changed since yesterday Harvester Archive Service A 1501 records 1403 records 1501 records OK! (102 new records, 4 deleted records, 23 changed records) 15 records Archive Service B 123 records Give me everything in set “physics” OK! (15 records)

Day 3 OK! (0 new records, 1 record changed) Give me all records which were added or changed since yesterday Harvester Archive Service A 1490 records 1490 records 1501 records Give me everything in set “physics” which were added or changed since yesterday. 15 records Archive Service B 123 records OK! (25 new records, 36 deleted records, 3 changed records)

What are these records? Dublin Core Title Creator Date Description Identifier (URL) … Very simple, but more useful than plain text.

Dublin Core in OAI Dublin Core required! You must provide Dublin Core data via OAI, so that all harvesters can use your data. You may also provide any other metadata formats you want to (MARC, AMF, one you-made-up etc.)

http://www.openarchives.org/tools/tools.html

Information about the repository, start any harvest with Identify Identify verb Information about the repository, start any harvest with Identify <Identify> <repositoryName>Library of Congress 1</repositoryName> <baseURL>http://memory.loc.gov/cgi-bin/oai</baseURL> <protocolVersion>2.0</protocolVersion> <adminEmail>r.e.gillian@larc.nasa.gov</adminEmail> <adminEmail>rgillian@visi.net</adminEmail> <deletedRecord>transient</deletedRecord> <earliestDatestamp>1990-02-01T00:00:00Z</earliestDatestamp> <granularity>YYYY-MM-DDThh:mm:ssZ</granularity> <compression>deflate</compression>

GetRecord - Normal response <?xml version="1.0" encoding="UTF-8"?> <OAI-PMH> ….namespace info not shown here <responseDate>2002-0208T08:55:46Z</responseDate> <request verb=“GetRecord”… …>http://arXiv.org/oai2</request> <GetRecord> <record> <header> <identifier>oai:arXiv:cs/0112017</identifier> <datestamp>2001-12-14</datestamp> <setSpec>cs</setSpec> <setSpec>math</setSpec> </header> <metadata> ….. </metadata> </record> </GetRecord> </OAI-PMH> note no HTTP encoding of the OAI-PMH request

Error/exception response <?xml version="1.0" encoding="UTF-8"?> <OAI-PMH> <responseDate>2002-0208T08:55:46Z</responseDate> <request>http://arXiv.org/oai2</request> <error code=“badVerb”>ShowMe is not a valid OAI-PMH verb</error> </OAI-PMH> with errors, only the correct attributes are echoed in <request> Same schema for all responses, including error responses.

Identifiers Items have identifiers (all records of same item share identifier) Identifiers must have URI syntax Unless you can recognize a global URI scheme, identifiers must be assumed to be local to the repository Complete identification of a record is baseURL+identifier+metadataPrefix+datestamp <provenance> container may be used to express harvesting/transformation history

RSS is mainly a “tail” format OAI-PMH is more “grep” like Selective Harvesting RSS is mainly a “tail” format OAI-PMH is more “grep” like Two “selectors” for harvesting Date Set Why not general search? Out of scope Not low-barrier Difficulty in achieving consensus

Datestamps All dates/times are UTC, encoded in ISO8601, Z notation: 1957-03-20T20:30:00Z Datestamps may be either fill date/time as above or date only (YYYY-MM-DD). Must be consistent over whole repository, ‘granularity’ specified in Identify response. Earlier version of the protocol specified “local time” which caused lots of misunderstandings. Not good for global interoperability!

Harvesting granularity mandatory support of YYYY-MM-DD optional support of YYYY-MM-DDThh:mm:ssZ (must look at Identify response) granularity of from and until agrument in ListIdentifier/ListRecords must match

Sets Simple notion of grouping at the item level to support selective harvesting Hierarchical set structure Multiple set membership permitted E.g: repo has sets A, A:B, A:B:C, D, D:E, D:F If item1 is in A:B then it is in A If item2 is in D:E then it is in D, may also be in D:F Item3 may be in no sets at all

header contains set membership of item Record headers header contains set membership of item <record> <header> <identifier>oai:arXiv:cs/0112017</identifier> <datestamp>2001-12-14</datestamp> <setSpec>cs</setSpec> <setSpec>math</setSpec> </header> <metadata> ….. </metadata> </record> eliminates the need for the “double harvest” 1.x required to get all records and all set information

resumptionToken Protocol supports the notion of partial responses in a very simple way: Response includes a ‘token’ at the which is used to get the next chunk. Idempotency of resumptionToken: return same incomplete list when resumptionToken is reissued while no changes occur in the repo: strict while changes occur in the repo: all items with unchanged datestamp optional attributes for the resumptionToken: expirationDate, completeListSize, cursor

Harvesting strategy Issue Identify request Check all as expected (validate, version, baseURL, granularity, comporession…) Check sets/metadata formats as necessary (ListSets, ListMetadataFormats) Do harvest, initial complete harvest done with no from and to parameters Subsequent incremental harvests start from datastamp that is responseDate of last response