CS431 guest lecture Simeon Warner

Slides:



Advertisements
Similar presentations
A centre of expertise in digital information management The OAI Protocol for Metadata Harvesting Andy Powell UKOLN,
Advertisements

A brief overview of the Open Archives Initiative Steve Hitchcock Open Citation Project (OpCit) Southampton University Prepared for Z39.50/OAI/OpenURL plenary.
Y.T. a brief history of the OAI 0 Kaynak: Herbert van de Sompel.
OAI in DigiTool DigiTool Version 3.0.
Simeon Warner - 31 May 2003 Semantics and use of responseDate.
OAI-PMH Dawn Petherick, University Web Services Team Manager, Information Services, University of Birmingham MIDESS Dissemination.
Infrastructures for Using Metadata RSS and OAI-PMH CS 431 – March 14, 2005 Carl Lagoze – Cornell University.
National Science Digital Library (NSDL) Core Infrastructure Metadata Repository (“union catalog”) Naomi Dushay Cornell University.
The Open Archives Initiative Simeon Warner (Cornell University) Symposium on “Scholarly Publishing and Archiving on the Web”, University.
OAI Standards for Sheet Music Meeting March 28-29, 2002 Basic OAI Principals How They Apply to Sheet Music Presenter: Curtis Fornadley, Senior Programmer/Analyst.
The Open Archives Initiative Simeon Warner (Cornell University) Open Archives seminar “Facilitating Free and Efficient Scientific.
OAI-PMH at Yale Report on the DLF OAI Training Session November 10, 2005 Charlottesville, VA.
The Open Archives Initiative Simeon Warner Cornell University, Ithaca, NY, USA CREPUQ 2002, Montréal, Canada 14:00, 24 October 2002.
Basic Concepts Architecture Topology Protocols Basic Concepts Open e-Print Archive Open Archive -- generalization of e-print Data Provider and Service.
Thomas G. Habing – University of Illinois at Urbana-Champaign Recap: SIGIR 2001 OAI Workshop 19 September OAI Provider Workshop, University of.
Introduction to the OAI Metadata Harvesting Protocol Hussein Suleman, Digital Library Research Laboratory Virginia Tech.
Metadata Harvesting The Hague, 13 & 14 January 2009 Julie Verleyen Scientific Coordinator, Europeana Office EuropeanaLocal Knowledge Sharing Workshop.
Metadata Harvesting Interoperable digital collections.
Open Archives Iniative – Protocol for Metadata Harvesting Iztok Kavkler, University of Ljubljana Some slides by Stefaan Ternier, KUL Bram Vandenputte,
Metadata Harvesting Interoperable digital collections.
Herbert van de sompel Workshop on OAI and peer review journals in Europe Geneva, Switserland – March 22nd to 24th 2001 Herbert Van de Sompel Cornell University.
OAI-PMH The Open Archives Initiative Protocol for Metadata Harvesting Presenter: Knud Möller Friday,
The OAI: overview and historical context OAI Open Meeting – Washington DC – January 23 rd 2001 Herbert Van de Sompel & Carl Lagoze Cornell University --
OAI-PMH: Open Archives Initiative Protocol for Metadata Harvesting T.B. Rajashekar National Centre for Science Information (NCSI) Indian Institute of Science,
The Open Archives Initiative Protocol for Metadata Harvesting: Overview Jewel Ward Visiting Scholar, Keio University Lib-Sys Seminar, Keio University,
The OAI Protocol for Metadata Harvesting Van de Sompel, Herbert Los Alamos National Laboratory – Research Library.
Metadata harvesting in regional digital libraries in PIONIER Network Cezary Mazurek, Maciej Stroiński, Marcin Werla, Jan Węglarz.
Digital Library Interoperability Architecture CS 502 – Carl Lagoze – Cornell University.
Kurt Maly Department of Computer Science Old Dominion University Norfolk, Virginia 23529, USA Digital Libraries, OAI and Free Software.
The OAI: overview and historical context OAI Open Meeting – Washington DC – January 23 rd 2001 Herbert Van de Sompel & Carl Lagoze Cornell University --
Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH) Phil Barker, March © Heriot-Watt University. You may reproduce all or any part.
Open Archive Initiative – Protocol for metadata Harvesting (OAI-PMH) Surinder Kumar Technical Director NIC, New Delhi
Slavic Digital Text Workshop 2006 The Open Archives Initiative Protocol for Metadata Harvesting: an Opportunity for Sharing Content in a Distributed Environment.
1 GRID Based Federated Digital Library K. Maly, M. Zubair, V. Chilukamarri, and P. Kothari Department of Computer Science Old Dominion University February,
OAI Overview DLESE OAI Workshop April 29-30, 2002 John Weatherley
Search Interoperability, OAI, and Metadata Sarah Shreeves University of Illinois at Urbana-Champaign Basics and Beyond Grainger Engineering Library April.
The OAI: technical overview OAI Open Meeting – Washington DC – January 23 rd 2001 Herbert Van de Sompel & Carl Lagoze Cornell University -- Computer Science.
The Open Archives Initiative Marshall Breeding Director for Innovative Technologies and Research Vanderbilt University
Open Archives Initiative Protocol for Metadata Harvesting.
Metadata Harvesting Interoperable digital collections.
Sharing Digital Scores: Will the Open Archives Initiative Protocol for Metadata Harvesting Provide the Key? Constance Mayer, Harvard University Peter Munstedt,
2/22/2016J Ammerman1 Open Archives Initiative What is it? What’s it good for?
NSDL & the Open Archives Initiative A Brief Introduction to OAI Timothy W. Cole Mathematics Librarian & Professor of Library Administration.
Introduction to the OAI Protocol for Metadata Harvesting Version 2.0 Hussein Suleman Virginia Tech DLRL 25 March 2002.
1 CS 430: Information Discovery Lecture 26 Architecture of Information Retrieval Systems 1.
The NSDL, OAI and Your Metadata Core Infrastructure Metadata Repository (“union catalog”) Naomi Dushay Cornell University.
1 herbert van de sompel CS 502 Computing Methods for Digital Libraries Cornell University – Computer Science Herbert Van de Sompel
OAI and ODL Building Digital Libraries from Components Hussein Suleman Virginia Tech DLRL 12 September 2002.
Harvesting and Exporting Metadata 714: Metadata Margaret E.I. Kipp -
Metadata Harvesting - OAI-PMH
Getting a Leg Up on OAI for the NSDL
University of Illinois at Urbana-Champaign OAI Alpha Experiences
WEB SERVICES From Chapter 19 of Distributed Systems Concepts and Design,4th Edition, By G. Coulouris, J. Dollimore and T. Kindberg Published by Addison.
An Overview of Data-PASS Shared Catalog
Systems for scholarly communication
Georges Arnaout Chaitanya Krishna
OAI protocol beyond discovery metadata
Outline Pursue Interoperability: Digital Libraries
Introduction to Digital Libraries Week 10: Metadata Harvesting
OAI and Metadata Harvesting
The Open Archives Initiative Story
Digitometric Services for Open Archives Environments
OAI 11/20/07.
Old Dominion University Department of Computer Science
Open Archive Initiative
JISC Information Environment Service Registry (IESR)
Institutional Repositories
Old Dominion University Department of Computer Science
WEB SERVICES From Chapter 19, Distributed Systems
IVOA Interoperability Meeting - Boston
Presentation transcript:

CS431 guest lecture Simeon Warner The Open Archives Initiative (OAI) and the Protocol for Metadata Harvesting (OAI-PMH) CS431 guest lecture Simeon Warner Work of lots of people. I have worked especially closely with Carl Lagoze, Herbert Von de Sompel and Michael Nelson. 3 March 2004

Origins of the OAI “The Open Archives Initiative has been set up to create a forum to discuss and solve matters of interoperability between electronic preprint solutions, as a way to promote their global acceptance. “ (Paul Ginsparg, Rick Luce & Herbert Van de Sompel - 1999)

What is the OAI now? “The OAI develops and promotes interoperability standards that aim to facilitate the efficient dissemination of content.” (from OAI mission statement) Technological framework around OAI-PMH protocol Application independent Independent of economic model for content Also … a community and a “brand” (and you need it for an assignment due in April) Notes: Quote is the first sentence of the OAI mission statement. Should not underestimate the success of the name “Open Archives Initiative”. Contains very good buzz-word “Open”. This name is often mis-understood: “archives” means just repository in this context, it has no implication of preservation beyond the idea that items in the repository are in some sense persistent; “Open” refers to open sharing of metadata, it is not necessary that the full-content be openly accessible for an archive to participate in the OAI.

DEF Eprints Search Service OAIster Search Service Where does the OAI fit? DEF Eprints Search Service OAIster Search Service DSpace OAI EPrints DTU Institutional Repository OAI is based around the notion of a collection of repositories (archives; collections) that ‘want’ to make metadata about their collections available. OAI provides the glue so the we an aggregated virtual collection. Services can then be built on the aggregated collection. Scholarly Publishing and Archiving on the Web arXiv

OAI and Open Access There is “A” difference Open Archives Initiative Open Access The OAI is not tied to a particular political agenda - technical focus BUT… the OAI provides functionality that is essential for many Open Access proposals

OAI-PMH Data Provider (Repository) Service Provider (Harvester) PMH -> Protocol for Metadata Harvesting http://www.openarchives.org/OAI/2.0/openarchivesprotocol.htm Simple protocol, just 6 verbs Designed to allow harvesting of any XML metadata (schema described) For batch-mode not interactive use Data Provider (Repository) Service Provider (Harvester) Protocol requests (GET, POST) XML metadata

? OAI for discovery User R1 R2 R3 R4 Information islands A number of problems with the notion of many separate repositories: User may not know where they are or how to select appropriate ones 2) User has to deal with peculiarities on interface to each repository 3) Information may be spread over more repositories than it is practical to user directly R4 Information islands

Metadata harvested by service OAI for discovery Service layer R1 R2 Search service User R3 Instead we insert a “service layer”. In this example a search service harvests metadata from all repositories and presents unified “portal” to user. R4 Metadata harvested by service

Global network of resources exposing metadata OAI for XYZ Service layer R1 R2 XYZ service User R3 Perhaps an alerting service. Perhaps even remove the user from the picture and use the same OAI framework to exchange preservation metadata for distributed archiving of open access content. R4 Global network of resources exposing metadata

OAI-PMH Data Model resource item has all available metadata identifier about this sculpture item Notes: The OAI-PMH is based on the “Resource-Item-Record” model. Resource: Details of the resource are outside the scope of the OAI-PMH, resources may be digital objects, physics objects, actors (people, institutions), collections… In this example the resource is an Inuit sculpture which lives in the Montreal museum of fine arts. Item: This is what resides in the repository. Logically contains all available metadata about the resource. In practice the repository (say an eprint archive) may actually contain the resource (eprint) and metadata could be extracted dynamically. In this example, the item might include metadata such as the name of the sculpture, the name of the sculptor, creation date, its size, geographical origin… Records: Particular metadata records disseminated via the OAI-PMH, may be in many overlapping and/or non-overlapping formats. Dublin Core metadata MARC21 branding records record has identifier + metadata format + datestamp

OAI-PMH and HTTP Clear separation of OAI-PMH and HTTP: OAI-PMH uses HTTP as transport all OK at HTTP level? => 200 OK something wrong at OAI-PMH level? => OAI-PMH error (e.g. badVerb) HTTP codes 302 (redirect), 503 (retry-after), etc. still available to implementers, but do not represent OAI-PMH events Not REST like

Normal response note no HTTP encoding of the OAI-PMH request <?xml version="1.0" encoding="UTF-8"?> <OAI-PMH> ….namespace info not shown here <responseDate>2002-0208T08:55:46Z</responseDate> <request verb=“GetRecord”… …>http://arXiv.org/oai2</request> <GetRecord> <record> <header> <identifier>oai:arXiv:cs/0112017</identifier> <datestamp>2001-12-14</datestamp> <setSpec>cs</setSpec> <setSpec>math</setSpec> </header> <metadata> ….. </metadata> </record> </GetRecord> </OAI-PMH> note no HTTP encoding of the OAI-PMH request

Error/exception response <?xml version="1.0" encoding="UTF-8"?> <OAI-PMH> <responseDate>2002-0208T08:55:46Z</responseDate> <request>http://arXiv.org/oai2</request> <error code=“badVerb”>ShowMe is not a valid OAI-PMH verb</error> </OAI-PMH> with errors, only the correct attributes are echoed in <request> Same schema for all responses, including error responses.

OAI-PMH verbs Function Verb listing of a single record GetRecord listing of N records ListRecords OAI unique ids contained in archive ListIdentifiers sets defined by archive ListSets metadata formats supported by archive ListMetadataFormats description of archive Identify metadata about the repository harvesting verbs most verbs take arguments: dates, sets, ids, metadata formats and resumption token (for flow control)

Information about the repository, start any harvest with Identify Identify verb Information about the repository, start any harvest with Identify <Identify> <repositoryName>Library of Congress 1</repositoryName> <baseURL>http://memory.loc.gov/cgi-bin/oai</baseURL> <protocolVersion>2.0</protocolVersion> <adminEmail>r.e.gillian@larc.nasa.gov</adminEmail> <adminEmail>rgillian@visi.net</adminEmail> <deletedRecord>transient</deletedRecord> <earliestDatestamp>1990-02-01T00:00:00Z</earliestDatestamp> <granularity>YYYY-MM-DDThh:mm:ssZ</granularity> <compression>deflate</compression>

Identifiers Items have identifiers (all records of same item share identifier) Identifiers must have URI syntax (defined by RFC, a type in XML schema) Unless you can recognize a global URI scheme, identifiers must be assumed to be local to the repository Complete identification of a record is baseURL+identifier+metadataPrefix+datestamp <provenance> container may be used to express harvesting/transformation history

Datestamps All dates/times are UTC, encoded in ISO8601, Z notation: 1957-03-20T20:30:00Z Datestamps may be either fill date/time as above or date only (YYYY-MM-DD). Must be consistent over whole repository, ‘granularity’ specified in Identify response. Earlier version of the protocol specified “local time” which caused lots of misunderstandings. Not good for global interoperability!

Harvesting granularity mandatory support of YYYY-MM-DD optional support of YYYY-MM-DDThh:mm:ssZ (must look at Identify response) granularity of from and until agrument in ListIdentifier/ListRecords must match

Sets Simple notion of grouping at the item level to support selective harvesting Hierarchical set structure Multiple set membership permitted E.g: repo has sets A, A:B, A:B:C, D, D:E, D:F If item1 is in A:B then it is in A If item2 is in D:E then it is in D, may also be in D:F Item3 may be in no sets at all Don’t use sets unless you have a good reason (selective harvesting)

resumptionToken Protocol supports the notion of partial responses in a very simple way: Response includes a ‘token’ at the which is used to get the next chunk. Idempotency of resumptionToken: return same incomplete list when resumptionToken is reissued while no changes occur in the repo: strict while changes occur in the repo: all items with unchanged datestamp optional attributes for the resumptionToken: expirationDate, completeListSize, cursor

Record headers header contains set membership of item <identifier>oai:arXiv:cs/0112017</identifier> <datestamp>2001-12-14</datestamp> <setSpec>cs</setSpec> <setSpec>math</setSpec> </header> <metadata> ….. </metadata> </record> eliminates the need for the “double harvest” 1.x required to get all records and all set information

Deleted records What happens when a record (or item) is deleted from a repository? Would be nice if harvesters could find out. Not necessarily guaranteed in OAI that harvesters will find out. Support made optional because of problems with legacy repositories (practical constraint). Level of support expressed in Identify (no, persistent, transient) Status expressed in header element, <header status=“deleted”>…</header>

Harvesting strategy Issue Identify request Check all as expected (validate, version, baseURL, granularity, comporession…) Check sets/metadata formats as necessary (ListSets, ListMetadataFormats) Do harvest, initial complete harvest done with no from and to parameters Subsequent incremental harvests start from datastamp that is responseDate of last response

Changing Scholarly Communication Traditional journal publishing combines functions: registration, certification, awareness, archiving. How about eprints being the starting point of a new value chain in which the raw material - the non-certified eprint - is open access? Other functions might be fullfilled by different networked parties. This requires a communication infrastructure: OAI-PMH may be part of this. Presentations on OAI and Scholarly Communication at http://www.cs.cornell.edu/people/simeon/talks