Metadata Harvesting - OAI-PMH Thanks to Carl Lagoze, Christoper Gutteridge
The Open Archives Initiative (OAI) and the Protocol for Metadata Harvesting (OAI-PMH) Work of lots of people. I have worked especially closely with Carl Lagoze, Herbert Von de Sompel and Michael Nelson.
Open Archives? An open archive is an archive which exports its metadata in a useful way (OAI-PMH) Most also make some or all content available for free
Origins of the OAI “The Open Archives Initiative has been set up to create a forum to discuss and solve matters of interoperability between electronic preprint solutions, as a way to promote their global acceptance. “ (Paul Ginsparg, Rick Luce & Herbert Van de Sompel - 1999)
“The OAI develops and promotes interoperability What is the OAI now? “The OAI develops and promotes interoperability standards that aim to facilitate the efficient dissemination of content.” (from OAI mission statement) Technological framework around OAI-PMH protocol Application independent Independent of economic model for content Also … a community and a “brand” Notes: Quote is the first sentence of the OAI mission statement. Should not underestimate the success of the name “Open Archives Initiative”. Contains very good buzz-word “Open”. This name is often mis-understood: “archives” means just repository in this context, it has no implication of preservation beyond the idea that items in the repository are in some sense persistent; “Open” refers to open sharing of metadata, it is not necessary that the full-content be openly accessible for an archive to participate in the OAI.
OAIster Search Service Where does the OAI fit? NSDL Metadata Repo. OAIster Search Service DSpace OAI EPrints DLESE Earth Science Digital Library OAI is based around the notion of a collection of repositories (archives; collections) that ‘want’ to make metadata about their collections available. OAI provides the glue so the we an aggregated virtual collection. Services can then be built on the aggregated collection. CiteSeer Library of Congress arXiv
What is OAI-PMH The Open Archives Initiative Protocol for Metadata Harvesting A way of asking an archive about the stuff it’s got in it. This allows services to provide searches and other functionality across the metadata from many archives. XML over HTTP
There is “A” difference OAI and Open Access There is “A” difference Open Archives Initiative Open Access The OAI is not tied to a particular political agenda - technical focus BUT… the OAI provides functionality that is essential for many Open Access proposals
Data Provider (Repository) Service Provider (Harvester) OAI-PMH PMH -> Protocol for Metadata Harvesting http://www.openarchives.org/OAI/2.0/openarchivesprotocol.htm Simple protocol, just 6 verbs Designed to allow harvesting of any XML (meta)data (schema described) For batch-mode not interactive use Data Provider (Repository) Service Provider (Harvester) Protocol requests (GET, POST) XML metadata
What Questions can you ask via OAI-PMH? Identify GetRecord ListIdentifiers ListMetadataFormats ListRecords ListSets
? User OAI for discovery R1 R2 R3 R4 Information islands A number of problems with the notion of many separate repositories: User may not know where they are or how to select appropriate ones 2) User has to deal with peculiarities on interface to each repository 3) Information may be spread over more repositories than it is practical to user directly R4 Information islands
Metadata harvested by service OAI for discovery Service layer R1 R2 Search service User R3 Instead we insert a “service layer”. In this example a search service harvests metadata from all repositories and presents unified “portal” to user. R4 Metadata harvested by service
Global network of resources exposing XML data OAI for XYZ Service layer R1 R2 XYZ service User R3 Perhaps an alerting service. Perhaps even remove the user from the picture and use the same OAI framework to exchange preservation metadata for distributed archiving of open access content. R4 Global network of resources exposing XML data
OAI-PMH Data Model resource item has all available metadata identifier about this sculpture item Notes: The OAI-PMH is based on the “Resource-Item-Record” model. Resource: Details of the resource are outside the scope of the OAI-PMH, resources may be digital objects, physics objects, actors (people, institutions), collections… In this example the resource is an Inuit sculpture which lives in the Montreal museum of fine arts. Item: This is what resides in the repository. Logically contains all available metadata about the resource. In practice the repository (say an eprint archive) may actually contain the resource (eprint) and metadata could be extracted dynamically. In this example, the item might include metadata such as the name of the sculpture, the name of the sculptor, creation date, its size, geographical origin… Records: Particular metadata records disseminated via the OAI-PMH, may be in many overlapping and/or non-overlapping formats. Dublin Core metadata MARC21 branding records record has identifier + metadata format + datestamp
OAI and Metadata Formats Protocol based on the notion that a record can be described in multiple metadata formats Dublin Core is required for “interoperability” Extended to include XML compound object formats: e.g., METS, DIDL http://www.dlib.org/dlib/december04/vandesompel/12vandesompel.html
OAI-PMH and HTTP OAI-PMH uses HTTP as transport Error handling Encoding OAI-PMH in GET http://baseURL?verb=<verb>&arg1=<arg1Val>... Example: http://an.oa.org/OAIscript? verb=GetRecord& identifier=oai:arXiv.org:hep-th/9901001& metadataPrefix=oai_dc Error handling all OK at HTTP level? => 200 OK something wrong at OAI-PMH level? => OAI-PMH error (e.g. badVerb) HTTP codes 302 (redirect), 503 (retry-after), etc. still available to implementers, but do not represent OAI-PMH events
OAI-PMH verbs Function Verb listing of a single record GetRecord listing of N records ListRecords OAI unique ids contained in archive ListIdentifiers sets defined by archive ListSets metadata formats supported by archive ListMetadataFormats description of archive Identify metadata about the repository harvesting verbs most verbs take arguments: dates, sets, ids, metadata formats and resumption token (for flow control)
What Questions can you ask via OAI-PMH? Identify GetRecord ListIdentifiers ListMetadataFormats ListRecords ListSets
Identify Who are you? What kind of stuff do you contain? What is the copyright of your data and your metadata? “A collection-level description”
GetRecord Give me the metadata of a single record!
ListRecords Give me the metadata of all your records! May be limited by the date a record was last modified May be limited to a subset of the archive (e.g. only physics related records, but only if supported by archive)
ListIdentifiers Give me a list of all your records! May be limited by date record was last modified May be limited to a subset of the archive (e.g. only physics related records, but only if supported by archive)
ListMetadataFormats What metadata formats can you supply? All archives must supply Dublin Core but may supply other metadata formats too.
ListSets What subsets of your records may I ask for? Some archives define subsets, by subject, by rights etc. e.g. Physics related records, or public domain items or peer-reviewed items.
So how does a service query an archive? The first time it asks for ALL records. Then, every so often (day, week…) it asks for everything that’s changed since it last asked.
from physics aggregator Harvester #1 (Psychology Service) 500 Cogprints 169 D-Space CogPrints (GNU EPrints) 1600 Records Harvester #3 (General Service) 230,000 arXiv 769 D-Space 264 OrgPrints 1600 CogPrints 150,162 “Improved” records from physics aggregator www.orgprints.org (GNU EPrints) 264 Records arXiv (custom software) 230,000 Records Harvester #2 (Physics Aggregator) 150,000 arXiv 162 D-Space D-Space @ MIT (D-Space Software) 769 Records
Day 1 Harvester Give me everything! Archive Service A 1403 records OK! (1403 records)
Day 2 Give me all records which were added or changed since yesterday Harvester Archive Service A 1501 records 1403 records 1501 records OK! (102 new records, 4 deleted records, 23 changed records) 15 records Archive Service B 123 records Give me everything in set “physics” OK! (15 records)
Day 3 OK! (0 new records, 1 record changed) Give me all records which were added or changed since yesterday Harvester Archive Service A 1490 records 1490 records 1501 records Give me everything in set “physics” which were added or changed since yesterday. 15 records Archive Service B 123 records OK! (25 new records, 36 deleted records, 3 changed records)
What are these records? Dublin Core Title Creator Date Description Identifier (URL) … Very simple, but more useful than plain text.
Dublin Core in OAI Dublin Core required! You must provide Dublin Core data via OAI, so that all harvesters can use your data. You may also provide any other metadata formats you want to (MARC, AMF, one you-made-up etc.)
http://www.openarchives.org/tools/tools.html
Information about the repository, start any harvest with Identify Identify verb Information about the repository, start any harvest with Identify <Identify> <repositoryName>Library of Congress 1</repositoryName> <baseURL>http://memory.loc.gov/cgi-bin/oai</baseURL> <protocolVersion>2.0</protocolVersion> <adminEmail>r.e.gillian@larc.nasa.gov</adminEmail> <adminEmail>rgillian@visi.net</adminEmail> <deletedRecord>transient</deletedRecord> <earliestDatestamp>1990-02-01T00:00:00Z</earliestDatestamp> <granularity>YYYY-MM-DDThh:mm:ssZ</granularity> <compression>deflate</compression>
GetRecord - Normal response <?xml version="1.0" encoding="UTF-8"?> <OAI-PMH> ….namespace info not shown here <responseDate>2002-0208T08:55:46Z</responseDate> <request verb=“GetRecord”… …>http://arXiv.org/oai2</request> <GetRecord> <record> <header> <identifier>oai:arXiv:cs/0112017</identifier> <datestamp>2001-12-14</datestamp> <setSpec>cs</setSpec> <setSpec>math</setSpec> </header> <metadata> ….. </metadata> </record> </GetRecord> </OAI-PMH> note no HTTP encoding of the OAI-PMH request
Error/exception response <?xml version="1.0" encoding="UTF-8"?> <OAI-PMH> <responseDate>2002-0208T08:55:46Z</responseDate> <request>http://arXiv.org/oai2</request> <error code=“badVerb”>ShowMe is not a valid OAI-PMH verb</error> </OAI-PMH> with errors, only the correct attributes are echoed in <request> Same schema for all responses, including error responses.
Identifiers Items have identifiers (all records of same item share identifier) Identifiers must have URI syntax Unless you can recognize a global URI scheme, identifiers must be assumed to be local to the repository Complete identification of a record is baseURL+identifier+metadataPrefix+datestamp <provenance> container may be used to express harvesting/transformation history
RSS is mainly a “tail” format OAI-PMH is more “grep” like Selective Harvesting RSS is mainly a “tail” format OAI-PMH is more “grep” like Two “selectors” for harvesting Date Set Why not general search? Out of scope Not low-barrier Difficulty in achieving consensus
Datestamps All dates/times are UTC, encoded in ISO8601, Z notation: 1957-03-20T20:30:00Z Datestamps may be either fill date/time as above or date only (YYYY-MM-DD). Must be consistent over whole repository, ‘granularity’ specified in Identify response. Earlier version of the protocol specified “local time” which caused lots of misunderstandings. Not good for global interoperability!
Harvesting granularity mandatory support of YYYY-MM-DD optional support of YYYY-MM-DDThh:mm:ssZ (must look at Identify response) granularity of from and until agrument in ListIdentifier/ListRecords must match
Sets Simple notion of grouping at the item level to support selective harvesting Hierarchical set structure Multiple set membership permitted E.g: repo has sets A, A:B, A:B:C, D, D:E, D:F If item1 is in A:B then it is in A If item2 is in D:E then it is in D, may also be in D:F Item3 may be in no sets at all
header contains set membership of item Record headers header contains set membership of item <record> <header> <identifier>oai:arXiv:cs/0112017</identifier> <datestamp>2001-12-14</datestamp> <setSpec>cs</setSpec> <setSpec>math</setSpec> </header> <metadata> ….. </metadata> </record> eliminates the need for the “double harvest” 1.x required to get all records and all set information
resumptionToken Protocol supports the notion of partial responses in a very simple way: Response includes a ‘token’ at the which is used to get the next chunk. Idempotency of resumptionToken: return same incomplete list when resumptionToken is reissued while no changes occur in the repo: strict while changes occur in the repo: all items with unchanged datestamp optional attributes for the resumptionToken: expirationDate, completeListSize, cursor
Harvesting strategy Issue Identify request Check all as expected (validate, version, baseURL, granularity, comporession…) Check sets/metadata formats as necessary (ListSets, ListMetadataFormats) Do harvest, initial complete harvest done with no from and to parameters Subsequent incremental harvests start from datastamp that is responseDate of last response