Introduction to Digital Libraries Week 10: The Open Archives Initiative Old Dominion University Department of Computer Science CS 695 Fall 2003 Michael L. Nelson <mln@cs.odu.edu> 10/28/03 several slides borrowed from Van de Sompel, Liu, & Warner
The Rise and Fall of Distributed Searching wholesale distributed searching, popular at the time, is attractive in theory but troublesome in practice Davis & Lagoze, JASIS 51(3), pp. 273-80 Powell & French, Proc 5th ACM DL, pp. 264-265 distributed searching of N nodes still viable, but only for small values of N NCSTRL: N > 100; bad NTRS/NIX: N<=20; ok (but could be better)
The Rise and Fall of Distributed Searching Other problems of distributed searching (from STARTS) source-metadata problem how do you know which nodes to search? query-language problem syntax varies and drifts over time between the various nodes rank-merging problem how do you meaningfully merge multiple result sets? Temptations: centralize all functions “everything will be done at X” standardize on a single product “everyone will use system Y”
Universal Preprint Service Demonstrated at Santa Fe NM, October 21-22, 1999 http://ups.cs.odu.edu/ D-Lib Magazine, 6(2) 2000 (2 articles) http://www.dlib.org/dlib/february00/02contents.html UPS was soon renamed the Open Archives Initiative (OAI) http://www.openarchives.org/ Based on NCSTRL+ software, it is a cross-archive DL that that provides services on a collection of content harvested from multiple archives NCSTRL+ is a modified version of Dienst support for “clustering” support for “buckets”
UPS Participants totals ca. July 1999
project metadata formats the arXiv CogPrints NACA NCSTRL NDLTD RePEc internal Refer RFC1807 MARC ReDIF
Getting metadata out of archives project metadata extraction Getting metadata out of archives not all archives support metadata extraction some archives have undocumented metadata extraction procedures not all archives support rich criteria for extraction single dump concept only Intellectual property and use rights not always clear
Metadata has problems with: project metadata quality Metadata has problems with: record duplication crucial missing fields internal errors ambiguous references to people and places, publications
re-creation of archives project re-creation of archives creation of archives for ReDIF-ed metadata using intelligent digital objects : “buckets” RePEc arXiv NCSTRL
project creation of end-user service NCSTRL+ digital library service indexing buckets in archives by requesting their metadata enhanced user-interface NCSTRL+ search results point at buckets buckets auto-display buckets provide link to full-text in native archive
Data and Service Providers Data Providers publishing into an archive providing methods for metadata “harvesting” provide non-technical context for sharing information also Service Providers harvest metadata from providers implement user interface to data Even if provided by the same DL, these are distinct functions
Data and Service Providers Native harvesting interface Input interface Native end-user interface Provider Input interface Provider Native end-user interface No machine based way to extract metadata… Machine and user interfaces for extracting metadata….
Data and Service Providers Native end-user interface Input and harvesting interfaces optional Implementor Native harvesting interface Native harvesting interface Input interface Provider Input interface Provider Native end-user interface Native end-user interface optional (e.g., RePEc)
Self-Describing Archives Much of the learning about the constituent UPS archives occurred out of band… Given an unknown archive, we should be able to algorithmically determine the archive’s metadata... Native harvesting interface Where possible, the harvesting interface should provide the same criteria as the end-user interface Input interface Provider Native end-user interface
Data and Service Providers Recommended criteria for metadata extraction: subject classification accession date publication date Criteria for archive description metadata formats employed contact information for archive publication type scheme identifier scheme subject classification scheme
Result… OAI The OAI was the result of the demonstration and discussion during the Santa Fe meeting Lots of churn regarding what the OAI was OAI harvesting protocol originally a subset of the Dienst (NCSTRL) protocol and originally called the “Santa Fe Convention” originally defined an OAI-specific metadata format
OAI Protocol for Metadata Harvesting OAI metadata format dropped in favor of unqualified Dublin Core other formats possible, but DC is required as lowest common denominator No longer dependent on Dienst defined independently (though still easily map-able)
OAI as a “Dumb Archive” SODA DL model originally used a separate protocol & implementation for the “dumb archive” development ceased in favor of the OAI metadata harvesting protocol OAI divides the world into “service providers” (DLs) and “data providers” (archives) OAI does not require smart objects, but does create a “dumb archive” layer note that OAI does not define an archive implementation, but rather just a standard way of exposing an archive’s contents
Santa Fe convention OAI-PMH v.1.0/1.1 OAI-PMH v.2.0 nature experimental stable verbs Dienst OAI-PMH requests HTTP GET/POST responses XML transport HTTP metadata OAMS unqualified Dublin Core about eprints document like objects resources model metadata harvesting
Overview of OAI Verbs Verb Function Identify description of archive ListMetadataFormats metadata formats supported by archive ListSets sets defined by archive ListIdentifiers OAI unique ids contained in archive ListRecords listing of N records GetRecord listing of a single record archival metadata harvesting verbs most verbs take arguments: dates, sets, ids, metadata formats and resumption token (for flow control)
Identify 1.1 2.0 Arguments Errors Arguments Errors none none badArgument
ListMetadataFormats 1.1 2.0 Arguments Errors Arguments Errors identifier (OPTIONAL) Errors id does not exist Arguments identifier (OPTIONAL) Errors badArgument noMetadataFormats idDoesNotExist
ListSets 1.1 2.0 Arguments Errors Arguments Errors resumptionToken (EXCLUSIVE) Errors no set hierarchy Arguments resumptionToken (EXCLUSIVE) Errors badArgument badResumptionToken noSetHierarchy
ListIdentifiers 1.1 2.0 Arguments Errors from (OPTIONAL) until (OPTIONAL) set (OPTIONAL) resumptionToken (EXCLUSIVE) Errors no records match Arguments from (OPTIONAL) until (OPTIONAL) set (OPTIONAL) resumptionToken (EXCLUSIVE) metadataPrefix (REQUIRED) Errors badArgument cannotDisseminateFormat badResumptionToken noSetHierarchy noRecordsMatch
ListRecords 1.1 2.0 Arguments Errors Arguments Errors from (OPTIONAL) until (OPTIONAL) set (OPTIONAL) resumptionToken (EXCLUSIVE) metadataPrefix (REQUIRED) Errors no records match metadata format cannot be disseminated Arguments from (OPTIONAL) until (OPTIONAL) set (OPTIONAL) resumptionToken (EXCLUSIVE) metadataPrefix (REQUIRED) Errors noRecordsMatch cannotDisseminateFormat badResumptionToken noSetHierarchy badArgument
GetRecord 1.1 2.0 Arguments Errors Arguments Errors identifier (REQUIRED) metadataPrefix (REQUIRED) Errors id does not exist metadata format cannot be disseminated Arguments identifier (REQUIRED) metadataPrefix (REQUIRED) Errors badArgument cannotDisseminateFormat idDoesNotExist
Argument Summary metadataPrefix from until set resumptionToken identifier Identify ListMetadata Formats optional ListSets exclusive ListIdentifiers ListRecords GetRecord
Error Summary Identify BA ListMetadata Formats NMF IDDNE ListSets BRT NSH ListIdentifiers CDF NRM ListRecords GetRecord Generate badVerb on any input not matching the 6 defined verbs this is an inversion of the table in section 3.6 of the OAI-PMH specification
Flow Control ListSets, ListIdentifiers, ListRecords are all allowed to return partial responses, via a combination of: resumptionToken – an opaque, archive-defined data string that when passed back to the archive allows the response to begin where it left off each archive defines their own resumptionToken syntax; it may have visible semantics or not 503 http status code – “retry after” up to the harvester to understand this code and respect it, and up to the archive to enforce it
resumptionToken scenario: harvesting 2770 records in 3 separate 1000 record “chunks” RDBMS ListRecords harvester Records 1-1000, resumptionToken=AXad31 ListRecords, resumptionToken=AXad31 Records 1001-2000, resumptionToken=pQ22-x ListRecords, resumptionToken=pQ22-x Records 2001-2770
302 Load Balancing Interactive users on main DL machine should not be impacted by metadata harvesting don’t take deliveries through the front door not part of the protocol; defined outside the protocl if load > 0.05 redirect request http://blah/oai/?verb=ListIdentifiers OAI Server harvester HTTP Status Code 302 http://blah/oai/?verb=ListIdentifiers naca.larc.nasa.gov/oai/ <?xml version=“1.0” encoding=“UTF-8”?> … <ListIdentifiers> </ListIdentifiers> OAI Server buckets.dsi.internet2.edu/naca/oai/
OAI Demos Data providers not really meant for end-user interaction, but Suleman’s “Repository Explorer” is an excellent tool http://purl.org/net/oai_explorer
what’s new in OAI-PMH v.2.0
general changes
protocol vs periphery fixed protocol document clear distinction between protocol and periphery fixed protocol document extensible implementation guidelines: e.g. sample metadata formats, description containers, about containers allows for OAI guidelines and community guidelines
clear separation of OAI-PMH and HTTP OAI-PMH error handling OAI-PMH vs HTTP clear separation of OAI-PMH and HTTP OAI-PMH error handling all OK at HTTP level? => 200 OK something wrong at OAI-PMH level? => OAI-PMH error (e.g. badVerb) http codes 302, 503, etc. still available to implementers, but no longer represent OAI-PMH events
OAI-PMH Data Model set-membership is item-level property resource all available metadata about David item item = identifier Dublin Core metadata MARC SPECTRUM records record = identifier + metadata format + datestamp
other general changes better definitions of harvester, repository, item, unique identifier, record, set, selective harvesting oai_dc schema builds on DCMI XML Schema for unqualified Dublin Core usage of must, must not etc. as in RFC2119 wording on response compression
all protocol responses can be validated with a single XML Schema other general changes all protocol responses can be validated with a single XML Schema easier for data providers no redundancy in type definitions SOAP-ready clean for error handling
response no errors note no http encoding of the OAI-PMH request <?xml version="1.0" encoding="UTF-8"?> <OAI-PMH> <responseDate>2002-0208T08:55:46Z</responseDate> <request verb=“GetRecord”… …>http://arXiv.org/oai2</request> <GetRecord> <record> <header> <identifier>oai:arXiv:cs/0112017</identifier> <datestamp>2001-12-14</datestamp> <setSpec>cs</setSpec> <setSpec>math</setSpec> </header> <metadata> ….. </metadata> </record> </GetRecord> </OAI-PMH> note no http encoding of the OAI-PMH request
response with error with errors, only the correct <?xml version="1.0" encoding="UTF-8"?> <OAI-PMH> <responseDate>2002-0208T08:55:46Z</responseDate> <request>http://arXiv.org/oai2</request> <error code=“badVerb”>ShowMe is not a valid OAI-PMH verb</error> </OAI-PMH> with errors, only the correct attributes are echoed in <request>
corrections
dates/times all dates/times are UTC, encoded in ISO8601, Z-notation 1957-03-20T20:30:00Z
resumptionToken idempotency of resumptionToken: return same incomplete list when rT is reissued while no changes occur in the repo: strict while changes occur in the repo: all items with unchanged datestamp new, optional attributes for the resumptionToken: expirationDate completeListSize cursor
noRecordsMatch 1.x - if no records match, an empty list was returned
noRecordsMatch 2.0 - if no records match, the error condition noRecordsMatch is returned -- not an empty list
new functionality
harvesting granularity mandatory support of YYYY-MM-DD optional support of YYYY-MM-DDThh:mm:ssZ other granularities considered, but ultimately rejected granularity of from and until must be the same
Identify more expressive <repositoryName>Library of Congress 1</repositoryName> <baseURL>http://memory.loc.gov/cgi-bin/oai</baseURL> <protocolVersion>2.0</protocolVersion> <adminEmail>r.e.gillian@larc.nasa.gov</adminEmail> <adminEmail>rgillian@visi.net</adminEmail> <deletedRecord>transient</deletedRecord> <earliestDatestamp>1990-02-01T00:00:00Z</earliestDatestamp> <granularity>YYYY-MM-DDThh:mm:ssZ</granularity> <compression>deflate</compression>
header contains set membership of item <record> <header> <identifier>oai:arXiv:cs/0112017</identifier> <datestamp>2001-12-14</datestamp> <setSpec>cs</setSpec> <setSpec>math</setSpec> </header> <metadata> ….. </metadata> </record> eliminates the need for the “double harvest” 1.x required to get all records and all set information
ListIdentifiers returns headers <?xml version="1.0" encoding="UTF-8"?> <OAI-PMH> <responseDate>2002-0208T08:55:46Z</responseDate> <request verb=“…” …>http://arXiv.org/oai2</request> <ListIdentifiers> <header> <identifier>oai:arXiv:hep-th/9801001</identifier> <datestamp>1999-02-23</datestamp> <setSpec>physic:hep</setSpec> </header> <identifier>oai:arXiv:hep-th/9801002</identifier> <datestamp>1999-03-20</datestamp> <setSpec>physic:exp</setSpec> ……
ListIdentifiers mandates metadataPrefix as argument http://www.perseus.tufts.edu/cgi-bin/pdataprov? verb=ListIdentifiers &metadataPrefix=olac &from=2001-01-01 &until=2001-01-01 &set=Perseus:collection:PersInfo
ListIdentifiers the changes to ListIdentifiers are subtle, and reflect a change in the OAI-PMH data model Could have been named “ListHeaders” or reduced to an option for ListRecords “ListIdentifiers” kept for lexigraphical consistency
metadataPrefix character set for metadataPrefix and setSpec extended to URL-safe characters A-Z a-z 0-9 _ ! ‘ $ ( ) + - . *
in the periphery
provenance introduction of provenance container to facilitate tracing of harvesting history <about> <provenance> <originDescription> <baseURL>http://an.oa.org</baseURL> <identifier>oai:r1:plog/9801001</identifier> <datestamp>2001-08-13T13:00:02Z</datestamp> <metadataPrefix>oai_dc</metadataPrefix> <harvestDate>2001-08-15T12:01:30Z</harvestDate> … … … </originDescription> </provenance> </about>
friends introduction of friends container to facilitate discovery of repositories <description> <friends> <baseURL>http://cav2001.library.caltech.edu/perl/oai</baseURL> <baseURL>http://formations2.ulst.ac.uk/perl/oai</baseURL> <baseURL>http://cogprints.soton.ac.uk/perl/oai</baseURL> <baseURL>http://wave.ldc.upenn.edu/OLAC/dp/aps.php4</baseURL> </friends> </description>
branding introduction of branding container for DPs to suggest rendering & association hints <branding xmlns="http://www.openarchives.org/OAI/2.0/branding/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/branding/ http://www.openarchives.org/OAI/2.0/branding.xsd"> <collectionIcon> <url>http://my.site/icon.png</url> <link>http://my.site/homepage.html</link> <title>MySite(tm)</title> <width>88</width> <height>31</height> </collectionIcon> <metadataRendering metadataNamespace="http://www.openarchives.org/OAI/2.0/oai_dc/" mimeType="text/xsl">http://some.where/DCrender.xsl</metadataRendering> metadataNamespace="http://another.place/MARC" mimeType="text/css">http://another.place/MARCrender.css</metadataRendering> </branding>
revision of oai-identifier <description> <oai-identifier xmlns="http://www.openarchives.org/OAI/2.0/oai-identifier" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/oai-identifier http://www.openarchives.org/OAI/2.0/oai-identifier.xsd"> <scheme>oai</scheme> <repositoryIdentifier>oai-stuff.foo.org</repositoryIdentifier> <delimiter>:</delimiter> <sampleIdentifier>oai:oai-stuff.foo.org:5324</sampleIdentifier> </oai-identifier> </description> domain based repository names
OAI-PMH musings
OAI Observation: Front-End Only No input/registry mechanism OAI harvesting protocol is always a front-end for something else filesystem, Dienst, RDBMS, LDAP, etc. convenient for pre-existing DLs, but does not address “new” DLs e.g., “we want to do OAI” Bounds the scope of OAI responsibilities and domain of OAI are still be discussed tension between functionality and simplicity
OAI Observation: No T&C No terms & conditions provisions in protocol assumes all metadata has uniform access rights how to restrict metadata to certain hosts? introducing T&C would increase the scope of application, but at the expense of simplicity how expensive do we want to make a “just-a-front-end protocol” ? maybe T&C is a good application for sets?
OAI Observation: No T&C Possible to use multiple OAI servers in a DMZ-like configuration… OAI requests from trusted hosts OAI requests from arbitrary hosts Public OAI Server Private OAI Server Source database could even use a separate copy of the database…
OAI Observation: No T&C Possible to use OAI harvesting protocol in closed, restricted systems OAI 1 OAI 2 OAI 4 OAI 3 all OAI requests originate from these 4 DLs
OAI Observation: Monolithic An OAI server has no protocol-defined concept of “other” OAI servers backups, mirrors, etc. have to be resolved outside of the scope of OAI scope vs. complexity again fully connected graph of DLs harvesting from each other is unnecessary cf. web crawlers vs. “gathers” in U of Colorado’s Harvest System 3rd party harvesting interfaces raise more T&C and data coherency issues
OAI Observation: Data Coherency In the interest of OAI implementer simplicity, several issues are left for the service provider to interpret what is an update vs. addition? in the NACA OAI interface, they are reported as the same and its up to the harvesting system to figure it out deletions? it is currently optional for OAI systems to mark records as deleted or not… still left to the harvester to interpret
OAI Observation: Harvest Model Frequency of harvests all-at-once harvests? initial harvest resolving data coherency frequent incremental harvests? far more efficient for both service and data providers Webcrawling vs. digital library models webcrawlers: little to no a priori information about target DLs: frequent harvesting of a small number of known targets Realization: we know very little about how harvesting behavior… are we optimizing for all-at-once, when incremental will be more common?
Interesting Services DP9 Celestial Static (mini-) repositories gateway to expose repository contents in HTML suitable for web crawlers Celestial OAI “cache”, also 1.1 -> 2.0 converter Static (mini-) repositories XML files, based on OLAC work OpenURL metadata format registries record = metadata format
DP9 Architecture see Liu et al., JCDL 2002; http://dlib.cs.odu.edu/dp9 Slide from Liu
DP9 Formatting Format of URLs HTML Meta tags http://arc.cs.odu.edu:8080/dp9/getrecord.jsp?identifier=oai:NACA:1917:naca-report-10 &prefix=oai_dc http://arc.cs.odu.edu:8080/dp9/getrecord/oai_dc/oai:NACA:1917:naca-report-10 HTML Meta tags Some crawlers (such as Inktomi) use the HTML meta tags to index a Web pages; DP9 also maps Dublin Core metadata to corresponding HTML meta tags. For pages that are designed exclusively for robots navigation, a noindex robots meta tag is used X-FORWARDED-FOR header to distinguish between different users coming in via a proxy Slide from Liu
Celestial Developed by Brody @ Southampton http://celestial.eprints.org/ designed to complement DP9 see Liu, Brody, et al., D-Lib Magazine 8(11) Where DP9 is a non-caching proxy, Celestial caches the metadata records can off-load work from individual archives, higher availability can harvest 1.1, 2.0; exports in 2.0
“Static” Repositories Premise: a repository does not wish to have an executing program on its site, so it has a “static” XML file with some of the OAI-PMH responses in place http://www.openarchives.org/OAI/2.0/guidelines-static-repository.htm accessed through a proxy could be a low functionality node, or the XML file could be produced by a process and moved outside a firewall Based on OLAC work by Bird & Simons http://www.language-archives.org/
OpenURL Metadata Registry Registry of metadata formats for OpenURL http://www.sfxit.com/openurl/ http://lib-www.lanl.gov/~herbertv/papers/icpp02-draft.pdf
Additional Readings presentations publications http://www.cs.odu.edu/~mln/jcdl03 http://www.cs.odu.edu/~mln/oai-geneva.ppt publications http://www.cs.odu.edu/~liu_x/dp9/dp9.pdf http://www.cs.odu.edu/~liu_x/paper/archon/archon.pdf http://www.cs.odu.edu/~liu_x/paper/tri/tri.pdf http://www.cs.odu.edu/~liu_x/paper/thesis/thesis.pdf