The Open Archives Initiative Movement Kurt Maly Old Dominion University Norfolk Virginia, USA Brazilian DL international conference Política de Informação em Bibliotecas Digitais Campinas, Brazil March 19-22, 2003
Outline* OpenArchivesInitiative - history and summary description OAI services Why the OAI-PMH is not important Defining the OAI-PMH data model More interesting services (DP9, Celestial, Kepler) * Slides from Herbert Van de Sompel & Carl Lagoze & Michael Nelson included
herbert van de sompel The Open Archives Initiative has been set up to create a forum to discuss and solve matters of interoperability between preprint solutions, as a way to promote their global acceptance. Paul Ginsparg, Rick Luce & Herbert Van de Sompel OAI origin herbert van de sompel
e-print e-print accessibility e-print herbert van de sompel e-print
e-print accessibility e-print herbert van de sompel e-print
preprint solutions herbert van de sompel Santa Fe meeting: improve accessibility of preprints by improving searchability via the provision of an interoperability spec
Core concepts of Santa Fe convention herbert van de sompel low-barrier interoperability data-provider & service-provider model metadata harvesting model shared metadata format and parallel, community- specific metadata formats acceptable use Dienst subset OAMS XML reply HTTP based Gentelmen’s agreement
metadata harvesting herbert van de sompel metadata e-print
metadata harvesting herbert van de sompel metadata Author Title Abstract Identifer e-print
interest from other communities herbert van de sompel Digital Library Federation meetings ~ research library community has many materials for which they would like to ‘expose’ metadata OAI San Antonio meeting: ~ interest from librarians, publishers, others,...
resulting actions: organizational herbert van de sompel establish organizational stability for the OAI: institutional backing from CNI & DLF steering committee: policy guidance technical committee: technical specifications executive group: day to day coordination workshops: public dissemination, feedback
resulting actions: technical herbert van de sompel [09/2000] revise specifications to allow adoption beyond preprints: technical committee [09/ /2001]compile new specifications: editing by Carl and Herbert [11/ /2001] alpha-test specifications: oai-alpha group [01/2001] discontinue the Santa Fe Convention [01/2001] release version 1.0 of the OAI protocol [07/2001) version 1.1 [06/2002] version 2.0
core concepts in OAI 1.0 herbert van de sompel low-barrier interoperability data-provider & service-provider model metadata harvesting model shared metadata format and parallel, community- specific metadata formats acceptable use flexibility OAI 1.0 protocol Dublin Core HTTP based Community specific Reply XML Schema Self contained
low-barrier interop umbrella herbert van de sompel metadata OPACimageFTXTA&Ie-print
low-barrier interop umbrella herbert van de sompel metadata OPACimageFTXTA&Ie-print Author Title Abstract Identifer
communication re OAI herbert van de sompel lists: subscribe via oai-general list [replaces UPS list; UPS- subscribers will be moved] oai-implementers list web: FAQ: mail:
freeze specifications for months: stable for experimentation; not definitive minimize risk for early adopters maximize chances for future interoperability across communities revision of specifications herbert van de sompel
The Open Archives Initiative develops and promotes interoperability standards that aim to facilitate the efficient dissemination of content. new OAI mission statement herbert van de sompel
The Open Archives Initiative has its roots in an effort to enhance access to e-print archives as a means of increasing the availability of scholarly communication. Continued support of this work remains a cornerstone of the Open Archives program. new OAI mission statement herbert van de sompel
The fundamental technological framework and standards that are developing to support this work are, however, independent of the both the type of content offered and the economic mechanisms surrounding that content, and promise to have much broader relevance in opening up access to a range of digital materials. [...] new OAI mission statement herbert van de sompel
OAI-PMH Meeting History OAI Open Day, Washington DC 1/2001 CERN meeting 10/2002 Protocol definition, development tools DPs, retrofitting existing DLs SPs, new services Socio-Economic- Political Issues
Shift of Topics From the protocol itself, supporting & debugging tools and how to retrofit (existing) DLs… …to building (new) services that use the OAI-PMH as a core technology and reporting on their impact to the institution/community
NTRS metadata harvesting replacement for –previous NTRS was based on distributed searching –hierarchical harvesting (nigh) publicly available
Arc harvests all known archives first end-user service provider source available through SourceForge hierarchical harvesting
NCSTRL metadata harvesting replacement for Dienst-based NCSTRL based on Arc computer science metadata
Archon physics metadata based on Arc features: –citation indexing –equation-based searching
Torii physics metadata features –personalization –recommendations –WAP access
iCite physics metadata features –citation based access to arXiv metadata
my.OAI covers all registered metadata features –result sets –personalization –many other advanced features
Cyclades scientific metadata features –personalization –recommendations –collaboration status?
citebase arXiv metadata citation based indexing, reporting
OAIster harvests all known archives
Public Knowledge Project domain-specific filtering of harvested metadata (?)
Perseus they claim to harvest all DPs, but only humanities related DPs appear in the pull down menu
Service Providers It is clear that SPs are proliferating, despite (because of?) the inherent bias toward DPs in the protocol –easy to be a DP -> many DPs -> SPs eventually emerge –hard to be a DP -> SPs starve –currently 5x DPs more than SPs SPs are beginning to offer increasingly sophisticated services –competitive market originally envisioned for SPs is emerging
Why The OAI-PMH is NOT Important Users don’t care OAI-PMH is middleware –if done right, the uninterested user should never have to know OAI Inside Using the OAI-PMH does not insure a good SP OAI-PMH is (or is becoming) HTTP for DLs –few people get excited about http now http & OAI-PMH are core technologies whose presence is now assumed
Other Uses For the OAI-PMH Assumptions: –Traditional DLs / SPs will continue on their present path of increasing sophistication citation indexing, search results viz, personalization, recommendations, subject-based filtering, etc. –growth rates remain the same (5x DPs as SPs) Premise: OAI-PMH is applicable to any scenario that needs to update / synchronize distributed state –Future opportunities are possible by creatively interpreting the OAI-PMH data model
resource all available metadata about David item Dublin Core metadata MARC metadata SPECTRUM metadata records item = identifier record = identifier + metadata format + datestamp set-membership is item-level property OAI-PMH Data Model
Typical Values repository –collection of publications resource –scholarly publication item –all metadata (DC + MARC) record –a single metadata format datestamp –last update / addition of a record metadata format –bibliographic metadata format set –originating institution or subject categories
Interesting Services DP9 –gateway to expose repository contents in HTML suitable for web crawlers Celestial –OAI “cache”, also 1.1 -> 2.0 converter Static (mini-) repositories –XML files, based on OLAC work OpenURL metadata format registries –record = metadata format
DP9 Architecture see Liu et al., JCDL 2002; Slide from Liu
DP9 Formatting Format of URLs – report-10 &prefix=oai_dc – HTML Meta tags –Some crawlers (such as Inktomi) use the HTML meta tags to index a Web pages; DP9 also maps Dublin Core metadata to corresponding HTML meta tags. –For pages that are designed exclusively for robots navigation, a noindex robots meta tag is used – X-FORWARDED-FOR header to distinguish between different users coming in via a proxy Slide from Liu
Celestial Developed by Southampton – –designed to complement DP9 –see Liu, Brody, et al., D-Lib Magazine 8(11) Where DP9 is a non-caching proxy, Celestial caches the metadata records –can off-load work from individual archives, higher availability –can harvest 1.1, 2.0; exports in 2.0
“Static” Repositories Premise: a repository does not wish to have an executing program on its site, so it has a “static” XML file with some of the OAI- PMH responses in place –Design still being discussed accessed through a proxy could be a low functionality node, or the XML file could be produced by a process and moved outside a firewall Based on OLAC work by Bird & Simons –
Original Kepler Framework Support "personal data providers" or "archivelets“ An archivelet is a self-contained, self- installing software package that easily allows a researcher to create and maintain a small, OAI-PMH-compliant archive. General public have a seamless access to the totality of all such published materials.
Enhanced Kepler Framework (EKF) Improve the scalability and service reliability by the concept of buddy nodes and SuperNodes. Extend OAI-PMH with Push model and hybrid push/pull model. Rapid discovery of content as soon as it is published. Works with firewall and network address translation proxy Support community-based installation and integration.
Motivation behind Kepler The success of Peer-to-Peer (P2P) network. The vision of author self-archiving. Efficient repository synchronization technology defined by OAI-PMH.
Peer to Peer Network File sharing P2P networks such as Napster, Gnutella, Freenet. LOCKSS (Lots Of Copies Keep Stuff Safe) provides long-term preservation of scientific journals. Recent arrival of FastTrack and openFT: – A 2-tier system :SuperNodes and Nodes to solve scalability problem. – Kazaa (based on FastTrack technology) claims 20M downloads and scales well.
Author Self-Archiving Subject based: is a very successful subject-based self-archiving service. Since its inception in 1991 there are nearly 200K documents submitted. Institutional Based: Eprints software from Southampton. Personal Based: Personal homepage (indexed by researchindex) and Kepler (indexed by OAI-PMH compliant service such as arc.
Original Kepler Framework
Problem of Original Kepler framework Centralized Registration Server (LDAP): Increases the complexity of installing the Kepler server side software. Open Protocol: The archivelet uses a non- standard protocol for registration, thus inhibiting the development of third party applications. Security and NAT (Network Address Translation): In many instances, an archivelet is behind a firewall or NAT proxy, which makes it difficult for the service provider to harvest the archivelet.
Problem of Original Kepler Framework Availability: The archivelet is extremely unstable. Some use dynamic IP address. Freshness: Large number of archivelets with sparse changes. This doesn’t fit well with OAI- PMH’s “poll”-based mode. Full text vs. Metadata: As the archivelet is not up all the time, it is desirable to harvest full-text documents as well to improve the availability of full-text to end-users.
Improvement through EKF “push” and a hybrid “push/pull” model to address the scalability, security and freshness problem. SuperNode and Buddy node to improve scalability and server availability. The protocol in EKF is open and we hope it will inspire third-party development of Kepler tools
EKF- Push/Pull model Pull – Retrieval without prior coordination (e.g., as used by current robots and OAI- PMH) Hybrid Push/Pull – Retrieval after notification Push –Notification followed by a provider push.
EKF-Push/Pull Model
EKF- Why extend OAI-PMH? "Update Overhead" problem. –Frequent crawling has to be done to synchronize the data providers and service providers. – It is inefficient if the data providers seldom change during a harvest interval. – On the other hand, without frequent crawling, service providers may become inconsistent with data providers.
Extension of OAI-PMH in Service Provider Side The AddFriend and Notify support push/pull hybrid model. –The AddFriend verb informs the service provider of the existence of a data provider. –The Notify verb informs the service provider that a data provider is up/down or any new data is available. A PushMetadata verb is added to support the push model.
Design and Implementation Loose Name Space Management –Use address to uniquely identify archivelet. –Avoid the effort of maintaining a global namespace. Sample –oai:
Archivelet Components –File System based. OAI-PMH-compliant repository –Publication tool. –A simple extended HTTP server which supports the OAI-PMH protocol and push/pull model. It might act as a SuperNode and BuddyNode at the same time.
SuperNode A SuperNode has all the functionalities of an archivelet. The SuperNode collects all documents and metadata from archivelets in its friends list, and builds value-added services over these harvested data. SuperNode is typically deployed at an institution with a high quality network.
Protocol Syntax Add a Friend: Request to be added as a friend –? verb=AddFriend&id=&baseURL= Notify: used for major events of an archivelet, including startup, shutdown and document update. –?verb=Notify&event=[start/stop/update]&id=&baseUR L PushMetadata: –? verb=PushMetadata&contents=
Optional Implementation Features outside of kernel protocol, but may operate as a “hook” to attract more usages of Kepler. –Cache full text document. –Query service in archivelet. –Security and Spoofing Issues
Conclusions Protocol / transport gateways –Dienst OAI DOG, –Z39.50 ZMARCO (UIUC) –SOAP VT (Suleman) & ODU (Zubair)
OAI-PMH Will Have Arrived When: general web robots issue OAI-PMH verbs –…DP9 will no longer be needed –requires shift in “control”: harvester or repository? mod_oai is developed and is included in the default Apache configuration OAI-PMH fades into the background –similar to TCP/IP, http, XML, etc. –next year’s workshop is on OpenURL
Conclusions DPs continue to proliferate –and spawn SPs! SPs are / are becoming a competitive market –e.g., at least 10 different interfaces to arXiv metadata –growing sophistication of services –differentiation of SPs will be on features that have little to nothing to do with OAI-PMH