OAI: Past, Present and Future Michael L. Nelson several slides stolen from Herbert Van de Sompel Open Archives Meeting Institute of Mechanical.

Slides:



Advertisements
Similar presentations
OAI from 50,000 Feet OAI develops and promotes interoperability solutions that aim to facilitate the efficient dissemination of content. Begun in 1999.
Advertisements

A centre of expertise in digital information management The OAI Protocol for Metadata Harvesting Andy Powell UKOLN,
A brief overview of the Open Archives Initiative Steve Hitchcock Open Citation Project (OpCit) Southampton University Prepared for Z39.50/OAI/OpenURL plenary.
How did we get here? (CMIS v0.5) F2F, January 2009.
Rapid Visual OAI Tool S. Kothamasa, K. Maly, M. Zubair (Old Dominion University) X. Liu (Los Alamos National Laboratory) RCDL 2003, St. Petersburg.
Y.T. a brief history of the OAI 0 Kaynak: Herbert van de Sompel.
OAI-PMH Dawn Petherick, University Web Services Team Manager, Information Services, University of Birmingham MIDESS Dissemination.
Building Digital Libraries on Open Archives Donatella Castelli IEI-CNR Italy.
UCLA Digital Library UC Digital Library Forum August 5, 2002 UCLA Digital Library Presenter: Curtis Fornadley Senior Programmer/Analyst.
The Open Archives Initiative Simeon Warner (Cornell University) Symposium on “Scholarly Publishing and Archiving on the Web”, University.
OAI Standards for Sheet Music Meeting March 28-29, 2002 Basic OAI Principals How They Apply to Sheet Music Presenter: Curtis Fornadley, Senior Programmer/Analyst.
The Open Archives Initiative Simeon Warner (Cornell University) Open Archives seminar “Facilitating Free and Efficient Scientific.
The Open Archives Initiative Simeon Warner Cornell University, Ithaca, NY, USA CREPUQ 2002, Montréal, Canada 14:00, 24 October 2002.
Dienst Distributed Networked Publishing Carl Lagoze Digital Library Scientist Cornell University.
Thomas G. Habing – University of Illinois at Urbana-Champaign Recap: SIGIR 2001 OAI Workshop 19 September OAI Provider Workshop, University of.
Introduction to the OAI Metadata Harvesting Protocol Hussein Suleman, Digital Library Research Laboratory Virginia Tech.
How to participate in the Union Catalogue Project Hussein Suleman Sivulile – Open Access South Africa Advanced Information Management.
Metadata Harvesting The Hague, 13 & 14 January 2009 Julie Verleyen Scientific Coordinator, Europeana Office EuropeanaLocal Knowledge Sharing Workshop.
Rapid Visual OAI Tool S. Kothamasa, K. Maly, M. Zubair (Old Dominion University) X. Liu (Los Alamos National Laboratory) RCDL 2003, St. Petersburg.
Herbert van de sompel Workshop on OAI and peer review journals in Europe Geneva, Switserland – March 22nd to 24th 2001 Herbert Van de Sompel Cornell University.
Dec 9-11, 2003ICADL Challenges in Building Federation Services over Harvested Metadata Hesham Anan, Jianfeng Tang, Kurt Maly, Michael Nelson, Mohammad.
Indo-US Workshop, June23-25, 2003 Building Digital Libraries for Communities using Kepler Framework M. Zubair Old Dominion University.
Meta Tagging / Metadata Lindsay Berard Assisted by: Li Li.
The OAI: overview and historical context OAI Open Meeting – Washington DC – January 23 rd 2001 Herbert Van de Sompel & Carl Lagoze Cornell University --
OAI-PMH: Open Archives Initiative Protocol for Metadata Harvesting T.B. Rajashekar National Centre for Science Information (NCSI) Indian Institute of Science,
OAI Overview Michael L. Nelson Old Dominion University Norfolk Virginia, USA Bioinformatics Seminar ODU CS 791/891.
ICDL 2004 Improving Federated Service for Non-cooperating Digital Libraries R. Shi, K. Maly, M. Zubair Department of Computer Science Old Dominion University.
The Open Archives Initiative Protocol for Metadata Harvesting: Overview Jewel Ward Visiting Scholar, Keio University Lib-Sys Seminar, Keio University,
New Digital Library Possibilities Using the Open Archives InitiativeProtocol for Metadata Harvesting (OAI-PMH) Michael L. Nelson Old Dominion University.
UKOLN is supported by: The Open Archives Initiative Protocol for Metadata Harvesting CRIS + Open Access = The Route to Research Knowledge on the GRID Brussels.
The OAI Protocol for Metadata Harvesting Van de Sompel, Herbert Los Alamos National Laboratory – Research Library.
Metadata harvesting in regional digital libraries in PIONIER Network Cezary Mazurek, Maciej Stroiński, Marcin Werla, Jan Węglarz.
Digital Library Interoperability Architecture CS 502 – Carl Lagoze – Cornell University.
OAI Implementation Notes for LTRS, NACA and Open Video Michael L. Nelson NASA Langley Research Center & University of North Carolina
Kurt Maly Department of Computer Science Old Dominion University Norfolk, Virginia 23529, USA Digital Libraries, OAI and Free Software.
The OAI: overview and historical context OAI Open Meeting – Washington DC – January 23 rd 2001 Herbert Van de Sompel & Carl Lagoze Cornell University --
Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH) Phil Barker, March © Heriot-Watt University. You may reproduce all or any part.
Open Archive Initiative – Protocol for metadata Harvesting (OAI-PMH) Surinder Kumar Technical Director NIC, New Delhi
Caltech CODA CODA: Collection of Digital Archives Caltech Scholarly Communication.
1 GRID Based Federated Digital Library K. Maly, M. Zubair, V. Chilukamarri, and P. Kothari Department of Computer Science Old Dominion University February,
OAI Overview DLESE OAI Workshop April 29-30, 2002 John Weatherley
Integrating Access to Digital Content Sarah Shreeves University of Illinois at Urbana-Champaign Visual Resources Association 23 rd Annual Conference Miami.
Bitter Harvest Metadata Harvesting Issues, Problems, and Possible Solutions Roy Tennant California Digital Library.
Search Interoperability, OAI, and Metadata Sarah Shreeves University of Illinois at Urbana-Champaign Basics and Beyond Grainger Engineering Library April.
NSDL October 12-15, 2003Eisenhower National Clearinghouse Slide 1 NSDL and the Open Archives Initiative NSDL – OAI – and the Eisenhower National Clearinghouse.
SPASE and the VxOs Jim Thieman Todd King Aaron Roberts.
The OAI: technical overview OAI Open Meeting – Washington DC – January 23 rd 2001 Herbert Van de Sompel & Carl Lagoze Cornell University -- Computer Science.
The Open Archives Initiative Marshall Breeding Director for Innovative Technologies and Research Vanderbilt University
Open Archives Initiative Protocol for Metadata Harvesting.
OAI from the needle box Humboldt Universität Berlin, March 20, 2002 Thomas Krichel Palmer School of Library and Information Science Long Island University.
Feb 24-27, 2004ICDL 2004, New Dehli Improving Federated Service for Non-cooperating Digital Libraries R. Shi, K. Maly, M. Zubair Department of Computer.
Open Archives Initiative Gail McMillan Digital Library and Archives, Virginia Tech Society for Scholarly Publishing: June 1, 2000.
Sharing Digital Scores: Will the Open Archives Initiative Protocol for Metadata Harvesting Provide the Key? Constance Mayer, Harvard University Peter Munstedt,
Arc – Federated Searching Service Kurt Maly, Xiaoming Liu, M.Zubair, Michael L.Nelson Old Dominion University January 23, 2001.
The Open Archives Initiative and the Sheet Music Consortium Jon Dunn, Jenn Riley IU Digital Library Program October 10, 2003.
The UPS protoproto project herbert van de sompel, michael nelson, thomas krichel UPS 1 Meeting Santa Fe - October 21th 1999.
Open Archives Initiative CNI Phoenix December 13, 1999 Dale Flecker, Harvard Carl Lagoze, Cornell John Ober, CDL Don Waters, Mellon.
U.S. Government Use of the OAI-PMH Michael L. Nelson Old Dominion University Norfolk Virginia, USA ISTEC / NSF.
NSDL & the Open Archives Initiative A Brief Introduction to OAI Timothy W. Cole Mathematics Librarian & Professor of Library Administration.
1 CS 430: Information Discovery Lecture 26 Architecture of Information Retrieval Systems 1.
OAI: XML-Based Digital Library Interoperability Michael L. Nelson NASA Langley Research Center
1 herbert van de sompel CS 502 Computing Methods for Digital Libraries Cornell University – Computer Science Herbert Van de Sompel
Mod_oai: Metadata Harvesting for Everyone Michael L. Nelson, Herbert Van de Sompel, Xiaoming Liu, Aravind Elango
OAI and ODL Building Digital Libraries from Components Hussein Suleman Virginia Tech DLRL 12 September 2002.
Georges Arnaout Chaitanya Krishna
NASA Technical Report Server (NTRS) Project Overview April 2, 2003
OAI and Metadata Harvesting
Digitometric Services for Open Archives Environments
Open Archive Initiative
IVOA Interoperability Meeting - Boston
Presentation transcript:

OAI: Past, Present and Future Michael L. Nelson several slides stolen from Herbert Van de Sompel Open Archives Meeting Institute of Mechanical Engineers London 07/11/01

Outline Past –original goals, participants Present –evolution of goals, terms, definitions, current status Future –observations, use in the U.S., next steps

Background I met Herbert Van de Sompel in April –we spoke of a demonstration project he had in mind and had received sponsorship from Paul Ginsparg and Rick Luce –We wanted to demonstrate a multi-disciplinary DL that leveraged the large number of high quality, yet often isolated, tech report servers, e-print servers, etc. most DLs had grown up along single disciplines –little to no interoperability, “gardens” of DLs

The Rise and Fall of Distributed Searching wholesale distributed searching, popular at the time, is attractive in theory but troublesome in practice –Davis & Lagoze, JASIS 51(3), pp –Powell & French, Proc 5 th ACM DL, pp distributed searching of N nodes still viable, but only for small values of N NCSTRL: N > 100; bad NTRS/NIX: N<=20; ok (but could be better)

The Rise and Fall of Distributed Searching Other problems of distributed searching (from STARTS) –source-metadata problem how do you know which nodes to search? –query-language problem syntax varies and drifts over time between the various nodes –rank-merging problem how do you meaningfully merge multiple result sets? Temptations: –centralize all functions “everything will be done at X” –standardize on a single product “everyone will use system Y”

Universal Preprint Service A cross-archive DL that that provides services on a collection of metadata harvested from multiple archives –based on NCSTRL+; a modified version of Dienst support for “clustering” support for “buckets” Demonstrated at Santa Fe NM, October 21-22, 1999 – –D-Lib Magazine, 6(2) 2000 (2 articles) –UPS was soon renamed the Open Archives Initiative (OAI)

UPS Participants totals ca. July 1999

Getting metadata out of archives –not all archives support metadata extraction some archives have undocumented metadata extraction procedures –not all archives support rich criteria for extraction single dump concept only Intellectual property and use rights not always clear –many policies akin to “don’t ask, don’t tell” Metadata Harvesting

Quality problems with: –record duplication –crucial missing fields –internal errors –ambiguous references to people and places, publications Different formats! Metadata Formatting and Quality unproven intuition : n digital libraries results in O(n) metadata formats

Buckets: Information Surrogates in UPS Limitations on intellectual property, file size, transmission time, system load, etc. caused us to focus on metadata only Metadata was collected into “buckets”, with pointers back to the data files (still at the original sites)

Value Added Services Attached to the Buckets SFX Reference Linking Service, developed at Univ of Ghent, Belgium. - provides a layer of indirection between reference services available at a local site and the object itself SFX “buttons” are attached to the buckets themselves - communication occurs between SFX server and the bucket Adding other services to the buckets is easy...

Data Providers –publishing into an archive –providing methods for metadata “harvesting” provide non-technical context for sharing information also Service Providers –harvest metadata from providers –implement user interface to data Even if provided by the same DL, these are distinct functions Data and Service Providers

Provider Input interface Native end-user interface Provider Input interface Native end-user interface Native harvesting interface No machine based way to extract metadata… Machine and user interfaces for extracting metadata…. Data and Service Providers Self-describing archives –Much of the learning about the constituent UPS archives occurred out of band… –Given an unknown archive, we should be able to algorithmically determine the nature of the archive

Data Provider Input interface Native harvesting interface Data Provider Input interface Native end-user interface Native harvesting interface Service Provider Native end-user interface Input and harvesting interfaces optional Native end-user interface optional (e.g., RePEc) Data and Service Providers

Result… OAI The OAI was the result of the demonstration and discussion during the Santa Fe meeting Initial focus was on federating collections of scholarly e-print materials… …however, interest grew and the scope and application of OAI expanded to become a generic bulk metadata transport protocol Note: –OAI is only about metadata -- not full text! –OAI is neutral with respect to the nature of the metadata or the resources the metadata describes read: commercial publishers have an interest in OAI too...

OAI Timeline Highlights October 21-22, initial UPS meeting February 15, Santa Fe Convention published in D-Lib Magazine –precursor to the OAI metadata harvesting protocol June 3, workshop at ACM DL 2000 (Texas) August 25, OAI steering committee formed, DLF/CNI support September 7-8, technical meeting at Cornell University –defined the core of the current OAI metadata harvesting protocol September 21, workshop at ECDL 2000 (Portugal) November 1, Alpha test group announced (~15 organizations) January 23, OAI protocol 1.0 announced, OAI Open Day in the U.S. (Washington DC) –purpose: freeze protocol for months, generate critical mass February 26, OAI Open Day in Europe (Berlin) July 3, OAI protocol 1.1 announced –to reflect changes in the W3C’s XML latest schema recommendation September 8, workshop at ECDL 2001 (Darmstadt)

Open Archives Initiative The protocol is openly documented, and metadata is “exposed” to at least some peer group (note: rights management can still apply!) Archive defined as a “collection of stuff” -- not the archivist’s definition of “archive”. “Repository” used in most OAI documents. OAI is happening at break-neck speed...

Open Archives InitiativeOpen Archival Information System exposure of metadata for harvesting insuring long-term preservation of archival materials OAIS OAIS w/ an OAI interface

OAI Metadata Harvesting Protocol Then: –OAI harvesting protocol originally a subset of the Dienst (NCSTRL) protocol and originally called the “Santa Fe Convention” –originally defined an OAI-specific metadata format Now: –OAI metadata format dropped in favor of unqualified Dublin Core other formats possible, but DC is required as lowest common denominator –No longer dependent on Dienst defined independently (though still easily mappable)

Overview of OAI Verbs VerbFunction Identifydescription of archive ListMetadataFormatsmetadata formats supported by archive ListSetssets defined by archive ListIdentifiersOAI unique ids contained in archive ListRecordslisting of N records GetRecordlisting of a single record archival metadata harvesting verbs most verbs take arguments: dates, sets, ids, metadata formats and resumption token (for flow control)

supporting protocol requests herbert van de sompel service provider harvester data provider repository Identify Identify / Time / Request Repository identifier Base-URL Admin OAI protocol version Description repositoryrepository

supporting protocol requests herbert van de sompel service provider harvester data provider repository ListMetadataFormats * identifier=oai:mlib:123a ListMetadataFormats / Time / Request REPEAT Format prefix Format XML schema /REPEAT repositoryrepository

supporting protocol requests herbert van de sompel service provider harvester data provider repository ListSets * resumptionToken ListSets / Time / Request REPEAT SetSpec SetName /REPEAT repositoryrepository

harvesting requests herbert van de sompel service provider harvester data provider repository * from=a * until=b * set=klm ListRecords * metadataPrefix=dc * resumptionToken ListRecords / Time / Request REPEAT Identifier Datestamp Metadata /REPEAT repositoryrepository

harvesting requests herbert van de sompel service provider harvester data provider repository ListIdentifiers / Time / Request REPEAT Identifier Datestamp /REPEAT repositoryrepository * from=a * until=b * set=klm ListIdentifiers * resumptionToken

harvesting requests herbert van de sompel service provider harvester data provider repository GetRecord * identifier=oai:mlib:123a * metadataPrefix=dc GetRecord / Time / Request Identifier Datestamp Metadata repositoryrepository

Flow Control ListSets, ListIdentifiers, ListRecords are all allowed to return partial responses, via a combination of: –resumptionToken – an opaque, archive-defined data string that when passed back to the archive allows the response to begin where it left off each archive defines their own resumptionToken syntax; it may have visible semantics or not –503 http status code – “retry after” up to the harvester to understand this code and respect it, and up to the archive to enforce it

resumptionToken harvester RDBMS ListRecords Records 1-100, resumptionToken=AXad31 ListRecords, resumptionToken=AXad31 Records , resumptionToken=pQ22-x ListRecords, resumptionToken=pQ22-x Records scenario: harvesting 277 records in 3 separate 100 record “chunks”

OAI Demos Data providers –not really meant for end-user interaction, but Suleman’s “Repository Explorer” is an excellent tool registered data providers – –many being used for internal purposes; not registered Service providers –Arc, the first known SP harvesting from OAI data providers 3 registered service providers – –several more known to be in testing or creation

Field of Dreams It should be easy to be a data provider, even if it makes more work for the service provider. –if enough data providers exist, the service providers will come (DPs >> SPs) Open-source / freely available tools –“drop-in” data providers: industrial strength: personal size: –tools to make your existing DL a data provider: also: OAI-implementers mailing list / mail archive! –service providers: only bits and pieces currently publicly available...

OAI Observation: Front-End Only No input/registry mechanism –OAI harvesting protocol is always a front-end for something else filesystem, Dienst, RDBMS, LDAP, etc. –convenient for pre-existing DLs, but does not address “new” DLs e.g., “we want to do OAI” Bounds the scope of OAI –responsibilities and domain of OAI are still be discussed –tension between functionality and simplicity

OAI Observation: No T&C No terms & conditions provisions in protocol –assumes all metadata has uniform access rights how to restrict metadata to certain hosts? –introducing T&C would increase the scope of application, but at the expense of simplicity how expensive do we want to make a “just-a-front- end protocol” ? maybe T&C is a good application for sets?

OAI Observation: No T&C Possible to use multiple OAI servers in a DMZ-like configuration… Public OAI Server Private OAI Server Source database OAI requests from trusted hosts OAI requests from arbitrary hosts could even use a separate copy of the database…

OAI Observation: No T&C Possible to use OAI harvesting protocol in closed, restricted systems OAI 1OAI 2 OAI 3OAI 4 all OAI requests originate from these 4 DLs

OAI Observation: Monolithic An OAI server has no protocol-defined concept of “other” OAI servers –backups, mirrors, etc. have to be resolved outside of the scope of OAI scope vs. complexity again –fully connected graph of DLs harvesting from each other is unnecessary cf. web crawlers vs. “gathers” in U of Colorado’s Harvest System –3 rd party harvesting interfaces raise more T&C and data coherency issues

302 Load Balancing Interactive users on main DL machine should not be impacted by metadata harvesting –don’t take deliveries through the front door –not part of the protocol; defined outside the protocol OAI Server naca.larc.nasa.gov/oai/ if load > 0.05 redirect request OAI Server buckets.dsi.internet2.edu/naca/oai/ harvester HTTP Status Code … …

OAI Observation: Data Coherency In the interest of OAI implementer simplicity, several issues are left for the service provider to interpret –what is an update vs. addition? in the NACA OAI interface, they are reported as the same and its up to the harvesting system to figure it out –deletions? it is currently optional for OAI systems to mark records as deleted or not… –still left to the harvester to interpret

OAI Observation: Harvest Model Frequency of harvests –all-at-once harvests? initial harvest resolving data coherency –frequent incremental harvests? far more efficient for both service and data providers Webcrawling vs. digital library models –webcrawlers: little to no a priori information about target –DLs: frequent harvesting of a small number of known targets Realization: we know very little about how harvesting behavior… –are we optimizing for all-at-once, when incremental will be more common?

Potentially Good Ideas (but we’re not sure yet) Sets –intuition: we’ll be glad we included them –arXiv the first to implement sets their DL is roughly built on “sets”, so it was an easy mapping for them a few other repositories have since adopted sets Flow control –harvesting == denial of service attack ? –is “resumptionToken” solution not enough? too much? need data providers with large collections and enough service providers to generate a load

Potentially Good Ideas (but we’re not sure yet) Metadata –Q: “Which format should I use?” A: any/all of them… –lowest common denominator: unqualified Dublin Core –Again, little known about actual behavior will DC be actually be useful? or too lossy? will communities create/adopt specific formats? will native (presumably richer) formats be harvested? we very much want this to happen... “The Return of MARC” ?!

XML Observations Not too much of a problem for data providers –XML is easier to write than read Service providers… –XML can be pretty picky… a large “ListRecords” result can be invalidated with a single error harvest in chunks? individual records? –author contributed metadata particularly a problem (e.g. control characters from copy-n-paste) –one advantage of resumptionToken is that it compartmentalizes bad data

Current NTRS / NIX Architecture NASA-wide page that federates N center/project specific servers through distributed searching user... search for “cfd applications” search for“cfd applications” search for“cfd applications” search for“cfd applications” search for“cfd applications” each node independently maintained NTRS/NIX

Current NTRS / NIX Architecture Or users can interact directly with the nodes of NTRS/NIX… user... search for“cfd applications” search for“cfd applications” NTRS/NIX

Proposed Strategy: Data Providers Reduce the high interoperability expectations of distributed searching… Each current node of NTRS, NIX and other NASA DLs become an OAI “data provider” –LTRS & NACA already have test OAI interfaces LTRS NACA –each node is free to run their own software / architecture / system / etc., but the method of metadata exposure is standardized very low interoperability requirements each node can continue to have a “user interface”

Proposed Strategy: Service Providers NTRS, NIX and other well known, “destination DLs” become OAI service providers –no longer relying on distributed searching –harvest metadata from their constituent data providers –provide their value added services on local copies of the metadata data remains resident at the local data providers

NTRS OAI Architecture user... search for “cfd applications” local copy of metadata metadata harvested offline, through OAI interface each node independently maintained individual nodes can still support direct user interaction NTRS LTRSATRSGTRSCASITRS all searching, browsing, etc. performed on the metadata here content (reports) remain archived at the local sites

Additional Models First step –OAI interfaces for data providers –DLs use OAI interfaces to move from distributed searching to metadata harvesting Other possibilities –hierarchical harvesting exposing metadata to other, possibly non-NASA DLs harvesting from other, possibly non-NASA DLs –multi-genre DLs –re-apply the OAI protocol for harvesting / replicating content (not just metadata) –3 rd party service providers

NASA DLs in the Larger STI Realm NTRS LTRSATRS CASITRS … DOE DOD UniversitiesPublishers... International NTRS could also be a data provider from the point of view of other DLs; allowing the harvesting of NASA report metadata. NTRS could also harvest metadata from other DLs, and provide access to non-NASA content. We hope to influence the direction of the science.gov effort to use OAI. this could be a fully connected graph

New Kinds of DLs Drawing from the same pool of DPs –different interfaces, capabilities and collection policies for: public affairs K-12 education science & research authors / librarians / managers –NTRS and NIX could harvest from the same sources… be the same DL, but with different interfaces? be replaced with a new, all-encompassing DL? –DL creators can now focus on collection management “ala carting” their collections and sub collections instead of fussing over syntax synchronization of remote search services

A Generic Harvesting Protocol The actual uses of OAI depend on your relative position and concerns: –What is metadata vs. data? –Who is a SP vs. a DP? Multiple OAI interfaces make many things possible: –restricted / public interfaces –Arc-like description of harvested archives –updates of log files, authority lists, etc. Additional services can be built on top of OAI –content replication –awareness services

OAI Impact Lightweight interoperability protocol –an OAI layer is added to your existing DL Separation of responsibilities –service providers –data providers