Download presentation
Presentation is loading. Please wait.
Published byVivian Baldwin Modified over 8 years ago
1
OAI: XML-Based Digital Library Interoperability Michael L. Nelson NASA Langley Research Center m.l.nelson@larc.nasa.gov http://mln.larc.nasa.gov/~mln/ CENDI: Federal STI Managers Meeting April 3, 2002
2
Background I met Herbert Van de Sompel in April 1999... –we spoke of a demonstration project he had in mind and had received sponsorship from Paul Ginsparg and Rick Luce –We wanted to demonstrate a multi-disciplinary DL that leveraged the large number of high quality, yet often isolated, tech report servers, e-print servers, etc. most DLs had grown up along single disciplines –little to no interoperability, “gardens” of DLs
3
Universal Preprint Service A cross-archive DL that that provides services on a collection of metadata harvested from multiple archives –based on NCSTRL+; a modified version of Dienst support for “clustering” support for “buckets” Demonstrated at Santa Fe NM, October 21-22, 1999 –http://ups.cs.odu.edu/ –D-Lib Magazine, 6(2) 2000 (2 articles) http://www.dlib.org/dlib/february00/02contents.html –UPS was soon renamed the Open Archives Initiative (OAI) http://www.openarchives.org/
4
Data Providers –publishing into an archive –providing methods for metadata “harvesting” provide non-technical context for sharing information also Service Providers –harvest metadata from providers –implement user interface to data Self-describing archives –Much of the learning about the constituent UPS archives occurred out of band… Data and Service Providers Even if these are done by the same DL, these are distinct roles
5
Metadata Harvesting Move away from distributed searching Extract metadata from various sources Build services on local copies of metadata –data remains at remote repositories user... search for “cfd applications” local copy of metadata harvested offline metadata harvested offline metadata harvested offline metadata harvested offline each node independently maintained all searching, browsing, etc. performed on the metadata here individual nodes can still support direct user interaction
6
Result… OAI The OAI was the result of the demonstration and discussion during the Santa Fe meeting Initial focus was on federating collections of scholarly e-print materials… …however, interest grew and the scope and application of OAI expanded to become a generic bulk metadata transport protocol Note: –OAI is only about metadata -- not full text! –OAI is neutral with respect to the nature of the metadata or the resources the metadata describes read: commercial publishers have an interest in OAI too...
7
Open Archives Initiative The protocol is openly documented, and metadata is “exposed” to at least some peer group (note: rights management can still apply!) Archive defined as a “collection of stuff” -- not the archivist’s definition of “archive”. “Repository” used in most OAI documents. OAI is happening at break-neck speed...
8
OAI Mechanics Request is encoded in http Response is encoded in XML XML Schemas for the responses are defined in the OAI-PMH document
9
Overview of OAI Verbs VerbFunction Identifydescription of archive ListMetadataFormatsmetadata formats supported by archive ListSetssets defined by archive ListIdentifiersOAI unique ids contained in archive ListRecordslisting of N records GetRecordlisting of a single record archival metadata harvesting verbs most verbs take arguments: dates, sets, ids, metadata formats and resumption token (for flow control)
10
resumptionToken harvester RDBMS ListRecords Records 1-100, resumptionToken=AXad31 ListRecords, resumptionToken=AXad31 Records 101-200, resumptionToken=pQ22-x ListRecords, resumptionToken=pQ22-x Records 201-277 scenario: harvesting 277 records in 3 separate 100 record “chunks”
11
OAI Links & Demos Data providers –not really meant for end-user interaction, but Suleman’s “Repository Explorer” is an excellent tool http://purl.org/net/oai_explorer 50+ registered data providers –http://oaisrv.nsdl.cornell.edu/Register/BrowseSites.pl –many being used for internal purposes; not registered Service providers –http://www.openarchives.org/service/listproviders.html University (Arc, Torii, NCSTRL, Citebase) Commercial (Scirus (Elsevier), my.OAI) Others known to be in the works –e.g., Technical Report Interchange Project (NASA, LANL, Sandia, AFRL)
12
Field of Dreams It should be easy to be a data provider, even if it makes more work for the service provider. –if enough data providers exist, the service providers will come (DPs >> SPs) Open-source / freely available tools –“drop-in” data providers: industrial strength: http://www.eprints.org/ personal size: http://kepler.cs.odu.edu/ –tools to make your existing DL a data provider: http://www.openarchives.org/tools/tools.htm also: OAI-implementers mailing list / mail archive! –service providers: only bits and pieces currently publicly available...
13
OAI Observation: Front-End Only No input/registry mechanism –OAI harvesting protocol is always a front-end for something else filesystem, Dienst, RDBMS, LDAP, etc. –convenient for pre-existing DLs, but does not address “new” DLs e.g., “we want to do OAI” Bounds the scope of OAI –responsibilities and domain of OAI are still be discussed –tension between functionality and simplicity
14
OAI Observation: No T&C Possible to use multiple OAI servers in a DMZ-like configuration… Public OAI Server Private OAI Server Source database OAI requests from trusted hosts OAI requests from arbitrary hosts could even use a separate copy of the database…
15
OAI Observation: No T&C Possible to use OAI harvesting protocol in closed, restricted systems OAI 1OAI 2 OAI 3OAI 4 all OAI requests originate from these 4 DLs
16
Metadata –Q: “Which format should I use?” A: any/all of them… –lowest common denominator: unqualified Dublin Core –Again, little known about actual behavior will DC be actually be useful? or too lossy? will communities create/adopt specific formats? will native (presumably richer) formats be harvested? we very much want this to happen... “The Return of MARC” ?!
17
XML Observations Service providers… –XML can be pretty picky… a large “ListRecords” result can be invalidated with a single error harvest in chunks? individual records? –author contributed metadata particularly a problem (e.g. control characters from copy-n-paste) –one advantage of resumptionToken is that it compartmentalizes bad data
18
NTRS OAI Architecture user... search for “cfd applications” local copy of metadata metadata harvested offline, through OAI interface each node independently maintained individual nodes can still support direct user interaction NTRS LTRSATRSGTRSCASITRS all searching, browsing, etc. performed on the metadata here content (reports) remain archived at the local sites
19
Additional Models First step –OAI interfaces for data providers –DLs use OAI interfaces to move from distributed searching to metadata harvesting Other possibilities –hierarchical harvesting exposing metadata to other, possibly non-NASA DLs harvesting from other, possibly non-NASA DLs –multi-genre DLs –re-apply the OAI protocol for harvesting / replicating content (not just metadata) –3 rd party service providers
20
NASA DLs in the Larger STI Realm NTRS LTRSATRS CASITRS … DOE DOD UniversitiesPublishers... International NTRS could also be a data provider from the point of view of other DLs; allowing the harvesting of NASA report metadata. NTRS could also harvest metadata from other DLs, and provide access to non-NASA content. We hope to influence the direction of the science.gov effort to use OAI. this could be a fully connected graph
21
A Generic Harvesting Protocol The actual uses of OAI depend on your relative position and concerns: –What is metadata vs. data? –Who is a SP vs. a DP? Multiple OAI interfaces make many things possible: –restricted / public interfaces –Arc-like description of harvested archives –updates of log files, authority lists, etc. Additional services can be built on top of OAI –content replication –awareness services
22
OAI Impact Lightweight interoperability protocol –an OAI layer is added to your existing DL Separation of responsibilities –service providers –data providers http://www.openarchives.org/
23
Emergency Backup Slides
24
Open Archives InitiativeOpen Archival Information System http://www.dlib.org/dlib/april01/04editorial.html http://www.dlib.org/dlib/may01/05letters.html http://ssdoo.gsfc.nasa.gov/nost/isoas/us/overview.html exposure of metadata for harvesting insuring long-term preservation of archival materials OAIS OAIS w/ an OAI interface
25
The Rise and Fall of Distributed Searching wholesale distributed searching, popular at the time, is attractive in theory but troublesome in practice –Davis & Lagoze, JASIS 51(3), pp. 273-80 –Powell & French, Proc 5 th ACM DL, pp. 264-265 distributed searching of N nodes still viable, but only for small values of N NCSTRL: N > 100; bad NTRS/NIX: N<=20; ok (but could be better)
26
The Rise and Fall of Distributed Searching Other problems of distributed searching (from STARTS) –source-metadata problem how do you know which nodes to search? –query-language problem syntax varies and drifts over time between the various nodes –rank-merging problem how do you meaningfully merge multiple result sets? Temptations: –centralize all functions “everything will be done at X” –standardize on a single product “everyone will use system Y”
27
Getting metadata out of archives –not all archives support metadata extraction some archives have undocumented metadata extraction procedures –not all archives support rich criteria for extraction single dump concept only Intellectual property and use rights not always clear –many policies akin to “don’t ask, don’t tell” Metadata Harvesting
28
OAI Metadata Harvesting Protocol Then: –OAI harvesting protocol originally a subset of the Dienst (NCSTRL) protocol and originally called the “Santa Fe Convention” –originally defined an OAI-specific metadata format Now: –OAI metadata format dropped in favor of unqualified Dublin Core other formats possible, but DC is required as lowest common denominator –No longer dependent on Dienst (Cornell CS TR 95-1514) defined independently (though still easily mappable)
29
Dublin Core Dublin Core Metadata Initiative –http://www.dublincore.org/ –from 1994-1995, recognizing the need for simple, interoperable metadata for resource discovery –good overview of metadata & DC: http://www.dlib.org/dlib/january01/lagoze/01lagoze.html –15 elements (qualifiers possible)
30
Flow Control ListSets, ListIdentifiers, ListRecords are all allowed to return partial responses, via a combination of: –resumptionToken – an opaque, archive-defined data string that when passed back to the archive allows the response to begin where it left off each archive defines their own resumptionToken syntax; it may have visible semantics or not –503 http status code – “retry after” up to the harvester to understand this code and respect it, and up to the archive to enforce it
31
OAI Observation: No T&C No terms & conditions provisions in protocol –assumes all metadata has uniform access rights how to restrict metadata to certain hosts? –introducing T&C would increase the scope of application, but at the expense of simplicity how expensive do we want to make a “just-a-front-end protocol” ? maybe T&C is a good application for sets?
32
OAI Observation: Monolithic An OAI server has no protocol-defined concept of “other” OAI servers –backups, mirrors, etc. have to be resolved outside of the scope of OAI scope vs. complexity again –fully connected graph of DLs harvesting from each other is unnecessary cf. web crawlers vs. “gathers” in U of Colorado’s Harvest System –3 rd party harvesting interfaces raise more T&C and data coherency issues
33
OAI Observation: Data Coherency In the interest of OAI implementer simplicity, several issues are left for the service provider to interpret –what is an update vs. addition? in the NACA OAI interface, they are reported as the same and its up to the harvesting system to figure it out –deletions? it is currently optional for OAI systems to mark records as deleted or not… –still left to the harvester to interpret
34
Current NTRS / NIX Architecture NASA-wide page that federates N center/project specific servers through distributed searching user... search for “cfd applications” search for“cfd applications” search for“cfd applications” search for“cfd applications” search for“cfd applications” each node independently maintained NTRS/NIX http://techreports.larc.nasa.gov/cgi-bin/NTRS http://nix.nasa.gov/
35
Proposed Strategy: Data Providers Reduce the high interoperability expectations of distributed searching… Each current node of NTRS, NIX and other NASA DLs become an OAI “data provider” –LTRS & NACA already have test OAI interfaces LTRS http://techreports.larc.nasa.gov/ltrs/oai/ NACA http://naca.larc.nasa.gov/oai/ –each node is free to run their own software / architecture / system / etc., but the method of metadata exposure is standardized very low interoperability requirements each node can continue to have a “user interface”
36
OAI Observation: Harvest Model Frequency of harvests –all-at-once harvests? initial harvest resolving data coherency –frequent incremental harvests? far more efficient for both service and data providers Webcrawling vs. digital library models –webcrawlers: little to no a priori information about target –DLs: frequent harvesting of a small number of known targets Realization: we know very little about how harvesting behavior… –are we optimizing for all-at-once, when incremental will be more common?
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.