Download presentation
Presentation is loading. Please wait.
Published byKarin Baker Modified over 8 years ago
1
ODU CS 751/851 Fall 2006 Michael L. Nelson mln@cs.odu.edu Introduction to Digital Libraries Week 9: The Open Archives Initiative Old Dominion University Department of Computer Science CS 751/851 Fall 2006 Michael L. Nelson 10/25/06 several slides borrowed from Van de Sompel, Liu, Warner & Harrison
2
ODU CS 751/851 Fall 2006 Michael L. Nelson mln@cs.odu.edu The Rise and Fall of Distributed Searching wholesale distributed searching, popular at the time, is attractive in theory but troublesome in practice –Davis & Lagoze, JASIS 51(3), pp. 273-80 –Powell & French, Proc 5 th ACM DL, pp. 264-265 distributed searching of N nodes still viable, but only for small values of N (<= 10) NCSTRL: N > 100; bad
3
ODU CS 751/851 Fall 2006 Michael L. Nelson mln@cs.odu.edu The Rise and Fall of Distributed Searching Other problems of distributed searching (from STARTS) –source-metadata problem how do you know which nodes to search? –query-language problem syntax varies and drifts over time between the various nodes –rank-merging problem how do you meaningfully merge multiple result sets? Temptations: –centralize all functions “everything will be done at X” –standardize on a single product “everyone will use system Y”
4
ODU CS 751/851 Fall 2006 Michael L. Nelson mln@cs.odu.edu Universal Preprint Service Demonstrated at Santa Fe NM, October 21-22, 1999 –http://ups.cs.odu.edu/ –D-Lib Magazine, 6(2) 2000 (2 articles) http://www.dlib.org/dlib/february00/02contents.html –UPS was soon renamed the Open Archives Initiative (OAI) http://www.openarchives.org/ Based on NCSTRL+ software, it is a cross-archive DL that that provides services on a collection of content harvested from multiple archives –NCSTRL+ is a modified version of Dienst support for “clustering” support for “buckets”
5
ODU CS 751/851 Fall 2006 Michael L. Nelson mln@cs.odu.edu UPS Participants totals ca. July 1999
6
ODU CS 751/851 Fall 2006 Michael L. Nelson mln@cs.odu.edu projectmetadata formats the arXiv CogPrints NACA NCSTRL NDLTD RePEc format internal Refer RFC1807 MARC ReDIF
7
ODU CS 751/851 Fall 2006 Michael L. Nelson mln@cs.odu.edu Getting metadata out of archives –not all archives support metadata extraction some archives have undocumented metadata extraction procedures –not all archives support rich criteria for extraction single dump concept only Intellectual property and use rights not always clear projectmetadata extraction
8
ODU CS 751/851 Fall 2006 Michael L. Nelson mln@cs.odu.edu Data Providers –publishing into an archive –providing methods for metadata “harvesting” provide non-technical context for sharing information also Service Providers –harvest metadata from providers –implement user interface to data Even if provided by the same DL, these are distinct functions Data and Service Providers
9
ODU CS 751/851 Fall 2006 Michael L. Nelson mln@cs.odu.edu Result… OAI The OAI was the result of the demonstration and discussion during the Santa Fe meeting Lots of churn regarding what the OAI was –OAI-PMH originally a subset of the Dienst (NCSTRL) protocol and originally called the “Santa Fe Convention” –originally defined an OAI-specific metadata format
10
ODU CS 751/851 Fall 2006 Michael L. Nelson mln@cs.odu.edu abouteprints document like objects resourcesmetadata OAMS unqualified Dublin Core unqualified Dublin Core transport HTTP responsesXML requests HTTP GET/POST verbs Dienst OAI-PMH natureexperimental stable model metadata harvesting metadata harvesting metadata harvesting Santa Fe convention OAI-PMH v.1.0/1.1 OAI-PMH v.2.0
11
ODU CS 751/851 Fall 2006 Michael L. Nelson mln@cs.odu.edu Open Archives Initiative The protocol is openly documented, and metadata is “exposed” to at least some peer group (note: rights management can still apply!) Archive defined as a “collection of stuff” -- not the archivist’s definition of “archive”. “Repository” used in most OAI documents. Needed a TLA…
12
ODU CS 751/851 Fall 2006 Michael L. Nelson mln@cs.odu.edu OAI-PMH Actors data providers / repositories: –“A repository is a network accessible server that can process the 6 OAI-PMH requests in the manner described in [the OAI-PMH document]. A repository is managed by a data provider to expose metadata to harvesters.” service providers / harvesters: –“A harvester is a client application that issues OAI-PMH requests. A harvester is operated by a service provider as a means of collecting metadata from repositories.”
13
ODU CS 751/851 Fall 2006 Michael L. Nelson mln@cs.odu.edu Data Providers / Service Providers data providers (repositories) service providers (harvesters)
14
ODU CS 751/851 Fall 2006 Michael L. Nelson mln@cs.odu.edu resource all available metadata about David item Dublin Core metadata MARC metadata SPECTRUM metadata records item = identifier record = identifier + metadata format + datestamp set-membership is item-level property OAI-PMH Data Model
15
ODU CS 751/851 Fall 2006 Michael L. Nelson mln@cs.odu.edu OAI-PMH characteristics: Typical Repository OAI-PMH Entityvaluedescription ResourceURLPDF, PS, XML, HTML or other file Item identifierOAI Identifier DNS-based name of metadata about resource set membershipLCSHLibrary of Congress Subject Heading Record metadataPrefixoai_dcbibliographic metadata in Dublin Core datestamp2004-07-31modification date of DC record Record metadataPrefixoai_marcbibliographic metadata in MARC datestamp2004-07-31modification date of MARC record
16
ODU CS 751/851 Fall 2006 Michael L. Nelson mln@cs.odu.edu Overview of OAI-PMH Verbs VerbFunction Identifydescription of repository ListMetadataFormatsmetadata formats supported by repository ListSetssets defined by repository ListIdentifiersOAI unique ids contained in repository ListRecordslisting of N records GetRecordlisting of a single record metadata about the repository harvesting verbs most verbs take arguments: dates, sets, ids, metadata formats and resumption token (for flow control)
17
ODU CS 751/851 Fall 2006 Michael L. Nelson mln@cs.odu.edu Argument Summary metadataPrefixfromuntilsetresumptionTokenidentifier Identify ListMetadata Formats optional ListSets exclusive ListIdentifiers optional exclusive ListRecords optional exclusive GetRecord
18
ODU CS 751/851 Fall 2006 Michael L. Nelson mln@cs.odu.edu Error Summary Identify BA ListMetadata Formats BANMFIDDNE ListSets BABRTNSH ListIdentifiers BABRTCDFNRMNSH ListRecords BABRTCDFNRMNSH GetRecord BACDFIDDNE Generate badVerb on any input not matching the 6 defined verbs this is an inversion of the table in section 3.6 of the OAI-PMH specification
19
ODU CS 751/851 Fall 2006 Michael L. Nelson mln@cs.odu.edu 2002-0208T08:55:46Z http://arXiv.org/oai2 oai:arXiv:cs/0112017 2001-12-14 cs math ….. response no errors note no http encoding of the OAI-PMH request
20
ODU CS 751/851 Fall 2006 Michael L. Nelson mln@cs.odu.edu 2002-0208T08:55:46Z http://arXiv.org/oai2 ShowMe is not a valid OAI-PMH verb response with error with errors, only the correct attributes are echoed in
21
ODU CS 751/851 Fall 2006 Michael L. Nelson mln@cs.odu.edu harvesting granularity mandatory support of YYYY-MM-DD optional support of YYYY-MM-DDThh:mm:ssZ other granularities considered, but ultimately rejected granularity of from and until must be the same harvesting granularity
22
ODU CS 751/851 Fall 2006 Michael L. Nelson mln@cs.odu.edu header contains set membership of item header oai:arXiv:cs/0112017 2001-12-14 cs math ….. eliminates the need for the “double harvest” 1.x required to get all records and all set information
23
ODU CS 751/851 Fall 2006 Michael L. Nelson mln@cs.odu.edu ListIdentifiers returns headers ListIdentifiers 2002-0208T08:55:46Z http://arXiv.org/oai2 oai:arXiv:hep-th/9801001 1999-02-23 physic:hep oai:arXiv:hep-th/9801002 1999-03-20 physic:hep physic:exp ……
24
ODU CS 751/851 Fall 2006 Michael L. Nelson mln@cs.odu.edu Flow Control ListSets, ListIdentifiers, ListRecords are all allowed to return partial responses, via a combination of: –resumptionToken – an opaque, archive-defined data string that when passed back to the archive allows the response to begin where it left off each archive defines their own resumptionToken syntax; it may have visible semantics or not –503 http status code – “retry after” up to the harvester to understand this code and respect it, and up to the archive to enforce it
25
ODU CS 751/851 Fall 2006 Michael L. Nelson mln@cs.odu.edu resumptionToken harvester RDBMS ListRecords Records 1-1000, resumptionToken=AXad31 ListRecords, resumptionToken=AXad31 Records 1001-2000, resumptionToken=pQ22-x ListRecords, resumptionToken=pQ22-x Records 2001-2770 scenario: harvesting 2770 records in 3 separate 1000 record “chunks”
26
ODU CS 751/851 Fall 2006 Michael L. Nelson mln@cs.odu.edu State in resumptionTokens HTTP is stateless resumptionTokens allow state information to be passed back to the repository to create a complete list from sequence of incomplete lists EITHER – all state in resumptionToken OR – cache result set in repository
27
ODU CS 751/851 Fall 2006 Michael L. Nelson mln@cs.odu.edu resumptionToken attributes (1) expirationDate – likely to be useful when cache clean-up schedule is known –Do not specify expirationDate if all state in resumptionToken badResumptionToken error to be used if resumptionToken expired –May also be used if request cannot be completed for some other reason e.g.: if repository changes cause the incomplete list to have no records –issue badRT’s judiciously; it can invalidate a lot of effort by a lot of harvesters
28
ODU CS 751/851 Fall 2006 Michael L. Nelson mln@cs.odu.edu resumptionToken attributes (2) completeListSize and cursor optionally provide information about size of complete list and number of records so far disseminated –not (currently) widely used –use consistently if used –designed for status monitoring –caveat harvester: completeListSize may be approximate and may be revised
29
ODU CS 751/851 Fall 2006 Michael L. Nelson mln@cs.odu.edu resumptionToken The only defined use of resumptionToken is as follows: a repository must include a resumptionToken element as part of each response that includes an incomplete list; in order to retrieve the next portion of the complete list, the next request must use the value of that resumptionToken element as the value of the resumptionToken argument of the request; the response containing the incomplete list that completes the list must include an empty resumptionToken element;
30
ODU CS 751/851 Fall 2006 Michael L. Nelson mln@cs.odu.edu Idempotency of “List” Requests (1) Purpose is to allow harvesters to recover from lost responses or crashes without starting a large harvest from scratch Recover by re-issuing request using resumptionToken from previous request IMPLICATION: harvester must accept both the most recent resumptionToken issued and the previous one
31
ODU CS 751/851 Fall 2006 Michael L. Nelson mln@cs.odu.edu Idempotency of “List” Requests (2) response to a re-issued request must contain all unchanged records any changed records will get new datestamps after time of initial request changes will be picked up by subsequent harvest if not included [no experience yet with incomplete responses to ListSets or ListMetadataFormats requests]
32
ODU CS 751/851 Fall 2006 Michael L. Nelson mln@cs.odu.edu OAI-PMH 2.0 Registration 700+ repositories registered ??? unregistered repositories unregistered because: testing / development not for public harvesting public, but “low-profile” never got around to it… ??? DP:SP ~= 5:1
33
ODU CS 751/851 Fall 2006 Michael L. Nelson mln@cs.odu.edu Registration is Nice… …But Not Required OAI-PMH is (becoming) the “http” for digital libraries –there is no central registry of http servers remember the NCSA “What’s New” page? (ca. 1994) There will never be “registration support” in OAI-PMH –registries are a type of service provider, built on top of OAI-PMH –registration will be an integral part of community building Some examples –UIUC http://gita.grainger.uiuc.edu/registry/ –Celestial http://celestial.eprints.org/cgi-bin/status/ –Cornell http://www.openarchives.org/Register/BrowseSites.pl http://www.openarchives.org/service/listproviders.html
34
ODU CS 751/851 Fall 2006 Michael L. Nelson mln@cs.odu.edu A light-weight, data-provider driven way to communicate the existence of “others”, e.g. http://ntrs.nasa.gov/?verb=Identify … http://naca.larc.nasa.gov/oai2.0 http://ntrs.nasa.gov/oai2.0 http://eprints.riacs.edu/perl/oai/ http://ston.jsc.nasa.gov/collections/TRS/oai/ …
35
ODU CS 751/851 Fall 2006 Michael L. Nelson mln@cs.odu.edu Aggregators data providers (repositories) service providers (harvesters) aggregator aggregators allow for: scalability for OAI-PMH load balancing community building discovery
36
ODU CS 751/851 Fall 2006 Michael L. Nelson mln@cs.odu.edu Aggregators Frequently interchangeable terms: –aggregators: likely to be community / institutionally focused –caches: stores a copy, less likely to be community-oriented –proxies: less likely to store a copy, may gateway between OAI- PMH and other protocols Dienst / OAI Gateway; Harrison, Nelson, Zubair, JCDL 03 To learn more about aggregators, caches & proxies: –http://www.openarchives.org/OAI/2.0/guidelines-aggregator.htm –http://www.cs.odu.edu/~mln/jcdl03/
37
ODU CS 751/851 Fall 2006 Michael L. Nelson mln@cs.odu.edu & datestamps Reminder: datestamps are local to the repository, a re-exporting service must use new local datestamps Such services should use the container to preserve the original datestamps and other information
38
ODU CS 751/851 Fall 2006 Michael L. Nelson mln@cs.odu.edu Identifiers are Local Identifiers are local to the repository Unless you absolutely did not change the metadata and the identifier corresponds to a recognized URI scheme, use a new identifier upon re-exporting –use the container to preserve the harvesting history
39
ODU CS 751/851 Fall 2006 Michael L. Nelson mln@cs.odu.edu Derived from the same item? 3 different ways to determine if records share provenance from the same item: 1.both records have the same identifier and the baseURL in the request elements of the OAI-PMH reponses which include the record are the same; 2.both records have the same identifier and that identifier belongs to some recognized URI scheme; 3.the provenance containers of both records have the same entries for both the identifier and baseURL;
40
ODU CS 751/851 Fall 2006 Michael L. Nelson mln@cs.odu.edu example (1) 2002-02-08T08:55:46.1 <request verb="GetRecord" metadataPrefix="odd_fmt" identifier="oai:odd.oa.org:z1x2y3">http://odd.oa.org <GetRecord...namespace stuff… oai:odd.oa.org:z1x2y3 1999-08-07T06:05:04Z …metadata record in odd_fmt… Consider a request from crosswalker.oa.org : http://odd.oa.org?verb=GetRecord &identifier=oai:odd.oa.org:z1x2y3&metadataPrefix=odd_fmt and the following response from odd.oa.org :
41
ODU CS 751/851 Fall 2006 Michael L. Nelson mln@cs.odu.edu Imagine that crosswalker.oa.org cross-walks harvested metadata from odd_fmt into oai_marc and then re-exposes the metadata with new identifiers. A request from getmarc.oa.org : http://crosswalker.oa.org?verb=GetRecord &identifier=oai:cw.oa.org:z1x2y3 &metadataPrefix=oai_marc might then yield the following response from crosswalker.oa.org: example (2)
42
ODU CS 751/851 Fall 2006 Michael L. Nelson mln@cs.odu.edu example (3) oai:cw.oa.org:z1x2y3 2002-02-09T01:15:43Z...metadata record in oai_marc... <originDescription harvestDate="2002-02-08T08:55:46Z“ altered="true"> http://odd.oa.org oai:odd.oa.org:z1x2y3 1999-08-07T06:05:04Z http://odd.oa.org/odd_fmt
43
ODU CS 751/851 Fall 2006 Michael L. Nelson mln@cs.odu.edu example (4) This oai_marc record is then re-exposed by getmarc.oa.org with the same identifier oai:cw.oa.og:z1x2y3 (because the record has not been altered). The associated container might be:
44
ODU CS 751/851 Fall 2006 Michael L. Nelson mln@cs.odu.edu example (5) oai:cw.oa.org:z1x2y3 2002-03-01T01:46:11Z...metadata record in oai_marc... http://crosswalker.oa.org/ oai:cw.oa.org:z1x2y3 2002-02-09T01:15:43Z http://../oai_marc http://odd.oa.org oai:odd.oa.org:z1x2y3 1999-08-07T06:05:04Z http://odd.oa.org/odd_fmt
45
ODU CS 751/851 Fall 2006 Michael L. Nelson mln@cs.odu.edu Listen to the Repository Check Identify’s element if you wish to use finer than YYYY-MM-DD If you harvest with sets, remember that “:” indicates hierarchy –harvesting “a” will recursively harvest “a:b”, “a:b:c”, and “a:d” Check for and handle non-200 HTTP status codes, 503, 302 and 4xx in particular Empty resumptionToken => end of complete list Ask for compressed responses if the repository supports them
46
ODU CS 751/851 Fall 2006 Michael L. Nelson mln@cs.odu.edu Harvesting Everything Issue an Identify request to find protocol version, finest datestamp granularity supported, if compression is supported… Issue a ListMetadataFormats request to obtain a list of all metadataPrefixes supported. Harvest using a ListRecords request for each metadataPrefix supported. Knowledge of the datestamp granularity allows for less overlap in incremental harvesting if granularities finer than a day are supported. Set structure can be inferred from the setSpec elements in the header blocks of each record returned (consistency checks are possible). Items may be reconstructed from the constituent records. Provenance and other information in blocks may be re- assembled at the item level if it is the same for all metadata formats harvested. However, this information may be supplied differently for different metadata formats and may thus need to be store separately for each metadata format.
47
ODU CS 751/851 Fall 2006 Michael L. Nelson mln@cs.odu.edu Observation: Front-End Only No input/registry mechanism –OAI-PMH is always a front-end for something else filesystem, Dienst, RDBMS, LDAP, etc. –convenient for pre-existing DLs, but does not address “new” DLs e.g., “we want to do OAI-PMH” Bounds the scope of OAI-PMH –responsibilities and domain of OAI-PMH are still be discussed –tension between functionality and simplicity
48
ODU CS 751/851 Fall 2006 Michael L. Nelson mln@cs.odu.edu 1 Repository, 2 baseURLs Possible to use multiple OAI-PMH interfaces in a DMZ-like configuration… Public repo interface Private repo interface Source database OAI-PMH requests from trusted hosts OAI-PMH requests from arbitrary hosts could even use a separate copy of the database…
49
ODU CS 751/851 Fall 2006 Michael L. Nelson mln@cs.odu.edu Closed Harvesting Possible to use OAI-PMH in closed, restricted systems OAI 1OAI 2 OAI 3OAI 4 all OAI requests originate from these 4 DLs
50
ODU CS 751/851 Fall 2006 Michael L. Nelson mln@cs.odu.edu Observation: Data Coherency In the interest of implementer simplicity, several issues are left for the service provider to interpret –what is an update vs. addition? in the NACA interface, they are reported as the same and its up to the harvesting system to figure it out –deletions? it is currently optional for systems to mark records as deleted or not… –still left to the harvester to interpret
51
ODU CS 751/851 Fall 2006 Michael L. Nelson mln@cs.odu.edu Interesting Services DP9 –gateway to expose repository contents in HTML suitable for web crawlers Celestial –OAI “cache”, also 1.1 -> 2.0 converter Static (mini-) repositories –XML files, based on OLAC work
52
ODU CS 751/851 Fall 2006 Michael L. Nelson mln@cs.odu.edu DP9 Architecture see Liu et al., JCDL 2002; http://dlib.cs.odu.edu/dp9 Slide from Liu
53
ODU CS 751/851 Fall 2006 Michael L. Nelson mln@cs.odu.edu DP9 Formatting Format of URLs –http://arc.cs.odu.edu:8080/dp9/getrecord.jsp?identifier=oai:NACA:1917:naca-report-10 &prefix=oai_dc –http://arc.cs.odu.edu:8080/dp9/getrecord/oai_dc/oai:NACA:1917:naca-report-10 HTML Meta tags –Some crawlers (such as Inktomi) use the HTML meta tags to index a Web pages; DP9 also maps Dublin Core metadata to corresponding HTML meta tags. –For pages that are designed exclusively for robots navigation, a noindex robots meta tag is used – X-FORWARDED-FOR header to distinguish between different users coming in via a proxy Slide from Liu
54
ODU CS 751/851 Fall 2006 Michael L. Nelson mln@cs.odu.edu Celestial Developed by Brody @ Southampton –http://celestial.eprints.org/ –designed to complement DP9 –see Liu, Brody, et al., D-Lib Magazine 8(11) Where DP9 is a non-caching proxy, Celestial caches the metadata records –can off-load work from individual archives, higher availability –can harvest 1.1, 2.0; exports in 2.0
55
ODU CS 751/851 Fall 2006 Michael L. Nelson mln@cs.odu.edu “Static” Repositories Premise: a repository does not wish to have an executing program on its site, so it has a “static” XML file with some of the OAI-PMH responses in place –http://www.openarchives.org/OAI/2.0/guidelines-static- repository.htmhttp://www.openarchives.org/OAI/2.0/guidelines-static- repository.htm accessed through a proxy could be a low functionality node, or the XML file could be produced by a process and moved outside a firewall Based on OLAC work by Bird & Simons –http://www.language-archives.org/
56
ODU CS 751/851 Fall 2006 Michael L. Nelson mln@cs.odu.edu OAI Demos Data providers –not really meant for end-user interaction, but Suleman’s “Repository Explorer” is an excellent tool http://purl.org/net/oai_explorer
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.