Download presentation
Presentation is loading. Please wait.
Published byCarmella Jefferson Modified over 9 years ago
1
OAI-PMH for Resource Harvesting Tutorial OAI4, October 20 th 2005, CERN, Geneva, Switzerland A New Model for Web Resource Harvesting Her This work supported in part by the Andrew Mellon Foundation & Library of Congress Michael Nelson Computer Science Department Old Dominion University Herbert Van de Sompel Digital Library Research & Prototyping Team Research Library, Los Alamos National Laboratory
2
OAI-PMH for Resource Harvesting Tutorial OAI4, October 20 th 2005, CERN, Geneva, Switzerland Outline (0) The Problem (1) mod_oai (2) Future Research
3
OAI-PMH for Resource Harvesting Tutorial OAI4, October 20 th 2005, CERN, Geneva, Switzerland WWW and DL: Separated at Birth 1994 DL WWW Today The Good: XML, BitTorrent, Web Services The Bad: RSS The Ugly: Semantic Web The Good: OAIS, DOI, OAI-PMH The Bad: Dublin Core The Ugly: SRU/W The problem is not that the WWW doesn’t work; it clearly does. The problem is that our expectations have been lowered. WWW DL
4
OAI-PMH for Resource Harvesting Tutorial OAI4, October 20 th 2005, CERN, Geneva, Switzerland www.getty.edu doc1; last mod 2003-03-12 doc2; last mod 2002-07-19 doc100; last mod 2003-09-11 … what documents have been modified since 2003-11-15 ? robot image from: http://www.q-design.com/toy/ToyArt/robots/55.JPEG Web Robots what is this file? what are its relationships to other files? how often does it change?
5
OAI-PMH for Resource Harvesting Tutorial OAI4, October 20 th 2005, CERN, Geneva, Switzerland A More Efficient Way what documents have been modified since 2003-11-15 ? www.getty.edu with mod_oai doc1; last mod 2003-03-12 doc2; last mod 2002-07-19 doc100; last mod 2003-09-11 … …
6
OAI-PMH for Resource Harvesting Tutorial OAI4, October 20 th 2005, CERN, Geneva, Switzerland Outline (0) The Problem (1) mod_oai (2) Future Research
7
OAI-PMH for Resource Harvesting Tutorial OAI4, October 20 th 2005, CERN, Geneva, Switzerland Goal: integrate OAI-PMH functionality into the web server itself… mod_oai: an Apache 2.0 module to automatically answer OAI-PMH requests for an http server o written in C o respects values in.htaccess, httpd.conf compile mod_oai on http://www.foo.edu/http://www.foo.edu/ baseURL is now http://www.foo.edu/modoaihttp://www.foo.edu/modoai o Result: web harvesting with OAI-PMH semantics (e.g., from, until, sets) -http://www.foo.edu/modoai? verb=ListIdentifiers & metdataPrefix=oai_dc & from=2004-09-15 & set=mime:video:mpeg mod_oai approach
8
OAI-PMH for Resource Harvesting Tutorial OAI4, October 20 th 2005, CERN, Geneva, Switzerland OAI-PMH data model in mod_oai resource item Dublin Core metadata records OAI-PMH identifier = entry point to all records pertaining to the resource MPEG-21 DIDL metadata pertaining to the resource HTTP header metadata http://techreports.larc.nasa.gov/ltrs/PDF/2004/aiaa/NASA-aiaa-2004-0015.pdf OAI-PMH sets MIME type
9
OAI-PMH for Resource Harvesting Tutorial OAI4, October 20 th 2005, CERN, Geneva, Switzerland OAI-PMH Entityvaluedescription ResourceURLPDF, PS, XML, HTML or other file Item identifierOAI IdentifierDNS-based name of metadata about resource set membershipLCSHLibrary of Congress Subject Heading Record metadataPrefixoai_dcbibliographic metadata in Dublin Core datestamp2004-10-18modification date of DC record Record metadataPrefixoai_marcbibliographic metadata in MARC datestamp2004-07-31modification date of MARC record OAI-PMH concepts : typical repository
10
OAI-PMH Entityvaluedescription ResourceURLHTML, GIF, PDF or other web file Item identifierURLsame URL as the resource set membershipMIME typeMIME type of the resource Record metadataPrefixhttp_headerthe http headers that would have been returned via HTTP GET/HEAD datestamp2004-07-31modification date of resource Record metadataPrefixoai_dca subset of http_header in DC datestamp2004-07-31modification date of resource Record metadataPrefixoai_didlMPEG-21 DIDL: base64 encoded resource + http_header metadata datestamp2004-07-31modification date of resource OAI-PMH concepts : mod_oai
11
OAI-PMH for Resource Harvesting Tutorial OAI4, October 20 th 2005, CERN, Geneva, Switzerland harvester issues a ListIdentifiers, finds URLs of updated resources does HTTP GETs updates only can get URLs of resources with specified MIME types Resource Discovery: ListIdentifiers
12
OAI-PMH for Resource Harvesting Tutorial OAI4, October 20 th 2005, CERN, Geneva, Switzerland Preservation: ListRecords harvester issues a ListRecords, Gets updates as MPEG- 21 DIDL documents (HTTP headers, resource By Value or By Reference) can get resources with specified MIME types
13
OAI-PMH for Resource Harvesting Tutorial OAI4, October 20 th 2005, CERN, Geneva, Switzerland performance of mod_oai and wget on www.cs.odu.eduwww.cs.odu.edu
14
OAI-PMH for Resource Harvesting Tutorial OAI4, October 20 th 2005, CERN, Geneva, Switzerland Readings Michael L. Nelson, Herbert Van de Sompel, Xiaoming Liu, Terry L. Harrison, Nathan McFarland. mod_oai: An Apache Module for Metadata Harvesting. http://arxiv.org/abs/cs.DL/0503069
15
OAI-PMH for Resource Harvesting Tutorial OAI4, October 20 th 2005, CERN, Geneva, Switzerland Outline (0) The Problem (1) mod_oai (2) Future Research
16
OAI-PMH for Resource Harvesting Tutorial OAI4, October 20 th 2005, CERN, Geneva, Switzerland Issues and Future Work For a given server, there are a set of URLs, U, and a set of files F o Apache maps U F o mod_oai maps F U Neither function is 1-1 nor onto o We can easily check if a single u maps to F, but given F we cannot (easily) generate U Short-term issues: o dynamic files -exporting unprocessed server-side files would be a security hole o IndexIgnore -httpd will “hide” valid URLs o File permissions -httpd will advertise files it cannot read Long-term issues o Alias, Location -files can be covered up by the httpd o UserDir -interactions between the httpd and the filesystem
17
OAI-PMH for Resource Harvesting Tutorial OAI4, October 20 th 2005, CERN, Geneva, Switzerland IndexIgnore & File Permissions
18
OAI-PMH for Resource Harvesting Tutorial OAI4, October 20 th 2005, CERN, Geneva, Switzerland Alias: Covering Up Files httpd.conf: Alias /A /usr/local/web/htdocs/B Alias /B /usr/local/web/htdocs/A the files “A” and “B” will be different from the URLs http://server/A http://server/B
19
OAI-PMH for Resource Harvesting Tutorial OAI4, October 20 th 2005, CERN, Geneva, Switzerland UserDir: “Just in Time” mounting of directories whiskey.cs.odu.edu:/ftp/WWW/conf% ls /home liu_x/ mln/ whiskey.cs.odu.edu:/ftp/WWW/conf% ls -d /home/tharriso /home/tharriso/ whiskey.cs.odu.edu:/ftp/WWW/conf % ls /home liu_x/ mln/ tharriso/ whiskey.cs.odu.edu:/ftp/WWW/conf %
20
OAI-PMH for Resource Harvesting Tutorial OAI4, October 20 th 2005, CERN, Geneva, Switzerland Looking Further Down the Road for mod_oai “Reverse” the method of URL discovery o cannot look to the files; o listen to incoming requests and build a list of valid URLs -could be seeded with files at start -also the method for handling server processed files / URLs Plug-ins for descriptive metadata o DC tags in HTML o MS Office formats, PDF o Tags from JPEG, TIFF, MP3, etc. Additional metadata in the DIDL o technical metadata from JHOVE o estimated change rate -cf. Cho & Garcia-Molina, ACM TOIT 28(4) http log access as separate metadata formats -cf. Van de Sompel, Young & Hickey, D-Lib 9(7/8)
21
OAI-PMH for Resource Harvesting Tutorial OAI4, October 20 th 2005, CERN, Geneva, Switzerland Expanding OAI-PMH / Complex Object Access OAI-PMH / CO access for: o blogs o message boards o native file systems -e.g. Mac OS X “Spotlight” More aggressive use of OAI-PMH / CO for preservation o recently funded NSF DIGARCH program o use for preservation: -Usenet -Email -Multicasting
22
OAI-PMH for Resource Harvesting Tutorial OAI4, October 20 th 2005, CERN, Geneva, Switzerland OAI-PMH + Complex Objects: A New Model for Web Resource Harvesting Better web harvesting can be achieved through: o OAI-PMH: structured access to updates o Complex object formats: modeled representation of digital objects Use cases: o Preservation (ListRecords) o Web crawling (ListIdentifiers) mod_oai: reference implementation o Better performance than wget o static files only; dynamic files in the future o not a replacement for DSpace, Fedora, eprints.org, etc. More info: o http://www.modoai.org/ http://www.modoai.org/ o http://whiskey.cs.odu.edu/ http://whiskey.cs.odu.edu/
23
OAI-PMH for Resource Harvesting Tutorial OAI4, October 20 th 2005, CERN, Geneva, Switzerland Datestamps and Etags Procedure o 16 harvests over 1 month of 465,374.dk domains o 5,543,470 possible downloads o 5,182,034 successful downloads o 599,143 changes Datestamp and Etag Example L. Clausen, “Concerning Etags and Datetsamps”, 4th International Web Archiving Workshop, ECDL 2004 http://www.netarchive.dk/website/publications/Etags-2004.pdf
24
OAI-PMH for Resource Harvesting Tutorial OAI4, October 20 th 2005, CERN, Geneva, Switzerland Errors in Datestamps and Etags Indicating Change EtagsDatestamps missed change0.087%0.30% redundant crawl32%10.7% L. Clausen, “Concerning Etags and Datetsamps”, 4th International Web Archiving Workshop, ECDL 2004 http://www.netarchive.dk/website/publications/Etags-2004.pdf 40.1 % of pages without Etags 0.07% of pages without Datestamps
25
http_header
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.