Download presentation
Presentation is loading. Please wait.
1
Introduction to Digital Libraries
Week 14: OAI & Complex Objects for Preservation Old Dominion University Department of Computer Science CS 751/851 Fall 2006 Michael L. Nelson Joan A. Smith 11/29/06 several slides borrowed from Van de Sompel, Liu, Lagoze, Warner & Harrison
2
CS 751/851: OAI-PMH & Complex Objects
Outline Digital Preservation: Concepts & Issues OAI-PMH Mechanics Complex Objects Preservation Using OAI-PMH & Complex Objects Implementation Example: mod_oai 11/29/2006 CS 751/851: OAI-PMH & Complex Objects
3
Digital Preservation “Digital information lasts forever --
or 5 years, whichever comes first” -- Jeff Rothenberg Durable Do you still have a copy of your first ? Can you still compile and run the first program you ever wrote? BASIC compilers are hard to find these days… If lightning fried your computer, how much information would you have lost? How many versions of your website have you made? How many do you still have? Digital information is very fragile Intuitively we know this… Raise your hand if this happened to you lately Fragile 11/29/2006 CS 751/851: OAI-PMH & Complex Objects
4
DP Strategy Example: LOCKSS Caches
LOCKSS seeks to ensure long-term availability of digital publications even if the publisher goes out of business Peer-to-peer network is used to maintain and repair content Ensures content is only available to authorized subscribers In this example, each LOCKSS cache (oval) collects journal content from the publisher's web site as it is published. Readers (circles) can get content from the publisher site. When the publisher's web site is not available (gray) to a local community, readers from that community get content from their local institution's cache. The caches "talk" to each other to maintain the content's integrity over time . 3 Goals of LOCKSS: Preserve content (bits) Preserve access (to bits) Preserve understanding of bits (as content) The point of LOCKSS is to ensure rights of publishers and accessibility to subscribers 11/29/2006 CS 751/851: OAI-PMH & Complex Objects
5
DP Strategy Example: VERS
VERS Objects VERS Process Note the emphasis on digital signatures: A key element of official records The final object contains a wealth of metadata The point of VERS is to ensure evidentiary-quality official records 11/29/2006 CS 751/851: OAI-PMH & Complex Objects
6
CS 751/851: OAI-PMH & Complex Objects
Web Site Preservation Internet Archive’s Wayback Machine Philanthropic effort by B. Kahle By-request and general web crawls WARP Japan’s national web archiving program Japanese-origin sites Many countries have similar efforts Sitemaps Search Engine standard (Google, Yahoo, MSN) to map site resources Not preservation-oriented per se: an entry point to preservation Today/near-future focus Search engines are saying: I give up! Google Groups/Usenet Restored ~80% of original Usenet archives Primarily text-based content Mirroring Strategies Can ease migration of resources Short-term backup rather than long-term preservation 11/29/2006 CS 751/851: OAI-PMH & Complex Objects
7
Crawling is Complicated
11/29/2006 CS 751/851: OAI-PMH & Complex Objects
8
Web Site Preservation: 2 Problems
The counting problem How many pages are on that site? To save it you have to find it The representation problem What’s that page all about? Future use requires understanding 11/29/2006 CS 751/851: OAI-PMH & Complex Objects
9
Digital Preservation Issues
Refreshing: If you don’t have it, you can’t preserve it Resources disappear over time (Cong. Foley’s web site) Resources change over time ( ) Resources can decay/degrade over time (damaged files, lost links) Migration: If you don’t upgrade it, you can’t use it Format obsolescence (WordPerfect vs. PDF) Format modification (XBM vs. JPEG) System obsolescence (TRS-80 vs PowerPC) Emulation: If you can’t access it, you can’t use it Original bits and bytes only work in the original environment (PDP-11) Obsolete systems can be emulated in a newer environment (Frogger) Physical characteristics have to be interpreted in new environments These issues apply to every digital preservation effort, web or DL 11/29/2006 CS 751/851: OAI-PMH & Complex Objects
10
Open Archive Information System: OAIS
A General Reference Model for Preservation (physical or digital) SIP = Submission Information Package AIP = Archival Information Package DIP = Dissemination Information Package from today through all tomorrows today future Note the complicated, active, on-going role of the archivist 11/29/2006 CS 751/851: OAI-PMH & Complex Objects
11
CS 751/851: OAI-PMH & Complex Objects
Outline: 2 Digital Preservation: Concepts & Issues OAI-PMH Mechanics Complex Objects Preservation Using OAI-PMH & Complex Objects Implementation Example: mod_oai 11/29/2006 CS 751/851: OAI-PMH & Complex Objects
12
Libraries: Inspiration for a Digital Age
Anatomy of a city library: Organized Grouped Topics subtopics Numbered Searchable By author, title By topic By edition Lots of metadata Digital library is similar Expands on physical library concepts Special protocols let librarians organize and find resources & information OAI-PMH is one of these “library” protocols GV943 . 25 Brenner, Richard J., 1941- .B74 Make the team. Soccer : a heads up guide to super soccer! / Richard 1990 J. Brenner. -- 1st ed. -- Boston : Little, Brown, c1990. 127 p. : ill. ; 19 cm. "A Sports illustrated for kids book." Summary: Instructions for improving soccer skills. Discusses dribbling, heading, playmaking, defense, conditioning, mental attitude, how to handle problems with coaches, parents, and other players, and the history of soccer. ISBN : $12.95 Soccer--Juvenile literature. 2. Soccer. II. Title: Heads up guide to super soccer. II. Title. Dewey Class no.: /2 -- dc 20 MARC 11/29/2006 CS 751/851: OAI-PMH & Complex Objects
13
OAI-PMH data model resource MimeType=pdf smith.pdf OAI-PMH sets OAI-PMH identifier entry point to all records pertaining to the resource item /foo/refs/smith.pdf Note that the datamodel is all about the metadata, rather than about the resource itself. Datestamps refer to metadata records, not to resource records. So a change in resource datestamp would NOT produce a new record (This situation changes with the DIDL metadata format) OAI-PMH: identifier metadataPrefix datestamp Dublin Core metadata MARCXML metadata records metadata pertaining to the resource 11/29/2006 CS 751/851: OAI-PMH & Complex Objects
14
Overview of OAI-PMH Verbs
Function Identify description of repository ListMetadataFormats metadata formats supported by repository ListSets sets defined by repository ListIdentifiers OAI unique ids contained in repository ListRecords listing of N records GetRecord listing of a single record metadata about the repository harvesting verbs most verbs take arguments: dates, sets, ids, metadata formats and resumption token (for flow control) 11/29/2006 CS 751/851: OAI-PMH & Complex Objects
15
Repositories and Harvesters
Data Providers / Repositories Service Providers / Harvesters HARVESTER: Client application Issues OAI-PMH style requests Collects metadata from repositories SERVICE PROVIDER: Aggregates metadata from multiple repositories Facilitates discovery of resources REPOSITORY: Network accessible Processes OAI-PMH style requests Exposes metadata to harvesters 11/29/2006 CS 751/851: OAI-PMH & Complex Objects
16
CS 751/851: OAI-PMH & Complex Objects
Aggregators aggregators allow for: scalability for OAI-PMH load balancing community building discovery data providers (repositories) service providers (harvesters) aggregator 11/29/2006 CS 751/851: OAI-PMH & Complex Objects
17
OAI-PMH Verbs & Special Features
Identify Provides descriptive metadata about the DL ListIdentifiers Returns record headers only Resumption token manages lengthy data set Unique identifier for each site resource ListMetadataFormats Specifies types of metadata tracked by the site Options include Dublin Core, MARC, DIDL, RFC1807, others… Dublin Core is required by OAI specification ListRecords Sequential transfer of each record Can limit to N records (flow control for crawler) ListSets Defined locally via scripts to aggregate common record groups Facilitates selective harvesting of site MIME-Type sets are automatically supported by mod_oai GetRecord Selects specific, single record from site Identified by the OAI unique identifier Special Features: Datestamp harvesting Example: Give me all records updated between and today “ Metadata only –or: Full record; encapsulated as DIDL –or: A complete package with all of this information Akin to OAIS AIP Best resource is herbert’s at Identify response: MANDATORY: repository name, baseURL, protocolVersion, earliestDatestamp, deletedRecords (no, transient, persistent), granularity (datestamp form) admin OPTIONAL: complression, description 11/29/2006 CS 751/851: OAI-PMH & Complex Objects
18
Example: Identify Verb Response Content
HTTP request: 11/29/2006 CS 751/851: OAI-PMH & Complex Objects
19
Example: ListIdentifiers Verb Response Content
11/29/2006 CS 751/851: OAI-PMH & Complex Objects
20
Resource Harvesting: Use cases
Discovery: use content itself in the creation of services search engines that make full-text searchable citation indexing systems that extract references from the full-text content browsing interfaces that include thumbnail versions of high-quality images from cultural heritage collections Preservation: periodically transfer digital content from a data repository to one or more trusted digital repositories trusted digital repositories need a mechanism to automatically synchronize with the originating data repository 11/29/2006 CS 751/851: OAI-PMH & Complex Objects
21
Existing OAI-PMH based approaches
Typical scenario: An OAI-PMH harvester harvests Dublin Core records from the OAI-PMH repository. The harvester analyzes each Dublin Core record, extracting dc.identifier information in order to determine the network location of the described resource. A separate process, out-of-band from the OAI-PMH, collects the described resource from its network location. 11/29/2006 CS 751/851: OAI-PMH & Complex Objects
22
Existing OAI-PMH based approaches : Issue 1
Locating the resource based on information provided in dc.identifier dc.identifier used to convey a variety of identifier: (simultaneously) URL DOI, bibliographic citation, … Not expressive enough to distinguish between identifier, locator. Several dereferencing attempts required URI provided in dc.identifier is commonly that of a bibliographic “splash page” How to know it is a bibliographic “splash page”, not the resource? If it is a bibliographic “splash page”, where is the resource? 11/29/2006 CS 751/851: OAI-PMH & Complex Objects
23
Existing OAI-PMH based approaches : Issue 2
Using the OAI-PMH datestamp of the Dublin Core record to trigger incremental harvesting: Datestamp of DC record does not necessarily change when resource changes no DC datestamp change DC datestamp change no resource update OK unnecessary resource download resource update missed resource update 11/29/2006 CS 751/851: OAI-PMH & Complex Objects
24
Existing OAI-PMH based approaches : Conventions
Cannot really address issue 2 (datestamps) with metadata conventions Issue 1 (identifier & locator of the resource) is currently addressed with a range of conventions First dc.identifier is locator of the resource what if the resource is not digital? Use of dc.format and/or dc.relation to convey locator 11/29/2006 CS 751/851: OAI-PMH & Complex Objects
25
Existing OAI-PMH based approaches : Conventions
<oai_dc:dc> <dc:title>A Simple Parallel-Plate Resonator Technique for Microwave. Characterization of Thin Resistive Films</dc:title> <dc:creator>Vorobiev, A.</dc:creator> <dc:subject>ING-INF/01 Elettronica</dc:subject> <dc:description>A parallel-plate resonator method is proposed for non-destructive characterisation of resistive films used in microwave integrated circuits. A slot made in one ... </dc:description> <dc:publisher>Microwave engineering Europe</dc:publisher> <dc:date>2002</dc:date> <dc:type>Documento relativo ad una Conferenza o altro Evento</dc:type> <dc:type>PeerReviewed</dc:type> <dc:identifier> <dc:format>pdf </dc:format> </oai_dc:dc> splash page locator of resource 11/29/2006 CS 751/851: OAI-PMH & Complex Objects
26
Existing OAI-PMH based approaches : Conventions
… <dc:identifier> <dc:relation> </dc:relation> splash page locator of resource 11/29/2006 CS 751/851: OAI-PMH & Complex Objects
27
Existing OAI-PMH based approaches : Conventions
… <dc:identifier> <dc:relation> </dc:relation> locator of resource splash page 11/29/2006 CS 751/851: OAI-PMH & Complex Objects
28
Existing OAI-PMH based approaches : Other attempts
dc.identifier leads to splash page & splash page contains special purpose XHTML link to resource(s) What if there is no splash page? How does a harvester recognize this situation? OA-X: protocol extension OK in local context Strategic problem to generalize How to consolidate with OAI-PMH data model Qualified Dublin Core Could bring expressiveness to distinguish between locator & identifier But what about the datestamp issue? 11/29/2006 CS 751/851: OAI-PMH & Complex Objects
29
CS 751/851: OAI-PMH & Complex Objects
Outline: 3 Digital Preservation: Concepts & Issues OAI-PMH Mechanics Complex Objects Preservation Using OAI-PMH & Complex Objects Implementation Example: mod_oai 11/29/2006 CS 751/851: OAI-PMH & Complex Objects
30
CS 751/851: OAI-PMH & Complex Objects
Representation of a digital object by means of a wrapper XML document Represented resource can be: simple digital object (consisting of a single datastream): foo.txt compound digital object (consisting of multiple datastreams) foo.asp Unambiguous approach to convey identifiers of the digital object and its constituent datastreams. Include datastream: By-Value: embedding of base64-encoded datastream By-Reference: embedding network location of the datastream not mutually exclusive; equivalent Include a variety of secondary information By-Value By-Reference Descriptive metadata, rights information, technical metadata, … 11/29/2006 CS 751/851: OAI-PMH & Complex Objects
31
Complex Object Formats: Characteristics
Representation of a digital object by means of a wrapper XML document. Represented resource can be: simple digital object (consisting of a single datastream) compound digital object (consisting of multiple datastreams) Include datastream: By-Value: embedding of base64-encoded datastream By-Reference: embedding network location of the datastream Descriptive metadata, rights information, technical metadata, … MPEG-21 DIDL is one type of complex object format Can be used in OAI-PMH Metadata prefix for mod_oai is “oai_didl” In other words: Instead of just looking at the index card about the book, we can actually get the book, too Let’s look at an example GetRecord verb for a very simple resource ( ) 11/29/2006 CS 751/851: OAI-PMH & Complex Objects
32
CS 751/851: OAI-PMH & Complex Objects
MPEG-21 DIDL Data Model How to encode Archive? 1 file = 1 DID 1 archive = 1 container 1 archive = 1 component 1 file = 1 component 11/29/2006 CS 751/851: OAI-PMH & Complex Objects
33
CS 751/851: OAI-PMH & Complex Objects
Example DIDL <didl:DIDL> <didl:Item> <didl:Descriptor><didl:Statement mimeType="text/xml; charset=UTF-8"> <dii:Identifier> </dii:Identifier> </didl:Statement></didl:Descriptor> <oai_dc:dc> <dc:title>A Simple Parallel-Plate Resonator Technique for Microwave. Characterization of Thin Resistive Films </dc:title> <dc:creator>Vorobiev, A.</dc:creator> <dc:identifier> <dc:format>application/pdf</dc:format> … </oai_dc:dc> <didl:Component> <didl:Resource mimeType="application/pdf" ref=" </didl:Component> </didl:Item> </didl:DIDL> 11/29/2006 CS 751/851: OAI-PMH & Complex Objects
34
Complex Object Formats & OAI-PMH
Resource represented via XML wrapper => OAI-PMH <metadata> Uniform solution for simple & compound objects Unambiguous expression of locator of datastream Disambiguation between locators & identifiers OAI-PMH datestamp changes whenever the resource changes data streams & secondary information Resource or its metadata OAI-PMH semantics apply: “about” containers, set membership 11/29/2006 CS 751/851: OAI-PMH & Complex Objects
35
GetRecord: Get the Id and the Data
&Identifier= &metadataPrefix=oai_didl oai_didl metadata format (prefix) Complex object response Encapsulates resource within the response Encodes it as base64 Everything known about the URL is in the response All of the metadata types and the contents Dublin Core HTTP Headers Any others that might be used by that server… 11/29/2006 CS 751/851: OAI-PMH & Complex Objects
36
Example: GetRecord/oai_didl Response
“joan.html” encoded in base64 11/29/2006 CS 751/851: OAI-PMH & Complex Objects
37
OAI-PMH based approach using Complex Object Formats
Typical scenario: An OAI-PMH harvester checks for support of a locally understood complex object format using the ListMetadataFormats verb The harvester harvests the complex object metadata. Semantics of the OAI-PMH datestamp guarantee that new and modified resources are detected. A parser at the end of the harvesting application analyzes each harvested complex object record: The parser extracts the bitstreams that were delivered By-Value. The parser extracts the unambiguous references to the network location of bitstreams delivered By-Reference. A separate process, out-of-band from the OAI-PMH, collects the bitstreams delivered By-Reference from the extracted network locations. 11/29/2006 CS 751/851: OAI-PMH & Complex Objects
38
Complex Object Formats & OAI-PMH : issues
Which Complex Object Format(s) How to Profile Complex Object Format(s) for OAI-PMH Harvesting Large records Making resources re-harvestable Because the resource is represented as <metadata>, can rights pertaining to the resource be expressed according to the “rights for metadata” OAI-rights guideline? Tools: Software library to write compliant complex objects Integration of this library with repository systems (Fedora, DSpace, eprints.org, ….) 11/29/2006 CS 751/851: OAI-PMH & Complex Objects
39
Complex Object Formats & OAI-PMH: Existing implementations
LANL Repository Local storage of Terrabytes of scholarly assets Assets stored as MPEG-21 DIDL documents DIDL documents made accessible to downstream applications via the OAI-PMH Mirroring of American Physical Society collection at LANL Maps APS document model to MPEG-21 DIDL Transfer Profile Exposes MPEG-21 DIDL documents through OAI-PMH infrastructure Inlcudes digests/signatures DSpace & Fedora plug-ins Maps DSpace/Fedora document model to MPEG-21 DIDL Transfer Profile mod_oai 11/29/2006 CS 751/851: OAI-PMH & Complex Objects
40
CS 751/851: OAI-PMH & Complex Objects
Outline: 4 Digital Preservation: Concepts & Issues OAI-PMH Mechanics Complex Objects Preservation Using OAI-PMH & Complex Objects Implementation Example: mod_oai 11/29/2006 CS 751/851: OAI-PMH & Complex Objects
41
Digital Preservation: A New Strategy
OAIS OAI-PMH Complex Objects + Digital Preservation We can leverage these existing technologies to create a unique approach to web preservation 11/29/2006 CS 751/851: OAI-PMH & Complex Objects
42
OAI-PMH Data Model with Complex Objects
resource OAI-PMH identifier = entry point to all records pertaining to the resource item Dublin Core metadata MPEG-21 DIDL METS MARCXML metadata metadata pertaining to the resource records modeled representation of the resource simple model complex model complex model more expressive model 11/29/2006 CS 751/851: OAI-PMH & Complex Objects
43
Complex Object Formats & OAI-PMH : archive export/ingest
11/29/2006 CS 751/851: OAI-PMH & Complex Objects
44
2 Problems: Counting & Representation
Counting Problem (Itemizing Resources) Finding all URLs on a site is hard Can’t preserve a resource if you can’t find it… Access-restrictions may exist Pages may be orphaned intentionally or accidentally URL normalization complicated, time-consuming Representation Problem (Characterizing Resources) Resource types in use migrate over time Mechanisms for accessing resources evolve Old formats may not be recognizable Other metadata might be desirable Keeping the bits & bytes alone is insufficient Can the web server help to solve these problems? 11/29/2006 CS 751/851: OAI-PMH & Complex Objects
45
Lessons from the Search Engines: Make Preservation Easy
Evolution of Search on the web Hard to use/Poor results ↔ Few users (think: alta-vista) Easier to use/OK results ↔ More users (think: Ask Jeeves) Simple to use/Great results ↔ Everybody Googles Search Engines turbo-charge the internet At-Your-Fingertips browsing = immediate user benefit Search Engines are successful (finally) Search Engines are easy Digital Preservation is not like Search Engines Digital preservation requires heroic effort & constant vigilance Benefits usually accrue only after a disaster How can we make preservation easy? We need to find resources We need to package resources Why not use the web server itself? DL’s & DP: Formal collections devote $$$ & resources to it Constant oversight to ensure persistence through digital generations Complex tools Metadata, indexing, and other key elements take work 11/29/2006 CS 751/851: OAI-PMH & Complex Objects
46
CRATE: A Model for Web Resource Preservation
Fits with OAIS Preservation Model Text-based protocol for long-term survivability Complex object format supported by HTTP via OAI-PMH Utilizes web-server to support preservation via mod_oai 11/29/2006 CS 751/851: OAI-PMH & Complex Objects
47
CS 751/851: OAI-PMH & Complex Objects
Outline: 5 Digital Preservation: Concepts & Issues OAI-PMH Mechanics Complex Objects Preservation Using OAI-PMH & Complex Objects Implementation Example: mod_oai 11/29/2006 CS 751/851: OAI-PMH & Complex Objects
48
What if we could -- Get a list of all URLs for the site
Including those not linked from root Maybe even CGI-related links Get a list of everything new since last visit Any pages that have changed Any new pages added Any pages that have been deleted Get a list of all <put your mime type here> Images (specific subtype or all of them) HTML pages only PDFs only Whatever mime spec you want… Package resource and metadata together in one object I.E., Solve the Counting and Representation problems 11/29/2006 CS 751/851: OAI-PMH & Complex Objects
49
CS 751/851: OAI-PMH & Complex Objects
mod_oai solution Integrate OAI-PMH functionality into the web server itself… Use mod_oai an Apache 2.0 module automatically answers OAI-PMH requests for an http server written in C respects values in .htaccess, httpd.conf Install mod_oai on Define baseURL: Result: web harvesting with OAI-PMH syntax (e.g., from, until, sets) From site foo, Using OAI-PMH Give me a list of all resources that are MIME type video-MPEG And their Dublin Core metadata dating from 9/15/2004 through today 11/29/2006 CS 751/851: OAI-PMH & Complex Objects
50
CS 751/851: OAI-PMH & Complex Objects
How does mod_oai work? Source Code Written in C Designed to be platform-independent Requires Apache 2 Uses APSX2 calls Linux, MAC compatible Runs as a web server process Installed on web server like mod_perl or mod_deflate, for example Config file handles module specifics (baseURL location, etc) Enables OAI-PMH verbs to appear in the HTTP request baseURL + verb gets OAI-PMH response The rest of the site works as normal Users see no change Standard crawlers can operate as usual 11/29/2006 CS 751/851: OAI-PMH & Complex Objects
51
OAI-PMH concepts : typical repository
OAI-PMH Entity value description Resource URL PDF, PS, XML, HTML or other file Item identifier OAI Identifier DNS-based name of metadata about resource set membership LCSH Library of Congress Subject Heading Record metadataPrefix oai_dc bibliographic metadata in Dublin Core datestamp modification date of DC record oai_marc bibliographic metadata in MARC modification date of MARC record 11/29/2006 CS 751/851: OAI-PMH & Complex Objects
52
OAI-PMH concepts : mod_oai
OAI-PMH Entity value description Resource URL HTML, GIF, PDF or other web file Item identifier same URL as the resource set membership MIME type MIME type of the resource Record metadataPrefix http_header the http headers that would have been returned via HTTP GET/HEAD datestamp modification date of resource oai_dc a subset of http_header in DC oai_didl MPEG-21 DIDL: base64 encoded resource + http_header metadata
53
Efficient, Automatic Harvesting
A better way: using OAI-PMH to crawl a site Identify Gives essential repository information ListRecords/ListIdentifiers Lists all of the resources on the site Can be “tweaked”: Only those that are new since YYYY-MM-DD Only those of MIME type <???> Streamlines crawling process ListSets Tells the crawler what kind of groupings the site supports 6 Verbs in All Streamlined initial crawl, fast update crawls 11/29/2006 CS 751/851: OAI-PMH & Complex Objects
54
Performance of mod_oai vs wget
All crawlers Must ask for every resource Discovery faster, automatic for mod_oai ListIdentifiers Only an OAI-PMH verb Could be used to create an index of resource names Gets unlinked and linked resources ListRecords Returns metadata plus resource wget Behaves like common crawler Can only find linked resources Update performance improved using mod_oai (OAI-PMH) Conditional request is streamlined If only new/changed pages are requested: OAI-PMH crawler: “GET from yyyy-mm-dd” (last visit date) One request gets all the new data Standard crawler “GET if-modified-since” Must ask for every page Data from performance on 11/29/2006 CS 751/851: OAI-PMH & Complex Objects for more detail: “mod_oai: An Apache Module for Metadata Harvesting “
55
Improving Crawls Using mod_oai
Google sitemaps for OAI-PMH sites currently harvests Dublin Core only Uses your baseURL to crawl your site Uses the date feature to get newest information Complex-object format/MPEG-21 DIDL New OAI-PMH approach combines resource + metadata Big files, but – Could use gzip, deflate if server supports it (many do) Still more efficient than traditional crawling Can provide lots of useful metadata Simplifies crawls ListRecords gets everything ListRecords + date range = fast updates Any crawler could request MPEG-21 DIDL format (oai_didl) Google could easily adopt it since they already use ListRecords Any search engine looking for a competitive edge could implement DIDL metadata prefix to streamline crawls Intranets could adopt this approach for archiving their internal web Encoded base64 resource is also easy to decode for analysis or restoration 11/29/2006 CS 751/851: OAI-PMH & Complex Objects
56
Addressing the Counting Problem: ListIdentifiers
CRAWLER: issues a ListIdentifiers, finds URLs of updated resources does HTTP GET updates only can get URLs of resources with specified MIME types EXTEND mod_oai “counting”: Web log lists File system lists Configuration information 11/29/2006 CS 751/851: OAI-PMH & Complex Objects
57
Addressing the Representation Problem: ListRecords in DIDL Format
CRAWLER: Makes a ListRecords query, Gets updates as MPEG-21 DIDL records (HTTP headers, resource By Value or By Reference) can get resources with specified MIME types EXTEND mod_oai “representation”: Add ability to incorporate other metadata output Build metadata-rich complex object response Encapsulate within existing OAI-PMH DIDL metadata format response 11/29/2006 CS 751/851: OAI-PMH & Complex Objects
58
GetRecord: Get the Id and the Data
&Identifier= &metadataPrefix=oai_didl oai_didl metadata format (prefix) Complex object response Encapsulates resource within the response Encodes it as base64 Everything known about the URL is in the response All of the metadata types and the contents Dublin Core HTTP Headers Any others that might be used by that server… 11/29/2006 CS 751/851: OAI-PMH & Complex Objects
59
Actual GetRecord Response (oai_didl)
“joan.html” encoded in base64 11/29/2006 CS 751/851: OAI-PMH & Complex Objects
60
CS 751/851: OAI-PMH & Complex Objects
Advantages of mod_oai Search engines are taking a real interest in OAI-PMH as a means to improve crawling mod_oai is an Apache 2.0 module that provides OAI-PMH interface for your site (currently Linux & Mac) You can send the baseURL to Google The module is relatively simple to install It won’t affect regular site users and regular web crawlers Any changes to your site will be reflected by the mod_oai server It makes crawling much faster, more efficient, more useful 11/29/2006 CS 751/851: OAI-PMH & Complex Objects
61
Search Engine Use of OAI-PMH
Google sitemaps: OAI-PMH or Do-It-Yourself Via OAI-PMH Just send them the baseURL! Google does a ListRecords query on your site Via Google’s tool or manually constructed XML-formatted file; URI/IRI compliant Follow schema: ASCII and UTF-8 encoded (escaped quotes, ampersands, etc) Limited size: 50,000 urls, 10mb max (per sitemap file) MSN Academic Live Digital-library-centric (not general web) Specifically states it can access OAI-PMH repositories Unclear if role will grow to include MSN Search Yahoo No sign-up guidelines for OAI-PMH-enabled sites Yet… research showed good coverage of OAI-PMH Repositories Outsourced OAI-PMH crawls [1] OAIster (U Michigan Library) provides Yahoo with OAI repository information Professional Digital Libraries Many support OAI-PMH Many are not open to commercial search engines 11/29/2006 CS 751/851: OAI-PMH & Complex Objects
62
Google Sitemaps Using OAI-PMH
XML Format info here: 11/29/2006 CS 751/851: OAI-PMH & Complex Objects
63
Issues, Current Research, and Future Work
For a given server, there are a set of URLs, U, and a set of files F Apache maps U F mod_oai maps F U Neither function is 1-1 nor onto We can easily check if a single u maps to F, but given F we cannot (easily) generate U Short-term issues: dynamic files exporting unprocessed server-side files would be a security hole IndexIgnore httpd will “hide” valid URLs File permissions httpd will advertise files it cannot read Long-term issues Alias, Location files can be covered up by the httpd UserDir interactions between the httpd and the filesystem Preservation research Plug-in metadata harvesters Efficient packaging of resource with metadata Impact of processes on web server performance Suitability of CRATE model for preservation 11/29/2006 CS 751/851: OAI-PMH & Complex Objects
64
IndexIgnore & File Permissions
11/29/2006 CS 751/851: OAI-PMH & Complex Objects
65
Alias: Covering Up Files
httpd.conf: Alias /A /usr/local/web/htdocs/B Alias /B /usr/local/web/htdocs/A the files “A” and “B” will be different from the URLs 11/29/2006 CS 751/851: OAI-PMH & Complex Objects
66
UserDir: “Just in Time” mounting of directories
whiskey.cs.odu.edu:/ftp/WWW/conf% ls /home liu_x/ mln/ whiskey.cs.odu.edu:/ftp/WWW/conf% ls -d /home/tharriso /home/tharriso/ whiskey.cs.odu.edu:/ftp/WWW/conf % ls /home liu_x/ mln/ tharriso/ whiskey.cs.odu.edu:/ftp/WWW/conf % 11/29/2006 CS 751/851: OAI-PMH & Complex Objects
67
Example CRATE Plug-Ins for mod_oai
Name Description Jhove Image analysis Kea Key-phrase extraction OTS Open Text Summarizer ExifTool Image/video metadata extractor Pdflib Extract PDF metadata MP3-Tag Extract audio file tags Essence Customized information extraction GDFR MIME++ Plug-in design allows for any type of extraction tool to be included Flexible architecture elements: Tags | Argument-Name | Version | CDATA output Simple Apache configuration file modification to enable plug-in Plug-ins written by 3rd-party programmers Validity of metadata is not verified by CRATE 11/29/2006 CS 751/851: OAI-PMH & Complex Objects
68
CS 751/851: OAI-PMH & Complex Objects
OAI-PMH + Complex Objects: A New Model for Web Resource Harvesting & Preservation Better web harvesting can be achieved through: OAI-PMH Complex object formats Use cases: Preservation (ListRecords) Web crawling (ListIdentifiers) mod_oai: reference implementation Better performance than wget static files only; dynamic files in the future not a replacement for DSpace, Fedora, eprints.org, etc. New version of mod_oai Plug-in compatible Flexible architecture 11/29/2006 CS 751/851: OAI-PMH & Complex Objects
69
CS 751/851: OAI-PMH & Complex Objects
For more information A website with mod_oai releases, demos and documentation is maintained by Old Dominion University and LANL: New release next month Improved installation process The Open Archives Initiative also maintains a web site: Forum, tutorials, news, research OAI-PMH information There are active research projects at ODU using mod_oai Web preservation Repository ingestion/handling See 11/29/2006 CS 751/851: OAI-PMH & Complex Objects
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.