Thinking Differently About Web Page Preservation Michael L. Nelson, Frank McCown, Joan A. Smith Old Dominion University Norfolk VA

Slides:

Advertisements

Similar presentations

Heinrich Stamerjohanns Institute for Science Networking Distributed Open Archives Dr. Heinrich Stamerjohanns Institute for Science Networking at the University.

Advertisements

Copying Archives Project Group Members: Mushashu Lumpa Ngoni Munyaradzi.

The Open Archives Initiative DRIADE Workshop, Durham NC, May 16-17, 2007 Michael L. Nelson The Open Archives Initiative Michael L. Nelson Computer Science,

Depositing e-material to The National Library of Sweden.

Object Re-Use and Exchange Mellon Retreat, Nassau Inn, Princeton, NJ, March Herbert Van de Sompel, Carl Lagoze The OAI Object Re-Use & Exchange.

Information Retrieval in Practice

OAI Standards for Sheet Music Meeting March 28-29, 2002 Basic OAI Principals How They Apply to Sheet Music Presenter: Curtis Fornadley, Senior Programmer/Analyst.

Microsoft ® Official Course Developing Optimized Internet Sites Microsoft SharePoint 2013 SharePoint Practice.

Overview of Search Engines

Meta Tags What are Meta Tags And How Are They Best Used?

Lazy Preservation: Reconstructing Websites from the Web Infrastructure Frank McCown Advisor: Michael L. Nelson Old Dominion University Computer Science.

Search Engines and their Public Interfaces: Which APIs are the Most Synchronized? Frank McCown and Michael L. Nelson Department of Computer Science, Old.

INTRODUCTION TO WEB DATABASE PROGRAMMING

July 25, 2012 Arlington, Virginia Digital Preservation 2012warcreate.com WARCreate Create Wayback-Consumable WARC Files from Any Webpage Mat Kelly, Michele.

FALL 2005CSI 4118 – UNIVERSITY OF OTTAWA1 Part 4 Web technologies: HTTP, CGI, PHP,Java applets)

Website Reconstruction using the Web Infrastructure Frank McCown Doctoral Consortium June.

How to participate in the Union Catalogue Project Hussein Suleman Sivulile – Open Access South Africa Advanced Information Management.

Strategies for improving Web site performance Google Webmaster Tools + Google Analytics Marshall Breeding Director for Innovative Technologies and Research.

XHTML Introductory1 Linking and Publishing Basic Web Pages Chapter 3.

XP New Perspectives on Browser and Basics Tutorial 1 1 Browser and Basics Tutorial 1.

5 Chapter Five Web Servers. 5 Chapter Objectives Learn about the Microsoft Personal Web Server Software Learn how to improve Web site performance Learn.

Wyatt Pearsall November  HyperText Transfer Protocol.

HT'061 Evaluation of Crawling Policies for a Web-Repository Crawler Frank McCown & Michael L. Nelson Old Dominion University Norfolk, Virginia, USA Odense,

1 XML as a preservation strategy Experiences with the DiVA document format Eva Müller, Uwe Klosa Electronic Publishing Centre Uppsala University Library,

A New Model for Web Resource Harvesting Michael L. Nelson Old Dominion University joint work with: Her Herbert Van de Sompel Xiaoming Liu Carl Lagoze Simeon.

Dec 9-11, 2003ICADL Challenges in Building Federation Services over Harvested Metadata Hesham Anan, Jianfeng Tang, Kurt Maly, Michael Nelson, Mohammad.

Indo-US Workshop, June23-25, 2003 Building Digital Libraries for Communities using Kepler Framework M. Zubair Old Dominion University.

OAI-PMH for Resource Harvesting Tutorial OAI4, October 20 th 2005, CERN, Geneva, Switzerland A New Model for Web Resource Harvesting Her This work supported.

OAI-PMH for Resource Harvesting Tutorial OAI4, October 20 th 2005, CERN, Geneva, Switzerland OAI-PMH for Resource Harvesting Herbert Van de Sompel Digital.

Van de Sompel, Herbert Los Alamos National Laboratory – Research Library OAI-PMH for Resource Harvesting.

My Website Was Lost, But Now It’s Found Frank McCown CS 110 – Intro to Computer Science April 23, 2007.

ICDL 2004 Improving Federated Service for Non-cooperating Digital Libraries R. Shi, K. Maly, M. Zubair Department of Computer Science Old Dominion University.

Repository Synchronization Using NNTP and SMTP Michael L. Nelson, Joan A. Smith, Martin Klein Old Dominion University Norfolk VA

Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH) Phil Barker, March © Heriot-Watt University. You may reproduce all or any part.

Open Archive Initiative – Protocol for metadata Harvesting (OAI-PMH) Surinder Kumar Technical Director NIC, New Delhi

1 GRID Based Federated Digital Library K. Maly, M. Zubair, V. Chilukamarri, and P. Kothari Department of Computer Science Old Dominion University February,

OAI Overview DLESE OAI Workshop April 29-30, 2002 John Weatherley

Client-Side Preservation Techniques for ORE Aggregations Michael L. Nelson & Sudhir Koneru Old Dominion University, Norfolk VA OAI-ORE Specification Roll-Out.

Lazy Preservation, Warrick, and the Web Infrastructure Frank McCown Old Dominion University Computer Science Department Norfolk, Virginia, USA JCDL 2007.

Web Design and Development. World Wide Web  World Wide Web (WWW or W3), collection of globally distributed text and multimedia documents and files 

Factors Affecting Website Reconstruction from the Web Infrastructure Frank McCown, Norou Diawara, and Michael L. Nelson Old Dominion University Computer.

Archive Ingest and Handling Test: ODU’s Perspective Michael L. Nelson Department of Computer Science Old Dominion University

Internet Applications (Cont’d) Basic Internet Applications – World Wide Web (WWW) Browser Architecture Static Documents Dynamic Documents Active Documents.

Lazy Preservation: Reconstructing Websites by Crawling the Crawlers Frank McCown, Joan A. Smith, Michael L. Nelson, & Johan Bollen Old Dominion University.

Fundamentals of Web DevelopmentRandy Connolly and Ricardo HoarFundamentals of Web DevelopmentRandy Connolly and Ricardo Hoar Fundamentals of Web DevelopmentRandy.

ASP-2-1 SERVER AND CLIENT SIDE SCRITPING Colorado Technical University IT420 Tim Peterson.

Sharing Digital Scores: Will the Open Archives Initiative Protocol for Metadata Harvesting Provide the Key? Constance Mayer, Harvard University Peter Munstedt,

The library is open Digital Assets Management & Institutional Repository Russian-IUG November 2015 Tomsk, Russia Nabil Saadallah Manager Business.

Evaluating Ingest Success: Using the AIHT Michael L. Nelson, Joan A. Smith Department of Computer Science Old Dominion University Norfolk VA DCC.

1 CS 430: Information Discovery Lecture 26 Architecture of Information Retrieval Systems 1.

Client-Side Preservation Techniques for ORE Aggregations Michael L. Nelson & Sudhir Koneru Old Dominion University, Norfolk VA OAI-ORE Specification Roll-Out.

Brass: A Queueing Manager for Warrick Frank McCown, Amine Benjelloun, and Michael L. Nelson Old Dominion University Computer Science Department Norfolk,

Building Preservation Environments with Data Grid Technology Reagan W. Moore Presenter: Praveen Namburi.

Introduction to Digital Libraries Week 15: Lazy Preservation Old Dominion University Department of Computer Science CS 751/851 Spring 2010 Michael L. Nelson.

Mod_oai: Metadata Harvesting for Everyone Michael L. Nelson, Herbert Van de Sompel, Xiaoming Liu, Aravind Elango

Search Engine and Optimization 1. Introduction to Web Search Engines 2.

1 Introduction to Digital Libraries Week 15: Web Infrastructure for Preservation Old Dominion University Department of Computer Science CS 751/851 Fall.

Transparent Format Migration of Preserved Web Content D. S. H. Rosenthal, T. Lipkis, T. S. Robertson, S. Morabito Lib Magazine, 11(1), 2005

Introduction to Digital Libraries Week 13: Lazy Preservation Old Dominion University Department of Computer Science CS 751/851 Spring 2011 Michael L. Nelson.

Efficient, Automatic Web Resource Harvesting Michael L. Nelson, Joan A. Smith and Ignacio Garcia del Campo Old Dominion University Computer Science Dept.

Strategies for improving Web site performance

Web Caching? Web Caching:.

Lazy Preservation, Warrick, and the Web Infrastructure

NASA Technical Report Server (NTRS) Project Overview April 2, 2003

OAI and Metadata Harvesting

Just-In-Time Recovery of Missing Web Pages

A New Model for Web Resource Harvesting

Characterization of Search Engine Caches

Open Archive Initiative

IVOA Interoperability Meeting - Boston

Presentation transcript:

Thinking Differently About Web Page Preservation Michael L. Nelson, Frank McCown, Joan A. Smith Old Dominion University Norfolk VA Library of Congress Brown Bag Seminar June 29, 2006 Research supported in part by NSF, Library of Congress and Andrew Mellon Foundation

Background “We can’t save everything!” –if not “everything”, then how much? –what does “save” mean?

“Women and Children First” image from: HMS Birkenhead, Cape Danger, passengers193 survivorsall 7 women & 13 children

We should probably save a copy of this…

Or maybe we don’t have to… the Wikipedia link is in the top 10, so we’re ok, right?

Surely we’re saving copies of this…

2 copies in the UK 2 Dublin Core records That’s probably good enough…

What about the things that we know we don’t need to keep? You DO support recycling, right?

A higher moral calling for pack rats?

Just Keep the Important Stuff!

Lessons Learned from the AIHT images from: (Boring stuff: D-Lib Magazine, December 2005) Preservation metadata is like a David Hockney Polaroid collage: each image is both true and incomplete, and while the result is not faithful, it does capture the “essence”

Preservation: Fortress Model 1.Get a lot of $ 2.Buy a lot of disks, machines, tapes, etc. 3.Hire an army of staff 4.Load a small amount of data 5.“Look upon my archive ye Mighty, and despair!” image from: Five Easy Steps for Preservation:

Alternate Models of Preservation Lazy Preservation –Let Google, IA et al. preserve your website Just-In-Time Preservation –Wait for it to disappear first, then a “good enough” version Shared Infrastructure Preservation –Push your content to sites that might preserve it Web Server Enhanced Preservation –Use Apache modules to create archival-ready resources image from:

Lazy Preservation “How much preservation do I get if I do nothing?” Frank McCown

Web Infrastructure as a Resource Reconstructing Web Sites Research Focus Outline: Lazy Preservation

Web Infrastructure

Cost of Preservation H L H Publisher’s cost (time, equipment, knowledge) LOCKSS Browser cache TTApacheiPROXY Furl/Spurl InfoMonitor Filesystem backups Coverage of the Web H Client-view Server-view Web archives SE caches Hanzo:web

Web Infrastructure as a Resource Reconstructing Web Sites Research Focus Outline: Lazy Preservation

Research Questions How much digital preservation of websites is afforded by lazy preservation? –Can we reconstruct entire websites from the WI? –What factors contribute to the success of website reconstruction? –Can we predict how much of a lost website can be recovered? –How can the WI be utilized to provide preservation of server-side components?

Prior Work Is website reconstruction from WI feasible? –Web repository: G,M,Y,IA –Web-repository crawler: Warrick –Reconstructed 24 websites How long do search engines keep cached content after it is removed?

Timeline of SE Resource Acquisition and Release Vulnerable resource – not yet cached (t ca is not defined) Replicated resource – available on web server and SE cache (t ca < current time < t r ) Endangered resource – removed from web server but still cached (t ca < current time < t cr ) Unrecoverable resource – missing from web server and cache (t ca < t cr < current time) Joan A. Smith, Frank McCown, and Michael L. Nelson. Observed Web Robot Behavior on Decaying Web Subsites, D-Lib Magazine, 12(2), February Frank McCown, Joan A. Smith, Michael L. Nelson, and Johan Bollen. Reconstructing Websites for the Lazy Webmaster, Technical report, arXiv cs.IR/ , 2005.

Cached Image

Cached PDF MSN version Yahoo versionGoogle version canonical

Web Repository Characteristics TypeMIME typeTypical file ext GoogleYahooMSNIA HTML text text/html html CCCC Plain text text/plain txt, ans MMMC Graphic Interchange Format image/gif gif MM~RC Joint Photographic Experts Group image/jpeg jpg MM~RC Portable Network Graphic image/png png MM~RC Adobe Portable Document Format application/pdf pdf MMMC JavaScript application/javascript js MMC Microsoft Excel application/vnd.ms-excel xls M~SMC Microsoft PowerPoint application/vnd.ms-powerpoint ppt MMMC Microsoft Word application/msword doc MMMC PostScript application/postscript ps M~SC CCanonical version is stored MModified version is stored (modified images are thumbnails, all others are html conversions) ~RIndexed but not retrievable ~SIndexed but not stored

SE Caching Experiment Create html, pdf, and images Place files on 4 web servers Remove files on regular schedule Examine web server logs to determine when each page is crawled and by whom Query each search engine daily using unique identifier to see if they have cached the page or image Joan A. Smith, Frank McCown, and Michael L. Nelson. Observed Web Robot Behavior on Decaying Web Subsites. D-Lib Magazine, February 2006, 12(2)Observed Web Robot Behavior on Decaying Web Subsites

Caching of HTML Resources - mln

Reconstructing a Website Warrick Starting URL Web Repo Original URL Results page Cached URL Cached resource File system Retrieved resource 1.Pull resources from all web repositories 2.Strip off extra header and footer html 3.Store most recently cached version or canonical version 4.Parse html for links to other resources

How Much Did We Reconstruct? A “Lost” web site Reconstructed web site BC DEF A B’C’ GE F Missing link to D; points to old resource G F can’t be found

Reconstruction Diagram added 20% identical 50% changed 33% missing 17%

Websites to Reconstruct Reconstruct 24 sites in 3 categories: 1. small (1-150 resources) 2. medium ( resources) 3. large (500+ resources) Use Wget to download current website Use Warrick to reconstruct Calculate reconstruction vector

Results Frank McCown, Joan A. Smith, Michael L. Nelson, and Johan Bollen. Reconstructing Websites for the Lazy Webmaster, Technical Report, arXiv cs.IR/ , 2005.Reconstructing Websites for the Lazy Webmaster,

Aggregation of Websites

Web Repository Contributions

Warrick Milestones www2006.org – first lost website reconstructed (Nov 2005)www2006.org DCkickball.org – first website someone else reconstructed without our help (late Jan 2006)DCkickball.org – first website we reconstructed for someone else (mid Mar 2006) Internet Archive officially “blesses” Warrick (mid Mar 2006)

Web Infrastructure as a Resource Reconstructing Web Sites Research Focus Outline: Lazy Preservation

Proposed Work How lazy can we afford to be? –Find factors influencing success of website reconstruction from the WI –Perform search engine cache characterization Inject server-side components into WI for complete website reconstruction Improving the Warrick crawler –Evaluate different crawling policies Frank McCown and Michael L. Nelson, Evaluation of Crawling Policies for a Web-repository Crawler, ACM Hypertext –Development of web-repository API for inclusion in Warrick

Factors Influencing Website Recoverability from the WI Previous study did not find statistically significant relationship between recoverability and website size or PageRank Methodology –Sample large number of websites - dmoz.org –Perform several reconstructions over time using same policy –Download sites several times over time to capture change rates

Evaluation Use statistical analysis to test for the following factors: –Size –Makeup –Path depth –PageRank –Change rate Create a predictive model – how much of my lost website do I expect to get back?

Marshall TR Server – running EPrints

We can recover the missing page and PDF, but what about the services?

Recovery of Web Server Components Recovering the client-side representation is not enough to reconstruct a dynamically-produced website How can we inject the server-side functionality into the WI? Web repositories like HTML –Canonical versions stored by all web repos –Text-based –Comments can be inserted without changing appearance of page Injection: Use erasure codes to break a server file into chunks and insert the chunks into HTML comments of different pages

Recover Server File from WI

Evaluation Find the most efficient values for n and r (chunks created/recovered) Security –Develop simple mechanism for selecting files that can be injected into the WI –Address encryption issues Reconstruct an EPrints website with a few hundred resources

SE Cache Characterization Web characterization is an active field Search engine caches have never been characterized Methodology –Randomly sample URLs from four popular search engines: Google, MSN, Yahoo, Ask –Download cached version and live version from the Web –Examine HTTP headers and page content –Test for overlap with Internet Archive –Attempt to access various resource types (PDF, Word, PS, etc.) in each SE cache

Summary: Lazy Preservation When this work is completed, we will have… demonstrated and evaluated the lazy preservation technique provided a reference implementation characterized SE caching behavior provided a layer of abstraction on top of SE behavior (API) explored how much we store in the WI (server-side vs. client-side representations)

Web Server Enhanced Preservation “How much preservation do I get if I do just a little bit?” Joan A. Smith

OAI-PMH mod_oai: complex objects + resource harvesting Research Focus Outline: Web Server Enhanced Preservation

WWW and DL: Separate Worlds 1994 DL WWW Today The problem is not that the WWW doesn’t work; it clearly does. The problem is that our (preservation) expectations have been lowered. WWW DL “Crawlapalooza” “Harvester Home Companion”

“A repository is a network accessible server that can process the 6 OAI-PMH requests … A repository is managed by a data provider to expose metadata to harvesters.” “A harvester is a client application that issues OAI-PMH requests. A harvester is operated by a service provider as a means of collecting metadata from repositories.” Data Providers / Repositories Service Providers / Harvesters

Aggregators data providers (repositories) service providers (harvesters) aggregator aggregators allow for: scalability for OAI-PMH load balancing community building discovery

OAI-PMH data model resource item Dublin Core metadata MARCXML metadata records entry point to all records pertaining to the resource metadata pertaining to the resource OAI-PMHidentifier metadataPrefix datestamp OAI-PMH identifierOAI-PMH sets

OAI-PMH Used by Google & AcademicLive (MSN) Why support OAI-PMH? $These guys are in business (i.e., for profit) QHow does OAI-PMH help their bottom line? ABy improving the search and analysis process

Resource Harvesting with OAI-PMH resource item Dublin Core metadata METS records OAI-PMH identifier = entry point to all records pertaining to the resource MPEG-21 DIDL metadata pertaining to the resource simplehighly expressive more expressive highly expressive MARCXML metadata

OAI-PMH mod_oai: complex objects + resource harvesting Research Focus Outline: Web Server Enhanced Preservation

Two Problems The counting problem There is no way to determine the list of valid URLs at a web site The representation problem Machine-readable formats and human- readable formats have different requirements

Integrate OAI-PMH functionality into the web server itself… mod_oai: an Apache 2.0 module to automatically answer OAI-PMH requests for an http server –written in C –respects values in.htaccess, httpd.conf compile mod_oai on baseURL is now –Result: web harvesting with OAI-PMH semantics (e.g., from, until, sets) mod_oai solution The human- readable web site Prepped for machine-friendly harvesting Give me a list of all resources, include Dublin Core metadata, dating from 9/15/2004 through today, and that are MIME type video-MPEG.

A Crawler’s View of the Web Site Not crawled (unadvertised & unlinked) web root Crawled pages Not crawled (too deep) Not crawled (protected) Not crawled (remote link only) Not crawled (Generated on-the-fly by CGI, e.g.) Not crawled robots.txt or robots META tag Remote web site

Apache’s View of the Web Site web root Require authentication Unknown/not visible Generated on-the-fly (CGI, e.g.) Tagged: No robots

The Problem: Defining The “Whole Site” For a given server, there are a set of URLs, U, and a set of files F –Apache maps U  F –mod_oai maps F  U Neither function is 1-1 nor onto –We can easily check if a single u maps to F, but given F we cannot (easily) generate U Short-term issues: –dynamic files exporting unprocessed server-side files would be a security hole –IndexIgnore httpd will “hide” valid URLs –File permissions httpd will advertise files it cannot read Long-term issues –Alias, Location files can be covered up by the httpd –UserDir interactions between the httpd and the filesystem

Tagged: No robots A Webmaster’s Omniscient View web root Deep Dynamic Authenticated Orphaned Unknown/not visible MySQL 1.Data1 2.User.abc 3.Fred.foo httpd 1.file1 2./dir/wwx 3.Foo.html

Machine-readable Human-readable HTTP “Get” versus OAI-PMH GetRecord mod_oai HTTP GET HTTP GetRecord JHOVE METADATA MD-5 LS Complex Object WEB SITE Apache Web Server “GET /headlines.html HTTP1.1” “GET /modoai/?verb=GetRecord&identifier= headlines.html&metadaprefix=oai_didl”

OAI-PMH data model in mod_oai resource item Dublin Core metadata records OAI-PMH identifier = entry point to all records pertaining to the resource MPEG-21 DIDL metadata pertaining to the resource HTTP header metadata OAI-PMH sets MIME type

Complex Objects That Tell A Story First came Lenin Then came Stalin… Resource and metadata packaged together as a complex digital object represented via XML wrapper Uniform solution for simple & compound objects Unambiguous expression of locator of datastream Disambiguation between locators & identifiers OAI-PMH datestamp changes whenever the resource (datastreams & secondary information) changes OAI-PMH semantics apply: “about” containers, set membership Russian Nesting Doll encoded as an MPEG-21 DIDL SADLFJSALDJF...SLDKFJASLDJ Jhove metadata DC metadata Checksum … Provenance

Resource Discovery: ListIdentifiers HARVESTER: issues a ListIdentifiers, finds URLs of updated resources does HTTP GETs updates only can get URLs of resources with specified MIME types

Preservation: ListRecords HARVESTER: issues a ListRecords, Gets updates as MPEG- 21 DIDL documents (HTTP headers, resource By Value or By Reference) can get resources with specified MIME types

What does this mean? For an entire web site, we can: –serialize everything as an XML stream –extract it using off-the-shelf OAI-PMH harvesters –efficiently discover updates & additions For each URL, we can: –create “preservation ready” version with configurable {descriptive|technical|structural} metadata e.g., Jhove output, datestamps, signatures, provenance, automatically generated summary, etc. Harvest the resource extract metadata include an index translations… or lexical signatures, Summaries, etc Jhove & other pertinent info Wrap it all together In an XML Stream Ready for the future

OAI-PMH mod_oai: complex objects + resource harvesting Research Focus Outline: Web Server Enhanced Preservation

Research Contributions Thesis Question: How well can Apache support web page preservation? Goal: To make web resources “preservation ready” –Support refreshing (“how many URLs at this site?”): the counting problem –Support migration (“what is this object?”): the representation problem How: Using OAI-PMH resource harvesting –Aggregate forensic metadata Automate extraction –Encapsulate into an object XML stream of information –Maximize preservation opportunity Bring DL technology into the realm of WWW

Experimentation & Evaluation Research solutions to the counting problem –Different tools yield different results –Google sitemap <> Apache file list <> robot crawled pages –Combine approaches for one automated, full URL listing Apache logs are detailed history of site activity Compare user page requests with crawlers’ requests Compare crawled pages with actual site tree Continue research on the representation problem –Integrate utilities into mod_oai (Jhove, etc.) –Automate metadata extraction & encapsulation Serialize and reconstitute –complete back-up of site & reconstitution through XML stream

Summary: Web Server Enhanced Preservation Better web harvesting can be achieved through: –OAI-PMH: structured access to updates –Complex object formats: modeled representation of digital objects Address 2 key problems: –Preservation (ListRecords) – The Representation Problem –Web crawling (ListIdentifiers) – The Counting Problem mod_oai: reference implementation –Better performance than wget & crawlers –not a replacement for DSpace, Fedora, eprints.org, etc. More info: – – Automatic harvesting of web resources rich in metadata packaged for the future Today: manualTomorrow: automatic!

Summary Michael L. Nelson

Summary Digital preservation is not hard, its just big. –Save the women and children first, of course, but there is room for many more… Using the by-product of SE and WI, we can get a good amount of preservation for free –prediction: Google et al. will eventually see preservation as a business opportunity Increasing the role of the web server will solve most of the digital preservation problems –complex objects + OAI-PMH = digital preservation solution

“As you know, you preserve the files you have. They’re not the files you might want or wish to have at a later time” “if you think about it, you can have all the metadata in the world on a file and a file can be blown up” image from:

Overview of OAI-PMH Verbs VerbFunction Identifydescription of repository ListMetadataFormatsmetadata formats supported by repository ListSetssets defined by repository ListIdentifiersOAI unique ids contained in repository ListRecordslisting of N records GetRecordlisting of a single record metadata about the repository harvesting verbs most verbs take arguments: dates, sets, ids, metadata formats and resumption token (for flow control)

Enhancing Apache’s utility as a preservation tool Create a partnership between server and SE –Apache can serve up details about site, accessible portions of site tree, changes including additions and deletions –SE can reduce crawl time and subsequent index/update times Google: “Hi Apache! What’s new?” Apache: “Hi Google! I’ve got 3 new pages, xyz/blah1.html, yyy/bug2.html, and ru2.html. Oh, and I also deleted xyz/boo.html.” Use OAI-PMH to facilitate conversation between the SE and the server –Data model offers many advantages Both content-rich and metadata-rich Supports complex objects – Protocol’s 6 verbs mesh well with SE, Server roles ListMetadataFormats, ListSets, GetRecord, ListRecords, ListIdentifiers, ListRecords Enable policy-driven relationship between site & SE –push content-rich harvesting to web community

OAI-PMH Entityvaluedescription ResourceURLPDF, PS, XML, HTML or other file Item identifierOAI Identifier DNS-based name of metadata about resource set membershipLCSHLibrary of Congress Subject Heading Record metadataPrefixoai_dcbibliographic metadata in Dublin Core datestamp modification date of DC record Record metadataPrefixoai_marcbibliographic metadata in MARC datestamp modification date of MARC record OAI-PMH concepts : typical repository

OAI-PMH Entityvaluedescription ResourceURLHTML, GIF, PDF or other web file Item identifierURLsame URL as the resource set membershipMIME typeMIME type of the resource Record metadataPrefixhttp_headerthe http headers that would have been returned via HTTP GET/HEAD datestamp modification date of resource Record metadataPrefixoai_dca subset of http_header in DC datestamp modification date of resource Record metadataPrefixoai_didlMPEG-21 DIDL: base64 encoded resource + http_header metadata datestamp modification date of resource OAI-PMH concepts : mod_oai

OAI-PMH data model resource item Dublin Core metadata METS records OAI-PMH identifier = entry point to all records pertaining to the resource MPEG-21 DIDL metadata pertaining to the resource simplehighly expressive more expressive highly expressive MARCXML metadata

Warrick API API should provide a clear and flexible interface for web repositories Goals: –Shield Warrick from changes to WI –Facilitate inclusion of new web repositories –Minimize implementation and maintenance costs

Evaluation Internet Archive has endorsed use of Warrick Make Warrick available on SourceForge Measure the community adoption & modification

performance of mod_oai and wget on for more detail: “mod_oai: An Apache Module for Metadata Harvesting “

IndexIgnore & File Permissions

Alias: Covering Up Files httpd.conf: Alias /A /usr/local/web/htdocs/B Alias /B /usr/local/web/htdocs/A the files “A” and “B” will be different from the URLs

UserDir: “Just in Time” mounting of directories whiskey.cs.odu.edu:/ftp/WWW/conf% ls /home liu_x/ mln/ whiskey.cs.odu.edu:/ftp/WWW/conf% ls -d /home/tharriso /home/tharriso/ whiskey.cs.odu.edu:/ftp/WWW/conf % ls /home liu_x/ mln/ tharriso/ whiskey.cs.odu.edu:/ftp/WWW/conf %

Complex Object Formats : Characteristics Representation of a digital object by means of a wrapper XML document. Represented resource can be: –simple digital object (consisting of a single datastream) –compound digital object (consisting of multiple datastreams) Unambiguous approach to convey identifiers of the digital object and its constituent datastreams. Include datastream: –By-Value: embedding of base64-encoded datastream –By-Reference: embedding network location of the datastream –not mutually exclusive; equivalent Include a variety of secondary information –By-Value –By-Reference –Descriptive metadata, rights information, technical metadata, …

Resource Harvesting: Use cases Discovery: use content itself in the creation of services –search engines that make full-text searchable –citation indexing systems that extract references from the full-text content –browsing interfaces that include thumbnail versions of high-quality images from cultural heritage collections Preservation: –periodically transfer digital content from a data repository to one or more trusted digital repositories –trusted digital repositories need a mechanism to automatically synchronize with the originating data repository Ideas first presented in Van de Sompel, Nelson, Lagoze & Warner,