Repository Synchronization Using NNTP and SMTP Michael L. Nelson, Joan A. Smith, Martin Klein Old Dominion University Norfolk VA DLF Spring 2006 Austin TX April 10-12, 2006
Preservation: Fortress Model 1.Get a lot of $ 2.Buy a lot of disks, machines, tapes, etc. 3.Hire an army of staff 4.Load a small amount of data 5.“Look upon my archive ye Mighty, and despair!” image from: Five Easy Steps for Preservation:
Alternate Models of Preservation Lazy Preservation –Let Google, IA et al. preserve your website Just-In-Time Preservation –Find a “good enough” replacement web page Web Server Based Preservation –Use Apache modules to create archival-ready resources Shared Infrastructure Preservation –Push your content to sites that might preserve it image from:
Shared, Existing Infrastructure Can we (re)use existing installed network infrastructure for preservation purposes? Who has the Bigger Fortress?
Experiment & Simulation Inject the contents of an OAI-PMH repository directly into: – (SMTP) –Usenet News (NNTP) Instrument existing , news servers Use mod_oai ( to do resource harvesting –complex object formats (e.g. MPEG-21 DIDL) used to encode the resources as “lumps of XML” –results are generalizable to any repository system Analyze testbed, simulate very large collections
Test Repository Website with 72 files –HTML, PDF, PNG, JPEG, GIF –1KB MB Used a script to harvest the MPEG-21 DIDLs, and then: –attach to outbound mesgs –post to a moderated newsgroup (repository.odu.test1)
General Architecture
Adding Attachments / Headers outgoing mail incoming mail
Headers OAI-PMH & HTTP headers base64 encoded DIDL original mesg
SMTP Overhead ~ 1 sec penalty per mesg diminishing returns for skipping mesgs
mail.cs.odu.edu 30 days of traffic –505,987 mesgs –4081 unique hosts –daily mean: 16,866 std dev: 5147 P(x) = a(x -b ) we measured b≈1.6
News
News Posting OAI-PMH & HTTP headers base64 encoded DIDL
News Overhead
News Policies
Simulation Parameters Repository –100,000 items –1MB/item –100 daily additions –400 daily updates Time –2000 days (5.5 years) –granularity=1 –follows ODU power law example News –servers hold contents for 30 days
NNTP Results
Results (Without Memory)
Results (With Memory)
Discussion We’ve examined the worst case scenario –large, active repository –sending contents by-value Optimizations / Alternatives –smaller, less dynamic repositories –sending contents by-reference –use for repository discovery, not for content interchange instead of sending “GetRecord” results, send “Identify” results and let interested parties return to your site with proper harvesters
Summary Shared, existing infrastructure can be used to push content to unknown preservation partners –exploiting not just hardware infrastructure, but human communication patterns for resource discovery as well While not possessing ideal DL/Archival capabilities, these methods are congruent with standard web practices –Gmail, Google Groups, etc. will always have more disks than you…