Presentation is loading. Please wait.

Presentation is loading. Please wait.

Repository Synchronization Using NNTP and SMTP Michael L. Nelson, Joan A. Smith, Martin Klein Old Dominion University Norfolk VA www.cs.odu.edu/~{mln,jsmit,mklein}

Similar presentations


Presentation on theme: "Repository Synchronization Using NNTP and SMTP Michael L. Nelson, Joan A. Smith, Martin Klein Old Dominion University Norfolk VA www.cs.odu.edu/~{mln,jsmit,mklein}"— Presentation transcript:

1 Repository Synchronization Using NNTP and SMTP Michael L. Nelson, Joan A. Smith, Martin Klein Old Dominion University Norfolk VA www.cs.odu.edu/~{mln,jsmit,mklein} DLF Spring 2006 Austin TX April 10-12, 2006

2 Preservation: Fortress Model 1.Get a lot of $ 2.Buy a lot of disks, machines, tapes, etc. 3.Hire an army of staff 4.Load a small amount of data 5.“Look upon my archive ye Mighty, and despair!” image from: http://www.itunisie.com/tourisme/excursion/tabarka/images/fort.jpg Five Easy Steps for Preservation:

3 Alternate Models of Preservation Lazy Preservation –Let Google, IA et al. preserve your website Just-In-Time Preservation –Find a “good enough” replacement web page Web Server Based Preservation –Use Apache modules to create archival-ready resources Shared Infrastructure Preservation –Push your content to sites that might preserve it image from: http://www.proex.ufes.br/arsm/knots_interlaced.htm

4 Shared, Existing Infrastructure Can we (re)use existing installed network infrastructure for preservation purposes? Who has the Bigger Fortress?

5 Experiment & Simulation Inject the contents of an OAI-PMH repository directly into: –Email (SMTP) –Usenet News (NNTP) Instrument existing email, news servers Use mod_oai (www.modoai.org) to do resource harvesting –complex object formats (e.g. MPEG-21 DIDL) used to encode the resources as “lumps of XML” –results are generalizable to any repository system Analyze testbed, simulate very large collections

6 Test Repository Website with 72 files –HTML, PDF, PNG, JPEG, GIF –1KB - 1.5 MB Used a script to harvest the MPEG-21 DIDLs, and then: –attach to outbound email mesgs –post to a moderated newsgroup (repository.odu.test1)

7 General Architecture

8 Email

9 Adding Email Attachments / Headers outgoing mail incoming mail

10 Email Headers OAI-PMH & HTTP headers base64 encoded DIDL original email mesg

11 SMTP Overhead ~ 1 sec penalty per mesg diminishing returns for skipping mesgs

12 Email Traffic @ mail.cs.odu.edu 30 days of traffic –505,987 mesgs –4081 unique hosts –daily mean: 16,866 std dev: 5147 P(x) = a(x -b ) we measured b≈1.6

13 News

14 News Posting OAI-PMH & HTTP headers base64 encoded DIDL

15 News Overhead

16 News Policies

17 Simulation Parameters Repository –100,000 items –1MB/item –100 daily additions –400 daily updates Time –2000 days (5.5 years) Email –granularity=1 –follows ODU power law example News –servers hold contents for 30 days

18 NNTP Results

19 Email Results (Without Memory)

20 Email Results (With Memory)

21 Discussion We’ve examined the worst case scenario –large, active repository –sending contents by-value Optimizations / Alternatives –smaller, less dynamic repositories –sending contents by-reference –use for repository discovery, not for content interchange instead of sending “GetRecord” results, send “Identify” results and let interested parties return to your site with proper harvesters

22 Summary Shared, existing infrastructure can be used to push content to unknown preservation partners –exploiting not just hardware infrastructure, but human communication patterns for resource discovery as well While not possessing ideal DL/Archival capabilities, these methods are congruent with standard web practices –Gmail, Google Groups, etc. will always have more disks than you…


Download ppt "Repository Synchronization Using NNTP and SMTP Michael L. Nelson, Joan A. Smith, Martin Klein Old Dominion University Norfolk VA www.cs.odu.edu/~{mln,jsmit,mklein}"

Similar presentations


Ads by Google