Download presentation
Presentation is loading. Please wait.
Published byBrianna Horn Modified over 8 years ago
1
Repository Synchronization Using NNTP and SMTP Michael L. Nelson, Joan A. Smith, Martin Klein Old Dominion University Norfolk VA www.cs.odu.edu/~{mln,jsmit,mklein} DLF Spring 2006 Austin TX April 10-12, 2006
2
Preservation: Fortress Model 1.Get a lot of $ 2.Buy a lot of disks, machines, tapes, etc. 3.Hire an army of staff 4.Load a small amount of data 5.“Look upon my archive ye Mighty, and despair!” image from: http://www.itunisie.com/tourisme/excursion/tabarka/images/fort.jpg Five Easy Steps for Preservation:
3
Alternate Models of Preservation Lazy Preservation –Let Google, IA et al. preserve your website Just-In-Time Preservation –Find a “good enough” replacement web page Web Server Based Preservation –Use Apache modules to create archival-ready resources Shared Infrastructure Preservation –Push your content to sites that might preserve it image from: http://www.proex.ufes.br/arsm/knots_interlaced.htm
4
Shared, Existing Infrastructure Can we (re)use existing installed network infrastructure for preservation purposes? Who has the Bigger Fortress?
5
Experiment & Simulation Inject the contents of an OAI-PMH repository directly into: –Email (SMTP) –Usenet News (NNTP) Instrument existing email, news servers Use mod_oai (www.modoai.org) to do resource harvesting –complex object formats (e.g. MPEG-21 DIDL) used to encode the resources as “lumps of XML” –results are generalizable to any repository system Analyze testbed, simulate very large collections
6
Test Repository Website with 72 files –HTML, PDF, PNG, JPEG, GIF –1KB - 1.5 MB Used a script to harvest the MPEG-21 DIDLs, and then: –attach to outbound email mesgs –post to a moderated newsgroup (repository.odu.test1)
7
General Architecture
8
Email
9
Adding Email Attachments / Headers outgoing mail incoming mail
10
Email Headers OAI-PMH & HTTP headers base64 encoded DIDL original email mesg
11
SMTP Overhead ~ 1 sec penalty per mesg diminishing returns for skipping mesgs
12
Email Traffic @ mail.cs.odu.edu 30 days of traffic –505,987 mesgs –4081 unique hosts –daily mean: 16,866 std dev: 5147 P(x) = a(x -b ) we measured b≈1.6
13
News
14
News Posting OAI-PMH & HTTP headers base64 encoded DIDL
15
News Overhead
16
News Policies
17
Simulation Parameters Repository –100,000 items –1MB/item –100 daily additions –400 daily updates Time –2000 days (5.5 years) Email –granularity=1 –follows ODU power law example News –servers hold contents for 30 days
18
NNTP Results
19
Email Results (Without Memory)
20
Email Results (With Memory)
21
Discussion We’ve examined the worst case scenario –large, active repository –sending contents by-value Optimizations / Alternatives –smaller, less dynamic repositories –sending contents by-reference –use for repository discovery, not for content interchange instead of sending “GetRecord” results, send “Identify” results and let interested parties return to your site with proper harvesters
22
Summary Shared, existing infrastructure can be used to push content to unknown preservation partners –exploiting not just hardware infrastructure, but human communication patterns for resource discovery as well While not possessing ideal DL/Archival capabilities, these methods are congruent with standard web practices –Gmail, Google Groups, etc. will always have more disks than you…
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.