WADL 2013 July th Indianapolis, IN Martin SiteStory Archiving Done Differently Justin F. Brunelle
WADL 2013 July th Indianapolis, IN LANL SiteStory Team lead developer
WADL 2013 July th Indianapolis, IN Archiving - the traditional way Actively crawl the web For example, using Heritrix
WADL 2013 July th Indianapolis, IN Issues with crawler based archiving: Request can be rejected (robots.txt, user-agent, IP) Can be deceived (geo-location, user-agent) Can be trapped (crawl my calendar!) Requires constant and massive bandwidth Implied timing problem, when to crawl? Archiving - the traditional way
WADL 2013 July th Indianapolis, IN Timing problem: Update 1 viewed but not archived t1 R created t2 browser visit1 t3 crawler visit1 t4 R update1 t5 browser visit2 t6 R update2 Archiving - the traditional way
WADL 2013 July th Indianapolis, IN Archiving - the SiteStory way Transactional Web archiving Archive accepts HTTP transaction between browser and server
WADL 2013 July th Indianapolis, IN Timing problem: Update 1 viewed and archived t1 R created t2 browser visit1 t3 crawler visit1 t4 R update1 t5 browser visit2 t6 R update2 Archiving - the traditional way
WADL 2013 July th Indianapolis, IN
WADL 2013 July th Indianapolis, IN Challenges with transactional archiving: To be archived server has to cooperate Transfer data to archive, batch mode or real-time Archive must trust transmission to be authentic Resources from external servers have to be archived out-of-band Deduplication challenges Alias: different URI, same response Conneg: same URI, different response Determine “significant” content change Archiving - the SiteStory way
WADL 2013 July th Indianapolis, IN SiteStory Status Quo mod_sitestory sends HTTP PUT to SiteStory Web Archive upon client’s GET request not for POST, DELETE, etc for HTTP response codes 200, 302, 303 Client IP can be included in stored headers, configurable Header info stored in BerkeleyDB, response body in FS Dedup via hash(body) Offloading content as WARC files possible (read: recommended)
WADL 2013 July th Indianapolis, IN SiteStory Use Case LANL has been archiving the DANS website (forever) ~32 GB since mid April 2013 >200k resources
WADL 2013 July th Indianapolis, IN To Appear: TPDL 2013 SiteStory benchmark with ab & wget o ApacheBench (ab): server stress test tool o wget: Web page download -All content: -p Local network Negligible difference between SiteStory and No SiteStory
WADL 2013 July th Indianapolis, IN Re-executed on testbed ws-dl-03.cs.odu.edu x99,…,,
WADL 2013 July th Indianapolis, IN Testing with ab
WADL 2013 July th Indianapolis, IN Testing with wget
WADL 2013 July th Indianapolis, IN Round Trip Time -- Distributed
WADL 2013 July th Indianapolis, IN Results Distributed: Higher variance Increased delay due to network On vs. Off Comparison still comparable Viable solution without crippling service
WADL 2013 July th Indianapolis, IN SiteStory Installation Apache module mod_sitestory Option to exclude a list of directories SiteStory Web Archive Trivial for existing Tomcat environments Tanuki Java wrapper (stand-alone) available Configure, open ports, go! Or…
WADL 2013 July th Indianapolis, IN SiteStory Testbed We have a SiteStory Web Archive installed for you! 1.Install and configure mod_sitestory 2.Send an containing: 1.Your contact info 2.Web server IP address 3.Server domain name used 3.Happy Sitestory’ing! mailto:
WADL 2013 July th Indianapolis, IN Martin SiteStory Archiving Done Differently Justin F. Brunelle