Presentation is loading. Please wait.

Presentation is loading. Please wait.

WADL 2013 July 25-26 th Indianapolis, IN Martin SiteStory Archiving Done Differently

Similar presentations

Presentation on theme: "WADL 2013 July 25-26 th Indianapolis, IN Martin SiteStory Archiving Done Differently"— Presentation transcript:

1 WADL 2013 July 25-26 th Indianapolis, IN Martin Klein @mart1nkle1n SiteStory Archiving Done Differently Justin F. Brunelle

2 WADL 2013 July 25-26 th Indianapolis, IN LANL SiteStory Team lead developer

3 WADL 2013 July 25-26 th Indianapolis, IN Archiving - the traditional way Actively crawl the web For example, using Heritrix

4 WADL 2013 July 25-26 th Indianapolis, IN Issues with crawler based archiving: Request can be rejected (robots.txt, user-agent, IP) Can be deceived (geo-location, user-agent) Can be trapped (crawl my calendar!) Requires constant and massive bandwidth Implied timing problem, when to crawl? Archiving - the traditional way

5 WADL 2013 July 25-26 th Indianapolis, IN Timing problem: Update 1 viewed but not archived t1 R created t2 browser visit1 t3 crawler visit1 t4 R update1 t5 browser visit2 t6 R update2 Archiving - the traditional way

6 WADL 2013 July 25-26 th Indianapolis, IN Archiving - the SiteStory way Transactional Web archiving Archive accepts HTTP transaction between browser and server

7 WADL 2013 July 25-26 th Indianapolis, IN Timing problem: Update 1 viewed and archived t1 R created t2 browser visit1 t3 crawler visit1 t4 R update1 t5 browser visit2 t6 R update2 Archiving - the traditional way

8 WADL 2013 July 25-26 th Indianapolis, IN

9 WADL 2013 July 25-26 th Indianapolis, IN Challenges with transactional archiving: To be archived server has to cooperate Transfer data to archive, batch mode or real-time Archive must trust transmission to be authentic Resources from external servers have to be archived out-of-band Deduplication challenges Alias: different URI, same response Conneg: same URI, different response Determine “significant” content change Archiving - the SiteStory way

10 WADL 2013 July 25-26 th Indianapolis, IN SiteStory Status Quo mod_sitestory sends HTTP PUT to SiteStory Web Archive upon client’s GET request not for POST, DELETE, etc for HTTP response codes 200, 302, 303 Client IP can be included in stored headers, configurable Header info stored in BerkeleyDB, response body in FS Dedup via hash(body) Offloading content as WARC files possible (read: recommended)

11 WADL 2013 July 25-26 th Indianapolis, IN SiteStory Use Case LANL has been archiving the DANS website (forever) ~32 GB since mid April 2013 >200k resources

12 WADL 2013 July 25-26 th Indianapolis, IN To Appear: TPDL 2013 SiteStory benchmark with ab & wget o ApacheBench (ab): server stress test tool o wget: Web page download -All content: -p Local network Negligible difference between SiteStory and No SiteStory

13 WADL 2013 July 25-26 th Indianapolis, IN Re-executed on testbed x99,…,, @AWS

14 WADL 2013 July 25-26 th Indianapolis, IN Testing with ab

15 WADL 2013 July 25-26 th Indianapolis, IN Testing with wget

16 WADL 2013 July 25-26 th Indianapolis, IN Round Trip Time -- Distributed

17 WADL 2013 July 25-26 th Indianapolis, IN Results Distributed: Higher variance Increased delay due to network On vs. Off Comparison still comparable Viable solution without crippling service

18 WADL 2013 July 25-26 th Indianapolis, IN SiteStory Installation Apache module mod_sitestory Option to exclude a list of directories SiteStory Web Archive Trivial for existing Tomcat environments Tanuki Java wrapper (stand-alone) available Configure, open ports, go! Or…

19 WADL 2013 July 25-26 th Indianapolis, IN SiteStory Testbed We have a SiteStory Web Archive installed for you! 1.Install and configure mod_sitestory 2.Send an email containing: 1.Your contact info 2.Web server IP address 3.Server domain name used 3.Happy Sitestory’ing! mailto:

20 WADL 2013 July 25-26 th Indianapolis, IN Martin Klein @mart1nkle1n SiteStory Archiving Done Differently Justin F. Brunelle

Download ppt "WADL 2013 July 25-26 th Indianapolis, IN Martin SiteStory Archiving Done Differently"

Similar presentations

Ads by Google