Lazy Preservation, Warrick, and the Web Infrastructure Frank McCown Old Dominion University Computer Science Department Norfolk, Virginia, USA JCDL 2007 Vancouver, BC June 19, 2007
2 Outline What is the Web Infrastructure (WI)? How can the WI be used for preservation? Web-repository crawling with Warrick Understanding the WI –Caching experiment –Reconstruction experiments –Search engine sampling and IA overlap experiment Recovering web server components from the WI Brass: Queueing manager for Warrick
3
4 Web Infrastructure
5 Alternative Models of Preservation Lazy Preservation –Let Google, IA et al. preserve your website Just-In-Time Preservation –Wait for it to disappear first, then a “good enough” version Shared Infrastructure Preservation –Push your content to sites that might preserve it Web Server Enhanced Preservation –Use Apache modules to create archival-ready resources
6
7 Black hat: Virus image: Hard drive:
8 Crawling the Crawlers
9
10
11 Cached Image
Cached PDF MSN version Yahoo versionGoogle version canonical
13 Web-repository Crawler
14 McCown, et al., Brass: A Queueing Manager for Warrick, IWAW McCown, et al., Factors Affecting Website Reconstruction from the Web Infrastructure, ACM IEEE JCDL McCown and Nelson, Evaluation of Crawling Policies for a Web-Repository Crawler, HYPERTEXT McCown, et al., Lazy Preservation: Reconstructing Websites by Crawling the Crawlers, ACM WIDM Available at
15 What Types of Websites Are Lost? Marshall, McCown, and Nelson, Evaluating Personal Archiving Strategies for Internet-based Information, IS&T Archiving 2007.
16 Outline What is the Web Infrastructure (WI)? How can the WI be used for preservation? Web-repository crawling with Warrick Understanding the WI –Caching experiment –Reconstruction experiments –Search engine sampling and IA overlap experiment Recovering web server components from the WI Brass: Queueing manager for Warrick
17 Understanding the WI How quickly do search engines acquire and purge their caches? Do search engines prefer caching one type of resource over another? How much overlap is there between the search engines caches and IA holdings? How successfully can we reconstruct a lost website? Are some resources more recoverable than others?
18 Timeline of Web Resource
19 Web Caching Experiment Create 4 websites composed of HTML, PDFs, and images – – – – Remove pages each day Query GMY every day using identifiers McCown, et al., Lazy Preservation: Reconstructing Websites by Crawling the Crawlers, ACM WIDM 2006.
20
21
22
23
24 Where is the Internet Archive? No crawls from Alexa, IA’s provider Even if they had crawled us, the content would not be accessible from IA for 6-12 months Short-lived web content is likely to be lost for good
Reconstruction Experiment Crawl and reconstruct 24 sites of various sizes: 1. small (1-150 resources) 2. medium ( resources) 3. large (500+ resources) Perform 5 reconstructions for each website –One using all four repositories together –Four using each repository separately Calculate reconstruction vector for each reconstruction (changed%, missing%, added%)
26 How Much Did We Reconstruct? A “Lost” web site Reconstructed web site BC DEF A B’C’ GE F Missing link to D; points to old resource G F can’t be found Four categories of recovered resources: 1) Identical: A, E 2) Changed: B, C 3) Missing: D, F 4) Added: G
27 Reconstruction Diagram added 20% identical 50% changed 33% missing 17%
28 Recovery Success by MIME Type
29 Repository Contributions
Reconstruction Experiment 300 websites chosen randomly from Open Directory Project (dmoz.org) Crawled and reconstructed each website every week for 14 weeks Examined change rates, age, decay, growth, recoverability McCown, et al., Factors Affecting Website Reconstruction from the Web Infrastructure, ACM IEEE JCDL 2007.
31 Success of website recovery each week *On average, we recovered 61% of a website on any given week.
32
33 Statistics for Repositories
34 Experiment: Sample Search Engine Caches Feb 2006 Submitted 5200 one-term queries to Ask, Google, MSN, and Yahoo Randomly selected 1 result from first 100 Download resource and cached page Check for overlap with Internet Archive McCown, et al., Brass: A Queueing Manager for Warrick, IWAW 2007.
35 Distribution of Top Level Domains
36 Cached Resource Size Distributions 976 KB977 KB 1 MB 215 KB
37 Cache Freshness crawled and cached changed on web server crawled and cached Stale time Fresh Staleness = max(0, Last-modified http header – cached date)
38 Cache Staleness 46% of resource had Last-Modified header 71% also had cached date 16% were at least 1 day stale
39 Similarity vs. Staleness
40 How much of the Web is indexed? Estimates from “The Indexable Web is More than 11.5 billion pages” by Gulli and Signorini (WWW’05) Internet Archive?
41 Overlap with Internet Archive
42 Overlap with Internet Archive
43 Distribution of Sampled URLs
44 Problem: WI currently only stores the client-side representation of a website. Server components (scripts, databases, configuration files, etc.) are not accessible from the WI
45 Outline What is the Web Infrastructure (WI)? How can the WI be used for preservation? Web-repository crawling with Warrick Understanding the WI –Caching experiment –Reconstruction experiments –Search engine sampling and IA overlap experiment Recovering web server components from the WI Brass: Queueing manager for Warrick
46 Database Perl script config Static files (html files, PDFs, images, style sheets, Javascript, etc.) Web Infrastructure Web Server Dynamic page Recoverable Not Recoverable
47 Injecting Server Components into Crawlable Pages Erasure codes HTML pagesRecover at least m blocks
48 Brass: A Queueing Manager for Warrick Warrick requires some technical expertise to download, install, and run Warrick uses search engine APIs which allow limited requests per IP address (or key) Google no longer provides new keys for accessing their API
49
50
51 Thank You Frank McCown Can’t wait until I’m old enough to recover a website!
52 Cache Freshness crawled and cached changed on web server crawled and cached Stale time Fresh Staleness = max(0, Last-modified http header – cached date)
53 Cache Staleness 46% of resource had Last-Modified header 71% also had cached date 16% were at least 1 day stale
54 Similarity vs. Staleness
55
56 Web Repository Characteristics TypeMIME typeFile extGoogleYahooLiveIA HTML text text/html html CCCC Plain text text/plain txt, ans MMMC Graphic Interchange Format image/gif gif MMMC Joint Photographic Experts Group image/jpeg jpg MM M C Portable Network Graphic image/png png MM M C Adobe Portable Document Format application/pdf pdf MMMC JavaScript application/javascript js MMC Microsoft Excel application/vnd.ms-excel xls M~SMC Microsoft PowerPoint application/vnd.ms- powerpoint ppt MMMC Microsoft Word application/msword doc MMMC PostScript application/postscript ps M~SC CCanonical version is stored MModified version is stored (modified images are thumbnails, all others are html conversions) ~SIndexed but not stored
57 Results Frank McCown, Joan A. Smith, Michael L. Nelson, and Johan Bollen. Reconstructing Websites for the Lazy Webmaster, Technical Report, arXiv cs.IR/ , 2005.Reconstructing Websites for the Lazy Webmaster,