Download presentation
Presentation is loading. Please wait.
Published byEmerald Stephens Modified over 9 years ago
1
Lazy Preservation: Reconstructing Websites by Crawling the Crawlers Frank McCown, Joan A. Smith, Michael L. Nelson, & Johan Bollen Old Dominion University Norfolk, Virginia, USA Arlington, Virginia November 10, 2006 WIDM 2006
2
2 Outline Web page threats Web Infrastructure Web caching experiment Web repository crawling Website reconstruction experiment
3
3 Black hat: http://img.webpronews.com/securitypronews/110705blackhat.jpg Virus image: http://polarboing.com/images/topics/misc/story.computer.virus_1137794805.jpg Hard drive: http://www.datarecoveryspecialist.com/images/head-crash-2.jpg
4
4 How much of the Web is indexed? Estimates from “The Indexable Web is More than 11.5 billion pages” by Gulli and Signorini (WWW’05)
5
5
6
6
7
7 Cached Image
8
Cached PDF http://www.fda.gov/cder/about/whatwedo/testtube.pdf MSN version Yahoo versionGoogle version canonical
9
Web Repository Characteristics TypeMIME typeFile extGoogleYahooMSNIA HTML text text/html html CCCC Plain text text/plain txt, ans MMMC Graphic Interchange Format image/gif gif MM~RC Joint Photographic Experts Group image/jpeg jpg MM~RC Portable Network Graphic image/png png MM~RC Adobe Portable Document Format application/pdf pdf MMMC JavaScript application/javascript js MMC Microsoft Excel application/vnd.ms-excel xls M~SMC Microsoft PowerPoint application/vnd.ms- powerpoint ppt MMMC Microsoft Word application/msword doc MMMC PostScript application/postscript ps M~SC CCanonical version is stored MModified version is stored (modified images are thumbnails, all others are html conversions) ~RIndexed but not retrievable ~SIndexed but not stored
10
10 Timeline of Web Resource
11
11 Web Caching Experiment Create 4 websites composed of HTML, PDF, images –http://www.owenbrau.com/http://www.owenbrau.com/ –http://www.cs.odu.edu/~fmccown/lazy/http://www.cs.odu.edu/~fmccown/lazy/ –http://www.cs.odu.edu/~jsmit/http://www.cs.odu.edu/~jsmit/ –http://www.cs.odu.edu/~mln/lazp/http://www.cs.odu.edu/~mln/lazp/ Remove pages each day Query GMY each day using identifiers
12
12
13
13
14
14
15
15
16
16 Crawling the Web and web repositories
17
17 First developed in fall of 2005 Available for download at http://www.cs.odu.edu/~fmccown/warrick/ http://www.cs.odu.edu/~fmccown/warrick/ www2006.org – first lost website reconstructed (Nov 2005)www2006.org DCkickball.org – first website someone else reconstructed without our help (late Jan 2006)DCkickball.org www.iclnet.org – first website we reconstructed for someone else (mid Mar 2006)www.iclnet.org Internet Archive officially endorses Warrick (mid Mar 2006)
18
18 How Much Did We Reconstruct? A “Lost” web site Reconstructed web site BC DEF A B’C’ GE F Missing link to D; points to old resource G F can’t be found Four categories of recovered resources: 1) Identical: A, E 2) Changed: B, C 3) Missing: D, F 4) Added: G
19
19 Reconstruction Diagram added 20% identical 50% changed 33% missing 17%
20
20 Reconstruction Experiment Crawl and reconstruct 24 sites of various sizes: 1. small (1-150 resources) 2. medium (151-499 resources) 3. large (500+ resources) Perform 5 reconstructions for each website –One using all four repositories together –Four using each repository separately Calculate reconstruction vector for each reconstruction (changed%, missing%, added%)
21
21 Frank McCown, Joan A. Smith, Michael L. Nelson, and Johan Bollen. Reconstructing Websites for the Lazy Webmaster, Technical Report, arXiv cs.IR/0512069, 2005.Reconstructing Websites for the Lazy Webmaster,
22
22 Recovery Success by MIME Type
23
23 Repository Contributions
24
24 Current & Future Work Building a web interface for Warrick Currently crawling & reconstructing 300 randomly sampled websites each week –Move from descriptive model to proscriptive & predictive model Injecting server-side functionality into WI –Recover the PHP code, not just the HTML
25
25 Time & Queries
26
26 Traditional Web Crawler
27
27 Web-Repository Crawler
28
28 Limitations Web crawling Limit hit rate per host Websites periodically unavailable Portions of website off- limits (robots.txt, passwords) Deep web Spam Duplicate content Flash and JavaScript interfaces Crawler traps Web-repo crawling Limit hit rate per repo Limited hits per day (API query quotas) Repos periodically unavailable Flash and JavaScript interfaces Can only recover what repos have stored Lossy format conversions (thumb nail images, HTMLlized PDFs, etc.)
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.