Lazy Preservation: Reconstructing Websites by Crawling the Crawlers Frank McCown, Joan A. Smith, Michael L. Nelson, & Johan Bollen Old Dominion University.

Lazy Preservation: Reconstructing Websites by Crawling the Crawlers Frank McCown, Joan A. Smith, Michael L. Nelson, & Johan Bollen Old Dominion University Norfolk, Virginia, USA Arlington, Virginia November 10, 2006 WIDM 2006

2 Outline Web page threats Web Infrastructure Web caching experiment Web repository crawling Website reconstruction experiment

3 Black hat: http://img.webpronews.com/securitypronews/110705blackhat.jpg Virus image: http://polarboing.com/images/topics/misc/story.computer.virus_1137794805.jpg Hard drive: http://www.datarecoveryspecialist.com/images/head-crash-2.jpg

4 How much of the Web is indexed? Estimates from “The Indexable Web is More than 11.5 billion pages” by Gulli and Signorini (WWW’05)

7 Cached Image

Cached PDF http://www.fda.gov/cder/about/whatwedo/testtube.pdf MSN version Yahoo versionGoogle version canonical

Web Repository Characteristics TypeMIME typeFile extGoogleYahooMSNIA HTML text text/html html CCCC Plain text text/plain txt, ans MMMC Graphic Interchange Format image/gif gif MM~RC Joint Photographic Experts Group image/jpeg jpg MM~RC Portable Network Graphic image/png png MM~RC Adobe Portable Document Format application/pdf pdf MMMC JavaScript application/javascript js MMC Microsoft Excel application/vnd.ms-excel xls M~SMC Microsoft PowerPoint application/vnd.ms- powerpoint ppt MMMC Microsoft Word application/msword doc MMMC PostScript application/postscript ps M~SC CCanonical version is stored MModified version is stored (modified images are thumbnails, all others are html conversions) ~RIndexed but not retrievable ~SIndexed but not stored

10 Timeline of Web Resource

11 Web Caching Experiment Create 4 websites composed of HTML, PDF, images –http://www.owenbrau.com/http://www.owenbrau.com/ –http://www.cs.odu.edu/~fmccown/lazy/http://www.cs.odu.edu/~fmccown/lazy/ –http://www.cs.odu.edu/~jsmit/http://www.cs.odu.edu/~jsmit/ –http://www.cs.odu.edu/~mln/lazp/http://www.cs.odu.edu/~mln/lazp/ Remove pages each day Query GMY each day using identifiers

16 Crawling the Web and web repositories

17 First developed in fall of 2005 Available for download at http://www.cs.odu.edu/~fmccown/warrick/ http://www.cs.odu.edu/~fmccown/warrick/ www2006.org – first lost website reconstructed (Nov 2005)www2006.org DCkickball.org – first website someone else reconstructed without our help (late Jan 2006)DCkickball.org www.iclnet.org – first website we reconstructed for someone else (mid Mar 2006)www.iclnet.org Internet Archive officially endorses Warrick (mid Mar 2006)

18 How Much Did We Reconstruct? A “Lost” web site Reconstructed web site BC DEF A B’C’ GE F Missing link to D; points to old resource G F can’t be found Four categories of recovered resources: 1) Identical: A, E 2) Changed: B, C 3) Missing: D, F 4) Added: G

19 Reconstruction Diagram added 20% identical 50% changed 33% missing 17%

20 Reconstruction Experiment Crawl and reconstruct 24 sites of various sizes: 1. small (1-150 resources) 2. medium (151-499 resources) 3. large (500+ resources) Perform 5 reconstructions for each website –One using all four repositories together –Four using each repository separately Calculate reconstruction vector for each reconstruction (changed%, missing%, added%)

21 Frank McCown, Joan A. Smith, Michael L. Nelson, and Johan Bollen. Reconstructing Websites for the Lazy Webmaster, Technical Report, arXiv cs.IR/0512069, 2005.Reconstructing Websites for the Lazy Webmaster,

22 Recovery Success by MIME Type

23 Repository Contributions

24 Current & Future Work Building a web interface for Warrick Currently crawling & reconstructing 300 randomly sampled websites each week –Move from descriptive model to proscriptive & predictive model Injecting server-side functionality into WI –Recover the PHP code, not just the HTML

25 Time & Queries

26 Traditional Web Crawler

27 Web-Repository Crawler

28 Limitations Web crawling Limit hit rate per host Websites periodically unavailable Portions of website off- limits (robots.txt, passwords) Deep web Spam Duplicate content Flash and JavaScript interfaces Crawler traps Web-repo crawling Limit hit rate per repo Limited hits per day (API query quotas) Repos periodically unavailable Flash and JavaScript interfaces Can only recover what repos have stored Lossy format conversions (thumb nail images, HTMLlized PDFs, etc.)

Lazy Preservation: Reconstructing Websites by Crawling the Crawlers Frank McCown, Joan A. Smith, Michael L. Nelson, & Johan Bollen Old Dominion University.

Similar presentations

Presentation on theme: "Lazy Preservation: Reconstructing Websites by Crawling the Crawlers Frank McCown, Joan A. Smith, Michael L. Nelson, & Johan Bollen Old Dominion University."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Lazy Preservation: Reconstructing Websites by Crawling the Crawlers Frank McCown, Joan A. Smith, Michael L. Nelson, & Johan Bollen Old Dominion University.

Similar presentations

Presentation on theme: "Lazy Preservation: Reconstructing Websites by Crawling the Crawlers Frank McCown, Joan A. Smith, Michael L. Nelson, & Johan Bollen Old Dominion University."— Presentation transcript:

Similar presentations

About project

Feedback