Factors Affecting Website Reconstruction from the Web Infrastructure Frank McCown, Norou Diawara, and Michael L. Nelson Old Dominion University Computer Science Department Norfolk, Virginia, USA JCDL 2007 Vancouver, BC June 20, 2007
2 Outline Web-repository crawling with Warrick How successful is a reconstruction? Reconstruction experiment Significant findings
3 Black hat: Virus image: Hard drive:
4 Crawling the Crawlers
5
6
7 Cached Image
Cached PDF MSN version Yahoo versionGoogle version canonical
9 Web Repository Characteristics TypeMIME typeFile extGoogleYahooLiveIA HTML text text/html html CCCC Plain text text/plain txt, ans MMMC Graphic Interchange Format image/gif gif MMMC Joint Photographic Experts Group image/jpeg jpg MM M C Portable Network Graphic image/png png MM M C Adobe Portable Document Format application/pdf pdf MMMC JavaScript application/javascript js MMC Microsoft Excel application/vnd.ms-excel xls M~SMC Microsoft PowerPoint application/vnd.ms- powerpoint ppt MMMC Microsoft Word application/msword doc MMMC PostScript application/postscript ps M~SC CCanonical version is stored MModified version is stored (modified images are thumbnails, all others are html conversions) ~SIndexed but not stored
10 McCown, et al., Brass: A Queueing Manager for Warrick, IWAW McCown, et al., Factors Affecting Website Reconstruction from the Web Infrastructure, ACM IEEE JCDL McCown and Nelson, Evaluation of Crawling Policies for a Web-Repository Crawler, HYPERTEXT McCown, et al., Lazy Preservation: Reconstructing Websites by Crawling the Crawlers, ACM WIDM Available at
11
12 How Much Did We Recover? A “Lost” web site Reconstructed web site BC DEF A B’C’ GE F Missing link to D; points to old resource G F can’t be found Four categories of recovered resources: 1) Identical: A, E 2) Changed: B, C 3) Missing: D, F 4) Added: G
13 Measuring the Difference (r c, r m, r a ) changed missing added Apply Recovery Vector for each resource Compute Difference Vector for website
14 Some Difference Vectors D = (changed, missing, added) (0,0,0) – Perfect recovery (1,0,0) – All resources are recovered but changed (0,1,0) – All resources are lost (0,0,1) – All recovered resources are at new URIs
15 How Much Change is a Bad Thing? LostRecovered
16 How Much Change is a Bad Thing? LostRecovered
17 Assigning Penalties Apply to each resource (P c, P m, P a ) Penalty Adjustment Or Difference vector
18 Defining Success success = 1 – d m Equivalent to percent of recovered resources 01 Less successful More successful
19 Reconstruction Experiment 300 websites chosen randomly from Open Directory Project (dmoz.org) Crawled and reconstructed each website every week for 14 weeks Examined change rates, age, decay, growth, recoverability
20 Success of website recovery each week *On average, we recovered 61% of a website on any given week.
21 Recovery of Textual Resources
22 Recovery by TLD
23 Birth and Decay
24 Recovery of HTML Resources
25 Recovery by Age
26 Statistics for Repositories
27 Which Factors Are Significant? External backlinks Internal backlinks Google’s PageRank Hops from root page Path depth MIME type Query string params Age Resource birth rate TLD Website size Size of resources
28 Mild Correlations Hops and –website size (0.428) –path depth (0.388) Age and # of query params (-0.318) External links and –PageRank (0.339) –Website size (0.301) –Hops (0.320)
29 Regression Analysis No surprises: all variables are significant, but overall model only explains about half of the observations Three most significant variables: PageRank, hops and age (R-squared = )
30 Regression Parameter Estimates
31 Conclusions Most of the sampled websites were relatively stable –One third of the websites never lost a single resource –Half of the websites never added any new resources The typical website can expect to get back 61% of its resources if it were lost today (77% textual, 42% images and 32% other) How to improve recovery from WI? Improve PageRank, decrease number of hops to resources, create stable URLs
32 Thank You Frank McCown Sorry, Dad… You lost me in the first two minutes.
33 Injecting Server Components into Crawlable Pages Erasure codes HTML pagesRecover at least m blocks
34 Database Perl script config Static files (html files, PDFs, images, style sheets, Javascript, etc.) Web Infrastructure Web Server Dynamic page Recoverable Not Recoverable