Factors Affecting Website Reconstruction from the Web Infrastructure Frank McCown, Norou Diawara, and Michael L. Nelson Old Dominion University Computer.

Factors Affecting Website Reconstruction from the Web Infrastructure Frank McCown, Norou Diawara, and Michael L. Nelson Old Dominion University Computer Science Department Norfolk, Virginia, USA JCDL 2007 Vancouver, BC June 20, 2007

2 Outline Web-repository crawling with Warrick How successful is a reconstruction? Reconstruction experiment Significant findings

3 Black hat: http://img.webpronews.com/securitypronews/110705blackhat.jpg Virus image: http://polarboing.com/images/topics/misc/story.computer.virus_1137794805.jpg Hard drive: http://www.datarecoveryspecialist.com/images/head-crash-2.jpg

4 Crawling the Crawlers

7 Cached Image

Cached PDF http://www.fda.gov/cder/about/whatwedo/testtube.pdf MSN version Yahoo versionGoogle version canonical

9 Web Repository Characteristics TypeMIME typeFile extGoogleYahooLiveIA HTML text text/html html CCCC Plain text text/plain txt, ans MMMC Graphic Interchange Format image/gif gif MMMC Joint Photographic Experts Group image/jpeg jpg MM M C Portable Network Graphic image/png png MM M C Adobe Portable Document Format application/pdf pdf MMMC JavaScript application/javascript js MMC Microsoft Excel application/vnd.ms-excel xls M~SMC Microsoft PowerPoint application/vnd.ms- powerpoint ppt MMMC Microsoft Word application/msword doc MMMC PostScript application/postscript ps M~SC CCanonical version is stored MModified version is stored (modified images are thumbnails, all others are html conversions) ~SIndexed but not stored

10 McCown, et al., Brass: A Queueing Manager for Warrick, IWAW 2007. McCown, et al., Factors Affecting Website Reconstruction from the Web Infrastructure, ACM IEEE JCDL 2007. McCown and Nelson, Evaluation of Crawling Policies for a Web-Repository Crawler, HYPERTEXT 2006. McCown, et al., Lazy Preservation: Reconstructing Websites by Crawling the Crawlers, ACM WIDM 2006. Available at http://warrick.cs.odu.edu/http://warrick.cs.odu.edu/

12 How Much Did We Recover? A “Lost” web site Reconstructed web site BC DEF A B’C’ GE F Missing link to D; points to old resource G F can’t be found Four categories of recovered resources: 1) Identical: A, E 2) Changed: B, C 3) Missing: D, F 4) Added: G

13 Measuring the Difference (r c, r m, r a ) changed missing added Apply Recovery Vector for each resource Compute Difference Vector for website

14 Some Difference Vectors D = (changed, missing, added) (0,0,0) – Perfect recovery (1,0,0) – All resources are recovered but changed (0,1,0) – All resources are lost (0,0,1) – All recovered resources are at new URIs

15 How Much Change is a Bad Thing? LostRecovered

16 How Much Change is a Bad Thing? LostRecovered

17 Assigning Penalties Apply to each resource (P c, P m, P a ) Penalty Adjustment Or Difference vector

18 Defining Success success = 1 – d m Equivalent to percent of recovered resources 01 Less successful More successful

19 Reconstruction Experiment 300 websites chosen randomly from Open Directory Project (dmoz.org) Crawled and reconstructed each website every week for 14 weeks Examined change rates, age, decay, growth, recoverability

20 Success of website recovery each week *On average, we recovered 61% of a website on any given week.

21 Recovery of Textual Resources

22 Recovery by TLD

23 Birth and Decay

24 Recovery of HTML Resources

25 Recovery by Age

26 Statistics for Repositories

27 Which Factors Are Significant? External backlinks Internal backlinks Google’s PageRank Hops from root page Path depth MIME type Query string params Age Resource birth rate TLD Website size Size of resources

28 Mild Correlations Hops and –website size (0.428) –path depth (0.388) Age and # of query params (-0.318) External links and –PageRank (0.339) –Website size (0.301) –Hops (0.320)

29 Regression Analysis No surprises: all variables are significant, but overall model only explains about half of the observations Three most significant variables: PageRank, hops and age (R-squared = 0.1496)

30 Regression Parameter Estimates

31 Conclusions Most of the sampled websites were relatively stable –One third of the websites never lost a single resource –Half of the websites never added any new resources The typical website can expect to get back 61% of its resources if it were lost today (77% textual, 42% images and 32% other) How to improve recovery from WI? Improve PageRank, decrease number of hops to resources, create stable URLs

32 Thank You Frank McCown fmccown@cs.odu.edu http://www.cs.odu.edu/~fmccown/ Sorry, Dad… You lost me in the first two minutes.

33 Injecting Server Components into Crawlable Pages Erasure codes HTML pagesRecover at least m blocks

34 Database Perl script config Static files (html files, PDFs, images, style sheets, Javascript, etc.) Web Infrastructure Web Server Dynamic page Recoverable Not Recoverable

Factors Affecting Website Reconstruction from the Web Infrastructure Frank McCown, Norou Diawara, and Michael L. Nelson Old Dominion University Computer.

Similar presentations

Presentation on theme: "Factors Affecting Website Reconstruction from the Web Infrastructure Frank McCown, Norou Diawara, and Michael L. Nelson Old Dominion University Computer."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Factors Affecting Website Reconstruction from the Web Infrastructure Frank McCown, Norou Diawara, and Michael L. Nelson Old Dominion University Computer.

Similar presentations

Presentation on theme: "Factors Affecting Website Reconstruction from the Web Infrastructure Frank McCown, Norou Diawara, and Michael L. Nelson Old Dominion University Computer."— Presentation transcript:

Similar presentations

About project

Feedback