Presentation is loading. Please wait.

Presentation is loading. Please wait.

Factors Affecting Website Reconstruction from the Web Infrastructure Frank McCown, Norou Diawara, and Michael L. Nelson Old Dominion University Computer.

Similar presentations


Presentation on theme: "Factors Affecting Website Reconstruction from the Web Infrastructure Frank McCown, Norou Diawara, and Michael L. Nelson Old Dominion University Computer."— Presentation transcript:

1 Factors Affecting Website Reconstruction from the Web Infrastructure Frank McCown, Norou Diawara, and Michael L. Nelson Old Dominion University Computer Science Department Norfolk, Virginia, USA JCDL 2007 Vancouver, BC June 20, 2007

2 2 Outline Web-repository crawling with Warrick How successful is a reconstruction? Reconstruction experiment Significant findings

3 3 Black hat: http://img.webpronews.com/securitypronews/110705blackhat.jpg Virus image: http://polarboing.com/images/topics/misc/story.computer.virus_1137794805.jpg Hard drive: http://www.datarecoveryspecialist.com/images/head-crash-2.jpg

4 4 Crawling the Crawlers

5 5

6 6

7 7 Cached Image

8 Cached PDF http://www.fda.gov/cder/about/whatwedo/testtube.pdf MSN version Yahoo versionGoogle version canonical

9 9 Web Repository Characteristics TypeMIME typeFile extGoogleYahooLiveIA HTML text text/html html CCCC Plain text text/plain txt, ans MMMC Graphic Interchange Format image/gif gif MMMC Joint Photographic Experts Group image/jpeg jpg MM M C Portable Network Graphic image/png png MM M C Adobe Portable Document Format application/pdf pdf MMMC JavaScript application/javascript js MMC Microsoft Excel application/vnd.ms-excel xls M~SMC Microsoft PowerPoint application/vnd.ms- powerpoint ppt MMMC Microsoft Word application/msword doc MMMC PostScript application/postscript ps M~SC CCanonical version is stored MModified version is stored (modified images are thumbnails, all others are html conversions) ~SIndexed but not stored

10 10 McCown, et al., Brass: A Queueing Manager for Warrick, IWAW 2007. McCown, et al., Factors Affecting Website Reconstruction from the Web Infrastructure, ACM IEEE JCDL 2007. McCown and Nelson, Evaluation of Crawling Policies for a Web-Repository Crawler, HYPERTEXT 2006. McCown, et al., Lazy Preservation: Reconstructing Websites by Crawling the Crawlers, ACM WIDM 2006. Available at http://warrick.cs.odu.edu/http://warrick.cs.odu.edu/

11 11

12 12 How Much Did We Recover? A “Lost” web site Reconstructed web site BC DEF A B’C’ GE F Missing link to D; points to old resource G F can’t be found Four categories of recovered resources: 1) Identical: A, E 2) Changed: B, C 3) Missing: D, F 4) Added: G

13 13 Measuring the Difference (r c, r m, r a ) changed missing added Apply Recovery Vector for each resource Compute Difference Vector for website

14 14 Some Difference Vectors D = (changed, missing, added) (0,0,0) – Perfect recovery (1,0,0) – All resources are recovered but changed (0,1,0) – All resources are lost (0,0,1) – All recovered resources are at new URIs

15 15 How Much Change is a Bad Thing? LostRecovered

16 16 How Much Change is a Bad Thing? LostRecovered

17 17 Assigning Penalties Apply to each resource (P c, P m, P a ) Penalty Adjustment Or Difference vector

18 18 Defining Success success = 1 – d m Equivalent to percent of recovered resources 01 Less successful More successful

19 19 Reconstruction Experiment 300 websites chosen randomly from Open Directory Project (dmoz.org) Crawled and reconstructed each website every week for 14 weeks Examined change rates, age, decay, growth, recoverability

20 20 Success of website recovery each week *On average, we recovered 61% of a website on any given week.

21 21 Recovery of Textual Resources

22 22 Recovery by TLD

23 23 Birth and Decay

24 24 Recovery of HTML Resources

25 25 Recovery by Age

26 26 Statistics for Repositories

27 27 Which Factors Are Significant? External backlinks Internal backlinks Google’s PageRank Hops from root page Path depth MIME type Query string params Age Resource birth rate TLD Website size Size of resources

28 28 Mild Correlations Hops and –website size (0.428) –path depth (0.388) Age and # of query params (-0.318) External links and –PageRank (0.339) –Website size (0.301) –Hops (0.320)

29 29 Regression Analysis No surprises: all variables are significant, but overall model only explains about half of the observations Three most significant variables: PageRank, hops and age (R-squared = 0.1496)

30 30 Regression Parameter Estimates

31 31 Conclusions Most of the sampled websites were relatively stable –One third of the websites never lost a single resource –Half of the websites never added any new resources The typical website can expect to get back 61% of its resources if it were lost today (77% textual, 42% images and 32% other) How to improve recovery from WI? Improve PageRank, decrease number of hops to resources, create stable URLs

32 32 Thank You Frank McCown fmccown@cs.odu.edu http://www.cs.odu.edu/~fmccown/ Sorry, Dad… You lost me in the first two minutes.

33 33 Injecting Server Components into Crawlable Pages Erasure codes HTML pagesRecover at least m blocks

34 34 Database Perl script config Static files (html files, PDFs, images, style sheets, Javascript, etc.) Web Infrastructure Web Server Dynamic page Recoverable Not Recoverable


Download ppt "Factors Affecting Website Reconstruction from the Web Infrastructure Frank McCown, Norou Diawara, and Michael L. Nelson Old Dominion University Computer."

Similar presentations


Ads by Google