Download presentation
Presentation is loading. Please wait.
Published byMeagan Bennett Modified over 9 years ago
1
Factors Affecting Website Reconstruction from the Web Infrastructure Frank McCown, Norou Diawara, and Michael L. Nelson Old Dominion University Computer Science Department Norfolk, Virginia, USA JCDL 2007 Vancouver, BC June 20, 2007
2
2 Outline Web-repository crawling with Warrick How successful is a reconstruction? Reconstruction experiment Significant findings
3
3 Black hat: http://img.webpronews.com/securitypronews/110705blackhat.jpg Virus image: http://polarboing.com/images/topics/misc/story.computer.virus_1137794805.jpg Hard drive: http://www.datarecoveryspecialist.com/images/head-crash-2.jpg
4
4 Crawling the Crawlers
5
5
6
6
7
7 Cached Image
8
Cached PDF http://www.fda.gov/cder/about/whatwedo/testtube.pdf MSN version Yahoo versionGoogle version canonical
9
9 Web Repository Characteristics TypeMIME typeFile extGoogleYahooLiveIA HTML text text/html html CCCC Plain text text/plain txt, ans MMMC Graphic Interchange Format image/gif gif MMMC Joint Photographic Experts Group image/jpeg jpg MM M C Portable Network Graphic image/png png MM M C Adobe Portable Document Format application/pdf pdf MMMC JavaScript application/javascript js MMC Microsoft Excel application/vnd.ms-excel xls M~SMC Microsoft PowerPoint application/vnd.ms- powerpoint ppt MMMC Microsoft Word application/msword doc MMMC PostScript application/postscript ps M~SC CCanonical version is stored MModified version is stored (modified images are thumbnails, all others are html conversions) ~SIndexed but not stored
10
10 McCown, et al., Brass: A Queueing Manager for Warrick, IWAW 2007. McCown, et al., Factors Affecting Website Reconstruction from the Web Infrastructure, ACM IEEE JCDL 2007. McCown and Nelson, Evaluation of Crawling Policies for a Web-Repository Crawler, HYPERTEXT 2006. McCown, et al., Lazy Preservation: Reconstructing Websites by Crawling the Crawlers, ACM WIDM 2006. Available at http://warrick.cs.odu.edu/http://warrick.cs.odu.edu/
11
11
12
12 How Much Did We Recover? A “Lost” web site Reconstructed web site BC DEF A B’C’ GE F Missing link to D; points to old resource G F can’t be found Four categories of recovered resources: 1) Identical: A, E 2) Changed: B, C 3) Missing: D, F 4) Added: G
13
13 Measuring the Difference (r c, r m, r a ) changed missing added Apply Recovery Vector for each resource Compute Difference Vector for website
14
14 Some Difference Vectors D = (changed, missing, added) (0,0,0) – Perfect recovery (1,0,0) – All resources are recovered but changed (0,1,0) – All resources are lost (0,0,1) – All recovered resources are at new URIs
15
15 How Much Change is a Bad Thing? LostRecovered
16
16 How Much Change is a Bad Thing? LostRecovered
17
17 Assigning Penalties Apply to each resource (P c, P m, P a ) Penalty Adjustment Or Difference vector
18
18 Defining Success success = 1 – d m Equivalent to percent of recovered resources 01 Less successful More successful
19
19 Reconstruction Experiment 300 websites chosen randomly from Open Directory Project (dmoz.org) Crawled and reconstructed each website every week for 14 weeks Examined change rates, age, decay, growth, recoverability
20
20 Success of website recovery each week *On average, we recovered 61% of a website on any given week.
21
21 Recovery of Textual Resources
22
22 Recovery by TLD
23
23 Birth and Decay
24
24 Recovery of HTML Resources
25
25 Recovery by Age
26
26 Statistics for Repositories
27
27 Which Factors Are Significant? External backlinks Internal backlinks Google’s PageRank Hops from root page Path depth MIME type Query string params Age Resource birth rate TLD Website size Size of resources
28
28 Mild Correlations Hops and –website size (0.428) –path depth (0.388) Age and # of query params (-0.318) External links and –PageRank (0.339) –Website size (0.301) –Hops (0.320)
29
29 Regression Analysis No surprises: all variables are significant, but overall model only explains about half of the observations Three most significant variables: PageRank, hops and age (R-squared = 0.1496)
30
30 Regression Parameter Estimates
31
31 Conclusions Most of the sampled websites were relatively stable –One third of the websites never lost a single resource –Half of the websites never added any new resources The typical website can expect to get back 61% of its resources if it were lost today (77% textual, 42% images and 32% other) How to improve recovery from WI? Improve PageRank, decrease number of hops to resources, create stable URLs
32
32 Thank You Frank McCown fmccown@cs.odu.edu http://www.cs.odu.edu/~fmccown/ Sorry, Dad… You lost me in the first two minutes.
33
33 Injecting Server Components into Crawlable Pages Erasure codes HTML pagesRecover at least m blocks
34
34 Database Perl script config Static files (html files, PDFs, images, style sheets, Javascript, etc.) Web Infrastructure Web Server Dynamic page Recoverable Not Recoverable
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.