Presentation is loading. Please wait.

Presentation is loading. Please wait.

HT'061 Evaluation of Crawling Policies for a Web-Repository Crawler Frank McCown & Michael L. Nelson Old Dominion University Norfolk, Virginia, USA Odense,

Similar presentations


Presentation on theme: "HT'061 Evaluation of Crawling Policies for a Web-Repository Crawler Frank McCown & Michael L. Nelson Old Dominion University Norfolk, Virginia, USA Odense,"— Presentation transcript:

1 HT'061 Evaluation of Crawling Policies for a Web-Repository Crawler Frank McCown & Michael L. Nelson Old Dominion University Norfolk, Virginia, USA Odense, Denmark August 23, 2006

2 HT'062 Alternate Models of Preservation Lazy Preservation –Let Google, IA et al. preserve your website Just-In-Time Preservation –Find a “good enough” replacement web page Shared Infrastructure Preservation –Push your content to sites that might preserve it Web Server Enhanced Preservation –Use Apache modules to create archival-ready resources image from: http://www.proex.ufes.br/arsm/knots_interlaced.htm

3 HT'063 Outline Web page threats Web Infrastructure Warrick –architectural description –crawling policies –future work

4 HT'064 Black hat: http://img.webpronews.com/securitypronews/110705blackhat.jpg Virus image: http://polarboing.com/images/topics/misc/story.computer.virus_1137794805.jpg Hard drive: http://www.datarecoveryspecialist.com/images/head-crash-2.jpg

5 HT'065 Crawling the Web and web repositories

6 HT'066 How much of the Web is indexed? GYM intersection less than 43% Figure from “The Indexable Web is More than 11.5 billion pages” by Gulli and Signorini (WWW’05)

7 HT'067 Traditional Web Crawler

8 HT'068 Web-Repository Crawler

9 HT'069

10 10

11 HT'0611 Cached Image

12 Cached PDF http://www.fda.gov/cder/about/whatwedo/testtube.pdf MSN version Yahoo versionGoogle version canonical

13 Web Repository Characteristics TypeMIME typeFile extGoogleYahooMSNIA HTML text text/html html CCCC Plain text text/plain txt, ans MMMC Graphic Interchange Format image/gif gif MM~RC Joint Photographic Experts Group image/jpeg jpg MM~RC Portable Network Graphic image/png png MM~RC Adobe Portable Document Format application/pdf pdf MMMC JavaScript application/javascript js MMC Microsoft Excel application/vnd.ms-excel xls M~SMC Microsoft PowerPoint application/vnd.ms- powerpoint ppt MMMC Microsoft Word application/msword doc MMMC PostScript application/postscript ps M~SC CCanonical version is stored MModified version is stored (modified images are thumbnails, all others are html conversions) ~RIndexed but not retrievable ~SIndexed but not stored

14 HT'0614 Limitations Web crawling Limit hit rate per host Websites periodically unavailable Portions of website off- limits (robots.txt, passwords) Deep web Spam Duplicate content Flash and JavaScript interfaces Crawler traps Web-repo crawling Limit hit rate per repo Limited hits per day (API query quotas) Repos periodically unavailable Flash and JavaScript interfaces Can only recover what repos have stored Lossy format conversions (thumb nail images, HTMLlized PDFs, etc.)

15 HT'0615 Warrick First developed in fall of 2005 Available for download at http://www.cs.odu.edu/~fmccown/warrick/ http://www.cs.odu.edu/~fmccown/warrick/ www2006.org – first lost website reconstructed (Nov 2005)www2006.org DCkickball.org – first website someone else reconstructed without our help (late Jan 2006)DCkickball.org www.iclnet.org – first website we reconstructed for someone else (mid Mar 2006)www.iclnet.org Internet Archive officially endorses Warrick (mid Mar 2006)

16 HT'0616 How Much Did We Reconstruct? A “Lost” web site Reconstructed web site BC DEF A B’C’ GE F Missing link to D; points to old resource G F can’t be found Four categories of recovered resources: 1) Identical: A, E 2) Changed: B, C 3) Missing: D, F 4) Added: G

17 HT'0617 Reconstruction Diagram added 20% identical 50% changed 33% missing 17%

18 HT'0618 Initial Experiment - April 2005 Crawl and reconstruct 24 sites in 3 categories: 1. small (1-150 resources) 2. medium (150-499 resources) 3. large (500+ resources) Calculate reconstruction vector for each site Results: mostly successful at recovering HTML Observation: many wasted queries, disconnected portions of websites are unrecoverable See: –McCown et al. Reconstructing websites for the lazy webmaster. Tech Report, 2005. http://arxiv.org/abs/cs.IR/0512069 http://arxiv.org/abs/cs.IR/0512069 –Smith et al. Observed web robot behavior on decaying web subsites. D-Lib Magazine, 12(2), Feb 2006.

19 HT'0619 Missing Disconnected Resources

20 HT'0620 Lister Queries Problem with initial version of Warrick: wasted queries –Internet Archive: Do you have X? No –Google: Do you have X? No –Yahoo: Do you have X? Yes –MSN: Do you have X? No What if we first ask each web repository “What resources do you have?” We call these “lister queries”. How many repository requests will this save? How many more resources will this discover? What other problems will this help solve?

21 HT'0621 Lister Queries cont. Search engines –site:www.foo.org –Limited to first 1000 results or less Internet Archive –http://web.archive.org/web/*/http://www.foo.org/* –Not all URLs reported are actually accessible Results are given in groups of 100 or less

22 HT'0622 URL Canonicalization How do we know if URL X is pointing to the same resource as URL Y? Web crawlers use several strategies that we may borrow: –Convert to lowercase –Remove www prefix –Remove session IDs –etc. All web repositories have different canonicalization policies which lister queries can uncover

23 HT'0623 Missing ‘www’ Prefix

24 HT'0624 Case Sensitivity Some web servers run on case-insensitive file systems (e.g., IIS on Windows) http://foo.org/bar.html is equivalent to http://foo.org/BAR.htmlhttp://foo.org/bar.html http://foo.org/BAR.html MSN always ignores case, Google and Yahoo do not

25 HT'0625

26 HT'0626 Crawling Policies 1.Naïve Policy - Do not issue lister queries; only recover links that are found in recovered pages. 2.Knowledgeable Policy - Issue lister queries but only recover links that are found in recovered pages. 3.Exhaustive Policy - Issue lister queries and recover all resources found in all repositories. (Repository dump)

27 HT'0627 Web-Repository Crawler using Lister Queries

28 HT'0628 Experiment Download all 24 websites (from first experiment) Perform 3 reconstructions for each site using the 3 crawling policies Compute reconstruction vectors for each reconstruction

29

30 Reconstruction Statistics Efficiency ratio = total recovered resources / total repository requests

31 Efficiency Ratio All resources Efficiency ratio = total recovered resources / total repository requests

32 Efficiency Ratio Not including ‘added’ resources Efficiency ratio = total recovered resources / total repository requests

33 HT'0633 Summary of Findings Naïve policy –Recovers nearly as many non-added resources as the knowledgeable and exhaustive policies –Issues highest number of repository requests Knowledgeable policy –Issues fewest number of requests per reconstruction –Has highest efficiency ratio for only non-added resources Exhaustive policy –Recovers significantly more added resources than the other two policies –Highest efficiency ratio

34 Website “Hijacking”

35 Soft 404 “Cache poisoning” Warrick should detect soft 404s

36 HT'0636 Other Web Obstacles Effective “do not preserve” tags: –Flash –AJAX –http POST –session ids, cookies, etc. –cloaking –URLs that change based on traversal patterns Lutkenhouse, Nelson, Bollen, Distributed, Real-Time Computation of Community Preferences, Proceedings of ACM Hypertext 05 –http://doi.acm.org/10.1145/1083356.1083374

37 HT'0637 Conclusion Web sites can be reconstructed by accessing the caches of the Web Infrastructure –Some URL canonicalization issues can be tackled using lister queries –multiple policies available depending on reconstruction requirements Much work to be done –capturing server-side information –moving from descriptive model to proscriptive & predictive model

38 HT'0638 Reconstructing “Real” Websites that have been Lost Three case studies revealing future work –TechLocker.com –www2006.org –Iclnet.org

39 The exhaustive policy recovered 1952 resources from IA and 292 from Google. Google’s cached pages were from several months to over one year old.

40 www2006.org Lost Temporarily

41 We’ve got the HTML, but where’s the PHP code?

42 HT'0642 ICLnet March 2006 - Charitable service quits hosting We were contacted to see if we could reconstruct the website on behalf of an individual who wanted to keep the site running.


Download ppt "HT'061 Evaluation of Crawling Policies for a Web-Repository Crawler Frank McCown & Michael L. Nelson Old Dominion University Norfolk, Virginia, USA Odense,"

Similar presentations


Ads by Google