Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Introduction to Digital Libraries Week 15: Web Infrastructure for Preservation Old Dominion University Department of Computer Science CS 751/851 Fall.

Similar presentations


Presentation on theme: "1 Introduction to Digital Libraries Week 15: Web Infrastructure for Preservation Old Dominion University Department of Computer Science CS 751/851 Fall."— Presentation transcript:

1 1 Introduction to Digital Libraries Week 15: Web Infrastructure for Preservation Old Dominion University Department of Computer Science CS 751/851 Fall 2006 Michael L. Nelson Frank McCown 12/6/06

2 2 Outline Web page threats Web Infrastructure (WI) Utilizing the WI for finding “good enough” replacements of web pages Search engine caching experiment Utilizing the WI for reconstructing lost websites

3 3 Linkrot: The 404 Problem Kahle (97) - Average page lifetime 44 days Koehler (99, 04) - 67% URLs lost in 4 years Lawrence et al. (01) - 23%-53% URLs in CiteSeer papers invalid over 5 year span (3% of invalid URLs “unfindable”) Spinellis (03) - 27% URLs in CACM/Computer papers gone in 5 years Fetterly et al. (03) – about 0.5% of web pages disappear per week McCown et al. (05) - 10 year half-life for URLs in D-Lib Magazine articles Nelson & Allen (02) - 3% objects in digital library gone in 1 year

4 4 No Longer Here ECDL 1999 “good enough” page available PSP 2003 exact copy at new URL Greynet 99 unavailable at any URL?

5 5 Black hat: http://img.webpronews.com/securitypronews/110705blackhat.jpg Virus image: http://polarboing.com/images/topics/misc/story.computer.virus_1137794805.jpg Hard drive: http://www.datarecoveryspecialist.com/images/head-crash-2.jpg

6 6 How much of the Web is indexed? Estimates from “The Indexable Web is More than 11.5 billion pages” by Gulli and Signorini (WWW’05)

7 7 Web Infrastructure: Refreshing & Migrating

8 8

9 9

10 10 Cached Image

11 11 Cached PDF http://www.fda.gov/cder/about/whatwedo/testtube.pdf MSN version Yahoo versionGoogle version canonical

12 12 Web Repository Characteristics TypeMIME typeFile extGoogleYahooMSNIA HTML text text/html html CCCC Plain text text/plain txt, ans MMMC Graphic Interchange Format image/gif gif MM~RC Joint Photographic Experts Group image/jpeg jpg MM~RC Portable Network Graphic image/png png MM~RC Adobe Portable Document Format application/pdf pdf MMMC JavaScript application/javascript js MMC Microsoft Excel application/vnd.ms-excel xls M~SMC Microsoft PowerPoint application/vnd.ms- powerpoint ppt MMMC Microsoft Word application/msword doc MMMC PostScript application/postscript ps M~SC CCanonical version is stored MModified version is stored (modified images are thumbnails, all others are html conversions) ~RIndexed but not retrievable ~SIndexed but not stored

13 13 Just-In-Time Preservation How can the WI be utilized to locate replacements for missing web pages? Masters thesis written by Terry Harrison in 2005 Terry L. Harrison and Michael L. Nelson, Just-In-Time Recovery of Missing Web Pages, Proceedings of Hypertext 2006, pp. 157-168.

14 14 Lexical Signatures “Robust Hyperlinks Cost Just Five Words Each” –Phelps & Wilensky (2000) http://www.cs.odu.edu/~tharriso/?lex-sig=terry+harrison+thesis+jcdl+awarded “Analysis of Lexical Signatures for Improving Information Presence on the World Wide Web” –Park et al. (2004) Lexical SignatureCalculation Technique Results from Google 2004+terry+digital+harrison+2003TF Based456,000 modoai+netpreserve.org+mod_oai+heretrix+xmlsolutionsIDF Based2 terry+harrison+thesis+jcdl+awardedTF-IDF Based6 TF (Term Frequency) = how often does this word appear in this document? IDF (Inverse Document Frequency) = in how many documents does this word appear?

15 15 Observations One reason why the original Phelps & Wilensky vision was never realized is that it required a priori LS calculation –idea: use the Web Infrastructure to calculate LSs as they are needed Mass adoption of a system will occur only if it is really, really easy to do so –idea: digital preservation systems should require only a small number of “heroes”

16 16 Description & Use Cases Allow many web servers to use a few Opal servers that use the caches of the Web Infrastructure to generate Lexical Signatures of recently 404 URLs to find either: –the same page at a new URL example: bookmarked colleague is now 404 –cached info is not useful –similar pages probably not useful –a “good enough” replacement page example: bookmarked recipe is now 404 –cached info is useful –similar pages probably useful

17 17 Opal Configuration: “Configure Two Things” edit httpd.conf add / edit custom 404 page

18 18 Opal High-Level Architecture Interactive User opal.foo.edu www.bar.org 1. Get URL X 2. Custom 404 page 3. Pagetag redirects User to Opal server 4. Opal searches WI caches; creates LS 5. Opal gives user navigation options

19 19 Locating Caches http://www.google.com/search?hl=en&ie=ISO-8859-1&q=http://www.cs.odu.edu/~tharriso http://search.yahoo.com/search?fr=FP-pull-web-t&ei=UTF8&p=http://www.cs.odu.edu/~tharriso

20 20 Internet Archive

21 21 Term Frequency  Inverse Document Frequency Calculating Term Frequency is easy –frequency of term in this document Calculating Document Frequency is hard –frequency of term in all documents assumes knowledge of entire corpus! “Good terms” appear: –frequently in a single document –infrequently across all documents

22 22 Scraping Google to Approximate DF Frequency of term across all documents: How many documents?

23 23 GUI - Bootstrapping

24 24 GUI - Learned

25 25 GUI (cont) <url:similarURL datestamp="2005-05-13" votes="1" simURL="http://www.cs.odu.edu/~tharriso/" baseURL="http://invivo_test.com"> - -<a href="javascript:popUp('demo_dev.pl?method=vote&url=http://www.cs.odu.edu/~tharriso &match=http://www.cs.odu.edu/~tharriso/')"> Terry Harrison Profile Page Burning Man Images Other Images (not really well sorted, sorry!) Email Terry... (May 2003), AR Zipf Fellowship Awarded to Terry Harrison - Press Release... www.cs.odu.edu/~tharriso/ - 12k - ]]>

26 26 Opal Server Databases URL database –404 URL  (LS, similarURL1, similarURL2, …, similarURLN) similarURL  (URL, datestamp, votes, Opal server) Term database –term  (Opal server, source, datestamp, DF, corpus size, IDF) Define each URL and Term as OAI-PMH Records and we can harvest what an Opal server has “learned” - can accommodate late arrivers (no “cold start” for them) - pool the learning of multiple servers - incentives to cooperate

27 27 Opal Synchronization Opal AOpal D.1 Opal D.2Opal D.3 Terms URLs Opal AOpal B Opal COpal D * Terms URLs * Opal D aggregates D.1-D.3 to Group 1 * Opal D aggregates A-C to Group 2 Group 1 Group 2 Other architectures possible Harvesting frequency determined by individual nodes

28 28 Discovery via OAI-PMH

29 29 Connection Costs Cost cache = (WI * N) + R –WI = # of web infrastructure caches –N = connections for each WI –R = connection to get a datestamp Cost paths = R c + T + R l –R c = connections to get a cached copy –T = connections required for each term –R l = connections to use LS Cost cache = 3*1 + 1 = 4 Cost paths = 1 + T + 1

30 30 Analysis - Cumulative Terms Learned 1 Million terms 30000 Documents Result averages after 100 iterations

31 31 Analysis - Terms Learned Per Document 1 Million terms 30000 Documents Result averages after 100 iterations

32 32 Load Estimation

33 33 Future Work Testing on departmental server –hard to test in-the-small Code optimizations –many short cuts taken for demo system G & Y APIs not used; screen scraping only Lexical Signatures –describe changes over time IDF calculation metrics –is scraping Google valid? is it nice? Learning new code –use OAI-PMH to update the system OpenURL resolver –404 URL = referent

34 34 Lazy Preservation and Website Reconstruction Investigating website reconstruction from the WI Publications: –Frank McCown, Joan A. Smith, Michael L. Nelson, and Johan Bollen. Lazy Preservation: Reconstructing Websites by Crawling the Crawlers. 8th ACM International Workshop on Web Information and Data Management (WIDM 2006). 10 November 2006. Lazy Preservation: Reconstructing Websites by Crawling the Crawlers –Frank McCown and Michael L. Nelson. Evaluation of Crawling Policies for a Web-Repository Crawler. 17th ACM Conference on Hypertext and Hypermedia (HYPERTEXT 2006). 23-25 August 2006.Evaluation of Crawling Policies for a Web-Repository Crawler –Joan A. Smith, Frank McCown, and Michael L. Nelson. Observed Web Robot Behavior on Decaying Web Subsites. D-Lib Magazine, February 2006, Vol. 12, Num. 2Observed Web Robot Behavior on Decaying Web Subsites –Frank McCown, Joan A. Smith, Michael L. Nelson, and Johan Bollen. Reconstructing Websites for the Lazy Webmaster. Technical Report. 2005. Reconstructing Websites for the Lazy Webmaster.

35 35 Timeline of Web Resource

36 36 Web Caching Experiment Create 4 websites composed of HTML, PDF, images –http://www.owenbrau.com/http://www.owenbrau.com/ –http://www.cs.odu.edu/~fmccown/lazy/http://www.cs.odu.edu/~fmccown/lazy/ –http://www.cs.odu.edu/~jsmit/http://www.cs.odu.edu/~jsmit/ –http://www.cs.odu.edu/~mln/lazp/http://www.cs.odu.edu/~mln/lazp/ Remove pages each day Query GMY each day using identifiers

37 37

38 38

39 39

40 40

41 41 Crawling the Web and web repositories

42 42 Traditional Web Crawler

43 43 Web-Repository Crawler

44 44 First developed in fall of 2005 Available for download at http://www.cs.odu.edu/~fmccown/warrick/ http://www.cs.odu.edu/~fmccown/warrick/ www2006.org – first lost website reconstructed (Nov 2005)www2006.org DCkickball.org – first website someone else reconstructed without our help (late Jan 2006)DCkickball.org www.iclnet.org – first website we reconstructed for someone else (mid Mar 2006)www.iclnet.org Internet Archive officially endorses Warrick (mid Mar 2006)

45 45 How Much Did We Reconstruct? A “Lost” web site Reconstructed web site BC DEF A B’C’ GE F Missing link to D; points to old resource G F can’t be found Four categories of recovered resources: 1) Identical: A, E 2) Changed: B, C 3) Missing: D, F 4) Added: G

46 46 Reconstruction Diagram added 20% identical 50% changed 33% missing 17%

47 47 Reconstruction Experiment Crawl and reconstruct 24 sites of various sizes: 1. small (1-150 resources) 2. medium (151-499 resources) 3. large (500+ resources) Perform 5 reconstructions for each website –One using all four repositories together –Four using each repository separately Calculate reconstruction vector for each reconstruction (changed%, missing%, added%)

48 48 Frank McCown, Joan A. Smith, Michael L. Nelson, and Johan Bollen. Reconstructing Websites for the Lazy Webmaster, Technical Report, arXiv cs.IR/0512069, 2005.Reconstructing Websites for the Lazy Webmaster,

49 49 Recovery Success by MIME Type

50 50 Repository Contributions

51 51 Current & Future Work Building a web interface for Warrick Currently crawling & reconstructing 300 randomly sampled websites each week –Move from descriptive model to proscriptive & predictive model Injecting server-side functionality into WI –Recover the PHP code, not just the HTML

52 52 Conclusions Preserving the Web is a very difficult problem Linkrot is not likely to decrease anytime soon The WI is the combined effort of many entities preserving portions of the Web and can be utilized for preserving the Web at large Utilizing the WI for finding missing web pages (Opal) and websites (Warrick) is promising but not full-proof

53 53 Time & Queries

54 54 Limitations Web crawling Limit hit rate per host Websites periodically unavailable Portions of website off- limits (robots.txt, passwords) Deep web Spam Duplicate content Flash and JavaScript interfaces Crawler traps Web-repo crawling Limit hit rate per repo Limited hits per day (API query quotas) Repos periodically unavailable Flash and JavaScript interfaces Can only recover what repos have stored Lossy format conversions (thumb nail images, HTMLlized PDFs, etc.)


Download ppt "1 Introduction to Digital Libraries Week 15: Web Infrastructure for Preservation Old Dominion University Department of Computer Science CS 751/851 Fall."

Similar presentations


Ads by Google