1 Introduction to Digital Libraries Week 15: Web Infrastructure for Preservation Old Dominion University Department of Computer Science CS 751/851 Fall.

Slides:

Advertisements

Similar presentations

Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:

Advertisements

$200 $300 $400 $500 $100 $200 $300 $400 $500 $100 $200 $300 $400 $500 $100 $200 $300 $400 $500 $100 $200 $300 $500 $100 Category One Category Two Category.

“ The Anatomy of a Large-Scale Hypertextual Web Search Engine ” Presented by Ahmed Khaled Al-Shantout ICS

1 CS 502: Computing Methods for Digital Libraries Lecture 2 The Nomadic Computing Experiment Object Models.

Anatomy of a Large-Scale Hypertextual Web Search Engine (e.g. Google)

ISP 433/633 Week 7 Web IR. Web is a unique collection Largest repository of data Unedited Can be anything –Information type –Sources Changing –Growing.

Introduction Web Development II 5 th February. Introduction to Web Development Search engines Discussion boards, bulletin boards, other online collaboration.

Hypertext Computer Science 01i Introduction to the Internet Neal Sample 6 February 2001.

WEB DESIGNING Prof. Jesse A. Role Ph. D TM UEAB 2010.

Lazy Preservation: Reconstructing Websites from the Web Infrastructure Frank McCown Advisor: Michael L. Nelson Old Dominion University Computer Science.

1 Archive-It Training University of Maryland July 12, 2007.

The Technical SEO Audit Rick Ramos | seOveflow. Introduction  SEO is search engine usability.  Why do you need an audit?  How nimble are your development.

Search Engines and their Public Interfaces: Which APIs are the Most Synchronized? Frank McCown and Michael L. Nelson Department of Computer Science, Old.

Synchronicity: Just-In-Time Discovery of Lost Web Pages NDIIPP Partners Meeting June 25, 2009 Martin Klein & Michael L. Nelson Department of Computer Science.

Introductions Search Engine Development COMP 475 Spring 2009 Dr. Frank McCown.

Website Reconstruction using the Web Infrastructure Frank McCown Doctoral Consortium June.

HINARI/Basic Internet Concepts (module 1.1). Instructions - This part of the:  course is a PowerPoint demonstration intended to introduce you to Basic.

Wasim Rangoonwala ID# CS-460 Computer Security “Privacy is the claim of individuals, groups or institutions to determine for themselves when,

Web Characterization: What Does the Web Look Like?

CS621 : Seminar-2008 DEEP WEB Shubhangi Agrawal ( )‏ Jayalekshmy S. Nair ( )‏

Chapter 4 Adding Images. Chapter 4 Lessons Introduction 1.Insert and align images 2.Enhance an image and use alternate text 3.Insert a background image.

Thinking Differently About Web Page Preservation Michael L. Nelson, Frank McCown, Joan A. Smith Old Dominion University Norfolk VA

Web Archiving Dr. Frank McCown Intro to Web Science Harding University This work is licensed under Creative Commons Attribution-NonCommercial 3.0Attribution-NonCommercial.

HT'061 Evaluation of Crawling Policies for a Web-Repository Crawler Frank McCown & Michael L. Nelson Old Dominion University Norfolk, Virginia, USA Odense,

Tutorial 1: Browser Basics.

Master Thesis Defense Jan Fiedler 04/17/98

The Anatomy of a Large-Scale Hypertextual Web Search Engine Presented By: Sibin G. Peter Instructor: Dr. R.M.Verma.

Indo-US Workshop, June23-25, 2003 Building Digital Libraries for Communities using Kepler Framework M. Zubair Old Dominion University.

Web Searching Basics Dr. Dania Bilal IS 530 Fall 2009.

Preserving Digital Culture: Tools & Strategies for Building Web Archives : Tools and Strategies for Building Web Archives Internet Librarian 2009 Tracy.

Autumn Web Information retrieval (Web IR) Handout #0: Introduction Ali Mohammad Zareh Bidoki ECE Department, Yazd University

استاد : مهندس حسین پور ارائه دهنده : احسان جوانمرد Google Architecture.

My Website Was Lost, But Now It’s Found Frank McCown CS 110 – Intro to Computer Science April 23, 2007.

World Wide Web “WWW”, "Web" or "W3". World Wide Web “WWW”, "Web" or "W3"

Client-Side Preservation Techniques for ORE Aggregations Michael L. Nelson & Sudhir Koneru Old Dominion University, Norfolk VA OAI-ORE Specification Roll-Out.

Project Two Adding Web Pages, Links, and Images Define and set a home page Add pages to a Web site Describe Dreamweaver's image accessibility features.

Ph.D. Progress Report Frank McCown 4/14/05. Timeline Year 1 : Course work and Diagnostic Exam Year 2 : Course work and Candidacy Exam Year 3 : Write and.

Lazy Preservation, Warrick, and the Web Infrastructure Frank McCown Old Dominion University Computer Science Department Norfolk, Virginia, USA JCDL 2007.

1 BCS, Oxfordshire, 19 February, 2004 WEB ARCHIVING issues and challenges Deborah Woodyard Digital Preservation Coordinator.

Web Design and Development. World Wide Web  World Wide Web (WWW or W3), collection of globally distributed text and multimedia documents and files 

Factors Affecting Website Reconstruction from the Web Infrastructure Frank McCown, Norou Diawara, and Michael L. Nelson Old Dominion University Computer.

Archive Ingest and Handling Test: ODU’s Perspective Michael L. Nelson Department of Computer Science Old Dominion University

What is Web Information retrieval from web Search Engine Web Crawler Web crawler policies Conclusion How does a web crawler work Synchronization Algorithms.

The Availability and Persistence of Web References in D-Lib Magazine Frank McCown, Sheffan Chan, Michael L. Nelson and Johan Bollen Old Dominion University.

Lazy Preservation: Reconstructing Websites by Crawling the Crawlers Frank McCown, Joan A. Smith, Michael L. Nelson, & Johan Bollen Old Dominion University.

Website Design, Development and Maintenance ONLY TAKE DOWN NOTES ON INDICATED SLIDES.

ADAPTIVE HYPERMEDIA Presented By:- Debraj Manna Raunak Pilani Gada Kekin Dhiraj.

1 CS 430: Information Discovery Lecture 26 Architecture of Information Retrieval Systems 1.

Client-Side Preservation Techniques for ORE Aggregations Michael L. Nelson & Sudhir Koneru Old Dominion University, Norfolk VA OAI-ORE Specification Roll-Out.

Brass: A Queueing Manager for Warrick Frank McCown, Amine Benjelloun, and Michael L. Nelson Old Dominion University Computer Science Department Norfolk,

Web Design Terminology Unit 2 STEM. 1. Accessibility – a web page or site that address the users limitations or disabilities 2. Active server page (ASP)

The Anatomy of a Large-Scale Hypertextual Web Search Engine S. Brin and L. Page, Computer Networks and ISDN Systems, Vol. 30, No. 1-7, pages , April.

The Internet Salihu Ibrahim Dasuki (PhD) CSC102 INTRODUCTION TO COMPUTER SCIENCE.

General Architecture of Retrieval Systems 1Adrienn Skrop.

The Anatomy of a Large-Scale Hypertextual Web Search Engine (The creation of Google)

Introduction to Digital Libraries Week 15: Lazy Preservation Old Dominion University Department of Computer Science CS 751/851 Spring 2010 Michael L. Nelson.

Introduction to Digital Libraries Week 13: Lazy Preservation Old Dominion University Department of Computer Science CS 751/851 Spring 2011 Michael L. Nelson.

Can’t Find Your 404s? Santa Fe Complex March 13, 2009 Martin Klein, Frank McCown, Joan Smith, Michael L. Nelson Department of Computer Science Old Dominion.

Computing Fundamentals

Lazy Preservation, Warrick, and the Web Infrastructure

A Brief Introduction to the Internet

The Anatomy of a Large-Scale Hypertextual Web Search Engine

Agreeing to Disagree: Search Engines and Their Public Interfaces

Search Search Engines Search Engine Optimization Search Interfaces

Information Retrieval

Just-In-Time Recovery of Missing Web Pages

Correlation of Term Count and Document Frequency for Google N-Grams

Characterization of Search Engine Caches

Correlation of Term Count and Document Frequency for Google N-Grams

Web Search Engines.

Presentation transcript:

1 Introduction to Digital Libraries Week 15: Web Infrastructure for Preservation Old Dominion University Department of Computer Science CS 751/851 Fall 2006 Michael L. Nelson Frank McCown 12/6/06

2 Outline Web page threats Web Infrastructure (WI) Utilizing the WI for finding “good enough” replacements of web pages Search engine caching experiment Utilizing the WI for reconstructing lost websites

3 Linkrot: The 404 Problem Kahle (97) - Average page lifetime 44 days Koehler (99, 04) - 67% URLs lost in 4 years Lawrence et al. (01) - 23%-53% URLs in CiteSeer papers invalid over 5 year span (3% of invalid URLs “unfindable”) Spinellis (03) - 27% URLs in CACM/Computer papers gone in 5 years Fetterly et al. (03) – about 0.5% of web pages disappear per week McCown et al. (05) - 10 year half-life for URLs in D-Lib Magazine articles Nelson & Allen (02) - 3% objects in digital library gone in 1 year

4 No Longer Here ECDL 1999 “good enough” page available PSP 2003 exact copy at new URL Greynet 99 unavailable at any URL?

5 Black hat: Virus image: Hard drive:

6 How much of the Web is indexed? Estimates from “The Indexable Web is More than 11.5 billion pages” by Gulli and Signorini (WWW’05)

7 Web Infrastructure: Refreshing & Migrating

8

9

10 Cached Image

11 Cached PDF MSN version Yahoo versionGoogle version canonical

12 Web Repository Characteristics TypeMIME typeFile extGoogleYahooMSNIA HTML text text/html html CCCC Plain text text/plain txt, ans MMMC Graphic Interchange Format image/gif gif MM~RC Joint Photographic Experts Group image/jpeg jpg MM~RC Portable Network Graphic image/png png MM~RC Adobe Portable Document Format application/pdf pdf MMMC JavaScript application/javascript js MMC Microsoft Excel application/vnd.ms-excel xls M~SMC Microsoft PowerPoint application/vnd.ms- powerpoint ppt MMMC Microsoft Word application/msword doc MMMC PostScript application/postscript ps M~SC CCanonical version is stored MModified version is stored (modified images are thumbnails, all others are html conversions) ~RIndexed but not retrievable ~SIndexed but not stored

13 Just-In-Time Preservation How can the WI be utilized to locate replacements for missing web pages? Masters thesis written by Terry Harrison in 2005 Terry L. Harrison and Michael L. Nelson, Just-In-Time Recovery of Missing Web Pages, Proceedings of Hypertext 2006, pp

14 Lexical Signatures “Robust Hyperlinks Cost Just Five Words Each” –Phelps & Wilensky (2000) “Analysis of Lexical Signatures for Improving Information Presence on the World Wide Web” –Park et al. (2004) Lexical SignatureCalculation Technique Results from Google 2004+terry+digital+harrison+2003TF Based456,000 modoai+netpreserve.org+mod_oai+heretrix+xmlsolutionsIDF Based2 terry+harrison+thesis+jcdl+awardedTF-IDF Based6 TF (Term Frequency) = how often does this word appear in this document? IDF (Inverse Document Frequency) = in how many documents does this word appear?

15 Observations One reason why the original Phelps & Wilensky vision was never realized is that it required a priori LS calculation –idea: use the Web Infrastructure to calculate LSs as they are needed Mass adoption of a system will occur only if it is really, really easy to do so –idea: digital preservation systems should require only a small number of “heroes”

16 Description & Use Cases Allow many web servers to use a few Opal servers that use the caches of the Web Infrastructure to generate Lexical Signatures of recently 404 URLs to find either: –the same page at a new URL example: bookmarked colleague is now 404 –cached info is not useful –similar pages probably not useful –a “good enough” replacement page example: bookmarked recipe is now 404 –cached info is useful –similar pages probably useful

17 Opal Configuration: “Configure Two Things” edit httpd.conf add / edit custom 404 page

18 Opal High-Level Architecture Interactive User opal.foo.edu 1. Get URL X 2. Custom 404 page 3. Pagetag redirects User to Opal server 4. Opal searches WI caches; creates LS 5. Opal gives user navigation options

19 Locating Caches

20 Internet Archive

21 Term Frequency  Inverse Document Frequency Calculating Term Frequency is easy –frequency of term in this document Calculating Document Frequency is hard –frequency of term in all documents assumes knowledge of entire corpus! “Good terms” appear: –frequently in a single document –infrequently across all documents

22 Scraping Google to Approximate DF Frequency of term across all documents: How many documents?

23 GUI - Bootstrapping

24 GUI - Learned

25 GUI (cont) <url:similarURL datestamp=" " votes="1" simURL=" baseURL=" - -<a href="javascript:popUp('demo_dev.pl?method=vote&url= &match= Terry Harrison Profile Page Burning Man Images Other Images (not really well sorted, sorry!) Terry... (May 2003), AR Zipf Fellowship Awarded to Terry Harrison - Press Release k - ]]>

26 Opal Server Databases URL database –404 URL  (LS, similarURL1, similarURL2, …, similarURLN) similarURL  (URL, datestamp, votes, Opal server) Term database –term  (Opal server, source, datestamp, DF, corpus size, IDF) Define each URL and Term as OAI-PMH Records and we can harvest what an Opal server has “learned” - can accommodate late arrivers (no “cold start” for them) - pool the learning of multiple servers - incentives to cooperate

27 Opal Synchronization Opal AOpal D.1 Opal D.2Opal D.3 Terms URLs Opal AOpal B Opal COpal D * Terms URLs * Opal D aggregates D.1-D.3 to Group 1 * Opal D aggregates A-C to Group 2 Group 1 Group 2 Other architectures possible Harvesting frequency determined by individual nodes

28 Discovery via OAI-PMH

29 Connection Costs Cost cache = (WI * N) + R –WI = # of web infrastructure caches –N = connections for each WI –R = connection to get a datestamp Cost paths = R c + T + R l –R c = connections to get a cached copy –T = connections required for each term –R l = connections to use LS Cost cache = 3*1 + 1 = 4 Cost paths = 1 + T + 1

30 Analysis - Cumulative Terms Learned 1 Million terms Documents Result averages after 100 iterations

31 Analysis - Terms Learned Per Document 1 Million terms Documents Result averages after 100 iterations

32 Load Estimation

33 Future Work Testing on departmental server –hard to test in-the-small Code optimizations –many short cuts taken for demo system G & Y APIs not used; screen scraping only Lexical Signatures –describe changes over time IDF calculation metrics –is scraping Google valid? is it nice? Learning new code –use OAI-PMH to update the system OpenURL resolver –404 URL = referent

34 Lazy Preservation and Website Reconstruction Investigating website reconstruction from the WI Publications: –Frank McCown, Joan A. Smith, Michael L. Nelson, and Johan Bollen. Lazy Preservation: Reconstructing Websites by Crawling the Crawlers. 8th ACM International Workshop on Web Information and Data Management (WIDM 2006). 10 November Lazy Preservation: Reconstructing Websites by Crawling the Crawlers –Frank McCown and Michael L. Nelson. Evaluation of Crawling Policies for a Web-Repository Crawler. 17th ACM Conference on Hypertext and Hypermedia (HYPERTEXT 2006) August 2006.Evaluation of Crawling Policies for a Web-Repository Crawler –Joan A. Smith, Frank McCown, and Michael L. Nelson. Observed Web Robot Behavior on Decaying Web Subsites. D-Lib Magazine, February 2006, Vol. 12, Num. 2Observed Web Robot Behavior on Decaying Web Subsites –Frank McCown, Joan A. Smith, Michael L. Nelson, and Johan Bollen. Reconstructing Websites for the Lazy Webmaster. Technical Report Reconstructing Websites for the Lazy Webmaster.

35 Timeline of Web Resource

36 Web Caching Experiment Create 4 websites composed of HTML, PDF, images – – – – Remove pages each day Query GMY each day using identifiers

37

38

39

40

41 Crawling the Web and web repositories

42 Traditional Web Crawler

43 Web-Repository Crawler

44 First developed in fall of 2005 Available for download at www2006.org – first lost website reconstructed (Nov 2005)www2006.org DCkickball.org – first website someone else reconstructed without our help (late Jan 2006)DCkickball.org – first website we reconstructed for someone else (mid Mar 2006) Internet Archive officially endorses Warrick (mid Mar 2006)

45 How Much Did We Reconstruct? A “Lost” web site Reconstructed web site BC DEF A B’C’ GE F Missing link to D; points to old resource G F can’t be found Four categories of recovered resources: 1) Identical: A, E 2) Changed: B, C 3) Missing: D, F 4) Added: G

46 Reconstruction Diagram added 20% identical 50% changed 33% missing 17%

47 Reconstruction Experiment Crawl and reconstruct 24 sites of various sizes: 1. small (1-150 resources) 2. medium ( resources) 3. large (500+ resources) Perform 5 reconstructions for each website –One using all four repositories together –Four using each repository separately Calculate reconstruction vector for each reconstruction (changed%, missing%, added%)

48 Frank McCown, Joan A. Smith, Michael L. Nelson, and Johan Bollen. Reconstructing Websites for the Lazy Webmaster, Technical Report, arXiv cs.IR/ , 2005.Reconstructing Websites for the Lazy Webmaster,

49 Recovery Success by MIME Type

50 Repository Contributions

51 Current & Future Work Building a web interface for Warrick Currently crawling & reconstructing 300 randomly sampled websites each week –Move from descriptive model to proscriptive & predictive model Injecting server-side functionality into WI –Recover the PHP code, not just the HTML

52 Conclusions Preserving the Web is a very difficult problem Linkrot is not likely to decrease anytime soon The WI is the combined effort of many entities preserving portions of the Web and can be utilized for preserving the Web at large Utilizing the WI for finding missing web pages (Opal) and websites (Warrick) is promising but not full-proof

53 Time & Queries

54 Limitations Web crawling Limit hit rate per host Websites periodically unavailable Portions of website off- limits (robots.txt, passwords) Deep web Spam Duplicate content Flash and JavaScript interfaces Crawler traps Web-repo crawling Limit hit rate per repo Limited hits per day (API query quotas) Repos periodically unavailable Flash and JavaScript interfaces Can only recover what repos have stored Lossy format conversions (thumb nail images, HTMLlized PDFs, etc.)