Introduction to Digital Libraries Week 15: Lazy Preservation Old Dominion University Department of Computer Science CS 751/851 Spring 2010 Michael L. Nelson.

Slides:

Advertisements

Similar presentations

Advertisements

DOCUMENT TYPES. Digital Documents Converting documents to an electronic format will preserve those documents, but how would such a process be organized?

Looking at both the Present and the Past to Efficiently Update Replicas of Web Content Luciano Barbosa * Ana Carolina Salgado ! Francisco Tenorio ! Jacques.

Introduction to Web Database Processing

1 Crawling the Web Discovery and Maintenance of Large-Scale Web Data Junghoo Cho Stanford University.

Introduction to Web Interface Technology (CSE2030)

1 Crawling the Web Discovery and Maintenance of Large-Scale Web Data Junghoo Cho Stanford University.

Mgt 240 Lecture Website Construction: Software and Language Alternatives March 29, 2005.

Glencoe Digital Communication Tools Create a Web Page with HTML Chapter Contents Lesson 4.1Lesson 4.1 Get Started with HTML (85) Lesson 4.2Lesson 4.2 Format.

Web Programming Language Dr. Ken Cosh Week 1 (Introduction)

IDK0040 Võrgurakendused I Building a site: Publicising Deniss Kumlander.

Lazy Preservation: Reconstructing Websites from the Web Infrastructure Frank McCown Advisor: Michael L. Nelson Old Dominion University Computer Science.

Search Engines and their Public Interfaces: Which APIs are the Most Synchronized? Frank McCown and Michael L. Nelson Department of Computer Science, Old.

Linux Operations and Administration

Chapter 9 Collecting Data with Forms. A form on a web page consists of form objects such as text boxes or radio buttons into which users type information.

1 Introduction to Web Development. Web Basics The Web consists of computers on the Internet connected to each other in a specific way Used in all levels.

INTRODUCTION TO WEB DATABASE PROGRAMMING

July 25, 2012 Arlington, Virginia Digital Preservation 2012warcreate.com WARCreate Create Wayback-Consumable WARC Files from Any Webpage Mat Kelly, Michele.

Website Reconstruction using the Web Infrastructure Frank McCown Doctoral Consortium June.

Copyright © cs-tutorial.com. Introduction to Web Development In 1990 and 1991,Tim Berners-Lee created the World Wide Web at the European Laboratory for.

Wasim Rangoonwala ID# CS-460 Computer Security “Privacy is the claim of individuals, groups or institutions to determine for themselves when,

Web Characterization: What Does the Web Look Like?

Chapter 33 CGI Technology for Dynamic Web Documents There are two alternative forms of retrieving web documents. Instead of retrieving static HTML documents,

CS621 : Seminar-2008 DEEP WEB Shubhangi Agrawal ( )‏ Jayalekshmy S. Nair ( )‏

Web Archiving Dr. Frank McCown Intro to Web Science Harding University This work is licensed under Creative Commons Attribution-NonCommercial 3.0Attribution-NonCommercial.

JavaScript, Fourth Edition

5 Chapter Five Web Servers. 5 Chapter Objectives Learn about the Microsoft Personal Web Server Software Learn how to improve Web site performance Learn.

Unit 1 – Web Concepts Instructor: Brent Presley. ASSIGNMENT Read Chapter 1 Complete lab 1 – Installing Portable Apps.

HT'061 Evaluation of Crawling Policies for a Web-Repository Crawler Frank McCown & Michael L. Nelson Old Dominion University Norfolk, Virginia, USA Odense,

Accessing the Deep Web Bin He IBM Almaden Research Center in San Jose, CA Mitesh Patel Microsoft Corporation Zhen Zhang computer science at the University.

TEMPORAL SPREAD IN ARCHIVED COMPOSITE RESOURCES (WORK IN PROGRESS) SCOTT G. AINSWORTH MICHAEL L. NELSON OLD DOMINION UNIVERSITY COMPUTER SCIENCE WADL 2013.

Dynamic Web File Format Transformations with Grace Daniel S. Swaney, Frank McCown, and Michael L. Nelson Old Dominion University Computer Science Department.

Introduction to web development and HTML MGMT 230 LAB.

Kingdom of Saudi Arabia Ministry of Higher Education Al-Imam Muhammad Ibn Saud Islamic University College of Computer and Information Sciences Chapter.

استاد : مهندس حسین پور ارائه دهنده : احسان جوانمرد Google Architecture.

My Website Was Lost, But Now It’s Found Frank McCown CS 110 – Intro to Computer Science April 23, 2007.

File Formats Different applications (programs) store data in different formats. Applications support some file formats and not others. Open…, Save…, Save.

Repository Synchronization Using NNTP and SMTP Michael L. Nelson, Joan A. Smith, Martin Klein Old Dominion University Norfolk VA

Client-Side Preservation Techniques for ORE Aggregations Michael L. Nelson & Sudhir Koneru Old Dominion University, Norfolk VA OAI-ORE Specification Roll-Out.

Project Two Adding Web Pages, Links, and Images Define and set a home page Add pages to a Web site Describe Dreamweaver's image accessibility features.

Evaluation of the NSDL and Google for Obtaining Pedagogical Resources Frank McCown, Johan Bollen, and Michael L. Nelson Old Dominion University Computer.

Lazy Preservation, Warrick, and the Web Infrastructure Frank McCown Old Dominion University Computer Science Department Norfolk, Virginia, USA JCDL 2007.

Web Design and Development. World Wide Web  World Wide Web (WWW or W3), collection of globally distributed text and multimedia documents and files 

Factors Affecting Website Reconstruction from the Web Infrastructure Frank McCown, Norou Diawara, and Michael L. Nelson Old Dominion University Computer.

Archive Ingest and Handling Test: ODU’s Perspective Michael L. Nelson Department of Computer Science Old Dominion University

The Availability and Persistence of Web References in D-Lib Magazine Frank McCown, Sheffan Chan, Michael L. Nelson and Johan Bollen Old Dominion University.

Internet Applications (Cont’d) Basic Internet Applications – World Wide Web (WWW) Browser Architecture Static Documents Dynamic Documents Active Documents.

8 Chapter Eight Server-side Scripts. 8 Chapter Objectives Create dynamic Web pages that retrieve and display database data using Active Server Pages Process.

Lazy Preservation: Reconstructing Websites by Crawling the Crawlers Frank McCown, Joan A. Smith, Michael L. Nelson, & Johan Bollen Old Dominion University.

Web Browsing *TAKE NOTES*. Millions of people browse the Web every day for research, shopping, job duties and entertainment. Installing a web browser.

The Internet. Important Terms Network Network Internet Internet WWW (World Wide Web) WWW (World Wide Web) Web page Web page Web site Web site Browser.

Client-Side Preservation Techniques for ORE Aggregations Michael L. Nelson & Sudhir Koneru Old Dominion University, Norfolk VA OAI-ORE Specification Roll-Out.

Brass: A Queueing Manager for Warrick Frank McCown, Amine Benjelloun, and Michael L. Nelson Old Dominion University Computer Science Department Norfolk,

Search Engine and Optimization 1. Introduction to Web Search Engines 2.

1 Introduction to Digital Libraries Week 15: Web Infrastructure for Preservation Old Dominion University Department of Computer Science CS 751/851 Fall.

Transparent Format Migration of Preserved Web Content D. S. H. Rosenthal, T. Lipkis, T. S. Robertson, S. Morabito Lib Magazine, 11(1), 2005

GNU EPrints 2 Overview Christopher Gutteridge 19 th October 2002 CERN. Geneva, Switzerland.

Introduction to Digital Libraries Week 13: Lazy Preservation Old Dominion University Department of Computer Science CS 751/851 Spring 2011 Michael L. Nelson.

A Solution for Maintaining File Integrity within an Online Data Archive Dan Scholes PDS Geosciences Node Washington University 1.

1 Chapter 1 INTRODUCTION TO WEB. 2 Objectives In this chapter, you will: Become familiar with the architecture of the World Wide Web Learn about communication.

File Formats Different applications (programs) store data in different formats. Applications support some file formats and not others. Open…, Save…, Save.

Web Programming Language

WWW and HTTP King Fahd University of Petroleum & Minerals

Department of Computer Science Homepage

Lazy Preservation, Warrick, and the Web Infrastructure

IS333D: MULTI-TIER APPLICATION DEVELOPMENT

Agreeing to Disagree: Search Engines and Their Public Interfaces

Chapter 27 WWW and HTTP.

Characterization of Search Engine Caches

An Introduction to JavaScript

Presentation transcript:

Introduction to Digital Libraries Week 15: Lazy Preservation Old Dominion University Department of Computer Science CS 751/851 Spring 2010 Michael L. Nelson 04/19/10

Slides From Frank McCown's PhD Defense The slides for this week's lecture come from Frank McCown's defense. Rather than edit, I've maintained them as they were presented. More about Frank:

Lazy Preservation: Reconstructing Websites from the Web Infrastructure Frank McCown Advisor: Michael L. Nelson Old Dominion University Computer Science Department Norfolk, Virginia, USA Dissertation Defense October 19, 2007

4 Outline Motivation Lazy preservation and the Web Infrastructure Web repositories Responses to 10 research questions Contributions and Future Work

5 Black hat: Virus image: Hard drive:

6 Preservation: Fortress Model 1.Get a lot of $ 2.Buy a lot of disks, machines, tapes, etc. 3.Hire an army of staff 4.Load a small amount of data 5.“Look upon my archive ye Mighty, and despair!” Image from: 5 easy steps for preservation: Slide from:

8 …I was doing a little “maintenance” on one of my sites and accidentally deleted my entire database of about 30 articles. After I finished berating myself for being so stupid, I realized that my hosting company would have a backup, so I sent an asking them to restore the database. Their reply stated that backups were “coming soon”…OUCH!

Web Infrastructure

Lazy Preservation How much preservation can be had for free? (Little to no effort for web producer/publisher before website is lost) High-coverage preservation of works of unknown importance Built atop unreliable, distributed members which cannot be controlled Usually limited to crawlable web 10

Dissertation Objective 11 To demonstrate the feasibility of using the WI as a preservation service – lazy preservation – and to evaluate how effectively this previously unexplored service can be utilized for reconstructing lost websites.

Research Questions (Dissertation p. 3) 1.What types of resources are typically stored in the WI search engine caches, and how up-to-date are the caches? 2.How successful is the WI at preserving short-lived web content? 3.How much overlap is there with what is found in search engine caches and the Internet Archive? 4.What interfaces are necessary for a member of the WI (a web repository) to be used in website reconstruction? 5.How does a web-repository crawler work, and how can it reconstruct a lost website from the WI? 12

Research Questions cont. 6.What types of websites do people lose, and how successful have they been recovering them from the WI? 7.How completely can websites be reconstructed from the WI? 8.What website attributes contribute to the success of website reconstruction? 9.Which members of the WI are the most helpful for website reconstruction? 10.What methods can be used to recover the server-side components of websites from the WI? 13

WI Preliminaries: Web Repositories 14

15 How much of the Web is indexed? Estimates from “The Indexable Web is More than 11.5 billion pages” by Gulli and Signorini (WWW’05) Internet Archive?

16

17

Cached Image 18

Cached PDF MSN version Yahoo versionGoogle version canonical

Types of Web Repositories Depth of holdings –Flat – only maintain last version of resource crawled –Deep – maintain multiple versions, each with a timestamp Access to holdings –Dark – no outside access to resources –Light – minimal access restrictions 20

Accessing the WI Screen-scraping the web user interface (WUI) Application programming interface (API) WUIs and APIs do not always produce the same responses; the APIs may be pulling from smaller indexes McCown & Nelson, Agreeing to Disagree: Search Engines and their Public Interfaces, JCDL 2007

Research Questions 1-3: Characterizing the WI Experiment 1: Observe the WI finding and caching new web content that is decaying. Experiment 2: Examine the contents of the WI by randomly sampling URLs 22

23 Timeline of Web Resource

24 Web Caching Experiment May – Sept 2005 Create 4 websites composed of HTML, PDFs, and images – – – – Remove pages each day Query GMY every day using identifiers McCown et al., Lazy Preservation: Reconstructing Websites by Crawling the Crawlers, ACM WIDM 2006.

25

26

27

28

Observations Google was the most useful web repository from a preservation perspective –Quick to find new content –Consistent access to cached content –Lost content reappeared in cache long after it was removed Images are slow to be cached, and duplicate images are not cached 29

30 Experiment: Sample Search Engine Caches Feb 2007 Submitted 5200 one-term queries to Ask, Google, MSN, and Yahoo Randomly selected 1 result from first 100 Download resource and cached page Check for overlap with Internet Archive McCown and Nelson, Characterization of Search Engine Caches, Archiving 2007.

31 Distribution of Top Level Domains

32 Cached Resource Size Distributions 976 KB977 KB 1 MB 215 KB

33 Cache Freshness and Staleness crawled and cached changed on web server crawled and cached Stale time Fresh Staleness = max(0, Last-modified HTTP header – cached date)

34 Cache Staleness 46% of resource had Last-Modified header 71% also had cached date 16% were at least 1 day stale

35 Overlap with Internet Archive

36 Overlap with Internet Archive

Research Question 4 of 10: Repository Interfaces Minimum interface requirement: What resource r do you have stored for the URI u?“ r  getResource(u) 37

Deep Repositories What resource r do you have stored for the URI u at datestamp d?“ r  getResource(u, d) 38

Lister Queries What resources R do you have stored from the site s? R  getAllUris(s) 39

40

Other Interface Commands Get list of dates D stored for URI u D  getResourceList(u) Get crawl date d for URI u d  getCrawlDate(u) 41

Research Question 5 of 10: Web-Repository Crawling 42

43 Web-repository Crawler

44 Written in Perl First version completed in Sept 2005 Made available to the public in Jan 2006 Run as a command line program warrick.pl --recursive --debug --output-file log.txt Or on-line using the Brass queuing system

45

Research Question 6 of 10: Warrick usage 46

47 Ave 38.2%

48

Research Questions 7 and 8: Reconstruction Effectiveness Problem with usage data: Difficult to determine how successful reconstructions actually were –Brass tells Warrick to recover all resources, even if not part of “current” website –When were websites actually lost? –Were URLs spelled correctly? Spam? –Need actual website to compare against reconstruction, especially if wanting to determine which factors determine website’s recoverability 49

50

51 Measuring the Difference (r c, r m, r a ) changed missing added Apply Recovery Vector for each resource Compute Difference Vector for website

52 Reconstruction Diagram added 20% identical 50% changed 33% missing 17%

53 McCown and Nelson, Evaluation of Crawling Policies for a Web- Repository Crawler, HYPERTEXT 2006

54 Reconstruction Experiment 300 websites chosen randomly from Open Directory Project (dmoz.org) Crawled and reconstructed each website every week for 14 weeks Examined change rates, age, decay, growth, recoverability McCown and Nelson, Factors Affecting Website Reconstruction from the Web Infrastructure, JCDL 2007

55 Success of website recovery each week *On average, 61% of a website was recovered on any given week.

56 Recovery by TLD

57 Which Factors Are Significant? External backlinks Internal backlinks Google’s PageRank Hops from root page Path depth MIME type Query string params Age Resource birth rate TLD Website size Size of resources

58 Regression Analysis No surprises: all variables are significant, but overall model only explains about half of the observations Three most significant variables: PageRank, hops and age (R-squared = )

59 Observations Most of the sampled websites were relatively stable –One third of the websites never lost a single resource –Half of the websites never added any new resources The typical website can expect to get back 61% of its resources if it were lost today (77% textual, 42% images and 32% other) How to improve recovery from WI? Improve PageRank, decrease number of hops to resources, create stable URLs

Research Question 9 of 10: Web Repository Contributions 60 Real usage data Experimental results

Research Question 10 of 10: Recovering the web server’s components 61 Database Perl script config Static files (html files, PDFs, images, style sheets, Javascript, etc.) Web Infrastructure Web Server Dynamic page Recoverable Not Recoverable

62 Injecting Server Components into Crawlable Pages Erasure codes HTML pagesRecover at least m blocks

Server Encoding Experiment Create a digital library using Eprints software and populate with 100 research papers Monarch DL: Encode Eprints server components (Perl scripts, MySQL database, config files) and inject into all HTML pages Reconstruct each week 63

64

65 Web resources recovered each week

66

Contributions 1.Novel solution to pervasive problem of website loss: lazy preservation, after-the- fact recovery for little to no work required for the content creator 2.WI is characterized: behavior to consume and retain new web content, types of resources it contains, overlap between flat and deep repositories 67

Contributions cont. 3.Model for resource availability is developed from initial creation to its potential unavailability 4.Developed new type of crawler: web- repository crawler. Architecture, interfaces for crawling web repositories, rules for canonicalizing URLs, three crawling policies are evaluated 68

Contributions cont. 5.Developed statistical model to measure reconstructed website, reconstruction diagram to summarize reconstruction success. 6.Discovered the three most significant variables that determine how successfully a web resource will be recovered from the WI: Google's PageRank, hops from the root page, resource age. 69

Contributions cont. 7.Proposed and experimentally validated a novel solution to recover a website's server components from WI 8.Created website reconstruction service which is currently being used by the public to reconstruct more than 100 lost websites a month 70

Future Work Improvements to Warrick: increase used repositories, discovery of URLs, soft 404s Determining or predicting loss- save websites if detecting they are about to or already have disappeared Investigate other sources of lazy preservation: browser caches More extensive overlap studies of WI 71

Related Publications Deep web –IEEE Internet Computing 2006 Link rot –IWAW 2005 Lazy Preservation / WI –D-Lib Magazine 2006 –WIDM 2006 –Archiving 2007 –Dynamics of Search Engines: An Introduction (chapter) –Content Engineering (chapter) –International Journal on Digital Libraries 2007 Search engine contents and interfaces –ECDL 2005 –WWW 2007 –JCDL 2007 Obsolete web file formats –IWAW 2005 Warrick –HYPERTEXT 2006 –Archiving 2007 –JCDL 2007 –IWAW 2007 –Communications of the ACM 2007 (to appear) 72

73 Thank You Can’t wait until I’m old enough to run Warrick!

74 Injecting Server Components into Crawlable Pages Erasure codes HTML pagesRecover at least m blocks

75 Database Perl script config Static files (html files, PDFs, images, style sheets, Javascript, etc.) Web Infrastructure Web Server Dynamic page Recoverable Not Recoverable

76

77 Web Repository Characteristics TypeMIME typeFile extGoogleYahooLiveIA HTML text text/html html CCCC Plain text text/plain txt, ans MMMC Graphic Interchange Format image/gif gif MMMC Joint Photographic Experts Group image/jpeg jpg MM M C Portable Network Graphic image/png png MM M C Adobe Portable Document Format application/pdf pdf MMMC JavaScript application/javascript js MMC Microsoft Excel application/vnd.ms-excel xls M~SMC Microsoft PowerPoint application/vnd.ms- powerpoint ppt MMMC Microsoft Word application/msword doc MMMC PostScript application/postscript ps M~SC CCanonical version is stored MModified version is stored (modified images are thumbnails, all others are html conversions) ~SIndexed but not stored

78

79 Some Difference Vectors D = (changed, missing, added) (0,0,0) – Perfect recovery (1,0,0) – All resources are recovered but changed (0,1,0) – All resources are lost (0,0,1) – All recovered resources are at new URIs

80 How Much Change is a Bad Thing? LostRecovered

81 How Much Change is a Bad Thing? LostRecovered

82 Assigning Penalties Apply to each resource (P c, P m, P a ) Penalty Adjustment Or Difference vector

83 Defining Success success = 1 – d m Equivalent to percent of recovered resources 01 Less successful More successful

84 Recovery of Textual Resources

85 Birth and Decay

86 Recovery of HTML Resources

87 Recovery by Age

88 Mild Correlations Hops and –website size (0.428) –path depth (0.388) Age and # of query params (-0.318) External links and –PageRank (0.339) –Website size (0.301) –Hops (0.320)

89 Regression Parameter Estimates

90 Similarity vs. Staleness