Lazy Preservation, Warrick, and the Web Infrastructure Frank McCown Old Dominion University Computer Science Department Norfolk, Virginia, USA JCDL 2007.

Slides:



Advertisements
Similar presentations
An Introduction to the Internet and the Web Frank McCown COMP 250 – Internet Development Harding University.
Advertisements

4.01 How Web Pages Work.
TCP/IP Protocol Suite 1 Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Chapter 22 World Wide Web and HTTP.
Internet Research Internet Applications. The Internet is not the Web Because of the great popularity of the World Wide Web, people think the Internet.
Looking at both the Present and the Past to Efficiently Update Replicas of Web Content Luciano Barbosa * Ana Carolina Salgado ! Francisco Tenorio ! Jacques.
Introduction to Web Interface Technology (CSE2030)
Introduction to Web Interface Technology (CSE2030)
Web Basics (Scary Acronyms Demystified). HTML Acronyms And your website… CSS URL HTTP SSL SMTP DNS RSS API ERQ iCAL Yippie! We have an Awesome website…
Search Engine Optimization March 23, 2011 Google Search Engine Optimization Starter Guide.
IDK0040 Võrgurakendused I Building a site: Publicising Deniss Kumlander.
Lazy Preservation: Reconstructing Websites from the Web Infrastructure Frank McCown Advisor: Michael L. Nelson Old Dominion University Computer Science.
Search Engines and their Public Interfaces: Which APIs are the Most Synchronized? Frank McCown and Michael L. Nelson Department of Computer Science, Old.
INTRODUCTION TO WEB DATABASE PROGRAMMING
Computer Concepts 2014 Chapter 7 The Web and .
Website Reconstruction using the Web Infrastructure Frank McCown Doctoral Consortium June.
Delving into the Internet and Networks. In the beginning  ARPANET – set up for the military to have another network of communication  Pre-cursor to.
Copyright © cs-tutorial.com. Introduction to Web Development In 1990 and 1991,Tim Berners-Lee created the World Wide Web at the European Laboratory for.
Dynamic Web Pages (Flash, JavaScript)
The Internet 8th Edition Tutorial 9 Creating Effective Web Pages.
Strategies for improving Web site performance Google Webmaster Tools + Google Analytics Marshall Breeding Director for Innovative Technologies and Research.
Thinking Differently About Web Page Preservation Michael L. Nelson, Frank McCown, Joan A. Smith Old Dominion University Norfolk VA
Web Archiving Dr. Frank McCown Intro to Web Science Harding University This work is licensed under Creative Commons Attribution-NonCommercial 3.0Attribution-NonCommercial.
Using a Web Browser What does a Web Browser do? A web browser enables you to surf the World Wide Web. What are the most popular browsers?
Unit 1 – Web Concepts Instructor: Brent Presley. ASSIGNMENT Read Chapter 1 Complete lab 1 – Installing Portable Apps.
HT'061 Evaluation of Crawling Policies for a Web-Repository Crawler Frank McCown & Michael L. Nelson Old Dominion University Norfolk, Virginia, USA Odense,
Searching the Internet CSCI-N 100 Department of Computer and Information Science.
TEMPORAL SPREAD IN ARCHIVED COMPOSITE RESOURCES (WORK IN PROGRESS) SCOTT G. AINSWORTH MICHAEL L. NELSON OLD DOMINION UNIVERSITY COMPUTER SCIENCE WADL 2013.
Dynamic Web File Format Transformations with Grace Daniel S. Swaney, Frank McCown, and Michael L. Nelson Old Dominion University Computer Science Department.
1 Search Engine Optimization An introduction to optimizing your web site for best possible search engine results.
McLean HIGHER COMPUTER NETWORKING Lesson 7 Search engines Description of search engine methods.
My Website Was Lost, But Now It’s Found Frank McCown CS 110 – Intro to Computer Science April 23, 2007.
Overview Web Session 3 Matakuliah: Web Database Tahun: 2008.
Repository Synchronization Using NNTP and SMTP Michael L. Nelson, Joan A. Smith, Martin Klein Old Dominion University Norfolk VA
Web Server Design Assignment #1: Basic Operations Due: 02/03/2010 Old Dominion University Department of Computer Science CS 495/595 Spring 2010 Martin.
Client-Side Preservation Techniques for ORE Aggregations Michael L. Nelson & Sudhir Koneru Old Dominion University, Norfolk VA OAI-ORE Specification Roll-Out.
Evaluation of the NSDL and Google for Obtaining Pedagogical Resources Frank McCown, Johan Bollen, and Michael L. Nelson Old Dominion University Computer.
Ph.D. Progress Report Frank McCown 4/14/05. Timeline Year 1 : Course work and Diagnostic Exam Year 2 : Course work and Candidacy Exam Year 3 : Write and.
Web Design and Development. World Wide Web  World Wide Web (WWW or W3), collection of globally distributed text and multimedia documents and files 
Factors Affecting Website Reconstruction from the Web Infrastructure Frank McCown, Norou Diawara, and Michael L. Nelson Old Dominion University Computer.
What is Web Information retrieval from web Search Engine Web Crawler Web crawler policies Conclusion How does a web crawler work Synchronization Algorithms.
The Availability and Persistence of Web References in D-Lib Magazine Frank McCown, Sheffan Chan, Michael L. Nelson and Johan Bollen Old Dominion University.
Lazy Preservation: Reconstructing Websites by Crawling the Crawlers Frank McCown, Joan A. Smith, Michael L. Nelson, & Johan Bollen Old Dominion University.
Website Design, Development and Maintenance ONLY TAKE DOWN NOTES ON INDICATED SLIDES.
How to Crawl the Web Hector Garcia-Molina Stanford University Joint work with Junghoo Cho.
Client-Side Preservation Techniques for ORE Aggregations Michael L. Nelson & Sudhir Koneru Old Dominion University, Norfolk VA OAI-ORE Specification Roll-Out.
Brass: A Queueing Manager for Warrick Frank McCown, Amine Benjelloun, and Michael L. Nelson Old Dominion University Computer Science Department Norfolk,
Web Design Terminology Unit 2 STEM. 1. Accessibility – a web page or site that address the users limitations or disabilities 2. Active server page (ASP)
1 Chapter 22 World Wide Web (HTTP) Chapter 22 World Wide Web (HTTP) Mi-Jung Choi Dept. of Computer Science and Engineering
Week-6 (Lecture-1) Publishing and Browsing the Web: Publishing: 1. upload the following items on the web Google documents Spreadsheets Presentations drawings.
General Architecture of Retrieval Systems 1Adrienn Skrop.
Introduction to Digital Libraries Week 15: Lazy Preservation Old Dominion University Department of Computer Science CS 751/851 Spring 2010 Michael L. Nelson.
Search Engine and Optimization 1. Introduction to Web Search Engines 2.
1 Introduction to Digital Libraries Week 15: Web Infrastructure for Preservation Old Dominion University Department of Computer Science CS 751/851 Fall.
Free Powerpoint Templates Page 1 Free Powerpoint Templates CHAPTER 1 LAB 1.1 Web Server.
CSE541: Web Applications Special Thanks to M. Abdur Rahman.
Web Page Design The Basics. The Web Page A document (file) created using the HTML scripting language. A document (file) created using the HTML scripting.
Computer Basics Introduction CIS 109 Columbia College.
Introduction to Digital Libraries Week 13: Lazy Preservation Old Dominion University Department of Computer Science CS 751/851 Spring 2011 Michael L. Nelson.
1 Chapter 1 INTRODUCTION TO WEB. 2 Objectives In this chapter, you will: Become familiar with the architecture of the World Wide Web Learn about communication.
Search Engine Optimization
CISC103 Web Development Basics: Web site:
Strategies for improving Web site performance
Lazy Preservation, Warrick, and the Web Infrastructure
Agreeing to Disagree: Search Engines and Their Public Interfaces
Just-In-Time Recovery of Missing Web Pages
Characterization of Search Engine Caches
Lesson 5: Multimedia on the Web
Web Server Design Assignment #1: Basic Operations
An Introduction to the Internet and the Web
Lesson 6 File Types.
Presentation transcript:

Lazy Preservation, Warrick, and the Web Infrastructure Frank McCown Old Dominion University Computer Science Department Norfolk, Virginia, USA JCDL 2007 Vancouver, BC June 19, 2007

2 Outline What is the Web Infrastructure (WI)? How can the WI be used for preservation? Web-repository crawling with Warrick Understanding the WI –Caching experiment –Reconstruction experiments –Search engine sampling and IA overlap experiment Recovering web server components from the WI Brass: Queueing manager for Warrick

3

4 Web Infrastructure

5 Alternative Models of Preservation Lazy Preservation –Let Google, IA et al. preserve your website Just-In-Time Preservation –Wait for it to disappear first, then a “good enough” version Shared Infrastructure Preservation –Push your content to sites that might preserve it Web Server Enhanced Preservation –Use Apache modules to create archival-ready resources

6

7 Black hat: Virus image: Hard drive:

8 Crawling the Crawlers

9

10

11 Cached Image

Cached PDF MSN version Yahoo versionGoogle version canonical

13 Web-repository Crawler

14 McCown, et al., Brass: A Queueing Manager for Warrick, IWAW McCown, et al., Factors Affecting Website Reconstruction from the Web Infrastructure, ACM IEEE JCDL McCown and Nelson, Evaluation of Crawling Policies for a Web-Repository Crawler, HYPERTEXT McCown, et al., Lazy Preservation: Reconstructing Websites by Crawling the Crawlers, ACM WIDM Available at

15 What Types of Websites Are Lost? Marshall, McCown, and Nelson, Evaluating Personal Archiving Strategies for Internet-based Information, IS&T Archiving 2007.

16 Outline What is the Web Infrastructure (WI)? How can the WI be used for preservation? Web-repository crawling with Warrick Understanding the WI –Caching experiment –Reconstruction experiments –Search engine sampling and IA overlap experiment Recovering web server components from the WI Brass: Queueing manager for Warrick

17 Understanding the WI How quickly do search engines acquire and purge their caches? Do search engines prefer caching one type of resource over another? How much overlap is there between the search engines caches and IA holdings? How successfully can we reconstruct a lost website? Are some resources more recoverable than others?

18 Timeline of Web Resource

19 Web Caching Experiment Create 4 websites composed of HTML, PDFs, and images – – – – Remove pages each day Query GMY every day using identifiers McCown, et al., Lazy Preservation: Reconstructing Websites by Crawling the Crawlers, ACM WIDM 2006.

20

21

22

23

24 Where is the Internet Archive? No crawls from Alexa, IA’s provider Even if they had crawled us, the content would not be accessible from IA for 6-12 months Short-lived web content is likely to be lost for good

Reconstruction Experiment Crawl and reconstruct 24 sites of various sizes: 1. small (1-150 resources) 2. medium ( resources) 3. large (500+ resources) Perform 5 reconstructions for each website –One using all four repositories together –Four using each repository separately Calculate reconstruction vector for each reconstruction (changed%, missing%, added%)

26 How Much Did We Reconstruct? A “Lost” web site Reconstructed web site BC DEF A B’C’ GE F Missing link to D; points to old resource G F can’t be found Four categories of recovered resources: 1) Identical: A, E 2) Changed: B, C 3) Missing: D, F 4) Added: G

27 Reconstruction Diagram added 20% identical 50% changed 33% missing 17%

28 Recovery Success by MIME Type

29 Repository Contributions

Reconstruction Experiment 300 websites chosen randomly from Open Directory Project (dmoz.org) Crawled and reconstructed each website every week for 14 weeks Examined change rates, age, decay, growth, recoverability McCown, et al., Factors Affecting Website Reconstruction from the Web Infrastructure, ACM IEEE JCDL 2007.

31 Success of website recovery each week *On average, we recovered 61% of a website on any given week.

32

33 Statistics for Repositories

34 Experiment: Sample Search Engine Caches Feb 2006 Submitted 5200 one-term queries to Ask, Google, MSN, and Yahoo Randomly selected 1 result from first 100 Download resource and cached page Check for overlap with Internet Archive McCown, et al., Brass: A Queueing Manager for Warrick, IWAW 2007.

35 Distribution of Top Level Domains

36 Cached Resource Size Distributions 976 KB977 KB 1 MB 215 KB

37 Cache Freshness crawled and cached changed on web server crawled and cached Stale time Fresh Staleness = max(0, Last-modified http header – cached date)

38 Cache Staleness 46% of resource had Last-Modified header 71% also had cached date 16% were at least 1 day stale

39 Similarity vs. Staleness

40 How much of the Web is indexed? Estimates from “The Indexable Web is More than 11.5 billion pages” by Gulli and Signorini (WWW’05) Internet Archive?

41 Overlap with Internet Archive

42 Overlap with Internet Archive

43 Distribution of Sampled URLs

44 Problem: WI currently only stores the client-side representation of a website. Server components (scripts, databases, configuration files, etc.) are not accessible from the WI

45 Outline What is the Web Infrastructure (WI)? How can the WI be used for preservation? Web-repository crawling with Warrick Understanding the WI –Caching experiment –Reconstruction experiments –Search engine sampling and IA overlap experiment Recovering web server components from the WI Brass: Queueing manager for Warrick

46 Database Perl script config Static files (html files, PDFs, images, style sheets, Javascript, etc.) Web Infrastructure Web Server Dynamic page Recoverable Not Recoverable

47 Injecting Server Components into Crawlable Pages Erasure codes HTML pagesRecover at least m blocks

48 Brass: A Queueing Manager for Warrick Warrick requires some technical expertise to download, install, and run Warrick uses search engine APIs which allow limited requests per IP address (or key) Google no longer provides new keys for accessing their API

49

50

51 Thank You Frank McCown Can’t wait until I’m old enough to recover a website!

52 Cache Freshness crawled and cached changed on web server crawled and cached Stale time Fresh Staleness = max(0, Last-modified http header – cached date)

53 Cache Staleness 46% of resource had Last-Modified header 71% also had cached date 16% were at least 1 day stale

54 Similarity vs. Staleness

55

56 Web Repository Characteristics TypeMIME typeFile extGoogleYahooLiveIA HTML text text/html html CCCC Plain text text/plain txt, ans MMMC Graphic Interchange Format image/gif gif MMMC Joint Photographic Experts Group image/jpeg jpg MM M C Portable Network Graphic image/png png MM M C Adobe Portable Document Format application/pdf pdf MMMC JavaScript application/javascript js MMC Microsoft Excel application/vnd.ms-excel xls M~SMC Microsoft PowerPoint application/vnd.ms- powerpoint ppt MMMC Microsoft Word application/msword doc MMMC PostScript application/postscript ps M~SC CCanonical version is stored MModified version is stored (modified images are thumbnails, all others are html conversions) ~SIndexed but not stored

57 Results Frank McCown, Joan A. Smith, Michael L. Nelson, and Johan Bollen. Reconstructing Websites for the Lazy Webmaster, Technical Report, arXiv cs.IR/ , 2005.Reconstructing Websites for the Lazy Webmaster,