The Availability and Persistence of Web References in D-Lib Magazine Frank McCown, Sheffan Chan, Michael L. Nelson and Johan Bollen Old Dominion University.

Slides:



Advertisements
Similar presentations
PHP syntax basics. Personal Home Page This is a Hypertext processor It works on the server side It demands a Web-server to be installed.
Advertisements

Evaluating Web Software Reliability By Zumrut Akcam, Kim Gero, Allen Chestoski, Javian Li & Rohan Warkad CSI518 – Group 1.
HTTP Hypertext Transfer Protocol. HTTP messages HTTP is the language that web clients and web servers use to talk to each other –HTTP is largely “under.
How the web works: HTTP and CGI explained
 What is it ? What is it ?  URI,URN,URL URI,URN,URL  HTTP – methods HTTP – methods  HTTP Request Packets HTTP Request Packets  HTTP Request Headers.
Search Engines and their Public Interfaces: Which APIs are the Most Synchronized? Frank McCown and Michael L. Nelson Department of Computer Science, Old.
FALL 2005CSI 4118 – UNIVERSITY OF OTTAWA1 Part 4 Web technologies: HTTP, CGI, PHP,Java applets)
Website Reconstruction using the Web Infrastructure Frank McCown Doctoral Consortium June.
Simple Web Services. Internet Basics The Internet is based on a communication protocol named TCP (Transmission Control Protocol) TCP allows programs running.
Web Server Design Week 5 Old Dominion University Department of Computer Science CS 495/595 Spring 2010 Martin Klein 2/10/10.
Web Characterization: What Does the Web Look Like?
_______________________________________________________________________________________________________________ E-Commerce: Fundamentals and Applications1.
Simple Web Services. Internet Basics The Internet is based on a communication protocol named TCP (Transmission Control Protocol) TCP allows programs running.
Maryam Elahi University of Calgary – CPSC 441.  HTTP stands for Hypertext Transfer Protocol.  Used to deliver virtually all files and other data (collectively.
HTTP HTTP stands for Hypertext Transfer Protocol. It is an TCP/IP based communication protocol which is used to deliver virtually all files and other.
Chapter 17 - Deploying Java Applications on the Web1 Chapter 17 Deploying Java Applications on the Web.
Web Server Design Week 14 Old Dominion University Department of Computer Science CS 495/595 Spring 2010 Martin Klein 4/14/10.
TEMPORAL SPREAD IN ARCHIVED COMPOSITE RESOURCES (WORK IN PROGRESS) SCOTT G. AINSWORTH MICHAEL L. NELSON OLD DOMINION UNIVERSITY COMPUTER SCIENCE WADL 2013.
Dynamic Web File Format Transformations with Grace Daniel S. Swaney, Frank McCown, and Michael L. Nelson Old Dominion University Computer Science Department.
My Website Was Lost, But Now It’s Found Frank McCown CS 110 – Intro to Computer Science April 23, 2007.
Web Server Design Week 4 Old Dominion University Department of Computer Science CS 495/595 Spring 2010 Martin Klein 2/03/10.
Web Server Design Assignment #1: Basic Operations Due: 02/03/2010 Old Dominion University Department of Computer Science CS 495/595 Spring 2010 Martin.
Web Server Design Assignment #2: Conditionals & Persistence Due: 02/24/2010 Old Dominion University Department of Computer Science CS 495/595 Spring 2010.
Appendix E: Overview of HTTP ©SoftMoore ConsultingSlide 1.
Evaluation of the NSDL and Google for Obtaining Pedagogical Resources Frank McCown, Johan Bollen, and Michael L. Nelson Old Dominion University Computer.
Operating Systems Lesson 12. HTTP vs HTML HTML: hypertext markup language ◦ Definitions of tags that are added to Web documents to control their appearance.
Lazy Preservation, Warrick, and the Web Infrastructure Frank McCown Old Dominion University Computer Science Department Norfolk, Virginia, USA JCDL 2007.
Factors Affecting Website Reconstruction from the Web Infrastructure Frank McCown, Norou Diawara, and Michael L. Nelson Old Dominion University Computer.
Module: Software Engineering of Web Applications Chapter 2: Technologies 1.
Internet Applications (Cont’d) Basic Internet Applications – World Wide Web (WWW) Browser Architecture Static Documents Dynamic Documents Active Documents.
Overview of Servlets and JSP
Data Communications and Computer Networks Chapter 2 CS 3830 Lecture 7 Omar Meqdadi Department of Computer Science and Software Engineering University of.
Brass: A Queueing Manager for Warrick Frank McCown, Amine Benjelloun, and Michael L. Nelson Old Dominion University Computer Science Department Norfolk,
1 Chapter 22 World Wide Web (HTTP) Chapter 22 World Wide Web (HTTP) Mi-Jung Choi Dept. of Computer Science and Engineering
Simple Web Services. Internet Basics The Internet is based on a communication protocol named TCP (Transmission Control Protocol) TCP allows programs running.
Evaluating Web Software Reliability By Zumrut Akcam, Kim Gero, Allen Chestoski, Javian Li & Rohan Warkad CSI518 – Group 1.
Web Server Design Week 5 Old Dominion University Department of Computer Science CS 495/595 Spring 2012 Michael L. Nelson 02/07/12.
Web Syndication Formats Seminar Week 1 Old Dominion University Department of Computer Science CS 791/891 Spring 2008 Michael L. Nelson 1/16/08.
Web Programming Week 1 Old Dominion University Department of Computer Science CS 418/518 Fall 2007 Michael L. Nelson 8/27/07.
Introduction to Digital Libraries Assignment #1 Old Dominion University Department of Computer Science CS 751/851 Spring 2015 Michael L. Nelson 01/22/15.
Web Server Design Week 15 Old Dominion University Department of Computer Science CS 495/595 Spring 2009 Michael L. Nelson 4/20/09.
Web Server Design Week 6 Old Dominion University Department of Computer Science CS 495/595 Spring 2006 Michael L. Nelson 2/13/06.
CS 791-S04 Digital Preservation Seminar Presentation of: Arms, "Preservation of Scientific Serials: Three Current Examples", JEP, 5(2), 1999 and Nelson.
Tiny http client and server
Block 5: An application layer protocol: HTTP
Content from Python Docs.
HTTP – An overview.
Introduction to Persistent Identifiers
Introducing the World Wide Web
Web Server Design Assignment #2: Conditionals & Persistence
Web Server Design Week 11 Old Dominion University
Lazy Preservation, Warrick, and the Web Infrastructure
Introduction to Digital Libraries Week 12: Digital Preservation
Agreeing to Disagree: Search Engines and Their Public Interfaces
Just-In-Time Recovery of Missing Web Pages
Introduction to Digital Libraries Week 14: Digital Preservation
Web Server Design Week 5 Old Dominion University
Characterization of Search Engine Caches
Web Server Design Assignment #2: Conditionals & Persistence
HyperText Transfer Protocol
Hypertext Transfer Protocol (HTTP)
Web Server Design Week 6 Old Dominion University
Web Server Design Week 10 Old Dominion University
Web Server Design Week 11 Old Dominion University
If You Harvest arXiv.org, Will They Come?
Web Server Design Week 5 Old Dominion University
Web Server Design Week 11 Old Dominion University
Web Server Design Week 16 Old Dominion University
Web Server Design Assignment #1: Basic Operations
Web Server Design Assignment #5 Extra Credit
Presentation transcript:

The Availability and Persistence of Web References in D-Lib Magazine Frank McCown, Sheffan Chan, Michael L. Nelson and Johan Bollen Old Dominion University Computer Science Department Norfolk, Virginia, USA IWAW05 September 22, 2005

IWAW054 Definition of Inaccessible URL Accessible URL: When performing an http GET on the URL, it should return an http 200 (OK) response with non-zero length content OR eventually return an http 200 response with non- zero length content after following one or more redirects (http 3xx) Inaccessible URL: Not an accessible URL (everything else)

IWAW055 Redirection Example Request: http GET Response: http Request: http GET Response: http Request: http GET Response: http 200 Content-Length: 765http:// Frequently encountered when using DOI resolvers, handles, and PURLs: Request: http GET Response: http 302

IWAW056 Related Work Many papers discuss link-rot of academic citations 2 studies dealing with computer science and related articles Steve Lawrence et al., “Persistence of Web References in Scientific Research”, IEEE Computer, 34(2), 2001 Diomidis Spinellis, “The Decay and Failures of Web References”, Communications of the ACM, 46(1), 2003

IWAW057 Lawrence Study Figure from 67,577 URLs accessed in May 2000 Half-life of URL = 6 years from publication date (our calculation)

IWAW058 Spinellis Study Figure from 1,391 URLs from Communications of the ACM 2,833 URLs from IEEE Computer Accessed in June 2000 Half-life of URL = 4 years from publication date

IWAW059 Methodology 1. Downloaded all articles from July 1995 to August 2004 (453 articles) and extracted all hyperlinks (7094 total). 2. Removed all URLs that referenced ( and and all redundant URLs, producing a total of 4387 URLs 3. Downloaded 4387 URLs 72 times (three times a week for 25 weeks), beginning on September 9, 2004 and ending on February 27, 2005

IWAW0510 Availability at Checkpoints

IWAW0511 Availability at Checkpoints

IWAW0512 Distribution by Year 10 year half-life from publication date

IWAW0513 Error Codes HTTP CodeMeaning First check Last check 404Not found % % 500Internal sever error 32.51% % 403Forbidden 3.94 % 3.86 % 401Unauthorized 0.74 % 0.62 % 200OK but 0 length content 0.25 % 0.23 % 410Gone 0.08 % 0.00 % 502Bad gateway 0.08 % 0.00 % “Soft 404s” were not tested.

IWAW0514 Path Depth

IWAW0515 Top-Level Domain

IWAW0516 Path Characteristics Personal home page Non-standard port Dynamic page Inaccessible URLs Accessible URLs Total URLs % Inaccessible51.9 %82.8 %41.1 % non-standard portpersonal dynamic page home page

IWAW0517 File Extension

IWAW0518 Persistent URLs Few uses of mechanisms designed to make URLs persist 59 PURLs (unique) - 59% were inaccessible 2 handles (unique) – none inaccessible 15 DOIs (unique, not pointing back to dlib.org) – none inaccessible

IWAW0519 Content Changes 683 “In flux” URLs  More than 1KB change in size

IWAW0520 Bad URL Characteristics The URL characteristics below were associated with increased levels of linkrot: a non-standard port a personal homepage dynamic query strings uncommon or deprecated file extensions (e.g.,.txt,.shtml,.ps).net,.edu or country-specific top-level domain names

IWAW0521 Thank You Questions? Slides and data files: