Preservation of Digital Objects and Collections

Slides:



Advertisements
Similar presentations
Current State of Play in Digital Preservation Peter B. Hirtle Cornell University Library Society of American Archivists.
Advertisements

Heinrich Stamerjohanns Institute for Science Networking Distributed Open Archives Dr. Heinrich Stamerjohanns Institute for Science Networking at the University.
DSpace: the MIT Libraries Institutional Repository MacKenzie Smith, MIT EDUCAUSE 2003, November 5 th Copyright MacKenzie Smith, This work is the.
Introduction to Online Resources Aeronautics & Astronautics, Mechanical Engineering and Ship Science Michael Whitton November 2011 & February 2012 University.
BUILDING DIGITAL WEB ARCHIVES FOR FUTURE SCHOLARS Jani Stenvall
Unsupervised Creation of Small World Networks for the Preservation of Digital Objects Charles L. Cartledge Michael L. Nelson Old Dominion University Department.
Introduction to Online Resources Aeronautics & Astronautics, Mechanical Engineering and Ship Science Michael Whitton February/March 2013 University Library.
ELPUB 2006 June Bansko Bulgaria1 Automated Building of OAI Compliant Repository from Legacy Collection Kurt Maly Department of Computer.
1 Uppsala University Library Eva Müller Peter Hansson Stefan Andersson Uwe Klosa Electronic Publishing Centre Krister Östlund Waller project.
CC 2007, 2011 attribution - R.B. Allen Information System Architectures and Services.
William Y. Arms Corporation for National Research Initiatives March 22, 1999 Object models, overlay journals, and virtual collections.
Fun with Geospatial Metadata, CUGIR, CORC, MARC, and OAI: The CSDGM to MARC Grant Project Adam Chandler, Olin Library Elaine Westbrooks, Mann Library Vivek.
The Open Archives Initiative Simeon Warner (Cornell University) Symposium on “Scholarly Publishing and Archiving on the Web”, University.
OAI Standards for Sheet Music Meeting March 28-29, 2002 Basic OAI Principals How They Apply to Sheet Music Presenter: Curtis Fornadley, Senior Programmer/Analyst.
1 CS 502: Computing Methods for Digital Libraries Lecture 27 Preservation.
1 Minerva The Web Preservation Project. 2 Team Members Library of Congress Roger Adkins Cassy Ammen Allene Hayes Melissa Levine Diane Kresh Jane Mandelbaum.
1 William Y. Arms Cornell University April 4, 2003 Free Access to Information Today Who Benefits? What are the Risks? Who Pays?
Dienst Distributed Networked Publishing Carl Lagoze Digital Library Scientist Cornell University.
CONTI’2008, 5-6 June 2008, TIMISOARA 1 Towards a digital content management system Gheorghe Sebestyen-Pal, Tünde Bálint, Bogdan Moscaliuc, Agnes Sebestyen-Pal.
Chinese-European Workshop on Digital Preservation, Beijing July 14 – Network of Expertise in Digital Preservation 1 Trusted Digital Repositories,
Web Characterization: What Does the Web Look Like?
1 CS 502: Computing Methods for Digital Libraries Lecture 28 Current work in preservation.
Indo-US Workshop, June23-25, 2003 Building Digital Libraries for Communities using Kepler Framework M. Zubair Old Dominion University.
Themes Architecture Content Metadata Interoperability Standards Knowledge Organisation Systems Use and Users Legal and Economic Issues The Future.
PLoS ONE Application Journal Publishing System (JPS) First application built on Topaz application framework Web 2.0 –Uses a template engine to display.
CBSOR,Indian Statistical Institute 30th March 07, ISI,Kokata 1 Digital Repository support for Consortium Dr. Devika P. Madalli Documentation Research &
DNER Architecture Andy Powell 6 March 2001 UKOLN, University of Bath UKOLN is funded by Resource: The Council for.
Kurt Maly Department of Computer Science Old Dominion University Norfolk, Virginia 23529, USA Digital Libraries, OAI and Free Software.
Alternative Architecture for Information in Digital Libraries Onno W. Purbo
Persistent Digital Archives and Library System (PeDALS)
Corporation For National Research Initiatives Technical Issues in Electronic Publishing Corporation for National Research Initiatives William Y. Arms.
The Availability and Persistence of Web References in D-Lib Magazine Frank McCown, Sheffan Chan, Michael L. Nelson and Johan Bollen Old Dominion University.
1 CS 430: Information Discovery Lecture 26 Architecture of Information Retrieval Systems 1.
Introduction to Digital Libraries Week 14: Digital Preservation Old Dominion University Department of Computer Science CS 695 Fall 2004 Michael L. Nelson.
Web-Based Information Retrieval Week 1: Administrivia Old Dominion University Department of Computer Science CS 895 Spring 2013 Michael L. Nelson 01/15/13.
Transparent Format Migration of Preserved Web Content D. S. H. Rosenthal, T. Lipkis, T. S. Robertson, S. Morabito Lib Magazine, 11(1), 2005
Web Services Overview Thomas Hickey. 2 What are Web Services? Machine-to-machine communication Run over standard Web protocols –XML syntax, HTTP packaging.
CS 791-S04 Digital Preservation Seminar Presentation of: Arms, "Preservation of Scientific Serials: Three Current Examples", JEP, 5(2), 1999 and Nelson.
7th Annual Hong Kong Innovative Users Group Meeting
Introduction to Information Retrieval Week 1: Administrivia
The Hosted Model Charl Roberts Good morning again,
Software Documentation
VI-SEEM Data Repository
Introduction to Digital Libraries Week 12: Digital Preservation
NASA Technical Report Server (NTRS) Project Overview April 2, 2003
An Ounce of Different is Worth A Pound of Same ~ Sustaining rich collections by adapting what we know & learning skills we need.
Introduction to Information Retrieval Week 1: Administrivia
OAI and Metadata Harvesting
Just-In-Time Recovery of Missing Web Pages
Introduction to Digital Libraries Assignment #4
Introduction to Digital Libraries Assignment #4
Introduction to Digital Libraries Assignment #3
Metadata to fit your needs... How much is too much?
Introduction to Digital Libraries Week 14: Digital Preservation
Digitometric Services for Open Archives Environments
Introduction to Digital Libraries Assignment #4
Web-Based Information Retrieval Week 1: Administrivia
Introduction to Digital Libraries Assignment #3
Introduction to Digital Libraries Assignment #3
Open Archival Information System
Introduction to Digital Libraries Assignment #3
Institutional Repositories
If You Harvest arXiv.org, Will They Come?
Introduction to Digital Libraries Assignment #4
Introduction to Digital Libraries Assignment #3
Introduction to Digital Libraries Assignment #3
Introduction to Digital Libraries Assignment #1
Introduction to Digital Libraries Assignment #4
Web-Based Information Retrieval Week 2: Administrivia
Old Dominion University Computer Science IIPC New Member
Presentation transcript:

Preservation of Digital Objects and Collections Old Dominion University Department of Computer Science CS 791/891 Spring 2005 Michael L. Nelson <mln@cs.odu.edu> 1/20/05

“Digital information lasts forever -- or 5 years, whichever comes first” -- Jeff Rothenberg Do you still have a copy of your first email? Can you still compile and run the first program you ever wrote? If Hurricane Isabel had destroyed your computer, how much information would you have lost? http://www.ancientegypt.co.uk/writing/rosetta.html http://www.rosettaproject.org/

Archive vs. Library persistence access optimum archives libraries general web pages access

Three Serials Arms, "Preservation of Scientific Serials: Three Current Examples", JEP, 5(2), 1999 ACM Digital Library http://www.acm.org/dl/ Internet RFCs http://www.rfc-editor.org/ http://www.rfc-editor.org/rfc-index.html note some of the missing RFCs! also, note the footers of some of the earliest RFCs D-Lib Magazine http://www.dlib.org/

Arm’s Three Levels of Preservation conservation maintaining the “look and feel” cf. D-Lib Magazine’s approach preservation of access maintenance of services on the content e.g.: search engines, author indexes, annotations, etc. preservation of content maintain “content” only e.g.: maintain only the XML and not the stylesheets, transformations, etc.

Publishers as Archivists think really long-term: “Tomorrow we could see the National Library of Medicine abolished by Congress, Elsevier dismantled by a corporate raider, the Royal Society declared bankrupt, or the University of Michigan Press destroyed by a meteor. All are highly unlikely, but over a long period of time unlikely events will happen.” emphasis mine - MLN

How Long is Forever? Average human life span (from: http://www.che.uc.edu/acs/archives/cintacs/vol39no5/vol39no5.html) female: 78 male: 77 Average Fortune 500 company lifespan: (from: http://www.businessweek.com/chapter/degeus.htm) 40 - 50 years Universities? U.S. Government agency or institution? what about individual labs? NASA Zero Base Review U.S. Military BRAC

Partnerships With Publishers LOCKSS: Lots of Copies Keeps Stuff Safe http://lockss.stanford.edu/ Thomas Robertson will be a guest in March requirements: cooperative publishers cooperating libraries with significant individual and aggregate resources IPR resolution…

Acting Independently of Publishers “The Library of Congress could play a special role. A prime function of the Library of Congress is to collect the cultural and intellectual output of today for the benefit of future generations. No legal changes are needed for the library to extend its mission to collecting and preserving information that is created in digital formats.” this was in 1999; we now have http://www.digitalpreservation.gov/ “outsourcing examples” de jure: Theses -> UMI de facto: web pages -> Internet Archive (http://www.archive.org)

Measuring Availability Nelson & Allen, “Object Persistence and Availability in Digital Libraries", D-Lib Magazine, 8(1), 2002 http://www.dlib.org/dlib/january02/nelson/01nelson.html

Where to Measure Availability? HTML page? HTTP server? DL Service? Information Objects?

Previous Studies - HTML Pages / URLS “…estimates put the average lifetime for a URL at 44 days.” Brewster Kahle, Scientific American, 1997 http://www.hackvan.com/pub/stig/articles/trusted-systems/0397kahle.html “…appears that the half-life of a Web page is somewhat less than two years and the half-life of a Web site is somewhat more than two years.” Wallace Koehler, Information Research, 1999 http://informationr.net/ir/4-4/paper60.html see also JASIST 53(2), JASIS 50(2), and others

Previous Studies - DL Services Powell & French, DL 2000 http://www.cs.virginia.edu/~cyberia/papers/DL00.pdf (note: this was for the Dienst-based NCSTRL, not the OAI-PMH-based NCSTRL) see: Anan et al., JCDL 2002 http://www.cs.odu.edu/~mln/pubs/ncstrl-oai.pdf

Previous Studies - HTTP Servers measured latency (~ 500 ms) and measured uptime probability to be ~ 0.95 Viles & French, Computing Systems, 1995 not here: http://www.usenix.org/publications/computing/ not here: ftp://ftp.cs.virginia.edu/pub/techreports/CS-94-36.ps.Z cf. the problem as presented by Arms!

But What About the Information Objects? Access to the http server / DL service / web page is a necessary but not sufficient condition to actually getting “the stuff” Premise: items are put in a DL because they are more valuable than the “average” URL; they should be more available

Experiment Select 20 different DLs by hand try to get a good mix between subject-based, author contributed, institution repository, different architectures, etc. (see figure 1) by fiat declare that it is a DL if it “looks like a DL” “randomly” (but still by hand) select 50 objects from the DL only DLs with >= 50 objects were chosen establish a baseline harvest 3 times per week for > 1 year record bytes recvd at each harvest

Results Table 2, Figures 1-20: Results: 31 / 1000 objects unavailable http://www.dlib.org/dlib/january02/nelson/01nelson.html Results: 31 / 1000 objects unavailable lots of additional analysis could be done here… see me if you’d like to pick this up as a project 3% corresponds with the study by Lawrence et al., IEEE Computer, 1999 http://www.neci.nec.com/~lawrence/papers/persistence-computer01/ persistence-computer01.pdf more recent study by Spinellis, CACM, 2003: “…after four years 40%-50% of the referenced URLs [in CACM and IEEE Computer articles] become inaccessible.” http://citeseer.nj.nec.com/spinellis03decay.html

Most Recent Study: URLs in D-Lib Magazine Study conducted by Sheffan Chan for her MS project, December 2004 not yet published Quick summary: check 5488 URLs extracted from 458 articles published in D-Lib Magazine

URL Loss URL half-life ~ 11 years from S. Chan’s MS project

Loss & URL Path Depth cf. “Cool URIs don’t Change” http://www.w3.org/Provider/Style/URI.html from S. Chan’s MS project

“Non-standard” URLs from S. Chan’s MS project

Open Archival Information System Reference model, not a system per se Goal: terminology and general framework to describe archive interactions Current specification: http://ssdoo.gsfc.nasa.gov/nost/isoas/ref_model.html More readable summaries http://www.oclc.org/research/publications/archive/2000/lavoie/ http://www.rlg.org/longterm/oais.html

OAIS Environment Sample Environment I: Archive: Planetary Data System (planetary science data sets) Management: National Aeronautics and Space Administration (NASA) Producers: NASA flight projects Designated Community: planetary science community Sample Environment II: Archive: Electronic and Special Media Records Services Division (U.S. federal records in formats designed for computer processing) Management: National Archives and Records Administration Producers: U.S. government agencies Designated Community: general public from: http://www.oclc.org/research/publications/archive/2000/lavoie/

OAIS Information Model if you remember nothing else about OAIS, remember SIP, AIP & DIP from: http://www.oclc.org/research/publications/archive/2000/lavoie/

OAIS Functional Model from: http://www.oclc.org/research/publications/archive/2000/lavoie/

OAIS vs. OAIS Hirtle editorial: OAI and OAIS: What's in a Name? http://www.dlib.org/dlib/april01/04editorial.html Nelson, letter to the editor: http://www.dlib.org/dlib/may01/05letters.html Open Archives Initiative: The focus is on "openness", through exposing and harvesting metadata through a simple, explicitly defined protocol. Note that metadata harvesting is the only model explicitly addressed Open Archival Information System: The focus is on "archival-ness" (apologies to William Safire, again) by thoroughly defining the framework, models, and terms needed to discuss long-term preservation of information. Note that protocols are not defined.

OAI + COs + OAIS = Preservation OAI and OAIS? OAI + COs + OAIS = Preservation http://www.dlib.org/dlib/december04/vandesompel/12vandesompel.html