Preservation of Digital Objects and Collections Old Dominion University Department of Computer Science CS 791/891 Spring 2005 Michael L. Nelson <mln@cs.odu.edu> 1/20/05
“Digital information lasts forever -- or 5 years, whichever comes first” -- Jeff Rothenberg Do you still have a copy of your first email? Can you still compile and run the first program you ever wrote? If Hurricane Isabel had destroyed your computer, how much information would you have lost? http://www.ancientegypt.co.uk/writing/rosetta.html http://www.rosettaproject.org/
Archive vs. Library persistence access optimum archives libraries general web pages access
Three Serials Arms, "Preservation of Scientific Serials: Three Current Examples", JEP, 5(2), 1999 ACM Digital Library http://www.acm.org/dl/ Internet RFCs http://www.rfc-editor.org/ http://www.rfc-editor.org/rfc-index.html note some of the missing RFCs! also, note the footers of some of the earliest RFCs D-Lib Magazine http://www.dlib.org/
Arm’s Three Levels of Preservation conservation maintaining the “look and feel” cf. D-Lib Magazine’s approach preservation of access maintenance of services on the content e.g.: search engines, author indexes, annotations, etc. preservation of content maintain “content” only e.g.: maintain only the XML and not the stylesheets, transformations, etc.
Publishers as Archivists think really long-term: “Tomorrow we could see the National Library of Medicine abolished by Congress, Elsevier dismantled by a corporate raider, the Royal Society declared bankrupt, or the University of Michigan Press destroyed by a meteor. All are highly unlikely, but over a long period of time unlikely events will happen.” emphasis mine - MLN
How Long is Forever? Average human life span (from: http://www.che.uc.edu/acs/archives/cintacs/vol39no5/vol39no5.html) female: 78 male: 77 Average Fortune 500 company lifespan: (from: http://www.businessweek.com/chapter/degeus.htm) 40 - 50 years Universities? U.S. Government agency or institution? what about individual labs? NASA Zero Base Review U.S. Military BRAC
Partnerships With Publishers LOCKSS: Lots of Copies Keeps Stuff Safe http://lockss.stanford.edu/ Thomas Robertson will be a guest in March requirements: cooperative publishers cooperating libraries with significant individual and aggregate resources IPR resolution…
Acting Independently of Publishers “The Library of Congress could play a special role. A prime function of the Library of Congress is to collect the cultural and intellectual output of today for the benefit of future generations. No legal changes are needed for the library to extend its mission to collecting and preserving information that is created in digital formats.” this was in 1999; we now have http://www.digitalpreservation.gov/ “outsourcing examples” de jure: Theses -> UMI de facto: web pages -> Internet Archive (http://www.archive.org)
Measuring Availability Nelson & Allen, “Object Persistence and Availability in Digital Libraries", D-Lib Magazine, 8(1), 2002 http://www.dlib.org/dlib/january02/nelson/01nelson.html
Where to Measure Availability? HTML page? HTTP server? DL Service? Information Objects?
Previous Studies - HTML Pages / URLS “…estimates put the average lifetime for a URL at 44 days.” Brewster Kahle, Scientific American, 1997 http://www.hackvan.com/pub/stig/articles/trusted-systems/0397kahle.html “…appears that the half-life of a Web page is somewhat less than two years and the half-life of a Web site is somewhat more than two years.” Wallace Koehler, Information Research, 1999 http://informationr.net/ir/4-4/paper60.html see also JASIST 53(2), JASIS 50(2), and others
Previous Studies - DL Services Powell & French, DL 2000 http://www.cs.virginia.edu/~cyberia/papers/DL00.pdf (note: this was for the Dienst-based NCSTRL, not the OAI-PMH-based NCSTRL) see: Anan et al., JCDL 2002 http://www.cs.odu.edu/~mln/pubs/ncstrl-oai.pdf
Previous Studies - HTTP Servers measured latency (~ 500 ms) and measured uptime probability to be ~ 0.95 Viles & French, Computing Systems, 1995 not here: http://www.usenix.org/publications/computing/ not here: ftp://ftp.cs.virginia.edu/pub/techreports/CS-94-36.ps.Z cf. the problem as presented by Arms!
But What About the Information Objects? Access to the http server / DL service / web page is a necessary but not sufficient condition to actually getting “the stuff” Premise: items are put in a DL because they are more valuable than the “average” URL; they should be more available
Experiment Select 20 different DLs by hand try to get a good mix between subject-based, author contributed, institution repository, different architectures, etc. (see figure 1) by fiat declare that it is a DL if it “looks like a DL” “randomly” (but still by hand) select 50 objects from the DL only DLs with >= 50 objects were chosen establish a baseline harvest 3 times per week for > 1 year record bytes recvd at each harvest
Results Table 2, Figures 1-20: Results: 31 / 1000 objects unavailable http://www.dlib.org/dlib/january02/nelson/01nelson.html Results: 31 / 1000 objects unavailable lots of additional analysis could be done here… see me if you’d like to pick this up as a project 3% corresponds with the study by Lawrence et al., IEEE Computer, 1999 http://www.neci.nec.com/~lawrence/papers/persistence-computer01/ persistence-computer01.pdf more recent study by Spinellis, CACM, 2003: “…after four years 40%-50% of the referenced URLs [in CACM and IEEE Computer articles] become inaccessible.” http://citeseer.nj.nec.com/spinellis03decay.html
Most Recent Study: URLs in D-Lib Magazine Study conducted by Sheffan Chan for her MS project, December 2004 not yet published Quick summary: check 5488 URLs extracted from 458 articles published in D-Lib Magazine
URL Loss URL half-life ~ 11 years from S. Chan’s MS project
Loss & URL Path Depth cf. “Cool URIs don’t Change” http://www.w3.org/Provider/Style/URI.html from S. Chan’s MS project
“Non-standard” URLs from S. Chan’s MS project
Open Archival Information System Reference model, not a system per se Goal: terminology and general framework to describe archive interactions Current specification: http://ssdoo.gsfc.nasa.gov/nost/isoas/ref_model.html More readable summaries http://www.oclc.org/research/publications/archive/2000/lavoie/ http://www.rlg.org/longterm/oais.html
OAIS Environment Sample Environment I: Archive: Planetary Data System (planetary science data sets) Management: National Aeronautics and Space Administration (NASA) Producers: NASA flight projects Designated Community: planetary science community Sample Environment II: Archive: Electronic and Special Media Records Services Division (U.S. federal records in formats designed for computer processing) Management: National Archives and Records Administration Producers: U.S. government agencies Designated Community: general public from: http://www.oclc.org/research/publications/archive/2000/lavoie/
OAIS Information Model if you remember nothing else about OAIS, remember SIP, AIP & DIP from: http://www.oclc.org/research/publications/archive/2000/lavoie/
OAIS Functional Model from: http://www.oclc.org/research/publications/archive/2000/lavoie/
OAIS vs. OAIS Hirtle editorial: OAI and OAIS: What's in a Name? http://www.dlib.org/dlib/april01/04editorial.html Nelson, letter to the editor: http://www.dlib.org/dlib/may01/05letters.html Open Archives Initiative: The focus is on "openness", through exposing and harvesting metadata through a simple, explicitly defined protocol. Note that metadata harvesting is the only model explicitly addressed Open Archival Information System: The focus is on "archival-ness" (apologies to William Safire, again) by thoroughly defining the framework, models, and terms needed to discuss long-term preservation of information. Note that protocols are not defined.
OAI + COs + OAIS = Preservation OAI and OAIS? OAI + COs + OAIS = Preservation http://www.dlib.org/dlib/december04/vandesompel/12vandesompel.html