Presentation is loading. Please wait.

Presentation is loading. Please wait.

Preservation of Digital Objects and Collections

Similar presentations


Presentation on theme: "Preservation of Digital Objects and Collections"— Presentation transcript:

1 Preservation of Digital Objects and Collections
Old Dominion University Department of Computer Science CS 791/891 Spring 2005 Michael L. Nelson 1/20/05

2 “Digital information lasts forever --
or 5 years, whichever comes first” -- Jeff Rothenberg Do you still have a copy of your first ? Can you still compile and run the first program you ever wrote? If Hurricane Isabel had destroyed your computer, how much information would you have lost?

3 Archive vs. Library persistence access optimum archives libraries
general web pages access

4 Three Serials Arms, "Preservation of Scientific Serials: Three Current Examples", JEP, 5(2), 1999 ACM Digital Library Internet RFCs note some of the missing RFCs! also, note the footers of some of the earliest RFCs D-Lib Magazine

5 Arm’s Three Levels of Preservation
conservation maintaining the “look and feel” cf. D-Lib Magazine’s approach preservation of access maintenance of services on the content e.g.: search engines, author indexes, annotations, etc. preservation of content maintain “content” only e.g.: maintain only the XML and not the stylesheets, transformations, etc.

6 Publishers as Archivists
think really long-term: “Tomorrow we could see the National Library of Medicine abolished by Congress, Elsevier dismantled by a corporate raider, the Royal Society declared bankrupt, or the University of Michigan Press destroyed by a meteor. All are highly unlikely, but over a long period of time unlikely events will happen.” emphasis mine - MLN

7 How Long is Forever? Average human life span (from: female: 78 male: 77 Average Fortune 500 company lifespan: (from: years Universities? U.S. Government agency or institution? what about individual labs? NASA Zero Base Review U.S. Military BRAC

8 Partnerships With Publishers
LOCKSS: Lots of Copies Keeps Stuff Safe Thomas Robertson will be a guest in March requirements: cooperative publishers cooperating libraries with significant individual and aggregate resources IPR resolution…

9 Acting Independently of Publishers
“The Library of Congress could play a special role. A prime function of the Library of Congress is to collect the cultural and intellectual output of today for the benefit of future generations. No legal changes are needed for the library to extend its mission to collecting and preserving information that is created in digital formats.” this was in 1999; we now have “outsourcing examples” de jure: Theses -> UMI de facto: web pages -> Internet Archive (

10 Measuring Availability
Nelson & Allen, “Object Persistence and Availability in Digital Libraries", D-Lib Magazine, 8(1), 2002

11 Where to Measure Availability?
HTML page? HTTP server? DL Service? Information Objects?

12 Previous Studies - HTML Pages / URLS
“…estimates put the average lifetime for a URL at 44 days.” Brewster Kahle, Scientific American, 1997 “…appears that the half-life of a Web page is somewhat less than two years and the half-life of a Web site is somewhat more than two years.” Wallace Koehler, Information Research, 1999 see also JASIST 53(2), JASIS 50(2), and others

13 Previous Studies - DL Services
Powell & French, DL 2000 (note: this was for the Dienst-based NCSTRL, not the OAI-PMH-based NCSTRL) see: Anan et al., JCDL 2002

14 Previous Studies - HTTP Servers
measured latency (~ 500 ms) and measured uptime probability to be ~ 0.95 Viles & French, Computing Systems, 1995 not here: not here: ftp://ftp.cs.virginia.edu/pub/techreports/CS ps.Z cf. the problem as presented by Arms!

15 But What About the Information Objects?
Access to the http server / DL service / web page is a necessary but not sufficient condition to actually getting “the stuff” Premise: items are put in a DL because they are more valuable than the “average” URL; they should be more available

16 Experiment Select 20 different DLs by hand
try to get a good mix between subject-based, author contributed, institution repository, different architectures, etc. (see figure 1) by fiat declare that it is a DL if it “looks like a DL” “randomly” (but still by hand) select 50 objects from the DL only DLs with >= 50 objects were chosen establish a baseline harvest 3 times per week for > 1 year record bytes recvd at each harvest

17 Results Table 2, Figures 1-20: Results: 31 / 1000 objects unavailable
Results: 31 / 1000 objects unavailable lots of additional analysis could be done here… see me if you’d like to pick this up as a project 3% corresponds with the study by Lawrence et al., IEEE Computer, 1999 persistence-computer01.pdf more recent study by Spinellis, CACM, 2003: “…after four years 40%-50% of the referenced URLs [in CACM and IEEE Computer articles] become inaccessible.”

18 Most Recent Study: URLs in D-Lib Magazine
Study conducted by Sheffan Chan for her MS project, December 2004 not yet published Quick summary: check 5488 URLs extracted from 458 articles published in D-Lib Magazine

19 URL Loss URL half-life ~ 11 years from S. Chan’s MS project

20 Loss & URL Path Depth cf. “Cool URIs don’t Change” from S. Chan’s MS project

21 “Non-standard” URLs from S. Chan’s MS project

22 Open Archival Information System
Reference model, not a system per se Goal: terminology and general framework to describe archive interactions Current specification: More readable summaries

23 OAIS Environment Sample Environment I:
Archive: Planetary Data System (planetary science data sets) Management: National Aeronautics and Space Administration (NASA) Producers: NASA flight projects Designated Community: planetary science community Sample Environment II: Archive: Electronic and Special Media Records Services Division (U.S. federal records in formats designed for computer processing) Management: National Archives and Records Administration Producers: U.S. government agencies Designated Community: general public from:

24 OAIS Information Model
if you remember nothing else about OAIS, remember SIP, AIP & DIP from:

25 OAIS Functional Model from:

26 OAIS vs. OAIS Hirtle editorial: OAI and OAIS: What's in a Name?
Nelson, letter to the editor: Open Archives Initiative: The focus is on "openness", through exposing and harvesting metadata through a simple, explicitly defined protocol. Note that metadata harvesting is the only model explicitly addressed Open Archival Information System: The focus is on "archival-ness" (apologies to William Safire, again) by thoroughly defining the framework, models, and terms needed to discuss long-term preservation of information. Note that protocols are not defined.

27 OAI + COs + OAIS = Preservation
OAI and OAIS? OAI + COs + OAIS = Preservation


Download ppt "Preservation of Digital Objects and Collections"

Similar presentations


Ads by Google