Digital Archiving: A FEDORA-Based Infrastructure for Preserving Electronic Journals LACUNY Institute 2005 Scholarly Publishing and Open Access: Payers and Players May 20, 2005 Ronald C. Jantz Rutgers University Libraries LACUNY 2005, R. Jantz - 5/20/2005
Some Questions to Think About What is the oldest digital object you know of? How can you tell if object A and object B are the same? How do you know if a digital object has been changed? What is the nature of the change? LACUNY 2005, R. Jantz - 5/20/2005
Should We Be Concerned? (About Our Ability to do Digital Preservation) The Clinton Administration produced approximately 90 million email messages. During the Iran-Contra scandal, John Poindexter and Oliver North erased 5,000 email messages. Chronicle of Higher Education, Jan. 30, 2004. “The patent office, home to nearly 6.5 million patents dating to 1790, is converting to an electronic database and discarding a significant portion of its paper files after they have been scanned and digitized.” -Mitchell, A. (2001). Ingenuity’s Blueprints, Into History’s Dustbin. NY Times. December 30, 2001, p. A1. The Nazis destroyed 100 million books in the years from 1933 to 1945. LACUNY 2005, R. Jantz - 5/20/2005
Digital Library Repository Initiative (Rutgers University Libraries) Objectives: To provide seamless, perpetual access to digital collections -- our resources and the resources of others. To develop a flexible framework of “core” capabilities providing the enabling infrastructure, interoperability, and sustainability. LACUNY 2005, R. Jantz - 5/20/2005
Digital Preservation and Archiving Institutional Requirements Institutional clarity about what to preserve Very large mass storage systems, scaling to millions of objects Flexibility to handle many digital formats (digital object architecture) Integration of key technologies Well defined preservation metadata and processes Sustainability – content, technology, financial LACUNY 2005, R. Jantz - 5/20/2005
Digital Preservation (A definition from Research Libraries Group) Digital preservation is defined as the managed activities necessary for ensuring: 1. The long term maintenance of a byte stream (including metadata) sufficient to reproduce a suitable facsimile of the document, and 2. Continued accessibility of the contents thru time and changing technology. LACUNY 2005, R. Jantz - 5/20/2005
Why Would You Digitally Preserve? Preserve material that exists in electronic form only Protect original artifact by using a surrogate Provide surrogate if original artifact is destroyed LACUNY 2005, R. Jantz - 5/20/2005
Digital Preservation Involves Both Process and Technology Creation of The Digital Object Ingest, Store, Access to Life Cycle Management Of the Digital Yes Decision To Digitally Preserve No D1.0 D3.0 D2.0 Migration (transferring digital materials from one media or format to another) is the only workable life cycle approach. LACUNY 2005, R. Jantz - 5/20/2005
Digital Library Concepts Digital Library Repository (DLR) The repository is designed and managed to contain and provide access to digital resources created by an institution. Repositories can provide both access and preservation. Digital Object The digital object is the basic unit of management and digital preservation, consisting of a persistent identifier, metadata, and associated byte streams. An object can represent a book, map, e-journal article, photograph, numeric data, etc. LACUNY 2005, R. Jantz - 5/20/2005
The Fedora* Infrastructure The Infrastructure (from Fedora) An extensible digital object model APIs for developing new applications Scalable, persistent storage for content and metadata Content Versioning and audit trails Metadata harvesting Development and Integration (by RUL) Design of the digital object architecture Integration of key technologies and standards Development of applications *Flexible Extensible Digital Object Repository Architecture LACUNY 2005, R. Jantz - 5/20/2005
RUL Digital Repository Architecture External Applications Browse Search Export “Native” Applications Browse Search Admin ftns Internet Internet Server Digital Object Repository (Fedora) Server ftns DB access METS-XML Export Ingest Export (OAI, MARC, etc.) Local Database Objects
Digital Projects at Rutgers University Libraries External (to Fedora) Applications Electronic Journals (journals published by RUL) The Eagleton Poll Archive: http://www.scc.rutgers.edu/eagleton The NJ Environmental Digital Library: http://njedl.rutgers.edu CETH projects (Roman coins, 18th century journals, classic texts) Native (Fedora) Projects The NJ Digital Highway – http://www.njdigitalhighway.org Jazz Oral Histories (digital sound) LACUNY 2005, R. Jantz - 5/20/2005
E-Journals at RUL Why are we undertaking this new role? To support new, open models for the dissemination of scholarship. Journal publishing complements the Libraries' key role in supporting scholarship within the academy. Libraries have a traditional role in the preservation of scholarly materials. The E-Journal Platform at RUL Based on the Open Journal Systems (OJS) from the Public Knowledge Project. Digital preservation based on the integration of OJS, Fedora, and special processes and technologies. All journals are freely accessible. LACUNY 2005, R. Jantz - 5/20/2005
Available at: http://pcsp.libraries.rutgers.edu
Available at: http://ejbe.libraries.rutgers.edu
Available at: http://rulj.libraries.rutgers.edu
Digital Object Example (An E-journal Article) Article Object Repository ID Descriptive Technical Source Rights Digital Prov. Administrative Disseminators Metadata Datastreams SMAP1 – Structure Map DS1- article (djvu) DS2 - article (pdf) ARCH1- Manuscript as Submitted. LACUNY 2005, R. Jantz - 5/20/2005
Important Technologies, Processes, and Standards Persistent identifiers Digital Signatures (based on SHA1) Audit Trails Versioning Digital Certificates Pipelines (to automate sequential processes) Preservation Metadata (based on Nat’l Library of Australia approach) METS (Metadata Exchange and Transmission Standard) OAI-PMH (Protocol for metadata harvesting) Open source – Linux, Apache, Fedora, Amberfish (search engine) LACUNY 2005, R. Jantz - 5/20/2005
Persistent Identifier (PID) Why is the PID important? An essential technology to preserve “referential integrity”. Approximately 41% or the urls referenced in Computer and CACM journals in the period 1995-1999 were inaccessible in 2002 (Spinellis, 2003) What is it? An identifier that is technology and protocol independent and is mapped to a url. The handle for a PCSP issue is 1782.1/pcsp1.1.47 Url access: http://hdl.rutgers.edu/1782.1/pcsp1.1.47 CNRI Handle System (http://www.handle.net) For assigning, managing and resolving persistent identifiers Managed by the Corporation for National Research Initiatives LACUNY 2005, R. Jantz - 5/20/2005
Digital Signatures Objective – to detect and report unauthorized changes in an object Signature Process SHA1 signatures for both object and archival master Created automatically and inserted into metadata Verified periodically Failures reported thru Alerting Services LACUNY 2005, R. Jantz - 5/20/2005
The E-Journal Preservation Process All articles in digital object form are exported to the Digital Repository (Fedora) Signatures and PIDs computed automatically Signatures verified automatically – failures reported via Repository alerting services External application (website) periodically captured and exported automatically to the Repository LACUNY 2005, R. Jantz - 5/20/2005
Issues and Questions We need “persistent” organizations The service model for e-journals within the Library The cost/benefit model Research on earlier questions Sustainability – content, technology, financial There are many skeptics, e.g. Cullen (2000) asks rhetorically “How confident can we be when an object whose authentication is crucial depends on electricity for its existence?”. LACUNY 2005, R. Jantz - 5/20/2005