Digital Preservation Dale Flecker Stephen Abrams February 15, 2007 HUL University Library Council
Agenda IThe problem IIWhat has Harvard been doing? IIIWhat more do we need to do?
IThe problem …
… is twofold Keeping the bits Keeping the bits useful
Keeping the bits Digital things are amazingly easy to destroy! –Bad guys want to do damage –Hardware/software fails –People make mistakes The slip of a finger, or an unnoticed consequence of change, happen easily - and are potentially catastrophic
Destruction is not always apparent Data not used regularly is always at risk of unintended and unnoticed damage. (Note that archival copies can be pretty invisible…)
Keeping bits useful Digital materials are fragile!!! They depend on technologies for their vitality… and those technologies age and disappear rapidly.
Fragility Using digital content requires mediation by hardware and software Hardware and software must understand the format of the content Hardware and software technology change continually …
Fragility Old technology will break New technology frequently does not understand old formats
II What has Harvard been doing? Internally …
Digital Repository Service (DRS) Secure, professionally managed environment –Manage data rigorously, with discipline, and in accordance to community best practices Redundant, heterogeneous, distributed storage with periodic media migration …
Digital Repository Service (DRS) Know what data you have –What are the logical objects (“works”, not files)? –What are the technical characteristics of those objects? Check the data continuously Manage access to stored objects
Format Understanding formats is fundamental to preservation ffd8ffe000104a ffed0fb050686f74 6f73686f e d 03e90a e e666f f40240ffeeffee fc d f d03ed0a f6c f 6e a
Format Understanding formats is fundamental to preservation ffd8ffe000104a ffed0fb050686f74 6f73686f e d 03e90a e e666f f40240ffeeffee fc d f d03ed0a f6c f 6e a SOI APP0 JFIF 1.2 APP13 IPTC APP2 ICC DQT SOF0 183x512 DRI DHT SOS ECS0 RST0 ECS1 RST1 ECS2...
Format Understanding formats is fundamental to preservation ffd8ffe000104a ffed0fb050686f74 6f73686f e d 03e90a e e666f f40240ffeeffee fc d f d03ed0a f6c f 6e a SOI APP0 JFIF 1.2 APP13 IPTC APP2 ICC DQT SOF0 183x512 DRI DHT SOS ECS0 RST0 ECS1 RST1 ECS2...
Format Formats vary significantly in their “preservability” Keeping multiple versions of a given piece of content for different purposes is frequently wise –E.g. archival master, production master, use copy
Format Some criteria for “preservability” (from LC) –Disclosure (how well documented?) –Adoption (how widely used?) –Transparency (is compression used?) –Self documenting (good!) –External dependencies (self sufficiency is good) –Patents (could limit preservation actions) –DRM/encryption (what if decryption key is not available?)
Metadata The basis of decision-making for preservation –Technical metadata What format is this in? What format options are used? –Structural metadata If I change this, what else is affected? …
Metadata –Administrative metadata Who has the right to make decisions about this? –Relationship metadata Are there other versions of this object? –How do these affect my preservation strategy? –Provenance metadata Where did this come from? What changes has it already undergone?
Guidelines for “preservable” objects The least expensive, and most effective preservation measure is to think about the future when an object is created! (Guidelines on format, metadata, archival masters, etc.)
JHOVE (JSTOR/Harvard Object Validation Environment) A widely used tool for format identification, validation, and characterization.
JHOVE (JSTOR/Harvard Object Validation Environment) When an object is ingested: Determine its format (“identify”) Insure that it is properly formed (“validate”) Extract meaningful technical metadata (“characterize”)
DRS: what’s managed today As of January 2007, 5.6M files and 22 TB, excluding Google and web archiving
II What has Harvard been doing? Externally…
E-journal archiving “How can we ensure that licensed e-journal content will remain usable over time?” Mellon-funded study Explored technical formats, content types, transactions and dataflows, validation, systems requirements, contractual requirements, business models Harvard’s proposed model largely implemented by Portico
Technical Metadata for Digital Still Images “What are the appropriate technical metadata necessary for the preservation of images?” Standardized as NISO Z39.87 Expressed in the MIX schema –Maintained by LC The basis for DRS image technical metadata
METS (Metadata Encoding and Transmission Standard) “Is there a generic packaging form for digital content?” For example, –Digital books –Audio works –Images (archival master, production master, deliverables) Useful for exchange of objects between repositories Maintained by LC
Core audio metadata “What are the appropriate technical metadata necessary for the preservation of audio?” Standardized as AES X-098 Used as the basis for DRS audio technical metadata
PDF/A “PDF defines too many options; is there a ‘flavor’ that will be more ‘preservable’ over time?” Requires, recommends, and restricts PDF functionality to enhance preservability Standardized as ISO 19005
PREMIS PREservation Metadata: Implementation Strategies “What are the general metadata elements necessary to preserve digital content over time?” OCLC/RLG-sponsored work group Recommendations and best practices for preservation metadata –Core elements, data dictionary, implementation strategies, cooperative projects …
PREMIS PREservation Metadata: Implementation Strategies Report on current practices and recommended metadata elements available Maintained by LC
AIHT (Archive Ingest and Handling Test) “What difficulties can we expect to arise during the exchange of content between heterogeneous repositories?” LC-funded project to investigate exchange of complex data between preservation repositories Harvard, Stanford, Johns Hopkins, Old Dominion ingest and exchange web archive data
GDFR (Global Digital Format Registry) “What will need to know in the future about formats in use today, and how will we know it?” Shared registry of preservation-related information about technical format Reduce work for repositories to create and maintain information about objects they ingest …
GDFR (Global Digital Format Registry) Enables sharing of format expertise Directed by Harvard, implemented by OCLC Funded by Mellon Foundation
Registry of Digital Masters “How can I found out who has accepted archival responsibility for a given piece of content?” Initially reformatted materials; intention to expand to born-digital DLF project Implemented by and housed at OCLC
Repository certification “Why should a collection manager trust a digital repository?” RLG/OCLC report on Trusted Repository Attributes RLG/NARA Digital Repository Certification Task Force …
Repository certification Recommend structure and metrics of an international process for certifying preservation repositories –Organizational role and structure, staff size and skill, formal operations and documentation, appropriate technical infrastructure and facilities, on-going funding, and “hand-off” plan, etc. CRL Auditing and Certification project
Key activities elsewhere ISO OAIS (Open Archival Information System) LC NDIIPP (National Digital Information Infrastructure Preservation Program) Web archiving (IA, IIPC) NARA ERA (Electronic Records Archiving) Digital Curation Centre PLANETS
IIIWhat more do we need to do?
Evolution: from projects to program Digital preservation requires continual pro- active program –You can’t just stop and start –Time frames are MUCH shorter than for preservation of physical collections Need to define scope and role of our preservation efforts Investment required in both technology and staffing
Preservation lifecycle Creation –Format and technical specification choices –Accompanying metadata –Packaging for ingest Ingest –Validation –Normalization …
Preservation lifecycle Assumption of preservation responsibility Monitoring –When is intervention necessary? Changes to the technical environment Changes to user expectations Planning –Significant properties All preservation decisions involve choice; how to choose what to preserve? …
Preservation lifecycle Intervention (preserving usability) –Re-acquisition –Re-generation from an archival master –Migration before necessary (“just in case”) –Migration at point of request (“just in time”) –Emulation of obsolete technology in contemporary environment –Universal Virtual Computer (UVC) Rewrite necessary software to run on technology- agnostic “virtual” computer …
Preservation lifecycle Intervention (continued) –Save for digital archeologists After intervention –Post-intervention quality assurance –Documenting the process of change Succession planning –What do we do when we want to get out of the repository business?
Staffing and responsibilities Technical –Infrastructure maintenance –Monitor technological change –Integration into larger preservation environment –Preservation planning Curatorial –Preservation intervention will involve trade-offs What attributes need to be preserved? Cost/benefit analysis
Immediate challenges Google –Substantial increase in scale (both number and size) –“Dark” content; no expectation of current access Web archiving –Explosion of data types –No forethought on format selection and technical specifications –No metadata –Some failure may be inevitable
Coming soon? Institutional repository (IR) to enhance scholarly communication and preserve scholarly creations –Similar to web archiving: objects not typically created with preservation in mind, nor accompanied by metadata “Just in case” local copies of licensed content –May necessitate increased sophistication of IPR management
Longer term issues Economics – What can we afford to preserve? Scale – How much can we preserve? Selection – What do we leave for others? Federation – Can we share responsibilities for preservation? –Copies in independent environments are safest Certification – Do we need formal certification? –Note Section 108 revision Education – who at Harvard needs to understand?