Challenges of Digital Preservation MA / CS 109 April 22, 2011 Andrea Goethals Manager of Digital Preservation & Repository Services Harvard Library.

Slides:



Advertisements
Similar presentations
E-Content Service Group Virtual Meeting Digital Preservation: How to Get Started.
Advertisements

Introduction to Planets Hans Hofman Nationaal Archief Netherlands Prague, 17 October 2008.
CLEARSPACE Digital Document Archiving system INTRODUCTION Digital Document Archiving is the process of capturing paper documents through scanning and.
DRS 2 one in a series of periodic updates Harvard University Library Andrea Goethals October 21, 2009 DRS = Digital Repository Service.
Authentication of the Federal Register Charley Barth Director, Office of the Federal Register United States Government.
| IFLA2010. Newspaper Section | Newspaper Resources in transition: Digital Preservation and Access - keynote - IFLA International Newspaper.
Digital Preservation - Its all about the metadata right? “Metadata and Digital Preservation: How Much Do We Really Need?” SAA 2014 Panel Saturday, August.
From Analog to Digital: Changes in Preservation Gregor Trinkaus-Randall Digital Commonwealth Conference Worcester, MA March 25, 2010.
Depositing and Disseminating Digital Resources Alan Morrison Collections Manager AHDS Subject Centre for Literature, Linguistics and Languages.
1 CS 502: Computing Methods for Digital Libraries Lecture 27 Preservation.
Archival Prototypes and Lessons Learned Mike Smorul UMIACS.
Different approaches to digital preservation Hilde van Wijngaarden Digital Preservation Officer Koninklijke Bibliotheek/ National Library of the Netherlands.
WGBH, Boston MA May 10, 2013 Andrea Goethals, Harvard Library.
2005 Adobe Systems Incorporated. All Rights Reserved. 1 Ontolog Forum Gunar Penikis Sr. Product Manager Adobe Systems.
1 Preservation And Access: Achieving the Best of Both Worlds Eimee Rhea C. Lagrama 1.
Digital Preservation Dale Flecker Stephen Abrams February 15, 2007 HUL University Library Council.
Statewide Digitization and the FCLA Digital Archive Priscilla Caplan, Florida Center for Library Automation Statewide Digitization Planners Meeting OCLC,
Digital Preservation Andrea Goethals Wendy Gogel From Harvard University Library NELA 18 October 2010.
City of Seattle Office of the City Clerk Open Government = Access Challenges and Opportunities with Digital Records.
Tackling concrete digital preservation challenges with SPRUCE Paul Wheatley SPRUCE Project Manager University of Leeds Twitter:
WORKFLOWS AND OTHER CONSIDERATIONS FOR DIGITIZATION  Steve Bingo  Processing Archivist Washington State University Libraries  Alex Merrill  Assistant.
Recordkeeping for Good Governance Toolkit Digital Recordkeeping Guidance Funafuti, Tuvalu – June 2013.
How to build your own Dark Archive (in your spare time) Priscilla Caplan FCLA.
Update on UDFR (Unified Digital Format Registry) NDIIPP Meeting June 25, 2009 Andrea Goethals.
Digital Preservation 101, or, How to Keep Bits for Centuries Julie C. Swierczek Digital Asset Manager and Digital Archivist Harvard Art Museums.
Investing in the Long-Term Viability of British Columbia’s Digital Collections A presentation to the Steering Committee of the B.C. Digitization Coalition.
TECHNOLOGY SUPPORT FOR ESSSS Progress, Issues, and Challenges Marshall Breeding Director for Innovative Technology and Research Vanderbilt University Library.
1 Archive-It: Archiving and Preserving Born Digital Content NDIIPP June 2009 Molly Bragg Partner Specialist Internet Archive.
The Role of File Formats in Digital Preservation: Opportunities and Threats ErpaTraining on File Formats for Preservation Vienna, May 10-11, 2004 Frank.
DAITSS: Dark Archive in the Sunshine State Priscilla Caplan, Florida Center for Library Automation DCC Workshop on Long-term Curation within Digital Repositories.
JH VE 2 The Fifth International Conference on Preservation of Digital Objects British Library, September 2008 What? So What? The Next-Generation.
Libraries, Archives, and Digital Preservation: The Reality of What We Must Do Leslie Johnston Acting Director, National Digital Information Infrastructure.
PREMIS Rathachai Chawuthai Information Management CSIM / AIT.
BUILDING ON COMMON GROUND: EXPLORING THE INTERSECTION OF ARCHIVES AND DATA CURATION Lizzy Rolando & Wendy Hagenmaier 6/3/2015IASSIST 2015.
Digital Preservation: Current Thinking Anne Gilliland-Swetland Department of Information Studies.
European Commission on Preservation and Access Preservation of digital heritage Yola de Lusenet Lisbon, November
Digital preservation activities at the NLW Sally McInnes 18 September 2009.
ITGS Databases.
Small steps and lasting impact: making a start with preservation or It’s not all NASA Patricia Sleeman Digital Archives and Repositories University of.
The KB e-Depot long-term preservation of scientific publications in practice Marcel Ras, National library of The Netherlands.
Storage of digital objects Adolf Knoll National Library of the Czech Republic
Multimedia ETD Questions Bill Savage UMI Dissertations Publishing ETD 2002 Provo, Utah Saturday, June 1, 2002.
DRS 2 Project (2008 – Present!) Andrea Goethals, Harvard Library Digital Preservation Management Workshop, MIT June 13, 2013.
OAIS Rathachai Chawuthai Information Management CSIM / AIT Issued document 1.0.
Digital Preservation 8/7/2012 Karen Estlund Head, Digital Library Services
Digital Preservation across the technologies, strategies, open standards & interoperability aspects including the legal issues Pratik Shrivastava Scientist.
Collecting History: Profiles in Science Alexa T. McCray National Library of Medicine Bethesda, MD Stanford University August 21, 1999.
Millman—Nov 04—1 An Update on Digital Libraries David Millman Director of Research & Development Academic Information Systems Columbia University
Digitization & Digital Preservation
Preserving Electronic Mailing Lists as Scholarly Resources: The H-Net Archives Lisa M. Schmidt
The Evolving Process to Add Preservation Support for New Formats at Harvard Library IS&T Archiving 2015 Andrea Goethals. Franziska Frey and David Ackerman.
Lifecycle Metadata for Digital Objects November 15, 2004 Preservation Metadata.
A computer contains two major sets of tools, software and hardware. Software is generally divided into Systems software and Applications software. Systems.
@ulccwww.ulcc.ac.uk IRMS Cymru October 2015 From EDRMS to digital archive: a wish-list for ways to preserve digital records.
Preservation of Digital Data by Christian Wellner Based on: Howard Besser. Digital longevity. In: Maxine Sitts (ed.) Handbook for Digital Projects: A Management.
Document Management
Leveraging the Expertise of our Staff and the Information Resources We Manage MIT Libraries Visiting Committee April 13, 2005.
Digital Stewardship Lee Dotson Digital Initiatives Librarian University of Central Florida John C. Hitt Library Presentation available at
Data Management and Digital Preservation Carly Dearborn, MSIS Digital Preservation & Electronic Records Archivist
Thinking Long Term - Archive Strategies for Alfresco Nathan McMinn Remote Service Engineer Alfresco Chetan Lalye Senior Software Architect Agilent Technologies.
Digital Preservation What, Why, and How? Dan Albertson’s Digital Libraries Class April 13, 2016 Jody DeRidder Head, Metadata & Digital Services University.
A strategic view of document and digital object management for the University of the Witwatersrand, Johannesburg Prof Derek W. Keats Deputy Vice Chancellor.
Working with personal digital archives Susan Thomas Project Manager & Digital Archivist project Manuscripts Matter, Electronica panel London, October.
KEEPS – a system for UELMA preservation and security
KEEPS – a system for UELMA preservation and security
DAITSS: Dark Archive in the Sunshine State
Statewide Digitization and the FCLA Digital Archive
Building Up the Strategic Components for Digital Preservation Policy
Digital Project Lifecycle Curating Across the Curriculum
Dissemination and Communication Introductory course
Presentation transcript:

Challenges of Digital Preservation MA / CS 109 April 22, 2011 Andrea Goethals Manager of Digital Preservation & Repository Services Harvard Library

“Digital Content”? Digitized (born-analog) Born-digital ◦ Tweets ◦ Web sites ◦ ◦ Documents  PDF  Word, OpenOffice …  Spreadsheets ◦ Data sets

Digital content is not new 1957: 1 st digital image 1969: ARPAnet 1971: 1 st sent 1972: 1 st consumer- level video game 1975: 1 st digital camera Russell Kirsch’s son (source: NIST)

But has only recently exploded 1998: 1 st Google index ◦ 26 million pages 2000: Google index ◦ 1 billion pages 2008: Google link processors ◦ 1 trillion unique URIs ◦ “… and the number of individual Web pages out there is growing by several billion pages per day” – from the official Google blog

The coming tsunami 2010: estimated at 1.2 ZB (1 ZB is 1 million TBs) ◦ DVDs stacked from Earth to the Moon and back 2020: expected to grow by a factor of 44 to 35 ZB ◦ DVDs stacked halfway to Mars Source: 2010 IDC Digital Universe Study sponsored by EMC

Outpacing storage Source: 2009 IDC Digital Universe Study sponsored by EMC

Why do we care?

May be historically significant Captured March 19, 2011 for a Japan Earthquake collection created by Virginia Tech, Internet Archive (

May be a work of art YouTube Play. A Biennial of Creative Video (Oct )

May be an important reference Only available in digital form

May only be possible digitally

Who cares? Cultural heritage institutions ◦ Libraries, archives ◦ Museums, historical societies ◦ Academic institutions Governments Entertainment, news and media industry Scientific community Funding bodies (NSF, NIH) You?

Preservation historically Archives and libraries have been preserving all kinds of analog material for centuries using: ◦ Environmental control ◦ Conservation treatments Can store away until resources allow processing ◦ Benign neglect approach works well

Analog content is fairly durable Even damaged, may still be identifiable, readable, usable Anatolian Cuneiform Tablet, circa 1850 BCE

In contrast digital content is Easily destroyed Transient Hidden Requires more active attention – benign neglect approach doesn’t work

Digital content is easily destroyed Bad people Hardware or software failures Human mistakes ◦ The slip of a finger can lead to catastrophic results ◦ “Help! Accidental deletion. I accidentally deleted 62 images… can you please recover them from backups?”

Digital content is transient Average lifespan of a Web site is between 44 and 100 days Captured April 8, 2009Visited October 13, 2010

Digital content is hidden Which is corrupt?

Digital content is hidden Both. Use helps but its not enough to detect corruption.

But is it usable??? It’s not enough to preserve the digital bits ◦ AppleWorks? ◦ WordStar? ◦ Excel 1.0? To use digital content we need software that can read the format

Reading formats ffd8ffe000104a ffed0fb050686f74 6f73686f e d 03e90a e e666f f40240ffeeffee fc d f d03ed0a f6c f 6e a

Reading formats ffd8ffe000104a ffed0fb050686f74 6f73686f e d 03e90a e e666f f40240ffeeffee fc d f d03ed0a f6c f 6e a SOI APP0 JFIF 1.2 APP13 IPTC APP2 ICC DQT SOF0 183x512 DRI DHT SOS ECS0 RST0 ECS1 RST1 ECS2...

Reading formats ffd8ffe000104a ffed0fb050686f74 6f73686f e d 03e90a e e666f f40240ffeeffee fc d f d03ed0a f6c f 6e a SOI APP0 JFIF 1.2 APP13 IPTC APP2 ICC DQT SOF0 183x512 DRI DHT SOS ECS0 RST0 ECS1 RST1 ECS2...

Access to information information content bits formats SW HW HW (paper) information content HW (paper) symbols language Analog book Unmediated access Digital book Technology-mediated access

Formats are key to digital preservation information content bits formats SW HW supporting technologies digital content If the format of our content is unsupported by technology, we can’t access the content’s information!

Dependent on fleeting technology We are dependent on technology to interpret (render, play, etc.) digital content No technology sticks around – it all ages and disappears Eventually all digital content in its original format becomes unusable!

Format obsolescence Kodak PhotoCD ◦ Used by libraries in the 1990’s and into 2000’s as a preservation format ◦ Best decoders were from Kodak and are no longer supported ◦ Very few software decoders remaining – soon images in this format will be unusable ◦ Harvard’s Digital Repository Service has 7,243 of these

Two sub-problems Keep the bits safe Keep the information usable as technology changes

Safe bits Infrastructure, polices, practices and professional staff to counter risks ◦ High quality storage ◦ Redundancy (multiple copies, multiple locations) ◦ Media refreshing (replacing) ◦ Security and access restrictions ◦ Content recovery ◦ Integrity monitoring (check for corruption)…

Integrity monitoring Message digests – unique signatures for digital content ◦ Fixed-size bit strings  6326ec82b3200df4a87fc54356d2cb73 ◦ Calculated by cryptographic hash functions, e.g. MD5, SHA1, … Any changes to a file result in a changed message digest Useful for detecting corruption

Usable information People have to be able to find it People must be able to manage it Document what’s important (description, context, ownership, processing history) Know what you are preserving (formats)…

A TIFF is a TIFF? Tiff 4.0 Tiff 5.0 Tiff 6.0 Tiff 6.0 extension YCbCr (Class Y) TIFF/IT (ISO 12639:2003) TIFF/EP (ISO :2001) RichTIFF EXIF 2.0 EXIF 2.1 (JEIDA ) EXIF 2.2 (JEITA CP-3451) GeoTIFF 1.0 TIFF-FX (RFC 2301) Class F (RFC 2306) RFC 1314 Canon RAW (.crw,.cr2,.tif) Nikon RAW (.nef) DNG (Adobe Digital Negative)

Identifying formats Techniques: “magic numbers”, full parse Few tools ◦ Support limited number of formats ◦ Accuracy varies Some improvements ◦ File Information Tool Set (FITS)  fits.google.code ◦ NARA-sponsored research

Usable information Make sure there’s technology to support the formats! (technology watch) Preservation strategies ◦ Technology preservation ◦ Creation of viewing software ◦ Emulation & variations:  Universal Virtual Machine  Universal Virtual Computer ◦ Format normalization ◦ Format migrations…

Key format migration considerations What can’t be lost in the transformation? “Significant properties” ◦ E.g. color, embedded metadata, resolution, ICC profiles, interaction, attachments, fonts, links ◦ How important are each of these properties? – weighted criteria To what format? “Preservable” formats What else must be changed? Ex: Links How many versions to keep?

Preservation lifecycle – a series of hand-offs Create or acquire digital content Ingest into a preservation repository ◦ Continuous cycle of:  Monitoring  Planning  Intervention ◦ Subject to collection management decisions Transfer to next generation of the repository or to a different repository

Ongoing commitment Requires continual proactive program ◦ You can’t just start and stop ◦ Time frames are MUCH shorter than for preservation of analog material Requires ongoing investment in infrastructure and staff

Can’t do it alone Digital preservation activities must be shared across institutions Even collectively we don’t have adequate resources or understanding

Preservation community Collaborative organizations (NDSA, IIPC, OPF) Collaborative projects Standards and best practices Shared infrastructure and tools ◦ Formats registry ◦ Repository software ◦ Preservation planning tools ◦ Format tools