Preservation Born-again and Born Digital CS 502 – 20030407 Carl Lagoze – Cornell University Acknowledgements: Anne Kenney Nancy McGovern Vicky Reich.

Slides:



Advertisements
Similar presentations
E-Content Service Group Virtual Meeting Digital Preservation: How to Get Started.
Advertisements

The Future of Scholarship in the Digital Age: The Role of Institutional Repositories Ann J. Wolpert Director of Libraries Massachusetts Institute of Technology.
Long-Term Preservation. Technical Approaches to Long-Term Preservation the challenge is to interpret formats a similar development: sound carriers From.
Digital Preservation and Trusted Digital Repositories Priscilla Caplan Florida Center for Library Automation ALA 2005 Chicago IL.
Electronic Theses and Dissertations: Benefits, Issues, and the University of Waterloo Approach
DRS 2 one in a series of periodic updates Harvard University Library Andrea Goethals October 21, 2009 DRS = Digital Repository Service.
From Analog to Digital. What are Digital Images? Electronic snapshots taken of a scene or scanned from documents samples and mapped as a grid of dots.
1 CS 502: Computing Methods for Digital Libraries Lecture 9 Conversion to Digital Formats Anne Kenney, Cornell University Library.
These ain’t “Old News”! Creating access to historic newspapers Christine Guenther OCLC Product Manager, Digital Services Preservation Service Centers Bethlehem,
Selecting Preservation Strategies for Web Archives Stephan Strodl, Andreas Rauber Department of Software.
Project Prism Virtual Remote Control: Preservation Risk Management for Web Resources Nancy Y. McGovern, ECURE 2002.
CC 2007, 2011 attribution - R.B. Allen Information System Architectures and Services.
Preservation Born-again and Born Digital CS 431 – Carl Lagoze – Cornell University Acknowledgements: James Cheney Anne Kenney Nancy McGovern Vicky.
Image Metadata Summary of 4/18/99 NISO/DLF Image Metadata Meeting ( Howard Besser UCLA School of Education & Information.
Digital Preservation. The Past is Prologue Developing Preservation Approaches.
Technology Bootcamp January 18, 2014 Large-Scale Digital Libraries Digitization Process Krystyna K. Matusiak, Ph.D. Assistant Professor Library & Information.
Digital Images. Scanned or digitally captured image Image created on computer using graphics software.
 Scanned or digitally captured image  Image created on computer using graphics software.
Different approaches to digital preservation Hilde van Wijngaarden Digital Preservation Officer Koninklijke Bibliotheek/ National Library of the Netherlands.
Digital preservation Hydra Europe, LSE 24 April 2015 Anders Conrad.
HBCU-CUL Digital Imaging Workshop, November 2005
Chinese-European Workshop on Digital Preservation, Beijing July 14 – Network of Expertise in Digital Preservation 1 Trusted Digital Repositories,
1 Preservation And Access: Achieving the Best of Both Worlds Eimee Rhea C. Lagrama 1.
Statewide Digitization and the FCLA Digital Archive Priscilla Caplan, Florida Center for Library Automation Statewide Digitization Planners Meeting OCLC,
Adventures in Digital Asset Management: Fedora at the National Library of Wales Glen Robson National Library of Wales
City of Seattle Office of the City Clerk Open Government = Access Challenges and Opportunities with Digital Records.
Digitization of the Federal Depository Library Program Judith C. Russell Superintendent of Documents & Managing Director, Information Dissemination “Electronic.
Jenn Riley Metadata Librarian Indiana University Digital Library Program.
1 XML as a preservation strategy Experiences with the DiVA document format Eva Müller, Uwe Klosa Electronic Publishing Centre Uppsala University Library,
Digital Preservation 101, or, How to Keep Bits for Centuries Julie C. Swierczek Digital Asset Manager and Digital Archivist Harvard Art Museums.
2002 September -- ejk/UF RESEARCH TOPICS Web-Interface Performance DTD Extensibility Imaging Distillation Other topics?
Digital Archiving Kathryn Lybarger November 6, 2008.
Preserving Digital Collections for Future Scholarship Oya Y. Rieger Cornell University
Preservation – Why the Urgency? “A National Library is a place where a nation nourishes its memory and exerts its imagination – where it connects with.
OAIS in the Library Environment Managing and Preserving Electronic Resources FLICC/CENDI Washington DC, December 11,2001 Anne Van Camp RLG, Member Initiatives.
DigCCurr Professional Institute: Curation Practices for the Digital Object Lifecycle Digital Curation Program Development Nancy Y McGovern Research Assistant.
Johnson Museum Online 15,800 works on paper 6,700 objects in Asian collection high resolution, medium resolution, and thumbnail Luna.
Preserving eScholarship and Digitized Special Collections Distributed Digital Preservation Bill Donovan
Digital Preservation: Current Thinking Anne Gilliland-Swetland Department of Information Studies.
Embracing Evanescence Digital Preservation 101 By Becky Ryder University of Kentucky Libraries ETD 2004.
Digital Image Capture of Musical Scores Jenn Riley, Indiana University Digital Library Program Ichiro Fujinaga, McGill University.
Small steps and lasting impact: making a start with preservation or It’s not all NASA Patricia Sleeman Digital Archives and Repositories University of.
Archival Workshop on Ingest, Identification, and Certification Standards Certification (Best Practices) Checklist Does the archive have a written plan.
 Scanned or digitally captured image  Image created on computer using graphics software.
The Story of at the Alaska State Library Presented by Sheri Somerville Alaska State Library March 14, 2009.
OAIS Rathachai Chawuthai Information Management CSIM / AIT Issued document 1.0.
How to Implement an Institutional Repository: Part II A NASIG 2006 Pre-Conference May 4, 2006 Technical Issues.
Collecting History: Profiles in Science Alexa T. McCray National Library of Medicine Bethesda, MD Stanford University August 21, 1999.
NDSR Boston webinar: Digital Preservation Introduction Presenter: Nancy Y McGovern October 2015.
Institutional Repositories: the DSpace Experience Ann J. Wolpert Director of Libraries Massachusetts Institute of Technology.
Digitization & Digital Preservation
Open Access Conference, Pretoria, July 2004 Wouter Klapwijk, Univ. of Stellenbosch The LOCKSS Project: an overview Open Access Conference, Pretoria, July.
Aligning Digital Preservation Policies with Community Standards Nancy McGovern Digital Preservation Officer.
Lifecycle Metadata for Digital Objects November 15, 2004 Preservation Metadata.
Introduction to Scanning. Why Digitize? Provide better access Protect fragile or valuable materials in your collections Digital surrogates will help preserve.
HOW SCANNERS WORK A scanner is a device that uses a light source to electronically convert an image into binary data (0s and 1s). This binary data can.
Libraries in the digital age Collection & preservation for generational access part two The LOCKSS Program.
Data Management and Digital Preservation Carly Dearborn, MSIS Digital Preservation & Electronic Records Archivist
Developing a Dark Archive for OJS Journals Yu-Hung Lin, Metadata Librarian for Continuing Resources, Scholarship and Data Rutgers University 1 10/7/2015.
A Semi-Automated Digital Preservation System based on Semantic Web Services Jane Hunter Sharmin Choudhury DSTC PTY LTD, Brisbane, Australia Slides by Ananta.

Digital Images.
ImageEditing Understanding Image Resolution.
Implementing an Institutional Repository: Part II
Digital Preservation and Trusted Digital Repositories
RESEARCH TOPICS Web-Interface Performance DTD Extensibility Imaging
From Analog to Digital.
Implementing an Institutional Repository: Part II
How to Implement an Institutional Repository: Part II
Image Metadata Summary of 4/18/99 NISO/DLF Image Metadata Meeting
Presentation transcript:

Preservation Born-again and Born Digital CS 502 – Carl Lagoze – Cornell University Acknowledgements: Anne Kenney Nancy McGovern Vicky Reich

Preservation of physical artifacts Environmental Control Brittle Books –Acidification is byproduct of paper production in 1850’s to 1980’s Bleach for whitening Alum for sizing (fixity of ink) Tanning for leather tanning –35-75% of paper based artifacts from this period are in danger –Newspapers and paperbacks especially vulnerable –ANSI standard Z for “permanent” paper.

Deacidification of Brittle Books Raise the pH level of treated paper to the acceptable range of 6.8 to 10.4pH extending the useful life of paper (measured by fold endurance after accelerated aging) by over 300%. Environmental treatment using magnesium oxide (MgO) Expense requires careful selection process

Failures of Microfilm Popular preservation approach before digitization Severe problems –Quality of filming –Color to bi-tonal –Usability issues –Self-destruction of film “Double Fold” Nicholson Baker

Digitization through Scanning Alternative to deacidification Advantages –Universal access –Reduction in shelf costs –OCR (full-text access) Disadvantages –Quality reduction –Cost –Not original syndrome –Destruction of source (debinding) –A new preservation problem

What are Digital Images? Electronic snapshots taken of a scene or scanned from documents samples and mapped as a grid of dots or picture elements (pixels) pixel assigned a tonal value (black, white, grays, colors), represented in binary code code stored or reduced (compressed) read and interpreted to create analog version

Why Rich Digital Masters? Preservation –Original may only withstand one scan –Maintenance of digital files Cost –One scan may be all that is affordable –Conversion costs dwarfed by other costs Access –Many from one –The richer the file, the better the derivative in terms of quality and processibility

How to determine what’s good enough? Connoisseurship of document attributes –Identify key information content –Objectively characterize or measure attributes: size, detail, tone, and color Appreciate imaging factors affecting quality and cost Translate between analog and digital –Equate measurements to digital equivalencies and corresponding metrics, e.g., detail size  resolution

Digital Image Quality is Governed By: resolution and threshold bit depth color management image enhancement compression and file format

Resolution Determined by number of pixels used to represent the image Increasing resolution increases level of detail captured and geometrically increases file size zoom in

11 Effects of Resolution 600 dpi 300 dpi 200 dpi

Threshold Setting in Bitonal Scanning defines the point on a scale from 0 to 255 at which gray values will be interpreted either as black or white

13 Effects of Threshold threshold = 100 threshold = 60

Bit Depth Determined by the number of binary digits (bits) used to represent each pixel 1-bit 8-bit24-bit

Bit Depth increasing bit depth increases the level of gray or color information that can be represented and arithmetically increases file size Bit depth, dynamic range, and color appearance

Utilizing Sufficient Bit-Depth 3-bit gray8-bit gray

Utilizing Sufficient Bit Depth 8-bit color24-bit color

Bit Depth vs. Dynamic Range The range of tonal difference between lightest light and the darkest dark

Mapping Tones Correctly: Use of Histograms

What are Digital Images? Electronic snapshots taken of a scene or scanned from documents samples and mapped as a grid of dots or picture elements (pixels) pixel assigned a tonal value (black, white, grays, colors), represented in binary code code stored or reduced (compressed) read and interpreted to create analog version

Why Rich Digital Masters? Preservation –Original may only withstand one scan –Maintenance of digital files Cost –One scan may be all that is affordable –Conversion costs dwarfed by other costs Access –Many from one –The richer the file, the better the derivative in terms of quality and processibility

How to determine what’s good enough? Connoisseurship of document attributes –Identify key information content –Objectively characterize or measure attributes: size, detail, tone, and color Appreciate imaging factors affecting quality and cost Translate between analog and digital –Equate measurements to digital equivalencies and corresponding metrics, e.g., detail size  resolution  MTF

Digital Image Quality is Governed By: resolution and threshold bit depth color management image enhancement compression and file format system performance

Resolution Determined by number of pixels used to represent the image Increasing resolution increases level of detail captured and geometrically increases file size zoom in

25 Effects of Resolution 600 dpi 300 dpi 200 dpi

Threshold Setting in Bitonal Scanning defines the point on a scale from 0 to 255 at which gray values will be interpreted either as black or white

27 Effects of Threshold threshold = 100 threshold = 60

Bit Depth Determined by the number of binary digits (bits) used to represent each pixel 1-bit 8-bit24-bit

Bit Depth increasing bit depth increases the level of gray or color information that can be represented and arithmetically increases file size Bit depth, dynamic range, and color appearance

Utilizing Sufficient Bit-Depth 3-bit gray8-bit gray

Utilizing Sufficient Bit Depth 8-bit color24-bit color

Bit Depth vs. Dynamic Range The range of tonal difference between lightest light and the darkest dark

Mapping Tones Correctly: Use of Histograms

Aligning Document Attributes with Digital Requirements Identify key document attributes –Tone, color, and detail Characterize them, if possible through objective measurements Determine quality requirements and tolerance levels Translate between analog and digital and between scanning requirements and scanning performance

Aligning Document Attributes with Digital Requirements Calibrate scanner with targets and software Calibrate the rest of the system Control lighting and environment Scan appropriate targets with documents Evaluate images against originals

Aligning Document Attributes with Digital Requirements Minimize post-processing in the master image Save in TIFF; avoid lossy compression Maintain scanning metadata Monitor emerging image quality metrics

One Size Does Not Fit All! Different document types will require different scanning equipment and processes The more complex the document, the higher the conversion/access requirements Scan the original whenever possible No standards for image conversion: guidance rather than guidelines Notion of long-term utility and cross-institutional resources gaining ground

Trusted Repository Attributes of a Trusted Digital Repository (RLG- OCLC) –Administrative responsibility OAIS Reference Model (CCSDS) 2.pdf –Organization viability –Financial sustainability –Technical suitability –System security –Procedural accountability National Archives and Records Administration (NARA)

Digital Preservation Strategies Disclaimer: monolithic, homogeneous solutions are likely to fail, many digital preservation approaches are required

Emulation Preserve original “look and feel” and functionality of digital artifact Enable obsolete systems to be run on future unknown systems Notion of universal virtual machine Jeff Rothenberg, Raymond Lurie CAMiLEON Project –

Migration File formats change over time and become extinct Issues of proprietary vs. open source formats Lossiness of formats Risk Management of Digital Information: A File Format Investigation – CAMiLEON Project

Canonicalization Canonicalization: A Fundamental Tool to Facilitate Preservation and Management of Digital Information – Tie to XML standards – / /

LOCKSS Lots of Copies Keep Stuff Safe

LOCKSS Mission Build tools and provide support to Libraries, so they can easily and affordably build, preserve, and archive local e-collections –Own rather than lease electronic information –Retain traditional custodial role of scholarly information Publishers, so they can easily and affordably provide content for preservation and archiving –With minimal risk to their business model or to their publishing platforms –Relinquish responsibility to provide perpetual access –Fulfill librarians’ requirements that publishers guarantee long-term access to content sold

Paper Library System Libraries act for their institution to –Acquire copies of important “stuff” –Keep copies on shelves –Give access to local readers Libraries cooperate to –Supply copies to other libraries a reader can easily to find a copy a “bad guy” has trouble finding and destroying all copies Libraries ensure content persists simply by supporting their local communities A cooperative, affordable, decentralized, ‘ archive system ’ with LOTS OF COPIES

LOCKSS “ Library System ” Libraries act for their institution to –Acquire copies of important “stuff” –Keep copies in transparent web caches –Give access to local readers Libraries cooperate to –Detect and repair damage a reader can easily find a copy a “bad guy” has trouble finding and destroying all copies Libraries ensure content persists simply by supporting their local communities A cooperative, affordable, decentralized, archive system with LOTS OF COPIES

LOCKSS Technology LOCKSS web caches Collect HTTP delivered content –All file formats (PDF, HTML, JPEG, TIF, Audio, Video) –Collect “presentation files” of content as published –Must have authorized access to publisher’s site Preserve and audit content integrity –Independent content collection –Cooperate to resolve content differences Continuously validate against other caches Repair gaps from publisher and other caches Provide access –Readers access content via desktop web browser –Content is never “dark”

Publisher Publishers LOCKSS caches Readers Data flows – an approximation

Hardware Costs HDD prices decline by 50% a year

Preservation Risk Management/Automated Monitoring Typical StagesPrism Stages 1. Risk identification1.Data gathering Characterization 2. Risk classification 3. Risk assessment2. Simple risk declaration 3. Contextualized declaration/detection 4. Risk analysis 5. Program implementation4. Automated enforcement

Levels of Context Web page –as a stand-alone object, ignoring its hyperlinks –in local context, considering the links into it and out from it Web site –as a semantically coherent set of linked Web pages –as an entity in a broader technical and organizational context

Page-level Monitoring Formatting: TIDY Standards compliance Document structure Metadata: –HTTP headers –HTML headers Changes –Content –Location Links –Out-link structure –In-link structure –Intra-site –Hub –Volatility Page provenance –URL parsing Log analysis

Site-level Monitoring Graph analysis Static site analysis and Longitudinal study Aggregate page analyses Site maintenance indicators –Backup and archiving policies and procedures –Hardware and software environment –Network configuration and maintenance

Facilitating/Monitoring Longevity of Distributed Content Preservation Service

Preservation Spaces

Object vs. Information

Preserving information rather than bits