JH VE 2 The Fifth International Conference on Preservation of Digital Objects British Library, 29-30 September 2008 What? So What? The Next-Generation.

Slides:



Advertisements
Similar presentations
More Better Metadata SAA 2014 Panel: Metadata and Digital Preservation: How Much Do We Really Need? Andrea Goethals, Harvard Library Even v.
Advertisements

Challenges of Digital Preservation MA / CS 109 April 22, 2011 Andrea Goethals Manager of Digital Preservation & Repository Services Harvard Library.
Unified Digital Format Registry (UDFR) Stakeholder Meeting Library of Congress Washington, DC April 13, 14, 2011.
Beyond Borders SAA Annual Meeting San Diego, August 5-9, 2012 University of California Curation Center California Digital Library Stephen Abrams Unified.
Funded by: © AHDS Sherpa DP – a Technical Architecture for a Disaggregated Preservation Service Mark Hedges Arts and Humanities Data Service King’s College.
Depositing e-material to The National Library of Sweden.
ITHAKA Preservation Metadata 2.0: Revising the Event Model A last-minute presentation on work currently in progress Evan Owens VP, Content Management ITHAKA.
JHOVE2 A Next-Generation Architecture for Format-Aware Preservation Processing Stephen Abrams Harvard University Evan Owens Portico Tom Cramer Stanford.
Mark Evans, Tessella Digital Preservation Boot Camp – PASIG meeting, Washington DC, 22 nd May 2013 PREMIS Practical Strategies For Preservation Metadata.
SOAPI: a flexible toolkit for implementing ingest and preservation workflows Mark Hedges Centre for e-Research, King’s College London Arts and Humanities.
Merrilee Proffitt e(X)literature / Digital Cultures Project April 2003 News from the Digital Library The Metadata Encoding and Transmission Standard; the.
On the Effective Manipulation of Digital Objects Libraries Computer Center Department of Informatics & Telecommunications University of Athens A Prototype-based.
1 Using Scalable and Secure Web Technologies to Design Global Format Registry Muluwork Geremew, Sangchul Song and Joseph JaJa Institute for Advanced Computer.
PAWN: A Novel Ingestion Workflow Technology for Digital Preservation
© 2010 Microsoft Corporation. All rights reserved. Quality Assurance: Towards Tools for Characterizing and Comparing Digital Documents Natasa Milic-Frayling.
Tools and Services for the Long Term Preservation and Access of Digital Archives Joseph JaJa, Mike Smorul, and Sangchul Song Institute for Advanced Computer.
PAWN: A Novel Ingestion Workflow Technology for Digital Preservation Mike Smorul, Joseph JaJa, Yang Wang, and Fritz McCall.
Demonstration of repositories Fedora (Flexible Extensible Digital Object Repository Architecture) Marie Lagerwall MIDESS Partners Meeting February 9, 2007.
A Framework for Distributed Preservation Workflows Rainer Schmidt AIT Austrian Institute of Technology iPres 2009, Oct. 5, San.
ETD Repositories Using DSpace Software Andrew Penman The Robert Gordon University 27 th September 2004.
SCAPE Andy Jackson The British Library SCAPEdev1 AIT, Vienna - 6 th – 7 th June 2011 PC Integration Plan First SCAPE Developers’ Workshop.
2005 Adobe Systems Incorporated. All Rights Reserved. 1 Ontolog Forum Gunar Penikis Sr. Product Manager Adobe Systems.
Digital Preservation Dale Flecker Stephen Abrams February 15, 2007 HUL University Library Council.
Statewide Digitization and the FCLA Digital Archive Priscilla Caplan, Florida Center for Library Automation Statewide Digitization Planners Meeting OCLC,
Catherine Masi, National Geospatial Digital Archive May 16, 2005 NGDA Format Registry  Why do we need a FR? We are designing with long-term storage in.
Repositories collect lots of technical metadata, but lack tools to use it to better understand the objects in their care, and to apply it precisely in.
Tackling concrete digital preservation challenges with SPRUCE Paul Wheatley SPRUCE Project Manager University of Leeds Twitter:
San Diego Supercomputer CenterUniversity of California, San Diego Preservation Research Roadmap Reagan W. Moore San Diego Supercomputer Center
Collections Management Museums What’s new in EMu? Alex Fell Operations Manager KE Software UK.
How to build your own Dark Archive (in your spare time) Priscilla Caplan FCLA.
1 XML as a preservation strategy Experiences with the DiVA document format Eva Müller, Uwe Klosa Electronic Publishing Centre Uppsala University Library,
H ARVARD U NIVERSITY L IBRARY The Global Digital Format Registry (GDFR) Project Stephen Abrams Harvard University Andreas Stanescu OCLC CNI Fall Task Force.
Per Møldrup-Dalum State and University Library SCAPE Information Day State and University Library, Denmark, SCAPE Scalable Preservation Environments.
Preservation and Archiving Special Interest Group Spring Meeting San Francisco, May 2008 Preservation Characterization Stephen Abrams California.
November 1, 2006IU DLP Brown Bag : Fall Data Integrity and Document- centric XML Using Schematron for Managing Text Collections Dazhi Jiao, Tamara.
Introduction to Apache OODT Yang Li Mar 9, What is OODT Object Oriented Data Technology Science data management Archiving Systems that span scientific.
ESRI User Conference, August 8, 2006 Long-term archiving of geospatial data: the NGDA project Julie Sweetkind-Singer John Banning Stanford University.
A historical perspective of Digital Preservation at The Royal Library, Denmark.
University of California Libraries Digital library building blocks: Empowering libraries in an increasingly competitive online information space Daniel.
File format registries - a global infrastructure for local persistence Andreas Aschenbrenner, ERPANET.
Ocean Observatories Initiative Data Management (DM) Subsystem Overview Michael Meisinger September 29, 2009.
The FCLA Digital Archive Joint Meeting of CSUL Committees, 2005.
AIIM Standards Betsy Fanning Director, Standards & Content, AIIM.
PREMIS Implementation Fair – SF 2009 PREMIS use in Rosetta Yair Brama – Ex Libris.
Conceptual Data Modelling for Digital Preservation Planets and PREMIS Angela Dappert.
Grid programming with components: an advanced COMPonent platform for an effective invisible grid © 2006 GridCOMP Grids Programming with components. An.
The New DRS Introduction. What is DRS? Digital repository for preservation and access – Maintains integrity of deposited content – Preserves content for.
JH VE 2 JHOVE2 A Next-Generation Architecture for Format-Aware Characterization British Library, 1 October 2008 Stephen Abrams California Digital Library.
JH VE 2 Digital Library Federation Fall Forum Providence, November 12-14, 2008 JH VE 2 Needs Assessment and Functional Requirements Stephen Abrams California.
Implementing PREMIS in DigiTool Michael Kaplan ALA 2007 Update.
DAITSS and the Florida Digital Archive Priscilla Caplan Florida Center for Library Automation iPRES 2006.
1 Data Architecture Strawman - Grimshaw Important points Everything is a service (object) >All have a name (EPR) and an interface (type) One or more base.
Utilizing the Benefits of Native XML Database Technologies Alan Cornish Systems Librarian Washington State University Libraries.
Repository-specific Spoke Scripts Content Repository JSR-170/283 Content Repository for Java Technology API Normalized H&S METS Files METS Import/ExportMETS.
Fedora Service Framework Sandy Payette, Executive Director UK Fedora Training London January 22-23, 2009.
Preservation Functionality in a Digital Archive Erik Oltmans Koninklijke Bibliotheek Raymond J. van Diessen IBM Business Consulting Services Hilde van.
Transparent Format Migration of Preserved Web Content D. S. H. Rosenthal, T. Lipkis, T. S. Robertson, S. Morabito Lib Magazine, 11(1), 2005
A Semi-Automated Digital Preservation System based on Semantic Web Services Jane Hunter Sharmin Choudhury DSTC PTY LTD, Brisbane, Australia Slides by Ananta.
Joint Meeting of CSUL Committees,
PASIG Bootcamp: Image Formats Robert Buckley NewMarket Imaging/
The National Archives Washington DC July 10, 2008
The Global Digital Format Registry (GDFR) Project
DAITSS and the Florida Digital Archive
An Introduction to Tessella and The Safety Deposit Box Platform
Joseph JaJa, Mike Smorul, and Sangchul Song
Global Digital Format Registry (GDFR)
Statewide Digitization and the FCLA Digital Archive
VI-SEEM Data Repository
Andrea Goethals, Harvard Library
Malte Dreyer – Matthias Razum
Presentation transcript:

JH VE 2 The Fifth International Conference on Preservation of Digital Objects British Library, September 2008 What? So What? The Next-Generation JHOVE2 Architecture for Format-Aware Characterization Stephen Abrams California Digital Library Sheila Morrissey Portico Tom Cramer Stanford University

JH VE 2 Digital preservation Managing the gap between what you where given and what you need That gap is only manageable if it is quantifiable Characterization tells you what you have, as the starting point for iterative preservation planning and action Adopted from A. Brown, “Developing Practical Approaches to Active Preservation,” IJDC 2:1 (June 2007): 3-11.

JH VE 2 What? So what? What is it? –IdentificationDetermining presumptive format through signature matching What is it, really? –ValidationDetermining conformance to commonly- accepted normative requirements What about it? –Feature extractionReporting intrinsic properties significant to preservation planning and action What should you do with it? –AssessmentDetermining acceptability for a given purpose on the basis of locally-defined policies

JH VE 2 Characterization in ingest workflows

JH VE 2 Characterization in migration workflows

JH VE 2 JH VE 2 project A next-generation architecture for format-aware object characterization –Three-fold goals: Re-factor the existing architecture to achieve higher performance, simplify system integration, and encourage third-party enhancement Provide significant new function Implement modules Collaborative project of CDL, Portico, and Stanford University –Funded by Library of Congress/NDIIPP –Open source BSD license

JH VE 2 What is a format, anyway? A set of syntactic and semantic rules for mapping between abstract information content and bit streams If we are interested in preserving content, and not merely bit streams, managing format is fundamentally important ffd8ffe000104a ffed0fb f746f73686f e d03e90a e e666f f40240ffeeffee

JH VE 2 What is a format, anyway? A set of syntactic and semantic rules for mapping between abstract information content and bit streams If we are interested in preserving content, and not merely bit streams, managing format is fundamentally important ffd8ffe000104a ffed0fb f746f73686f e d03e90a e e666f f40240ffeeffee SOI APP0 JFIF 1.2 APP13 IPTC APP2 ICC DQT SOF0 183x512 DRI DHT SOS ECS0...

JH VE 2 What is a format, anyway? A set of syntactic and semantic rules for mapping between abstract information content and bit streams If we are interested in preserving content, and not merely bit streams, managing format is fundamentally important ffd8ffe000104a ffed0fb f746f73686f e d03e90a e e666f f40240ffeeffee SOI APP0 JFIF 1.2 APP13 IPTC APP2 ICC DQT SOF0 183x512 DRI DHT SOS ECS0...

JH VE 2 Objects, not files JHOVE assumed 1 object = 1 file = 1 format But what about… –TIFF with embedded ICC profile and XMP metadata 1 object = 1 file = 3 formats –JPEG 2000 JPX fragmentation 1 object = n files = 1 format –ESRI Shapefile 1 object = 3 files = 3 formats JHOVE2 will support 1 object = n files = m formats

JH VE 2 Objects, not files

JH VE 2 Other enhancements Generic plug-in interface Configurable set of modules iteratively invoked against each object Common data structure passed between modules to enable stateful processing Identification de-coupled from validation Standardized handling of format profiles and error reporting Symbolic display of binary formats API-level support for editing

JH VE 2 Data abstraction Based on the “natural” conceptual structures of a format and their component attributes –Each such structure maps to a class with methods for parsing, validating, reporting, and serializing –Each such attribute maps to a field with accessor and mutator methods UTF-8  Character TIFF  Image File Header and Image File Directory JPEG 2000  Box PDF  boolean, number, string, name, array, dictionary, and stream

JH VE 2 Format support Based on project partner requirements and budgetary constraints –Image:JPEG 2000, TIFF –Audio:WAVE –Text:SGML, UTF-8, XML –Document:PDF –GIS:Shapefile –Color:ICC –And their well-known variants, e.g. TIFF/IT, TIFF/EP, GeoTIFF, EXIF, DNG, … Unfortunately precluding some JHOVE-supported formats –AIFF, GIF, HTML, JPEG

JH VE 2 Technical components Java 1.5 –java.nio package OSGi/Spring frameworks –Component versioning and dependency management –Fine-grained control of component invocation –Inversion of control SourceForge –Distribution platform –Issue tracking

JH VE 2 Schedule Months 1-6 Outreach, design, and prototyping Months 7-9 Core APIs and framework Months Module implementation

JH VE 2 Advisory board Deutsche Nationalbibliothek Ex Libris Fedora Florida Center for Library Automation Harvard University Koninklijke Bibliotheek MIT/DSpace National Archives (UK) National Library of Australia National Library of New Zealand Planets project

JH VE 2 Questions? Wikiconfluence.ucop.edu/display/JHOVE2Info/Homeconfluence.ucop.edu/display/JHOVE2Info/Home Mailing listsJHOVE2-Announce-L JHOVE2-Techtalk-L (Subscribe via the wiki)