H ARVARD U NIVERSITY L IBRARY The Global Digital Format Registry (GDFR) Project Stephen Abrams Harvard University Andreas Stanescu OCLC CNI Fall Task Force.

Slides:



Advertisements
Similar presentations
IATI Technical Advisory Group Technical Proposals Simon Parrish IATI Technical Advisory Group, DIPR March 2010.
Advertisements

National Partnership for Advanced Computational Infrastructure San Diego Supercomputer Center Data Grids for Collection Federation Reagan W. Moore University.
Introduction to Planets Hans Hofman Nationaal Archief Netherlands Prague, 17 October 2008.
The PREMIS Data Dictionary Michael Day Digital Curation Centre UKOLN, University of Bath JORUM, JISC and DCC.
Collaborating to Compile Information about Formats The vision, the current state, and the challenges for format registries Caroline R. Arms Library of.
A centre of expertise in data curation and preservation London :: ARK Group Workshop: Archiving the Web :: 28 Sept 2006 Funded by: This work is licensed.
Preserving and Sharing Digital Data Greg Colati, Director, Archives and Special Collections May 11, 2012.
DRS 2 one in a series of periodic updates Harvard University Library Andrea Goethals October 21, 2009 DRS = Digital Repository Service.
Unified Digital Format Registry (UDFR) Stakeholder Meeting Library of Congress Washington, DC April 13, 14, 2011.
Beyond Borders SAA Annual Meeting San Diego, August 5-9, 2012 University of California Curation Center California Digital Library Stephen Abrams Unified.
Mark Evans, Tessella Digital Preservation Boot Camp – PASIG meeting, Washington DC, 22 nd May 2013 PREMIS Practical Strategies For Preservation Metadata.
H a r v a r d U n i v e r s i t y L i b r a r y Global Digital Format Registry An Update July 2006.
Common Use Cases for Preservation Metadata Deborah Woodyard-Robinson Digital Preservation Consultant Long-term Repositories:
Fedora Commons: Introduction and Update Swedish National Library June 24, 2008.
UKOLN is supported by: OAI-ORE a perspective on compound information objects ( Defining Image Access.
3. Technical and administrative metadata standards Metadata Standards and Applications.
Preservation and Long-term access through Networked Services Adam Farquhar, The British Library iPres2006 Cornell University, October 2006.
1 Using Scalable and Secure Web Technologies to Design Global Format Registry Muluwork Geremew, Sangchul Song and Joseph JaJa Institute for Advanced Computer.
Robust Tools for Archiving and Preserving Digital Data Joseph JaJa, Mike Smorul, and Mike McGann Institute for Advanced Computer Studies Department of.
A Framework for Distributed Preservation Workflows Rainer Schmidt AIT Austrian Institute of Technology iPres 2009, Oct. 5, San.
Archiving the Web: the PANDORA archive at the National Library of Australia Preserving the Present for the Future Copenhagen, June 2001 Warwick Cathro,
An Overview of Selected ISO Standards Applicable to Digital Archives Science Archives in the 21st Century 25 April 2007 Donald Sawyer - NASA/GSFC/NSSDC.
Digital Preservation Dale Flecker Stephen Abrams February 15, 2007 HUL University Library Council.
Catherine Masi, National Geospatial Digital Archive May 16, 2005 NGDA Format Registry  Why do we need a FR? We are designing with long-term storage in.
San Diego Supercomputer CenterUniversity of California, San Diego Preservation Research Roadmap Reagan W. Moore San Diego Supercomputer Center
Using IESR Ann Apps MIMAS, The University of Manchester, UK.
What Agencies Should Know About PDF/A September 20, 2005 Susan J. Sullivan, CRM
1 XML as a preservation strategy Experiences with the DiVA document format Eva Müller, Uwe Klosa Electronic Publishing Centre Uppsala University Library,
Update on UDFR (Unified Digital Format Registry) NDIIPP Meeting June 25, 2009 Andrea Goethals.
Rule-Based Data Management Systems Reagan W. Moore Wayne Schroeder Mike Wan Arcot Rajasekar {moore, schroede, mwan, {moore, schroede, mwan,
Preserving Digital Collections for Future Scholarship Oya Y. Rieger Cornell University
Preservation and Archiving Special Interest Group Spring Meeting San Francisco, May 2008 Preservation Characterization Stephen Abrams California.
ESRI User Conference, August 8, 2006 Long-term archiving of geospatial data: the NGDA project Julie Sweetkind-Singer John Banning Stanford University.
Preserving Digital Culture: Tools & Strategies for Building Web Archives : Tools and Strategies for Building Web Archives Internet Librarian 2009 Tracy.
Production Data Grids SRB - iRODS Storage Resource Broker Reagan W. Moore
DAITSS: Dark Archive in the Sunshine State Priscilla Caplan, Florida Center for Library Automation DCC Workshop on Long-term Curation within Digital Repositories.
File format registries - a global infrastructure for local persistence Andreas Aschenbrenner, ERPANET.
What Agencies Should Know About PDF/A-1 April 6, 2006 Mark Giguere
JH VE 2 The Fifth International Conference on Preservation of Digital Objects British Library, September 2008 What? So What? The Next-Generation.
PREMIS Rathachai Chawuthai Information Management CSIM / AIT.
Ocean Observatories Initiative Data Management (DM) Subsystem Overview Michael Meisinger September 29, 2009.
Use Cases and Functional Requirements Goal: Agree on prioritization and scope of requirements Sources – UDFR Technical Working Group: The Functional Requirements.
CBSOR,Indian Statistical Institute 30th March 07, ISI,Kokata 1 Digital Repository support for Consortium Dr. Devika P. Madalli Documentation Research &
Rule-Based Preservation Systems Reagan W. Moore Wayne Schroeder Mike Wan Arcot Rajasekar Richard Marciano {moore, schroede, mwan, sekar,
National Partnership for Advanced Computational Infrastructure San Diego Supercomputer Center Persistent Management of Distributed Data Reagan W. Moore.
This presentation describes the development and implementation of WSU Research Exchange, a permanent digital repository system that is being, adding WSU.
JISC Information Environment Service Registry (IESR) Ann Apps MIMAS, The University of Manchester, UK.
How to Implement an Institutional Repository: Part II A NASIG 2006 Pre-Conference May 4, 2006 Technical Issues.
Preservation Program Digital Preservation Program Digital Preservation Services: Extending tools to meet campus needs Patricia Cruse, Director, Digital.
April 12, 2005 WHAT DOES IT MEAN TO BE AN ARCHIVES? Trusted Digital Repository Model Original Presentation by Bruce Ambacher Extended by Don Sawyer 12.
1 Not So Strange Bedfellows: Information Standards For Librarians AND Publishers November 6, 2015.
The OAIS Reference Model Michael Day, Digital Curation Centre UKOLN, University of Bath Reference Models meeting,
Institutional Repositories July 2007 DIGITAL CURATION creating, managing and preserving digital objects Dr D Peters DISA Digital Innovation South.
Describing resources II: Dublin Core CERN-UNESCO School on Digital Libraries Rabat, Nov 22-26, 2010 Annette Holtkamp CERN.
An overview of the Reference Model for an Open Archival Information System (OAIS) Michael Day, Digital Curation Centre UKOLN, University.
Preservation Data Services Persistent Archive Research Group Reagan W. Moore October 1, 2003.
A Semi-Automated Digital Preservation System based on Semantic Web Services Jane Hunter Sharmin Choudhury DSTC PTY LTD, Brisbane, Australia Slides by Ananta.
Building A Repository for Digital Objects
The National Archives Washington DC July 10, 2008
DAITSS: Dark Archive in the Sunshine State
The Global Digital Format Registry (GDFR) Project
An Introduction to Tessella and The Safety Deposit Box Platform
Knowledge Management Systems
Global Digital Format Registry (GDFR)
Implementing an Institutional Repository: Part II
Metadata in Digital Preservation: Setting the Scene
Oya Y. Rieger Cornell University Library May 2004
Implementing an Institutional Repository: Part II
How to Implement an Institutional Repository: Part II
Presentation transcript:

H ARVARD U NIVERSITY L IBRARY The Global Digital Format Registry (GDFR) Project Stephen Abrams Harvard University Andreas Stanescu OCLC CNI Fall Task Force Meeting Washington, DC, December 10-11, 2007

H ARVARD U NIVERSITY L IBRARY Digital preservation and format Preservation is concerned with ensuring access to managed digital assets over time Thus, preservation activities are focused on –Viability –Fixity –Authenticity –Interpretability –Renderability The last two are primarily a function of format

H ARVARD U NIVERSITY L IBRARY Without format typing, all content is opaque ffd8ffe000104a ffed0fb050686f74 6f73686f e d 03e90a e e666f f40240ffeeffee fc d f d03ed0a f6c f 6e a

H ARVARD U NIVERSITY L IBRARY Without format typing, all content is opaque ffd8ffe000104a ffed0fb050686f74 6f73686f e d 03e90a e e666f f40240ffeeffee fc d f d03ed0a f6c f 6e a SOI APP0 JFIF 1.2 APP13 IPTC APP2 ICC DQT SOF0 183x512 DRI DHT SOS ECS0 RST0 ECS1 RST1 ECS2...

H ARVARD U NIVERSITY L IBRARY Without format typing, all content is opaque ffd8ffe000104a ffed0fb050686f74 6f73686f e d 03e90a e e666f f40240ffeeffee fc d f d03ed0a f6c f 6e a SOI APP0 JFIF 1.2 APP13 IPTC APP2 ICC DQT SOF0 183x512 DRI DHT SOS ECS0 RST0 ECS1 RST1 ECS2...

H ARVARD U NIVERSITY L IBRARY Global Digital Format Registry “The Global Digital Format Registry (GDFR) will provide sustainable services to collect, review, store, discover, and deliver significant representation information about digital formats.” –Centrally-organized collection and review –Distributed storage, discovery, and delivery on a network of independent, but cooperating registries

H ARVARD U NIVERSITY L IBRARY What is a format? “A serialized encoding of an abstract information model” Encompasses the nominal sense of “file format” as well as a range of conceptual entities from the micro to the macro level –IEEE 754 floating point number –File system –In both case, there are well-defined syntactic and semantic rules for mapping from information to bits, and back again

H ARVARD U NIVERSITY L IBRARY What’s wrong with MIME types?

H ARVARD U NIVERSITY L IBRARY What’s wrong with MIME types? Non-standardized documentation Intended for human, not machine consumption Coarse granularity –image/tiff vs.TIFF 4.0 – 6.0 Baseline Class B, G, P, R Extension Class Y TIFF/EP TIFF/IT with f ile types CT, LW, HC, MP, BP, BP, BL, FP Exif 2.0 – 2.2 GeoTIFF TIFF/FX DNG

H ARVARD U NIVERSITY L IBRARY GDFR project Two DLF-sponsored invitational workshops –University of Pennsylvania, January 2003 –Washington, March 2003 Two independent demonstration projects –FRED [John Ockerbloom, University of Pennsylvania] tom.library.upenn.edu/fred/ –FOCUS [Joseph JaJa, University of Maryland]

H ARVARD U NIVERSITY L IBRARY GDFR project Harvard University Library (HUL) funded for 2 years by the Andrew W. Mellon Foundation Staffing and technical work subcontracted by HUL to OCLC (July 2006)

H ARVARD U NIVERSITY L IBRARY GDFR project oversight Technical Working Group (TWG) –Bibliothèque nationale de France –British Library –California Digital Library –Digital Curation Centre, UK –Library of Congress –National Archives, UK –National Archives and Records Administration –National Library of Australia –National Library of New Zealand –Stanford University –University of Pennsylvania

H ARVARD U NIVERSITY L IBRARY General development goals A generalized registry framework, specialized for the distributed GDFR application Based on well-known products and protocols Human and machine interfaces Full information content expressible in XML form, and can be re-instantiated from that expression Platform independence Globally fault tolerant Open source

H ARVARD U NIVERSITY L IBRARY GDFR data model Consistent with PRONOM registry

H ARVARD U NIVERSITY L IBRARY Identifiers Canonical, GDFR-assigned identifier –“info” URI info:rfa/gdfr1/Formats/1 Other well-known identifiers –Common name “TIFF”, “Tagged Image File Format” –MIME type image/tiff –PRONOM identifier info:pronom/fmt/7 –Library of Congress Format Description Document (FDD) identifier fdd000022

H ARVARD U NIVERSITY L IBRARY Classification scheme Eight facets –Genre (required)text, still-image, sound, aggregate, … –Role (required)family, file-format, encoding, serialization –Composition unitary, container-bundle, container-wrapper –Form binary, text –Constraint structured, unstructured –Basis sampled, symbolic –Domain astronomy, cad-cam, gis, web-archive, … –Transform compression, encryption, message-digest, …

H ARVARD U NIVERSITY L IBRARY Classification scheme Examples –TIFF (Tagged Image File Format) genre:still-image role:family composition:container-wrapper form:binary basis:sampled –LZW (Liv-Zempel-Welch) genre:still-image role:encoding transform:compression –SVG (Scalable Vector Graphics) genre:still-image role:file-format form:text basis:symbolic

H ARVARD U NIVERSITY L IBRARY Signatures External signatures –File extension –Mac OS type –Mac OS X Uniform Type Identifiers (UTI) Internal signatures –“Magic numbers” –Required vs. optional –Fixed vs. restricted vs. unrestricted

H ARVARD U NIVERSITY L IBRARY Grammar Formal description of the syntactic grammar underlying a format, expressed in some formal typed notation –BNF Backus-Naur Form –BSDL MPEG-21 Bitstream Syntax Description Language –DFDL Data Format Description Language –EAST CCSDS B-2 –XCEL Extensible Characterisation Extraction Language

H ARVARD U NIVERSITY L IBRARY Assessment Assessment of a format, expressed in some formal typed notation –Cornell Virtual Remote Control (VRC) –DTSC PANIC –Library of Congress Sustainability, Quality, Function (SQF) –National Library of Australia AONS –OCLC INFORM

H ARVARD U NIVERSITY L IBRARY Documentation Specification documents (and software files) can be managed and distributed in the network –Applicable only in cases of public domain resources or if explicit permission is granted by rights holders –Other documents (and software) will be referenced by full citation, including actionable links where possible –Mechanism for individuals or institutions to register locally-held copies, with terms of use

H ARVARD U NIVERSITY L IBRARY Software Format role Input, output Process type Characterize, create, edit, identify, … Enables discovery of transformative processing chains

H ARVARD U NIVERSITY L IBRARY Relationships Modification BWF → WAVE –ExtensionDNG → TIFF 6.0 –Restriction PDF/A → PDF 1.4 Definition NITF → XML DTD Requisite XML → Relax NG Containment ZIP → * Equivalence DXF (ASCII) → DXF (binary) Version Word 97 → Word 6.0 Affinity SPIFF → JPEG

H ARVARD U NIVERSITY L IBRARY GDFR node Based on the OCLC IWSA / RFA framework

H ARVARD U NIVERSITY L IBRARY GDFR node Java, Apache/Tomcat, Berkeley DB XML GNU LGPL license –Including pre-existing OCLC technology and technology newly- developed for the project Release schedule –v0.1 (alpha)March 23, 2007 –v0.1 (beta)June 14, 2007 –v1.0 June 30, 2007 –v1.1 August 12, 2007 –v1.3 September 17, 2007 –v1.3.1 October 26, 2007

H ARVARD U NIVERSITY L IBRARY GDFR node

H ARVARD U NIVERSITY L IBRARY GDFR node

H ARVARD U NIVERSITY L IBRARY GDFR node

H ARVARD U NIVERSITY L IBRARY GDFR network Peer-to-peer network of independent, but cooperating registries communicating over a common protocol

H ARVARD U NIVERSITY L IBRARY GDFR network Public notification of the availability of new data –RSS feed available at well-known public address to which remote nodes can subscribe Remote harvesting of local data –OAI-PMH (Open Archives Initiative Protocol for Metadata Harvesting) Initially, a single source (root node) for all new data

H ARVARD U NIVERSITY L IBRARY Project status Extensive internal testing of GDFR software in a stand-alone mode Current project activities are focused on –Implementing the distribution and synchronization functions –Building the network –Data acquisition –Succession planning

H ARVARD U NIVERSITY L IBRARY Initial population Manual addition is possible, but time consuming Automated update using Atom What sources are available for bulk population? –PRONOM registry –Library of Congress Format Description Documents (FDD) –Unix / Linux magic(4) database

H ARVARD U NIVERSITY L IBRARY Subsequent population RFC 2026, Internet Standards Process –“Iterations of review by the... community and revision based upon experience” Draft distribution and public discussion Approval by “area” editors Release to the network for distribution

H ARVARD U NIVERSITY L IBRARY Sustainability The technological solution is the (relatively) easy part, but… –The technology is expendable –The important point is for the data to survive, evolve, and expand

H ARVARD U NIVERSITY L IBRARY Governance and succession Mellon funding was for technical work only At the end of the two year project… –Harvard will continue maintenance for up to two years –Library of Congress has agreed to be a care-taker agency until a permanent body is identified

H ARVARD U NIVERSITY L IBRARY Governance and succession NARA GDFR governance investigation –Part of the Electronic Records Archives (ERA) initiative –GDFR Governance Workshop, November 2007 Bibliothèque et Archives, Canada NARA Corp. for National Research Initiatives NASA Digital Curation Centre, UK NIST Digital Library Federation National Library of Australia General Services Administration National Library of New Zealand Georgia Institute of Technology San Diego Supercomputer Center Government Printing Office Stanford University Harvard University Statens Archiv, Sweden IBM Watson Research Center Tessalla Support Services Koninklijke Bibliotheek, Netherlands University of Pennsylvania Library of Congress MIT

H ARVARD U NIVERSITY L IBRARY Administrative considerations Policy –Who (and how many) can join the network? –What are the eligibility requirements? –What are the rights and obligations of membership? Technical –Who will maintain and enhance the data model? –Who will maintain, enhance, distribute, and support the software?

H ARVARD U NIVERSITY L IBRARY Administrative considerations Data –Who will contribute data? –Who will vouch for data authenticity? –Who will ensure data integrity? Financial –What are the real human and system costs associated with GDFR operation? –Who pays, and how?

H ARVARD U NIVERSITY L IBRARY Summary The GDFR is an enabling technology that will support digital repository and preservation activities –Supports the strong typing of digital assets at an appropriate level of granularity –Enables the future recovery of the syntax and semantics associated with typed digital assets –A means to pool and redistribute the expertise of the international digital preservation community

H ARVARD U NIVERSITY L IBRARY For more information…