The Global Digital Format Registry (GDFR) Project

Slides:



Advertisements
Similar presentations
IATI Technical Advisory Group Technical Proposals Simon Parrish IATI Technical Advisory Group, DIPR March 2010.
Advertisements

The PREMIS Data Dictionary Michael Day Digital Curation Centre UKOLN, University of Bath JORUM, JISC and DCC.
Collaborating to Compile Information about Formats The vision, the current state, and the challenges for format registries Caroline R. Arms Library of.
A centre of expertise in data curation and preservation DigCCur2007 Symposium, Chapel Hill, N.C., April 18-20, 2007 Co-operation for digital preservation.
A centre of expertise in data curation and preservation London :: ARK Group Workshop: Archiving the Web :: 28 Sept 2006 Funded by: This work is licensed.
Preserving and Sharing Digital Data Greg Colati, Director, Archives and Special Collections May 11, 2012.
DRS 2 one in a series of periodic updates Harvard University Library Andrea Goethals October 21, 2009 DRS = Digital Repository Service.
Digital Preservation - Its all about the metadata right? “Metadata and Digital Preservation: How Much Do We Really Need?” SAA 2014 Panel Saturday, August.
Challenges of Digital Preservation MA / CS 109 April 22, 2011 Andrea Goethals Manager of Digital Preservation & Repository Services Harvard Library.
Mark Evans, Tessella Digital Preservation Boot Camp – PASIG meeting, Washington DC, 22 nd May 2013 PREMIS Practical Strategies For Preservation Metadata.
H a r v a r d U n i v e r s i t y L i b r a r y Global Digital Format Registry An Update July 2006.
Common Use Cases for Preservation Metadata Deborah Woodyard-Robinson Digital Preservation Consultant Long-term Repositories:
Fedora Commons: Introduction and Update Swedish National Library June 24, 2008.
3. Technical and administrative metadata standards Metadata Standards and Applications.
Preservation and Long-term access through Networked Services Adam Farquhar, The British Library iPres2006 Cornell University, October 2006.
1 Using Scalable and Secure Web Technologies to Design Global Format Registry Muluwork Geremew, Sangchul Song and Joseph JaJa Institute for Advanced Computer.
PAWN: A Novel Ingestion Workflow Technology for Digital Preservation
An Overview of Selected ISO Standards Applicable to Digital Archives Science Archives in the 21st Century 25 April 2007 Donald Sawyer - NASA/GSFC/NSSDC.
Teaching Metadata and Networked Information Organization & Retrieval The UNT SLIS Experience William E. Moen School of Library and Information Sciences.
Digital Preservation Dale Flecker Stephen Abrams February 15, 2007 HUL University Library Council.
Catherine Masi, National Geospatial Digital Archive May 16, 2005 NGDA Format Registry  Why do we need a FR? We are designing with long-term storage in.
San Diego Supercomputer CenterUniversity of California, San Diego Preservation Research Roadmap Reagan W. Moore San Diego Supercomputer Center
What Agencies Should Know About PDF/A September 20, 2005 Susan J. Sullivan, CRM
1 XML as a preservation strategy Experiences with the DiVA document format Eva Müller, Uwe Klosa Electronic Publishing Centre Uppsala University Library,
H ARVARD U NIVERSITY L IBRARY The Global Digital Format Registry (GDFR) Project Stephen Abrams Harvard University Andreas Stanescu OCLC CNI Fall Task Force.
Update on UDFR (Unified Digital Format Registry) NDIIPP Meeting June 25, 2009 Andrea Goethals.
Preservation and Archiving Special Interest Group Spring Meeting San Francisco, May 2008 Preservation Characterization Stephen Abrams California.
ESRI User Conference, August 8, 2006 Long-term archiving of geospatial data: the NGDA project Julie Sweetkind-Singer John Banning Stanford University.
Preserving Digital Culture: Tools & Strategies for Building Web Archives : Tools and Strategies for Building Web Archives Internet Librarian 2009 Tracy.
DAITSS: Dark Archive in the Sunshine State Priscilla Caplan, Florida Center for Library Automation DCC Workshop on Long-term Curation within Digital Repositories.
File format registries - a global infrastructure for local persistence Andreas Aschenbrenner, ERPANET.
What Agencies Should Know About PDF/A-1 April 6, 2006 Mark Giguere
JH VE 2 The Fifth International Conference on Preservation of Digital Objects British Library, September 2008 What? So What? The Next-Generation.
PREMIS Rathachai Chawuthai Information Management CSIM / AIT.
Rule-Based Preservation Systems Reagan W. Moore Wayne Schroeder Mike Wan Arcot Rajasekar Richard Marciano {moore, schroede, mwan, sekar,
National Partnership for Advanced Computational Infrastructure San Diego Supercomputer Center Persistent Management of Distributed Data Reagan W. Moore.
This presentation describes the development and implementation of WSU Research Exchange, a permanent digital repository system that is being, adding WSU.
How to Implement an Institutional Repository: Part II A NASIG 2006 Pre-Conference May 4, 2006 Technical Issues.
April 12, 2005 WHAT DOES IT MEAN TO BE AN ARCHIVES? Trusted Digital Repository Model Original Presentation by Bruce Ambacher Extended by Don Sawyer 12.
1 Not So Strange Bedfellows: Information Standards For Librarians AND Publishers November 6, 2015.
PREMIS Data Dictionary and the Future of Preservation Metadata Brian Lavoie Research Scientist OCLC Research Society of American Archivists.
Lifecycle Metadata for Digital Objects November 15, 2004 Preservation Metadata.
Describing resources II: Dublin Core CERN-UNESCO School on Digital Libraries Rabat, Nov 22-26, 2010 Annette Holtkamp CERN.
Preservation Data Services Persistent Archive Research Group Reagan W. Moore October 1, 2003.
A Semi-Automated Digital Preservation System based on Semantic Web Services Jane Hunter Sharmin Choudhury DSTC PTY LTD, Brisbane, Australia Slides by Ananta.
Data Grids, Digital Libraries and Persistent Archives: An Integrated Approach to Publishing, Sharing and Archiving Data. Written By: R. Moore, A. Rajasekar,
Metadata Issues in Long-term Management of Data and Metadata
Preserving Digital Collections
Ingest and Dissemination with DAITSS
Moving on : Repository Services after the RAE
Building A Repository for Digital Objects
The National Archives Washington DC July 10, 2008
DAITSS: Dark Archive in the Sunshine State
DAITSS and the Florida Digital Archive
Making Sense of the Alphabet Soup of Standards
An Introduction to Tessella and The Safety Deposit Box Platform
Joseph JaJa, Mike Smorul, and Sangchul Song
Knowledge Management Systems
Global Digital Format Registry (GDFR)
Accessing a national digital library: an architecture for the UK DNER
CNI Spring 2010 Membership Meeting
Implementing an Institutional Repository: Part II
Global Digital Format Registry
Metadata in Digital Preservation: Setting the Scene
Oya Y. Rieger Cornell University Library May 2004
JISC Information Environment Service Registry (IESR)
Márton Németh – László Drótos How to catalogue a web archive?
Implementing an Institutional Repository: Part II
How to Implement an Institutional Repository: Part II
Presentation transcript:

The Global Digital Format Registry (GDFR) Project CNI Fall Task Force Meeting Washington, DC, December 10-11, 2007 The Global Digital Format Registry (GDFR) Project Stephen Abrams Harvard University Andreas Stanescu OCLC

Digital preservation and format Preservation is concerned with ensuring access to managed digital assets over time Thus, preservation activities are focused on Viability Fixity Authenticity Interpretability Renderability The last two are primarily a function of format

Without format typing, all content is opaque ffd8ffe000104a46494600010201 008300830000ffed0fb050686f74 6f73686f7020332e30003842494d 03e90a5072696e7420496e666f00 0000007800000000004800480000 000002f40240ffeeffee03060252 0347052803fc0002000000480048 0000000002d80228000100000064 000000010003030300000001270f 0001000100000000000000000000 0000600800190190000000000000 0000000000000000000000000000 0000000000000000000000003842 494d03ed0a5265736f6c7574696f 6e0000000010008313a3000200 ...

Without format typing, all content is opaque ffd8ffe000104a46494600010201 008300830000ffed0fb050686f74 6f73686f7020332e30003842494d 03e90a5072696e7420496e666f00 0000007800000000004800480000 000002f40240ffeeffee03060252 0347052803fc0002000000480048 0000000002d80228000100000064 000000010003030300000001270f 0001000100000000000000000000 0000600800190190000000000000 0000000000000000000000000000 0000000000000000000000003842 494d03ed0a5265736f6c7574696f 6e0000000010008313a3000200 ... SOI APP0 JFIF 1.2 APP13 IPTC APP2 ICC DQT SOF0 183x512 DRI DHT SOS ECS0 RST0 ECS1 RST1 ECS2 ...

Without format typing, all content is opaque ffd8ffe000104a46494600010201 008300830000ffed0fb050686f74 6f73686f7020332e30003842494d 03e90a5072696e7420496e666f00 0000007800000000004800480000 000002f40240ffeeffee03060252 0347052803fc0002000000480048 0000000002d80228000100000064 000000010003030300000001270f 0001000100000000000000000000 0000600800190190000000000000 0000000000000000000000000000 0000000000000000000000003842 494d03ed0a5265736f6c7574696f 6e0000000010008313a3000200 ... SOI APP0 JFIF 1.2 APP13 IPTC APP2 ICC DQT SOF0 183x512 DRI DHT SOS ECS0 RST0 ECS1 RST1 ECS2 ... Edward Burne-Jones (British, 1833-1898) The Days of Creation: the First Day, 1870-1876 Watercolor and gouache, 102.2×35.5 cm Fogg Art Museum, Harvard University, 1943.454 Bequest of Grenville L. Winthrop

Global Digital Format Registry “The Global Digital Format Registry (GDFR) will provide sustainable services to collect, review, store, discover, and deliver significant representation information about digital formats.” Centrally-organized collection and review Distributed storage, discovery, and delivery on a network of independent, but cooperating registries

What is a format? “A serialized encoding of an abstract information model” Encompasses the nominal sense of “file format” as well as a range of conceptual entities from the micro to the macro level IEEE 754 floating point number File system In both case, there are well-defined syntactic and semantic rules for mapping from information to bits, and back again

What’s wrong with MIME types?

What’s wrong with MIME types? Non-standardized documentation Intended for human, not machine consumption Coarse granularity image/tiff vs. TIFF 4.0 – 6.0 Baseline Class B, G, P, R Extension Class Y TIFF/EP TIFF/IT with file types CT, LW, HC, MP, BP, BP, BL, FP Exif 2.0 – 2.2 GeoTIFF TIFF/FX DNG

GDFR project Two DLF-sponsored invitational workshops University of Pennsylvania, January 2003 Washington, March 2003 Two independent demonstration projects FRED [John Ockerbloom, University of Pennsylvania] tom.library.upenn.edu/fred/ FOCUS [Joseph JaJa, University of Maryland] www.umiacs.umd.edu/~joseph/focus-archiving06.pdf FRED, Format Registry Demonstrator, TOM (Typed Object Model) FOCUS, Format Curation Service (LDAP)

GDFR project Harvard University Library (HUL) funded for 2 years by the Andrew W. Mellon Foundation Staffing and technical work subcontracted by HUL to OCLC (July 2006)

GDFR project oversight Technical Working Group (TWG) Bibliothèque nationale de France British Library California Digital Library Digital Curation Centre, UK Library of Congress National Archives, UK National Archives and Records Administration National Library of Australia National Library of New Zealand Stanford University University of Pennsylvania

General development goals A generalized registry framework, specialized for the distributed GDFR application Based on well-known products and protocols Human and machine interfaces Full information content expressible in XML form, and can be re-instantiated from that expression Platform independence Globally fault tolerant Open source

GDFR data model Consistent with PRONOM registry

Identifiers Canonical, GDFR-assigned identifier “info” URI info:rfa/gdfr1/Formats/1 Other well-known identifiers Common name “TIFF”, “Tagged Image File Format” MIME type image/tiff PRONOM identifier info:pronom/fmt/7 Library of Congress Format Description Document (FDD) identifier fdd000022

Classification scheme Eight facets Genre (required) text, still-image, sound, aggregate, … Role (required) family, file-format, encoding, serialization Composition unitary, container-bundle, container-wrapper Form binary, text Constraint structured, unstructured Basis sampled, symbolic Domain astronomy, cad-cam, gis, web-archive, … Transform compression, encryption, message-digest, …

Classification scheme Examples TIFF (Tagged Image File Format) genre:still-image role:family composition:container-wrapper form:binary basis:sampled LZW (Liv-Zempel-Welch) genre:still-image role:encoding transform:compression SVG (Scalable Vector Graphics) genre:still-image role:file-format form:text basis:symbolic

Signatures External signatures Internal signatures File extension Mac OS type Mac OS X Uniform Type Identifiers (UTI) Internal signatures “Magic numbers” Required vs. optional Fixed vs. restricted vs. unrestricted

Grammar Formal description of the syntactic grammar underlying a format, expressed in some formal typed notation BNF Backus-Naur Form BSDL MPEG-21 Bitstream Syntax Description Language DFDL Data Format Description Language EAST CCSDS 644.0-B-2 XCEL Extensible Characterisation Extraction Language

Assessment Assessment of a format, expressed in some formal typed notation Cornell Virtual Remote Control (VRC) DTSC PANIC Library of Congress Sustainability, Quality, Function (SQF) National Library of Australia AONS OCLC INFORM

Documentation Specification documents (and software files) can be managed and distributed in the network Applicable only in cases of public domain resources or if explicit permission is granted by rights holders Other documents (and software) will be referenced by full citation, including actionable links where possible Mechanism for individuals or institutions to register locally-held copies, with terms of use

Software Format role Input, output Process type Characterize, create, edit, identify, … Enables discovery of transformative processing chains

Relationships Modification BWF → WAVE Definition NITF → XML DTD Extension DNG → TIFF 6.0 Restriction PDF/A → PDF 1.4 Definition NITF → XML DTD Requisite XML → Relax NG Containment ZIP → * Equivalence DXF (ASCII) → DXF (binary) Version Word 97 → Word 6.0 Affinity SPIFF → JPEG

GDFR node Based on the OCLC IWSA / RFA framework

GDFR node Java, Apache/Tomcat, Berkeley DB XML GNU LGPL license Including pre-existing OCLC technology and technology newly-developed for the project Release schedule v0.1 (alpha) March 23, 2007 v0.1 (beta) June 14, 2007 v1.0 June 30, 2007 v1.1 August 12, 2007 v1.3 September 17, 2007 v1.3.1 October 26, 2007

GDFR node

GDFR node

GDFR node

GDFR network Peer-to-peer network of independent, but cooperating registries communicating over a common protocol

GDFR network Public notification of the availability of new data RSS feed available at well-known public address to which remote nodes can subscribe Remote harvesting of local data OAI-PMH (Open Archives Initiative Protocol for Metadata Harvesting) Initially, a single source (root node) for all new data

Project status Extensive internal testing of GDFR software in a stand-alone mode Current project activities are focused on Implementing the distribution and synchronization functions Building the network Data acquisition Succession planning

Initial population Manual addition is possible, but time consuming Automated update using Atom What sources are available for bulk population? PRONOM registry www.nationalarchives.gov.uk/pronom Library of Congress Format Description Documents (FDD) www.digitalpreservation.gov/formats/fdd/descriptions.shtml Unix / Linux magic(4) database

Subsequent population RFC 2026, Internet Standards Process www.ietf.org/rfc/rfc2026.txt “Iterations of review by the ... community and revision based upon experience” Draft distribution and public discussion Approval by “area” editors Release to the network for distribution

Sustainability The technological solution is the (relatively) easy part, but… The technology is expendable The important point is for the data to survive, evolve, and expand

Governance and succession Mellon funding was for technical work only At the end of the two year project… Harvard will continue maintenance for up to two years Library of Congress has agreed to be a care-taker agency until a permanent body is identified NARA ERA, Ken Thibodeau, Robert Chadduck, Richard Steinbacher

Governance and succession NARA GDFR governance investigation Part of the Electronic Records Archives (ERA) initiative GDFR Governance Workshop, November 2007 Bibliothèque et Archives, Canada • NARA Corp. for National Research Initiatives • NASA Digital Curation Centre, UK • NIST Digital Library Federation • National Library of Australia General Services Administration • National Library of New Zealand Georgia Institute of Technology • San Diego Supercomputer Center Government Printing Office • Stanford University Harvard University • Statens Archiv, Sweden IBM Watson Research Center • Tessalla Support Services Koninklijke Bibliotheek, Netherlands • University of Pennsylvania Library of Congress MIT

Administrative considerations Policy Who (and how many) can join the network? What are the eligibility requirements? What are the rights and obligations of membership? Technical Who will maintain and enhance the data model? Who will maintain, enhance, distribute, and support the software?

Administrative considerations Data Who will contribute data? Who will vouch for data authenticity? Who will ensure data integrity? Financial What are the real human and system costs associated with GDFR operation? Who pays, and how?

Summary The GDFR is an enabling technology that will support digital repository and preservation activities Supports the strong typing of digital assets at an appropriate level of granularity Enables the future recovery of the syntax and semantics associated with typed digital assets A means to pool and redistribute the expertise of the international digital preservation community

For more information… www.formatregistry.org stephen_abrams@harvard.edu andreas_stanescu@oclc.org