Presentation is loading. Please wait.

Presentation is loading. Please wait.

The Global Digital Format Registry (GDFR) Project

Similar presentations


Presentation on theme: "The Global Digital Format Registry (GDFR) Project"— Presentation transcript:

1 The Global Digital Format Registry (GDFR) Project
CNI Fall Task Force Meeting Washington, DC, December 10-11, 2007 The Global Digital Format Registry (GDFR) Project Stephen Abrams Harvard University Andreas Stanescu OCLC

2 Digital preservation and format
Preservation is concerned with ensuring access to managed digital assets over time Thus, preservation activities are focused on Viability Fixity Authenticity Interpretability Renderability The last two are primarily a function of format

3 Without format typing, all content is opaque
ffd8ffe000104a ffed0fb050686f74 6f73686f e d 03e90a e e666f00 000002f40240ffeeffee fc d f 494d03ed0a f6c f 6e a

4 Without format typing, all content is opaque
ffd8ffe000104a ffed0fb050686f74 6f73686f e d 03e90a e e666f00 000002f40240ffeeffee fc d f 494d03ed0a f6c f 6e a SOI APP0 JFIF 1.2 APP13 IPTC APP2 ICC DQT SOF0 183x512 DRI DHT SOS ECS0 RST0 ECS1 RST1 ECS2 ...

5 Without format typing, all content is opaque
ffd8ffe000104a ffed0fb050686f74 6f73686f e d 03e90a e e666f00 000002f40240ffeeffee fc d f 494d03ed0a f6c f 6e a SOI APP0 JFIF 1.2 APP13 IPTC APP2 ICC DQT SOF0 183x512 DRI DHT SOS ECS0 RST0 ECS1 RST1 ECS2 ... Edward Burne-Jones (British, ) The Days of Creation: the First Day, Watercolor and gouache, 102.2×35.5 cm Fogg Art Museum, Harvard University, Bequest of Grenville L. Winthrop

6 Global Digital Format Registry
“The Global Digital Format Registry (GDFR) will provide sustainable services to collect, review, store, discover, and deliver significant representation information about digital formats.” Centrally-organized collection and review Distributed storage, discovery, and delivery on a network of independent, but cooperating registries

7 What is a format? “A serialized encoding of an abstract information model” Encompasses the nominal sense of “file format” as well as a range of conceptual entities from the micro to the macro level IEEE 754 floating point number File system In both case, there are well-defined syntactic and semantic rules for mapping from information to bits, and back again

8 What’s wrong with MIME types?

9 What’s wrong with MIME types?
Non-standardized documentation Intended for human, not machine consumption Coarse granularity image/tiff vs. TIFF 4.0 – 6.0 Baseline Class B, G, P, R Extension Class Y TIFF/EP TIFF/IT with file types CT, LW, HC, MP, BP, BP, BL, FP Exif 2.0 – 2.2 GeoTIFF TIFF/FX DNG

10 GDFR project Two DLF-sponsored invitational workshops
University of Pennsylvania, January 2003 Washington, March 2003 Two independent demonstration projects FRED [John Ockerbloom, University of Pennsylvania] tom.library.upenn.edu/fred/ FOCUS [Joseph JaJa, University of Maryland] FRED, Format Registry Demonstrator, TOM (Typed Object Model) FOCUS, Format Curation Service (LDAP)

11 GDFR project Harvard University Library (HUL) funded for 2 years by the Andrew W. Mellon Foundation Staffing and technical work subcontracted by HUL to OCLC (July 2006)

12 GDFR project oversight
Technical Working Group (TWG) Bibliothèque nationale de France British Library California Digital Library Digital Curation Centre, UK Library of Congress National Archives, UK National Archives and Records Administration National Library of Australia National Library of New Zealand Stanford University University of Pennsylvania

13 General development goals
A generalized registry framework, specialized for the distributed GDFR application Based on well-known products and protocols Human and machine interfaces Full information content expressible in XML form, and can be re-instantiated from that expression Platform independence Globally fault tolerant Open source

14 GDFR data model Consistent with PRONOM registry

15 Identifiers Canonical, GDFR-assigned identifier
“info” URI info:rfa/gdfr1/Formats/1 Other well-known identifiers Common name “TIFF”, “Tagged Image File Format” MIME type image/tiff PRONOM identifier info:pronom/fmt/7 Library of Congress Format Description Document (FDD) identifier fdd000022

16 Classification scheme
Eight facets Genre (required) text, still-image, sound, aggregate, … Role (required) family, file-format, encoding, serialization Composition unitary, container-bundle, container-wrapper Form binary, text Constraint structured, unstructured Basis sampled, symbolic Domain astronomy, cad-cam, gis, web-archive, … Transform compression, encryption, message-digest, …

17 Classification scheme
Examples TIFF (Tagged Image File Format) genre:still-image role:family composition:container-wrapper form:binary basis:sampled LZW (Liv-Zempel-Welch) genre:still-image role:encoding transform:compression SVG (Scalable Vector Graphics) genre:still-image role:file-format form:text basis:symbolic

18 Signatures External signatures Internal signatures File extension
Mac OS type Mac OS X Uniform Type Identifiers (UTI) Internal signatures “Magic numbers” Required vs. optional Fixed vs. restricted vs. unrestricted

19 Grammar Formal description of the syntactic grammar underlying a format, expressed in some formal typed notation BNF Backus-Naur Form BSDL MPEG-21 Bitstream Syntax Description Language DFDL Data Format Description Language EAST CCSDS B-2 XCEL Extensible Characterisation Extraction Language

20 Assessment Assessment of a format, expressed in some formal typed notation Cornell Virtual Remote Control (VRC) DTSC PANIC Library of Congress Sustainability, Quality, Function (SQF) National Library of Australia AONS OCLC INFORM

21 Documentation Specification documents (and software files) can be managed and distributed in the network Applicable only in cases of public domain resources or if explicit permission is granted by rights holders Other documents (and software) will be referenced by full citation, including actionable links where possible Mechanism for individuals or institutions to register locally-held copies, with terms of use

22 Software Format role Input, output
Process type Characterize, create, edit, identify, … Enables discovery of transformative processing chains

23 Relationships Modification BWF → WAVE Definition NITF → XML DTD
Extension DNG → TIFF 6.0 Restriction PDF/A → PDF 1.4 Definition NITF → XML DTD Requisite XML → Relax NG Containment ZIP → * Equivalence DXF (ASCII) → DXF (binary) Version Word 97 → Word 6.0 Affinity SPIFF → JPEG

24 GDFR node Based on the OCLC IWSA / RFA framework

25 GDFR node Java, Apache/Tomcat, Berkeley DB XML GNU LGPL license
Including pre-existing OCLC technology and technology newly-developed for the project Release schedule v0.1 (alpha) March 23, 2007 v0.1 (beta) June 14, 2007 v1.0 June 30, 2007 v1.1 August 12, 2007 v1.3 September 17, 2007 v1.3.1 October 26, 2007

26 GDFR node

27 GDFR node

28 GDFR node

29 GDFR network Peer-to-peer network of independent, but cooperating registries communicating over a common protocol

30 GDFR network Public notification of the availability of new data
RSS feed available at well-known public address to which remote nodes can subscribe Remote harvesting of local data OAI-PMH (Open Archives Initiative Protocol for Metadata Harvesting) Initially, a single source (root node) for all new data

31 Project status Extensive internal testing of GDFR software in a stand-alone mode Current project activities are focused on Implementing the distribution and synchronization functions Building the network Data acquisition Succession planning

32 Initial population Manual addition is possible, but time consuming
Automated update using Atom What sources are available for bulk population? PRONOM registry Library of Congress Format Description Documents (FDD) Unix / Linux magic(4) database

33 Subsequent population
RFC 2026, Internet Standards Process “Iterations of review by the ... community and revision based upon experience” Draft distribution and public discussion Approval by “area” editors Release to the network for distribution

34 Sustainability The technological solution is the (relatively) easy part, but… The technology is expendable The important point is for the data to survive, evolve, and expand

35 Governance and succession
Mellon funding was for technical work only At the end of the two year project… Harvard will continue maintenance for up to two years Library of Congress has agreed to be a care-taker agency until a permanent body is identified NARA ERA, Ken Thibodeau, Robert Chadduck, Richard Steinbacher

36 Governance and succession
NARA GDFR governance investigation Part of the Electronic Records Archives (ERA) initiative GDFR Governance Workshop, November 2007 Bibliothèque et Archives, Canada • NARA Corp. for National Research Initiatives • NASA Digital Curation Centre, UK • NIST Digital Library Federation • National Library of Australia General Services Administration • National Library of New Zealand Georgia Institute of Technology • San Diego Supercomputer Center Government Printing Office • Stanford University Harvard University • Statens Archiv, Sweden IBM Watson Research Center • Tessalla Support Services Koninklijke Bibliotheek, Netherlands • University of Pennsylvania Library of Congress MIT

37 Administrative considerations
Policy Who (and how many) can join the network? What are the eligibility requirements? What are the rights and obligations of membership? Technical Who will maintain and enhance the data model? Who will maintain, enhance, distribute, and support the software?

38 Administrative considerations
Data Who will contribute data? Who will vouch for data authenticity? Who will ensure data integrity? Financial What are the real human and system costs associated with GDFR operation? Who pays, and how?

39 Summary The GDFR is an enabling technology that will support digital repository and preservation activities Supports the strong typing of digital assets at an appropriate level of granularity Enables the future recovery of the syntax and semantics associated with typed digital assets A means to pool and redistribute the expertise of the international digital preservation community

40 For more information… www.formatregistry.org


Download ppt "The Global Digital Format Registry (GDFR) Project"

Similar presentations


Ads by Google