Download presentation
Presentation is loading. Please wait.
1
Global Digital Format Registry
2
Why Do We Need a Registry?
Repository functions are performed on a format-specific basis Interpretation of otherwise opaque content streams is dependent upon knowledge of how typed content is represented Interchange requires mutual agreement of format syntax and semantics ERPANET Workshop on Trusted Repositories for Preserving Cultural Heritage, November 2003
3
Potential Use Cases Identification Validation Transformation
“I have a digital object; what format is it?” Validation “I have an object purportedly of format F; is it?” Transformation “I have an object of format F, but need G; how can I produce it?” Characterization “I have an object of format F; what are its significant properties?” Risk assessment “I have an object of format F; is at risk of obsolescence?” Delivery “I have an object of format F; how can I render it?” ERPANET Workshop on Trusted Repositories for Preserving Cultural Heritage, November 2003
4
Repository Format Dependencies Using the OAIS Reference Model
ERPANET Workshop on Trusted Repositories for Preserving Cultural Heritage, November 2003
5
Repository Format Dependencies
Ingest Format-specific validation SIP-to-AIP transformation Access AIP-to-DIP transformation Format-specific rendering Preservation planning Necessary regardless of strategy Migration Emulation UVC Ingest-time is the best time to perform format-specific validation of digital objects: errors will never be as easy to find or as inexpensive to correct. SIP (submission information package) may be a transfer syntax only, that needs to be transformed into a repositories internal format (archival information package). Similarly, an additional access time transformation may be required to produce the appropriate delivery copy (dissemination information package) of the digital object. Irrespective of preservation strategy, we need authoritative and detailed information about the formats in which a repository’s objects are stored. ERPANET Workshop on Trusted Repositories for Preserving Cultural Heritage, November 2003
6
What’s Wrong with MIME Types?
Insufficient depth of detail No requirements regarding syntax and semantic description No requirement for complete disclosure, especially of proprietary formats Insufficient granularity Both tiled RGB TIFF with LZW and striped bi-tonal TIFF with Group 4 are typed as “image/tiff” All of PDF 1.0 – 1.4, PDF/X-1 – 3, and PDF/A are typed as “application/pdf” These variants might require radically different workflows MIME type registration may not any require public disclosure of internal details. No standardization of descriptive information about the format for which the MIME type is registered. MIME registration is done through the IETF RFC (Request for Comment) process; this is done with text documents intended for human consumption. There is no provision for machine actionability over the information. Digital objects that will require different processing paths during repository operations may share the same MIME type. ERPANET Workshop on Trusted Repositories for Preserving Cultural Heritage, November 2003
7
A Bit of History During summer 2002 the Harvard LDI and MIT DSpace teams met to discuss shared concerns, leading to: DLF-sponsored invitational meetings Ad-hoc committee Collected use cases Working groups on data and governance models All institutions building or operating digital repositories are facing the same preservation challenges and they all need the type of format-specific information that the registry can provide. Implementing and operating the registry will be expensive; we all need to pool our resources to make it happen. ERPANET Workshop on Trusted Repositories for Preserving Cultural Heritage, November 2003
8
Ad-Hoc Committee Bibliothèque nationale de France
California Digital Library Digital Library Federation Harvard University IETF JISC, UK JSTOR Library of Congress MIT NARA National Archives of Canada New York University NIST OCLC Public Records Office, UK RLG Stanford University University of Pennsylvania Wide international participation by national libraries, archives, and major research libraries. So far the group has been artificially kept at this number; we have had to turn people away in order to keep a reasonable size. We anticipate that as the design and implementation process moves forward we will open up the process to all interested stakeholders. ERPANET Workshop on Trusted Repositories for Preserving Cultural Heritage, November 2003
9
Global Digital Format Registry
“The registry will maintain persistent, unambiguous bindings between public identifiers for digital formats and representation information for those formats.” - Scope statement endorsed by ad-hoc committee ERPANET Workshop on Trusted Repositories for Preserving Cultural Heritage, November 2003
10
What is a Format? A format is a fixed, byte-serialized encoding of an information model. No assumption regarding byte size An information model is a formal expression of exchangeable knowledge (as defined by OAIS / ISO 14721) The concept of information model is defined by OAIS (ISO/IEC 14721:2002). ERPANET Workshop on Trusted Repositories for Preserving Cultural Heritage, November 2003
11
What is Representation Information?
Representation information maps typed formats into more meaningful concepts by capturing the significant syntactic and semantic properties of those formats. Significant properties are those aspects of a format that are the primary carriers of the format’s intellectual value The concept of representation information also comes from OAIS (ISO/IEC 14721). ERPANET Workshop on Trusted Repositories for Preserving Cultural Heritage, November 2003
12
Data Model Registry – properties of the registry itself Format
General descriptive properties Characterization properties Technical syntactic/semantic properties Processing properties Services and systems using format as input or output Administrative properties Provenance The registry data model will maintain information about the registry itself and about the individual formats. ERPANET Workshop on Trusted Repositories for Preserving Cultural Heritage, November 2003
13
Data Model (provisional)
This is just to give a flavor for the working data model. It will undoubtedly change and grow as the work towards a registry continues. ERPANET Workshop on Trusted Repositories for Preserving Cultural Heritage, November 2003
14
Data Model Sources ISO 14721, Open archival information system -- Reference model CCSDS OAIS reference model Representation information Interpret, or provide “additional meaning” to Data Object Structure and semantic information PRONOM Public Records Office, UK “information about file formats and the application software needed to open them” Format, vendor, product Diffuse EC’s Information Society Technologies programme “reference and guidance information on available and emerging standards and specifications” Business Guides “application of standards and specifications in specific areas” We have reviewed a wide range of systems, services, and projects to define an appropriate data model for the registry. ERPANET Workshop on Trusted Repositories for Preserving Cultural Heritage, November 2003
15
Data Model Sources OCLC/RLG Preservation Metadata Framework
“information necessary to render/display, understand, and interpret the Content Data Object” Based on CEDARS, NEDLIB NLA, OAIS, and OCLC metadata Typed Object Model (TOM) “model for identifying and describing data formats … distributed system of ‘type brokers’ that maintain and interpret these descriptions” Format is aggregate of type (attributes, operations, semantics) and encoding JISC File Format Representation and Rendering Project Assessment of formats and rendering software Representation system to track formats and their rendering software TOM originated at CMU and is being continued at University of Pennsylvania <tom.library.upenn.edu>. TOM may provide a formal language useful for model format syntax. If the syntax could be modeled this way, that model could be used to build automated syntax validators. The JISC report < provides interesting insights into the problems we can expect to encounter when trying to collect authoritative format information. ERPANET Workshop on Trusted Repositories for Preserving Cultural Heritage, November 2003
16
Informative, not Evaluative
The format properties stored in the registry should be factual, not judgmental. Legal liability May discourage deposit of proprietary information Investigate ways to include (by reference?) third party evaluations/recommendations Insofar as this doesn’t hamper primary goal ERPANET Workshop on Trusted Repositories for Preserving Cultural Heritage, November 2003
17
High-Level Format Properties
Identifier UTF-8 Canonical format identifier Alias * Variant identifiers Author * Agent Author Owner + Authority Owner Maintainer * Maintenance agency Classification + Class Ontological classification Relationship * FormatRelation Typed relationship with another format, either registered internally or externally Specification * Document Specification document Signature * Signature Internal or external signature Tool * System Process or service having format as input or output Status Status: ‘Active’, ‘Withdrawn’, ‘Unknown’, ‘Other’ Provenance * Event Provenance event Note * Informative note Again, this is just to give a flavor of the type of high-level format properties that are currently being modeled in the registry data model. ERPANET Workshop on Trusted Repositories for Preserving Cultural Heritage, November 2003
18
Descriptive Properties
Identifiers Canonical and alias E.g. “TIFF”, “Tagged Image File Format” Arbitrary relationships Equivalence Encapsulation (format can be a container for another) Sub-typing, with strict substitutability XML ← SVG (“all SVG is XML; some XML may be SVG”) PDF 1.4 ← PDF/X ← PDF/A Versioning Ontological classification As mentioned previously, the data model format properties fall into four categories: descriptive, characterization, processing, and provenance. Descriptive properties are general, and define identifiers, relationships between formats, and an ontological classification. The sub-type relation “PDF 1.4 ← PDF/X ← PDF/A” may not be strictly true, but is used for illustrative purposes ERPANET Workshop on Trusted Repositories for Preserving Cultural Heritage, November 2003
19
Format Ontology Content stream Physical media Communication Logical
Numeric Scalar Integer Unsigned Real Floating point Complex Text Structured text Mark-up language Programming language Message Mail News Image Still Font Outline Raster Graphic Vector Page description Motion Audio Music Application CAD Communication Database Executable GIS Presentation Spreadsheet Word processing Transformation Compression Lossless Lossy Container File system Transfer 7-bit safe Physical media Magnetic Disk Tape Reel Cartridge Optical CD-ROM DVD Film Paper Card An ontology of formats is used to indicate a the primary useful domain for a given format. ERPANET Workshop on Trusted Repositories for Preserving Cultural Heritage, November 2003
20
Characterization Properties
Specification documents Actionable links to on-line documents Public identifiers Hard copy Public, on-site, license, and/or escrow access Signatures External File extension, Mac OS data fork type Internal Magic number Characterization properties are those useful to identify and validate an object of a given format. ERPANET Workshop on Trusted Repositories for Preserving Cultural Heritage, November 2003
21
Centralized vs. Distributed
Allowing arbitrary granularity may lead to an explosion of registered formats Versions Local profiles Typed relationships support internal and external references Enable distributed architecture without mandating it This is a key question that will effect technical design and implementation, and the operational policy of the registry. The current data model enables a distributed architecture without requiring it. ERPANET Workshop on Trusted Repositories for Preserving Cultural Heritage, November 2003
22
Core Registry Services
Core services provided by registry itself Approval of format entries Level of technical review, level of public disclosure Maintenance Add, update, delete format entries Notification Notify registry clients of new/updated format or trigger events (e.g. obsolescence, new transformation service, etc.) Introspection Determine local policies (scope, coverage, implemented services, etc.) of a given registry to identify appropriate registry to use Description Representation information returned on request for single format Export Entire registry or selected subset sent to external repository These are services that will be provided internally by the registry. ERPANET Workshop on Trusted Repositories for Preserving Cultural Heritage, November 2003
23
Supported Services Ancillary services supported by, but not necessarily provided by, the registry Identification service Determine format of a specific digital object (DO) by comparing its attributes to the attribute profiles retrieved from the registry Validation service Verify format of a specific DO by comparing its attributes to the attribute profile retrieved from the registry for that format These are services that would require the use of the information stored within the registry. We anticipate that these services will evolve external to the registry. ERPANET Workshop on Trusted Repositories for Preserving Cultural Heritage, November 2003
24
Supported Services Rendering service Transformation service
Identify current rendering conditions for supplied digital object (DO) Transformation service Convert DO from current (source) format to target format Metadata Extraction service Registry returns information supporting automated extraction of attribute metadata from a DO of a specific format ERPANET Workshop on Trusted Repositories for Preserving Cultural Heritage, November 2003
25
Registry Operation The registry is valuable insofar as it is trustworthy and sustainable. Trust is necessary to encourage deposit of proprietary information Registry must be seen as “honest broker” Sustainability is necessary to justify expense As for all preservation activities, how do we generate income today, for services not needed until tomorrow? Unless the registry can be implemented and operated in a sustainable manner that will install a sense of trust then it is probably not worth going through the effort. It has to be a long-term resource; we know (hopefully) how to deal with TIFF, XML, PDF, etc. today; but our colleagues 50 years hence won’t. They will need the registry that we build today. If the registry is not populated with most (if not all) formats in common use, including proprietary formats, it will not be helpful in the future. Also, the information stored about format must be true. ERPANET Workshop on Trusted Repositories for Preserving Cultural Heritage, November 2003
26
Registry Operation Is the registry self-populating, or a public bulletin board? Will registry staff collect and manage representation information, or Will knowledgeable community members submit information? What is the level of technical review, and by whom? Another key question that will impact on the operational policies and staffing size for the registry. ERPANET Workshop on Trusted Repositories for Preserving Cultural Heritage, November 2003
27
Governance Model Can this initiative reasonably be placed under the umbrella of an existing organization? Is global scope in conflict with national prerogatives? How to build sufficient trust models? We all have scarce resources; there is no need to create another organization if there is an existing organization that will be appropriate. That organization must be seem as an honest broker. We are investigating the escrowing of information about proprietary formats. Commercial interests will only go along with this if the registry is trusted. ERPANET Workshop on Trusted Repositories for Preserving Cultural Heritage, November 2003
28
Business Model Costs depend on level of quality and authority required
Assuming the registry needs to be cost-recovered, options for supporting “common good” services include: Subsidy Subscription Pay to submit Format registration accompanied by an “endowment” Pay to view Queries on a for-fee basis Added-value services There is no good way to generate a predictable income stream today for a registry whose real benefit will not be apparent until some time in the future. How do we, as a community, pay for this type of common good service? ERPANET Workshop on Trusted Repositories for Preserving Cultural Heritage, November 2003
29
Next Steps Tell people what we’re doing
National, academic, private libraries/archives Standards bodies Commercial Regulated industries Software vendors (developers and consumers of formats) Publishers Anyone with long-term digital preservation needs Refine project description for a general audience Vision statement and high-level project plan ERPANET Workshop on Trusted Repositories for Preserving Cultural Heritage, November 2003
30
Next Steps Look for project funding Potentially two phases:
Design and implementation Planning grant to sustain initial activity, developing: Data and service models Governance and business model Development and operations plan Library of Congress NDIIPP and/or JISC (UK) Digital Curation Centre Operational Need reliable, sustainable income stream ERPANET Workshop on Trusted Repositories for Preserving Cultural Heritage, November 2003
31
Why Is This Important to You?
If you care about the long-term usability of your digital assets: The registry will allow typing of digital objects at an appropriate level of granularity The registry will allow the recovery in the future of the syntax and semantics associated with typed digital objects The registry is an enabling technology underlying digital repository operations and preservation activities ERPANET Workshop on Trusted Repositories for Preserving Cultural Heritage, November 2003
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.