Global Digital Format Registry

Slides:



Advertisements
Similar presentations
The PREMIS Data Dictionary Michael Day Digital Curation Centre UKOLN, University of Bath JORUM, JISC and DCC.
Advertisements

Preservation Metadata Initiatives: Practicality, Sustainability, and Interoperability Michael Day UKOLN, University of Bath ERPANET Training.
Collaborating to Compile Information about Formats The vision, the current state, and the challenges for format registries Caroline R. Arms Library of.
A centre of expertise in data curation and preservation DigCCur2007 Symposium, Chapel Hill, N.C., April 18-20, 2007 Co-operation for digital preservation.
Long-Term Preservation. Technical Approaches to Long-Term Preservation the challenge is to interpret formats a similar development: sound carriers From.
An Introduction to Repositories Thornton Staples Director of Community Strategy and Alliances Director of the Fedora Project.
October 28, 2003Copyright MIT, 2003 METS repositories: DSpace MacKenzie Smith Associate Director for Technology MIT Libraries.
The OAIS experience at the British Library Deborah Woodyard Digital Preservation Coordinator ERPANET OAIS Training Seminar, Nov 2002.
An Introduction June 17, 2013 Open Archival Information System (OAIS)
Digital Preservation - Its all about the metadata right? “Metadata and Digital Preservation: How Much Do We Really Need?” SAA 2014 Panel Saturday, August.
Mark Evans, Tessella Digital Preservation Boot Camp – PASIG meeting, Washington DC, 22 nd May 2013 PREMIS Practical Strategies For Preservation Metadata.
H a r v a r d U n i v e r s i t y L i b r a r y Global Digital Format Registry An Update July 2006.
Global Digital Format Registry Stephen L. Abrams Harvard University Library MacKenzie Smith Massachusetts Institute of Technology DLF Spring Forum New.
3. Technical and administrative metadata standards Metadata Standards and Applications.
Merrilee Proffitt e(X)literature / Digital Cultures Project April 2003 News from the Digital Library The Metadata Encoding and Transmission Standard; the.
The British Library’s METS Experience The Cost of METS Carl Wilson
Different approaches to digital preservation Hilde van Wijngaarden Digital Preservation Officer Koninklijke Bibliotheek/ National Library of the Netherlands.
Preserving Digital Collections Andrea Goethals Florida Center for Library Automation (FCLA)
Metadata for preservation Michael Day, UKOLN, University of Bath Chinese-European Workshop on Digital Preservation,
Ingest and Dissemination with DAITSS Presented by Randy Fischer, Programmer, Florida Center for Library Automation, University of Florida DigCCurr2007.
Dr. Kurt Fendt, Comparative Media Studies, MIT MetaMedia An Open Platform for Media Annotation and Sharing Workshop "Online Archives:
Addressing Metadata in the MPEG-21 and PDF-A ISO Standards NISO Workshop: Metadata on the Cutting Edge May 2004 William G. LeFurgy U.S. Library of Congress.
What Agencies Should Know About PDF/A September 20, 2005 Susan J. Sullivan, CRM
How to build your own Dark Archive (in your spare time) Priscilla Caplan FCLA.
ECHO DEPository Project: Highlight on tools & emerging issues The ECHO DEPository Project is a 3-year digital preservation research and development project.
A disaggregated model for preservation of E-Prints Gareth Knight SHERPA DP Project Arts and Humanities Data Service.
OAIS Open Archival Information System. “Content creators, systems developers, custodians, and future users are all potential stakeholders in the preservation.
ESRI User Conference, August 8, 2006 Long-term archiving of geospatial data: the NGDA project Julie Sweetkind-Singer John Banning Stanford University.
The Role of File Formats in Digital Preservation: Opportunities and Threats ErpaTraining on File Formats for Preservation Vienna, May 10-11, 2004 Frank.
OAIS Rathachai Chawuthai Information Management CSIM / AIT Issued document 1.0.
File format registries - a global infrastructure for local persistence Andreas Aschenbrenner, ERPANET.
What Agencies Should Know About PDF/A-1 April 6, 2006 Mark Giguere
Archival Information Packages for NASA HDF-EOS Data R. Duerr, Kent Yang, Azhar Sikander.
PREMIS Rathachai Chawuthai Information Management CSIM / AIT.
Implementor’s Panel: BL’s eJournal Archiving solution using METS, MODS and PREMIS Markus Enders, British Library DC2008, Berlin.
The FCLA Digital Archive Joint Meeting of CSUL Committees, 2005.
Digital Preservation: Current Thinking Anne Gilliland-Swetland Department of Information Studies.
Archival Workshop on Ingest, Identification, and Certification Standards Certification (Best Practices) Checklist Does the archive have a written plan.
Metadata for digital preservation: a review of recent developments Michael Day UKOLN, University of Bath ECDL2001, 5th European Conference.
OAIS Rathachai Chawuthai Information Management CSIM / AIT Issued document 1.0.
The OAIS Reference Model Michael Day, Digital Curation Centre UKOLN, University of Bath Reference Models meeting,
Preservation metadata and the Cedars project Michael Day UKOLN: UK Office for Library and Information Networking University of Bath
Preservation Metadata Initiatives: Status and Direction Brian Lavoie Senior Research Scientist Office of Research OCLC Archiving Web Resources Canberra.
DAITSS and the Florida Digital Archive Priscilla Caplan Florida Center for Library Automation iPRES 2006.
Lifecycle Metadata for Digital Objects November 15, 2004 Preservation Metadata.
Institutional Repositories July 2007 DIGITAL CURATION creating, managing and preserving digital objects Dr D Peters DISA Digital Innovation South.
Cedars work on metadata Michael Day UKOLN, University of Bath Cedars Workshop Manchester, February 2002.
Data Management and Digital Preservation Carly Dearborn, MSIS Digital Preservation & Electronic Records Archivist
2/26/2004 Dan Swaney 1 Preservation Metadata and the OAIS Information Model A Metadata Framework to Support the Preservation of Digital Objects A review.
Digital Preservation What, Why, and How? Dan Albertson’s Digital Libraries Class April 13, 2016 Jody DeRidder Head, Metadata & Digital Services University.
OAIS (archive) OAIS (archive) Producer Management Consumer.
A Semi-Automated Digital Preservation System based on Semantic Web Services Jane Hunter Sharmin Choudhury DSTC PTY LTD, Brisbane, Australia Slides by Ananta.
Preservation Planning Bojana Tasić FORS SEEDS Workshop I Belgrade, October.
Joint Meeting of CSUL Committees,
Preserving Digital Collections
Ingest and Dissemination with DAITSS
OAIS Producer (archive) Consumer Management
Building A Repository for Digital Objects
DAITSS: Dark Archive in the Sunshine State
DAITSS and the Florida Digital Archive
Active Data Management in Space 20m DG
Software Documentation
Metadata for preservation
Health Ingenuity Exchange - HingX
Metadata in Digital Preservation: Setting the Scene
An Open Archival Repository System for UT Austin
Open Archival Information System
Metadata The metadata contains
Digital Preservation and Trusted Digital Repositories
Presentation transcript:

Global Digital Format Registry

Why Do We Need a Registry? Repository functions are performed on a format-specific basis Interpretation of otherwise opaque content streams is dependent upon knowledge of how typed content is represented Interchange requires mutual agreement of format syntax and semantics ERPANET Workshop on Trusted Repositories for Preserving Cultural Heritage, 17-19 November 2003

Potential Use Cases Identification Validation Transformation “I have a digital object; what format is it?” Validation “I have an object purportedly of format F; is it?” Transformation “I have an object of format F, but need G; how can I produce it?” Characterization “I have an object of format F; what are its significant properties?” Risk assessment “I have an object of format F; is at risk of obsolescence?” Delivery “I have an object of format F; how can I render it?” ERPANET Workshop on Trusted Repositories for Preserving Cultural Heritage, 17-19 November 2003

Repository Format Dependencies Using the OAIS Reference Model ERPANET Workshop on Trusted Repositories for Preserving Cultural Heritage, 17-19 November 2003

Repository Format Dependencies Ingest Format-specific validation SIP-to-AIP transformation Access AIP-to-DIP transformation Format-specific rendering Preservation planning Necessary regardless of strategy Migration Emulation UVC Ingest-time is the best time to perform format-specific validation of digital objects: errors will never be as easy to find or as inexpensive to correct. SIP (submission information package) may be a transfer syntax only, that needs to be transformed into a repositories internal format (archival information package). Similarly, an additional access time transformation may be required to produce the appropriate delivery copy (dissemination information package) of the digital object. Irrespective of preservation strategy, we need authoritative and detailed information about the formats in which a repository’s objects are stored. ERPANET Workshop on Trusted Repositories for Preserving Cultural Heritage, 17-19 November 2003

What’s Wrong with MIME Types? Insufficient depth of detail No requirements regarding syntax and semantic description No requirement for complete disclosure, especially of proprietary formats Insufficient granularity Both tiled RGB TIFF with LZW and striped bi-tonal TIFF with Group 4 are typed as “image/tiff” All of PDF 1.0 – 1.4, PDF/X-1 – 3, and PDF/A are typed as “application/pdf” These variants might require radically different workflows MIME type registration may not any require public disclosure of internal details. No standardization of descriptive information about the format for which the MIME type is registered. MIME registration is done through the IETF RFC (Request for Comment) process; this is done with text documents intended for human consumption. There is no provision for machine actionability over the information. Digital objects that will require different processing paths during repository operations may share the same MIME type. ERPANET Workshop on Trusted Repositories for Preserving Cultural Heritage, 17-19 November 2003

A Bit of History During summer 2002 the Harvard LDI and MIT DSpace teams met to discuss shared concerns, leading to: DLF-sponsored invitational meetings Ad-hoc committee Collected use cases Working groups on data and governance models All institutions building or operating digital repositories are facing the same preservation challenges and they all need the type of format-specific information that the registry can provide. Implementing and operating the registry will be expensive; we all need to pool our resources to make it happen. ERPANET Workshop on Trusted Repositories for Preserving Cultural Heritage, 17-19 November 2003

Ad-Hoc Committee Bibliothèque nationale de France California Digital Library Digital Library Federation Harvard University IETF JISC, UK JSTOR Library of Congress MIT NARA National Archives of Canada New York University NIST OCLC Public Records Office, UK RLG Stanford University University of Pennsylvania Wide international participation by national libraries, archives, and major research libraries. So far the group has been artificially kept at this number; we have had to turn people away in order to keep a reasonable size. We anticipate that as the design and implementation process moves forward we will open up the process to all interested stakeholders. ERPANET Workshop on Trusted Repositories for Preserving Cultural Heritage, 17-19 November 2003

Global Digital Format Registry “The registry will maintain persistent, unambiguous bindings between public identifiers for digital formats and representation information for those formats.” - Scope statement endorsed by ad-hoc committee ERPANET Workshop on Trusted Repositories for Preserving Cultural Heritage, 17-19 November 2003

What is a Format? A format is a fixed, byte-serialized encoding of an information model. No assumption regarding byte size An information model is a formal expression of exchangeable knowledge (as defined by OAIS / ISO 14721) The concept of information model is defined by OAIS (ISO/IEC 14721:2002). ERPANET Workshop on Trusted Repositories for Preserving Cultural Heritage, 17-19 November 2003

What is Representation Information? Representation information maps typed formats into more meaningful concepts by capturing the significant syntactic and semantic properties of those formats. Significant properties are those aspects of a format that are the primary carriers of the format’s intellectual value The concept of representation information also comes from OAIS (ISO/IEC 14721). ERPANET Workshop on Trusted Repositories for Preserving Cultural Heritage, 17-19 November 2003

Data Model Registry – properties of the registry itself Format General descriptive properties Characterization properties Technical syntactic/semantic properties Processing properties Services and systems using format as input or output Administrative properties Provenance The registry data model will maintain information about the registry itself and about the individual formats. ERPANET Workshop on Trusted Repositories for Preserving Cultural Heritage, 17-19 November 2003

Data Model (provisional) This is just to give a flavor for the working data model. It will undoubtedly change and grow as the work towards a registry continues. ERPANET Workshop on Trusted Repositories for Preserving Cultural Heritage, 17-19 November 2003

Data Model Sources ISO 14721, Open archival information system -- Reference model CCSDS OAIS reference model Representation information Interpret, or provide “additional meaning” to Data Object Structure and semantic information PRONOM Public Records Office, UK “information about file formats and the application software needed to open them” Format, vendor, product Diffuse EC’s Information Society Technologies programme “reference and guidance information on available and emerging standards and specifications” Business Guides “application of standards and specifications in specific areas” We have reviewed a wide range of systems, services, and projects to define an appropriate data model for the registry. ERPANET Workshop on Trusted Repositories for Preserving Cultural Heritage, 17-19 November 2003

Data Model Sources OCLC/RLG Preservation Metadata Framework “information necessary to render/display, understand, and interpret the Content Data Object” Based on CEDARS, NEDLIB NLA, OAIS, and OCLC metadata Typed Object Model (TOM) “model for identifying and describing data formats … distributed system of ‘type brokers’ that maintain and interpret these descriptions” Format is aggregate of type (attributes, operations, semantics) and encoding JISC File Format Representation and Rendering Project Assessment of formats and rendering software Representation system to track formats and their rendering software TOM originated at CMU and is being continued at University of Pennsylvania <tom.library.upenn.edu>. TOM may provide a formal language useful for model format syntax. If the syntax could be modeled this way, that model could be used to build automated syntax validators. The JISC report <www.jisc.ac.uk/uploaded_documents/FileFormatsreport.pdf> provides interesting insights into the problems we can expect to encounter when trying to collect authoritative format information. ERPANET Workshop on Trusted Repositories for Preserving Cultural Heritage, 17-19 November 2003

Informative, not Evaluative The format properties stored in the registry should be factual, not judgmental. Legal liability May discourage deposit of proprietary information Investigate ways to include (by reference?) third party evaluations/recommendations Insofar as this doesn’t hamper primary goal ERPANET Workshop on Trusted Repositories for Preserving Cultural Heritage, 17-19 November 2003

High-Level Format Properties Identifier UTF-8 Canonical format identifier Alias * Variant identifiers Author * Agent Author Owner + Authority Owner Maintainer * Maintenance agency Classification + Class Ontological classification Relationship * FormatRelation Typed relationship with another format, either registered internally or externally Specification * Document Specification document Signature * Signature Internal or external signature Tool * System Process or service having format as input or output Status Status: ‘Active’, ‘Withdrawn’, ‘Unknown’, ‘Other’ Provenance * Event Provenance event Note * Informative note Again, this is just to give a flavor of the type of high-level format properties that are currently being modeled in the registry data model. ERPANET Workshop on Trusted Repositories for Preserving Cultural Heritage, 17-19 November 2003

Descriptive Properties Identifiers Canonical and alias E.g. “TIFF”, “Tagged Image File Format” Arbitrary relationships Equivalence Encapsulation (format can be a container for another) Sub-typing, with strict substitutability XML ← SVG (“all SVG is XML; some XML may be SVG”) PDF 1.4 ← PDF/X ← PDF/A Versioning Ontological classification As mentioned previously, the data model format properties fall into four categories: descriptive, characterization, processing, and provenance. Descriptive properties are general, and define identifiers, relationships between formats, and an ontological classification. The sub-type relation “PDF 1.4 ← PDF/X ← PDF/A” may not be strictly true, but is used for illustrative purposes ERPANET Workshop on Trusted Repositories for Preserving Cultural Heritage, 17-19 November 2003

Format Ontology Content stream Physical media Communication Logical Numeric Scalar Integer Unsigned Real Floating point Complex Text Structured text Mark-up language Programming language Message Mail News Image Still Font Outline Raster Graphic Vector Page description Motion Audio Music Application CAD Communication Database Executable GIS Presentation Spreadsheet Word processing Transformation Compression Lossless Lossy Container File system Transfer 7-bit safe Physical media Magnetic Disk Tape Reel Cartridge Optical CD-ROM DVD Film Paper Card An ontology of formats is used to indicate a the primary useful domain for a given format. ERPANET Workshop on Trusted Repositories for Preserving Cultural Heritage, 17-19 November 2003

Characterization Properties Specification documents Actionable links to on-line documents Public identifiers Hard copy Public, on-site, license, and/or escrow access Signatures External File extension, Mac OS data fork type Internal Magic number Characterization properties are those useful to identify and validate an object of a given format. ERPANET Workshop on Trusted Repositories for Preserving Cultural Heritage, 17-19 November 2003

Centralized vs. Distributed Allowing arbitrary granularity may lead to an explosion of registered formats Versions Local profiles Typed relationships support internal and external references Enable distributed architecture without mandating it This is a key question that will effect technical design and implementation, and the operational policy of the registry. The current data model enables a distributed architecture without requiring it. ERPANET Workshop on Trusted Repositories for Preserving Cultural Heritage, 17-19 November 2003

Core Registry Services Core services provided by registry itself Approval of format entries Level of technical review, level of public disclosure Maintenance Add, update, delete format entries Notification Notify registry clients of new/updated format or trigger events (e.g. obsolescence, new transformation service, etc.) Introspection Determine local policies (scope, coverage, implemented services, etc.) of a given registry to identify appropriate registry to use Description Representation information returned on request for single format Export Entire registry or selected subset sent to external repository These are services that will be provided internally by the registry. ERPANET Workshop on Trusted Repositories for Preserving Cultural Heritage, 17-19 November 2003

Supported Services Ancillary services supported by, but not necessarily provided by, the registry Identification service Determine format of a specific digital object (DO) by comparing its attributes to the attribute profiles retrieved from the registry Validation service Verify format of a specific DO by comparing its attributes to the attribute profile retrieved from the registry for that format These are services that would require the use of the information stored within the registry. We anticipate that these services will evolve external to the registry. ERPANET Workshop on Trusted Repositories for Preserving Cultural Heritage, 17-19 November 2003

Supported Services Rendering service Transformation service Identify current rendering conditions for supplied digital object (DO) Transformation service Convert DO from current (source) format to target format Metadata Extraction service Registry returns information supporting automated extraction of attribute metadata from a DO of a specific format ERPANET Workshop on Trusted Repositories for Preserving Cultural Heritage, 17-19 November 2003

Registry Operation The registry is valuable insofar as it is trustworthy and sustainable. Trust is necessary to encourage deposit of proprietary information Registry must be seen as “honest broker” Sustainability is necessary to justify expense As for all preservation activities, how do we generate income today, for services not needed until tomorrow? Unless the registry can be implemented and operated in a sustainable manner that will install a sense of trust then it is probably not worth going through the effort. It has to be a long-term resource; we know (hopefully) how to deal with TIFF, XML, PDF, etc. today; but our colleagues 50 years hence won’t. They will need the registry that we build today. If the registry is not populated with most (if not all) formats in common use, including proprietary formats, it will not be helpful in the future. Also, the information stored about format must be true. ERPANET Workshop on Trusted Repositories for Preserving Cultural Heritage, 17-19 November 2003

Registry Operation Is the registry self-populating, or a public bulletin board? Will registry staff collect and manage representation information, or Will knowledgeable community members submit information? What is the level of technical review, and by whom? Another key question that will impact on the operational policies and staffing size for the registry. ERPANET Workshop on Trusted Repositories for Preserving Cultural Heritage, 17-19 November 2003

Governance Model Can this initiative reasonably be placed under the umbrella of an existing organization? Is global scope in conflict with national prerogatives? How to build sufficient trust models? We all have scarce resources; there is no need to create another organization if there is an existing organization that will be appropriate. That organization must be seem as an honest broker. We are investigating the escrowing of information about proprietary formats. Commercial interests will only go along with this if the registry is trusted. ERPANET Workshop on Trusted Repositories for Preserving Cultural Heritage, 17-19 November 2003

Business Model Costs depend on level of quality and authority required Assuming the registry needs to be cost-recovered, options for supporting “common good” services include: Subsidy Subscription Pay to submit Format registration accompanied by an “endowment” Pay to view Queries on a for-fee basis Added-value services There is no good way to generate a predictable income stream today for a registry whose real benefit will not be apparent until some time in the future. How do we, as a community, pay for this type of common good service? ERPANET Workshop on Trusted Repositories for Preserving Cultural Heritage, 17-19 November 2003

Next Steps Tell people what we’re doing National, academic, private libraries/archives Standards bodies Commercial Regulated industries Software vendors (developers and consumers of formats) Publishers Anyone with long-term digital preservation needs Refine project description for a general audience Vision statement and high-level project plan ERPANET Workshop on Trusted Repositories for Preserving Cultural Heritage, 17-19 November 2003

Next Steps Look for project funding Potentially two phases: Design and implementation Planning grant to sustain initial activity, developing: Data and service models Governance and business model Development and operations plan Library of Congress NDIIPP and/or JISC (UK) Digital Curation Centre Operational Need reliable, sustainable income stream ERPANET Workshop on Trusted Repositories for Preserving Cultural Heritage, 17-19 November 2003

Why Is This Important to You? If you care about the long-term usability of your digital assets: The registry will allow typing of digital objects at an appropriate level of granularity The registry will allow the recovery in the future of the syntax and semantics associated with typed digital objects The registry is an enabling technology underlying digital repository operations and preservation activities ERPANET Workshop on Trusted Repositories for Preserving Cultural Heritage, 17-19 November 2003