The National Archives Washington DC July 10, 2008 GDFR Pilot Discussion The National Archives Washington DC July 10, 2008
Agenda Introductions – (All) Purpose of meeting – (Dale) Roles – (Dale, Richard) Background/history – (Stephen) GDFR Governance Workshop – (Richard, Robert) Architecture – (Stephen) Current state – (Andrea) Relationship to PRONOM – (Andrea) Issues and observations – (Dale) Use cases – (Andrea) Discussion of pilot – (All) Review next steps from GDFR Governance Workshop Report – (Richard, Robert) Outreach to other interested parties – (All) Next steps – (All)
Introductions All
Purpose of the meeting Dale Flecker
Harvard – Dale Flecker NARA –Richard Steinbacher Roles Harvard – Dale Flecker NARA –Richard Steinbacher
Background/History Stephen Abrams
Background/History Format is the key piece of representation information that permits preservation activities to be focused on interpretable/renderable content, not just opaque bit strings ffd8ffe000104a46494600010201 008300830000ffed0fb050686f74 6f73686f7020332e30003842494d 03e90a5072696e7420496e666f00 0000007800000000004800480000 000002f40240ffeeffee03060252 0347052803fc0002000000480048 0000000002d80228000100000064 000000010003030300000001270f 0001000100000000000000000000 000060080019019000000000 ... SOI APP0 JFIF 1.2 APP13 IPTC APP2 ICC DQT SOF0 183x512 DRI DHT SOS ECS0 ...
Background/History Traditional methods of managing format information, e.g. the IANA MIME registry, are insufficiently descriptive and granular for effective preservation planning and intervention The application/word format is essentially defined as anything produced by the Word application TIFF 6.0, TIFF/IT, TIFF/EP, GeoTIFF,… image/tiff
Background/History Two DLF-sponsored invitational workshops Univ. Pennsylvania, January 2003 Washington, March 2003 Two independent demonstration projects FRED, John Ockerbloom, Univ. Pennsylvania FOCUS, Joseph JaJa, Univ. Maryland
Background/History Evolving consensus on scope A forum for documenting normative definitions of format syntax and semantics A common facility to pool and share scarce technical expertise on a global basis A channel for the distribution of that expertise to the international community of preservation practitioners A foundation for additional value-added services requiring detailed knowledge of digital formats
Background/History Peer-to-peer network of independent, but cooperating registries
Background/History Harvard University Library (HUL) funded for 2 years by the Andrew W. Mellon Foundation Technical deliverables only; no funded governance/policy activity Staffing and technical work subcontracted to OCLC (July 2006)
NARA Governance Workshop Richard Steinbacher Robert Chadduck
Architecture Stephen Abrams
Architecture A generic distributed registry framework, specialized for the GDFR application Based on well-known products and protocols Human and machine interfaces Full information content expressible in XML form; can be re-instantiated from that expression Platform independence Globally fault tolerant Open source
Architecture Data model is an extension of PRONOM 4
Architecture Based on the OCLC IWSA/RFA framework
Architecture Java, Apache/Tomcat, Berkeley DB XML GNU LGPL license Including technology newly-developed for the project and pre-existing OCLC technology
Current state Andrea Goethals
Current state: schedule July 31, 2008 Contract with OCLC ends GDFR source node at Harvard goes public in beta mode August 2008 up to August 2010 Harvard maintains GDFR software, website and source node
Current state: GDFR Home website It moved! Old GDFR Home: http://www.formatregistry.org New GDFR Home: http://www.gdfr.info All existing GDFR docs migrated from the old GDFR Home website Over the next month Updated documentation! Demo source node?
Current state: architecture Currently: One GDFR source node Where all data additions and edits are performed Many GDFR mirror nodes Replicated data Future? Multiple GDFR source nodes? Multiple interoperable format registry source nodes? “Discoverable” from GDFR Home website Each node has 2 Interfaces For humans: user interface For machines: web service interface
Current state: GDFR source node Housed by Harvard for now http://www.formatregistry.org/registry Populated with test data- ~2000 formats from Magic database Need authorized account to add/edit data
Current state: GDFR mirror nodes Test mirror nodes at OCLC and Harvard Anyone can run a mirror node Synchronize data with the source node Can brand your mirror node
Current state: Mirror node set-up Dependencies Apache 2 (mod_rewrite, mod_jk, mod_perl2) Tomcat 5.5.x Berkeley DBXML 2.3.10 Perl 5.8.x Java 1.5 Installation & configuration – half day
User interface Mirror node Source node Sneak preview Search, browse, lookup/retrieve, export, manage node Source node Same as mirror node Plus: add, edit Sneak preview
Current state: machine interface Web services using SRU Can do everything supported by the human user interface Except browsing Plus mirror-to-source node synchronization
Relationship to PRONOM Andrea Goethals
Relationship to PRONOM – what’s the problem? Two different “format” registries Overlapping but digressing data model No common format model No mechanism to exchange data PRONOM is in production, GDFR is not yet PRONOM has been publicly available for over 4 years and is used by some preservation repositories Interoperates with DROID Basis for PLANET projects How many format registries does the digital preservation community need? Depends on how different they are…
Relationship to PRONOM – core differences Who governs the registry and makes policy, scope and enhancement decisions? PRONOM: TNA GDFR: community-based Who adds and edits format information? PRONOM: TNA (does take addition requests) Where is the format information physically located? PRONOM: at TNA GDFR: replicated in different geographic locations
Relationship to PRONOM – what’s the solution? Recognize there is a problem – DONE Mutual willingness to resolve TNA desire to participate in a GDFR pilot Common web service API across the registries? PRONOM could become a GDFR node PRONOM and GDFR could each support a new web service API Cross-walk PRONOM PUIDs and GDFR GFIDs? Use common format identification tools (DROID, JHOVE, etc.) with either registry
Issues and Observations Dale Flecker
Use cases Andrea Goethals
Use cases – 3 sets (see handout) Higher-level use cases submitted by many institutions (early 2003) Lower-level use case model created for the software design (2006-7) Use cases arising from informal talks and meetings
Key use cases – discussed but not supported Determine duplicates Notifications/warnings Determine migration/emulation pathways Determine at-risk formats (machine-actionable risk assessments) Support the registry & discovery of GDFR nodes Authentication of nodes and users (outside the UI) Storage of local profiles separate from central formats Synchronizations based on vetted or non-vetted data Determine “quality” of format information Multiple source nodes
Use cases- common issues How evaluative should GDFR be? Neutral vs judgmental Are services in the scope of GDFR? Should GDFR provide services directly (notifications, validation, etc.) or should GDFR be a reference that can be used by external services?
Discussion of pilot All
Discussion of pilot Purposes
Discussion of pilot Pilot use cases
Discussion of pilot Process
Discussion of pilot Participants
Review next steps from the GDFR Governance Workshop Report Richard Steinbacher Robert Chadduck
Outreach to other interested parties All
Next steps? All