Presentation is loading. Please wait.

Presentation is loading. Please wait.

JH VE 2 JHOVE2 A Next-Generation Architecture for Format-Aware Characterization British Library, 1 October 2008 Stephen Abrams California Digital Library.

Similar presentations


Presentation on theme: "JH VE 2 JHOVE2 A Next-Generation Architecture for Format-Aware Characterization British Library, 1 October 2008 Stephen Abrams California Digital Library."— Presentation transcript:

1 JH VE 2 JHOVE2 A Next-Generation Architecture for Format-Aware Characterization British Library, 1 October 2008 Stephen Abrams California Digital Library Sheila Morrissey Evan Owens Portico Tom Cramer Keith Johnson Stanford University

2 JH VE 2 Thanks to… Koninklijke Bibliotheekour meeting sponsor British Libraryour meeting host Library of Congressour funding agency

3 JH VE 2 Agenda Introductions Review of objectives and agenda Project goals, deliverables, and schedule New terminology Functional requirements Assessment Technology options Community engagement Next steps

4 JH VE 2 Introductions Who are you? Where are you from? What is your level of involvement with JHOVE?

5 JH VE 2 Objectives Getting feedback on the utility and achievability of project goals, deliverables, and schedule Introduce new terminology and concepts Refine functional requirements Understand what assessment means in preservation workflows Propose community engagement plan Gauge the potential for development partnerships Solicit interesting test data

6 JH VE 2 Why JH VE 2 ? Preservation requires management of the gap between what you were given and what you need That gap is only manageable if it is quantifiable Characterization tells you what you have, as the starting point for iterative preservation planning and action Adopted from A. Brown, “Developing Practical Approaches to Active Preservation,” IJDC 2:1 (June 2007): 3-11.

7 JH VE 2 JH VE 2 project A next-generation architecture for format-aware object characterization –Three-fold goals: Re-factor the existing architecture to achieve higher performance, simplify system integration, and encourage third-party enhancement Provide significant new function Implement modules Collaborative project of CDL, Portico, and Stanford University –Funded by Library of Congress/NDIIPP –Open source BSD license

8 JH VE 2 New function Complex object data model Generic plug-in interface Common data structure passed between modules to enable stateful processing Identification de-coupled from validation Standardized handling of format profiles and error reporting Symbolic display of binary formats API-level support for editing

9 JH VE 2 Format support Based on project partner requirements and budgetary constraints –Image:JPEG 2000, TIFF –Audio:WAVE –Text:SGML, UTF-8, XML –Document:PDF –GIS:Shapefile –Color:ICC –And their well-known variants, e.g. TIFF/IT, TIFF/EP, GeoTIFF, EXIF, DNG, … Unfortunately precluding some JHOVE-supported formats –AIFF, GIF, HTML, JPEG

10 JH VE 2 Schedule Months 1-6 Outreach, design, and prototyping Months 7-9 Core APIs and framework Months 10-24 Module implementation

11 JH VE 2 Terminology What is it? –IdentificationDetermining presumptive format through signature matching What is it, really? –ValidationDetermining conformance to commonly- accepted normative requirements What about it? –Feature extractionReporting intrinsic properties significant to preservation planning and action What should you do with it? –AssessmentDetermining acceptability for a given purpose on the basis of locally-defined policies

12 JH VE 2 Objects, not files JHOVE assumed 1 object = 1 file = 1 format But what about… –TIFF with embedded ICC profile and XMP metadata 1 object = 1 file = 3 formats –JPEG 2000 JPX fragmentation 1 object = n files = 1 format –ESRI Shapefile 1 object = 3 files = 3 formats JHOVE2 will support 1 object = n files = m formats

13 JH VE 2 Objects, not files

14 JH VE 2 Functional requirements JHOVE2 will provide a highly configurable, extensible, and scalable framework for preservation characterization JHOVE2 can process an arbitrary number of source units during a single invocation JHOVE2 function will be encapsulated into granular plug-in modules that can be installed in and invoked by the framework The output of characterization will be expressed in an intermediate XML form, with further presentation via XSL JHOVE2 can be invoked through an API or as a stand- alone application with a command line interface

15 JH VE 2 Identification requirements Based on extrinsic hints and internal/external signatures –Aggregate-level identification signatures are based on file-level properties Identification may be possible for formats than feature for which JHOVE2 feature extraction or validation is not supported Identification will be reported in terms of level of confidence Identification will be reported in terms of all known common names and public identifiers

16 JH VE 2 Validation requirements Conformance reported in terms defined by the format itself The applicability of ambiguous conformance requirements will be under local configuration control Validation will be exhaustive, documenting all identified violations (not merely the first) Ability to distinguish syntactic and semantic conformance Error and informative messages will reference specific passages in the relevant specifications

17 JH VE 2 Feature extraction requirements Control over the granularity of parsing and reporting will be local configuration option General reportable properties will include: –Source unit name and modification date –Reportable unit size and offset –Format identifiers and version –Format profile Genre- and format-specific properties will be reported in terms of well-known data dictionaries and schemas –ANSI/NISO Z39.87 for still image –AES X-098B for audio

18 JH VE 2 Assessment Evaluation of characterization data on the basis of local policy to inform decisions and instigate actions Focus is on “instance risk” in an object –JHOVE2 cannot answer, “Is JPEG 2000 a suitable archival format?” –But it can answer, “I have a JPEG 2000 format policy; how does this object measure up to it?” Assessment… –Makes technical metadata actionable –Provides a common framework in which digital memory institutions can publish and share their policies

19 JH VE 2 Assessment policy Expressed in a set of user-configured rules Any piece of characterization data is eligible for analysis –Presence/absence of a particular value or informative/error message –Particular range or combination of values Default policies associated with each format module, based on community best practice –Easily modified, extended, or replaced to conform to local policy and practice

20 JH VE 2 Assessment output Summary report of information not elsewhere summarized Ability to create custom preservation metadata Can be extended to initiate downstream processes in local workflows automatically

21 JH VE 2 Assessment in workflows In a flow of homogenous objects, verifying specifications –Am I receiving what I expected to receive? In a flow of heterogeneous objects, conducting triage –Which risks are tolerable? Which aren’t? –What level of service can I offer the depositor?

22 JH VE 2 Ingest workflow

23 JH VE 2 Migration workflow

24 JH VE 2 Technical components Java –java.nio package –1.5 vs 1.6 OSGi/Spring frameworks –Component versioning and dependency management –Fine-grained control of component invocation –Inversion of control SourceForge –Distribution platform –Issue tracking

25 JH VE 2 Data abstraction Based on the “natural” conceptual structures of a format and their component attributes –Each such structure maps to a class with methods for parsing, validating, reporting, and serializing –Each such attribute maps to a field with accessor and mutator methods UTF-8  Character TIFF  Image File Header and Image File Directory JPEG 2000  Box PDF  boolean, number, string, name, array, dictionary, and stream

26 JH VE 2 Community engagement Wikiconfluence.ucop.edu/display/JHOVE2Info/Homeconfluence.ucop.edu/display/JHOVE2Info/Home Mailing listsJHOVE2-Announce-L JHOVE2-Techtalk-L (Subscribe via the wiki)

27 JH VE 2 Advisory board Deutsche Nationalbibliothek Ex Libris Fedora Commons Florida Center for Library Automation Harvard University Koninklijke Bibliotheek MIT/DSpace National Archives (UK) National Library of Australia National Library of New Zealand Planets project

28 JH VE 2 Next steps?


Download ppt "JH VE 2 JHOVE2 A Next-Generation Architecture for Format-Aware Characterization British Library, 1 October 2008 Stephen Abrams California Digital Library."

Similar presentations


Ads by Google