Presentation is loading. Please wait.

Presentation is loading. Please wait.

METS, MODS and PREMIS, Oh My! (and a little MIX and other schema too)

Similar presentations


Presentation on theme: "METS, MODS and PREMIS, Oh My! (and a little MIX and other schema too)"— Presentation transcript:

1 METS, MODS and PREMIS, Oh My! (and a little MIX and other schema too)
Integrating Digital Library Standards for Interoperability and Preservation Thomas Habing, Grainger Engineering Library Information Center University of Illinois at Urbana-Champaign ALA Summer 2007 Habing

2 Presentation Outline Brief Background on our Project
Hub and Spoke METS Profile MODS for descriptive metadata PREMIS for technical and provenance metadata MIX (plus some others) for media-specific technical metadata Technical Implementation in Java Future Plans ALA Summer 2007 Habing

3 NDIIPP ECHODEP1 http://ndiipp.uiuc.edu/
Quick Project Background NDIIPP ECHODEP1 Repository Evaluation Tools development Web harvesting and archiving (OCLC’s WAW) ** Hub and Spoke interoperability and preservation architecture ** Preservation Research preserving the authenticity and semantic meaning of digital resources through time. 1Exploring Collaborations to Harness Objects in a Digital Environment for Preservation ALA Summer 2007 Habing

4 Hub and Spoke Repository Interoperability Architecture with a forward-looking emphasis on preservation metadata and activities ALA Summer 2007 Habing

5 The Problem Plethora of repositories Overabundance of data sources
Not just across institutions, but even with a single institution Overabundance of data sources Web crawlers like Heritrix or OCLC's WAW, digitization and scanning services, individual authors, batch ingest from legacy systems Current integration solutions are local and ad hoc Enforcing centralized preservation policy difficult H&S for Interoperability There are currently many different digital repositories in widespread use, such as DSpace, Greenstone, Fedora, Eprints, contentDM, also including OCLC's and CDL's digital archive services, to name just a few.  There are also many different sources of input into these systems, such as web crawlers like Heretrix or OCLC's Web Archivist Workbench, or numerous digitization and scanning services.  It also is not uncommon for several of these systems to be in use within a single institution, and if multiple institutions within a consortia want to share data it is very likely that multiple of these system will come into play.  In short, the problem being addressed by the H&S is that almost none of these systems can inter-operate beyond a rudimentary level, usually nothing more than an OAI data provider supplying simple Dublin Core metadata.  Even if they have embraced some of the OAIS concepts (which few have) such as submission or dissemination information packages (SIPs and DIPs), their implementation of these concepts vary greatly.  For example, a DIP produced by DSpace has nothing in common with a SIP which could be used by Eprints.  Because of these problems, achieving any level of interoperability between these systems usually entails some level of custom software development, and any time a new repository is thrown into the mix some of that software development will need to be redone. The H&S interoperability system seeks to address these issues with the development of a common METS-based profile, a standard programming API, and a series of scripts that use that API and METS profile for creating SIPs and DIPs which can be used across different repositories. H&S for Preservation This problem is based on the same premise as the interoperability statement.  There are many different repositories.  Few of these repositories have any explicit support for preservation, either preservation metadata, such as PREMIS, or activities to support preservation, such as format migrations or checksum validations.  For an institution with several of these systems deployed, which is common, something as simple as performing consistent backups to off-line storage like tape can be complicated by the fact that these system all store the underlying data differently.  There may be data stored in relation databases, XML databases, RDF triple stores, and different file systems all of which must be backed up, usually using various different backup techniques. It can be this complex even for a single system, no less for a combination of such systems. The H&S preservation system will leverage the work done on interoperability.  The METS profile will be treated as a common archival information package (AIP) with support for the PREMIS preservation metadata schema, and the programming API will be enhanced to support common preservation activities, such a generation and validation of technical metadata as well as provenance metadata, format migrations, among others.  The system will also included mechanism for persisting AIPs most likely using the JSR-170 standard for content repositories. This will provide a common preservation data store for all digital packages regardless of which other repository they might have been created in or which other repository is used for search and browse by end users.  In other words, the H&S repository will be used for preservation activities which can benefit greatly from a common system, but other activities related to the digital objects, such as creation or access can still occur in whatever repository is best suited for that activity. ALA Summer 2007 Habing

6 A Solution A common METS-based profile A standard programming API
A series of scripts that use the API and METS profile for creating Information Packages which can be ‘used’ across different repositories ALA Summer 2007 Habing

7 ALA Summer 2007 Habing

8 To-Hub Spoke Hub Data Store / DIPs ALA Summer 2007 Habing image.jpg
Generate/collect provenance metadata Data Store / DIPs Hub Extract format-specific technical metadata Generate/collect digital provenance metadata Embed links to digital items image.jpg Model structure of the item Embed native metadata Transform/enrich native metadata metadata.xml ALA Summer 2007 Habing

9 From-Hub Spoke Hub SIPs ALA Summer 2007 Habing metadata.xml
Generate provenance metadata SIPs Hub Transform hub metadata to repository-compatible metadata Assemble into packages for repository ingest Add the METS file as an item in the submission package metadata.xml hubMets.xml ALA Summer 2007 Habing

10 METS Profile The METS Profile is the ‘Hub’ Two Registered Profiles
Also May be overlaid on top of, or inherited from, other profiles Primary Focus of Profiles Digital preservation Repository interoperability minimally at the technical and descriptive metadata level, not at the structural level or file format level Web captures Focus on preservation, not access agnostic regarding file formats or structures ALA Summer 2007 Habing

11 METS Profile in More Detail
Descriptive Metadata Primary DMD is MODS Alternate DMD are encouraged Provenance for DMD is required Technical Metadata PREMIS object entities MIX for images Other metadata for other media types Digital Provenance PREMIS events and agents ALA Summer 2007 Habing

12 Simple Object Example ALA Summer 2007 Habing

13 Descriptive Metadata for the Entire Package
MODS as the primary descriptive metadata The Aquifer MODS profile is used as the minimal requirement (see presentation by Sarah Shreeves) Other descriptive metadata schema should be preserved as alternative dmdSec’s Transformations of descriptive metadata must be recorded in digiprovMD sections using PREMIS event and agent elements Individual files may have their own dmdSec’s; these are considered outside the scope of our profile. However we encourage the use of relatedItem’s in the primary MODS for this purpose. ALA Summer 2007 Habing

14 Technical Metadata for Files
A techMD section wrapping a PREMIS object element is required for each file or bit-stream Minimal required elements: fixity, size, formatDesignation creatingApplication and software are encouraged especially for MIME types starting with ‘application/…’ ALA Summer 2007 Habing

15 Technical Metadata for Files
Alternative technical metadata schemas for different media types are encouraged: MIX for images textMD for text AUDIOMD for audio VIDEOMD for video Where possible we are using JHOVE to derive all of these; the profile also allows raw JHOVE output to be used in techMD ( ALA Summer 2007 Habing

16 Technical Metadata for Representations
Technical metadata can also be associated with representations There is a special required techMD called the ‘primary representation’ that corresponds to the entire METS file. Used mostly for alternate identifiers for the file, but may also be used to record other technical metadata about the whole METS document Each structural map may also have representation technical metadata. ALA Summer 2007 Habing

17 Digital Provenance Recorded for all non-trivial changes to:
Descriptive Metadata (must) Creation, Transformation, Modification, Deletion Files and Bitstreams (should) Events from PREMIS data dictionary Structural Maps (may) PREMIS event and optional associated agents are wrapped in a digiprovMD ALA Summer 2007 Habing

18 Using PREMIS in METS All linking via ID & IDREF-type attributes not identifier elements Embedding Object in techMD Event in digiprovMD Rights in rightsMD Agent in digiprovMD or rightsMD All Files at a Composition level of 0 No packaging, compression, or encryption ALA Summer 2007 Habing

19 Profile for Web Captures
Inherits almost everything from base profile Adds rules for the primary structural map Adds rules for referencing ARC files and their constituents from the fileSec ARC is used by Internet Archive, Heritrix web crawler, and OCLC’s WAW ALA Summer 2007 Habing

20 Challenges in Developing the Profile
How to deal with overlaps between the various schema Properties that occur in multiple places METS attributes, PREMIS elements, MODS elements, MIX elements Differences in how to tie sections together ID and IDREFS or embedded identifiers or nested XML elements What METS sections in which to embed the various PREMIS entities ALA Summer 2007 Habing

21 Java Implementation Partially complete and in-work Open source
Base-level API, plus support for DSpace and to lesser degree Fedora Open source Javadocs: Source Code ALA Summer 2007 Habing

22 Technical Architecture (Java)
ALA Summer 2007 Habing

23 Future Plans Add support for other repositories such as
CONTENTdm EPrints Develop additional sub-profiles Transformations/Adaptations to/form other METS profiles Continue to improve the documentation and program code ALA Summer 2007 Habing

24 Questions? ALA Summer 2007 Habing


Download ppt "METS, MODS and PREMIS, Oh My! (and a little MIX and other schema too)"

Similar presentations


Ads by Google