Evaluating Ingest Success: Using the AIHT Michael L. Nelson, Joan A. Smith Department of Computer Science Old Dominion University Norfolk VA 23508 DCC.

Slides:



Advertisements
Similar presentations
October 28, 2003Copyright MIT, 2003 METS repositories: DSpace MacKenzie Smith Associate Director for Technology MIT Libraries.
Advertisements

METS: An Introduction Structuring Digital Content.
The future’s so bright…. DAITSS DIGITAL PRESERVATION SYSTEM: RE-ARCHITECTED, RE- WRITTEN, AND OPEN SOURCE Priscilla Caplan Florida Center for Library Automation.
Using OAI-PMH for Resource Exchange OAI Metadata Harvesting Workshop, JCDL 03 Michael L. Nelson, Terry L. Harrison Old Dominion University Norfolk VA
The Open Archives Initiative DRIADE Workshop, Durham NC, May 16-17, 2007 Michael L. Nelson The Open Archives Initiative Michael L. Nelson Computer Science,
Digital Preservation - Its all about the metadata right? “Metadata and Digital Preservation: How Much Do We Really Need?” SAA 2014 Panel Saturday, August.
Depositing e-material to The National Library of Sweden.
Copyright 2004 Monash University IMS5401 Web-based Systems Development Topic 2: Elements of the Web (g) Interactivity.
1 Archiving Workflow between a Local Repository and the National Library Archive Experiences from the DiVA Project Eva Müller, Peter Hansson, Uwe Klosa,
Information Retrieval in Practice
OAI-PMH at Yale Report on the DLF OAI Training Session November 10, 2005 Charlottesville, VA.
1 The World Wide Web. 2  Web Fundamentals  Pages are defined by the Hypertext Markup Language (HTML) and contain text, graphics, audio, video and software.
Overview of Search Engines
Computer Concepts 2014 Chapter 7 The Web and .
Ingest and Dissemination with DAITSS Presented by Randy Fischer, Programmer, Florida Center for Library Automation, University of Florida DigCCurr2007.
PeDALS Persistent Digital Archives & Library System Richard Pearce-Moses Deputy Director for Technology & Information Resources Arizona State Library,
Adventures in Digital Asset Management: Fedora at the National Library of Wales Glen Robson National Library of Wales
Addressing Metadata in the MPEG-21 and PDF-A ISO Standards NISO Workshop: Metadata on the Cutting Edge May 2004 William G. LeFurgy U.S. Library of Congress.
Using OAI-PMH Resource Harvesting & MPEG-21 DIDL for Digital Preservation Joan A. Smith & Michael L. Nelson Old Dominion University Department of Computer.
OCLC Online Computer Library Center CONTENTdm ® Digital Collection Management Software Ron Gardner, OCLC Digital Services Consultant ICOLC Meeting April.
Implementing an Integrated Digital Asset Management System: FEDORA and OAIS in Context Paul Bevan DAMS Implementation Manager
OAI-PMH for Resource Harvesting Tutorial OAI4, October 20 th 2005, CERN, Geneva, Switzerland OAIResource Software Her This work supported in part by the.
1 XML as a preservation strategy Experiences with the DiVA document format Eva Müller, Uwe Klosa Electronic Publishing Centre Uppsala University Library,
The Metadata Object Description Schema (MODS) NISO Metadata Workshop May 20, 2004 Rebecca Guenther Network Development and MARC Standards Office Library.
Indo-US Workshop, June23-25, 2003 Building Digital Libraries for Communities using Kepler Framework M. Zubair Old Dominion University.
Generating Best Effort Preservation Metadata for Web Resources at Time of Dissemination Joan A. Smith & Michael L. Nelson Old Dominion University Department.
Dynamic Web File Format Transformations with Grace Daniel S. Swaney, Frank McCown, and Michael L. Nelson Old Dominion University Computer Science Department.
OAI-PMH for Resource Harvesting Tutorial OAI4, October 20 th 2005, CERN, Geneva, Switzerland A New Model for Web Resource Harvesting Her This work supported.
OAIS Rathachai Chawuthai Information Management CSIM / AIT Issued document 1.0.
1 Schema Registries Steven Hughes, Lou Reich, Dan Crichton NASA 21 October 2015.
Implementor’s Panel: BL’s eJournal Archiving solution using METS, MODS and PREMIS Markus Enders, British Library DC2008, Berlin.
The FCLA Digital Archive Joint Meeting of CSUL Committees, 2005.
Van de Sompel, Herbert Los Alamos National Laboratory – Research Library OAI-PMH for Resource Harvesting.
Archive Ingest and Handling Test: ODU’s Perspective Michael L. Nelson Department of Computer Science Old Dominion University
Repository Synchronization Using NNTP and SMTP Michael L. Nelson, Joan A. Smith, Martin Klein Old Dominion University Norfolk VA
National Geospatial Digital Archive Greg Janée University of California at Santa Barbara.
1 GRID Based Federated Digital Library K. Maly, M. Zubair, V. Chilukamarri, and P. Kothari Department of Computer Science Old Dominion University February,
OAI Overview DLESE OAI Workshop April 29-30, 2002 John Weatherley
VITAL at the National Library of Wales Glen Robson
OAIS Rathachai Chawuthai Information Management CSIM / AIT Issued document 1.0.
Funded by: © AHDS Preservation in Institutional Repositories Preliminary conclusions of the SHERPA DP project Gareth Knight Digital Preservation Officer.
NSDL October 12-15, 2003Eisenhower National Clearinghouse Slide 1 NSDL and the Open Archives Initiative NSDL – OAI – and the Eisenhower National Clearinghouse.
Lazy Preservation, Warrick, and the Web Infrastructure Frank McCown Old Dominion University Computer Science Department Norfolk, Virginia, USA JCDL 2007.
Challenges in the Nursery: Linking a Finding Aid with Online Content Elizabeth Johnson, Lilly Library Jenn Riley, Digital Library Program DL Brown Bag,
JISC/NSF PI Meeting, June Archon - A Digital Library that Federates Physics Collections with Varying Degrees of Metadata Richness Department of Computer.
Open Archives Initiative Protocol for Metadata Harvesting.
Module: Software Engineering of Web Applications Chapter 2: Technologies 1.
Archive Ingest and Handling Test: ODU’s Perspective Michael L. Nelson Department of Computer Science Old Dominion University
OAI Tools By Thomas G. Habing Grainger Engineering Library Information Center University.
DAITSS and the Florida Digital Archive Priscilla Caplan Florida Center for Library Automation iPRES 2006.
Lifecycle Metadata for Digital Objects November 15, 2004 Preservation Metadata.
1 CS 430: Information Discovery Lecture 26 Architecture of Information Retrieval Systems 1.
The NSDL, OAI and Your Metadata Core Infrastructure Metadata Repository (“union catalog”) Naomi Dushay Cornell University.
Building Preservation Environments with Data Grid Technology Reagan W. Moore Presenter: Praveen Namburi.
Mod_oai: Metadata Harvesting for Everyone Michael L. Nelson, Herbert Van de Sompel, Xiaoming Liu, Aravind Elango
Transparent Format Migration of Preserved Web Content D. S. H. Rosenthal, T. Lipkis, T. S. Robertson, S. Morabito Lib Magazine, 11(1), 2005
GNU EPrints 2 Overview Christopher Gutteridge 19 th October 2002 CERN. Geneva, Switzerland.
The Multi-Faceted Use of the OAI-PMH in the LANL Repository Written By: Henry, Xiaoming,Patrick Henry, Xiaoming,Patrick and Herbert. Presented By: Shashi.
Efficient, Automatic Web Resource Harvesting Michael L. Nelson, Joan A. Smith and Ignacio Garcia del Campo Old Dominion University Computer Science Dept.
E-Business Infrastructure PRESENTED BY IKA NOVITA DEWI, MCS.
Information Retrieval in Practice
Joint Meeting of CSUL Committees,
Ingest and Dissemination with DAITSS
DAITSS and the Florida Digital Archive
OAI and Metadata Harvesting
Just-In-Time Recovery of Missing Web Pages
Characterization of Search Engine Caches
Open Archive Initiative
WEB SERVICES From Chapter 19, Distributed Systems
Robin Dale RLG OAIS Functionality Robin Dale RLG
Presentation transcript:

Evaluating Ingest Success: Using the AIHT Michael L. Nelson, Joan A. Smith Department of Computer Science Old Dominion University Norfolk VA DCC / LUCAS Joint Workshop Liverpool UK Nov 30 - Dec 1, 2006

Outline AIHT Recap ODU Approach to AIHT AIHT Lessons Learned –not evaluation –not ingest Create a tool that would have made AIHT easy

Archive Ingest and Handling Test Summary: a tar file of the filesystem + database for (~57k files) provided to: –Johns Hopkins University –Harvard University –Old Dominion University –Stanford University Goal: –“ingest” the tar file –migrate one format to another (simulate the passage of time) –exchange contents with one other partner

Outline AIHT Recap ODU Approach to AIHT AIHT Lessons Learned –not evaluation –not ingest Create a tool that would have made AIHT easy

Fortress Model 1.Get a lot of $ 2.Buy a lot of disks, machines, tapes, etc. 3.Hire an army of staff 4.Load a small amount of data 5.“Look on my archive ye Mighty, and despair!” image from: Five Easy Steps for Preservation:

ODU’s Research Goals We’re in the CS department, not the library –Less infrastructure (bad) –More freedom (good) Interested in repository/object interaction –Long-range vision: repositories fade away; objects are responsible for their own preservation –Could we accomplish this with our “bucket” technology? Significant questions about archive granularity Transition to MPEG-21 Digital Item Declaration Language (DIDL) based buckets New models for digital preservation?

Buckets Buckets: self-contained, web-accessible objects –Grew out of research for serving NASA documents, esp. NACA Reports – implicit assumptions: 1 bucket = 1 logical item (N physical items) Display is for human use Bucket contents are DOM-parsable

Bucket / MPEG-21 Model MPEG-21 DIDL Payload Bucket Infrastructure methods logs support libraries

MPEG-21 DIDL A generic, powerful complex object metadata format –Based on an abstract data model –Semantics separated from syntax i.e. the tags don’t mean anything -- a little disconcerting at first glance –Digital library use championed by LANL

MPEG-21 DIDL Data Model How to encode Archive? 1 file = 1 DID 1 archive = 1 container 1 archive = 1 component 1 file = 1 component

1 File = 1 Component 8 file archive for demo purposes…

Looking Inside the Archive

Looking at a Single File…

Design Decisions: File Storage Store each file as a –Big: each file is base64’d into the DIDL –Small: each file is ref’d from the DIDL to a directory Filename = MD5 hash of the original file name (not contents!) + a version number Example:

Design Decisions: Ingestion For every program/process to apply to a file, create a corresponding –Jhove –Unix “file” –Fred URI –MD5 of file contents Expandable, scriptable list of metadata extraction / analysis programs Ingestion is parallelized over a workstation cluster

Example Output: MD5 perl/Digest::M D a1bcd2b e7cf05f36066d4cdc9cf

Conversion: Linking Old to New If the previous version of the Resource was specified as: then the new version of the resource is specified as:

Harvard Ingest Harvard’s model was the most similar to our MPEG-21 model Ingesting from another archive is (roughly) the same as initial ingest –Save any metadata that was delivered in the original METS file as a We don’t trust it, but it might be useful for future forensics –Re-ingest in the normal way Our export is part of the bucket API: – External Metadata image/jpeg Canon Canon EOS D

Outline AIHT Recap ODU Approach to AIHT AIHT Lessons Learned –not evaluation –not ingest Create a tool that would have made AIHT easy

The DIP is the TMD* Using METS or MPEG-21, there is no need for a separate transfer metadata format METS & MPEG-21 can be the lumps of XML exchanged between OAI-PMH harvesters & repositories – mpel/12vandesompel.html * Apologies to Marshal McLuhan Figure 1, Bekaert & Van de Sompel

Validation is Subjective images from: Preservation metadata is like a David Hockney photo collage: each image is both true and incomplete, and while the result is not faithful, it does capture the “essence”

Alternate Models of Preservation Lazy Preservation –Let Google, IA et al. preserve your website Just-In-Time Preservation –Wait for it to disappear first, then a “good enough” version Shared Infrastructure Preservation –Push your content to sites that might preserve it Web Server Enhanced Preservation –Use Apache modules to create archival-ready resources image from:

Outline AIHT Recap ODU Approach to AIHT AIHT Lessons Learned –not evaluation –not ingest Create a tool that would have made AIHT easy

Web Site Preservation The counting problem How many pages are on that site? To save it you have to find it The representation problem What’s that page all about? Future use requires understanding

Limitations of HTTP There is no “Select *” in HTTP –Crawlers cannot request a list of all URLs for the site –Crawlers can only GET one resource at a time –HTTP cannot give a crawler a list of resources it has Cannot ask for only new resources –Conditional GET by datestamp or etag is limited –Cannot get a list of pages that have been deleted –Each resource must be requested, one at a time HTTP alone is insufficient to confidently enumerate a site’s resources

Time XBM Current browser accessible formats GIFPNG JP2 PS PDF Slide from “Dynamic Web File Format Transformations withGrace” IWAW 05

Integrate OAI-PMH functionality into the web server itself… 1.Use mod_oai an Apache 2.0 module automatically answers OAI-PMH requests for an http server written in C respects values in.htaccess, httpd.conf 2.Install mod_oai on 3.Define baseURL: Result: web harvesting with OAI-PMH semantics (e.g., from, until, sets) mod_oai solution Using OAI-PMH Give me all resources… …and their preservation metadata From site foo, dating from 9/15/2004 through today that are MIME type video/mpeg

Addressing the Counting Problem: ListIdentifiers CRAWLER: issues a ListIdentifiers, finds URLs of updated resources does HTTP GET updates only can get URLs of resources with specified MIME types EXPAND mod_oai approach: Web log lists File system lists Configuration information

Addressing the Representation Problem: ListRecords in DIDL Format CRAWLER: Makes a ListRecords query, Gets updates as MPEG-21 DIDL records (HTTP headers, resource By Value or By Reference) can get resources with specified MIME types EXPAND OAI-PMH approach: Add ability to incorporate other metadata output Build metadata-rich complex object response Encapsulate within existing OAI-PMH DIDL metadata format response

“Born Archival” It would be nice if HTML, PDF, MS Office, etc. applications encoded preservation oriented metadata –Stewart Brand - “born archival” That would be really nice… while we’re wishing…

CRATE: A Model for Web Resource Preservation Fits with OAIS Preservation Model Text-based protocol for long-term survivability Complex object format supported by HTTP via OAI-PMH Utilizes web-server to support preservation at dissemination

CRATE Example: PostScript File Metadata –Descriptive Summary Index words / term frequency –Administrative Copyright Item version information Last modified date –Technical Target application version MIME type Fred URI System compatibility Checksum Signature Resource –base64-encoded resource Information-rich preservation package

CRATE and the OAIS Information Model base64- encoded resource CRATE Metadata from plug-ins: Summary, index, MIME / GDFR, format analysis… URL (for web access), OAI Identifier, baseURL (for harvester access) OAI-PMH GetRecord Request SIP: submitted by web server AIP: contains archival information DIP: transmitted to other repositories or to user for extraction image from:

Example CRATE Plug-Ins for mod_oai NameDescription JhoveFile format analysis KeaKeyphrase extraction OTSOpen Text Summarizer ExifToolImage/video metadata extractor PdflibExtract PDF metadata MP3-TagExtract audio file tags EssenceCustomized information extraction Plug-in design allows for any type of extraction tool to be included Flexible architecture elements: Tags | Argument-Name | Version | CDATA output Simple Apache configuration file modification to enable plug-in Plug-ins written by 3 rd -party programmers Output from plug-ins is not validated!

AIHT caused to think differently about web resource preservation Goal of CRATE / mod_oai is dynamically produce as much preservation metadata as possible at dissemination time –“Ingestion goes faster if you can disseminate `crated’ metadata” Validation is a separate, local process –“Dissemination goes faster if you don’t sweat validation” More info: Summary