Tools and Services for the Long Term Preservation and Access of Digital Archives Joseph JaJa, Mike Smorul, and Sangchul Song Institute for Advanced Computer.

Slides:



Advertisements
Similar presentations
Configuration management
Advertisements

Audit Control Environment Mike Smorul UMIACS. Issues surrounding asserting integrity Threats to Integrity of Digital Archives –Hardware/media degradation.
Data Grid: Storage Resource Broker Mike Smorul. SRB Overview Developed at San Diego Supercomputing Center. Provides the abstraction mechanisms needed.
NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE SAN DIEGO SUPERCOMPUTER CENTER Particle Physics Data Grid PPDG Data Handling System Reagan.
The Digital Preservation Network at UT Austin Chris Jordan Texas Advanced Computing Center.
Transformations at GPO: An Update on the Government Printing Office's Future Digital System George Barnum Coalition for Networked Information December.
DESIGNING A PUBLIC KEY INFRASTRUCTURE
1 Software & Grid Middleware for Tier 2 Centers Rob Gardner Indiana University DOE/NSF Review of U.S. ATLAS and CMS Computing Projects Brookhaven National.
Network Management Overview IACT 918 July 2004 Gene Awyzio SITACS University of Wollongong.
Chronopolis: Preserving Our Digital Heritage David Minor UC San Diego San Diego Supercomputer Center.
ADAPT An Approach to Digital Archiving and Preservation Technology Principal Investigator: Joseph JaJa Lead Programmers: Mike Smorul and Mike McGann Graduate.
PAWN: Producer-Archive Workflow Network University of Maryland Institute for Advanced Computer Studies Joseph Ja’Ja, Mike Smorul, Mike McGann.
May Archiving PAWN: A Policy-Driven Software Environment for Implementing Producer- Archive Interactions in Support of Long Term Digital.
Producer-Archive Workflow Network (PAWN) Goals Consistent with the Open Archival Information System (OAIS) model Use of web/grid technologies and platform.
Rutgers University Libraries What is RUcore? o An institutional repository, to preserve, manage and make accessible the research and publications of the.
ACE: A Software Tool to Ensure the Integrity of Digital Archives Principal Investigator: Joseph JaJa Graduate Student: Sangchul Song Lead Programmer: Michael.
PAWN V0.7 University of Maryland Institute for Advanced Computer Studies.
1 Using Scalable and Secure Web Technologies to Design Global Format Registry Muluwork Geremew, Sangchul Song and Joseph JaJa Institute for Advanced Computer.
Supporting Customized Archival Practices Using the Producer-Archive Workflow Network (PAWN) Mike Smorul, Mike McGann, Joseph JaJa.
Brief Overview of Major Enhancements to PAWN. Producer – Archive Workflow Network (PAWN) Distributed and secure ingestion of digital objects into the.
July NAGARA 1 Producer-Archive Workflow Network Mike Smorul, Mike McGann, Joseph JaJa Institute for Advanced Computer Science Studies University.
Robust Tools for Archiving and Preserving Digital Data Joseph JaJa, Mike Smorul, and Mike McGann Institute for Advanced Computer Studies Department of.
PAWN: A Novel Ingestion Workflow Technology for Digital Preservation
PAWN Progress July 06, Overview of changes New flexible environment for setting up and managing interactions between producers and the archive Domains.
Replication Monitoring University of Maryland Institute for Advanced Computer Studies.
Internet Resources Discovery (IRD) IBM DB2 Digital Library Thanks to Zvika Michnik and Avital Greenberg.
Tools and Services for the Long Term Preservation and Access of Digital Archives Joseph JaJa, Mike Smorul, and Sangchul Song Institute for Advanced Computer.
ACE: A Software Tool to Ensure the Integrity of Digital Archives Principal Investigator: Joseph JaJa Graduate Student: Sangchul Song Lead Programmers:
May 23, 2007 Archiving ACE: A Novel Software Platform to Ensure the Integrity of Digital Archives Sangchul Song and Joseph JaJa Institute for Advanced.
Robust Technologies for Automated Ingestion and Long-Term Preservation of Digital Information Principal Investigator: Joseph JaJa Lead Programmers: Mike.
PAWN: Producer-Archive Workflow Network University of Maryland Institute for Advanced Computer Studies Joseph JaJa, Mike Smorul, Mike McGann.
Mike Smorul Saurabh Channan Digital Preservation and Archiving at the Institute for Advanced Computer Studies University of Maryland, College Park.
PAWN: Producer-Archive Workflow Network University of Maryland Institute for Advanced Computer Studies Joseph Ja’Ja, Mike Smorul, Mike McGann.
UMIACS PAWN, LPE, and GRASP data grids Mike Smorul.
Robust Technologies for Automated Ingestion and Long-Term Preservation of Digital Information PI: Joseph JaJa Co-PIs: Allison Druin and Doug Oard Major.
PAWN: A Novel Ingestion Workflow Technology for Digital Preservation Mike Smorul, Joseph JaJa, Yang Wang, and Fritz McCall.
Archival Prototypes and Lessons Learned Mike Smorul UMIACS.
1 Stanford Archival Repository Project Brian Cooper Arturo Crespo Hector Garcia-Molina Department of Computer Science Stanford University.
Data-PASS Shared Catalog Micah Altman & Jonathan Crabtree 1 Micah Altman Harvard University Archival Director, Henry A. Murray Research Archive Associate.
San Diego Supercomputer CenterUniversity of California, San Diego Preservation Research Roadmap Reagan W. Moore San Diego Supercomputer Center
Using SRB and iRODS with the Cheshire3 Information Framework Building Data Grids with iRODS May, 2008 National e-Science Centre Edinburgh Dr Robert.
Cloud Integrity Monitoring Mike Smorul ADAPT Group University of Maryland, College Par.
 To explain the importance of software configuration management (CM)  To describe key CM activities namely CM planning, change management, version management.
University of North Texas Libraries Building Search Systems for Digital Library Collections Mark E. Phillips Texas Conference on Digital Libraries May.
Web Archiving and Access Mike Smorul Joseph JaJa ADAPT Group University of Maryland, College Park.
1 Schema Registries Steven Hughes, Lou Reich, Dan Crichton NASA 21 October 2015.
CYBERINFRASTRUCTURE FOR THE GEOSCIENCES Data Replication Service Sandeep Chandra GEON Systems Group San Diego Supercomputer Center.
OAIS Rathachai Chawuthai Information Management CSIM / AIT Issued document 1.0.
Millman—Nov 04—1 An Update on Digital Libraries David Millman Director of Research & Development Academic Information Systems Columbia University
Search and Access Technologies for Large Scale Web Archives Joseph JaJa, Sangchul Song, and Mike Smorul Institute for Advanced Computer Studies Department.
Metadata and Meta tag. What is metadata? What does metadata do? Metadata schemes What is meta tag? Meta tag example Table of Content.
DSpace System Architecture 11 July 2002 DSpace System Architecture.
Partnerships in Innovation: Serving a Networked Nation Grid Technologies: Foundations for Preservation Environments Portals for managing user interactions.
Infrastructure Breakout What capacities should we build now to manage data and migrate it over the future generations of technologies, standards, formats,
Rights Management for Shared Collections Storage Resource Broker Reagan W. Moore
Managing live digital content with DuraSpace services Bill Branan PASIG Spring 2015.
Building Preservation Environments with Data Grid Technology Reagan W. Moore Presenter: Praveen Namburi.
Collection-Based Persistent Archives Arcot Rajasekar, Richard Marciano, Reagan Moore San Diego Supercomputer Center Presented by: Preetham A Gowda.
5/29/2001Y. D. Wu & M. Liu1 Content Management for Digital Library May 29, 2001.
De Rigueur - Adding Process to Your Business Analytics Environment Diane Hatcher, SAS Institute Inc, Cary, NC Falko Schulz, SAS Institute Australia., Brisbane,
Preservation Data Services Persistent Archive Research Group Reagan W. Moore October 1, 2003.
A Semi-Automated Digital Preservation System based on Semantic Web Services Jane Hunter Sharmin Choudhury DSTC PTY LTD, Brisbane, Australia Slides by Ananta.
Data Grids, Digital Libraries and Persistent Archives: An Integrated Approach to Publishing, Sharing and Archiving Data. Written By: R. Moore, A. Rajasekar,
PAWN: Producer-Archive Workflow Network
An Overview of Data-PASS Shared Catalog
Policy-Based Data Management integrated Rule Oriented Data System
Joseph JaJa, Mike Smorul, and Sangchul Song
Building Search Systems for Digital Library Collections
GSAF Grid Storage Access Framework
Presentation transcript:

Tools and Services for the Long Term Preservation and Access of Digital Archives Joseph JaJa, Mike Smorul, and Sangchul Song Institute for Advanced Computer Studies Department of Electrical and Computer Engineering University of Maryland, College Park

Background Started as an ERA project focusing on setting up and testing a distributed archiving infrastructure. Evolved into the development of archiving tools and services that are scalable and platform independent. In addition to the continued NARA support, the work has been supported by NSF, Library of Congress, and the Mellon Foundation.

Main Tools Developed Flexible software environment for ingestion and for handling producers – archive interactions: PAWN. Tools to ensure the long term integrity of digital holdings based on rigorous cryptographic methodologies: ACE. Methods to ensure compact storage and fast retrieval of archived web contents: PISA. Tracking and Monitoring tool of the digital holdings of an archive.

Software Developed and Tested on TPAP: Data Management Metadata Management Administrative Metadata Preservation Metadata Descriptive Metadata Deep Archive Storage Data Grid Storage Digital Library Storage Ingestion Workflow PAWN M e t a d a t a D a t a Search Access Monitoring and Preservation Services

PAWN – Producer Archive Workflow Network Software that provides a flexible and customizable ingestion framework Handles the process in a reliable and secure fashion: From package assembly To archival storage Simple interface for end-users Flexible interface for archive managers Designed for use in multiple contexts

Overall Organization Producers organized into domains, each domain contains a transfer agreement negotiated with the archive. Each domain contains a hierarchical organization of data grouped into record sets/templates (convenient groupings from the transfer agreement). An end-user operates within a domain with record sets associated with the account.

Producer-Archive Agreement

Package Workflow Overview 1. Create Producer-Archive Agreement and client package template. 2. Create package based on template 3. Optionally, review submitted items 4. Invoke publishing processes.

Customizable Components Definable Roles Actions in PAWN can be grouped to create arbitrary types of users Flexible Approval Requirements Signature requirements can be placed on parts of a package. Automated Processing API for creating processes to validate, transform, approve, or publish items in a package Processes can be invoked manually or automatically Processes may have dependencies on item approval

Sample Submission 1. Client ingests image data 2. First process chain: Validators check image format and marks ‘good’ files as approved. 3. Files that are rejected (misc mp3’s, etc..) are held for manual processing 4. Second Chain: push approved files into DSpace/Fedora/whatever

PAWN Summary Flexible environment to handle ingestion between many producers and an archive. Very little effort for producers to push their data into the archive. Granular workflow definition. Fully automated to completely manual. Easy to include new standards (metadata, packaging, …). Tested in a number of environments (including the NARA TPAP testbed and the Library of Congress).

ACE – Auditing Control Environment Software to protect the integrity of digital assets in the long term Hardware/media degradation Security breaches, malicious alterations Infrequent access to most data Evolution of cryptographic schemes Underpinnings are based on rigorous cryptographic techniques. Scalable, cost-effective, and can interoperate with any archiving architecture.

ACE – Basic Methodology Builds on cryptographic hashing by introducing additional layers of trust. Layers of cryptographic summary information Is not confined to the local processes of the archive, and assumes a third-party, which is not fully trusted. An independent party can assert the correctness of any object in the future based on the archive’s information and publically available information.

ACE – System Architecture

Components IMS – issues tokens for hashes that are to be monitored. WSDL available Java API for bulk operations (uses WSDL) Audit Manager(s) – Local, per-archive installations. Monitor bitstreams locally May be independent or part of larger software

Audit Managers

ACE Audit  Audit Local Files: Audit Manager periodically scans all files and compares stored digests with computed digests.  Audit Local Manager: Manager computes round summary for each digest using that digest and its token. This is compared to value stored on the IMS.  IMS Audit: Round summaries are used to compute witness values. These are compared with offsite witness values.

ACE Summary TPAP Audit 1.1Tb of images 1.5+ million small files (1.2Tb) Single portal for collections on disk, SRB, iRODS Chronopolis 3 Collections 5+ million files, 12.2Tb total High performance, Scalable Version 1.0 publically available

Tracking and Replication Monitoring Portal that provides overview of a collection status over different zones. Ensures that new objects are replicated to relevant sites. Tracks files at master locations and periodically copy new files to replica sites. Log actions on a collection and errors during replication

Web Archiving: Compact Storage and Fast Retrieval New technology for storing and indexing web archives. Uses standard web containers (WARC) and stores unique contents – detect duplicates before storage. Indexing structure based on advanced multiversion B-trees. Significantly improved storage and performance over existing technologies.

Scalable Technology for Information Discovery of Web Archives Allows discovery through a combination of words and time spans. Efficient for handling temporal queries rather than “search and then filter”: “Retrieve documents containing September 11 which were written before 2001” Returned web links are ranked according to an appropriate scoring function. Allows the possibility of coalescing similar versions of a web page.

Organization of Archived Web Contents Efficient browsing of archived web contents based on web graph analysis and graph partitioning techniques. Archived web contents are organized into web containers using standard WARC formats.

Other Technologies PAWN – Related: APIs for different packaging technologies (METS and XFDU). ICDL Book Builder – Interface to enable bulk ingestion of digital objects already managed by a database. FOCUS (FOrmat CUration Service): a scalable, and secure registry for persistent information and services applied to formats.

Conclusion Initial effort started through an ERA project, which has grown substantially over the last few years. Focus has been on platform and architecture – independent tools and services that are scalable and cost effective. Empirical testing and evaluation using a wide variety of NARA and NDIIPP collections and different infrastructures. Partnerships have played a crucial role.