Society of American Archivists 2008 Annual Meeting Society of American Archivists 2008 Annual Meeting Capturing the E-Tiger: New Tools for Email Preservation.

Slides:



Advertisements
Similar presentations
IRRA DSpace April 2006 Claire Knowles University of Edinburgh.
Advertisements

THE DONOR PROJECT Titia van der Werf-Davelaar. Project Financed by: Innovation of Scientific Information Provision (IWI) Duration: –phase 1: 1 may 1998.
Long-Term Preservation. Technical Approaches to Long-Term Preservation the challenge is to interpret formats a similar development: sound carriers From.
DIGIDOC A web based tool to Manage Documents. System Overview DigiDoc is a web-based customizable, integrated solution for Business Process Management.
JavaScript FaaDoOEngineers.com FaaDoOEngineers.com.
Transferred 89,000+ messages XML preservation formats Account-centricMessage-centric.
Institutional Repositories It’s not Just the Technology New England Archivists Boston College March 11, 2006 Eliot Wilczek University Records Manager Tufts.
Digital Preservation - Its all about the metadata right? “Metadata and Digital Preservation: How Much Do We Really Need?” SAA 2014 Panel Saturday, August.
Transformations at GPO: An Update on the Government Printing Office's Future Digital System George Barnum Coalition for Networked Information December.
1 Introduction to XML. XML eXtensible implies that users define tag content Markup implies it is a coded document Language implies it is a metalanguage.
Information Retrieval in Practice
SOAPI: a flexible toolkit for implementing ingest and preservation workflows Mark Hedges Centre for e-Research, King’s College London Arts and Humanities.
PAWN: A Novel Ingestion Workflow Technology for Digital Preservation
Remote mailbox access gateway Software lab project.
Integration of Applications MIS3502: Application Integration and Evaluation Paul Weinberg Adapted from material by Arnold Kurtz, David.
1 CS 502: Computing Methods for Digital Libraries Lecture 27 Preservation.
PAWN: A Novel Ingestion Workflow Technology for Digital Preservation Mike Smorul, Joseph JaJa, Yang Wang, and Fritz McCall.
Archival Prototypes and Lessons Learned Mike Smorul UMIACS.
US GPO AIP Independence Test CS 496A – Senior Design Team members: Antonio Castillo, Johnny Ng, Aram Weintraub, Tin-Shuk Wong Faculty advisor: Dr. Russ.
RSS RSS is a method that uses XML to distribute web content on one web site, to many other web sites. RSS allows fast browsing for news and updates.
Persistent Digital Archives and Library System (PeDALS) A Guide for Wisconsin State Agencies.
Slide 1 of 9 Presenting 24x7 Scheduler The art of computer automation Press PageDown key or click to advance.
(C) 2013 Logrus International Practical Visualization of ITS 2.0 Categories for Real World Localization Process Part of the Multilingual Web-LT Program.
System Design/Implementation and Support for Build 2 PDS Management Council Face-to-Face Mountain View, CA Nov 30 - Dec 1, 2011 Sean Hardman.
© 2008 The McGraw-Hill Companies, Inc. All rights reserved. M I C R O S O F T ® Preparing for Electronic Distribution Lesson 14.
High-Speed, High Volume Document Storage, Retrieval, and Manipulation with Documentum and Snowbound March 8, 2007.
Data Exchange Tools (DExT) DExT PROJECTAN OPEN EXCHANGE FORMAT FOR DATA enables long-term preservation and re-use of metadata,
METS-Based Cataloging Toolkit for Digital Library Management System Dong, Li Tsinghua University Library
EARTH SCIENCE MARKUP LANGUAGE “Define Once Use Anywhere” INFORMATION TECHNOLOGY AND SYSTEMS CENTER UNIVERSITY OF ALABAMA IN HUNTSVILLE.
About Dynamic Sites (Front End / Back End Implementations) by Janssen & Associates Affordable Website Solutions for Individuals and Small Businesses.
Metadata Harvesting The Hague, 13 & 14 January 2009 Julie Verleyen Scientific Coordinator, Europeana Office EuropeanaLocal Knowledge Sharing Workshop.
Introduction to XML. XML - Connectivity is Key Need for customized page layout – e.g. filter to display only recent data Downloadable product comparisons.
CMSPro Omniversal Apps, Inc.. Application overview CMSPro is an extremely powerful, yet simple, metadata exploration and analysis tool for Business Objects.
The Metadata Object Description Schema (MODS) NISO Metadata Workshop May 20, 2004 Rebecca Guenther Network Development and MARC Standards Office Library.
Access Across Time: How the NAA Preserves Digital Records Andrew Wilson Assistant Director, Preservation.
Web Programming : Building Internet Applications Chris Bates CSE :
XML A web enabled data description language 4/22/2001 By Mark Lawson & Edward Ryan L’Herault.
WEB BASED DATA TRANSFORMATION USING XML, JAVA Group members: Darius Balarashti & Matt Smith.
1 Schema Registries Steven Hughes, Lou Reich, Dan Crichton NASA 21 October 2015.
EDI communication system
The FCLA Digital Archive Joint Meeting of CSUL Committees, 2005.
XML Engr. Faisal ur Rehman CE-105T Spring Definition XML-EXTENSIBLE MARKUP LANGUAGE: provides a format for describing data. Facilitates the Precise.
A radiologist analyzes an X-ray image, and writes his observations on papers  Image Tagging improves the quality, consistency.  Usefulness of the data.
OAIS Rathachai Chawuthai Information Management CSIM / AIT Issued document 1.0.
Funded by: © AHDS Preservation in Institutional Repositories Preliminary conclusions of the SHERPA DP project Gareth Knight Digital Preservation Officer.
Internet & World Wide Web How to Program, 5/e. © by Pearson Education, Inc. All Rights Reserved.2.
Metadata “Data about data” Describes various aspects of a digital file or group of files Identifies the parts of a digital object and documents their content,
National Archives and Records Administration Status of the ERA Project RACO Chicago Meg Phillips August 24, 2010.
McGraw-Hill/Irwin © 2004 by The McGraw-Hill Companies, Inc. All rights reserved. An Overview of XML Ellen Pearlman Eileen Mullin Programming the Web Using.
DSpace System Architecture 11 July 2002 DSpace System Architecture.
Preserving Electronic Mailing Lists as Scholarly Resources: The H-Net Archives Lisa M. Schmidt
Copyright © 2002 Pearson Education, Inc. Slide 3-1 Internet II A consortium of more than 180 universities, government agencies, and private businesses.
Web-based Front End for Kraken Jing Ai Jingfei Kong Yinghua Hu.
XML Tools (Chapter 4 of XML Book). What tools are needed for a complete XML application? n Fundamental components n Web infrasructure n XML development.
ARIADNE is funded by the European Commission's Seventh Framework Programme Archiving and Repositories Holly Wright.
Institutional Repositories July 2007 DIGITAL CURATION creating, managing and preserving digital objects Dr D Peters DISA Digital Innovation South.
Building Preservation Environments with Data Grid Technology Reagan W. Moore Presenter: Praveen Namburi.
Chapter 4 COMPUTER SOFTWARE. Objective Describe several important trends occurring in computer software. Explain the purpose of several popular software.
Preservation Functionality in a Digital Archive Erik Oltmans Koninklijke Bibliotheek Raymond J. van Diessen IBM Business Consulting Services Hilde van.
DArcMail Demonstration D igital Arc hive e Mail System Riccardo Smithsonian Institution Archiving.
Your Interactive Guide to the Digital World Discovering Computers 2012 Chapter 13 Computer Programs and Programming Languages.
A Semi-Automated Digital Preservation System based on Semantic Web Services Jane Hunter Sharmin Choudhury DSTC PTY LTD, Brisbane, Australia Slides by Ananta.
Application Sharing Bhavesh Amin Casey Miller Casey Miller Ajay Patel Ajay Patel Bhavesh Thakker Bhavesh Thakker.
KEEPS – a system for UELMA preservation and security
Information Retrieval in Practice
Joint Meeting of CSUL Committees,
KEEPS – a system for UELMA preservation and security
Sharing of Eurostat predefined tables
Sharing of Eurostat predefined tables
Web Programming : Building Internet Applications Chris Bates CSE :
Presentation transcript:

Society of American Archivists 2008 Annual Meeting Society of American Archivists 2008 Annual Meeting Capturing the E-Tiger: New Tools for Preservation Collaborative Electronic Records Project Capturing the E-Tiger: New Tools for Preservation Collaborative Electronic Records Project

The Collaborative Electronic Records Project (CERP) Design a preservation system and tools capable of preserving and maintaining digital records A strong emphasis on records A strong emphasis on records Implement the system and tools at the partner organizations Produce a practicable preservation system model for use by other small to medium archives.

CERP Partners Rockefeller Archive Center Depositors include the Rockefeller family, their philanthropic and educational organizations, and non- family philanthropies. Depositors include the Rockefeller family, their philanthropic and educational organizations, and non- family philanthropies. Little to no access to the depositors’ systems creating or other digital records. Little to no access to the depositors’ systems creating or other digital records. Smithsonian Institution Archives Depositors include the Institution, related persons and organizations, and other donors related to the history of American science Depositors include the Institution, related persons and organizations, and other donors related to the history of American science transferred from a variety of systems, typically 5 years or more after becoming inactive transferred from a variety of systems, typically 5 years or more after becoming inactive Active digital preservation and curation program Active digital preservation and curation program

ArchitectureArchitecture Depositor system(s) These systems and their clients do not interact with the CERP system Transferred is transferred after it has become inactive. The depositor determines the file formats and the physical transfer media. Transfer events are not expected to follow a pre-defined schedule. CERP workstations Repository Server

ArchitectureArchitecture Transferred CERP workstations Repository Server If necessary, transferred goes through a preliminary transformation into ‘mbox’ format. An XML file of the account is generated by the CERP Parser. The XML is incorporated into the Archival Information Package (AIP) along with updated metadata information and Preservation Description Information (PDI). The AIP is loaded into the Repository Server.

Choosing An Account Model Given a starting point of messages selected by an account owner for archival deposit, relationships between those s as well as any supplemental meaning that the owner has assigned through his/her organization of that account are valuable information that must be captured. With a ‘message’ model, thorough documentation of each message, its interrelationships, and its context within the account is overwhelming in the face of volume. With a ‘message’ model, thorough documentation of each message, its interrelationships, and its context within the account is overwhelming in the face of volume. With an ‘account’ model, many of the relationships between the s are already documented within the s themselves. Further relationships, especially those assigned by the account owner, are present in the account structure and organization at the point of transfer. With an ‘account’ model, many of the relationships between the s are already documented within the s themselves. Further relationships, especially those assigned by the account owner, are present in the account structure and organization at the point of transfer.

Account Preservation File Viability and risks of the native format Which one? How well is it documented? How long will software exist to read it? Which companies (if any) have a real commitment to stability and longevity? Choosing eXtensible Markup Language (XML)? XML is open, human readable and “self describing” A good descriptive schema supports validity checking There are many open source tools to create, manipulate and read XML

The Value of the Account Preservation (EMAP) Schema PRESERVATION: A Schema defines how the XML tags for the various parts of an relate to each other. It is the Rosetta stone that guides how raw is converted to XML

The Value of the EMAP Schema STORAGE: Authorization filter to verify that an object purporting to be an authentic preserved account is what it claims to be. SEARCHING: Structure for subsequent search, display Level of tagging enables deep data-mining Cross-account searching, and possibly broader federated searches

From Transfer to AIP Various transfer methods Metadata gathering Attachment diagnosis Preliminary format transformation Final preservation transformation METS generation and final metadata AIP assembly

Using METS in the AIP Multiple types of metadata = excellent wrapper DMDSec Accession metadata stored in Dublin Core Not limited to one descriptive metadata syntax FileGroup FileSec StructMap Other information options available AdminSec METS format for DSpace ingest

Conversion Results We have converted and validated 70 thousand messages in three test sets to the XML Mail-Account schema Smithsonian - 5,537 messages in 232 Mb of recent Outlook mail 99.97% successfully parsed (4 could not be parsed), Smithsonian - 28,000+ messages in a 1.5 Gb Outlook account % successfully parsed (5 could not be parsed) Rockefeller Archives - 43,778 messages in 378 Mb of older eclectic mail 99.85% successfully parsed (74 unparsed, but improvement is clearly possible) Parse speed for an account with attachments about a quarter gigabyte per hour on a Thinkpad T40 (March, 2008)

Variety is the Spice of Dozens of common systems and 100s of others We have encountered mail from Eudora (multiple versions), Simeon for MacPPC, Outlook/Exchange (multiple versions), Appl , Lotus Notes, Groupwise, Mozilla/Firefox, Pegasus Mail, and various Internet mail services such as gmail, Hotmail, YahooMail, Juno, and AOL Each has its peculiarities. Some use non-standard date formats European and Asian mail may contain non-ASCII (actually, non UTF-8) characters Older may have HTML in inappropriate places Forwarded and other “child” messages may be included in nonstandard forms

The Parser

The CERP Parser First and foremost, it is a prototype It was built in an Open Source development system: Squeak Smalltalk v3.9 A portable development environment that runs on Windows, Linux, and Macintosh ( Squeak was chosen because it is a very powerful prototyping system. We can debate the relative merits of other prototyping languages (Java, Ruby, whatever, …) off-line.

The Web Application Interface The parser can be run from within Squeak, but most users will prefer to run it from a Web browser The Web interface is built with a popular Squeak Web Application development framework called Seaside ( Seaside uses a web server (Comanche) that is embedded in Squeak. Comanche is confined to supporting the parser and the Seaside application interface.

Running the CERP Parser Start the parser Start the Web UI If necessary, start Seaside by executing “WAKom startOn: 9092” If necessary, start Seaside by executing “WAKom startOn: 9092” The Web UI runs at The Web UI runs at

Navigate to the directory containing the prepped account Select the account folder “Proceed with parsing”

Parsing Results Status

Preservation AIP Source File(s) Accession Metadata Preservation Description Information (PDI) Preservation File(s) METS File

Parsed Body Excerpt

Parsed Attachment Reference

Validation Message

Parser Subject-Sender Log

Parser Subject-Sender Log (cont.)

Long-term storage – Using DSpace Selected for expediency Significant limitations Surmounting the scale and access obstacles will require further research Other DSpace projects may generate some solutions

Preservation Issues Complex account structures Hierarchical structures More than just formats standards and adherence system idiosyncrasies

Loose “Standards” RFC2822 and other standards are a good start that handle most cases. Yet continues to evolve and standards continue to lag. To be widely adopted, lagging standards must support virtually all preexisting practices…an impossible goal without compromises that are open to interpretation. Different client vendors interpret the standards differently. And there are the inevitable mismatches between interpretations (and inevitable bugs).

Preservation Lessons Learned 100% success is an unrealistic goal Some s are just too broken to parse without manual intervention We can achieve at least 99.9% success (and save the few unparsed s for human inspection) This error rate is not unlike physical archives The EMAP Schema provides a very robust structure that can support sophisticated and complex access and retrieval

Next Steps Continued testing Review by others Parser and documentation on CERP website Parser and documentation on CERP website Considering ‘webinar’ events Considering ‘webinar’ events Testing with -related records e.g., mailing lists e.g., mailing lists Identifying/developing search tools Integrating privacy/sensitive data solutions

Rockefeller Archive Center Nancy Adgent, Project Archivist Smithsonian Institution Archives Riccardo Ferrante, Project Manager Lynda Schmitz Fuhrig, Project Archivist