Exploring the Applicability of Scientific Data Management Tools and Techniques on the Records Management Requirements for the National Archives and Records Administration (NARA) Oct 01, 2002 – Sept. 30, 2003
Introduction The National Archives and Records Administration (NARA) has the duty to preserve the nation’s history through archival storage and management of federal records. By law, Federal records are –all books, papers, maps, photographs, machine readable materials, or other documentary materials –regardless of physical form or characteristics –made or received by an agency of the U.S. Government under Federal law or in connection with the transaction of public business and preserved or appropriate for preservation by that agency or its legitimate successor
Electronic records critically challenge NARA --and many other archives, libraries, agencies, and businesses. sheer volume of electronic records, their diversity and complexity, rapidity of change in the information technologies used to create, store, and manage these records
Preserving the nation’s history requires more than the simple archiving of electronic records it requires the capability to –mine –generate knowledge from –reorganize these records in response to public and government queries. It also requires –integrating the capability to preserve knowledge embedded in these records given the inevitable and frequent changes in technology
Archival and retrieval systems Must be efficient and scalable to cope with multiple petabyte archives. Preservation of metadata is vital Making large-scale collections accessible to diverse users –requires components that provide high-performance digital library services (such as indexing, clustering, browsing, querying, translation, and change management) –as well as the means to rapidly deploy these components in configurations that meet the needs of a particular task or user group.
Six projects A study of file formats for long-term record archiving Automatic classification Publishing, exploring, and mining heterogeneous distributed data Digital library component technology for large- scale archives Time series characterization of archival I/O behavior Performance analysis of archive data management and retrieval
A Study of File Formats for Long-Term Record Archiving PI: Mike Folk Investigate the suitability of scientific data formats and access methods for record archives. Look at HDF5 as archival format for a variety of different kinds of records. Possibilities include GIS, CAD. Interface with SRB and OAIS implementation of sample collection. Prototype implementation in HDF5 of NARA records collections, to be identified by NARA.
Automatic Classification PI: Michael Welge Much interest in automatic text classification (ATC) where automated learning techniques are used to categorize text documents into pre-defined discrete sets of topics Automatic Classification (AEC) can be seen as a subtask of ATC AEC differs from the common ATC in many ways. –e.g., sentences are ill structured, knowledge is embedded nondiscriminatory fields, etc. We propose to focus on two main questions: –What is the best machine learning technique to classify messages? –Which are the important attributes within an message that help classification? support it with a series of experiments on several benchmarks and real world data. Particularly, we would like to experiment with a large real world data set such as the Clinton White House archive.
Publishing, Exploring, and Mining Heterogeneous Distributed Data PI: Michael Welge Look at performance of NCSA’s existing data mining tools, Data Spaces on distributed data. Extend Data Spaces so that is can understand HDF data. Then apply Data Spaces to a real collection. Probably use HDF as well as collections managed by Bob Grossman of the University of Chicago.
Digital Library Component Technology for Large-scale Archives PI: Joseph Futrelle Making large-scale collections accessible to a variety of kinds of users requires components that provide high-performance digital library services –E.g. indexing, clustering, browsing, querying, translation, and change management As well as the means to rapidly deploy these components in configurations that meet the needs of a particular task or user group. The NCSA Digital Library Technologies group has been developing distributed digital library components for several years Recently work within the Open Digital Library (ODL) framework –based on extensions to the Open Archives Initiative Protocol for Metadata Harvesting Tasks –Investigate the applicability of the ODL framework to problems of the scale and heterogeneity represented by NARA records –Attempt to integrate the ODL framework with NCSA’s D2K framework.
Digital Library Component Technology for Large-scale Archives Key questions –Can we build large-scale, high performance Open Archives services using caching and proxying strategies? –Can hierarchical configurations of filtering components be used to scale services by performing records reduction on multiple streams of documents? –Can translation components be used in conjunction with indexing or clustering components to build unified representations that span large- scale heterogeneous collections? –Can NARA records be made to interoperate with external data sources using the Open Archives protocol? –Can Open Archives components be used to help NARA acquire records from other government agencies? –Can ODL components be rapidly assembled into applications using the D2K rapid application development environment, or a derived environment, which would not only facilitate application building but also allow the ODL components to interoperate with D2K’s machine learning components?
Performance Analysis of Archive Data Management and Retrieval I: Dan Reed Extend the functionality of Pablo I/O analysis toolkit to analyze I/O performance when accessing data via archival systems supported by large Linux clusters. We will characterize performance at three levels, driven to the maximum extent possible by expected NARA access patterns and integration with the HDF5, D2K, and Emerge toolkits: –the time required to execute the high level archival commands; –the cost of performing the Linux level I/O operations; –the cost of storage and retrieval from physical storage devices. Add procedures to produce SDDF trace data from high-level archival operations. Develop new analysis tools to process this data Develop the requisite interfaces to extract data from the SDDF trace files in a form that can be used by the ARIMA time series modeling software described above. Then apply time series techniques to characterize the behavior of archival operations. We also propose to study the cost and power demands of different archival operations, comparing alternative implementations and analyzing patterns of basic operations that occur frequently throughout the use of the archive.
Time Series Characterization of Archival I/O Behavior PI: Nancy Tran This project plans to work closely with the Pablo team to model and characterize I/O behaviors using the Pablo group’s SDDF instrumented data. Interested in the cost (a fraction of the total execution time) of HDF5 major I/O function calls in applications run on Linux clusters. Leveraging their online time series modeling framework (TsModeler), they plan to analyze HDF5 cost time series, automatically built by SDDF. Will correlate costs with I/O behaviors, compare different function costs, and identify the most impeding performance bottlenecks. Also will develop graphical tools to enable viewing of I/O function cost patterns and their evolutions.