Exploring the Applicability of Scientific Data Management Tools and Techniques on the Records Management Requirements for the National Archives and Records.

Slides:



Advertisements
Similar presentations
DELOS Highlights COSTANTINO THANOS ITALIAN NATIONAL RESEARCH COUNCIL.
Advertisements

DIGIDOC A web based tool to Manage Documents. System Overview DigiDoc is a web-based customizable, integrated solution for Business Process Management.
June 22-23, 2005 Technology Infusion Team Committee1 High Performance Parallel Lucene search (for an OAI federation) K. Maly, and M. Zubair Department.
ELPUB 2006 June Bansko Bulgaria1 Automated Building of OAI Compliant Repository from Legacy Collection Kurt Maly Department of Computer.
Chapter 9 DATA WAREHOUSING Transparencies © Pearson Education Limited 1995, 2005.
Requirements Specification
DCS Architecture Bob Krzaczek. Key Design Requirement Distilled from the DCS Mission statement and the results of the Conceptual Design Review (June 1999):
Advanced Topics COMP163: Database Management Systems University of the Pacific December 9, 2008.
Robust Tools for Archiving and Preserving Digital Data Joseph JaJa, Mike Smorul, and Mike McGann Institute for Advanced Computer Studies Department of.
OCT1 Principles From Chapter One of “Distributed Systems Concepts and Design”
DATA WAREHOUSING.
Internet Resources Discovery (IRD) IBM DB2 Digital Library Thanks to Zvika Michnik and Avital Greenberg.
Tools and Services for the Long Term Preservation and Access of Digital Archives Joseph JaJa, Mike Smorul, and Sangchul Song Institute for Advanced Computer.
Knowledge Portals and Knowledge Management Tools
Digital Library in a Box Ming Luo, Hussein Suleman, Edward Fox Virginia Tech Subcontract to Collaborative Project led by University of Florida (also with.
Web-based Portal for Discovery, Retrieval and Visualization of Earth Science Datasets in Grid Environment Zhenping (Jane) Liu.
LÊ QU Ố C HUY ID: QLU OUTLINE  What is data mining ?  Major issues in data mining 2.
Introduction to the Enterprise Library. Sounds familiar? Writing a component to encapsulate data access Building a component that allows you to log errors.
CONTI’2008, 5-6 June 2008, TIMISOARA 1 Towards a digital content management system Gheorghe Sebestyen-Pal, Tünde Bálint, Bogdan Moscaliuc, Agnes Sebestyen-Pal.
Overview of the Database Development Process
Challenges in Information Retrieval and Language Modeling Michael Shepherd Dalhousie University Halifax, NS Canada.
4.x Performance Technology drivers – Exascale systems will consist of complex configurations with a huge number of potentially heterogeneous components.
Ihr Logo Chapter 5 Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization Turban, Aronson, and Liang.
Event Metadata Records as a Testbed for Scalable Data Mining David Malon, Peter van Gemmeren (Argonne National Laboratory) At a data rate of 200 hertz,
Department of Biomedical Informatics Service Oriented Bioscience Cluster at OSC Umit V. Catalyurek Associate Professor Dept. of Biomedical Informatics.
Database System Concepts and Architecture
 To explain the importance of software configuration management (CM)  To describe key CM activities namely CM planning, change management, version management.
Chapter 1 Introduction to Data Mining
material assembled from the web pages at
Appraisal and Data Mining of Large Size Complex Documents Rob Kooper, William McFadden and Peter Bajcsy National Center for Supercomputing Applications.
Introduction to Apache OODT Yang Li Mar 9, What is OODT Object Oriented Data Technology Science data management Archiving Systems that span scientific.
Metadata and Geographical Information Systems Adrian Moss KINDS project, Manchester Metropolitan University, UK
A Domain-Specific Modeling Language for Scientific Data Composition and Interoperability Hyun ChoUniversity of Alabama at Birmingham Jeff GrayUniversity.
RELATIONAL FAULT TOLERANT INTERFACE TO HETEROGENEOUS DISTRIBUTED DATABASES Prof. Osama Abulnaja Afraa Khalifah
Chapter 4 Realtime Widely Distributed Instrumention System.
Lecturer: Gareth Jones. How does a relational database organise data? What are the principles of a database management system? What are the principal.
Design of a Search Engine for Metadata Search Based on Metalogy Ing-Xiang Chen, Che-Min Chen,and Cheng-Zen Yang Dept. of Computer Engineering and Science.
Database Design and Management CPTG /23/2015Chapter 12 of 38 Functions of a Database Store data Store data School: student records, class schedules,
Ocean Observatories Initiative Data Management (DM) Subsystem Overview Michael Meisinger September 29, 2009.
© 2012 xtUML.org Bill Chown – Mentor Graphics Model Driven Engineering.
Virtual Data Grid Architecture Ewa Deelman, Ian Foster, Carl Kesselman, Miron Livny.
5 - 1 Copyright © 2006, The McGraw-Hill Companies, Inc. All rights reserved.
4 - 1 Copyright © 2006, The McGraw-Hill Companies, Inc. All rights reserved. Computer Software Chapter 4.
Copyright © 2012, SAS Institute Inc. All rights reserved. ANALYTICS IN BIG DATA ERA ANALYTICS TECHNOLOGY AND ARCHITECTURE TO MANAGE VELOCITY AND VARIETY,
Big Data Analytics Large-Scale Data Management Big Data Analytics Data Science and Analytics How to manage very large amounts of data and extract value.
1 Computing Challenges for the Square Kilometre Array Mathai Joseph & Harrick Vin Tata Research Development & Design Centre Pune, India CHEP Mumbai 16.
- 1 - HDF5, HDF-EOS and Geospatial Data Archives HDF and HDF-EOS Workshop VII September 24, 2003.
1 GRID Based Federated Digital Library K. Maly, M. Zubair, V. Chilukamarri, and P. Kothari Department of Computer Science Old Dominion University February,
Breakout # 1 – Data Collecting and Making It Available Data definition “ Any information that [environmental] researchers need to accomplish their tasks”
Digital Libraries1 David Rashty. Digital Libraries2 “A library is an arsenal of liberty” Anonymous.
INTRODUCTION TO GIS  Used to describe computer facilities which are used to handle data referenced to the spatial domain.  Has the ability to inter-
HDF and HDF-EOS Workshop VII September 24, 2003 HDF5, HDF-EOS and Geospatial Data Archives Don Keefer Illinois State Geological Survey Mike Folk Univ.
JISC/NSF PI Meeting, June Archon - A Digital Library that Federates Physics Collections with Varying Degrees of Metadata Richness Department of Computer.
Oct 12-14, 2003NSDL Challenges in Building Federation Services over Harvested Metadata Kurt Maly, Michael Nelson, Mohammad Zubair Digital Library.
Distributed Data Analysis & Dissemination System (D-DADS ) Special Interest Group on Data Integration June 2000.
A Resource Discovery Service for the Library of Texas Requirements, Architecture, and Interoperability Testing William E. Moen, Ph.D. Principal Investigator.
Fire Emissions Network Sept. 4, 2002 A white paper for the development of a NSF Digital Government Program proposal Stefan Falke Washington University.
Copyright (c) 2014 Pearson Education, Inc. Introduction to DBMS.
SDM Center Parallel I/O Storage Efficient Access Team.
Data Mining Concepts and Techniques Course Presentation by Ali A. Ali Department of Information Technology Institute of Graduate Studies and Research Alexandria.
Physical Oceanography Distributed Active Archive Center THUANG June 9-13, 20089th GHRSST-PP Science Team Meeting GHRSST GDAC and EOSDIS PO.DAAC.
Building Preservation Environments with Data Grid Technology Reagan W. Moore Presenter: Praveen Namburi.
Data Management: Data Processing Types of Data Processing at USGS There are several ways to classify Data Processing activities at USGS, and here are some.
Collection-Based Persistent Archives Arcot Rajasekar, Richard Marciano, Reagan Moore San Diego Supercomputer Center Presented by: Preetham A Gowda.
Postgraduate Module Enterprise Database Systems Technological Educational Institution of Larisa in collaboration with Staffordshire University Larisa
A Semi-Automated Digital Preservation System based on Semantic Web Services Jane Hunter Sharmin Choudhury DSTC PTY LTD, Brisbane, Australia Slides by Ananta.
Joseph JaJa, Mike Smorul, and Sangchul Song
Datamining : Refers to extracting or mining knowledge from large amounts of data Applications : Market Analysis Fraud Detection Customer Retention Production.
A Unifying View on Instance Selection
Data Warehousing and Data Mining
Presentation transcript:

Exploring the Applicability of Scientific Data Management Tools and Techniques on the Records Management Requirements for the National Archives and Records Administration (NARA) Oct 01, 2002 – Sept. 30, 2003

Introduction The National Archives and Records Administration (NARA) has the duty to preserve the nation’s history through archival storage and management of federal records. By law, Federal records are –all books, papers, maps, photographs, machine readable materials, or other documentary materials –regardless of physical form or characteristics –made or received by an agency of the U.S. Government under Federal law or in connection with the transaction of public business and preserved or appropriate for preservation by that agency or its legitimate successor

Electronic records critically challenge NARA --and many other archives, libraries, agencies, and businesses. sheer volume of electronic records, their diversity and complexity, rapidity of change in the information technologies used to create, store, and manage these records

Preserving the nation’s history requires more than the simple archiving of electronic records it requires the capability to –mine –generate knowledge from –reorganize these records in response to public and government queries. It also requires –integrating the capability to preserve knowledge embedded in these records given the inevitable and frequent changes in technology

Archival and retrieval systems Must be efficient and scalable to cope with multiple petabyte archives. Preservation of metadata is vital Making large-scale collections accessible to diverse users –requires components that provide high-performance digital library services (such as indexing, clustering, browsing, querying, translation, and change management) –as well as the means to rapidly deploy these components in configurations that meet the needs of a particular task or user group.

Six projects A study of file formats for long-term record archiving Automatic classification Publishing, exploring, and mining heterogeneous distributed data Digital library component technology for large- scale archives Time series characterization of archival I/O behavior Performance analysis of archive data management and retrieval

A Study of File Formats for Long-Term Record Archiving PI: Mike Folk Investigate the suitability of scientific data formats and access methods for record archives. Look at HDF5 as archival format for a variety of different kinds of records. Possibilities include GIS, CAD. Interface with SRB and OAIS implementation of sample collection. Prototype implementation in HDF5 of NARA records collections, to be identified by NARA.

Automatic Classification PI: Michael Welge Much interest in automatic text classification (ATC) where automated learning techniques are used to categorize text documents into pre-defined discrete sets of topics Automatic Classification (AEC) can be seen as a subtask of ATC AEC differs from the common ATC in many ways. –e.g., sentences are ill structured, knowledge is embedded nondiscriminatory fields, etc. We propose to focus on two main questions: –What is the best machine learning technique to classify messages? –Which are the important attributes within an message that help classification? support it with a series of experiments on several benchmarks and real world data. Particularly, we would like to experiment with a large real world data set such as the Clinton White House archive.

Publishing, Exploring, and Mining Heterogeneous Distributed Data PI: Michael Welge Look at performance of NCSA’s existing data mining tools, Data Spaces on distributed data. Extend Data Spaces so that is can understand HDF data. Then apply Data Spaces to a real collection. Probably use HDF as well as collections managed by Bob Grossman of the University of Chicago.

Digital Library Component Technology for Large-scale Archives PI: Joseph Futrelle Making large-scale collections accessible to a variety of kinds of users requires components that provide high-performance digital library services –E.g. indexing, clustering, browsing, querying, translation, and change management As well as the means to rapidly deploy these components in configurations that meet the needs of a particular task or user group. The NCSA Digital Library Technologies group has been developing distributed digital library components for several years Recently work within the Open Digital Library (ODL) framework –based on extensions to the Open Archives Initiative Protocol for Metadata Harvesting Tasks –Investigate the applicability of the ODL framework to problems of the scale and heterogeneity represented by NARA records –Attempt to integrate the ODL framework with NCSA’s D2K framework.

Digital Library Component Technology for Large-scale Archives Key questions –Can we build large-scale, high performance Open Archives services using caching and proxying strategies? –Can hierarchical configurations of filtering components be used to scale services by performing records reduction on multiple streams of documents? –Can translation components be used in conjunction with indexing or clustering components to build unified representations that span large- scale heterogeneous collections? –Can NARA records be made to interoperate with external data sources using the Open Archives protocol? –Can Open Archives components be used to help NARA acquire records from other government agencies? –Can ODL components be rapidly assembled into applications using the D2K rapid application development environment, or a derived environment, which would not only facilitate application building but also allow the ODL components to interoperate with D2K’s machine learning components?

Performance Analysis of Archive Data Management and Retrieval I: Dan Reed Extend the functionality of Pablo I/O analysis toolkit to analyze I/O performance when accessing data via archival systems supported by large Linux clusters. We will characterize performance at three levels, driven to the maximum extent possible by expected NARA access patterns and integration with the HDF5, D2K, and Emerge toolkits: –the time required to execute the high level archival commands; –the cost of performing the Linux level I/O operations; –the cost of storage and retrieval from physical storage devices. Add procedures to produce SDDF trace data from high-level archival operations. Develop new analysis tools to process this data Develop the requisite interfaces to extract data from the SDDF trace files in a form that can be used by the ARIMA time series modeling software described above. Then apply time series techniques to characterize the behavior of archival operations. We also propose to study the cost and power demands of different archival operations, comparing alternative implementations and analyzing patterns of basic operations that occur frequently throughout the use of the archive.

Time Series Characterization of Archival I/O Behavior PI: Nancy Tran This project plans to work closely with the Pablo team to model and characterize I/O behaviors using the Pablo group’s SDDF instrumented data. Interested in the cost (a fraction of the total execution time) of HDF5 major I/O function calls in applications run on Linux clusters. Leveraging their online time series modeling framework (TsModeler), they plan to analyze HDF5 cost time series, automatically built by SDDF. Will correlate costs with I/O behaviors, compare different function costs, and identify the most impeding performance bottlenecks. Also will develop graphical tools to enable viewing of I/O function cost patterns and their evolutions.