Archiving Digital Government Data Joseph JaJa Institute for Advanced Computer Studies Department of Electrical and Computer Engineering University of Maryland.

Slides:



Advertisements
Similar presentations
Panel 2 – Promoting Re-Use of Scientific Collections John Harrison SHAMAN Project University of Liverpool
Advertisements

National Partnership for Advanced Computational Infrastructure San Diego Supercomputer Center Data Grids for Collection Federation Reagan W. Moore University.
An Introduction to Repositories Thornton Staples Director of Community Strategy and Alliances Director of the Fedora Project.
Background Chronopolis Goals Data Grid supporting a Long-term Preservation Service Data Migration Data Migration to next generation technologies Trust.
A Very Brief Introduction to iRODS
Transformations at GPO: An Update on the Government Printing Office's Future Digital System George Barnum Coalition for Networked Information December.
ISO & OAI-PMH By Neal Harmeyer, Amy Hatfield, and Brandon Beatty PURDUE UNIVERSITY RESEARCH REPOSITORY.
1 Software & Grid Middleware for Tier 2 Centers Rob Gardner Indiana University DOE/NSF Review of U.S. ATLAS and CMS Computing Projects Brookhaven National.
Chronopolis: Preserving Our Digital Heritage David Minor UC San Diego San Diego Supercomputer Center.
ADAPT An Approach to Digital Archiving and Preservation Technology Principal Investigator: Joseph JaJa Lead Programmers: Mike Smorul and Mike McGann Graduate.
PAWN: Producer-Archive Workflow Network University of Maryland Institute for Advanced Computer Studies Joseph Ja’Ja, Mike Smorul, Mike McGann.
Tools and Services for the Long Term Preservation and Access of Digital Archives Joseph JaJa, Mike Smorul, and Sangchul Song Institute for Advanced Computer.
Producer-Archive Workflow Network (PAWN) Goals Consistent with the Open Archival Information System (OAIS) model Use of web/grid technologies and platform.
Rutgers University Libraries What is RUcore? o An institutional repository, to preserve, manage and make accessible the research and publications of the.
1 Using Scalable and Secure Web Technologies to Design Global Format Registry Muluwork Geremew, Sangchul Song and Joseph JaJa Institute for Advanced Computer.
Brief Overview of Major Enhancements to PAWN. Producer – Archive Workflow Network (PAWN) Distributed and secure ingestion of digital objects into the.
Robust Tools for Archiving and Preserving Digital Data Joseph JaJa, Mike Smorul, and Mike McGann Institute for Advanced Computer Studies Department of.
PAWN: A Novel Ingestion Workflow Technology for Digital Preservation
Tools and Services for the Long Term Preservation and Access of Digital Archives Joseph JaJa, Mike Smorul, and Sangchul Song Institute for Advanced Computer.
FOCUS: FOrmat CUration Service Advisor: Dr. Joseph JaJa Students: Sang Chul Song Muluwork Geremew.
May 23, 2007 Archiving ACE: A Novel Software Platform to Ensure the Integrity of Digital Archives Sangchul Song and Joseph JaJa Institute for Advanced.
Robust Technologies for Automated Ingestion and Long-Term Preservation of Digital Information Principal Investigator: Joseph JaJa Lead Programmers: Mike.
PAWN: Producer-Archive Workflow Network University of Maryland Institute for Advanced Computer Studies Joseph JaJa, Mike Smorul, Mike McGann.
Mike Smorul Saurabh Channan Digital Preservation and Archiving at the Institute for Advanced Computer Studies University of Maryland, College Park.
UMIACS PAWN, LPE, and GRASP data grids Mike Smorul.
Data Grid: GRASP Mike Smorul. Grid Retrieval and Search Platform Based on concepts developed in the Earth Science Data Interface (ESDI) developed at the.
Robust Technologies for Automated Ingestion and Long-Term Preservation of Digital Information PI: Joseph JaJa Co-PIs: Allison Druin and Doug Oard Major.
PAWN: A Novel Ingestion Workflow Technology for Digital Preservation Mike Smorul, Joseph JaJa, Yang Wang, and Fritz McCall.
A Framework for Distributed Preservation Workflows Rainer Schmidt AIT Austrian Institute of Technology iPres 2009, Oct. 5, San.
FOCUS – A Scalable and Extensible Digital Format Registry Principal Investigator: Joseph JaJa Graduate Students: Sang Song and Muluwork Geremew Lead Programmers:
Web-based Portal for Discovery, Retrieval and Visualization of Earth Science Datasets in Grid Environment Zhenping (Jane) Liu.
System Design/Implementation and Support for Build 2 PDS Management Council Face-to-Face Mountain View, CA Nov 30 - Dec 1, 2011 Sean Hardman.
Data-PASS Shared Catalog Micah Altman & Jonathan Crabtree 1 Micah Altman Harvard University Archival Director, Henry A. Murray Research Archive Associate.
UNIVERSITY of MARYLAND GLOBAL LAND COVER FACILITY High Performance Computing in Support of Geospatial Information Discovery and Mining Joseph JaJa Institute.
Promoting Digital Preservation Partnerships at the U.S. Library of Congress April 2004.
San Diego Supercomputer CenterUniversity of California, San Diego Preservation Research Roadmap Reagan W. Moore San Diego Supercomputer Center
1 A journey of a thousand miles begins with a single step. Chinese Proverb.
Rule-Based Data Management Systems Reagan W. Moore Wayne Schroeder Mike Wan Arcot Rajasekar {moore, schroede, mwan, {moore, schroede, mwan,
Preserving Digital Collections for Future Scholarship Oya Y. Rieger Cornell University
Global Land Cover Facility The Global Land Cover Facility (GLCF) is a member of the Earth Science Information Partnership (ESIP) Federation providing data,
Introduction to Apache OODT Yang Li Mar 9, What is OODT Object Oriented Data Technology Science data management Archiving Systems that span scientific.
Rule-Based Preservation Systems Reagan W. Moore Wayne Schroeder Mike Wan Arcot Rajasekar Richard Marciano {moore, schroede, mwan, sekar,
National Partnership for Advanced Computational Infrastructure San Diego Supercomputer Center Persistent Management of Distributed Data Reagan W. Moore.
Ames Research CenterDivision 1 Information Power Grid (IPG) Overview Anthony Lisotta Computer Sciences Corporation NASA Ames May 2,
Archival Workshop on Ingest, Identification, and Certification Standards Certification (Best Practices) Checklist Does the archive have a written plan.
National Geospatial Digital Archive Greg Janée University of California at Santa Barbara.
GRID Overview Internet2 Member Meeting Spring 2003 Sandra Redman Information Technology and Systems Center and Information Technology Research Center National.
The Global Land Cover Facility is sponsored by NASA and the University of Maryland.The GLCF is a founding member of the Federation of Earth Science Information.
OAIS Rathachai Chawuthai Information Management CSIM / AIT Issued document 1.0.
How to Implement an Institutional Repository: Part II A NASIG 2006 Pre-Conference May 4, 2006 Technical Issues.
Millman—Nov 04—1 An Update on Digital Libraries David Millman Director of Research & Development Academic Information Systems Columbia University
DSpace - Digital Library Software
DSpace System Architecture 11 July 2002 DSpace System Architecture.
Preserving Electronic Mailing Lists as Scholarly Resources: The H-Net Archives Lisa M. Schmidt
Infrastructure Breakout What capacities should we build now to manage data and migrate it over the future generations of technologies, standards, formats,
Rights Management for Shared Collections Storage Resource Broker Reagan W. Moore
Building Preservation Environments with Data Grid Technology Reagan W. Moore Presenter: Praveen Namburi.
Collection-Based Persistent Archives Arcot Rajasekar, Richard Marciano, Reagan Moore San Diego Supercomputer Center Presented by: Preetham A Gowda.
Preservation Data Services Persistent Archive Research Group Reagan W. Moore October 1, 2003.
A Semi-Automated Digital Preservation System based on Semantic Web Services Jane Hunter Sharmin Choudhury DSTC PTY LTD, Brisbane, Australia Slides by Ananta.
Grid Services for Digital Archive Tao-Sheng Chen Academia Sinica Computing Centre
Data Grids, Digital Libraries and Persistent Archives: An Integrated Approach to Publishing, Sharing and Archiving Data. Written By: R. Moore, A. Rajasekar,
An Overview of Data-PASS Shared Catalog
An Introduction to Tessella and The Safety Deposit Box Platform
Policy-Based Data Management integrated Rule Oriented Data System
Joseph JaJa, Mike Smorul, and Sangchul Song
Implementing an Institutional Repository: Part II
Implementing an Institutional Repository: Part II
How to Implement an Institutional Repository: Part II
Presentation transcript:

Archiving Digital Government Data Joseph JaJa Institute for Advanced Computer Studies Department of Electrical and Computer Engineering University of Maryland

Digital Preservation Lots of current digital data need to be preserved for periods ranging from several years to decades and sometimes centuries. Government records require a special attention – long term; authenticity; audit trail; etc. Government data can include both public and restricted access, and classified information. Life cycle management of records to ensure long-term access to data.

Traditional Preservation Proven methodologies to preserve physical artifacts, revolving around trusted stewardships such as museums, libraries, and archives, with relatively little sharing. Long term preservation requires an elaborate process that requires several steps:  Appraisal  Accessioning  Arrangement  Description  Preservation  Access  Re-purposing

What Does Digital Preservation Mean? Is it preserving the “information content”? How about the “look and feel” of a book or a document or a piece of art? How about multimedia data? Will the use of a different coding scheme be OK? What does it mean to preserve a video game? How about preserving engineering designs developed by using a number of CAD tools?

Main Technology Issues Management of Technology evolution: Storage, Information Management, Representation, and Access. Risk Management and Disaster Recovery: Technology degradation and failure; Natural disasters such as fires, floods, etc. Human-induced operational or malicious errors. Ensuring long term authenticity of and access to electronic records

Major Government Efforts Electronic Records Archives (ERA) Program by the National Archives and Records Administration (NARA). Goal: Address critical issues in the creation, management, and use of electronic records of the U.S. Government. A production system is currently under development. National Digital Information Infrastructure and Preservation (NDIIPP) – led by the Library of Congress. Develop a national strategy to collect, archive and preserve the burgeoning amounts of digital content, especially materials that are created only in digital formats, for current and future generations.

The ADAPT Project at Maryland Main Themes: Platform independence characterization of Data Objects; layered architecture; distributed infrastructure. Digital object model that encapsulates content, structural, descriptive, and preservation metadata. Layered software architecture based on three levels of abstraction: data, information, and preservation. Organized to enable collaborative, community-based efforts such as replication, “dark archiving”, and Global Digital Format Registry. Components expressed within the Open Archival Information System (OAIS) reference framework.

Global Land Cover Facility at Maryland Established in 1997 as a part of NASA-supported Federation of Earth Science Information Partnerships (ESIPs). Joint effort between faculty in Geography, Computer Science, and Electrical and Computer Engineering. Main mission is to develop novel land cover products and information services in support of Earth Systems Science research. Evolved over the years to support a wide range of projects involving partners from academia, state governments, and private organizations.

Breadth of the Effort Faculty: John Townshend (Geography) Joseph JaJa (UMIACS and ECE) Sam Goward (Geography) Ben Shneiderman (Computer Science) Nick Roussopoulos (Computer Science) Rama Chellappa (Electrical and Computer Engineering) …. Partners: The Nature Conservancy, The Smithsonian Institution, The United Nations, World Conservation Union, World Resources Institute, Guyara/Paraguay, Conservation International, …. 8TB-10TB of data downloaded per month from the GLCF.

GLCF Data Holdings Satellite Data Landsat MSS Landsat TM Landsat ETM+ AVHRR Global Area Coverage Data (1989 – 1991) GOES data for United States Derived Products 1km, 8km and 1 Degree Land Cover Maps Urban Growth of Selected U.S. Metropolitan Centers U.S. Costal Marsh Health CARPE – Central African GIS Data Sets Continuous Fields Tree Cover Project EOS Core Validation Sites NASA Landsat Pathfinder Humid Tropical Deforestation Project MODIS 250m U.S. Vegetation Index MODIS 250m Landsat 7 ETM+

Specific Digital Preservation Projects Pilot Persistent Archive: Research and development of a testbed with a particular focus on NARA-type records and collections.NARA-type records NDIIPP Project: Research on management of preservation processes, including the organization of a “deep archive,” using collections from the Shoah Visual Foundation, ICDL, and GLCF. Chronopolis: A component of the cyber- infrastructure to preserve collections of national importance.

Pilot Persistent Archive Prototype: Heterogeneous “Grid Bricks” with over 12 TB Disk Storage and Substantially More Back-up Storage Disk Dell NARA UMD SDSC HPSSTSM Abilene Network Local Network Router Intel

Pilot Software Configuration The SRB data grid provides the middleware for integrating storage into a global address space and for incorporating replication and migration mechanisms. The Grid Security Infrastructure (GSI) supports uniform cross-site authentication through a Certificate Authority run by NARA. Separate heterogeneous databases supporting Metadata Catalogs (MCAT) at each site. SDSC and NARA run Oracle, and UMD runs Informix.

Data Grids as Core Infrastructure for Persistent Archives Technology Evolution Management Storage system abstraction, support data migration across storage systems Information repository abstraction, support catalog migration to new databases Logical name space, support global persistent identifier Risk Management Distributed architecture, logical name space, and data/information abstractions enable graceful handling of media degradation, natural disasters, and operational/malicious errors.

Selected Collections Available on the Prototype CollectionSize# FilesOriginal Media NARA Accessioned Holdings 38 GB Tape Presidential Web Sites 11 GB15,000Web Image EAP1.3 TB124,000WORM

Clinton Government Web Snapshot NCES Statistics 83 file types 53.5 GB total 2,767 folders 175,968 files File Type Percent of files html39% csv26% jpg13% gif12% pdf3% asp2% xls1% shtml1%

Main Software Components of ADAPT:

Producer – Archive Workflow Network (PAWN) Distributed and secure ingestion of digital objects into the archive. Use of web/grid technologies – platform independent Ease of integration with data grids or digital libraries. XML Representation of metadata and bitstream Self describing bitstream submissions Accountability of transfer and guarantee of data integrity

Distributed Ingestion

Submission Information Packet METS Handles all areas of a SIP except Physical Object and Descriptive Information Descriptive Information can be embedded into METS as 3 rd party XML schema

Distributed Ingestion Each Producer registers and arranges files locally prior to transport. Multiple distributed archival receiving stations. X.509 based authentication between sites. Independent Certificate Authorities at each Producer. Persistent archive is geographically distributed and managed by a data grid.

Management of Preservation Processes: Policy driven management of preservation processes. Main Components: System Registry: available data/metadata repositories; supported file formats; certified transformations. Registry of Policies: replication, refreshing, and migration. Monitoring System to evaluate the archive’s health on a regular basis.

Deep Archive Erasure codes are forward error correction codes that transform an input object into fragments such that only a specific number of arbitrary fragments can be used to reconstruct the object. Using a peer to peer DHT scheme, distribute the fragments among the nodes. Integrity and survivability of each object is guaranteed with high probability (can also be made unforgeable and self-verifying).

Consumer – Archive Network (CAN): Enables long-term access and information discovery across collections. Manages retrieval and display of content. Leverages advanced digital library services. Grid Retrieval and Search Platform (GRASP) prototype.

Digital Format Registry Handling of digital formats is an essential part of long-term preservation Preservation of any object must include ways to render and transform the object. Needs to preserve Different essential aspects of objects. Tools for capturing the essential format characteristics of information stored as digital objects.

FOrmat CUration Service Maintains persistent, unambiguous representation information on digital formats and ways to access and manipulate them. Accessible either Directly through LDAP Or indirectly through SOAP (Web Services) Web Service Agent Format Registry LDAP SOAP

FOCUS on LDAP/SOAP Interoperability LDAP and SOAP provide the standard models and protocols, being platform independent. Scalability LDAP is a proven scalable technology. LDAP schema can be extended and server can be replicated with ease. SOAP server side can be extended without affecting client sides. Security SOAP can be on top of SSL (https). LDAP also provides its own secure authentication and authorization methods.

FOCUS Data Model dc=umiacs, dc=umd, dc=edu ou=Format- Registry ou=Applications Adobe Acrobat v6.0 Adobe Photoshop v7.0 Jhove 1.0 ou=Formats Adobe PDF v1.4 CompuServ GIF 1989a JPEG Image Format 2000  General descriptive properties.  Processing: rendering, editing, conversion and validation services/systems.  General descriptive properties.  Processing : format taken as input and/or output.

Validation Service Conversio n Service Web Service Agent Identificatio n Service Rendering Service Rendering Service FOCUS Service Model Format Registry Identifies format of a specific DO using the internal signature Determines a verification service to verify the format of a specific DO Identifies current rendering conditions for specific digital format. Locates transformation services to convert DO from source format to format of interest.

Use Case: Digital Object Format Verification Validation Service Conversio n service Web Service Agent ID Service Rendering Service Rendering Service Format Registry Format ? Format ID / Format Info Verifier? App ID / App Info Verify this? Valid/Well-formed Step 1: User requests to identify the format a file via Web Service Step 2: Registry returns format ID and format information Step 3: User requests for information on available verifier for this format Step 4: Registry returns validation service ID and information, such as its service location Step 5: User connects to the validation service and verify the format Step 6: Validation service returns the verification result Web Service Agent Format Registry

Demo

Conclusion Broad research program addressing major technology issues in digital preservation. Set up a pilot system for a distributed archiving infrastructure, which currently holds around 10TB of widely different types of data. Development of tools that are currently being tested at NARA. Several other organizations have expressed interest in using our tools. Program conducted in close collaboration with NARA and Library of Congress.