San Diego Supercomputer Center NARA Research Prototype Persistent Archive Building Preservation Environments with Data Grid Technology (NARA Research Prototype.

Slides:



Advertisements
Similar presentations
Building Shared Collections Using the Storage Resource Broker Storage Resource Broker Reagan W. Moore
Advertisements

San Diego Supercomputer Center & National Partnership for Advance Computational Infrastructure Storage Resource Broker Reagan W. Moore San Diego Supercomputer.
National Partnership for Advanced Computational Infrastructure San Diego Supercomputer Center Data Grids for Collection Federation Reagan W. Moore University.
GFS OGF-22 Global Resource Naming Developers: Reagan Moore Arcot Mike.
OGF-23 iRODS Metadata Grid File System Reagan Moore San Diego Supercomputer Center.
© 2006 Open Grid Forum OGF19 Federated Identity Rule-based data management Wed 11:00 AM Mountain Laurel Thurs 11:00 AM Bellflower.
Data Management Systems Richard Marciano Reagan W. Moore Wayne Schroeder Arcot Rajasekar Mike Wan San Diego Supercomputer Center
The Storage Resource Broker and.
The Storage Resource Broker and.
Overview of the SDSC Storage Resource Broker Wayne Schroeder (and other SRB team members) May, 2004 San Diego Supercomputer Center, University of California.
Digital Preservation Lifecycle Management Building a demonstration prototype for the preservation of large-scale multi-media collections Arcot Rajasekar.
Data Grid: Storage Resource Broker Mike Smorul. SRB Overview Developed at San Diego Supercomputing Center. Provides the abstraction mechanisms needed.
National Partnership for Advanced Computational Infrastructure San Diego Supercomputer Center Data Grids Reagan W. Moore San Diego Supercomputer Center.
National Partnership for Advanced Computational Infrastructure San Diego Supercomputer Center Data Grids, Digital Libraries and Persistent Archives Reagan.
NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE SAN DIEGO SUPERCOMPUTER CENTER Particle Physics Data Grid PPDG Data Handling System Reagan.
San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure Integration of Data Grids, Digital Libraries, and Persistent.
San Diego Supercomputer CenterNational Partnership for Advanced Computational Infrastructure1 Grid Based Solutions for Distributed Data Management Reagan.
A Very Brief Introduction to iRODS
Federating Archives in the DELAMAN Network Reagan W. Moore San Diego Supercomputer Center Storage Resource.
Security Requirements for Shared Collections Storage Resource Broker Reagan W. Moore
GGF-17 Astro Workshop Preservation Environment Working Group Officers: Bruce Barkstrom (NASA Langley) Reagan Moore (SDSC) Goals  Demonstrate.
Chronopolis: Preserving Our Digital Heritage David Minor UC San Diego San Diego Supercomputer Center.
Applying Data Grids to Support Distributed Data Management Storage Resource Broker Reagan W. Moore Ian Fisk Bing Zhu University of California, San Diego.
Robust Tools for Archiving and Preserving Digital Data Joseph JaJa, Mike Smorul, and Mike McGann Institute for Advanced Computer Studies Department of.
Modern Data Management Overview Storage Resource Broker Reagan W. Moore
PAWN: A Novel Ingestion Workflow Technology for Digital Preservation
UMIACS PAWN, LPE, and GRASP data grids Mike Smorul.
DCC Conference, Glasgow November, Digital Archive Policies and Trusted Digital Repositories MacKenzie Smith, MIT Libraries Reagan Moore, San Diego.
Data Grids and Data Management Storage Resource Broker Reagan W. Moore
National Partnership for Advanced Computational Infrastructure Digital Library Architecture Reagan Moore Chaitan Baru Amarnath Gupta George Kremenek Bertram.
San Diego Supercomputer CenterUniversity of California, San Diego Preservation Research Roadmap Reagan W. Moore San Diego Supercomputer Center
Information Management and Distributed Data Reagan W. Moore Wayne Schroeder Mike Wan Arcot Rajasekar Richard Marciano {moore, schroede, mwan, sekar,
Data Grids and Data Management Storage Resource Broker Reagan W. Moore
Managing Simulation Output Storage Resource Broker Reagan W. Moore
PERG OGF-22 Preservation Environments Research Group Organizers: Reagan Moore Richard Marciano
Rule-Based Data Management Systems Reagan W. Moore Wayne Schroeder Mike Wan Arcot Rajasekar {moore, schroede, mwan, {moore, schroede, mwan,
Rule-Based Distributed Data Management iRODS Jan 23, Reagan W. Moore Mike Wan Arcot Rajasekar Wayne Schroeder San Diego.
San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure San Diego Supercomputer Center National Partnership for Advanced.
EGEE-II INFSO-RI Enabling Grids for E-sciencE Data Grid Services/SRB/SRM & Practical Hai-Ning Wu Academia Sinica Grid Computing.
Production Data Grids SRB - iRODS Storage Resource Broker Reagan W. Moore
Working Group Practical Policy based on slides and latest documents from the PP WG chaired by Reagan Moore, Rainer Stotzka presented by Johannes Reetz.
Rule-Based Preservation Systems Reagan W. Moore Wayne Schroeder Mike Wan Arcot Rajasekar Richard Marciano {moore, schroede, mwan, sekar,
National Partnership for Advanced Computational Infrastructure San Diego Supercomputer Center Persistent Management of Distributed Data Reagan W. Moore.
National Partnership for Advanced Computational Infrastructure San Diego Supercomputer Center Persistent Archive for the NSDL Reagan W. Moore Charlie Cowart.
Policy Based Data Management Data-Intensive Computing Distributed Collections Grid-Enabled Storage iRODS Reagan W. Moore 1.
San Diego Supercomputer CenterNational Partnership for Advanced Computational Infrastructure1 Data Grids, Digital Libraries, and Persistent Archives Reagan.
The Global Land Cover Facility is sponsored by NASA and the University of Maryland.The GLCF is a founding member of the Federation of Earth Science Information.
How to Implement an Institutional Repository: Part II A NASIG 2006 Pre-Conference May 4, 2006 Technical Issues.
From SRB to IRODS: Policy Virtualization using Rule-Based Data Grids Reagan W. Moore Wayne Schroeder Arcot Rajasekar Mike Wan San Diego Supercomputer Center.
GGF-17 Preservation Environments Research Group Preservation Environment Working Group Officers: Bruce Barkstrom (NASA Langley) Reagan.
1 e-Science AHM st Aug – 3 rd Sept 2004 Nottingham Distributed Storage management using SRB on UK National Grid Service Manandhar A, Haines K,
Introduction to The Storage Resource.
Biomedical Informatics Research Network The Storage Resource Broker & Integration with NMI Middleware Arcot Rajasekar, BIRN-CC SDSC October 9th 2002 BIRN.
National Archives and Records Administration1 Integrated Rules Ordered Data System (“IRODS”) Technology Research: Digital Preservation Technology in a.
Rights Management for Shared Collections Storage Resource Broker Reagan W. Moore
The Storage Resource Broker and.
NARA Report: NARA Persistent Archives Prototype Bill Underwood GTRI, Atlanta CCSDS, MOIMS DAI / IPR WGs Toulouse, 2 Nov-5 Nov 2004.
Building Preservation Environments with Data Grid Technology Reagan W. Moore Presenter: Praveen Namburi.
Collection-Based Persistent Archives Arcot Rajasekar, Richard Marciano, Reagan Moore San Diego Supercomputer Center Presented by: Preetham A Gowda.
Building Preservation Environments Reagan W. Moore San Diego Supercomputer Center Storage Resource Broker.
Preservation Data Services Persistent Archive Research Group Reagan W. Moore October 1, 2003.
Building Preservation Environments from Federated Data Grids Reagan W. Moore San Diego Supercomputer Center Storage.
Data Grids, Digital Libraries and Persistent Archives: An Integrated Approach to Publishing, Sharing and Archiving Data. Written By: R. Moore, A. Rajasekar,
Collection Based Persistent Archives
Policy-Based Data Management integrated Rule Oriented Data System
Joseph JaJa, Mike Smorul, and Sangchul Song
Arcot Rajasekar Michael Wan Reagan Moore (sekar, mwan,
VORB Virtual Object Ring Buffers
Technical Issues in Sustainability
Presentation transcript:

San Diego Supercomputer Center NARA Research Prototype Persistent Archive Building Preservation Environments with Data Grid Technology (NARA Research Prototype Persistent Archive) Reagan Moore Richard Marciano Arcot Rajasekar Michael Wan Wayne Schroeder Antoine de Torcy Sheau-Yen Chen

San Diego Supercomputer Center NARA Research Prototype Persistent Archive Topics Moore, R., “Building Preservation Environments with Data Grid Technology”, American Archivist, vol. 69, no. 1, pp , July Identify relevant preservation concepts Prove concepts in NARA Research Prototype Persistent Archive Identify future technology development goals

San Diego Supercomputer Center NARA Research Prototype Persistent Archive Preservation Concepts Paper records –Authenticity –Integrity Digital records –Authenticity –Integrity –Infrastructure independence –Scalability

San Diego Supercomputer Center NARA Research Prototype Persistent Archive Traditional Preservation Concepts Authenticity - the inextricable linking of identity metadata to a record  Date record is made  Date record is transmitted  Date record is received  Date record is set aside (i.e.,, filed)  Name of author (person or organization issuing the record)  Name of addressee (person or organization for whom the record is intended)  Name of writer (person or organization responsible for the articulation of the record’s content)  Name of originator (electronic address from which the record is sent)  Name of recipient(s) (person or organization to whom the record is sent)  Name of creator (person or organization in whose archival fonds the record exists)  Name of action or matter (the activity in the course of which the record is created)  Name of documentary form (e.g., , report, memo)  Identification of digital components  Identification of attachments (e.g.,, digital signature)  Archival bond (e.g.,, classification code)  Assertions about the creation of the record  Assertions made by the archivist about the creator of the record and the creation process

San Diego Supercomputer Center NARA Research Prototype Persistent Archive Traditional Preservation Concept Integrity - the management of record correctness and the chain of custody –Name(s) of the handling office/officer –Name of the office of primary responsibility for keeping the record –Indication of annotations or comments –Indication of actions carried out on the record (e.g., audit trail) –Indication of technical modifications due to transformative migration –Integrity signature –Validation date for last integrity check Assertions made by the archivist about the management of the records

San Diego Supercomputer Center NARA Research Prototype Persistent Archive Preservation of Digital Records Extract a digital record from the environment in which it was created, Import the digital record into the preservation environment Given that the preservation environment will evolve, the process must be repeated with each new generation of storage technology –Extract from the old technology and import onto the new storage technology Two more preservation concepts: –Infrastructure independence –Scalability

San Diego Supercomputer Center NARA Research Prototype Persistent Archive Extraction of Electronic Records Storage Repository Storage location User name File name File context (creation date,…) Access constraints Data Access Method (Web Browser) Extract Digital record (file) Identifiers (names used to manage the file) Provenance metadata (creator, time stamps) Integrity metadata (digital signature) Encoding format

San Diego Supercomputer Center NARA Research Prototype Persistent Archive To Import into a Preservation Environment Archivist needs persistent identifiers: –Names used to describe archivists, files, and storage systems –Names for authenticity and integrity metadata attributes –Storage locations for electronic records Archivist needs to control the properties of the electronic record –Access controls for processing the electronic records –Locations of replicas –Audit trails on actions performed on electronic record –Checksums for validating integrity

San Diego Supercomputer Center NARA Research Prototype Persistent Archive Digital Preservation Concept Infrastructure Independence –The ability to manage all of the properties of the electronic records independently of the choice of storage systems –Provides persistent names to identify persons, files, storage systems as well as manage authentication and authorization Data virtualization Trust virtualization

San Diego Supercomputer Center NARA Research Prototype Persistent Archive Data Grids Data grids implement data virtualization –Ability to manage properties of shared collection independently of the choice of storage system –Ability to access records stored in all types of storage systems –Ability to support multiple types of access mechanisms independently of choice of type of storage system Data grids implement trust virtualization –Ability to authenticate users independently of the local administrative domain –Ability to manage access controls independently of the local file system or archive Data grids are used to manage shared collections that are distributed across multiple sites and multiple storage systems

San Diego Supercomputer Center NARA Research Prototype Persistent Archive Astronomy Data Grid National Optical Astronomy Observatory –Chile –Tucson, Arizona –Kitt Peak –NCSA, Illinois Replicate images taken by a telescope in Chile to an archive at NCSA A functioning international Data Grid for Astronomy Manages over 400,000 images

San Diego Supercomputer Center NARA Research Prototype Persistent Archive Generic Distributed Data Management Data grids support –Shared collections distributed across international organizations –Digital libraries for the publication of records –Real-time sensor data systems –Persistent archives that manage technology evolution Data grids manage –Small collections with a few thousand files and a few Gigabytes of data –Large collections with 800,000 Gigabytes (800 Terabytes) of data and 100 million files –Collections stored within a single computer –Collections distributed across multiple international sites

San Diego Supercomputer Center NARA Research Prototype Persistent Archive Data Grid Using a Data Grid – in Abstract Ask for data User asks for data from the data grid Data delivered The data is found and returned Where & how details are hidden

San Diego Supercomputer Center NARA Research Prototype Persistent Archive Using a Data Grid - Details Storage Resource Broker Data request goes to SRB Server Storage Resource Broker Metadata Catalog DB Server looks up data in catalog Catalog tells which SRB server has data 1 st server asks 2 nd server for data The data is found and returned User asks for data

San Diego Supercomputer Center NARA Research Prototype Persistent Archive Using a Data Grid - Details SRB MCAT DB SRB Data Grid has arbitrary number of servers Heterogeneity is hidden from users

San Diego Supercomputer Center NARA Research Prototype Persistent Archive Infrastructure Independence Manage migration to new choices of storage systems or access protocols or security technology –At the point in time when the electronic records are migrated from old technology to new storage systems, both systems are present –The ability of data grids to interoperate with multiple types of storage systems means they can be used to manage data migration to new types of storage systems Data grids provide the fundamental mechanisms needed to implement infrastructure independence –Storage system protocol converters (SRB drivers) –Application interface protocol converters –Management of record authenticity and integrity

San Diego Supercomputer Center NARA Research Prototype Persistent Archive Import into Preservation Environment Storage Repository Storage location User name File name File context (creation date,…) Access constraints Data Grid Software Logical resource name space Logical user name space Logical file name space Logical context (metadata) Control/consistency constraints Data Collection Data Access Methods (C library, Unix, Web Browser) Data is organized as a shared collection

San Diego Supercomputer Center NARA Research Prototype Persistent Archive Logical Arrangement of Digital Records Persistent identifier for the record is the logical file name Arrangement hierarchy imposed on the logical file names as collection hierarchy –Record group / Record series / File-unit / Item / Object –Associate Life Cycle Data Requirements Guide attributes with each level of the collection hierarchy –Separate extensible schema associated with each logical file name Information about all operations performed upon digital record are mapped to the logical file name –Integrity attributes –State information describing the result of each operation –Logical file name is the link between authenticity information and the record

San Diego Supercomputer Center NARA Research Prototype Persistent Archive Implemented using the SDSC Storage Resource Broker (SRB) Data Grid –Demonstrated migration to new types of storage systems Added commodity-based disk file storage systems –Demonstrated evolution of access methods Added interfaces to web browsers, workflow systems –Demonstrated migration to new transport mechanisms Added mechanisms to support interaction with network firewalls and support bulk load of records –Demonstrated replication of records across multiple systems –Demonstrated automated loading of authenticity metadata –Demonstrated mechanisms to implement a deep archive

San Diego Supercomputer Center NARA Research Prototype Persistent Archive National Archives and Records Administration Research Prototype Persistent Archive U MdSDSC MCAT Georgia Tech MCAT Federation of Five Independent Data Grids NARA II MCAT NARA I MCAT Extensible Environment, can federate with additional research and education sites

San Diego Supercomputer Center NARA Research Prototype Persistent Archive Federation Between Data Grids Data Grid Logical resource name space Logical user name space Logical file name space Logical context (metadata) Control/consistency constraints Data Collection B Data Access Methods (Web Browser, DSpace, OAI-PMH) Data Grid Logical resource name space Logical user name space Logical file name space Logical context (metadata) Control/consistency constraints Data Collection A Access controls and consistency constraints on cross registration of digital entities

San Diego Supercomputer Center NARA Research Prototype Persistent Archive Powerful platform for demonstrating preservation Interoperability across diverse platforms Multiple access mechanisms Multiple types of storage systems Extensible schema to support authenticity and integrity attributes Archivist defined metadata attributes Audit trails Checksums Mitigation of risk of data loss Replication of data Federation of catalogs Synchronization across zones Deep archive Collaboration between SDSC, University of Maryland, Stanford Linear Accelerator, Georgia Institute of Technology

San Diego Supercomputer Center NARA Research Prototype Persistent Archive Scalability: Automation of Preservation Processes Generic operations for managing interactions with files –Open, close, read, write, seek, stat, synch, … –Authentication and authorization Latency management operations –Aggregation of files in containers –Bulk load, unload, registration –Remote procedures for metadata extraction, file filtering Database interaction operations –Registration of SQL command strings –Extensible metadata schema –Table import and export Integrity mechanisms –Replication, checksum validation, synchronization –Audit trails –Federation

San Diego Supercomputer Center NARA Research Prototype Persistent Archive Prototype persistent archive design based on: –Data virtualization - ability to manage collection properties independently of storage system –Trust virtualization - ability to manage authentication and authorization independently of administrative domains –Latency management - scalable operations –Collection management - impose logical arrangement such as LDCRG hierarchy –Federation - ability to create preservation environments that span multiple data grids Automation of operations across a million records

San Diego Supercomputer Center NARA Research Prototype Persistent Archive Risk Mitigation Against Data Loss How many replicas are enough? Media corruption –Maintain a copy on a second set of media Systemic vendor error –Maintain a copy on a separate vendor’s product Operational error –Maintain a copy in a separate administrative domain Natural disaster –Maintain a copy at a geographically remote site Malicious users –Maintain a copy in a deep archive under archivist control

San Diego Supercomputer Center NARA Research Prototype Persistent Archive Deep Archive Z2Z1 Z3 Z2:D2:U2 Register Z3:D3:U3 Register Pull Firewall Server initiated I/O Deep Archive Staging Zone Public Zones No access by public zones PVN

San Diego Supercomputer Center NARA Research Prototype Persistent Archive NARA Leadership Technology demonstrated in the NARA Research Prototype Persistent Archive is now being applied in multiple national and international collaborations –NSF National Science Digital Library persistent archive –NOAO preservation data grid for astronomy images –California Digital Library - Digital Preservation Repository –Taiwan Preservation Data Grid –European Union CASPAR preservation environment NARA support has been essential in the continued development of data grid technology for building preservation environments –Software distributed to 174 institutions in –Half of the sites are international

San Diego Supercomputer Center NARA Research Prototype Persistent Archive SDSC Research Objectives Understand principles underlying digital preservation –Authenticity - assertions made about creator –Integrity - assertions made by archivist about management –Infrastructure independence - ability to migrate to new or alternate technology –Scalability - automation of preservation policies Map from preservation principles to Information Technology concepts –Data virtualization –Trust virtualization –Latency management –Shared collection management –Federation –Policy virtualization

San Diego Supercomputer Center NARA Research Prototype Persistent Archive SRB Developers Reagan Moore - PI Richard Marciano - Sustainable Archives and Library Technology Michael Wan - SRB Architect Arcot Rajasekar - SRB Manager, Information Architect Wayne Schroeder - SRB Productization, Security Charlie Cowart- inQ, NSDL application Lucas Gilbert - Jargon, DSpace/Fedora integration Bing Zhu - Perl, Python, Windows load libraries Antoine de Torcy - mySRB web browser, NARA collections Sheau-Yen Chen - SRB Administration George Kremenek- SRB Collections Arun Jagatheesan - Matrix workflow Leesa Brieger - NVO Application Sifang Lu - ROADnet Application Students 75 FTE-years of development and application About 300,000 lines of C

San Diego Supercomputer Center NARA Research Prototype Persistent Archive Unix Shell NT Browser, Kepler Actors OAI, GridFTP, WSDL, (WSRF) HTTP, OpenDAP, DSpace, Fedora Archives - Tape, Sam-QFS, DMF, HPSS, ADSM, UniTree, ADS Databases - DB2, Oracle, Sybase, Postgres, mySQL, Informix File Systems Unix, NT, Mac OSX Application ORB Storage Repository Abstraction Database Abstraction Databases - DB2, Oracle, Sybase, Postgres, mySQL, Informix C Library, Java Logical Name Space Latency Management Data Transport Metadata Transport Consistency & Metadata Management / Authorization, Authentication, Audit Linux I/O C++ DLL / Python, Perl, Windows Federation Management Storage Resource Broker 3.4.2

San Diego Supercomputer Center NARA Research Prototype Persistent Archive integrated Rule-Oriented Data System Traditional shared collection –Metadata catalog manages persistent state information Add rule engine –Choose rule set to apply to a given record series –Allow dynamic rule changes Track version of rule, date version was applied and the level of granularity (item, sub-collection) Manage temporary state information needed for rule execution Manage persistent state information resulting from rule application Apply rules that control assertions about the collection –Data distribution, replication, access constraints –Integrity validation –Metadata consistency

San Diego Supercomputer Center NARA Research Prototype Persistent Archive NSF and NARA Funded Research Automation of the application of management policies –Identify, characterize, and manage rules for: Clients (user access, allowed operations, views) Item state information (update, consistency, validation) Collection properties (global state, integrity, replication) –Dynamically apply the constraints to collection subsets Modify rules without rewriting code Re-apply rules to enforce new policies –Create modular system which will allow components of the data management environment to be updated independently Demonstrate use of dynamic rules to implement and manage preservation policies –Automation of the evaluation of the RLG/NARA assessment criteria for trusted digital repositories

San Diego Supercomputer Center NARA Research Prototype Persistent Archive iRODS - integrated Rule-Oriented Data System Resources Client InterfaceAdmin Interface Metadata Modifier Module Config Modifier Module Rule Modifier Module Consistency Check Module Confs Rule Base Meta Data Base Engine Rule Curren t State Rule Invoker Micro Service Modules Resource-based Services Micro Service Modules Metadata-based Services Service Manager Consistency Check Module Consistency Check Module

San Diego Supercomputer Center NARA Research Prototype Persistent Archive For More Information Reagan W. Moore