Building Preservation Environments from Federated Data Grids Reagan W. Moore San Diego Supercomputer Center Storage Resource Broker
Topics Preservation environments Authenticity Integrity Digital library technology Metadata management Data grid technology Technology evolution management
Preservation Archival processes through which a digital entity is extracted from its creation environment, and migrated into a preservation environment, while maintaining authenticity and integrity information. Extraction process requires insertion of support infrastructure underneath the digital material Goal is infrastructure independence, the ability to use any commercial storage system, database, or access mechanism
Preservation Communities InterPARES - diplomatics Preservation of records NARA Preservation of records from federal agencies State archives Preservation of submitted “collections” Continuum model Preservation of active data and records
Digital Libraries Support the community vocabulary Discovery and browse using community relevant terms Support the community data format Maintain information on the data format of each item Support the community access services Provide services that manipulate and display the community data format
Preservation Mandates Diplomatics Authenticity Integrity NARA Infrastructure independence Scalability State archives Automation of archival processes
InterPARES - Diplomatics Authenticity - maintain links to metadata for: Date record is made Date record is transmitted Date record is received Date record is set aside [i.e. filed] Name of author (person or organization issuing the record) Name of addressee (person or organization for whom the record is intended) Name of writer (entity responsible for the articulation of the record’s content) Name of originator (electronic address from which record is sent) Name of recipient(s) (person or organization to whom the record is sent) Name of creator (entity in whose archival fonds the record exists) Name of action or matter (the activity for which the record is created) Name of documentary form (e.g. , report, memo) Identification of digital components Identification of attachments (e.g. digital signature) Archival bond (e.g. classification code)
InterPARES - Diplomatics Integrity - maintain links to metadata for Name(s) of the handling office / officer Name of office of primary responsibility for keeping the record Annotations or comments Actions carried out on the record Technical modifications due to transformative migration Validation
Preservation Approach Provide mechanisms to: Create archival context for the content Context is preservation metadata (provenance, administrative, descriptive, structural, behavioral) Content is the submitted digital entity Assert integrity - the consistency between the context and the content Track operations done on material and update context Assert authenticity - that the material represents the original site Track the chain of custody Manage technology evolution (encoding standard, storage repository, information repository, access methods)
Data Grids Manage shared collections that are distributed in space Location of item, access controls, checksums Implement infrastructure independence Standard operations for interacting with storage repositories Implement presentation independence Standard APIs to support porting of user interfaces
Preservation Environment Digital library infrastructure that supports Preservation metadata Arrangement and description of items Access mechanisms Data grid infrastructure that supports Shared collections that are migrated forward in time Management of technology evolution Administrative metadata providing status of records
Infrastructure Independence Storage Repository Storage location User name File name File context (creation date,…) Access constraints Data Access Methods (Web Browser, DSpace, OAI-PMH) Naming conventions provided by storage systems
Data Grids Provide a Level of Indirection for Each Naming Convention Storage Repository Storage location User name File name File context (creation date,…) Access constraints Data Grid Logical resource name space Logical user name space Logical file name space Logical context (metadata) Control/consistency constraints Data Collection Data Access Methods (C library, Unix, Web Browser) Data is organized as a shared collection
Data Grids Provide two levels of indirection: Low level API used to interact with storage repositories Standard operations for manipulating files in a storage system Standard operations for manipulating a catalog stored in a database High level API used to support user interfaces Three basic APIs - “C” library call, Unix shell commands, Java class library Other are interfaces ported on top of the basic APIs.
Unix Shell NT Browser, Kepler Actors OAI, WSDL, (WSRF) HTTP, DSpace, OpenDAP, GridFTP Archives - Tape, Sam-QFS, DMF, HPSS, ADSM, UniTree, ADS Databases - DB2, Oracle, Sybase, Postgres, mySQL, Informix File Systems Unix, NT, Mac OSX Application ORB Storage Repository Abstraction Database Abstraction Databases - DB2, Oracle, Sybase, Postgres, mySQL, Informix C Library, Java Logical Name Space Latency Management Data Transport Metadata Transport Consistency & Metadata Management / Authorization, Authentication, Audit Linux I/O C++ DLL / Python, Perl, Windows Federation Management Storage Resource Broker 3.3
Standard Data Access Operations Common set of operations for interacting with every type of storage repository User Application Remote operations Unix file system Latency management Procedures Transformations Third party transfer Filtering Queries Collective operations Replication Fault tolerance Load leveling Archive at SDSC Archive at NARA Archive at U Md
Building a Distributed Collection Archive at SDSC Data Grid Common naming convention and set of attributes for describing digital entities User Application Logical name space Location independent identifier Persistent identifier Collection owned data Authenticity metadata Access controls Audit trails Checksums Descriptive metadata Inter-realm authentication Single sign-on system Archive at NARA Archive at U Md
SRB server SRB agent SRB server Federated Server Architecture MCAT Read Application SRB agent Logical Name Or Attribute Condition 1.Logical-to-Physical mapping 2.Identification of Replicas 3.Access & Audit Control Peer-to-peer Brokering Server(s) Spawning Data Access Parallel Data Access R1 R2 5/6
Managing Access Authenticate users independently of storage systems Preservation environment owns the data Authorize data access independently of storage system ACLs on both data and metadata Maintain audit trails of all accesses Both read and write
Collection-owned Data Store data at remote storage system under data-grid ID Access data through data grid servers Track all operations on data and update state information User authenticates to a data grid server Access controls are checked for permissions Data grid servers authenticate messages from other servers Remote server authenticates to remote storage system Multiple authentication mechanisms GSI / challenge-response / tickets
Provide Context for Data Properties of files Provenance - source Descriptive attributes Structure Organize properties as metadata in a collection hierarchy Define operations on file properties Manage state information - location, replicas, containers Separate context management from content management Maintain consistency of context as operations are done on content
Database Operations Standard interface to support Schema extension - user defined attributes Snowflake table creation SQL generation Import and export of XML files Bulk metadata load and unload Operations required to manage a catalog that resides in a database
National Archives and Records Administration - Research Prototype Persistent Archive NARAU MdSDSC MCAT Principle copy stored at NARA with complete metadata catalog Replicated copy at U Md for improved access, load balancing and disaster recovery Deep Archive at SDSC, no user access, but complete copy Demonstrate preservation environment Authenticity Integrity Management of technology evolution Mitigation of risk of data loss Replication of data Federation of catalogs Management of preservation metadata Scalability EAP collection 350,000 files 1.2 TBs in size Federation of Three Independent Data Grids
Preservation Requirements Maintain authenticity and integrity of electronic records Authenticity - assertion of provenance of data Integrity - assertion of invariance of bits Manage risk of data loss Media corruption / System failures / Operational errors / Natural disaster / Malicious users Manage technology obsolescence Support migration of collection to new systems Bulk data operations
Federation Data Grid Logical resource name space Logical user name space Logical file name space Logical context (metadata) Control/consistency constraints Data Collection B Data Access Methods (Web Browser, DSpace, OAI-PMH) Data Grid Logical resource name space Logical user name space Logical file name space Logical context (metadata) Control/consistency constraints Data Collection A Access controls and consistency constraints on cross registration of digital entities
Data Grid Zones Choose how name spaces will be shared Cross register storage resources May the other data grid write to my storage? Cross register user names Users are authenticated by their home zone Cross register files Can replicate files into another data grid Cross register metadata Can build a copy of the metadata catalog
Replicated Catalog Deep Archive Partial User-ID Sharing Partial Resource Sharing No Metadata Synch Hierarchical Zone Organization One Shared User-ID System Managed Replication Connection From Any Zone Complete Resource Sharing System Set Access Controls System Controlled Complete Synch Complete User-ID Sharing System Managed Replication System Set Access Controls System Controlled Partial Synch No Resource Sharing Super Administrator Zone Control System Controlled Complete Synch No User-ID Sharing Peer-to-Peer Data Grids Replication Data Grids Hierarchical Data Grids Occasional Interchange Free Floating Resource Interaction User and Data Replica Nomadic Snow Flake Master Slave Replicated Data Federation Environments Replication Constraints Consistency Constraints Access Constraints
Examples of Extensibility Storage Repository Driver evolution Initially supported Unix file system Added archival access - UniTree, HPSS Added FTP/HTTP Added database blob access Added database table interface Added Windows file system Added project archives - Dcache, Castor, ADS Added Object Ring Buffer, Datascope Adding GridFTP version 3.3 Database management evolution Postgres DB2 Oracle Informix Sybase mySQL (most difficult port - no locks, no views, limited SQL)
Examples of Extensibility The 3 fundamental APIs are C library, shell commands, Java Other access mechanisms are ported on top of these interfaces API evolution Initial access through C library, Unix shell command Added inQ Windows browser (C++ library) Added mySRB Web browser (C library and shell commands) Added Java (Jargon) Added Perl/Python load libraries (shell command) Added WSDL (Java) Added OAI-PMH, OpenDAP, DSpace digital library (Java) Added Kepler actors for dataflow access (Java) Adding GridFTP version 3.3 (C library )
Sites Using the SRB
Preservation Strategies Emulation Migrate the display application onto new operating systems Equivalent to forcing use of candlelight to look at 16th century documents Transformative migration Migrate the encoding format to the new standard Migration period is expected to be 5-10 years Persistent object Characterize the encoding format Migrate the characterization forward in time
Persistent Objects Display Applications Digital Entities Characterize standard manipulation operations Characterize encoding format - data structure
Preservation Archival processes through which a digital entity is extracted from its creation environment and migrated to a preservation environment, while maintaining authenticity and integrity information. Extraction process requires insertion of support infrastructure underneath the digital material, characterization of the authenticity and integrity, characterization of the digital encoding format, and characterization of the display operations Goal is infrastructure independence, the ability to use any commercial storage system, database, or access mechanism
For More Information Reagan W. Moore San Diego Supercomputer Center