National Partnership for Advanced Computational Infrastructure Collection-based Persistent Archives Reagan W. Moore Associate Director, Data Intensive Computing San Diego Supercomputer Center
National Partnership for Advanced Computational Infrastructure Topics Experiences learned building a prototype Persistent Archive Information model Hierarchical levels of information Interoperability mechanisms Application to workshop topics Ingestion methodology Data set identification Certification of archives
National Partnership for Advanced Computational Infrastructure Persistent Archive Goals Provide collection based archive Data set relevance is organized by the collection Provide information model to describe the context for the data collection Enough information is needed to be able to dynamically create the collection from archived information Decouple collection creation from digital object archiving Provide accessioning system to turn data sets into digital objects Accessioning is independent of the final collection
National Partnership for Advanced Computational Infrastructure NARA Persistent Archive Prototype Demonstrate ability to ingest, archive, recreate, query, and present a digital object from a 1 million record collection (RFC1036) 2.5 GB of data 6 required fields 13 optional fields User defined fields (over 1000) Determine information model needed for persistent archive
National Partnership for Advanced Computational Infrastructure
Key Concepts Learned Information model Semi-structured representation of information - XML Infrastructure independent representation of information context - XML DTD Differentiation between information context for digital objects,collection and presentation DTD for objects DTD for collection XSL style sheets for presentation Instantiation software for creating the collection from the information model XML databases now appearing
National Partnership for Advanced Computational Infrastructure Hierarchy of Information Contexts Digital object context Meta-data to define the structure of the object When publishing a digital object, must also publish the context of the object Use collections to organize objects Meta-data to define the structure of the collection When publishing a collection, must also publish the information needed to organize the collection. Use presentation context to control access Meta-data to define structure of presentation
National Partnership for Advanced Computational Infrastructure XML DTD for
National Partnership for Advanced Computational Infrastructure Formatted Message Using XML DTD
National Partnership for Advanced Computational Infrastructure Key Concepts Learned Digital object encapsulation Minimize the number of times a digital object must be touched Once archived, a digital object should only be retrieved when requested by a user Implies meta-data stored with digital objects should only describe the objects Collection and presentation meta-data should be stored separately
National Partnership for Advanced Computational Infrastructure Persistent Archive Requirements Distributed environment to ensure separable components Accession workbench Archive Presentation platform Data handling mechanisms for interoperability as basis for system evolution No tightly coupled systems Unique names are only used by the data handling system Use of containers to aggregate digital objects for storage Implies a hierarchical naming scheme Collection / container / digital object
National Partnership for Advanced Computational Infrastructure TAPE DISK CD FTP Media Handlers METADATA REPOSITORY RECORDS REPOSITORY Accessioning Work Bench (snapin) Text Image Photo Video Audio Geographical Information System Compound Records WEB Database Metadata wrapper record Reference Workbench (snapin) Arrangement A R C Catalog Order Fulfillment Retrieve Records WRAPPERWRAPPER ACCESSIONARCHIVESREFERENCETRANSFER FTP TAPE DISK CD UNWRAPPERUNWRAPPER Electronic Records Archive ( ERA ) Query & Reference Tools Internet Intranet Presentation
National Partnership for Advanced Computational Infrastructure Federation of Data Collections into Digital Libraries DPOSS Sky Survey 2MASS Sky Survey NASA Catalog NS Dig Lib Wash. Brain Image UCLA Brain Image MSU Brain Image UCSD Neuroscience CEED / ESA REINAS U Md Archive ADL Elib - Flora ESS Dig Lib Protein Data Bank Wash U Genome U H Mol Trajectory MS Dig Lib UC Calif Finding Aids UMDL Social Science AMICO Image Library NARA Persistent Archive U Wisc. Video Lib. Pacific Rim DL
National Partnership for Advanced Computational Infrastructure Conclusions Ingestion Infrastructure independent representation for digital objects Infrastructure independent representation for information model Turn data sets into digital objects by adding attribute tags Aggregate digital objects in containers for storage
National Partnership for Advanced Computational Infrastructure Conclusions Data set identification Unique names only required by data handling system Attribute based access through collection Hierarchical naming Collection / Container / Digital object Finding Aid for collection / Data handling system ID for container / Unique ID for object
National Partnership for Advanced Computational Infrastructure Conclusion Certification of persistent archive Demonstrate that can provide infrastructure independent representation for Finding aids for locating collections Information model for building collection Data handling system container Ids for storage access Digital object attribute tags Demonstrate that can use information models to create finding aids, collections, and access interfaces on new technology Demonstrate that can independently migrate any component of architecture
National Partnership for Advanced Computational Infrastructure Further Information
National Partnership for Advanced Computational Infrastructure NARA Persistent Archive
National Partnership for Advanced Computational Infrastructure Context Based Objects For data to be useful, the context must be defined Data format - binary/integer representation Physical meaning - units Structure - geometry Relevance - feature annotation Semantics - data dictionary for attributes Context is preserved as meta-data attributes
National Partnership for Advanced Computational Infrastructure Information Models for Organization of Data Digital Object Attributes Collection Attributes Presentation Attributes
National Partnership for Advanced Computational Infrastructure Information Models for Access to Data Presentation of data from multiple digital libraries Collections from federated databases Digital object Attributes
National Partnership for Advanced Computational Infrastructure Common Information Model eXtensible Markup Language (XML) Use tags to define semantic context for components of the data set Document Type Definition (DTD) Provides semi-structured representation for organizing tags that can be applied to groups of digital objects Development of standards for tags Digital sky, Protein Data Bank, Neuroscience brain images California Digital Library - Art Museum Image Consortium
National Partnership for Advanced Computational Infrastructure Information Management Hierarchy Presentation / Information Discovery / Analysis Visualization - Shastra, 3D visualization tools Presentation information model - XSL style sheet Collection organization Meta-data catalog - MCAT Collection information model - XML DTD Data handling Storage Resource Broker - SRB Storage Archival storage system - HPSS Digital object model - XML DTD
National Partnership for Advanced Computational Infrastructure Open Grid Architecture to Encourage Interoperability Data Handling Systems Storage Resources Remote Procedure Execution Data Model Management Application Storage System Description Information Discovery Dynamic Info Discovery
National Partnership for Advanced Computational Infrastructure Technology Sources Archive Community IEEE Mass Storage Systems Technical Committee Scalable storage systems Digital Library Community NSF Digital Library Initiative, Phase II Information management mediation - XML Supercomputer Community Scalable analysis platforms Grid Forum Data handling systems for interoperability Archivist Community / Library Community Management policies and standards
National Partnership for Advanced Computational Infrastructure Technology Sources Data Handling Systems Storage Resources Remote Procedure Execution Data Model Management Application Storage System Description Information Discovery Dynamic Info Discovery Digital Library Computational Grid
National Partnership for Advanced Computational Infrastructure Information Management Architecture Digital library community technologies Distributed information resources Digital library interoperability protocols - SDLIP Mediation of information using XML - MIX Grid Forum technologies Support for distributed services / procedures Inter-realm authentication GSI Grid Security Infrastructure Data handling system Storage Resource Broker, Meta-data Catalog
National Partnership for Advanced Computational Infrastructure Grid Forum Data Access Architecture Data Handling Systems Storage Resources API that provides “glue” to underlying storage, QoS, etc. [GASS, IBP, SRB] Remote Procedure Execution DPSS, DFS, NFS HPSS, ADSM, DMF, Unitree, NASstore, DB2, Oracle, Informix, Sybase, O2, ObjectStore, Objectivity API that provides “glue” to underlying data handling systems (security, scheduling, QoS, access protocol, data format/model, adaptivity, info discovery, location control) Data Model Management Application Storage System Description Information Discovery Armada D’agents, FEL, ADR GRAM, SRB, Java, CORBA + authentication + authorization Dynamic Info Discovery GloPerf, Netlogger, NWS Condor, GASS, NILE, SRB, I-2 caching, ADR DTD, ADR, object class LDAP, Database, Flat file, Object database
National Partnership for Advanced Computational Infrastructure Data Handling System Capabilities SDSC Storage Resource Broker Protocol transparency Common API for access to remote data resources Explicit drivers for each type of storage system Name transparency Attribute based access to data Location transparency Distribution of collection across multiple physical resources Time transparency Minimization of latency for data access
National Partnership for Advanced Computational Infrastructure SDSC Storage Resource Broker & Meta-data Catalog SRB ADSM HPSS DB2Oracle Unix Application File SIDDBLobj SIDObj SID MCAT Dublin Core Resource User Application Meta-data
National Partnership for Advanced Computational Infrastructure Time Transparency How to minimize latency Prefetch data to local high performance disk, so that all accesses can be done at high speed from local resources How to maximize access rate Composite or aggregate data into a single data set to avoid multiple accesses Stream data at high rates using parallel I/O, amortizing the access latency by the volume of data that is delivered. How to avoid congestion Replicate data across multiple servers
National Partnership for Advanced Computational Infrastructure SRB Containers - Managing Archive Latency Create container in a logical storage resource containing at least one “cacheable” resource Create objects in containers “Cache” daemon will move filled containers to archive synch and purge API’s SRB client UNIX Distributed Storage Resources SRB Server HPSS container cached containers
National Partnership for Advanced Computational Infrastructure Generality of Information Infrastructure Same information model needed to manage Federation in space Metacomputing environment Interoperable services for digital libraries Migration over time Collection creation and update Persistent archive Same storage systems needed to support Supercomputer center data Discipline specific data collections Digital library collections
National Partnership for Advanced Computational Infrastructure Art Museum Image Consortium Demonstrated Support for heterogeneous digital objects Automated conversion of meta-data to XML DTD Validation of meta-data XSL style sheet for presenting information
National Partnership for Advanced Computational Infrastructure AMICO Meta-data Conversion to XML
National Partnership for Advanced Computational Infrastructure AMICO Presentation Interface
National Partnership for Advanced Computational Infrastructure Facilitate the conduct of science through development of knowledge resources Publish - Data collection infrastructure Info discovery- Digital Library infrastructure Data access - Data handling infrastructure Apply to federal, state, and university projects NSF / DOE / NASA / USPTO / NARA / Census Bureau California Digital Library UCSD - Pacific Rim Digital Library Alliance
National Partnership for Advanced Computational Infrastructure Publishing Scientific Data Archival Storage Applications Digital Library Data Storage Information Management Collection Building CDL UCB - Elib UCSB - ADL Stanford - SDLIP U Michigan - UMDL Digital Sky Neuroscience Protein Data Bank Molecular Structures Earth Systems Science ApplicationsLibraries
National Partnership for Advanced Computational Infrastructure NPACI is a National Partnership of Partnerships 46 institutions 20 states 4 countries 5 national labs Many projects (new and old) Vendors and industry Government agencies
National Partnership for Advanced Computational Infrastructure Provide Teraflops / Petabyte capable systems for use by national academic community Current systems at the San Diego Supercomputer Center 250 Gflops peak computation rate –IBM SP, CRAY T3E 250 Terabyte archive capacity, 100 TB in archive –High Performance Storage System By end of year 1 TFlop peak computation rate –IBM SP 500 Terabyte archive capacity
National Partnership for Advanced Computational Infrastructure Challenges Facilitate access to high-end resources Support data intensive computing Facilitate access to distributed data resources Support information discovery Minimize complexity of user interfaces Provide unifying data access system Requires information management infrastructure
National Partnership for Advanced Computational Infrastructure Bio-Informatics
National Partnership for Advanced Computational Infrastructure Art Museum Image Consortium - AMICO
National Partnership for Advanced Computational Infrastructure National Virtual Observatory
National Partnership for Advanced Computational Infrastructure California Digital Library