Collection Based Persistent Archives Reagan Moore Chaitan Baru Amarnath Gupta George Kremenek Bertram Ludaescher Richard Marciano Arcot Rajasekar Wayne Schroeder Michael Wan Ilya Zaslavsky Bing Zhu (http://www.npaci.edu/DICE/)
Topics Components of a persistent archive Information management example Data management example Knowledge management example
Fundamental Concept for a Persistent Archive Persistence requires migration over time onto new technology While the migration occurs, a persistent archive must be able to interoperate with both the old technology and the new technology. A persistent archive is an interoperability system.
What Types of Interoperability are Needed? Data management (data sets) Ability to work with multiple types of storage systems, across separate administration domains Information management (schema) Ability to define a collection independent of database choice Ability to migrate collection onto new databases Knowledge management (ontology) Ability to map old concepts to current view of the world Ability to present and manipulate information associated with data sets
Implicit Concepts Infrastructure independence Information models Data set access Authentication Collection management Presentation Non-proprietary formatting Information models XML - Information markup language GML - Graphics markup language Functional separation of archival systems Accessioning workbench, archive, access workbench
Implicit Goals Maintain digital objects and the information retrieval catalog description in the archive Provide ability to instantiate collection as needed on new technology Instantiate archived collection only when needed Implies collection can sit in the archive forever, and can still be accessed at an arbitrary point in the future
Electronic Records Archive (ERA) TRANSFER ACCESSION ARCHIVES REFERENCE Media Handlers Accessioning Work Bench (snapin) Reference Workbench (snapin) Retrieve Records Catalog METADATA REPOSITORY RECORDS Internet Intranet Text Image Photo Video Audio Geographical Information System Compound Records WEB Database Arrangement Query & Reference Tools A R C TAPE TAPE CD U N W R A P E CD W R A P E DISK DISK record FTP Presentation FTP Metadata wrapper Order Fulfillment
Common Information Model eXtensible Markup Language (XML) Use tags to define semantic context for components of the data set Document Type Definition (DTD) Provides semi-structured representation for organizing tags that can be applied to groups of digital objects Development of standards for tags California Digital Library - Encoded Archival Description Digital sky, Protein Data Bank, Neuroscience brain images
Digital Object Representation Require non-proprietary markup language for formats that can be controlled by the archive HTML - text SVG - Scalable Vector Graphics markup language As standards evolve, choose next format markup language to be a superset of the previous language Convert to new standard on the fly as digital objects are accessed, or during a media migration
Hierarchy of Information Contexts Digital object context Meta-data to define the structure of the object When publishing a digital object, must also publish the context of the object Use collections to organize objects Meta-data to define the structure of the collection When publishing a collection, must also publish the information needed to organize the collection. Use knowledge context to control presentation Rules to map information to presentation style Rules that govern the generation of the digital objects
Information Management XML representation of metadata attributes Standardization of DTDs - MOA II DTD for text Standardization of markup language XML based representation of collection structure Attributes defining the physical layout of a schema into relational tables (foreign keys, attribute data types, …) XML databases & XML organized data collections Commercial systems: Excelon, TAMINO, Oracle8i,…
Art Museum Image Consortium Information management Support for heterogeneous digital objects Automated conversion of meta-data to XML DTD Validation of meta-data
AMICO Meta-data Conversion to XML
E-mail Collection Demonstrate ability to ingest, archive, recreate, query, and present a digital object from a 1 million record E-mail collection (RFC1036) 2.5 GB of data 6 required fields 13 optional fields User defined fields (over 1000) Determine information model needed for persistent archive
XML DTD for E-mail
Data Management Hierarchy Persistent Archives Storage of information model, data model, along with data Data Grid Access to data in a different administration domain Digital Library - services Interlib - ADEPT, UC Berkeley Digital Library Data Collection Extensible Meta-data catalog - EMCAT Data handling SDSC Storage Resource Broker - SRB Archival Storage High performance storage system - HPSS
Storage Transparencies Location transparency Distribution of data collection across multiple physical resources Name transparency Attributed based access to data Protocol transparency Common API for access to remote data resources Time transparency Minimization of data access latency
Digital Library Data Management Persistent identifiers Ability to move a data set without the name changing Data set replicas Management of multiple copies of a data set Archival backup of data sets Integration of disk data caches with archival storage Persistent archives Management of a collection through multiple cycles of technology evolution
SDSC Storage Resource Broker & Meta-data Catalog Application Resource Third-party copy File SID DBLobj SID Obj SID User Remote Proxies SRB MCAT ADSM HPSS DB2 Oracle Unix Dublin Core DataCutter Application Meta-data
Collection Based Access Abstract data set naming and administration away from physical storage resource Data sets defined by attributes Logical collection used to group data sets across storage systems Enables support for replication of data Collection owned data Authentication controlled by data handling system Persistence controlled by data handling system
SRB Containers - Managing Archive Latency Create container in a logical storage resource containing at least one “cacheable” resource Create objects in containers “Cache” daemon will move filled containers to archive synch and purge API’s SRB client SRB Server UNIX HPSS HPSS container cached containers Distributed Storage Resources
Knowledge Management Knowledge-based mediation Conceptual-level integration Predictive learning models Rule-based ontology maps Map source XML to Concept Map (ontologies, views) Rule-based presentation and analysis Rules governing accessioning of data sets Rule governing integrity constraints Style sheets for presentation
AMICO Presentation Interface
Formatted Message Using XML DTD
Knowledge Representation PROTLOC Result (XML/XSLT) ANATOM Result (VML) MODEL-BASED Mediation Surface atlas, Van Essen Lab CCB, Montana SU stereotaxic atlas LONI NCMIR, UCSD MCell, CNL, Salk
ANATOM
ANATOM
PROTLOC
Applications Support for distributed data collections Federation of data collections to form digital library Integration of digital libraries with archives Finding aids for federation of digital libraries through mediation of information Data grids for data access Persistent archives
Communities Providing Technology Archival storage - HPSS, ADSM, SANs Data handling - Storage Resource Broker Databases - XML, Object relational Digital libraries - services, information discovery Data grids - collection federation, finding aids Computational grids - remote execution Library - catalogs, DTDs, finding aids Archivist - archival procedures
Further Information http://www.npaci.edu/DICE