Collection Based Persistent Archives

Slides:



Advertisements
Similar presentations
Putting the Pieces Together Grace Agnew Slide User Description Rights Holder Authentication Rights Video Object Permission Administration.
Advertisements

National Partnership for Advanced Computational Infrastructure San Diego Supercomputer Center Data Grids for Collection Federation Reagan W. Moore University.
GFS OGF-22 Global Resource Naming Developers: Reagan Moore Arcot Mike.
OGF-23 iRODS Metadata Grid File System Reagan Moore San Diego Supercomputer Center.
Digital Preservation Lifecycle Management Building a demonstration prototype for the preservation of large-scale multi-media collections Arcot Rajasekar.
Data Grid: Storage Resource Broker Mike Smorul. SRB Overview Developed at San Diego Supercomputing Center. Provides the abstraction mechanisms needed.
National Partnership for Advanced Computational Infrastructure San Diego Supercomputer Center Data Grids Reagan W. Moore San Diego Supercomputer Center.
National Partnership for Advanced Computational Infrastructure San Diego Supercomputer Center Data Grids, Digital Libraries and Persistent Archives Reagan.
NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE SAN DIEGO SUPERCOMPUTER CENTER Particle Physics Data Grid PPDG Data Handling System Reagan.
San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure Integration of Data Grids, Digital Libraries, and Persistent.
San Diego Supercomputer Center NARA Research Prototype Persistent Archive Building Preservation Environments with Data Grid Technology (NARA Research Prototype.
San Diego Supercomputer CenterNational Partnership for Advanced Computational Infrastructure1 Grid Based Solutions for Distributed Data Management Reagan.
Applying Data Grids to Support Distributed Data Management Storage Resource Broker Reagan W. Moore Ian Fisk Bing Zhu University of California, San Diego.
On Developing Data Grid Workflows using Storage Resource Broker (SRB) and Kepler Tim H. Wong - UC Davis Efrat Frank - SDSC Bertram Ludäscher - UC Davis.
National Partnership for Advanced Computational Infrastructure Data Intensive Computing Information Based Computing Digital Libraries / Metacomputing Services.
PAWN: A Novel Ingestion Workflow Technology for Digital Preservation
On Developing Data Grid Workflows using Storage Resource Broker (SRB) and Kepler Tim H. Wong - UC Davis Efrat Frank - SDSC Dr. Bertram Ludäscher - UC Davis.
Ontology-based Access Ontology-based Access to Digital Libraries Sonia Bergamaschi University of Modena and Reggio Emilia Modena Italy Fausto Rabitti.
Web-based Portal for Discovery, Retrieval and Visualization of Earth Science Datasets in Grid Environment Zhenping (Jane) Liu.
DCC Conference, Glasgow November, Digital Archive Policies and Trusted Digital Repositories MacKenzie Smith, MIT Libraries Reagan Moore, San Diego.
National Partnership for Advanced Computational Infrastructure Digital Library Architecture Reagan Moore Chaitan Baru Amarnath Gupta George Kremenek Bertram.
San Diego Supercomputer CenterUniversity of California, San Diego Preservation Research Roadmap Reagan W. Moore San Diego Supercomputer Center
Information Management and Distributed Data Reagan W. Moore Wayne Schroeder Mike Wan Arcot Rajasekar Richard Marciano {moore, schroede, mwan, sekar,
Jan Storage Resource Broker Managing Distributed Data in a Grid A discussion of a paper published by a group of researchers at the San Diego Supercomputer.
Rule-Based Data Management Systems Reagan W. Moore Wayne Schroeder Mike Wan Arcot Rajasekar {moore, schroede, mwan, {moore, schroede, mwan,
1 School of Computer, National University of Defense Technology A Profile on the Grid Data Engine (GridDaEn) Xiao Nong
San Diego Supercomputer Center SDSC Storage Resource Broker Data Grid Automation Arun Jagatheesan et al., San Diego Supercomputer Center University of.
San Diego Supercomputer Center University of California, San Diego The MIX Project Native XML Database XML View(s) Wrappers export: 1. Schemas & Metadata.
National Partnership for Advanced Computational Infrastructure Collection-based Persistent Archives Reagan W. Moore Associate Director, Data Intensive.
DATABASE MANAGEMENT SYSTEMS IN DATA INTENSIVE ENVIRONMENNTS Leon Guzenda Chief Technology Officer.
Production Data Grids SRB - iRODS Storage Resource Broker Reagan W. Moore
1 CS 502: Computing Methods for Digital Libraries Lecture 19 Interoperability Z39.50.
Rule-Based Programming for VORBs Bertram Ludaescher Arcot Rajasekar Data and Knowledge Systems San Diego Supercomputer Center U.C. San Diego.
San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure SRB + Web Services = Datagrid Management System (DGMS) Arcot.
Rule-Based Preservation Systems Reagan W. Moore Wayne Schroeder Mike Wan Arcot Rajasekar Richard Marciano {moore, schroede, mwan, sekar,
National Partnership for Advanced Computational Infrastructure San Diego Supercomputer Center Persistent Management of Distributed Data Reagan W. Moore.
National Partnership for Advanced Computational Infrastructure San Diego Supercomputer Center Persistent Archive for the NSDL Reagan W. Moore Charlie Cowart.
NA-MIC National Alliance for Medical Image Computing UCSD: Engineering Core 2 Portal and Grid Infrastructure.
Digital Library The networked collections of digital text, documents, images, sounds, scientific data, and software that are the core of today’s Internet.
Metadata “Data about data” Describes various aspects of a digital file or group of files Identifies the parts of a digital object and documents their content,
1 e-Science AHM st Aug – 3 rd Sept 2004 Nottingham Distributed Storage management using SRB on UK National Grid Service Manandhar A, Haines K,
Metadata and Meta tag. What is metadata? What does metadata do? Metadata schemes What is meta tag? Meta tag example Table of Content.
Biomedical Informatics Research Network The Storage Resource Broker & Integration with NMI Middleware Arcot Rajasekar, BIRN-CC SDSC October 9th 2002 BIRN.
NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE SAN DIEGO SUPERCOMPUTER CENTER Interlib Technology Integration Reagan.
An Extensible Model-Based Mediator System with Domain Maps Amarnath Gupta * Bertram Ludäscher * Maryann E. Martone + * San Diego Supercomputer Center (SDSC)
National Archives and Records Administration1 Integrated Rules Ordered Data System (“IRODS”) Technology Research: Digital Preservation Technology in a.
Rights Management for Shared Collections Storage Resource Broker Reagan W. Moore
Building Preservation Environments with Data Grid Technology Reagan W. Moore Presenter: Praveen Namburi.
Collection-Based Persistent Archives Arcot Rajasekar, Richard Marciano, Reagan Moore San Diego Supercomputer Center Presented by: Preetham A Gowda.
5/29/2001Y. D. Wu & M. Liu1 Content Management for Digital Library May 29, 2001.
Preservation Data Services Persistent Archive Research Group Reagan W. Moore October 1, 2003.
Grid Services for Digital Archive Tao-Sheng Chen Academia Sinica Computing Centre
Building Preservation Environments from Federated Data Grids Reagan W. Moore San Diego Supercomputer Center Storage.
Data Grids, Digital Libraries and Persistent Archives: An Integrated Approach to Publishing, Sharing and Archiving Data. Written By: R. Moore, A. Rajasekar,
Databases and DBMSs Todd S. Bacastow January 2005.
Database Management:.
UCSD Neuron-Centered Database
Policy-Based Data Management integrated Rule Oriented Data System
Chapter 16 Designing Distributed and Internet Systems
Copyright © 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 2 Database System Concepts and Architecture.
Data, Databases, and DBMSs
Arcot Rajasekar Michael Wan Reagan Moore (sekar, mwan,
Database Environment Transparencies
Interlib Technology Integration
VORB Virtual Object Ring Buffers
Introduction To Distributed Systems
The Database Environment
Meta-Data: the key to accessing Data and Information
Presentation transcript:

Collection Based Persistent Archives Reagan Moore Chaitan Baru Amarnath Gupta George Kremenek Bertram Ludaescher Richard Marciano Arcot Rajasekar Wayne Schroeder Michael Wan Ilya Zaslavsky Bing Zhu (http://www.npaci.edu/DICE/)

Topics Components of a persistent archive Information management example Data management example Knowledge management example

Fundamental Concept for a Persistent Archive Persistence requires migration over time onto new technology While the migration occurs, a persistent archive must be able to interoperate with both the old technology and the new technology. A persistent archive is an interoperability system.

What Types of Interoperability are Needed? Data management (data sets) Ability to work with multiple types of storage systems, across separate administration domains Information management (schema) Ability to define a collection independent of database choice Ability to migrate collection onto new databases Knowledge management (ontology) Ability to map old concepts to current view of the world Ability to present and manipulate information associated with data sets

Implicit Concepts Infrastructure independence Information models Data set access Authentication Collection management Presentation Non-proprietary formatting Information models XML - Information markup language GML - Graphics markup language Functional separation of archival systems Accessioning workbench, archive, access workbench

Implicit Goals Maintain digital objects and the information retrieval catalog description in the archive Provide ability to instantiate collection as needed on new technology Instantiate archived collection only when needed Implies collection can sit in the archive forever, and can still be accessed at an arbitrary point in the future

Electronic Records Archive (ERA) TRANSFER ACCESSION ARCHIVES REFERENCE Media Handlers Accessioning Work Bench (snapin) Reference Workbench (snapin) Retrieve Records Catalog METADATA REPOSITORY RECORDS Internet Intranet Text Image Photo Video Audio Geographical Information System Compound Records WEB Database Arrangement Query & Reference Tools A R C TAPE TAPE CD U N W R A P E CD W R A P E DISK DISK record FTP Presentation FTP Metadata wrapper Order Fulfillment

Common Information Model eXtensible Markup Language (XML) Use tags to define semantic context for components of the data set Document Type Definition (DTD) Provides semi-structured representation for organizing tags that can be applied to groups of digital objects Development of standards for tags California Digital Library - Encoded Archival Description Digital sky, Protein Data Bank, Neuroscience brain images

Digital Object Representation Require non-proprietary markup language for formats that can be controlled by the archive HTML - text SVG - Scalable Vector Graphics markup language As standards evolve, choose next format markup language to be a superset of the previous language Convert to new standard on the fly as digital objects are accessed, or during a media migration

Hierarchy of Information Contexts Digital object context Meta-data to define the structure of the object When publishing a digital object, must also publish the context of the object Use collections to organize objects Meta-data to define the structure of the collection When publishing a collection, must also publish the information needed to organize the collection. Use knowledge context to control presentation Rules to map information to presentation style Rules that govern the generation of the digital objects

Information Management XML representation of metadata attributes Standardization of DTDs - MOA II DTD for text Standardization of markup language XML based representation of collection structure Attributes defining the physical layout of a schema into relational tables (foreign keys, attribute data types, …) XML databases & XML organized data collections Commercial systems: Excelon, TAMINO, Oracle8i,…

Art Museum Image Consortium Information management Support for heterogeneous digital objects Automated conversion of meta-data to XML DTD Validation of meta-data

AMICO Meta-data Conversion to XML

E-mail Collection Demonstrate ability to ingest, archive, recreate, query, and present a digital object from a 1 million record E-mail collection (RFC1036) 2.5 GB of data 6 required fields 13 optional fields User defined fields (over 1000) Determine information model needed for persistent archive

XML DTD for E-mail

Data Management Hierarchy Persistent Archives Storage of information model, data model, along with data Data Grid Access to data in a different administration domain Digital Library - services Interlib - ADEPT, UC Berkeley Digital Library Data Collection Extensible Meta-data catalog - EMCAT Data handling SDSC Storage Resource Broker - SRB Archival Storage High performance storage system - HPSS

Storage Transparencies Location transparency Distribution of data collection across multiple physical resources Name transparency Attributed based access to data Protocol transparency Common API for access to remote data resources Time transparency Minimization of data access latency

Digital Library Data Management Persistent identifiers Ability to move a data set without the name changing Data set replicas Management of multiple copies of a data set Archival backup of data sets Integration of disk data caches with archival storage Persistent archives Management of a collection through multiple cycles of technology evolution

SDSC Storage Resource Broker & Meta-data Catalog Application Resource Third-party copy File SID DBLobj SID Obj SID User Remote Proxies SRB MCAT ADSM HPSS DB2 Oracle Unix Dublin Core DataCutter Application Meta-data

Collection Based Access Abstract data set naming and administration away from physical storage resource Data sets defined by attributes Logical collection used to group data sets across storage systems Enables support for replication of data Collection owned data Authentication controlled by data handling system Persistence controlled by data handling system

SRB Containers - Managing Archive Latency Create container in a logical storage resource containing at least one “cacheable” resource Create objects in containers “Cache” daemon will move filled containers to archive synch and purge API’s SRB client SRB Server UNIX HPSS HPSS container cached containers Distributed Storage Resources

Knowledge Management Knowledge-based mediation Conceptual-level integration Predictive learning models Rule-based ontology maps Map source XML to Concept Map (ontologies, views) Rule-based presentation and analysis Rules governing accessioning of data sets Rule governing integrity constraints Style sheets for presentation

AMICO Presentation Interface

Formatted Message Using XML DTD

Knowledge Representation PROTLOC Result (XML/XSLT) ANATOM Result (VML)  MODEL-BASED Mediation Surface atlas, Van Essen Lab CCB, Montana SU stereotaxic atlas LONI NCMIR, UCSD MCell, CNL, Salk

ANATOM

ANATOM

PROTLOC

Applications Support for distributed data collections Federation of data collections to form digital library Integration of digital libraries with archives Finding aids for federation of digital libraries through mediation of information Data grids for data access Persistent archives

Communities Providing Technology Archival storage - HPSS, ADSM, SANs Data handling - Storage Resource Broker Databases - XML, Object relational Digital libraries - services, information discovery Data grids - collection federation, finding aids Computational grids - remote execution Library - catalogs, DTDs, finding aids Archivist - archival procedures

Further Information http://www.npaci.edu/DICE