National Partnership for Advanced Computational Infrastructure Collection-based Persistent Archives Reagan W. Moore Associate Director, Data Intensive.

Slides:



Advertisements
Similar presentations
San Diego Supercomputer Center & National Partnership for Advance Computational Infrastructure Storage Resource Broker Reagan W. Moore San Diego Supercomputer.
Advertisements

National Partnership for Advanced Computational Infrastructure San Diego Supercomputer Center Data Grids for Collection Federation Reagan W. Moore University.
The Storage Resource Broker and.
The Storage Resource Broker and.
Digital Preservation Lifecycle Management Building a demonstration prototype for the preservation of large-scale multi-media collections Arcot Rajasekar.
Data Grid: Storage Resource Broker Mike Smorul. SRB Overview Developed at San Diego Supercomputing Center. Provides the abstraction mechanisms needed.
National Partnership for Advanced Computational Infrastructure San Diego Supercomputer Center Data Grids Reagan W. Moore San Diego Supercomputer Center.
National Partnership for Advanced Computational Infrastructure San Diego Supercomputer Center Data Grids, Digital Libraries and Persistent Archives Reagan.
NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE SAN DIEGO SUPERCOMPUTER CENTER Particle Physics Data Grid PPDG Data Handling System Reagan.
San Diego Supercomputer Center NARA Research Prototype Persistent Archive Building Preservation Environments with Data Grid Technology (NARA Research Prototype.
San Diego Supercomputer CenterNational Partnership for Advanced Computational Infrastructure1 Grid Based Solutions for Distributed Data Management Reagan.
Background Chronopolis Goals Data Grid supporting a Long-term Preservation Service Data Migration Data Migration to next generation technologies Trust.
A Very Brief Introduction to iRODS
Chronopolis: Preserving Our Digital Heritage David Minor UC San Diego San Diego Supercomputer Center.
Applying Data Grids to Support Distributed Data Management Storage Resource Broker Reagan W. Moore Ian Fisk Bing Zhu University of California, San Diego.
National Partnership for Advanced Computational Infrastructure Data Intensive Computing Information Based Computing Digital Libraries / Metacomputing Services.
Robust Tools for Archiving and Preserving Digital Data Joseph JaJa, Mike Smorul, and Mike McGann Institute for Advanced Computer Studies Department of.
eGovernance Under guidance of Dr. P.V. Kamesam IBM Research Lab New Delhi Ashish Gupta 3 rd Year B.Tech, Computer Science and Engg. IIT Delhi.
What is it? Hierarchical storage software developed in collaboration with five US department of Energy Labs since 1992 Allows storage management of 100s.
Web-based Portal for Discovery, Retrieval and Visualization of Earth Science Datasets in Grid Environment Zhenping (Jane) Liu.
UNIVERSITY of MARYLAND GLOBAL LAND COVER FACILITY High Performance Computing in Support of Geospatial Information Discovery and Mining Joseph JaJa Institute.
Architecting an Extensible Digital Repository Anoop Kumar, Ranjani Saigal,Rob Chavez, Nikolai Schwertner Tufts University, Medford, MA.
Digital Library Architecture and Technology
January, 23, 2006 Ilkay Altintas
CONTI’2008, 5-6 June 2008, TIMISOARA 1 Towards a digital content management system Gheorghe Sebestyen-Pal, Tünde Bálint, Bogdan Moscaliuc, Agnes Sebestyen-Pal.
National Partnership for Advanced Computational Infrastructure Digital Library Architecture Reagan Moore Chaitan Baru Amarnath Gupta George Kremenek Bertram.
San Diego Supercomputer CenterUniversity of California, San Diego Preservation Research Roadmap Reagan W. Moore San Diego Supercomputer Center
Jan Storage Resource Broker Managing Distributed Data in a Grid A discussion of a paper published by a group of researchers at the San Diego Supercomputer.
Fundamentals of XML Management Greg Alexopoulos Systems Engineer Documentum.
Rule-Based Data Management Systems Reagan W. Moore Wayne Schroeder Mike Wan Arcot Rajasekar {moore, schroede, mwan, {moore, schroede, mwan,
Introduction to Apache OODT Yang Li Mar 9, What is OODT Object Oriented Data Technology Science data management Archiving Systems that span scientific.
San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure San Diego Supercomputer Center National Partnership for Advanced.
Metadata and Geographical Information Systems Adrian Moss KINDS project, Manchester Metropolitan University, UK
DATABASE MANAGEMENT SYSTEMS IN DATA INTENSIVE ENVIRONMENNTS Leon Guzenda Chief Technology Officer.
Production Data Grids SRB - iRODS Storage Resource Broker Reagan W. Moore
San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure SRB + Web Services = Datagrid Management System (DGMS) Arcot.
Kurt Mueller San Diego Supercomputer Center NPACI HotPage Updates.
Rule-Based Preservation Systems Reagan W. Moore Wayne Schroeder Mike Wan Arcot Rajasekar Richard Marciano {moore, schroede, mwan, sekar,
National Partnership for Advanced Computational Infrastructure San Diego Supercomputer Center Persistent Management of Distributed Data Reagan W. Moore.
National Partnership for Advanced Computational Infrastructure San Diego Supercomputer Center Persistent Archive for the NSDL Reagan W. Moore Charlie Cowart.
NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE Discovery Environments: Thrust Areas Susan L. Graham University of California, Berkeley.
Archival Workshop on Ingest, Identification, and Certification Standards Certification (Best Practices) Checklist Does the archive have a written plan.
NA-MIC National Alliance for Medical Image Computing UCSD: Engineering Core 2 Portal and Grid Infrastructure.
GRID Overview Internet2 Member Meeting Spring 2003 Sandra Redman Information Technology and Systems Center and Information Technology Research Center National.
San Diego Supercomputer CenterNational Partnership for Advanced Computational Infrastructure1 Data Grids, Digital Libraries, and Persistent Archives Reagan.
The Global Land Cover Facility is sponsored by NASA and the University of Maryland.The GLCF is a founding member of the Federation of Earth Science Information.
Digital Library The networked collections of digital text, documents, images, sounds, scientific data, and software that are the core of today’s Internet.
Introduction to The Storage Resource.
Millman—Nov 04—1 An Update on Digital Libraries David Millman Director of Research & Development Academic Information Systems Columbia University
Metadata and Meta tag. What is metadata? What does metadata do? Metadata schemes What is meta tag? Meta tag example Table of Content.
Storage Why is storage an issue? Space requirements Persistence Accessibility Needs depend on purpose of storage Capture/encoding Access/delivery Preservation.
Biomedical Informatics Research Network The Storage Resource Broker & Integration with NMI Middleware Arcot Rajasekar, BIRN-CC SDSC October 9th 2002 BIRN.
Super Computing 2000 DOE SCIENCE ON THE GRID Storage Resource Management For the Earth Science Grid Scientific Data Management Research Group NERSC, LBNL.
NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE SAN DIEGO SUPERCOMPUTER CENTER Interlib Technology Integration Reagan.
National Archives and Records Administration1 Integrated Rules Ordered Data System (“IRODS”) Technology Research: Digital Preservation Technology in a.
Rights Management for Shared Collections Storage Resource Broker Reagan W. Moore
The Storage Resource Broker and.
Building Preservation Environments with Data Grid Technology Reagan W. Moore Presenter: Praveen Namburi.
Collection-Based Persistent Archives Arcot Rajasekar, Richard Marciano, Reagan Moore San Diego Supercomputer Center Presented by: Preetham A Gowda.
Preservation Data Services Persistent Archive Research Group Reagan W. Moore October 1, 2003.
Building Preservation Environments from Federated Data Grids Reagan W. Moore San Diego Supercomputer Center Storage.
Data Grids, Digital Libraries and Persistent Archives: An Integrated Approach to Publishing, Sharing and Archiving Data. Written By: R. Moore, A. Rajasekar,
Collection Based Persistent Archives
Policy-Based Data Management integrated Rule Oriented Data System
Problem: Ecological data needed to address critical questions are dispersed, heterogeneous, and complex Solution: An internet-based mechanism to discover,
Arcot Rajasekar Michael Wan Reagan Moore (sekar, mwan,
Interlib Technology Integration
VORB Virtual Object Ring Buffers
Proposed Grid Protocol Architecture Working Group
Presentation transcript:

National Partnership for Advanced Computational Infrastructure Collection-based Persistent Archives Reagan W. Moore Associate Director, Data Intensive Computing San Diego Supercomputer Center

National Partnership for Advanced Computational Infrastructure Topics Experiences learned building a prototype Persistent Archive Information model Hierarchical levels of information Interoperability mechanisms Application to workshop topics Ingestion methodology Data set identification Certification of archives

National Partnership for Advanced Computational Infrastructure Persistent Archive Goals Provide collection based archive Data set relevance is organized by the collection Provide information model to describe the context for the data collection Enough information is needed to be able to dynamically create the collection from archived information Decouple collection creation from digital object archiving Provide accessioning system to turn data sets into digital objects Accessioning is independent of the final collection

National Partnership for Advanced Computational Infrastructure NARA Persistent Archive Prototype Demonstrate ability to ingest, archive, recreate, query, and present a digital object from a 1 million record collection (RFC1036) 2.5 GB of data 6 required fields 13 optional fields User defined fields (over 1000) Determine information model needed for persistent archive

National Partnership for Advanced Computational Infrastructure

Key Concepts Learned Information model Semi-structured representation of information - XML Infrastructure independent representation of information context - XML DTD Differentiation between information context for digital objects,collection and presentation DTD for objects DTD for collection XSL style sheets for presentation Instantiation software for creating the collection from the information model XML databases now appearing

National Partnership for Advanced Computational Infrastructure Hierarchy of Information Contexts Digital object context Meta-data to define the structure of the object When publishing a digital object, must also publish the context of the object Use collections to organize objects Meta-data to define the structure of the collection When publishing a collection, must also publish the information needed to organize the collection. Use presentation context to control access Meta-data to define structure of presentation

National Partnership for Advanced Computational Infrastructure XML DTD for

National Partnership for Advanced Computational Infrastructure Formatted Message Using XML DTD

National Partnership for Advanced Computational Infrastructure Key Concepts Learned Digital object encapsulation Minimize the number of times a digital object must be touched Once archived, a digital object should only be retrieved when requested by a user Implies meta-data stored with digital objects should only describe the objects Collection and presentation meta-data should be stored separately

National Partnership for Advanced Computational Infrastructure Persistent Archive Requirements Distributed environment to ensure separable components Accession workbench Archive Presentation platform Data handling mechanisms for interoperability as basis for system evolution No tightly coupled systems Unique names are only used by the data handling system Use of containers to aggregate digital objects for storage Implies a hierarchical naming scheme Collection / container / digital object

National Partnership for Advanced Computational Infrastructure TAPE DISK CD FTP Media Handlers METADATA REPOSITORY RECORDS REPOSITORY Accessioning Work Bench (snapin) Text Image Photo Video Audio Geographical Information System Compound Records WEB Database Metadata wrapper record Reference Workbench (snapin) Arrangement A R C Catalog Order Fulfillment Retrieve Records WRAPPERWRAPPER ACCESSIONARCHIVESREFERENCETRANSFER FTP TAPE DISK CD UNWRAPPERUNWRAPPER Electronic Records Archive ( ERA ) Query & Reference Tools Internet Intranet Presentation

National Partnership for Advanced Computational Infrastructure Federation of Data Collections into Digital Libraries DPOSS Sky Survey 2MASS Sky Survey NASA Catalog NS Dig Lib Wash. Brain Image UCLA Brain Image MSU Brain Image UCSD Neuroscience CEED / ESA REINAS U Md Archive ADL Elib - Flora ESS Dig Lib Protein Data Bank Wash U Genome U H Mol Trajectory MS Dig Lib UC Calif Finding Aids UMDL Social Science AMICO Image Library NARA Persistent Archive U Wisc. Video Lib. Pacific Rim DL

National Partnership for Advanced Computational Infrastructure Conclusions Ingestion Infrastructure independent representation for digital objects Infrastructure independent representation for information model Turn data sets into digital objects by adding attribute tags Aggregate digital objects in containers for storage

National Partnership for Advanced Computational Infrastructure Conclusions Data set identification Unique names only required by data handling system Attribute based access through collection Hierarchical naming Collection / Container / Digital object Finding Aid for collection / Data handling system ID for container / Unique ID for object

National Partnership for Advanced Computational Infrastructure Conclusion Certification of persistent archive Demonstrate that can provide infrastructure independent representation for Finding aids for locating collections Information model for building collection Data handling system container Ids for storage access Digital object attribute tags Demonstrate that can use information models to create finding aids, collections, and access interfaces on new technology Demonstrate that can independently migrate any component of architecture

National Partnership for Advanced Computational Infrastructure Further Information

National Partnership for Advanced Computational Infrastructure NARA Persistent Archive

National Partnership for Advanced Computational Infrastructure Context Based Objects For data to be useful, the context must be defined Data format - binary/integer representation Physical meaning - units Structure - geometry Relevance - feature annotation Semantics - data dictionary for attributes Context is preserved as meta-data attributes

National Partnership for Advanced Computational Infrastructure Information Models for Organization of Data Digital Object Attributes Collection Attributes Presentation Attributes

National Partnership for Advanced Computational Infrastructure Information Models for Access to Data Presentation of data from multiple digital libraries Collections from federated databases Digital object Attributes

National Partnership for Advanced Computational Infrastructure Common Information Model eXtensible Markup Language (XML) Use tags to define semantic context for components of the data set Document Type Definition (DTD) Provides semi-structured representation for organizing tags that can be applied to groups of digital objects Development of standards for tags Digital sky, Protein Data Bank, Neuroscience brain images California Digital Library - Art Museum Image Consortium

National Partnership for Advanced Computational Infrastructure Information Management Hierarchy Presentation / Information Discovery / Analysis Visualization - Shastra, 3D visualization tools Presentation information model - XSL style sheet Collection organization Meta-data catalog - MCAT Collection information model - XML DTD Data handling Storage Resource Broker - SRB Storage Archival storage system - HPSS Digital object model - XML DTD

National Partnership for Advanced Computational Infrastructure Open Grid Architecture to Encourage Interoperability Data Handling Systems Storage Resources Remote Procedure Execution Data Model Management Application Storage System Description Information Discovery Dynamic Info Discovery

National Partnership for Advanced Computational Infrastructure Technology Sources Archive Community IEEE Mass Storage Systems Technical Committee Scalable storage systems Digital Library Community NSF Digital Library Initiative, Phase II Information management mediation - XML Supercomputer Community Scalable analysis platforms Grid Forum Data handling systems for interoperability Archivist Community / Library Community Management policies and standards

National Partnership for Advanced Computational Infrastructure Technology Sources Data Handling Systems Storage Resources Remote Procedure Execution Data Model Management Application Storage System Description Information Discovery Dynamic Info Discovery Digital Library Computational Grid

National Partnership for Advanced Computational Infrastructure Information Management Architecture Digital library community technologies Distributed information resources Digital library interoperability protocols - SDLIP Mediation of information using XML - MIX Grid Forum technologies Support for distributed services / procedures Inter-realm authentication GSI Grid Security Infrastructure Data handling system Storage Resource Broker, Meta-data Catalog

National Partnership for Advanced Computational Infrastructure Grid Forum Data Access Architecture Data Handling Systems Storage Resources API that provides “glue” to underlying storage, QoS, etc. [GASS, IBP, SRB] Remote Procedure Execution DPSS, DFS, NFS HPSS, ADSM, DMF, Unitree, NASstore, DB2, Oracle, Informix, Sybase, O2, ObjectStore, Objectivity API that provides “glue” to underlying data handling systems (security, scheduling, QoS, access protocol, data format/model, adaptivity, info discovery, location control) Data Model Management Application Storage System Description Information Discovery Armada D’agents, FEL, ADR GRAM, SRB, Java, CORBA + authentication + authorization Dynamic Info Discovery GloPerf, Netlogger, NWS Condor, GASS, NILE, SRB, I-2 caching, ADR DTD, ADR, object class LDAP, Database, Flat file, Object database

National Partnership for Advanced Computational Infrastructure Data Handling System Capabilities SDSC Storage Resource Broker Protocol transparency Common API for access to remote data resources Explicit drivers for each type of storage system Name transparency Attribute based access to data Location transparency Distribution of collection across multiple physical resources Time transparency Minimization of latency for data access

National Partnership for Advanced Computational Infrastructure SDSC Storage Resource Broker & Meta-data Catalog SRB ADSM HPSS DB2Oracle Unix Application File SIDDBLobj SIDObj SID MCAT Dublin Core Resource User Application Meta-data

National Partnership for Advanced Computational Infrastructure Time Transparency How to minimize latency  Prefetch data to local high performance disk, so that all accesses can be done at high speed from local resources How to maximize access rate  Composite or aggregate data into a single data set to avoid multiple accesses Stream data at high rates using parallel I/O, amortizing the access latency by the volume of data that is delivered. How to avoid congestion Replicate data across multiple servers

National Partnership for Advanced Computational Infrastructure SRB Containers - Managing Archive Latency Create container in a logical storage resource containing at least one “cacheable” resource Create objects in containers “Cache” daemon will move filled containers to archive synch and purge API’s SRB client UNIX Distributed Storage Resources SRB Server HPSS container cached containers

National Partnership for Advanced Computational Infrastructure Generality of Information Infrastructure Same information model needed to manage Federation in space Metacomputing environment Interoperable services for digital libraries Migration over time Collection creation and update Persistent archive Same storage systems needed to support Supercomputer center data Discipline specific data collections Digital library collections

National Partnership for Advanced Computational Infrastructure Art Museum Image Consortium Demonstrated Support for heterogeneous digital objects Automated conversion of meta-data to XML DTD Validation of meta-data XSL style sheet for presenting information

National Partnership for Advanced Computational Infrastructure AMICO Meta-data Conversion to XML

National Partnership for Advanced Computational Infrastructure AMICO Presentation Interface

National Partnership for Advanced Computational Infrastructure Facilitate the conduct of science through development of knowledge resources Publish - Data collection infrastructure Info discovery- Digital Library infrastructure Data access - Data handling infrastructure Apply to federal, state, and university projects NSF / DOE / NASA / USPTO / NARA / Census Bureau California Digital Library UCSD - Pacific Rim Digital Library Alliance

National Partnership for Advanced Computational Infrastructure Publishing Scientific Data Archival Storage Applications Digital Library Data Storage Information Management Collection Building CDL UCB - Elib UCSB - ADL Stanford - SDLIP U Michigan - UMDL Digital Sky Neuroscience Protein Data Bank Molecular Structures Earth Systems Science ApplicationsLibraries

National Partnership for Advanced Computational Infrastructure NPACI is a National Partnership of Partnerships 46 institutions 20 states 4 countries 5 national labs Many projects (new and old) Vendors and industry Government agencies

National Partnership for Advanced Computational Infrastructure Provide Teraflops / Petabyte capable systems for use by national academic community Current systems at the San Diego Supercomputer Center 250 Gflops peak computation rate –IBM SP, CRAY T3E 250 Terabyte archive capacity, 100 TB in archive –High Performance Storage System By end of year 1 TFlop peak computation rate –IBM SP 500 Terabyte archive capacity

National Partnership for Advanced Computational Infrastructure Challenges Facilitate access to high-end resources Support data intensive computing Facilitate access to distributed data resources Support information discovery Minimize complexity of user interfaces Provide unifying data access system Requires information management infrastructure

National Partnership for Advanced Computational Infrastructure Bio-Informatics

National Partnership for Advanced Computational Infrastructure Art Museum Image Consortium - AMICO

National Partnership for Advanced Computational Infrastructure National Virtual Observatory

National Partnership for Advanced Computational Infrastructure California Digital Library