San Diego Supercomputer CenterNational Partnership for Advanced Computational Infrastructure1 Grid Based Solutions for Distributed Data Management Reagan.

Slides:



Advertisements
Similar presentations
Building Shared Collections Using the Storage Resource Broker Storage Resource Broker Reagan W. Moore
Advertisements

San Diego Supercomputer Center & National Partnership for Advance Computational Infrastructure Storage Resource Broker Reagan W. Moore San Diego Supercomputer.
National Partnership for Advanced Computational Infrastructure San Diego Supercomputer Center Data Grids for Collection Federation Reagan W. Moore University.
GFS OGF-22 Global Resource Naming Developers: Reagan Moore Arcot Mike.
OGF-23 iRODS Metadata Grid File System Reagan Moore San Diego Supercomputer Center.
Peter Berrisford RAL – Data Management Group SRB Services.
Digital Preservation Lifecycle Management Building a demonstration prototype for the preservation of large-scale multi-media collections Arcot Rajasekar.
Data Grid: Storage Resource Broker Mike Smorul. SRB Overview Developed at San Diego Supercomputing Center. Provides the abstraction mechanisms needed.
San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure Data Grids, Digital Libraries, and Persistent Archives ESIP.
National Partnership for Advanced Computational Infrastructure San Diego Supercomputer Center Data Grids Reagan W. Moore San Diego Supercomputer Center.
National Partnership for Advanced Computational Infrastructure San Diego Supercomputer Center Data Grids, Digital Libraries and Persistent Archives Reagan.
NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE SAN DIEGO SUPERCOMPUTER CENTER Particle Physics Data Grid PPDG Data Handling System Reagan.
San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure Integration of Data Grids, Digital Libraries, and Persistent.
San Diego Supercomputer Center NARA Research Prototype Persistent Archive Building Preservation Environments with Data Grid Technology (NARA Research Prototype.
A Very Brief Introduction to iRODS
INFSO-RI Enabling Grids for E-sciencE Grid & Data Preservation Boon Low System Development, EGEE Training National.
Federating Archives in the DELAMAN Network Reagan W. Moore San Diego Supercomputer Center Storage Resource.
Security Requirements for Shared Collections Storage Resource Broker Reagan W. Moore
“Enabling Success: IT Infrastructure & Repositories” Andrew Bennett, University of Qld Library APSR : The Successful Repository University of Queensland.
VL-e PoC Introduction Maurice Bouwhuis VL-e work shop, April 7 th, 2006.
Chronopolis: Preserving Our Digital Heritage David Minor UC San Diego San Diego Supercomputer Center.
Applying Data Grids to Support Distributed Data Management Storage Resource Broker Reagan W. Moore Ian Fisk Bing Zhu University of California, San Diego.
Robust Tools for Archiving and Preserving Digital Data Joseph JaJa, Mike Smorul, and Mike McGann Institute for Advanced Computer Studies Department of.
Modern Data Management Overview Storage Resource Broker Reagan W. Moore
UMIACS PAWN, LPE, and GRASP data grids Mike Smorul.
Web-based Portal for Discovery, Retrieval and Visualization of Earth Science Datasets in Grid Environment Zhenping (Jane) Liu.
Data Grid Interactions with Firewalls Michael Wan Reagan Moore SDSC/UCSD/NPACI.
SDSC Projects Part 1: BUILDING PRESERVATION ENVIRONMENTS (Reagan Moore, Storage Resource Broker (SRB) and collection migration technologies:
Data Grids and Data Management Storage Resource Broker Reagan W. Moore
National Partnership for Advanced Computational Infrastructure Digital Library Architecture Reagan Moore Chaitan Baru Amarnath Gupta George Kremenek Bertram.
San Diego Supercomputer CenterUniversity of California, San Diego Preservation Research Roadmap Reagan W. Moore San Diego Supercomputer Center
Information Management and Distributed Data Reagan W. Moore Wayne Schroeder Mike Wan Arcot Rajasekar Richard Marciano {moore, schroede, mwan, sekar,
Jan Storage Resource Broker Managing Distributed Data in a Grid A discussion of a paper published by a group of researchers at the San Diego Supercomputer.
Data Grids and Data Management Storage Resource Broker Reagan W. Moore
Managing Simulation Output Storage Resource Broker Reagan W. Moore
Rule-Based Data Management Systems Reagan W. Moore Wayne Schroeder Mike Wan Arcot Rajasekar {moore, schroede, mwan, {moore, schroede, mwan,
San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure San Diego Supercomputer Center National Partnership for Advanced.
EGEE-II INFSO-RI Enabling Grids for E-sciencE Data Grid Services/SRB/SRM & Practical Hai-Ning Wu Academia Sinica Grid Computing.
DATABASE MANAGEMENT SYSTEMS IN DATA INTENSIVE ENVIRONMENNTS Leon Guzenda Chief Technology Officer.
Production Data Grids SRB - iRODS Storage Resource Broker Reagan W. Moore
1 Schema Registries Steven Hughes, Lou Reich, Dan Crichton NASA 21 October 2015.
Kurt Mueller San Diego Supercomputer Center NPACI HotPage Updates.
Rule-Based Preservation Systems Reagan W. Moore Wayne Schroeder Mike Wan Arcot Rajasekar Richard Marciano {moore, schroede, mwan, sekar,
National Partnership for Advanced Computational Infrastructure San Diego Supercomputer Center Persistent Management of Distributed Data Reagan W. Moore.
National Partnership for Advanced Computational Infrastructure San Diego Supercomputer Center Persistent Archive for the NSDL Reagan W. Moore Charlie Cowart.
CLASS Information Management Presented at NOAATECH Conference 2006 Presented by Pat Schafer (CLASS-WV Development Lead)
San Diego Supercomputer CenterNational Partnership for Advanced Computational Infrastructure1 Data Grids, Digital Libraries, and Persistent Archives Reagan.
The Global Land Cover Facility is sponsored by NASA and the University of Maryland.The GLCF is a founding member of the Federation of Earth Science Information.
How to Implement an Institutional Repository: Part II A NASIG 2006 Pre-Conference May 4, 2006 Technical Issues.
Michael Doherty RAL UK e-Science AHM 2-4 September 2003 SRB in Action.
1 e-Science AHM st Aug – 3 rd Sept 2004 Nottingham Distributed Storage management using SRB on UK National Grid Service Manandhar A, Haines K,
Introduction to The Storage Resource.
SDSC Storage Resource Broker & Meta-data Catalog SRB Archives HPSS, ADSM, UniTree, DMF Databases DB2, Oracle, Sybase File Systems Unix, NT, Mac OSX Application.
National Archives and Records Administration1 Integrated Rules Ordered Data System (“IRODS”) Technology Research: Digital Preservation Technology in a.
Rights Management for Shared Collections Storage Resource Broker Reagan W. Moore
Building Preservation Environments with Data Grid Technology Reagan W. Moore Presenter: Praveen Namburi.
Collection-Based Persistent Archives Arcot Rajasekar, Richard Marciano, Reagan Moore San Diego Supercomputer Center Presented by: Preetham A Gowda.
Building Preservation Environments Reagan W. Moore San Diego Supercomputer Center Storage Resource Broker.
1 Case Study: Business Intelligence & Customer Data Customer Support Web-based Dashboard VP Marketing SQL XSLT XML Data Grid Customer Data Customer Order.
Preservation Data Services Persistent Archive Research Group Reagan W. Moore October 1, 2003.
Building Preservation Environments from Federated Data Grids Reagan W. Moore San Diego Supercomputer Center Storage.
Data Grids, Digital Libraries and Persistent Archives: An Integrated Approach to Publishing, Sharing and Archiving Data. Written By: R. Moore, A. Rajasekar,
High Performance Storage System (HPSS) Jason Hick Mass Storage Group HEPiX October 26-30, 2009.
Collection Based Persistent Archives
Policy-Based Data Management integrated Rule Oriented Data System
Joseph JaJa, Mike Smorul, and Sangchul Song
Arcot Rajasekar Michael Wan Reagan Moore (sekar, mwan,
VORB Virtual Object Ring Buffers
Technical Issues in Sustainability
Presentation transcript:

San Diego Supercomputer CenterNational Partnership for Advanced Computational Infrastructure1 Grid Based Solutions for Distributed Data Management Reagan W. Moore San Diego Supercomputer Center

San Diego Supercomputer CenterNational Partnership for Advanced Computational Infrastructure2 Topics Managing data residing in multiple storage systems Building collections of distributed data Supporting digital library services Federating collections Preserving collections

San Diego Supercomputer CenterNational Partnership for Advanced Computational Infrastructure3 Storage Resource Broker Generic data management infrastructure that is used to support: –Data grids for data sharing –Digital libraries for data publication –Persistent archives for data preservation Manages distributed data on national and international scales –California Digital Library –NSF National Science Digital Library –Worldwide Universities Network data grid

San Diego Supercomputer CenterNational Partnership for Advanced Computational Infrastructure4

San Diego Supercomputer CenterNational Partnership for Advanced Computational Infrastructure5 Managing Distributed Data Storage Repository Storage location User name File name File context (creation date,…) Access constraints Data Access Methods (Web Browser, DSpace, OAI-PMH) Naming conventions provided by storage systems

San Diego Supercomputer CenterNational Partnership for Advanced Computational Infrastructure6 Storage Resource Broker Data Grid Storage Repository Storage location User name File name File context (creation date,…) Access constraints Data Grid Logical resource name space Logical user name space Logical file name space Logical context (metadata) Control/consistency constraints Data Collection Data Access Methods (Web Browser, DSpace, OAI-PMH)

San Diego Supercomputer CenterNational Partnership for Advanced Computational Infrastructure7 Discovery Data grids associate metadata with each digital entity (file, SQL command, URL) that is registered –Administrative metadata (location of file, owner, access controls, size, audit trail) –Descriptive metadata (Dublin core, annotations) –Curator-defined metadata (can define collection level metadata, and metadata unique to a digital entity) Metadata query mechanisms include: –Web browsers, DSpace, OAI-PMH, WSDL, Perl, Python, Windows browser, Java class library, Unix shell commands, C library calls

San Diego Supercomputer CenterNational Partnership for Advanced Computational Infrastructure8

San Diego Supercomputer CenterNational Partnership for Advanced Computational Infrastructure9 Search Capabilities Browse within collection hierarchy Search by attribute name and operations on attribute values across all types of metadata –Dublin core attributes –Administrative attributes –Curator-defined attributes SRB manages access controls on metadata attributes and on digital entities –Metadata not displayed for digital entities that have restricted access –Metadata not displayed for attributes that have restricted access

San Diego Supercomputer CenterNational Partnership for Advanced Computational Infrastructure10 Access Mechanisms Files, clicking on the record downloads the file URLs, clicking redirects to the web page SQL commands, clicking causes the SQL command (with input parameters) to be issued to the database and the result is returned as HTML or XML Additional operations that support –Replication / Caching / Staging / Pre-fetch (partial read) / Bulk unload / Parallel I/O streams / Remote procedures for filtering and subsetting Asynchronous interfaces: –DSpace mechanisms, Storage Resource Manager

San Diego Supercomputer CenterNational Partnership for Advanced Computational Infrastructure11 Timeliness Data grids self-consistently manage all registered digital entities –All operations on digital entities automatically update the administrative metadata –Synchronization flags kept for replicas –Write locks kept for files aggregated into containers Federated digital libraries are synchronized under curator control

San Diego Supercomputer CenterNational Partnership for Advanced Computational Infrastructure12 Federation Data Grid Logical resource name space Logical user name space Logical file name space Logical context (metadata) Control/consistency constraints Data Collection B Data Access Methods (Web Browser, DSpace, OAI-PMH) Data Grid Logical resource name space Logical user name space Logical file name space Logical context (metadata) Control/consistency constraints Data Collection A Access controls and consistency constraints on cross registration of digital entities

San Diego Supercomputer CenterNational Partnership for Advanced Computational Infrastructure13 Consistency Constraints Master-slave data grids –The entries in the slave data grid are registered under control of the master data grid Peer-to-peer data grids –Curators register selected material into another data grid. Access controls are kept by the original data grid. Central repository –Remote data grids push material, user names, metadata into a central repository Deep archive –Digital entities and metadata are replicated into a data grid under curator control, but no other users are allowed access

San Diego Supercomputer CenterNational Partnership for Advanced Computational Infrastructure14 Software Costs Storage Resource Broker clients are open source - distributed for free Storage Resource Broker server source code is distributed to academic institutions for free –Commercial companies should talk to the University of California, San Diego Technology Transfer Office for server source code SRB data grid uses commercially available systems for storing: –Metadata - Oracle, DB2, Sybase, Informix, PostgreSQL, mySQL –Files - Unix file systems, Linux, Mac OS X, Windows, binary large objects in databases, object ring buffers, HPSS, UniTree, ADSM, DMF, archival storage systems If you use Postgres or mySQL for your database, the cost is zero. However large collections (millions of files) should use a commercial database

San Diego Supercomputer CenterNational Partnership for Advanced Computational Infrastructure15 Hardware Costs SRB software can be installed on laptops (Windows, Linux, Mac OS X), servers (Sun, Linux, Irix, AIX, HP), and supercomputers (clusters) –Installation on a Mac laptop takes 15 minutes, including a Postgres database, metadata catalog, server, and clients Grid Bricks - commodity-based disk systems –Provide 2.5 Ghz CPU, 1 Gbyte of memory, Gig-E network connection, 5 terabytes of disk, RAID controller, Linux operating system –Effective cost is $2000 per terabyte –Modular system that can be expanded by adding grid bricks. The SRB data grid manages global name spaces. If you use your own storage system, the cost is zero

San Diego Supercomputer CenterNational Partnership for Advanced Computational Infrastructure16 Processing and Administrative Costs SRB data grid supports digital entities –Any type of file can be stored –Files can be registered from an existing storage system, preserving both the organization and names Administration costs –Data grid administrator - manage the data grid servers, track problems with access to storage systems, installation of additional servers, registration of users –Database administrator - manage the database in which the metadata is stored, perform backups, track software upgrades –Security, network, and storage system administrators - standard administrative support for storage systems and networks

San Diego Supercomputer CenterNational Partnership for Advanced Computational Infrastructure17 Summary SRB provides collection management of data distributed across multiple storage systems –Support technology evolution - migration to new storage systems and new databases –Support federation - controlled sharing and publication of data between data grids –Support preservation - tracking of audit trails, checksums for validating integrity –Support all sizes of collections - thousands to hundreds of millions of records

San Diego Supercomputer CenterNational Partnership for Advanced Computational Infrastructure18 Unix Shell Java, NT Browsers OAI, WSDL, WSRF HTTP DSpace OpenDAP Archives - Tape, HPSS, ADSM, UniTree, DMF, CASTOR,ADS Databases DB2, Oracle, Sybase, SQLserver,Postgres, mySQL, Informix File Systems Unix, NT, Mac OSX Application ORB Storage Repository Virtualization Catalog Abstraction Databases DB2, Oracle, Sybase, Postgres, mySQL, Informix C, C++, Java Libraries Logical Name Space Latency Management Data Transport Metadata Transport Consistency & Metadata Management / Authorization,Authentication,Audit Linux I/O DLL / Python, Perl Federation Management Data Grid Federation - zoneSRB

San Diego Supercomputer CenterNational Partnership for Advanced Computational Infrastructure19 For More Information Reagan W. Moore San Diego Supercomputer Center

San Diego Supercomputer CenterNational Partnership for Advanced Computational Infrastructure20 PRDLA Collection at SDSC (2003) Collection size –800 Gbytes –14 million files Server capacity –Windows NT with 2 Tbytes disk –AIT2 tape library for backup, 1 Tbyte of tape –3 web servers Access rate –Average 1 million web page accesses per month –Does not count Siku server

San Diego Supercomputer CenterNational Partnership for Advanced Computational Infrastructure21 Data Grid Opportunities Provide uniform interface to data collections that reside at member sites –Provides way to extend PRDLA published holdings by incorporating new material Replicate collections between sites –Provides way to protect against natural disasters Integrate file access with archive access –Provides way to preserve collections

San Diego Supercomputer CenterNational Partnership for Advanced Computational Infrastructure22 Data Grids Software systems that manage distributed data –Organize distributed data into a logical collection Provide global naming conventions –Location independent identifiers Support curation processes –Access controls for adding files –Browsing and discovery services

San Diego Supercomputer CenterNational Partnership for Advanced Computational Infrastructure23 Accessing Data at Multiple Sites Archive at SDSC File System in Australia File System in Taiwan User Application Each site has their own naming convention for files A data grid provides a uniform way to name and access the files across the sites

San Diego Supercomputer CenterNational Partnership for Advanced Computational Infrastructure24 Building Distributed Collection Archive at SDSC Data Grid Common naming convention and set of attributes for describing digital entities User Application Logical name space Location independent identifier Persistent identifier Collection owned data Access controls Audit trails Checksums Descriptive metadata Inter-realm authentication Single sign-on system File System in Australia File System in Taiwan

San Diego Supercomputer CenterNational Partnership for Advanced Computational Infrastructure25 Collection Metadata Catalog Logical file name space (associate metadata attributes with the logical file name) Physical location of the file Name of the file on the storage system Size of the files Owner of the file Access controls on the file (associate digital library attributes with the logical file name) Descriptive metadata about the file Dublin Core provenance information about the file Annotations on the file

San Diego Supercomputer CenterNational Partnership for Advanced Computational Infrastructure26 Storage Systems Provide File name - naming convention for files Storage location - IP address of the storage system User name - persons who have access to the storage system File context (creation date,…) - state information about each file Access constraints - controls on access Each storage repository uses a different set of naming conventions

San Diego Supercomputer CenterNational Partnership for Advanced Computational Infrastructure27 Managing Distributed Data (Replace naming conventions used by a storage repository with naming conventions managed by the data grid) Storage Repository File name Storage location User name File context (creation date,…) Access constraints Data Grid Logical file name space Logical resource name space Logical user name space Logical metadata context Control/consistency constraints

San Diego Supercomputer CenterNational Partnership for Advanced Computational Infrastructure28 Accessing Multiple Types of Storage Systems User Application Archive at SDSC Database in Australia File System in Taiwan

San Diego Supercomputer CenterNational Partnership for Advanced Computational Infrastructure29 Standard Data Access Operations Common set of operations for interacting with every type of storage repository User Application Remote operations Unix file system Latency management Procedures Transformations Third party transfer Filtering Queries Archive at SDSC Database in Australia File System in Taiwan

San Diego Supercomputer CenterNational Partnership for Advanced Computational Infrastructure30 Data Grid Applications Data grid for managing distributed data –Latency management for bulk analyses of collections –Infrastructure independent name spaces for describing data, resources, users, and state information Digital library for managing data context –Curation services for managing collections –Descriptive metadata Persistent archive to manage technology evolution –Interoperability mechanisms between heterogeneous storage systems and user access mechanisms

San Diego Supercomputer CenterNational Partnership for Advanced Computational Infrastructure31 Provide uniform interface to data collections that reside at member sites Install a Storage Resource Broker application level server on each storage system that holds data Register the data into the PRDLA data grid –Establishes a logical file name for each file Create a collection hierarchy to support browsing and discovery –Register PRDLA metadata for each file –The SRB data grid manages the metadata for the data grid; automatically updates information on the location of the file Provide web-based access to the collections –Other access mechanisms support bulk load operations

San Diego Supercomputer CenterNational Partnership for Advanced Computational Infrastructure32 Replicate collections between sites Use data grid commands to replicate a collection onto a remote storage system –Information about the replicated files is kept in the metadata catalog Provides way to support load balancing –Sites access data that is closer to them Provides a way to protect against a local natural disaster –Files can be retreived from the remote site

San Diego Supercomputer CenterNational Partnership for Advanced Computational Infrastructure33 Integrate file access with archive access Can also replicate metadata catalog between sites –Provides way to manage long-term preservation, a deep archive Data grid provides synchronization mechanisms to update the metadata catalog –Can control execution of the synchronization mechanisms Data grid provides file validation mechanisms to verify file integrity (checksums) –Can verify a local copy against the checksums stored in the metadata catalog

San Diego Supercomputer CenterNational Partnership for Advanced Computational Infrastructure34 PRDLA Data Grid Propose the formation of a data grid linking PRDLA sites –Support data sharing –Support integration of digital libraries –Support preservation environments Storage Resource Broker data grid is in production use in international projects

San Diego Supercomputer CenterNational Partnership for Advanced Computational Infrastructure35 Data Grid Installations Australia - University of Queensland, APAC Japan - KEK (Tsukuba) Korea - KISTI, Korea Institute of Science and Technology Information Singapore - National University of Singapore Taiwan - National Taiwan University University of California - California Digital Library, UCSD