San Diego Supercomputer CenterNational Partnership for Advanced Computational Infrastructure1 Grid Based Solutions for Distributed Data Management Reagan W. Moore San Diego Supercomputer Center
San Diego Supercomputer CenterNational Partnership for Advanced Computational Infrastructure2 Topics Managing data residing in multiple storage systems Building collections of distributed data Supporting digital library services Federating collections Preserving collections
San Diego Supercomputer CenterNational Partnership for Advanced Computational Infrastructure3 Storage Resource Broker Generic data management infrastructure that is used to support: –Data grids for data sharing –Digital libraries for data publication –Persistent archives for data preservation Manages distributed data on national and international scales –California Digital Library –NSF National Science Digital Library –Worldwide Universities Network data grid
San Diego Supercomputer CenterNational Partnership for Advanced Computational Infrastructure4
San Diego Supercomputer CenterNational Partnership for Advanced Computational Infrastructure5 Managing Distributed Data Storage Repository Storage location User name File name File context (creation date,…) Access constraints Data Access Methods (Web Browser, DSpace, OAI-PMH) Naming conventions provided by storage systems
San Diego Supercomputer CenterNational Partnership for Advanced Computational Infrastructure6 Storage Resource Broker Data Grid Storage Repository Storage location User name File name File context (creation date,…) Access constraints Data Grid Logical resource name space Logical user name space Logical file name space Logical context (metadata) Control/consistency constraints Data Collection Data Access Methods (Web Browser, DSpace, OAI-PMH)
San Diego Supercomputer CenterNational Partnership for Advanced Computational Infrastructure7 Discovery Data grids associate metadata with each digital entity (file, SQL command, URL) that is registered –Administrative metadata (location of file, owner, access controls, size, audit trail) –Descriptive metadata (Dublin core, annotations) –Curator-defined metadata (can define collection level metadata, and metadata unique to a digital entity) Metadata query mechanisms include: –Web browsers, DSpace, OAI-PMH, WSDL, Perl, Python, Windows browser, Java class library, Unix shell commands, C library calls
San Diego Supercomputer CenterNational Partnership for Advanced Computational Infrastructure8
San Diego Supercomputer CenterNational Partnership for Advanced Computational Infrastructure9 Search Capabilities Browse within collection hierarchy Search by attribute name and operations on attribute values across all types of metadata –Dublin core attributes –Administrative attributes –Curator-defined attributes SRB manages access controls on metadata attributes and on digital entities –Metadata not displayed for digital entities that have restricted access –Metadata not displayed for attributes that have restricted access
San Diego Supercomputer CenterNational Partnership for Advanced Computational Infrastructure10 Access Mechanisms Files, clicking on the record downloads the file URLs, clicking redirects to the web page SQL commands, clicking causes the SQL command (with input parameters) to be issued to the database and the result is returned as HTML or XML Additional operations that support –Replication / Caching / Staging / Pre-fetch (partial read) / Bulk unload / Parallel I/O streams / Remote procedures for filtering and subsetting Asynchronous interfaces: –DSpace mechanisms, Storage Resource Manager
San Diego Supercomputer CenterNational Partnership for Advanced Computational Infrastructure11 Timeliness Data grids self-consistently manage all registered digital entities –All operations on digital entities automatically update the administrative metadata –Synchronization flags kept for replicas –Write locks kept for files aggregated into containers Federated digital libraries are synchronized under curator control
San Diego Supercomputer CenterNational Partnership for Advanced Computational Infrastructure12 Federation Data Grid Logical resource name space Logical user name space Logical file name space Logical context (metadata) Control/consistency constraints Data Collection B Data Access Methods (Web Browser, DSpace, OAI-PMH) Data Grid Logical resource name space Logical user name space Logical file name space Logical context (metadata) Control/consistency constraints Data Collection A Access controls and consistency constraints on cross registration of digital entities
San Diego Supercomputer CenterNational Partnership for Advanced Computational Infrastructure13 Consistency Constraints Master-slave data grids –The entries in the slave data grid are registered under control of the master data grid Peer-to-peer data grids –Curators register selected material into another data grid. Access controls are kept by the original data grid. Central repository –Remote data grids push material, user names, metadata into a central repository Deep archive –Digital entities and metadata are replicated into a data grid under curator control, but no other users are allowed access
San Diego Supercomputer CenterNational Partnership for Advanced Computational Infrastructure14 Software Costs Storage Resource Broker clients are open source - distributed for free Storage Resource Broker server source code is distributed to academic institutions for free –Commercial companies should talk to the University of California, San Diego Technology Transfer Office for server source code SRB data grid uses commercially available systems for storing: –Metadata - Oracle, DB2, Sybase, Informix, PostgreSQL, mySQL –Files - Unix file systems, Linux, Mac OS X, Windows, binary large objects in databases, object ring buffers, HPSS, UniTree, ADSM, DMF, archival storage systems If you use Postgres or mySQL for your database, the cost is zero. However large collections (millions of files) should use a commercial database
San Diego Supercomputer CenterNational Partnership for Advanced Computational Infrastructure15 Hardware Costs SRB software can be installed on laptops (Windows, Linux, Mac OS X), servers (Sun, Linux, Irix, AIX, HP), and supercomputers (clusters) –Installation on a Mac laptop takes 15 minutes, including a Postgres database, metadata catalog, server, and clients Grid Bricks - commodity-based disk systems –Provide 2.5 Ghz CPU, 1 Gbyte of memory, Gig-E network connection, 5 terabytes of disk, RAID controller, Linux operating system –Effective cost is $2000 per terabyte –Modular system that can be expanded by adding grid bricks. The SRB data grid manages global name spaces. If you use your own storage system, the cost is zero
San Diego Supercomputer CenterNational Partnership for Advanced Computational Infrastructure16 Processing and Administrative Costs SRB data grid supports digital entities –Any type of file can be stored –Files can be registered from an existing storage system, preserving both the organization and names Administration costs –Data grid administrator - manage the data grid servers, track problems with access to storage systems, installation of additional servers, registration of users –Database administrator - manage the database in which the metadata is stored, perform backups, track software upgrades –Security, network, and storage system administrators - standard administrative support for storage systems and networks
San Diego Supercomputer CenterNational Partnership for Advanced Computational Infrastructure17 Summary SRB provides collection management of data distributed across multiple storage systems –Support technology evolution - migration to new storage systems and new databases –Support federation - controlled sharing and publication of data between data grids –Support preservation - tracking of audit trails, checksums for validating integrity –Support all sizes of collections - thousands to hundreds of millions of records
San Diego Supercomputer CenterNational Partnership for Advanced Computational Infrastructure18 Unix Shell Java, NT Browsers OAI, WSDL, WSRF HTTP DSpace OpenDAP Archives - Tape, HPSS, ADSM, UniTree, DMF, CASTOR,ADS Databases DB2, Oracle, Sybase, SQLserver,Postgres, mySQL, Informix File Systems Unix, NT, Mac OSX Application ORB Storage Repository Virtualization Catalog Abstraction Databases DB2, Oracle, Sybase, Postgres, mySQL, Informix C, C++, Java Libraries Logical Name Space Latency Management Data Transport Metadata Transport Consistency & Metadata Management / Authorization,Authentication,Audit Linux I/O DLL / Python, Perl Federation Management Data Grid Federation - zoneSRB
San Diego Supercomputer CenterNational Partnership for Advanced Computational Infrastructure19 For More Information Reagan W. Moore San Diego Supercomputer Center
San Diego Supercomputer CenterNational Partnership for Advanced Computational Infrastructure20 PRDLA Collection at SDSC (2003) Collection size –800 Gbytes –14 million files Server capacity –Windows NT with 2 Tbytes disk –AIT2 tape library for backup, 1 Tbyte of tape –3 web servers Access rate –Average 1 million web page accesses per month –Does not count Siku server
San Diego Supercomputer CenterNational Partnership for Advanced Computational Infrastructure21 Data Grid Opportunities Provide uniform interface to data collections that reside at member sites –Provides way to extend PRDLA published holdings by incorporating new material Replicate collections between sites –Provides way to protect against natural disasters Integrate file access with archive access –Provides way to preserve collections
San Diego Supercomputer CenterNational Partnership for Advanced Computational Infrastructure22 Data Grids Software systems that manage distributed data –Organize distributed data into a logical collection Provide global naming conventions –Location independent identifiers Support curation processes –Access controls for adding files –Browsing and discovery services
San Diego Supercomputer CenterNational Partnership for Advanced Computational Infrastructure23 Accessing Data at Multiple Sites Archive at SDSC File System in Australia File System in Taiwan User Application Each site has their own naming convention for files A data grid provides a uniform way to name and access the files across the sites
San Diego Supercomputer CenterNational Partnership for Advanced Computational Infrastructure24 Building Distributed Collection Archive at SDSC Data Grid Common naming convention and set of attributes for describing digital entities User Application Logical name space Location independent identifier Persistent identifier Collection owned data Access controls Audit trails Checksums Descriptive metadata Inter-realm authentication Single sign-on system File System in Australia File System in Taiwan
San Diego Supercomputer CenterNational Partnership for Advanced Computational Infrastructure25 Collection Metadata Catalog Logical file name space (associate metadata attributes with the logical file name) Physical location of the file Name of the file on the storage system Size of the files Owner of the file Access controls on the file (associate digital library attributes with the logical file name) Descriptive metadata about the file Dublin Core provenance information about the file Annotations on the file
San Diego Supercomputer CenterNational Partnership for Advanced Computational Infrastructure26 Storage Systems Provide File name - naming convention for files Storage location - IP address of the storage system User name - persons who have access to the storage system File context (creation date,…) - state information about each file Access constraints - controls on access Each storage repository uses a different set of naming conventions
San Diego Supercomputer CenterNational Partnership for Advanced Computational Infrastructure27 Managing Distributed Data (Replace naming conventions used by a storage repository with naming conventions managed by the data grid) Storage Repository File name Storage location User name File context (creation date,…) Access constraints Data Grid Logical file name space Logical resource name space Logical user name space Logical metadata context Control/consistency constraints
San Diego Supercomputer CenterNational Partnership for Advanced Computational Infrastructure28 Accessing Multiple Types of Storage Systems User Application Archive at SDSC Database in Australia File System in Taiwan
San Diego Supercomputer CenterNational Partnership for Advanced Computational Infrastructure29 Standard Data Access Operations Common set of operations for interacting with every type of storage repository User Application Remote operations Unix file system Latency management Procedures Transformations Third party transfer Filtering Queries Archive at SDSC Database in Australia File System in Taiwan
San Diego Supercomputer CenterNational Partnership for Advanced Computational Infrastructure30 Data Grid Applications Data grid for managing distributed data –Latency management for bulk analyses of collections –Infrastructure independent name spaces for describing data, resources, users, and state information Digital library for managing data context –Curation services for managing collections –Descriptive metadata Persistent archive to manage technology evolution –Interoperability mechanisms between heterogeneous storage systems and user access mechanisms
San Diego Supercomputer CenterNational Partnership for Advanced Computational Infrastructure31 Provide uniform interface to data collections that reside at member sites Install a Storage Resource Broker application level server on each storage system that holds data Register the data into the PRDLA data grid –Establishes a logical file name for each file Create a collection hierarchy to support browsing and discovery –Register PRDLA metadata for each file –The SRB data grid manages the metadata for the data grid; automatically updates information on the location of the file Provide web-based access to the collections –Other access mechanisms support bulk load operations
San Diego Supercomputer CenterNational Partnership for Advanced Computational Infrastructure32 Replicate collections between sites Use data grid commands to replicate a collection onto a remote storage system –Information about the replicated files is kept in the metadata catalog Provides way to support load balancing –Sites access data that is closer to them Provides a way to protect against a local natural disaster –Files can be retreived from the remote site
San Diego Supercomputer CenterNational Partnership for Advanced Computational Infrastructure33 Integrate file access with archive access Can also replicate metadata catalog between sites –Provides way to manage long-term preservation, a deep archive Data grid provides synchronization mechanisms to update the metadata catalog –Can control execution of the synchronization mechanisms Data grid provides file validation mechanisms to verify file integrity (checksums) –Can verify a local copy against the checksums stored in the metadata catalog
San Diego Supercomputer CenterNational Partnership for Advanced Computational Infrastructure34 PRDLA Data Grid Propose the formation of a data grid linking PRDLA sites –Support data sharing –Support integration of digital libraries –Support preservation environments Storage Resource Broker data grid is in production use in international projects
San Diego Supercomputer CenterNational Partnership for Advanced Computational Infrastructure35 Data Grid Installations Australia - University of Queensland, APAC Japan - KEK (Tsukuba) Korea - KISTI, Korea Institute of Science and Technology Information Singapore - National University of Singapore Taiwan - National Taiwan University University of California - California Digital Library, UCSD