Data Grids Jon Ludwig Leor Dilmanian Braden Allchin Andrew Brown
Outline What is a Data Grid Components of a Data Grid Data Grids of Today Amazon S3 Web Service
What is a Data Grid? Distributed storage mechanism providing resources to computational grids Cheap, effective, and scalable means of recording information across multiple grid sites The resources, tools, and information products that can be used for data discovery and delivery from a variety of sources, typically used for the production of valuable information.
Components of a Data Grid Case study: NERC o CSML. The Climate Science Modelling Language information Model. o The CSML Toolbox: Create and Manipulate documents which conform to the CSML schema. o The CSML Data Services. Expose documents & data pointed to. o The NDG Data Graphical User Interface - Use web service to manipulate data o Moles Schema, XQuery definitions, related software, frontend browser o Discovery Gateways & Infrastructure o Vocabulary server
Components Diagram
Storage Resource Broker Virtual data storage using namespaces Maintains metadata on files, users, groups Stored in relational DBMS Queries supported Has an API for other applications (e.g. Globus) Sharing, transfer, backup
Data Grids of Today Biomedical Informatics Research Network (BIRN) HP's Global File systems (SFS) collaboration NSF's iVDGL (International Virtual Data Grid Laboratory) o Now part of OSG European Union's DataGrid Project o Now part of the Enabling Grids for E-SciencE Natural Environment Research Counsel (NERC) Amazon Simple Storage Solution (S3)
Amazon S3 Amazon Simple Storage Service Web Service - REST / SOAP / BitTorrent Offload storage requirements to Amazon o Cost o Security Scalable - Storage, availability, speed Reliable - Fault tolerance, redundancy Fast Inexpensive - Commodity hardware Simple - Data grid is abstracted Flexible - Constraints
Amazon S3 - Design Principles Decentralization - Avoid SPoF Asynchrony - Avoid waiting on communications Autonomy - Local Responsibility - Nodes take care of themselves Controlled Concurrency - Exposed operations require little or no concurrency Failure Tolerance - Automatic recovery, minimal interruption Controlled Parallelism - Recover quickly Small Building Blocks Symmetry - Nodes are identical in functionality, minimal configuration Simplicity
Amazon S3 - Functionality Objects - Fundamental storage unit o 1B to 5GB o Metadata o Keys uniquely identify Objects Buckets - Namespace for managing objects o Users own Buckets o Buckets contain Objects o Unlimited Objects per Bucket Operations o Create, Read, Write, List, Delete Replication
Amazon S3 - Security Public key authentication + HMAC Access Control Lists for Buckets Logging for Buckets May use SSL Integrity - MD5 No data encryption
Amazon S3 - Disadvantages No renaming or moving of Buckets No content-based search No capping capabilities Cost
Amazon S3 - Costs Storage o $0.15 per GB-Month of storage used Data Transfer o $0.10 per GB - all data transfer in o $0.18 per GB - first 10 TB / month data transfer out o $0.16 per GB - next 40 TB / month data transfer out o $0.13 per GB - data transfer out / month over 50 TB Requests o $0.01 per 1,000 PUT or LIST requests o $0.01 per 10,000 GET and all other requests
References [2]: Baru, C.; Moore, R.; Rajasekar, A. & Wan, M. (1998), The SDSC storage resource broker, in 'CASCON '98: Proceedings of the 1998 conference of the Centre for Advanced Studies on Collaborative research', IBM Press,, pp. 5. [3] Amazon S3: [4] S. Aktas, M.; C. Fox, G. & Pierce, M. "Distributed High Performance Grid Information Service" Indiana University, 2007 [5] Garfinkel, I.; Palankar & Ripeanu. "Amazon S3 for Science Grids: a Viable Solution?" International Workshop on Data-Aware Distributed Computing, 2008
S html