The Data Grid: Towards an Architecture for the Distributed Management and Analysis of Large Scientific Datasets A.Chervenak, I.Foster, C.Kesselman, C.Salisbury,

The Data Grid: Towards an Architecture for the Distributed Management and Analysis of Large Scientific Datasets A.Chervenak, I.Foster, C.Kesselman, C.Salisbury, S.Tuecke Presented By: Kasturi Chatterjee Agnostic: Selim Kalayci 6/5/2019

Agenda Introduction Data Grid Design Data Grid Services
Higher-Level Data Grid Components Conclusion 6/5/2019

Introduction Grid : Geographically distributed computing resources configured for coordinated use Data Grid : Database Architecture for storage and handling huge amount of data supported by a Grid 6/5/2019

Introduction Scientific disciplines are data intensive as well as computationally demanding Terabytes and petabytes of data Diverse Domains and Geographic Distribution of Users and Resources 6/5/2019

Data Grid Integrate heterogenous data archives into a distributed data management grid* Identify services for high performance, distributed, data intensive computing* APIs and Components required to implement it efficiently *from globus project slides available at loci.cs.utk.edu/dsi/netstore99/docs/presentations/foster-d-slides.pdf 6/5/2019

Data Grid Design Design Principles Mechanism Neutrality
independent of low-level mechanisms Policy Neutrality design decisions are exposed to users Compatibility with Computational Grid integration of storage and computation Uniformity of Information Infrastructure uniform access to information about resource structure and state 6/5/2019

Layered Architecture (from the paper)
6/5/2019

Core Services Storage Systems
DPSS : Distributed Parallel Storage System HPSS : High Performance Storage System Metadata Repository LDAP : Lightweight Directory Access Protocol MCAT : MetaData Catalogue The Distributed-Parallel Storage System (DPSS) is a scalable, high-performance, distributed-parallel data storage system originally developed as part of the DARPA -funded MAGIC Testbed, with additional support from the U.S. Dept. of Energy, Energy Research Division, Mathematical, Information, and Computational Sciences Office. The DPSS is a data block server, which provides high-performance data handling and architecture for building high-performance storage systems from low-cost commodity hardware components. This technology has been quite successful in providing an economical, high-performance, widely distributed, and highly scalable architecture for caching large amounts of data that can potentially be used by many different users. HPSS is software that manages hundreds of terabytes to petabytes of data on disk and robotic tape libraries. HPSS provides highly flexible and scalable hierarchical storage management that keeps recently used data on disk and less recently used data on tape. HPSS uses cluster and SAN technology to aggregate the capacity and performance of many computers, disks, and tape drives into a single virtual file system of exceptional size and versatility. This approach enables HPSS to easily meet otherwise unachievable demands of total storage capacity, file sizes, data rates, and number of objects stored. HPSS provides a variety of user and filesystem interfaces ranging from the ubiquitous vfs, ftp, samba and nfs to higher performance pftp, client API, local file mover and third party SAN. Developed by: BM Global Services Lawrence Livermore National Laboratory Los Alamos National Laboratory National Energy Research Supercomputer Center (NERSC) at Lawrence Berkeley National Laboratory Oak Ridge National Laboratory Sandia National Laboratories LDAP: Short for Lightweight Directory Access Protocol, a set of protocols for accessing information directories. LDAP is based on the standards contained within the X.500 standard, but is significantly simpler. And unlike X.500, LDAP supports TCP/IP, which is necessary for any type of Internet access. Because it's a simpler version of X.500, LDAP is sometimes called X.500-lite. Although not yet widely implemented, LDAP should eventually make it possible for almost any application running on virtually any computer platform to obtain directory information, such as addresses and public keys. Because LDAP is an open protocol, applications need not worry about the type of server hosting the directory. MCAT:The MCAT database is a metadata repository that provides a mechanism for storing information used by the SRB system. This includes both internal system data required for running the system and application data regarding data sets being brokered by SRB e.g. your own metadata. SRB makes a clear distinction between these two types of data. 6/5/2019

Data Grid Services Data Access
Mechanisms for accessing, managing and initiating third-party transfers of data Metadata Access Mechanisms for accessing and managing information about data 6/5/2019

Data Grid Services (from loci. cs. utk
Data Grid Services (from loci.cs.utk.edu/dsi/netstore99/docs/presentations/foster-d-slides.pdf ) 6/5/2019

Data Grid Services Storage Systems and Data Access Storage Systems:
provides functions for creating, destroying, writing and manipulating file instances associate a set of properties like name, size and access restrictions with each file instance Eg: A data grid implementation may use SRB to access data 6/5/2019

Data Grid Services Data Access
APIs are defined which describes the possible operations on storage systems and file instances API provides standard interface to storage systems like create, delete, open, close, read, write and storage to storage transfer Self-Optimizing capability Uniform Access to heterogeneous Systems 6/5/2019

Data Grid Services Metadata Service
Application Metadata, Replica Metadata and System Configuration Metadata Single interface to access them Pros: Uniformity Cons: Complex Implementation Structured as hierarchical and distributed Pros: Scalable, no single failure point, local control 6/5/2019

Data Grid Services Application Metadata : metadata describing the information content represented by the file, circumstances under which data was obtained and information to applications to process it Replica Metadata : data used to manage replication of data objects System Configuration Metadata : describes the system i.e. network connectivity, storage systems, usage policy etc. 6/5/2019

Higher-Level Data Grid Components
Replica Management from I. Foster Slides Collections contain related files Logical files describe replicated physical files Services for managing replicated file instances Create / delete Schedule / manage data transfer Register in the replica catalog Metadata display 6/5/2019

How Does a Replica Manager Works ? Maintains a repository/catalogue Entries correspond to logical files/file collections Associated with each logical file/collection are one/more physical instance of objects Catalogue contains mapping from logical file to physical instances 6/5/2019

Replica Manager doesn’t do the following : determine when or where replicas are created which replicas are to be used by an application keeps policy separate from replica manager design making it generic 6/5/2019

Replica Selection Process of choosing replica that will optimize a desired performance criterion Selection process may initiate creation of a new replica Intelligent scheduling to determine appropriate replica, site for (re)computation, etc. 6/5/2019

Conclusion Implementation experience led to the adoption of using collection of logical files Implements computation and data intensive Grid architecture APIs provide standard interface for various utilities Replica Management and Metadata services are provided using LDAP 6/5/2019

Further Works Chervenak et al
1.Secure, Efficient Data Transport and Replica Management for High-Performance Data-Intensive Computing :2001 2. High-Performance Remote Access to Climate Simulation Data: A Challenge Problem for Data Grid Technologies :2001 3. A Replica Location Grid Service Implementation : 2004 4. Applying Peer-to-Peer Techniques to Grid Replica Location Services :2006 Leanne Guy et al Replica Management in Data Grids in 2002 : addressed Read/Write Replica techniques 6/5/2019

The Data Grid: Towards an Architecture for the Distributed Management and Analysis of Large Scientific Datasets A.Chervenak, I.Foster, C.Kesselman, C.Salisbury,

Similar presentations

Presentation on theme: "The Data Grid: Towards an Architecture for the Distributed Management and Analysis of Large Scientific Datasets A.Chervenak, I.Foster, C.Kesselman, C.Salisbury,"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

The Data Grid: Towards an Architecture for the Distributed Management and Analysis of Large Scientific Datasets A.Chervenak, I.Foster, C.Kesselman, C.Salisbury,

Similar presentations

Presentation on theme: "The Data Grid: Towards an Architecture for the Distributed Management and Analysis of Large Scientific Datasets A.Chervenak, I.Foster, C.Kesselman, C.Salisbury,"— Presentation transcript:

Similar presentations

About project

Feedback