Presenter: Dipesh Gautam
Introduction Why Data Grid? High Level View Design Considerations Data Grid Services Topology Grids and Cloud Convergence of Grid and Cloud Vertical RDBMS Benefits of column-oriented layout 2
Data Grid: an architecture or set of services that enable individual or group of users ability to access and transact large amounts of geographically distributed data. The data may be replicated throughout the grid outside the original administrative domain of the data. The integration between users and the data are handled and controlled by the data grid middleware. 3
Large dataset size Geographic distribution of users and resources Computationally intensive analysis No other architecture exists that allows us to apply technologies in large scale application domains 4
5
Mechanism Neutrality ◦ Designed to be as independent as possible of low level mechanisms ◦ Defining interfaces that sum up oddness of specific storage systems. Compatibility with Grid Infrastructure ◦ Take advantage of fundamental Grid infrastructure ◦ Compatible with lower level Grid mechanisms Uniformity of Information Infrastructure ◦ The same data model and interface used to access the grids metadata 6
Middleware provides following services: ◦ Universal namespace ◦ Data transport service ◦ Data access service ◦ Data replication service ◦ Resource management system(RMS) 7
Number of systems and networks are connected within a grid Different file naming conventions of separate systems within grid Physical file names merely do not address the problem locating the data. Universal namespace provides logical file names Storage Resource Broker provides service to map between logical and physical file names Upon requesting logical file names, all matching physical file names are returned and the end user chose appropriate replica 8
Middleware service for data transfer The atomicity of the requested data transfer ensures the fault tolerant service ◦ Data transfer is resumed after each interruption until all requested data is receive ◦ Many possible strategies: Starting the entire transmission from the beginning Resuming from the point of interruption. E.g: GridFTP sends data from the last acknowledged byte without starting the entire transfer from the beginning. Provides service for low-level access and connection between hosts for file transfer Provides I/O functions that allow user to see remote files as if they were local to their system Provides high level abstraction of the access and transfer of data between different systems hiding the complexity and presenting user as a unified data source 9
Work with data transport service to provide security, access control and management of data transfer within the grid Provides security service to authenticate users Provides authorization service to control access by simple file permission to Access Control Lists (ACLs), Role-Based Access control Provides encryption service to protect the confidentiality of the data transport (e.g SSL ) 10
Why replication? ◦ Scalability ◦ Fast access ◦ User collaboration Replicas are often placed close to the sites where users need them Replication is controlled by a replica management system Replica management system determines the needs of replicas based on the requests Timely update of the replica is performed by propagating the changes in some node to all the nodes in the grid 11
Centralized model: single master replica updates all others Decentralized model: all peers update each other The topology of node placement influence update strategy 12
Static replication ◦ Uses a fixed replica set of nodes with no dynamic changes to the files being replicated Dynamic replication ◦ based on popularity of data ◦ If request exceeds the replication threshold, the replica is placed on the server that directly services the client provided that the storage is available ◦ Dynamic deletion of replicas that have null access value Adaptive replication ◦ The dynamic threshold is computed based on request arrival rates from clients over a period of time ◦ The replicas with lower threshold and were not created in the current replication interval can be removed Fair-share replication ◦ Based on access load and storage load of candidate servers ◦ Server with less access load is selected for replication as the replicated in server with more access load degrades the performance for all clients ◦ Among the candidate servers with same access load, server with less storage load is selected Lot more replication placement strategy exists 13
Core functionality of data grid Manages all the actions related to storage resources Fulfils user and application requests for data resources based on type of request and policies Schedules creation of replicas Enforces policy and security within the data grid resources by including authentication, authorization and access support systems with different administrative policies to inter-operate Enforces system fault tolerance and stability requirements 14
Various topologies have been used to address need of the scientific community Four major types of topologies ◦ Federation topology ◦ Monadic topology ◦ Hierarchical topology ◦ Hybrid topology 15
Allows each institution control over their data The institution who receives request from authorized institution determines whether to send data to the requesting institution The federation could be loosely or tightly integrated Preferred by the institutions that wish to share data from already existing systems 16
All the collected data is fed into a central repository Central repository responds to all queries for data No replicas in the topology This topology is well suited when all access to the data is local or within a single region with high speed connectivity 17
Suited for collaborating data from single source to distributed multiple locations around the world 18
Any combination of other topologies Suited for researches working on projects want to share their results to further research by making it readily available for collaboration 19
Grid ◦ Grid refers for distributed computing in science and engineering ◦ In grid computing, virtual organizations share computer resources over a network ◦ Scientific research, collaboration ◦ Share local resources ◦ Heterogeneous, real resource ◦ Geographically distributed, locally owned and managed Cloud – Cloud refers for a computer network in the context of network management – In cloud computing anybody can access data and compute services over the internet – Web services, business apps – Make huge data centers available – Homogeneous virtualized resources – Geographically distributed, centrally owned and managed 20
Interoperability standards among the service providers of both grid and cloud should be considered by the user Interoperating cloud looks like grid 21
Column-Oriented DBMS ◦ Store data column wise instead of row wise ◦ In row oriented DBMS the values on the rows are serialized and stored in memory as: 1, Smith, Joe, 40000; 2, Jones, Mary, 50000; 3, Johnson, Cathy, 44000; ◦ In column oriented DBMS the columns are serialized as: ◦ 1, 2, 3; Smith, Jones, Johnson; Joe, Mary, Cathy; 40000, 50000, 44000; EmpIdLastnameFirstnameSalary 1SmithJoe JonesMary JohnsonCathy
Efficient when aggregate needs to be computed over many rows but only for notably smaller subset of columns Efficient in writing a column when new values of column for all rows are supplied at once Suite for Online Analytical Processing(OLAP) like workloads which involve a smaller number of highly complex queries over all data of terabyte size. 23
Martin Antony Walker, Grids and Clouds, Grids+and+Clouds+OGF25+MAW.pdf 004/documents/Course-DataGrid.ppt 004/documents/Course-DataGrid.ppt oriented_DBMS oriented_DBMS 24