Resource Management of Grid Computing

Slides:



Advertisements
Similar presentations
The Anatomy of the Grid: An Integrated View of Grid Architecture Carl Kesselman USC/Information Sciences Institute Ian Foster, Steve Tuecke Argonne National.
Advertisements

Agreement-based Distributed Resource Management Alain Andrieux Karl Czajkowski.
High Performance Computing Course Notes Grid Computing.
Condor-G: A Computation Management Agent for Multi-Institutional Grids James Frey, Todd Tannenbaum, Miron Livny, Ian Foster, Steven Tuecke Reporter: Fu-Jiun.
A Computation Management Agent for Multi-Institutional Grids
Objektorienteret Middleware Presentation 2: Distributed Systems – A brush up, and relations to Middleware, Heterogeneity & Transparency.
CoreGRID Workpackage 5 Virtual Institute on Grid Information and Monitoring Services Authorizing Grid Resource Access and Consumption Erik Elmroth, Michał.
Universität Dortmund Robotics Research Institute Information Technology Section Grid Metaschedulers An Overview and Up-to-date Solutions Christian.
GridFlow: Workflow Management for Grid Computing Kavita Shinde.
City University London
Workload Management Workpackage Massimo Sgaravatto INFN Padova.
Resource Management – a Solution for Providing QoS over IP Tudor Dumitraş, Frances Jen-Fung Ning and Humayun Latif.
Introduction and Overview “the grid” – a proposed distributed computing infrastructure for advanced science and engineering. Purpose: grid concept is motivated.
Slides for Grid Computing: Techniques and Applications by Barry Wilkinson, Chapman & Hall/CRC press, © Chapter 1, pp For educational use only.
Computer Science Department 1 Load Balancing and Grid Computing David Finkel Computer Science Department Worcester Polytechnic Institute.
1-2.1 Grid computing infrastructure software Brief introduction to Globus © 2010 B. Wilkinson/Clayton Ferner. Spring 2010 Grid computing course. Modification.
Milos Kobliha Alejandro Cimadevilla Luis de Alba Parallel Computing Seminar GROUP 12.
Grids and Grid Technologies for Wide-Area Distributed Computing Mark Baker, Rajkumar Buyya and Domenico Laforenza.
Workload Management Massimo Sgaravatto INFN Padova.
Globus Computing Infrustructure Software Globus Toolkit 11-2.
©Ian Sommerville 2004Software Engineering, 7th edition. Chapter 12 Slide 1 Distributed Systems Design 1.
Resource Management Reading: “A Resource Management Architecture for Metacomputing Systems”
©Ian Sommerville 2006Software Engineering, 8th edition. Chapter 12 Slide 1 Distributed Systems Architectures.
Workload Management WP Status and next steps Massimo Sgaravatto INFN Padova.
Nimrod/G GRID Resource Broker and Computational Economy David Abramson, Rajkumar Buyya, Jon Giddy School of Computer Science and Software Engineering Monash.
WP9 Resource Management Current status and plans for future Juliusz Pukacki Krzysztof Kurowski Poznan Supercomputing.
SOFTWARE DESIGN AND ARCHITECTURE LECTURE 09. Review Introduction to architectural styles Distributed architectures – Client Server Architecture – Multi-tier.
Through the development of advanced middleware, Grid computing has evolved to a mature technology in which scientists and researchers can leverage to gain.
Grid Workload Management & Condor Massimo Sgaravatto INFN Padova.
GRID’2012 Dubna July 19, 2012 Dependable Job-flow Dispatching and Scheduling in Virtual Organizations of Distributed Computing Environments Victor Toporkov.
A Novel Approach to Workflow Management in Grid Environments Frank Berretz*, Sascha Skorupa*, Volker Sander*, Adam Belloum** 15/04/2010 * FH Aachen - University.
CPSC 871 John D. McGregor Module 6 Session 3 System of Systems.
Scientific Workflow Scheduling in Computational Grids Report: Wei-Cheng Lee 8th Grid Computing Conference IEEE 2007 – Planning, Reservation,
Grid Workload Management Massimo Sgaravatto INFN Padova.
Database Design and Management CPTG /23/2015Chapter 12 of 38 Functions of a Database Store data Store data School: student records, class schedules,
1 4/23/2007 Introduction to Grid computing Sunil Avutu Graduate Student Dept.of Computer Science.
DISTRIBUTED COMPUTING Introduction Dr. Yingwu Zhu.
Middleware for Grid Computing and the relationship to Middleware at large ECE 1770 : Middleware Systems By: Sepehr (Sep) Seyedi Date: Thurs. January 23,
Distributed Computing Systems CSCI 4780/6780. Geographical Scalability Challenges Synchronous communication –Waiting for a reply does not scale well!!
Ames Research CenterDivision 1 Information Power Grid (IPG) Overview Anthony Lisotta Computer Sciences Corporation NASA Ames May 2,
Superscheduling and Resource Brokering Sven Groot ( )
GRID ARCHITECTURE Chintan O.Patel. CS 551 Fall 2002 Workshop 1 Software Architectures 2 What is Grid ? "...a flexible, secure, coordinated resource- sharing.
Authors: Ronnie Julio Cole David
What is SAM-Grid? Job Handling Data Handling Monitoring and Information.
Ruth Pordes November 2004TeraGrid GIG Site Review1 TeraGrid and Open Science Grid Ruth Pordes, Fermilab representing the Open Science.
GridLab Resource Management System (GRMS) Jarek Nabrzyski GridLab Project Coordinator Poznań Supercomputing and.
Introduction to Grids By: Fetahi Z. Wuhib [CSD2004-Team19]
1 Observations on Architecture, Protocols, Services, APIs, SDKs, and the Role of the Grid Forum Ian Foster Carl Kesselman Steven Tuecke.
CoreGRID Workpackage 5 Virtual Institute on Grid Information and Monitoring Services Michał Jankowski, Paweł Wolniewicz, Jiří Denemark, Norbert Meyer,
International Symposium on Grid Computing (ISGC-07), Taipei - March 26-29, 2007 Of 16 1 A Novel Grid Resource Broker Cum Meta Scheduler - Asvija B System.
Globus and PlanetLab Resource Management Solutions Compared M. Ripeanu, M. Bowman, J. Chase, I. Foster, M. Milenkovic Presented by Dionysis Logothetis.
Distributed System Architectures Yonsei University 2 nd Semester, 2014 Woo-Cheol Kim.
Introduction to Grid Computing and its components.
Globus Grid Tutorial Part 2: Running Programs Across Multiple Resources.
Securing the Grid & other Middleware Challenges Ian Foster Mathematics and Computer Science Division Argonne National Laboratory and Department of Computer.
GRID ANATOMY Advanced Computing Concepts – Dr. Emmanuel Pilli.
Globus: A Report. Introduction What is Globus? Need for Globus. Goal of Globus Approach used by Globus: –Develop High level tools and basic technologies.
ATLAS Database Access Library Local Area LCG3D Meeting Fermilab, Batavia, USA October 21, 2004 Alexandre Vaniachine (ANL)
A Resource Management Architecture for Metacomputing Systems Karl Czajkowski Ian Foster Nicholas Karonis Carl Kesselman Stuart Martin Warren Smith Steven.
CSF. © Platform Computing Inc CSF – Community Scheduler Framework Not a Platform product Contributed enhancement to The Globus Toolkit Standards.
INTRODUCTION TO GRID & CLOUD COMPUTING U. Jhashuva 1 Asst. Professor Dept. of CSE.
The EPIKH Project (Exchange Programme to advance e-Infrastructure Know-How) gLite Grid Introduction Salma Saber Electronic.
Enabling Grids for E-sciencE Agreement-based Workload and Resource Management Tiziana Ferrari, Elisabetta Ronchieri Mar 30-31, 2006.
Towards a High Performance Extensible Grid Architecture Klaus Krauter Muthucumaru Maheswaran {krauter,
Distributed Systems Architectures Chapter 12. Objectives  To explain the advantages and disadvantages of different distributed systems architectures.
Distributed Systems Architectures. Topics covered l Client-server architectures l Distributed object architectures l Inter-organisational computing.
Duncan MacMichael & Galen Deal CSS 534 – Autumn 2016
Grid Computing.
Resource and Service Management on the Grid
Presentation transcript:

Resource Management of Grid Computing -- Juan Chen 陈娟 Ying Tao 陶莺

Overview Grid Computing Resource Management Instance --Nimrod/G Local Resource Management Global Resource Management Scheduling Instance --Nimrod/G

Grid Computing (1) A Grid is a very large-scale, generalized distributed Network Computing system that can scale to Internet size environments with machines distributed across multiple organizations and administrative domains. There are four main aspects characterize a grid Multiple Administrative Domains and Autonomy Heterogeneity Scalability Dynamicity or Adaptability For a Grid to efficiently support a variety of applications, the resource management system (RMS) that is central to its operation must address the above issues in addition to issues such as fault-tolerance and stability

Grid Computing (2) Grid computing is concerned with coordinated resource sharing and problem solving in dynamic, multi-institutional virtual organizations. The key concept is the ability to negotiate resource-sharing arrangements among a set of participating parties and then to use the resulting resource pool for some purpose. So we can see that resource management system is the central component of grid computing systems.

Designing a Grid architecture is challenging due to: supporting adaptability, extensibility, and scalability; allowing systems with different administrative policies to inter-operate while preserving site autonomy; co-allocating resources; supporting quality of service; economy of computations

A layered Grid architecture and components

Basic library and supported softwares Globus Architecture Applications Basic library and supported softwares mpich, PSEs GIISs collective GRAM, GRIS, GSS Globus resource GIS, GAA connectivity Low-level (Fabric) interface local scheduler, PBS, Condor, SQMS

Traditional Resource Management Designed and operated under the assumption that: – They have complete control over a resource – They can implement the mechanisms and policies needed for effective use of that resource in isolation This is not the case for Grid Resource management – Separate administrative domains – Resource Heterogeneity – Lack of control of different policies

What is Grid Resource Management? Identifying application requirements, resource specification Matching resources to applications Allocating/scheduling and monitoring those resources and applications over time in order to run as effectively as possible.

Resource Management System

RMS system abstract structure

Grid Resource Management System consists of : Local resource management system (Resource Layer) Basic resource management unit Provide a standard interface for using remote resources e.g. GRAM, etc. Global resource management system (Collective Layer) Coordinate all Local resource management system within multiple or distributed Virtual Organizations (VOs) Provide high-level functionalities to efficiently use all of resources Job Submission Resource Discovery and Selection Scheduling Co-allocation Job Monitoring, etc. e.g. Meta-scheduler, Resource Broker, etc.

Local Resource Management Globus Resource Allocation Manager (GRAM) is responsible for 1. processing RSL specifications representing resource requests, by either denying the request or by creating one or more processes (a \job") that satisfy that request; 2. enabling remote monitoring and management of jobs created in response to a resource request 3. periodically updating the MDS information service with information about the current availability and capabilities of the resources that it manages.

Major components of the GRAM implementation

Resource co-allocator it is often the case that a metacomputing application requires that several resources be allocated simultaneously. In these cases, a resource broker produces a multirequest and co-allocation is required. the role of a co-allocator is to split a request into its constituent components, submit each component to the appropriate resource manager, and then provide a means for manipulating the resulting set of resources as a whole

Types of co-allocation a range of different co-allocation service can be constructed. require all resources to be available before the job is allowed to proceed, and fail globally if failure occurs at any resource; allocate at least N out of M requested resources then return return immediately, but gradually return more resources as they become available.

Scheduling Scheduling is the matching of application requirements and available resources. System-level schedulers---focus on throughput and generally do not consider application requirements in scheduling decisions. Application-specific schedulers---have been very successful for individual applications, but are not easily applied to new applications.

Grid Application Development Software Project (GrADS) Scheduling Launch-time scheduling is the pre-execution determination of an initial matching of application requirements and available resources. Rescheduling involves making modifications to that initial matching in response to dynamic system or application changes. Meta-scheduling involves the coordination of schedules for multiple applications running on the same Grid at once. The Grid Application Development Software Project provides a system to simplify Grid application development. I will focus on application sheduling in it and present the three scheduling approaches developed in GrADS:

Grid Application Development Software Architecture The Program Preparation System (PPS) handles application development, composition, and compilation. To begin the development process, the user interacts with a high-level interface called a problem solving environment (PSE) to assemble their Grid application source code and integrate it with software libraries. The resulting application is passed to a specialized GrADS compiler. The compiler performs application analysis and partial compilation into intermediate representation code and generates a configurable object program (COP). This work may be performed off-line, as the COP is a long-lived object that may be re-used for multiple runs of the application. The COP must encapsulate all results of the PPS phase for later usage; for example, application performance models and partially compiled code called intermediate representation code. The Program Execution System (PES) provides on-line resource discovery, scheduling, binding, application performance monitoring, and rescheduling. To execute an application, the user submits parameters of the problem such as problem size to the GrADS system. The application COP is retrieved and the PES is invoked. At this stage the scheduler interacts with the Grid run-time system to determine which resources are available and what performance can be expected of those resources. The scheduler then uses application requirements specified in the COP to select an application-appropriate schedule (resource subset and a mapping of the problem data or tasks onto those resources). The binder is then invoked to perform a final, resource-specific compilation of the intermediate representation code. Next, the executable is launched on the selected Grid resources and a real-time performance monitor is used to track program performance and detect violation of performance guarantees. Performance guarantees are formalized in a performance contract [35]. In the case of a performance contract violation, the rescheduler is invoked to evaluate alternative schedules.

Launch-time scheduling The launch-time scheduler is called just before application launch to determine how the current application execution should be mapped to available Grid resources. The resulting schedule specifies the list of target machines, the mapping of virtual application processes to those machines, and the mapping of application data to processes.

GrADS launch-time scheduling architeture Launch-time scheduling is a central component in GrADSoft and is key to achieving application performance; This architecture is designed to work in concert with many other GrADS components. Launch-time scheduling proceeds as follows. A user submits a list of machines to be considered for this application execution (the machine list). The list is passed to the information services client; this component contacts the MDS and the NWS to retrieve information about Grid resources. The Monitoring and Discovery Service (MDS) and the Network Weather Service (NWS) [36] are two on-line services that collect and store Grid information that may be of interest to tools such as schedulers. The resulting Grid information is passed to the filter resources component. This component retrieves the application requirements and filters out resources that can not be used for the application. The remaining Grid information is then passed to the search procedure. This procedure searches through available resources to find application-appropriate subsets; this process is aided by the application performance model and mapper. The search procedure eventually selects a final schedule, which is then used to launch the application on the selected Grid resources.

The drawbacks of launch-time scheduling When two applications are submitted to GrADS at the same time, scheduling decisions will be made for each application ignoring the presence of the other. If the launch-time scheduler determines there are not enough resources for the application, it can not make further progress. A long running job in the system can severely impact the performance of numerous new jobs entering the system The root cause of these and other problems is the absence of a metascheduler

Rescheduling Rescheduling, can include changing the machines on which the application is executing or changing the mapping of data and/or processes to those machines according to the change of load or application requirements. Rescheduling can be implemented via two ways: Application Migration Process Swapping

GrADS rescheduling architecture Figure 1.6 shows rescheduling architecture and depict an application already executing on N processors. The performance contract specifies the performance expected for the current application execution. While the application is executing, application sensors are co-located with application processes to monitor application progress. Meanwhile, resource sensors are located on the machines on which the application is executing, as well as the other machines available to the user for rescheduling. Application sensor data and the performance contract are passed to the contract monitor, which compares achieved application performance against expectations. When performance falls below expectations, the contract monitor signals a violation and contacts the rescheduler. Upon a contract violation, the rescheduler must determine whether rescheduling is profitable, and if so, what new schedule should be used. Data from the resource sensors can be used to evaluate various schedules, but the rescheduler must also consider (i) the cost of moving the application to a new execution schedule and (ii) the amount of work remaining in the application that can benefit from a new schedule. To initiate schedule modification, the rescheduler contacts the rescheduling actuators located on each processor. These actuators use some mechanism to initiate the actual migration or load balancing. Application support for migration or load balancing is the most important part of the rescheduling system. The most transparent migration solution would involve an external migrator that, without application knowledge, freezes execution, records important state such as register values and message queues, and restarts the execution on a new processor. Unfortunately, this is not yet feasible as a general solution in heterogeneous environments. Specifically, the application developer simply adds a few lines to their source code and then links their application against our MPI rescheduling libraries in addition to their regular MPI libraries.

Metascheduling The goal of metascheduling is to investigate scheduling policies that take into account both the needs of the application and the overall performance of the system. The metascheduler possesses global knowledge of all applications in the system and tries to balance the needs of the applications. The metascheduler is implemented by the addition of four components, namely database manager, permission service, contract negotiator and rescheduler

Metascheduler and interactions The metascheduler is implemented by the addition of four components, namely database manager, permission service, contract negotiator and rescheduler. The database manager acts as a repository for information about all applications in the system. As shown in Figure, the application stores information in the database manager at different stages of execution. Other metascheduling components query the database manager for information to make scheduling decisions. In metascheduling, after the filter resources component generates a list of resources suitable for problem solving, this list is passed, along with the problem parameters, to the permission service. The permission service determines if the filtered resources have adequate capacities for problem solving and grants or rejects permission to the application for proceeding with the other stages of execution. If the capacities of the resources are not adequate for problem solving, the permission service tries to proactively stop a large resource consuming application to accommodate the new application. before launching the application, the final schedule along with performance estimates are passed as a performance contract to the contract developer, which in turn passes the information to the metascheduling component, contract negotiator. The contract negotiator is the primary component that balances the needs of different applications. It can reject a contract if it determines that the application has made a scheduling decision based on outdated resource information. The contract negotiator also rejects the contract if it determines that the approval of the contract can severely impact the performance of an executing application. Finally, the contract negotiator can also preempt an executing application to improve the performance contract of a new application. Thus the major objectives of the contract negotiator is to provide high performance to the individual applications and to increase the throughput of the system.

Introduction of Nimrod A large-scale parameter study of a simulation is well suited to high-throughput computing. It involves the execution of a large number of tasks (task farms) over a range of parameters. The Nimrod system is designed to address the complexities associated with parametric computing on clusters of distributed systems. However, Nimrod is unsuitable as implemented in the large-scale dynamic context of computational grids, where resources are scattered across several administrative domains, each with their own user policies, employing their own queuing system, varying access cost and computational power. Nimrod/G: An architecture for a Resource Management and Scheduling System in a Global Computational Grid.

Nimrod/G Shortcomings of Nimrod are addressed by a new system called Nimrod/G It uses the Globus middleware services for dynamic resource discovery and dispatching jobs over computational grids. The architecture of Nimrod/G and its key components are shown as follows:

The architecture of Nimrod/G

Scheduling and Computational Economy of Nimrod/G Nimrod/G system has integrated computational economy as part of a scheduling system, It can be handled in two ways: systems can work on the user’s behalf and try to complete the assigned work within a given deadline and cost. (the early prototype of Nimrod/G) the user can enter into a contract with the system and pose requests. The advantage of this approach is that the user knows before the experiment is started whether the system can deliver the results and what the cost will be. (rather complex and need grid middleware services for resource reservation, broker services for negotiating cost )

The important parameters of computational economy that can influence the way resource scheduling is done are: Resource Cost (set by its owner) Price (that the user is willing to pay) Deadline (the period by which an application execution need to completed) The scheduler can use all sorts of information gathered by a resource discoverer and also negotiate with resource owners to get the best “value for money”.

Thank you