Grid Computing, B. Wilkinson, 20046d.1 Schedulers and Resource Brokers.

Slides:



Advertisements
Similar presentations
Current methods for negotiating firewalls for the Condor ® system Bruce Beckles (University of Cambridge Computing Service) Se-Chang Son (University of.
Advertisements

1 Concepts of Condor and Condor-G Guy Warner. 2 Harvesting CPU time Teaching labs. + Researchers Often-idle processors!! Analyses constrained by CPU time!
Setting up of condor scheduler on computing cluster Raman Sehgal NPD-BARC.
Condor-G: A Computation Management Agent for Multi-Institutional Grids James Frey, Todd Tannenbaum, Miron Livny, Ian Foster, Steven Tuecke Reporter: Fu-Jiun.
A Computation Management Agent for Multi-Institutional Grids
Condor and GridShell How to Execute 1 Million Jobs on the Teragrid Jeffrey P. Gardner - PSC Edward Walker - TACC Miron Livney - U. Wisconsin Todd Tannenbaum.
Intermediate Condor: DAGMan Monday, 1:15pm Alain Roy OSG Software Coordinator University of Wisconsin-Madison.
Resource Management of Grid Computing
Workload Management Workpackage Massimo Sgaravatto INFN Padova.
Slides for Grid Computing: Techniques and Applications by Barry Wilkinson, Chapman & Hall/CRC press, © Chapter 1, pp For educational use only.
Tutorial on Distributed High Performance Computing 14:30 – 19:00 (2:30 pm – 7:00 pm) Wednesday November 17, 2010 Jornadas Chilenas de Computación 2010.
6d.1 Schedulers and Resource Brokers ITCS 4010 Grid Computing, 2005, UNC-Charlotte, B. Wilkinson.
The Difficulties of Distributed Data Douglas Thain Condor Project University of Wisconsin
Condor Overview Bill Hoagland. Condor Workload management system for compute-intensive jobs Harnesses collection of dedicated or non-dedicated hardware.
Evaluation of the Globus GRAM Service Massimo Sgaravatto INFN Padova.
Intermediate HTCondor: Workflows Monday pm Greg Thain Center For High Throughput Computing University of Wisconsin-Madison.
Resource Management Reading: “A Resource Management Architecture for Metacomputing Systems”
Zach Miller Computer Sciences Department University of Wisconsin-Madison What’s New in Condor.
CONDOR DAGMan and Pegasus Selim Kalayci Florida International University 07/28/2009 Note: Slides are compiled from various TeraGrid Documentations.
Grid Computing, B. Wilkinson, 20046d.1 Schedulers and Resource Brokers.
6d.1 Schedulers and Resource Brokers ITCS 4010 Grid Computing, 2005, UNC-Charlotte, B. Wilkinson.
Alain Roy Computer Sciences Department University of Wisconsin-Madison An Introduction To Condor International.
Grid Computing 7700 Fall 2005 Lecture 17: Resource Management Gabrielle Allen
Distributed Systems Early Examples. Projects NOW – a Network Of Workstations University of California, Berkely Terminated about 1997 after demonstrating.
Parallel Computing The Bad News –Hardware is not getting faster fast enough –Too many architectures –Existing architectures are too specific –Programs.
High Throughput Computing with Condor at Purdue XSEDE ECSS Monthly Symposium Condor.
Track 1: Cluster and Grid Computing NBCR Summer Institute Session 2.2: Cluster and Grid Computing: Case studies Condor introduction August 9, 2006 Nadya.
Introduction and Overview Questions answered in this lecture: What is an operating system? How have operating systems evolved? Why study operating systems?
The Glidein Service Gideon Juve What are glideins? A technique for creating temporary, user- controlled Condor pools using resources from.
Workload Management WP Status and next steps Massimo Sgaravatto INFN Padova.
Condor Tugba Taskaya-Temizel 6 March What is Condor Technology? Condor is a high-throughput distributed batch computing system that provides facilities.
WP9 Resource Management Current status and plans for future Juliusz Pukacki Krzysztof Kurowski Poznan Supercomputing.
Job Submission Condor, Globus, Java CoG Kit Young Suk Moon.
Grid Resource Allocation and Management (GRAM) Execution management Execution management –Deployment, scheduling and monitoring Community Scheduler Framework.
Grid Computing I CONDOR.
Grid Workload Management & Condor Massimo Sgaravatto INFN Padova.
Part 6: (Local) Condor A: What is Condor? B: Using (Local) Condor C: Laboratory: Condor.
DataGrid WP1 Massimo Sgaravatto INFN Padova. WP1 (Grid Workload Management) Objective of the first DataGrid workpackage is (according to the project "Technical.
Intermediate Condor Rob Quick Open Science Grid HTC - Indiana University.
Part 8: DAGMan A: Grid Workflow Management B: DAGMan C: Laboratory: DAGMan.
INVITATION TO COMPUTER SCIENCE, JAVA VERSION, THIRD EDITION Chapter 6: An Introduction to System Software and Virtual Machines.
Evaluation of Agent Teamwork High Performance Distributed Computing Middleware. Solomon Lane Agent Teamwork Research Assistant October 2006 – March 2007.
Condor: High-throughput Computing From Clusters to Grid Computing P. Kacsuk – M. Livny MTA SYTAKI – Univ. of Wisconsin-Madison
Grid Compute Resources and Job Management. 2 Local Resource Managers (LRM)‏ Compute resources have a local resource manager (LRM) that controls:  Who.
CE Operating Systems Lecture 3 Overview of OS functions and structure.
Intermediate Condor: Workflows Rob Quick Open Science Grid Indiana University.
July 11-15, 2005Lecture3: Grid Job Management1 Grid Compute Resources and Job Management.
Review of Condor,SGE,LSF,PBS
HTCondor and Workflows: An Introduction HTCondor Week 2015 Kent Wenger.
Condor Project Computer Sciences Department University of Wisconsin-Madison Grids and Condor Barcelona,
Campus grids: e-Infrastructure within a University Mike Mineter National e-Science Centre 14 February 2006.
Derek Wright Computer Sciences Department University of Wisconsin-Madison Condor and MPI Paradyn/Condor.
Intermediate Condor: Workflows Monday, 1:15pm Alain Roy OSG Software Coordinator University of Wisconsin-Madison.
Grid Compute Resources and Job Management. 2 How do we access the grid ?  Command line with tools that you'll use  Specialised applications Ex: Write.
Peter F. Couvares Computer Sciences Department University of Wisconsin-Madison Condor DAGMan: Managing Job.
Job Submission with Globus, Condor, and Condor-G Selim Kalayci Florida International University 07/21/2009 Note: Slides are compiled from various TeraGrid.
Grid Compute Resources and Job Management. 2 Job and compute resource management This module is about running jobs on remote compute resources.
Peter Couvares Computer Sciences Department University of Wisconsin-Madison Condor DAGMan: Introduction &
Grid Compute Resources and Job Management. 2 Grid middleware - “glues” all pieces together Offers services that couple users with remote resources through.
6d.1 Schedulers and Resource Brokers Topics ITCS 4146/5146, UNC-Charlotte, B. Wilkinson, 2007 Feb 12, 2007 Local schedulers Condor.
HTCondor’s Grid Universe Jaime Frey Center for High Throughput Computing Department of Computer Sciences University of Wisconsin-Madison.
SQL Database Management
Condor DAGMan: Managing Job Dependencies with Condor
OpenPBS – Distributed Workload Management System
Intermediate HTCondor: Workflows Monday pm
Where are being used the OS?
Grid Compute Resources and Job Management
Condor: Job Management
Basic Grid Projects – Condor (Part I)
Condor-G Making Condor Grid Enabled
Presentation transcript:

Grid Computing, B. Wilkinson, 20046d.1 Schedulers and Resource Brokers

Grid Computing, B. Wilkinson, 20046d.2 Scheduler Job manager submits jobs to scheduler. Scheduler assigns work to resources to achieve specified time requirements.

Grid Computing, B. Wilkinson, 20046d.3 Scheduling From "Introduction to Grid Computing with Globus," IBM Redbooks

Grid Computing, B. Wilkinson, 20046d.4 Why scheduling? Efficient use of Grid resources requires powerful and flexible Grid scheduling For Grid technology to be successful, there must automatic features to determine available Grid resources and to coordinate the allocation of these resources in accordance with the requirements, dependencies, and objectives of the user. From GGF7-workshop on Grid scheduling

Grid Computing, B. Wilkinson, 20046d.5 Scheduling architecture Is a current area of study for Grid Eventually, there must be a definition of a scheduling architecture –The cooperation of different scheduling instances for arbitrary resources available in the grid –The interaction of local resource management (e.g., PBS, LoadLeveler) and data management

Grid Computing, B. Wilkinson, 20046d.6 Service Level Agreements Resource SLA, (RSLA) i.e., reservation –A promise that a resource will be available when it is needed –The client will utilize the promise in subsequent SLA’s Task SLA, (TSLA) i.e., execution –A promise to perform a task –There may be complex task requirements and may reference an RSLA implicitly

Grid Computing, B. Wilkinson, 20046d.7 SLA’s, continued Binding SLA (BSLA), i.e., a claim –Binds a resource capability to a TSLA May reference an RSLA, or be implicit May be created lazily to provision the task

Grid Computing, B. Wilkinson, 20046d.8 Advance Reservation Requesting actions at times in future. (“A service level agreement in which the conditions of the agreement start at some agreed-upon time in the future” [2]) [2] “The Grid 2, Blueprint for a New Computing Infrastructure,” I. Foster and C. Kesselman editors, Morgan Kaufmann, 2004.

Grid Computing, B. Wilkinson, 20046d.9 Resource Broker “A scheduler that optimizers the performance of a particular resource. Performance may be measured by such criteria as fairness (to ensure that all requests for the resources are satisfied) or utilization (to measure the amount of the resource used).” [2]

Grid Computing, B. Wilkinson, 20046d.10 Community Scheduling Individual users –Require service –Have application goals Community schedulers –Broker service –Aggregate scheduling Individual resources –Provide service –Have policy autonomy –Serve higher-level layers

Grid Computing, B. Wilkinson, 20046d.11 Scheduling in Globus (not) Fully-fledged scheduler/resource broker not in Globus. For example, Globus does not currently have advance reservation. Scheduler/resource broker need to be provided separately on top of Globus, using basic services provided in Globus.

Grid Computing, B. Wilkinson, 20046d.12 Resource Broker Examples Condor-G, Nimrod/G, Grid Canada

Grid Computing, B. Wilkinson, 20046d.13 Condor System first developed at University of Wisconsin-Madison in mid 1980’s to convert a collection of distributed workstations and clusters into a high- throughput computing facility. Key concept - using wasted computer power of idle workstations.

Grid Computing, B. Wilkinson, 20046d.14 Condor Converts collections of distributed workstations and dedicated clusters into a distributed high-throughput computing facility.

Grid Computing, B. Wilkinson, 20046d.15 Features Include: –Resource finder –Batch queue manager –Scheduler –Checkpoint/restart –Process migration

Grid Computing, B. Wilkinson, 20046d.16 Intended to run job even if: Machines crash Disk space exhausted Software not installed Machines are needed by others Machines are managed by others Machines are far away

Grid Computing, B. Wilkinson, 20046d.17 Uses Consider following scenario: –I have a simulation that takes two hours to run on my high-end computer –I need to run it 1000 times with slightly different parameters each time. –If I do this on one computer, it will take at least 2000 hours (or about 3 months) From: “Condor: What it is and why you should worry about it,” by B. Beckles, University of Cambridge, Seminar, June 23,,2004

Grid Computing, B. Wilkinson, 20046d.18 –Suppose my department has 100 PCs like mine that are mostly sitting idle overnight (say 8 hours a day) –If I could use them when their legitimate users are not using them, so that I do not inconvenience them, I could get about 800 CPU hours/day. –This is an ideal situation for Condor. I could do my simulations in 2.5 days. From: “Condor: What it is and why you should worry about it,” by B. Beckles, University of Cambridge, Seminar, June 23,,2004

Grid Computing, B. Wilkinson, 20046d.19 How does Condor work? A collection of machines running Condor called a pool. Individual pools can be joined together in a process called flocking. From: “Condor: What it is and why you should worry about it,” by B. Beckles, University of Cambridge, Seminar, June 23,,2004

Grid Computing, B. Wilkinson, 20046d.20 Machine Roles Machines have one or more of four roles: –Central manager –Submit machine (Submit host) –Execution machine (Execute host) –Checkpoint server

Grid Computing, B. Wilkinson, 20046d.21 Central Manager Resource broker for a pool. Keeps track of which machines are available, what jobs are running, negotiates which machine will run which job, etc. Only one central manager per pool.

Grid Computing, B. Wilkinson, 20046d.22 Submit Machine Machine which submits jobs to pool. Must be at least one submit machine in a pool, and usually more than one.

Grid Computing, B. Wilkinson, 20046d.23 Execute Machine Machine on which jobs can be run. Must be at least one execute machine in a pool, and usually more than one.

Grid Computing, B. Wilkinson, 20046d.24 Checkpoint Server Machine which stores al checkpoint files produced by job which checkpoint. Can only be one checkpoint machine in a pool. Optional to have a checkpoint machine.

Grid Computing, B. Wilkinson, 20046d.25 Possible Configuration A central manager. Some machine that can only be submit hosts. Some machine that can be only execute hosts. Some machines that can be both submit and execute hosts.

Grid Computing, B. Wilkinson, 20046d.26

Grid Computing, B. Wilkinson, 20046d.27 Submitting a job Job submitted to submit host Submit host tells the central,manager about job using Condors “ClassAd” Mechanism which may include: –What it requires –What it desires –What it prefers, and –What it will accept

Grid Computing, B. Wilkinson, 20046d.28 1.Central manager monitoring execute hosts so knows what is available and what type of machines each execute host is, and software. 2.Execute hosts periodically send a ClassAd describing themselves to the central manager.

Grid Computing, B. Wilkinson, 20046d.29 3.At times, the central manager enters a negotiation cycle where it matches waiting jobs with available execute hosts. 4.Eventually job is matched with a suitable execute host (hopefully).

Grid Computing, B. Wilkinson, 20046d.30 5.Central manager informs chosen execute host that is has been claimed and gives it a ticket. 6.Central manage informs submit host which execute host to use and gives it a matching ticket.

Grid Computing, B. Wilkinson, 20046d.31 7.Submit host contacts execute host presenting its matching ticket and transfers job’s executable and date files to execute host if necessary. (shared file system also possible.) 8.When job finished, results returned to submit host (unless shared file system in use between submit and execute hosts).

Grid Computing, B. Wilkinson, 20046d.32 Connections Connection between submit and execute host usually done with a TCP connection. If connection dies, job resubmitted to Condor pool. Some jobs might access files and resources on submit host via remote procedure calls.

Grid Computing, B. Wilkinson, 20046d.33 Checkpointing Certain jobs can checkpoint, both periodically for safety and when interrupted. If checkpointed job interrupted, it will resume at the last checkpointed state when it starts again. Generally no change to source code - need to link Condor’s Standard Universe support library (see later).

Grid Computing, B. Wilkinson, 20046d.34 Types of Jobs Classified according to environment it provides. Currently seven environments: –Standard –Vanilla –PVM –MPI –Globus –Java –Scheduler

Grid Computing, B. Wilkinson, 20046d.35 Standard For jobs compiled with Condor libraries Allows for checking pointing and remote system calls. Must be single threaded. Not available under Windows.

Grid Computing, B. Wilkinson, 20046d.36 Vanilla For jobs that cannot be compiled with Condor libraries, and for shell scripts and Windows batch files. No checkpointing or remote system calls.

Grid Computing, B. Wilkinson, 20046d.37 Job Universes continued PVM For PVM programs. MPI For MPI programs (MPICH). Globus For submitting jobs to resources managed by Globus (version 2.2 and higher).

Grid Computing, B. Wilkinson, 20046d.38 Java For Java programs (written for Java Virtual Interface). Scheduler A universe not normally used by end- user. Ignores any requirements and runs job on submit host. Never preempted.

Grid Computing, B. Wilkinson, 20046d.39 Directed Acyclic Graph Manager (DAGMan) Allows one to specify dependencies between Condor Jobs. Example “Do not run Job B until Job A completed successfully” Especially important to jobs working together (as in Grid computing).

Grid Computing, B. Wilkinson, 20046d.40 Directed Acyclic Graph (DAG) A data structure used to represent dependencies. Each job is a node in the DAG. Each node can have any number of parents and childred as long as there are no loops (Acyclic graph).

Grid Computing, B. Wilkinson, 20046d.41 Defining a DAG DAG defined by a.dag file, listing each of the nodes and their dependencies Example # diamond.dag Job A a.sub Job B b.sub Job C c.sub Job D d.sub Parent A Child B C Parent B C Child D Job A Job CJob B Job D

Grid Computing, B. Wilkinson, 20046d.42 Running a DAG DASGMan acts as a scheduler managing the submission of jobs to Condor based upon DAG dependencies. DAGMan holds and submits jobs to Condor queue at appropriate times.

Grid Computing, B. Wilkinson, 20046d.43 Job Failures DAGMan continues until it cannot make progress and then creates a rescue file holding current state of DAG. When failed job ready to re-run, rescue file used to restore prior state of DAG.

Grid Computing, B. Wilkinson, 20046d.44 ClassAd Matchmaking Used to ensure job done according to constraints of users and owners. Example of user constraints “ I need a Pentium IV with at least 512 Mbytes of RAM and speed of at least 3.5 Ghz Example of machine owner constraints “Never run jobs owned by Fred”

Grid Computing, B. Wilkinson, 20046d.45 Condor Submit Description File # This is a comment, condor submit file Universe = vanilla Executable = /home/abw/condor/myProg Input = myProg.stdin Output = myProg.stdout Error = myProg.stderr Arguments = -arg1 -arg2 InitialDir = /home/abw/condor/assignment4 Queue Describes job to Condor. Used with Condor _submit command. Description File Example

Grid Computing, B. Wilkinson, 20046d.46 Submitting Multiple Jobs Submit file can specify multiple jobs Queue 500 will submit 500 jobs at once Condor calls groups of jobs a cluster Each job within cluster called a process Condor job ID is the cluster number, a period and process number, for example 26.2 Single jobs also a cluster but with a single process (process 0)

Grid Computing, B. Wilkinson, 20046d.47 Specifying Requirements A C/Java-like Boolean expression that evaluates to TRUE for a match. # This is a comment, condor submit file Universe = vanilla Executable = /home/abw/condor/myProg InitialDir = /home/abw/condor/assignment4 Requirements = Memory >= 512 && Disk > queue 500

Grid Computing, B. Wilkinson, 20046d.48 Summary of Key Condor Features High throughput computing using an opportunitistic environment. Matchmaking Checkpointing DAG scheduling

Grid Computing, B. Wilkinson, 20046d.49 Condor-G Grid enabled version of Condor. Uses Globus Toolkit for: –Security (GSI) –managing remote jobs on grid (GRAM) –file handling and remote I/O (GSI-FTP)

Grid Computing, B. Wilkinson, 20046d.50 Remote execution by Condor-G on Globus-managed resources From:”Condor-G A Computation Management Agent for Multi-Institutional Grids” by J. Frey, T. Tannenbaum, M. Livny, I. Foster and S. Tuecke. Figure probably refers to Globus version 2.

Grid Computing, B. Wilkinson, 20046d.51 More Information