Condor Introduction and Architecture for Vanilla Jobs CERN Feb

Slides:

Advertisements

Similar presentations

Jaime Frey Computer Sciences Department University of Wisconsin-Madison OGF 19 Condor Software Forum Routing.

Advertisements

Current methods for negotiating firewalls for the Condor ® system Bruce Beckles (University of Cambridge Computing Service) Se-Chang Son (University of.

Dan Bradley Computer Sciences Department University of Wisconsin-Madison Schedd On The Side.

1 Concepts of Condor and Condor-G Guy Warner. 2 Harvesting CPU time Teaching labs. + Researchers Often-idle processors!! Analyses constrained by CPU time!

Setting up of condor scheduler on computing cluster Raman Sehgal NPD-BARC.

More HTCondor 2014 OSG User School, Monday, Lecture 2 Greg Thain University of Wisconsin-Madison.

Condor and GridShell How to Execute 1 Million Jobs on the Teragrid Jeffrey P. Gardner - PSC Edward Walker - TACC Miron Livney - U. Wisconsin Todd Tannenbaum.

SIE’s favourite pet: Condor (or how to easily run your programs in dozens of machines at a time) Adrián Santos Marrero E.T.S.I. Informática - ULL.

Using HTCondor European HTCondor Site Admins Meeting CERN December 2014.

Condor Project Computer Sciences Department University of Wisconsin-Madison A Scientist’s Introduction.

1 Using Condor An Introduction ICE 2008.

First steps implementing a High Throughput workload management system Massimo Sgaravatto INFN Padova

Jaeyoung Yoon Computer Sciences Department University of Wisconsin-Madison Virtual Machines in Condor.

Derek Wright Computer Sciences Department, UW-Madison Lawrence Berkeley National Labs (LBNL)

Introduction to Condor DMD/DFS J.Knudstrup December 2005.

Zach Miller Computer Sciences Department University of Wisconsin-Madison What’s New in Condor.

Alain Roy Computer Sciences Department University of Wisconsin-Madison An Introduction To Condor International.

Distributed Systems Early Examples. Projects NOW – a Network Of Workstations University of California, Berkely Terminated about 1997 after demonstrating.

April Open Science Grid Building a Campus Grid Mats Rynge – Renaissance Computing Institute University of North Carolina, Chapel.

High Throughput Computing with Condor at Purdue XSEDE ECSS Monthly Symposium Condor.

April Open Science Grid Campus Condor Pools Mats Rynge – Renaissance Computing Institute University of North Carolina, Chapel Hill.

An Introduction to High-Throughput Computing Monday morning, 9:15am Alain Roy OSG Software Coordinator University of Wisconsin-Madison.

The Glidein Service Gideon Juve What are glideins? A technique for creating temporary, user- controlled Condor pools using resources from.

Condor Tugba Taskaya-Temizel 6 March What is Condor Technology? Condor is a high-throughput distributed batch computing system that provides facilities.

Progress Report Barnett Chiu Glidein Code Updates and Tests (1) Major modifications to condor_glidein code are as follows: 1. Command Options:

Installing and Managing a Large Condor Pool Derek Wright Computer Sciences Department University of Wisconsin-Madison

Grid Computing I CONDOR.

Condor Birdbath Web Service interface to Condor

Part 6: (Local) Condor A: What is Condor? B: Using (Local) Condor C: Laboratory: Condor.

Intermediate Condor Rob Quick Open Science Grid HTC - Indiana University.

1 The Roadmap to New Releases Todd Tannenbaum Department of Computer Sciences University of Wisconsin-Madison

Grid job submission using HTCondor Andrew Lahiff.

Condor: High-throughput Computing From Clusters to Grid Computing P. Kacsuk – M. Livny MTA SYTAKI – Univ. of Wisconsin-Madison

1 The Roadmap to New Releases Todd Tannenbaum Department of Computer Sciences University of Wisconsin-Madison

The Roadmap to New Releases Derek Wright Computer Sciences Department University of Wisconsin-Madison

July 11-15, 2005Lecture3: Grid Job Management1 Grid Compute Resources and Job Management.

Review of Condor,SGE,LSF,PBS

Derek Wright Computer Sciences Department University of Wisconsin-Madison New Ways to Fetch Work The new hook infrastructure in Condor.

Pilot Factory using Schedd Glidein Barnett Chiu BNL

An Introduction to High-Throughput Computing With Condor Tuesday morning, 9am Zach Miller University of Wisconsin-Madison.

Grid Compute Resources and Job Management. 2 Grid middleware - “glues” all pieces together Offers services that couple users with remote resources through.

HTCondor’s Grid Universe Jaime Frey Center for High Throughput Computing Department of Computer Sciences University of Wisconsin-Madison.

Intermediate Condor Monday morning, 10:45am Alain Roy OSG Software Coordinator University of Wisconsin-Madison.

More HTCondor Monday AM, Lecture 2 Ian Ross

HTCondor Security Basics

Quick Architecture Overview INFN HTCondor Workshop Oct 2016

Operating a glideinWMS frontend by Igor Sfiligoi (UCSD)

High Availability in HTCondor

Adding High Availability to Condor Central Manager Tutorial

Monitoring HTCondor with Ganglia

Building Grids with Condor

The Scheduling Strategy and Experience of IHEP HTCondor Cluster

Job HTCondor’s quanta of work, like a Unix process Has many attributes

Using Condor An Introduction Condor Week 2004

Troubleshooting Your Jobs

Accounting, Group Quotas, and User Priorities

HTCondor Security Basics HTCondor Week, Madison 2016

Job Matching, Handling, and Other HTCondor Features

Using Condor An Introduction Condor Week 2003

Condor Glidein: Condor Daemons On-The-Fly

Basic Grid Projects – Condor (Part I)

HTCondor Training Florentia Protopsalti IT-CM-IS 1/16/2019.

The Condor JobRouter.

Condor: Firewall Mirroring

Using Condor An Introduction Paradyn/Condor Week 2002

Condor Administration in the Open Science Grid

Condor-G Making Condor Grid Enabled

Credential Management in HTCondor

Troubleshooting Your Jobs

Condor-G: An Update.

Presentation transcript:

Condor Introduction and Architecture for Vanilla Jobs CERN Feb 14 2011

The Condor Project (Established ‘85) Research and Development in the Distributed High Throughput Computing field Team of ~35 faculty, full time staff and students Face software engineering challenges in a distributed UNIX/Linux/NT environment Are involved in national and international grid collaborations Actively interact with academic and commercial entities and users Maintain and support large distributed production environments Educate and train students

The Condor Team

What does Condor offer to USCMS community? Batch System Robust feature set, fault tolerant Open Source Development team dedicated to working closely w/ scientific community as priority #1 Flexible!!! Multi-purpose Job Queue / Workflow manager Submit jobs to local cluster Submit jobs to local “campus” Submit jobs to the grid (a.k.a. Condor-G) Grid Overlay Capable Dynamic Condor Glidein pools with the help of a factory (glideinWMS)

Outline Condor ClassAds Matchmaking Submitting a hello world job The “lingua franca” of Condor Matchmaking Submitting a hello world job Condor Architecture Condor daemons – what they are, what they do Who talks to who Condor Connection Brokering

Job Condor’s quanta of work Like a UNIX process Can be an element of a workflow

Jobs Have Wants & Needs Jobs state their requirements and preferencs: I require a Linux/x86 platform Preferences ("Rank"): I prefer the machine with the most memory I prefer a machine in the chemistry department

Machines Do Too! Machines specify: Requirements: Preferences ("Rank"): Require that jobs run only when there is no keyboard activity Never run jobs belonging to Dr. Heisenberg Preferences ("Rank"): I prefer to run Albert’s jobs Custom Attributes: I am a machine in the physics department

Condor brings them together Central Manager (collector, negotiator) Execute Node (startd) Execute Node (startd) condor_submit Submit Node (schedd) Execute Node (startd)

Condor ClassAds

What are Condor ClassAds? ClassAds is a language for objects (jobs and machines) to Express attributes about themselves Express what they require/desire in a match (similar to personal classified ads) Structure : set of attribute name/value pairs Value : Literals (string, bool, int, float) or an expression

ClassAd Expressions Similar look to C/C++ or Java : operators, references, functions Operators: +, -, *, /, <, <=,>, >=, ==, !=, &&, and || all work as expected Functions: if/then/else, string manipulation, regular expression pattern matching, list operations, dates, randomization, … References: to other attributes in the same ad, or attributes in an ad that is a candidate for a match TRUE==1 and FALSE==0 (guaranteed) (3 == (2+1)) is identical to 1 (TRUE*30) is identical to 30 (3 == 1) is identical to 0 12 12

ClassAd Types Condor has many types of ClassAds A "Job Ad" represents a job to Condor A "Machine Ad" represents a computing resource Others types of ads represent other instances of other services, users, licenses, etc, to your Condor pool

The Magic of Matchmaking Condor evaluates job ads in the context of a candidate machine ads looking for a matches Requirements and Rank expressions Two ads match if both their Requirements expressions evaluate to True MY.name – Value for attribute “name” in local ClassAd TARGET.name – Value for attribute “name” in match candidate ClassAd Name – Looks for “name” in the local ClassAd, then the candidate ClassAd Requirements is a bool; Rank is a float where higher is preferred

Example Pet Ad Buyer Ad Type = “Dog” Requirements = DogLover =?= True Color = “Brown” Price = 75 Sex = "Male" AgeWeeks = 8 Breed = "Saint Bernard" Size = "Very Large" Weight = 27 Buyer Ad AcctBalance = 100 DogLover = True Requirements = (Type == “Dog”) && (TARGET.Price <= MY.AcctBalance) && ( Size == "Large" || Size == "Very Large" ) Rank = (Breed == "Saint Bernard") . . .

Getting Started: Submitting Jobs to Condor Get access to submit host Choose a “Universe” for your job Make your job “batch-ready” Includes making your data available to your job Create a submit description file Run condor_submit to put your job(s) in the queue Relax while Condor manages and watches over your job(s)

Choose the job “Universe” Controls how Condor handles jobs Condors many universes include: Vanilla (aka regular serial job) Parallel Grid Java VM Standard

Hello World Submit File # Simple condor_submit input file # (Lines beginning with # are comments) # NOTE: the words on the left side are not # case sensitive, but filenames are! Universe = vanilla Executable = cosmos ·Job's executable Output = cosmos.out ·Job's STDOUT Input = cosmos.in ·Job's STDIN Queue 1 ·Put the job in the queue!

condor_submit & condor_q % condor_submit sim.submit Submitting job(s). 1 job(s) submitted to cluster 1. % condor_q -- Submitter: perdita.cs.wisc.edu : <128.105.165.34:1027> : ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 1.0 frieda 6/16 06:52 0+00:00:00 I 0 0.0 sim.exe 1 jobs; 1 idle, 0 running, 0 held %

View the full ClassAd % condor_q -long -- Submitter: perdita.cs.wisc.edu : <128.105.165.34:1027> : MyType = “Job” TargetType = “Machine” ClusterId = 1 QDate = 1150921369 CompletionDate = 0 Owner = “frieda” RemoteWallClockTime = 0.000000 LocalUserCpu = 0.000000 LocalSysCpu = 0.000000 RemoteUserCpu = 0.000000 RemoteSysCpu = 0.000000 ExitStatus = 0 …

Logging your Job's Activities Create a log of job events Add to submit description file: log = cosmos.log The Life Story of a Job Shows all events in the life of a job Good practice to always have a log file Libraries to parse them provided

Sample Condor User Log 000 (0101.000.000) 05/25 19:10:03 Job submitted from host: <128.105.146.14:1816> ... 001 (0101.000.000) 05/25 19:12:17 Job executing on host: <128.105.146.14:1026> 005 (0101.000.000) 05/25 19:13:06 Job terminated. (1) Normal termination (return value 0)

condor_status gives information about the pool: Name OpSys Arch State Activ LoadAv Mem ActvtyTime perdita.cs.wi LINUX INTEL Owner Idle 0.020 511 0+02:28:42 coral.cs.wisc LINUX INTEL Claimed Busy 0.990 511 0+01:27:21 doc.cs.wisc.e LINUX INTEL Unclaimed Idle 0.260 511 0+00:20:04 dsonokwa.cs.w LINUX INTEL Claimed Busy 0.810 511 0+00:01:45 ferdinand.cs. LINUX INTEL Claimed Suspe 1.130 511 0+00:00:55 To inspect full ClassAds: condor_status -long

Condor File Transfer Condor will transfer files between submit and execute nodes (eliminating the need for a shared filesystem) if desired: ShouldTransferFiles YES: Always transfer files to execution site NO: Always rely on a shared filesystem IF_NEEDED: Condor will automatically transfer the files if the submit and execute machine are not in the same FileSystemDomain (Use shared file system if available) When_To_Transfer_Output ON_EXIT: Transfer the job's output files back to the submitting machine only when the job completes ON_EXIT_OR_EVICT: Like above, but also when the job is evicted

Condor File Transfer, cont Transfer_Input_Files List of files that you want Condor to transfer to the execute machine Transfer_Output_Files List of files that you want Condor to transfer from the execute machine If not specified, Condor will transfer back all new or modified files in the execute directory

Simple File Transfer Example # Example submit file using file transfer Universe = vanilla Executable = cosmos Log = cosmos.log ShouldTransferFiles = YES Transfer_input_files = cosmos.dat Transfer_output_files = results.dat When_To_Transfer_Output = ON_EXIT Queue

General User Commands condor_status View Pool Status condor_q View Job Queue condor_submit Submit new Jobs condor_run Submit and block condor_rm Remove Jobs condor_prio Intra-User Prios condor_history Completed Job Info condor_submit_dag Submit new DAG condor_checkpoint Force a checkpoint condor_compile Link Condor library

Condor Daemons Title unknown, by Hans Holbein the Younger, from Historiarum Veteris Testamenti icones, 1543 28

Condor Daemons – Mix’n Match Components negotiator collector master shadow schedd procd startd starter kbdd exec

condor_master You start it, it starts up the other Condor daemons If a daemon exits unexpectedly, restarts deamon and emails administrator If a daemon binary is updated (timestamp changed), restarts the daemon 30 30

condor_master Provides access to many remote administration commands: condor_reconfig, condor_restart, condor_off, condor_on, etc. Default server for many other commands: condor_config_val, etc. 31 31

condor_master Periodically runs condor_preen to clean up any files Condor might have left on the machine Emails you notification of deleted files Backup behavior, the other daemons clean up after themselves 32 32

condor_procd Tracks processes Automatically started as needed No DAEMON_LIST entry necessary Behind the scenes Part of privilege separation security enhancements “IMG 0960” by Eva Schiffer © 2008 Used with permission http://www.digitalchangeling.com/pictures/ourCats2008/january2008/IMG_0960.html 33 33

condor_startd Represents a machine willing to run jobs to the Condor pool Run on any machine you want to run jobs on Enforces the wishes of the machine owner (the owner’s “policy”) 34 34

condor_startd Starts, stops, suspends jobs Spawns the appropriate condor_starter, depending on the type of job Provides other administrative commands (for example, condor_vacate) Aided by condor_kbdd 35 35

condor_starter Spawned by the condor_startd Don’t add to DAEMON_LIST Handles all the details of starting and managing the job Transfer job’s binary to execute machine Send back exit status Etc. 36 36

condor_starter One per running job The default configuration is willing to run one job per CPU 37 37

condor_kbdd Monitors physical keyboard and mouse so the condor_startd can make decisions based on local usage.

condor_schedd Represents jobs to the Condor pool Maintains persistent queue of jobs Queue is not strictly first-in-first-out (priority based) Each machine running condor_schedd maintains its own independent queue Run on any machine you want to submit jobs from 39 39

condor_schedd Responsible for contacting available machines and spawning waiting jobs When told to by condor_negotiator Services most user commands: condor_submit, condor_rm, condor_q Also: condor_hold, condor_release 40 40

condor_shadow Represents job on the submit machine Spawned by condor_schedd Don’t add to DAEMON_LIST Services requests from standard universe jobs for remote system calls including all file I/O Makes decisions on behalf of the job for example: where to store the checkpoint file 41 41

condor_exec.exe A running job. When user executable binaries are transferred to the execution side, they are renamed condor_exec.exe.

condor_collector Collects information from all other Condor daemons in the pool condor_collector Each daemon sends a periodic update called a ClassAd to the collector Old ClassAds removed after a time out Services queries for information: Queries from other Condor daemons Queries from users (condor_status) 43 43

condor_negotiator Performs matchmaking in Condor Pulls list of available machines and job queues from condor_collector Matches jobs with available machines Both the job and the machine must satisfy each other’s requirements (2-way matching) Handles user priorities and accounting 44 44

Machine role defined by services launched there You only have to run the daemons for the services you need to provide DAEMON_LIST is a comma separated list of daemons to start DAEMON_LIST=MASTER,SCHEDD,STARTD 45 45

Central Manager The Central Manager is the machine running the collector and negotiator DAEMON_LIST = MASTER, COLLECTOR, NEGOTIATOR Defines a Condor pool. 46 46

Sample Condor Pool Execute-Only Central Manager master master startd = Process Spawned Central Manager master collector negotiator schedd startd = ClassAd Communication Pathway Execute-Only master startd Submit-Only master schedd Regular Node schedd startd master Regular Node schedd startd master Here we have a small pool with six machines. Like most Condor pools, there is a Central Manager, identifiable by the presence of the condor_negotiator and condor_collector. In this case the Central Manger is also allowed to run jobs (presence of condor_startd), and jobs can be submitted from the Central Manager (presence of condor_schedd) This pool has one node only useful for submitting jobs (it is only running the condor_schedd), and two nodes only used for executing jobs (only running condor_startd). There are two nodes both able to submit and execute jobs (both startd and schedd) 47 47

Job Startup 48 “LUNAR Launch” by Steve Jurvertson (“jurvetson”) © 2006 Licensed under the Creative Commons Attribution 2.0 license. http://www.flickr.com/photos/jurvetson/114406979/ http://www.webcitation.org/5XIfTl6tX 48

Claiming Protocol Q Central Manager J S S Submit Machine Negotiator Collector J S S Submit Machine Execute Machine CLAIM J S S S Q J J Schedd Startd Steps 1. Startd sends collector ClassAd describing itself. (The Schedd does as well, but it has nothing interesting to say yet.) 2. The user calls condor_submit to submit a job. The job is handed off to the schedd and condor_submit returns. 3. The schedd alerts the collector that it now has a job waiting. 4. The negotiator asks the collector for a list machines able to run jobs and schedd queues with waiting jobs. 5. The negotiator contacts the schedd to learn about the waiting job. 6. The negotiator matches the waiting job with the waiting machine. 7. The negotiator alerts the schedd and the startd that there is a match. 8. The schedd contacts the startd to claim the match. 9. The schedd starts a shadow to monitor the job. 10. The startd starts a starter to start the job. 11. The starter and the shadow contact each other. 11. The starter starts the job. 12. If the job is using the Condor syscall library (typically through being condor_compiled), it will contact the shadow to access necessary files. J Submit 49 49

Claim Activation Central Manager Submit Machine Execute Machine Negotiator Collector Submit Machine Execute Machine CLAIMED Schedd Startd Steps 1. Startd sends collector ClassAd describing itself. (The Schedd does as well, but it has nothing interesting to say yet.) 2. The user calls condor_submit to submit a job. The job is handed off to the schedd and condor_submit returns. 3. The schedd alerts the collector that it now has a job waiting. 4. The negotiator asks the collector for a list machines able to run jobs and schedd queues with waiting jobs. 5. The negotiator contacts the schedd to learn about the waiting job. 6. The negotiator matches the waiting job with the waiting machine. 7. The negotiator alerts the schedd and the startd that there is a match. 8. The schedd contacts the startd to claim the match. 9. The schedd starts a shadow to monitor the job. 10. The startd starts a starter to start the job. 11. The starter and the shadow contact each other. 11. The starter starts the job. 12. If the job is using the Condor syscall library (typically through being condor_compiled), it will contact the shadow to access necessary files. Activate Claim Shadow Starter Job 50 50

Repeat until Claim released Central Manager Negotiator Collector Submit Machine Execute Machine CLAIMED Schedd Startd Steps 1. Startd sends collector ClassAd describing itself. (The Schedd does as well, but it has nothing interesting to say yet.) 2. The user calls condor_submit to submit a job. The job is handed off to the schedd and condor_submit returns. 3. The schedd alerts the collector that it now has a job waiting. 4. The negotiator asks the collector for a list machines able to run jobs and schedd queues with waiting jobs. 5. The negotiator contacts the schedd to learn about the waiting job. 6. The negotiator matches the waiting job with the waiting machine. 7. The negotiator alerts the schedd and the startd that there is a match. 8. The schedd contacts the startd to claim the match. 9. The schedd starts a shadow to monitor the job. 10. The startd starts a starter to start the job. 11. The starter and the shadow contact each other. 11. The starter starts the job. 12. If the job is using the Condor syscall library (typically through being condor_compiled), it will contact the shadow to access necessary files. Activate Claim Shadow Starter Job 51 51

Repeat until Claim released Central Manager Negotiator Collector Submit Machine Execute Machine CLAIMED Schedd Startd Steps 1. Startd sends collector ClassAd describing itself. (The Schedd does as well, but it has nothing interesting to say yet.) 2. The user calls condor_submit to submit a job. The job is handed off to the schedd and condor_submit returns. 3. The schedd alerts the collector that it now has a job waiting. 4. The negotiator asks the collector for a list machines able to run jobs and schedd queues with waiting jobs. 5. The negotiator contacts the schedd to learn about the waiting job. 6. The negotiator matches the waiting job with the waiting machine. 7. The negotiator alerts the schedd and the startd that there is a match. 8. The schedd contacts the startd to claim the match. 9. The schedd starts a shadow to monitor the job. 10. The startd starts a starter to start the job. 11. The starter and the shadow contact each other. 11. The starter starts the job. 12. If the job is using the Condor syscall library (typically through being condor_compiled), it will contact the shadow to access necessary files. Activate Claim Shadow Starter Job 52 52

When is claim released? When relinquished by one of the following lease on the claim is not renewed Why? Machine powered off, disappeared, etc schedd Why? Out of jobs, shutting down, schedd didn’t “like” the machine, etc startd Why? Policy re claim lifetime, prefers a different match (via Rank), non-dedicated desktop, etc negotiator Why? User priority inversion policy explicitly via a command-line tool E.g. condor_vacate

Some items to notice Machines (startds) or submitters (schedds) can dynamically appear and disappear A key for glidein Scheduling policy can be very flexible (custom attributes) and very distributed Lots of network arrows on previous slides Reflects the P2P nature of Condor But what about NATs, firewalls ?

CCB: Condor Connection Broker Condor wants two-way p2p connectivity With CCB, one-way is good enough Collector requests reversed connections for clients Execute Node Job Submit Point run this job I want to connect to the submit node transfer files reversed connection CCB_ADDRESS=ccb.host.name

Limitations of CCB Execute Node Job Submit Point no go! Collector (CCB Broker) needs to be accessible by everyone Requires outgoing connectivity Can’t have BOTH submit and execute points behind different firewalls Execute Node CCB_ADDRESS=ccb1.host CCB_ADDRESS=ccb2.host Job Submit Point no go!

Another Submit File Example # Example submit file using file transfer Universe = vanilla Log = cosmos.log Executable = cosmos # Do each run in its own Subdirectory Initialdir = Run_$(Process) # Move files if no shared volume avail ShouldTransferFiles = IF_NEEDED Transfer_input_files = cosmos.dat Transfer_output_files = results.dat When_To_Transfer_Output = ON_EXIT # Data dir is advertised by the machines Requirements = Memory > 2000 Arguments = -datadir $$(CosmosData) Rank = 100000*KFlops + Memory # Run 1000 different data sets Queue 1000

Questions? Thank You!