Douglas Thain Computer Sciences Department University of Wisconsin-Madison (In Bologna for June 2000) Condor.

Slides:



Advertisements
Similar presentations
Dealing with real resources Wednesday Afternoon, 3:00 pm Derek Weitzel OSG Campus Grids University of Nebraska.
Advertisements

Setting up of condor scheduler on computing cluster Raman Sehgal NPD-BARC.
More HTCondor 2014 OSG User School, Monday, Lecture 2 Greg Thain University of Wisconsin-Madison.
Condor and GridShell How to Execute 1 Million Jobs on the Teragrid Jeffrey P. Gardner - PSC Edward Walker - TACC Miron Livney - U. Wisconsin Todd Tannenbaum.
Dr. David Wallom Use of Condor in our Campus Grid and the University September 2004.
SIE’s favourite pet: Condor (or how to easily run your programs in dozens of machines at a time) Adrián Santos Marrero E.T.S.I. Informática - ULL.
Douglas Thain Computer Sciences Department University of Wisconsin-Madison (In Bologna for June 2000) Remote.
1 Workshop 20: Teaching a Hands-on Undergraduate Grid Computing Course SIGCSE The 41st ACM Technical Symposium on Computer Science Education Friday.
Farming with Condor Douglas Thain INFN Bologna, December 2001.
Process in Unix, Linux and Windows CS-3013 C-term Processes in Unix, Linux, and Windows CS-3013 Operating Systems (Slides include materials from.
1 Using Condor An Introduction ICE 2008.
Douglas Thain Computer Sciences Department University of Wisconsin-Madison October Condor by Example.
High Throughput Computing with Condor at Notre Dame Douglas Thain 30 April 2009.
First steps implementing a High Throughput workload management system Massimo Sgaravatto INFN Padova
Jim Basney Computer Sciences Department University of Wisconsin-Madison Managing Network Resources in.
Jaeyoung Yoon Computer Sciences Department University of Wisconsin-Madison Virtual Machines in Condor.
Introduction to Condor DMD/DFS J.Knudstrup December 2005.
Zach Miller Computer Sciences Department University of Wisconsin-Madison What’s New in Condor.
Peter Couvares Computer Sciences Department University of Wisconsin-Madison High-Throughput Computing With.
Grid Computing, B. Wilkinson, 20046d.1 Schedulers and Resource Brokers.
Alain Roy Computer Sciences Department University of Wisconsin-Madison An Introduction To Condor International.
Distributed Systems Early Examples. Projects NOW – a Network Of Workstations University of California, Berkely Terminated about 1997 after demonstrating.
High Throughput Computing with Condor at Purdue XSEDE ECSS Monthly Symposium Condor.
Track 1: Cluster and Grid Computing NBCR Summer Institute Session 2.2: Cluster and Grid Computing: Case studies Condor introduction August 9, 2006 Nadya.
Process in Unix, Linux, and Windows CS-3013 A-term Processes in Unix, Linux, and Windows CS-3013 Operating Systems (Slides include materials from.
An Introduction to High-Throughput Computing Monday morning, 9:15am Alain Roy OSG Software Coordinator University of Wisconsin-Madison.
The Glidein Service Gideon Juve What are glideins? A technique for creating temporary, user- controlled Condor pools using resources from.
Workload Management WP Status and next steps Massimo Sgaravatto INFN Padova.
Condor Tugba Taskaya-Temizel 6 March What is Condor Technology? Condor is a high-throughput distributed batch computing system that provides facilities.
Peter Keller Computer Sciences Department University of Wisconsin-Madison Quill Tutorial Condor Week.
Grid Computing I CONDOR.
Grid Workload Management & Condor Massimo Sgaravatto INFN Padova.
Part 6: (Local) Condor A: What is Condor? B: Using (Local) Condor C: Laboratory: Condor.
Intermediate Condor Rob Quick Open Science Grid HTC - Indiana University.
Grid job submission using HTCondor Andrew Lahiff.
Condor: High-throughput Computing From Clusters to Grid Computing P. Kacsuk – M. Livny MTA SYTAKI – Univ. of Wisconsin-Madison
Dealing with real resources Wednesday Afternoon, 3:00 pm Derek Weitzel OSG Campus Grids University of Nebraska.
Turning science problems into HTC jobs Wednesday, July 29, 2011 Zach Miller Condor Team University of Wisconsin-Madison.
Douglas Thain, John Bent Andrea Arpaci-Dusseau, Remzi Arpaci-Dusseau, Miron Livny Computer Sciences Department, UW-Madison Gathering at the Well: Creating.
Alain Roy Computer Sciences Department University of Wisconsin-Madison I/O Access in Condor and Grid.
Report from USA Massimo Sgaravatto INFN Padova. Introduction Workload management system for productions Monte Carlo productions, data reconstructions.
The Roadmap to New Releases Derek Wright Computer Sciences Department University of Wisconsin-Madison
Derek Wright Computer Sciences Department University of Wisconsin-Madison MPI Scheduling in Condor: An.
Todd Tannenbaum Computer Sciences Department University of Wisconsin-Madison Quill / Quill++ Tutorial.
July 11-15, 2005Lecture3: Grid Job Management1 Grid Compute Resources and Job Management.
Working with Condor. Links: Condor’s homepage:  Condor manual (for the version currently.
Derek Wright Computer Sciences Department University of Wisconsin-Madison Condor and MPI Paradyn/Condor.
Derek Wright Computer Sciences Department University of Wisconsin-Madison New Ways to Fetch Work The new hook infrastructure in Condor.
Condor Week 2004 The use of Condor at the CDF Analysis Farm Presented by Sfiligoi Igor on behalf of the CAF group.
Peter Couvares Associate Researcher, Condor Team Computer Sciences Department University of Wisconsin-Madison
Condor Tutorial for Users INFN-Bologna, 6/29/99 Derek Wright Computer Sciences Department University of Wisconsin-Madison
An Introduction to High-Throughput Computing With Condor Tuesday morning, 9am Zach Miller University of Wisconsin-Madison.
Nicholas Coleman Computer Sciences Department University of Wisconsin-Madison Distributed Policy Management.
Grid Compute Resources and Job Management. 2 Grid middleware - “glues” all pieces together Offers services that couple users with remote resources through.
Miron Livny Computer Sciences Department University of Wisconsin-Madison Condor and (the) Grid (one of.
Douglas Thain, John Bent Andrea Arpaci-Dusseau, Remzi Arpaci-Dusseau, Miron Livny Computer Sciences Department, UW-Madison Gathering at the Well: Creating.
How High Throughput was my cluster? Greg Thain Center for High Throughput Computing.
Todd Tannenbaum Computer Sciences Department University of Wisconsin-Madison Condor NT Condor ported.
Condor Project Computer Sciences Department University of Wisconsin-Madison Condor Introduction.
Five todos when moving an application to distributed HTC.
Intermediate Condor Monday morning, 10:45am Alain Roy OSG Software Coordinator University of Wisconsin-Madison.
Condor DAGMan: Managing Job Dependencies with Condor
Condor: Job Management
A Distributed Policy Scenario
Accounting, Group Quotas, and User Priorities
Remote I/O in Condor.
Basic Grid Projects – Condor (Part I)
HTCondor Training Florentia Protopsalti IT-CM-IS 1/16/2019.
Douglas Thain INFN Bologna, December 2001
Condor: Firewall Mirroring
Presentation transcript:

Douglas Thain Computer Sciences Department University of Wisconsin-Madison (In Bologna for June 2000) Condor by Example

Outline › Overview › Submitting Jobs, Getting Feedback › Setting Requirements with ClassAds › Using LOTS of Machines › Which Universe? › Conclusion

What is Condor? › Condor converts a collection of unrelated workstations into a high- throughput computing facility. › Condor uses matchmaking to make sure that everyone is happy.

What is High-Throughput Computing? › High-performance: CPU cycles/second under ideal circumstances.  “How fast can I run simulation X on this machine?” › High-throughput: CPU cycles/day (week, month, year?) under non-ideal circumstances.  “How many times can I run simulation X in the next week using all available machines?”

What is High-Throughput Computing? › Condor does whatever it takes to run your jobs, even if some machines…  Crash!  Are disconnected  Run out of disk space  Are removed or added from the pool  Are put to other uses

What is Matchmaking? › Condor uses Matchmaking to make sure that work gets done within the constraints of both users and owners. › Users (jobs) have constraints:  “I need an Alpha with 256 MB RAM” › Owners (machines) have constraints:  “Only run jobs when I am away from my desk and never run jobs owned by Bob.”

“What can Condor do for me?” Condor can… › …increase your throughput. › …do your housekeeping. › …improve reliability. › …give performance feedback.

The INFN Condor Pool

How many machines now? › The map is out of date! › The system is always changing. › First example: What machines (and of what kind) are in the pool now?

First Things First › Set your path: setenv PATH /library/condor_nfs/XXX/bin › XXX should be your system: OSF1, LINUX, SOLARIS26, HPUX10 …

How Many Machines? % condor_status Name OpSys Arch State Activity LoadAv Mem lxpc1.na.infn LINUX-GLIBC INTEL Unclaimed Idle axpd21.pd.inf OSF1 ALPHA Owner Idle vlsi11.pd.inf SOLARIS26 SUN4u Claimed Busy Machines Owner Claimed Unclaimed Matched Preempting ALPHA/OSF INTEL/LINUX INTEL/LINUX-GLIBC SUN4u/SOLARIS SUN4u/SOLARIS SUN4u/SOLARIS SUN4x/SOLARIS Total

Machine States › Most machines will be:  Owner: The machine’s owner is busy at the console, so no Condor jobs may run.  Claimed: Condor has selected the machine to run jobs for other users.

Machine States › Only a few should be:  Unclaimed: The owner is gone, but Condor has not yet selected the machine.  Matched: Between claimed and unclaimed.  Preempting: Condor is busy removing a job.

More Examples % condor_status -help % condor_status –avail % condor_status –run % condor_status –total % condor_status –pool condor.cs.wisc.edu

Submitting Jobs

Steps to Running a Job › Re-link for Condor. › Submit the job. › Watch the progess. › Receive when done.

Example Job Compute the nth Fibonnaci number. Fib(40) takes about one minute to compute on an Alpha. %./fib 40 fib(40) =

#include int fib( int x ) { if( x<=0 ) return 0; if( x==1 ) return 1; return fib(x-1) + fib(x-2); } int main(int argc, char *argv[]) { int n; n = atoi(argv[1]); printf ("fib(%d) = %d\n",n,fib(n)); return 0; }

Re-link for Condor › Normal compile: gcc –c fib.c –o fib.o › Normal link: gcc fib.o –o fib › Use the same command, but add condor_compile: condor_compile gcc fib.o –o fib

Submit the Job › Create a submit file: vi fib.submit › Submit the job: condor_submit fib.submit Executable = fib Arguments = 40 Output = fib.out Log = fib.log queue

Watch the Progress % condor_q -- Submitter: axpbo8.bo.infn.it : : ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 5.0 thain 6/21 12: :00:15 R fib 40 Each job gets a unique number. Status: Unexpanded, Running or Idle Size of program image (MB)

Receive When Done This is an automated from the Condor system on machine "axpbo8.bo.infn.it". Do not reply. Your condor job /tmp_mnt/usr/users/ccl/thain/test/fib 40 exited with status 0. Submitted at: Wed Jun 21 14:24: Completed at: Wed Jun 21 14:36: Real Time: 0 00:11:54 Run Time: 0 00:06:52 Committed Time: 0 00:01:37...

Running Many Processes › 100 processes are almost as easy as !. › Each condor_submit makes one cluster of one or more processes. › Add the number of processes to run to the Queue statement. › Use the $(PROCESS) variable to give each process slightly different instructions.

Running Many Processes › Compute Fib(1) through Fib(50) › Output goes in fib.out.1, fib.out.2, and so on… Executable = fib Arguments = $(PROCESS) Output = fib.out.$(PROCESS) Log = fib.log Queue 50

Running Many Processes › Another approach: Each process gets its own directory (dir1, dir2, …) and sends the output to dirX/fib.out. Executable = fib Arguments = $(PROCESS) Initial_Dir = dir$(PROCESS) Output = fib.out Log = fib.log Queue 50

Running Many Processes % condor_q -- Submitter: axpbo8.bo.infn.it : ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 9.3 thain 6/23 10: :05:40 R fib thain 6/23 10: :05:11 R fib thain 6/23 10: :05:09 R fib jobs; 2 idle, 19 running, 0 held Cluster number Process number

Where Are They Running? › condor_q –run - Submitter: axpbo8.bo.infn.it : : ID OWNER SUBMITTED RUN_TIME HOST(S) 9.47 thain 6/23 10: :07:03 ax4bbt.bo.infn.it 9.48 thain 6/23 10: :06:51 pewobo1.bo.infn.it 9.49 thain 6/23 10: :06:30 osde01.pd.infn.it Current Location

Help! I’m buried in ! › By default, Condor sends one for each completed process. › Add these to your submit file:  notification = error  notification = never › To send it to someone else:  notify_user =

Removing Processes › Remove one process:  condor_rm 9.47 › Remove a whole cluster:  condor_rm 9 › Remove everything!  condor_rm -a

Getting Feedback

What have I done? › The user log file (fib.log) shows a chronological list of everything important that happened to a job. 001 ( ) 06/21 17:03:44 Job executing on host: 004 ( ) 06/21 17:04:58 Job was evicted. 009 ( ) 06/21 17:05:10 Job was aborted by the user.

What have I done? % condor_history ID OWNER SUBMITTED CPU_USAGE ST COMPLETED CMD 9.3 thain 6/23 10: :00:00 C 6/23 10:58 fib thain 6/23 10: :00:24 C 6/23 10:59 fib thain 6/23 10: :00:00 C 6/23 11:01 fib thain 6/23 10: :05:45 C 6/23 11:01 fib thain 6/23 10: :00:00 C 6/23 11:01 fib 7

Brief I/O Summary % condor_q –io -- Schedd: c01.cs.wisc.edu : ID OWNER READ WRITE SEEK XPUT BUFSIZE BLKSIZE joe KB KB KB/s KB 32.0 KB joe KB KB B /s KB 32.0 KB joe 44.7 KB 22.1 KB B /s KB 32.0 KB 3 jobs; 0 idle, 3 running, 0 held

Complete I/O Summary in Your condor job "/usr/joe/records.remote input output" exited with status 0. Total I/O: KB/s effective throughput 5 files opened 104 reads totaling KB 316 writes totaling 1.2 MB 102 seeks I/O by File: buffered file /usr/joe/input opened 2 times 100 reads totaling KB 311 write totaling 1.2 MB 101 seeks (Only since Condor Version )

Complete I/O Summary in › The summary helps identify performance problems. Even advanced users don't know exactly how their programs and libraries operate.

Complete I/O Summary in › Example:  CMSSIM - physics analysis program.  “Why is this job so slow?”  Data summary: read 250 MB from 20 MB file.  Very high SEEK total -> random access.  Solution: Increase buffer to 20 MB.

Who Uses Condor? % condor_q –global -- Schedd: to02xd.to.infn.it : ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD garzelli 6/21 18: :18:16 R tosti2trisdn -- Schedd: quark.ts.infn.it : ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD dellaric 4/10 14: :20:31 R john p2.dat dellaric 6/2 11: :27:30 R john p1.dat pamela 6/20 09: :41:43 R montepamela

Who uses Condor? % condor_status –submitters Name Machine Running IdleJobs MaxJobsRunning decux1.pv quark.ts.i to05xd.to RunningJobs IdleJobs Total 59 86

Who Uses Condor? % condor_userprio Last Priority Update: 6/23 16:27 Effective User Name Priority Number of users shown: 8

Who Uses Condor? › The user priority is computed by Condor to estimate how much of the pool’s CPU resources have been used by each submitter. › Lighter users receive a lower priority: they will be allocated CPUs before heavy users. › Users consuming the same amount of CPU will be allocated an equal amount.

Measuring Goodput › Goodput is the amount of time a workstation spends making forward progress on work assigned by Condor. › This is a big topic all by itself:

Measuring Goodput % condor_q –goodput -- Submitter: coral.cs.wisc.edu : : coral.cs.wisc.edu ID OWNER SUBMITTED RUN_TIME GOODPUT CPU_UTIL Mb/s thain 6/23 07: :47: % 87.6% thain 6/23 07: :38: % 99.8% thain 6/23 07: :38: % 98.7% thain 6/23 07: :10: % 99.8% 0.00

Setting Requirements with ClassAds

Setting Requirements › We believe that Condor must allow both users (jobs) and owners (machines) to set requirements. › This is an absolute necessity in order to convince people to participate in the community.

ClassAds › ClassAds are a simple language for describing both the properties and the requirements of jobs and machines. › Condor stores nearly everything in ClassAds -- use the –l option to condor_q and condor_submit to get the full details.

ClassAd for a Machine › condor_status –l axpbo8 MyType = "Machine" TargetType = "Job" Name = "axpbo8.bo.infn.it" START = TRUE VirtualMemory = Disk = Memory = 160 Cpus = 1 Arch = "ALPHA" OpSys = "OSF1“

ClassAd for a Job › condor_q –l 9.49 MyType = "Job" TargetType = "Machine" Owner = "thain" Cmd = "/tmp_mnt/usr/users/ccl/thain/test/fib" Out = “fib.out.49” Args = “49” ImageSize = 2544 DiskUsage = 2544 Requirements = (Arch == "ALPHA") && (OpSys == "OSF1") && (Disk >= DiskUsage) && (VirtualMemory >= ImageSize)

Default Requirements › By default, Condor assumes the requirements for your job are: “I need a machine with…”  The same operating system and architecture as my workstation.  Enough disk to store the program.  Enough virtual memory to run the program.

Default Requirements › Expressed in ClassAds as: Requirements = (Arch ==“ALPHA”) && (OpSys==“OSF1”) && (Disk >= DiskUsage) && (VirtualMemory >= ImageSize)

ClassAd Requirements › Similar to C/C++/Java expressions:  Symbols: Arch, OpSys, Memory, Mips  Values: 15, 6.5, “LINUX”  Operators: ==,, = &&, || ( )

Adding Requirements › In the submit file, add a line beginning with “requirements = “ Executable = fib Arguments = 40 Output = fib.out Log = fib.log Requirements = (Memory > 64) queue

Example Requirements › (Memory>64) › (Machine == “axpbo3.bo.infn.it” ) › (Mips>100) || (Kflops>10000) › (Subnet != “ ”) && (Disk > )

Are the Requirements Reasonable? › Two ways to find out:  Before running, use condor_status to list all machines matching certain requirements.  While running, use condor_analyze to see if a match is possible.

Are the Requirements Reasonable? % condor_status –constraint ‘(Memory>640)’  Only axpd30. % condor_status –constraint ‘(Memory>512)’  Five machines: ax4mcs, axpd30, axppv3, axzds0, and stonehenge.

Are the Requirements Reasonable? › Suppose that I submit a job like this: › “My job isn’t running – Why?” Executable = fib Arguments = 40 Output = fib.out Requirements = ( Mips > 5000 ) queue

Are the Requirements Reasonable? % condor_q –analyze WARNING: Be advised: No resources matched request's constraints Check the Requirements expression below: Requirements = ((Mips > 5000)) && (Arch == "ALPHA") && (OpSys == "OSF1") && (Disk >= DiskUsage) && (VirtualMemory >= ImageSize)

Preferences › Condor assumes that any machines that match your requirements are suitable. › However, you may prefer some machines over others. (100 Mips is better than 10) › To indicate a preference, you may provide a ClassAd expression which ranks all matches.

Rank › The rank expression is evaluated into a number for every potential matching machine. › A machine with a higher number will be preferred over a machine with a lower number.

Rank Examples › Prefer machines with more Mips: Rank = Mips › Prefer more memory, but add 100 to the rank if the machine is Solaris 2.7: Rank = Memory + 100*(OpSys==“SOLARIS27)” › Prefer machines with a high ratio of memory to cpu performance: Rank = Memory/Mips › Prefer machines that will checkpoint in Bologna: Rank = (CkptServer==“ckpt.bo.infn.it”)

Use MORE Machines! › The Condor pool has several architectures:  115 Alpha/OSF1  62 Intel/Linux  11 Sun4u/Solaris › To get maximum throughput, you must use all that are available. Be greedy!

Compile for Each System. › Make an executable for each kind of system you wish to use. Give each a unique name. › On an Alpha/OSF1:  condor_compile gcc fib.c –o fib.ALPHA.OSF1 › On an Intel/Linux:  condor_compile gcc fib.c –o fib.INTEL.LINUX

Change the Submit File Executable = fib.$$Arch.$$Opsys Requirements = ( ((Arch==“ALPHA”) && (OpSys==“OSF1”)) || ((Arch==“INTEL”) && (OpSys==“LINUX”)) ) Make the executable name a function of the machine selected. Allow either ALPHA/OSF1 or INTEL/LINUX machines to be selected.

Condor Will Decide at the Last Minute Alpha/OSF1 You Intel/Linux Fib.$$Arch.$$Opsys Fib.ALPHA.OSF1 Fib.INTEL.LINUXFib.$$Arch.$$Opsys

Standard or Vanilla?

Which Universe? › Each Condor universe provides different services to different kinds of programs:  Standard – Relinked UNIX programs  Vanilla – Unmodified UNIX programs  PVM  Scheduler (Not described here)  Globus

Which Universe?Cluster File Server Cluster File Server VANILLA STANDARD

Standard Universe › Submit a specially-linked UNIX application to the Condor system. › Advantages:  Checkpointing for fault tolerance.  Remote I/O services: Friendly environment anywhere in the world. Data buffering and staging. I/O performance feedback. User remapping of data sources.

Standard Universe › Disadvantages:  Must statically link with Condor library.  Limited class of applications: Single-process UNIX binaries. Certain system calls prohibited.

System Call Limitations › Standard universe does not allow:  Multiple processes: fork(), exec(), system()  Inter-process communication: semaphores, messages, shared memory  Complex I/O: mmap(), select(), poll(), non-blocking I/O, …  Kernel-level threads (User level threads are OK.)

System Call Limitations › Too restrictive?  Use the vanilla universe.

Vanilla Universe › Submit any sort of UNIX program to the Condor system. › Advantages:  No relinking required.  Any program at all, including Binaries Shell scripts Interpreted programs (java, perl) Multiple processes

Vanilla Universe › Disadvantages:  No checkpointing.  Very limited remote I/O services. Specify input files explicitly. Specify output files explicitly.  Condor will refuse to start a vanilla job on a machine that is unfriendly. ClassAds: FilesystemDomain and UIDDomain

Which Universe? › Standard:  Good for mixed Condor pools, flocked pools, and the Grid at large. › Vanilla:  Good for a Condor pool of identical machines.

Conclusion

Conclusion › Condor expands your reach to many CPUs – even those you cannot log in to. › Condor makes it easy to run and manage large numbers of jobs › Good candidates for the standard universe are single-process CPU-bound jobs with simple I/O. › Too restrictive? Use the vanilla universe, but fewer available machines.

Conclusion › Need more info? › Douglas Thain › INFN CCL › Condor Web Page ( › This talk: (