Intermediate Condor Rob Quick Open Science Grid HTC - Indiana University.

Slides:



Advertisements
Similar presentations
June 21-25, 2004Lecture2: Grid Job Management1 Lecture 3 Grid Resources and Job Management Jaime Frey Condor Project, University of Wisconsin-Madison
Advertisements

Dealing with real resources Wednesday Afternoon, 3:00 pm Derek Weitzel OSG Campus Grids University of Nebraska.
Setting up of condor scheduler on computing cluster Raman Sehgal NPD-BARC.
More HTCondor 2014 OSG User School, Monday, Lecture 2 Greg Thain University of Wisconsin-Madison.
Condor and GridShell How to Execute 1 Million Jobs on the Teragrid Jeffrey P. Gardner - PSC Edward Walker - TACC Miron Livney - U. Wisconsin Todd Tannenbaum.
Intermediate Condor: DAGMan Monday, 1:15pm Alain Roy OSG Software Coordinator University of Wisconsin-Madison.
Condor Project Computer Sciences Department University of Wisconsin-Madison An Introduction To Condor.
SIE’s favourite pet: Condor (or how to easily run your programs in dozens of machines at a time) Adrián Santos Marrero E.T.S.I. Informática - ULL.
Alain Roy Computer Sciences Department University of Wisconsin-Madison An Introduction To Condor International.
Condor Project Computer Sciences Department University of Wisconsin-Madison A Scientist’s Introduction.
1 Using Condor An Introduction ICE 2008.
Condor Overview Bill Hoagland. Condor Workload management system for compute-intensive jobs Harnesses collection of dedicated or non-dedicated hardware.
Alain Roy Computer Sciences Department University of Wisconsin-Madison 24-June-2002 Using and Administering.
Intermediate HTCondor: Workflows Monday pm Greg Thain Center For High Throughput Computing University of Wisconsin-Madison.
Derek Wright Computer Sciences Department, UW-Madison Lawrence Berkeley National Labs (LBNL)
Zach Miller Condor Project Computer Sciences Department University of Wisconsin-Madison Flexible Data Placement Mechanisms in Condor.
Condor Project Computer Sciences Department University of Wisconsin-Madison Virtual Machines in Condor.
Jaime Frey Computer Sciences Department University of Wisconsin-Madison Virtual Machines in Condor.
Introduction to Condor DMD/DFS J.Knudstrup December 2005.
Zach Miller Computer Sciences Department University of Wisconsin-Madison What’s New in Condor.
Alain Roy Computer Sciences Department University of Wisconsin-Madison An Introduction To Condor International.
Parallel Computing The Bad News –Hardware is not getting faster fast enough –Too many architectures –Existing architectures are too specific –Programs.
High Throughput Computing with Condor at Purdue XSEDE ECSS Monthly Symposium Condor.
Track 1: Cluster and Grid Computing NBCR Summer Institute Session 2.2: Cluster and Grid Computing: Case studies Condor introduction August 9, 2006 Nadya.
Building a Real Workflow Thursday morning, 9:00 am Lauren Michael Research Computing Facilitator University of Wisconsin - Madison.
IntroductiontotoHTCHTC 2015 OSG User School, Monday, Lecture1 Greg Thain University of Wisconsin– Madison Center For High Throughput Computing.
Condor Project Computer Sciences Department University of Wisconsin-Madison Condor User Tutorial.
An Introduction to High-Throughput Computing Rob Quick OSG Operations Officer Indiana University Some Content Contributed by the University of Wisconsin.
An Introduction to High-Throughput Computing Monday morning, 9:15am Alain Roy OSG Software Coordinator University of Wisconsin-Madison.
Condor Tugba Taskaya-Temizel 6 March What is Condor Technology? Condor is a high-throughput distributed batch computing system that provides facilities.
Grid Computing I CONDOR.
Compiled Matlab on Condor: a recipe 30 th October 2007 Clare Giacomantonio.
1 Using Condor An Introduction ICE 2010.
Part 6: (Local) Condor A: What is Condor? B: Using (Local) Condor C: Laboratory: Condor.
Grid job submission using HTCondor Andrew Lahiff.
Grid Compute Resources and Job Management. 2 Local Resource Managers (LRM)‏ Compute resources have a local resource manager (LRM) that controls:  Who.
Dealing with real resources Wednesday Afternoon, 3:00 pm Derek Weitzel OSG Campus Grids University of Nebraska.
Turning science problems into HTC jobs Wednesday, July 29, 2011 Zach Miller Condor Team University of Wisconsin-Madison.
1 The Roadmap to New Releases Todd Tannenbaum Department of Computer Sciences University of Wisconsin-Madison
The Roadmap to New Releases Derek Wright Computer Sciences Department University of Wisconsin-Madison
Building a Real Workflow Thursday morning, 9:00 am Lauren Michael Research Computing Facilitator University of Wisconsin - Madison.
Intermediate Condor: Workflows Rob Quick Open Science Grid Indiana University.
July 11-15, 2005Lecture3: Grid Job Management1 Grid Compute Resources and Job Management.
Review of Condor,SGE,LSF,PBS
HTCondor and Workflows: An Introduction HTCondor Week 2015 Kent Wenger.
Distributed Computing in Practice: The Condor Experience CS739 11/4/2013 Todd Tannenbaum Center for High Throughput Computing Department of Computer Sciences.
Intermediate Condor: Workflows Monday, 1:15pm Alain Roy OSG Software Coordinator University of Wisconsin-Madison.
Grid Compute Resources and Job Management. 2 How do we access the grid ?  Command line with tools that you'll use  Specialised applications Ex: Write.
Job Submission with Globus, Condor, and Condor-G Selim Kalayci Florida International University 07/21/2009 Note: Slides are compiled from various TeraGrid.
An Introduction to High-Throughput Computing With Condor Tuesday morning, 9am Zach Miller University of Wisconsin-Madison.
Grid Compute Resources and Job Management. 2 Job and compute resource management This module is about running jobs on remote compute resources.
Grid Compute Resources and Job Management. 2 Grid middleware - “glues” all pieces together Offers services that couple users with remote resources through.
Condor Project Computer Sciences Department University of Wisconsin-Madison An Introduction To Condor.
Five todos when moving an application to distributed HTC.
Profiling Applications to Choose the Right Computing Infrastructure plus Batch Management with HTCondor Kyle Gross – Operations Support.
Intermediate Condor Monday morning, 10:45am Alain Roy OSG Software Coordinator University of Wisconsin-Madison.
Operations Support Manager - Open Science Grid
Intermediate HTCondor: Workflows Monday pm
Using Stork An Introduction Condor Week 2006
Building Grids with Condor
Using Condor An Introduction Condor Week 2004
Troubleshooting Your Jobs
Using Condor An Introduction Condor Week 2006
Using Condor An Introduction Condor Week 2003
Introduction to High Throughput Computing and HTCondor
HTCondor Training Florentia Protopsalti IT-CM-IS 1/16/2019.
The Condor JobRouter.
Using Condor An Introduction Paradyn/Condor Week 2002
Troubleshooting Your Jobs
Presentation transcript:

Intermediate Condor Rob Quick Open Science Grid HTC - Indiana University

2012 Africa Grid School Before we begin… Any questions on the lectures or exercises up to this point? 2

2012 Africa Grid School HTC: Reliability Two quotes from the first lecture:  “We don’t care about operations per second, we care about operations per year”  “HTC focuses on reliability” How should we provide reliability? 3

2012 Africa Grid School First: why do jobs fail? The computer running the job fails  Or the network, or the disk, or the OS, or… Your job might be preempted:  Condor decides your job is less important than another, so your job is stopped and another started.  Not a “failure” per se, but it may feel like it to you. 4

2012 Africa Grid School Reliability example #1 When a job fails or is preempted:  It stays in the queue (on the schedd)  A note is written to the job log file  It reverts to “idle” state  It is eligible to be matched again Relax! Condor will run your job again 5

2012 Africa Grid School Reliability example #2: Checkpointing When Condor re-runs your job, it starts over from the beginning.  Unless you use the standard universe  Huh? Whats a “universe”? 6

2012 Africa Grid School Condor’s Universes Condor can support various combinations of features/environments in different “universes” Different Universes provide different functionality for your job:  Vanilla: Run any serial job (your first job)  Standard: Support for checkpointing & remote I/O  Java:Special support for Java  Parallel:Support for parallel jobs (such as MPI)  … and others 7

2012 Africa Grid School Process Checkpointing Condor’s process checkpointing mechanism saves the entire state of a process into a checkpoint file  Memory, CPU, I/O, etc. The process can then be restarted from right where it left off Typically no changes to your job’s source code needed—however, your job must be relinked with Condor’s Standard Universe support library 8

2012 Africa Grid School Relinking Your Job for Standard Universe To do this, just place “condor_compile” in front of the command you normally use to link your job: % condor_compile gcc -o myjob myjob.c - OR - % condor_compile f77 -o myjob filea.f fileb.f 9

2012 Africa Grid School Limitations of the Standard Universe Condor’s checkpointing is not at the kernel level. Thus in the Standard Universe the job may not:  fork()  Use kernel threads  Use some forms of IPC, such as pipes and shared memory Many typical scientific jobs are OK Must be same gcc as Condor was built with 10

2012 Africa Grid School When will Condor checkpoint your job? Periodically, if desired (for fault tolerance) When your job is preempted by a higher priority job When your job is vacated because the execution machine becomes busy When you explicitly run:  condor_checkpoint  condor_vacate  condor_off  condor_restart 11

2012 Africa Grid School Access to data in Condor Option #1: Shared filesystem  Simple to use, but make sure your filesystem can handle the load Option #2: Condor’s file transfer  Can automatically send back changed files  Atomic transfer of multiple files  Can be encrypted over the wire  Most common for small applications and data Option #3: Remote I/O 12

2012 Africa Grid School Universe = vanilla Executable = my_job Log = my_job.log ShouldTransferFiles = YES Transfer_input_files = dataset$(Process), common.data Queue 600 Condor File Transfer ShouldTransferFiles = YES  Always transfer files to execution site ShouldTransferFiles = NO  Rely on a shared filesystem ShouldTransferFiles = IF_NEEDED  Will automatically transfer the files if needed 13

2012 Africa Grid School Condor File Transfer with URLs Transfer_input_files can be a URL For example: 14 transfer_input_files =

2012 Africa Grid School Standard Universe: Remote System Calls When your job accesses a file, we trap the access and do it on the submit computer  No file transfer needed  Can be faster if you don’t access the whole file An essential part of checkpointing:  We know the state of the job, including the state of its files  We can re-run a job on a different machine without losing access to the files No source code changes required 15

2012 Africa Grid School Job I/O Lib Remote I/O condor_schedd condor_startd condor_shadow condor_starter File 16

2012 Africa Grid School So Scot… So far we’ve run one job at a time. Just one job. ONE. This doesn’t seem very “high-throughput” 17

2012 Africa Grid School Clusters & Processes One submit file can describe lots of jobs  All the jobs in a submit file are a cluster of jobs  Yeah, same term as a cluster of computers Each cluster has a unique “cluster number” Each job in a cluster is called a “process” A Condor “job ID” is the cluster number, a period, and the process number (“20.1”) A cluster is allowed to have one or more processes.  There is always a cluster for every job 18

2012 Africa Grid School Example Submit Description File for a Cluster # Example submit description file that defines a # cluster of 2 jobs with separate working directories Universe = vanilla Executable = my_job log = my_job.log Arguments = -arg1 -arg2 Input = my_job.stdin Output = my_job.stdout Error = my_job.stderr InitialDir = run_0 Queue Becomes job 2.0 InitialDir = run_1 Queue Becomes job

2012 Africa Grid School % condor_submit my_job.submit-file Submitting job(s). 2 job(s) submitted to cluster 2. % condor_q -- Submitter: perdita.cs.wisc.edu : : ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 2.0 frieda 4/15 06: :00:00 I my_job 2.1 frieda 4/15 06: :00:00 I my_job 2 jobs; 2 idle, 0 running, 0 held Submitting The Job 20

2012 Africa Grid School The $(Process) macro The initial directory for each job can be specified as run_$(Process), and instead of submitting a single job, we use “Queue 600” to submit 600 jobs at once The $(Process) macro will be expanded to the process number for each job in the cluster ( ), so we’ll have “run_0”, “run_1”, … “run_599” directories All the input/output files will be in different directories! 21

2012 Africa Grid School Example of $(Process) # Example condor_submit input file that defines # a cluster of 600 jobs with different directories Universe = vanilla Executable = my_job Log = my_job.log Arguments = -arg1 –arg2 Input = my_job.stdin Output = my_job.stdout Error = my_job.stderr InitialDir = run_$(Process) run_0 … run_599 Queue 600 Creates job 3.0 …

2012 Africa Grid School More $(Process) You can use $(Process) anywhere: Universe = vanilla Executable = my_job Log = my_job.$(Process).log Arguments = -randomseed $(Process) Input = my_job.stdin Output = my_job.stdout Error = my_job.stderr InitialDir = run_$(Process) Queue

2012 Africa Grid School Sharing a directory You don’t have to use separate directories. $(Cluster) will help distinguish runs Universe = vanilla Executable = my_job Arguments = -randomseed $(Process) Input = my_job.input.$(Process) Output = my_job.stdout.$(Cluster).$(Process) Error = my_job.stderr.$(Cluster).$(Process) Log = my_job.$(Cluster).$(Process).log Queue

2012 Africa Grid School Complex sets of jobs What if you have a large set of jobs?  Different jobs have different purposes But all are needed for your science  Some jobs must run before others Hang tight, we’ll talk about this after lunch 25

2012 Africa Grid School Job Priorities Are some of the jobs more interesting than others? condor_prio lets you set the job priority  Priority relative to your jobs, not other peoples  Priority can be any integer Can be set in submit file:  Priority = 14 26

2012 Africa Grid School Advanced Trickery Pretend: You submit several large sets of jobs These jobs are of two different types  e.g. analysis vs. simulation How can you look just the analysis jobs? 27

2012 Africa Grid School Advanced Trickery cont. In your job file: +JobType = “analysis” You can see this in your job ClassAd condor_q –l You can show jobs of a certain type: condor_q –constraint ‘JobType == “analysis”’ Try this during the exercises! Be careful with the quoting… 28

2012 Africa Grid School More Hands on Tomorrow Morning! 29

2012 Africa Grid School Questions? Questions? Comments? Feel free to ask me questions later: Rob Quick 30