Download presentation
Presentation is loading. Please wait.
Published byAmice Cunningham Modified over 9 years ago
1
Zellescher Weg 12 Trefftz-Building – HRSK/151 Phone +49 - 351 - 463 39871 Guido Juckeland (guido.juckeland@tu-dresden.de) Center for Information Services and High Performance Computing (ZIH) Introduction to High Performance Computing at ZIH The LSF Batch System
2
Slide 2 - Guido Juckeland Agenda What is a batch system? Batch queues on the Altix/Deimos Host groups on Deimos Starting, stopping, and monitoring batch jobs Batch scripts
3
Slide 3 - Guido Juckeland What is a batch system?
4
Slide 4 - Guido Juckeland Concept of a batch system Login- Host - Compile - File transfers Master- Host Compute- Host... Submission of the batch job Execution of the batch job
5
Slide 5 - Guido Juckeland Deimos Batchsystem Login- Host - Compile - File transfer Master- Host p1d001 p1d002 p1d003 p2s256... Submission of the batch job Execution of the batch job Login- Host
6
Slide 6 - Guido Juckeland Mars Altix Batchsystem Login- Host - Compile - File transfer Master- Host Jupiter Saturn Uranus Mars Submission of the batch job Execution of the batch job
7
Slide 7 - Guido Juckeland What is a job? A piece of work (e.g. a script, command, or application call) that was handed over to the control of the batch system (using „bsub“) The batch system determines: –Execution time of the job (when?) –Placement of the job on the hosts (where?) –If needed the interupt of the job The batch system also does the accounting for jobs with respect to the available project CPU time. Special case: Interactive Jobs
8
Slide 8 - Guido Juckeland A job‘s life cycle
9
Slide 9 - Guido Juckeland What is a host? Deimos/Phobos: –Login node –Compute node Altix: –Alle 4 Partitionen A host contains a number of job slots (number of sequential jobs that could be placed on a host): –Deimos: 1 node -> 2-8 job slots (since 2-8 CPUs) –Phobos: 1 node -> 2 job slots (since 2 CPUs) –Altix: Mars -> 346 job slots (since 346 available CPUs) Jupiter, Saturn, Uranus -> 506 job slots (since 506 available CPUs)
10
Slide 10 - Guido Juckeland What is a queue? Queue = Alignment of things/people waiting for some event Batch queue = Queue for batch jobs Usually different queues have different limits with respect to max. run time or max. avail job slots One batch queue for the whole system –Advantage: Easy to administer –Disadvantage: User has to specify all job parameters (CPU time, memory usage,…) Multiple batch queues for the whole system (Altix, PC farm) –Advantage: Easy to use for the user –Disadvantage: More difficult to manage
11
Slide 11 - Guido Juckeland Batch queues on the Altix/Deimos
12
Slide 12 - Guido Juckeland Batch queues on Mars NameUsersCPUs Time limit default /max Hosts interactiveAll1 - 3212 h / 12 hAll ilrILR1 - 76812 h /24 hAll smallAll1 -6312 h / 5 dAll intermediateAll64 – 25512 h / 5 dAll largeAll256 – 186612 h / 24 hAll
13
Slide 13 - Guido Juckeland Batch queues on Deimos NameUsersCPUs Time limit default /max Hosts interactiveAll1 - 3212 h / 12 hAll smallAll1 - 812 h / 5 dAll rtcSelected users1 - 2200 h All but fat_quads intermediateAll9 – 12712h / 5dAll largeAll128 – 25612 h / 24 hAll nightexpressMPI_CBG14 hAll gaussGauss users1 - 8120 hfat_quads
14
Slide 14 - Guido Juckeland Host groups on Deimos
15
Slide 15 - Guido Juckeland Host groups on Deimos p1_hosts - Phase 1 nodes p2_hosts - Phase 2 nodes single_hosts - Single CPU nodes dual_hosts - Dual CPU nodes quad_hosts - 16 GByte Quad CPU nodes fat_quads - 32 GByte Quad CPU nodes single{1|2}_hosts - Phase 1/2 Single CPU nodes dual{1|2}_hosts - Phase 1/2 Dual CPU nodes quad{1|2}_hosts - Phase 1/2 16 GByte Quad nodes express_hosts - Knoten für Queue nightexpress
16
Slide 16 - Guido Juckeland Starting, stopping, and monitoring batch jobs
17
Slide 17 - Guido Juckeland Starting a batch job (1) Command: bsub Call with: bsub -n [parameter] [Command parameters] Parameters: – -n : Number of job slots to use (CPUs) – -q : Selects a specific queue for the job – -W : Max. runtime of the job (format: H:MM) – -e : Redirects all error output (stderr) to „file“ – -o : Redirects all standard output (stdout) to „file“ – -M : Max. amount of main memory needed by the job – -m : Specifies a certain host (group) for the job execution – -x : Uses the execution host exclusively by the job – -Is: Interakte job ( bsub -Is bash -l)
18
Slide 18 - Guido Juckeland Starting a batch job (2) – -w done( ): Start job when job with ID „job-id“ is done successfully Example: juckel@deimos101:~> bsub pwd Job is submitted to default queue. Sequential program: –Altix/Phobos/Deimos/SX-6: bsub./a.out MPI-parallel program: –Altix: bsub -n pamrun./a.out –Deimos: bsub -n -a openmpi mpirun.lsf./a.out OpenMP-parallel / Multithreaded program: –Altix: bsub -R "span[hosts=1]" -n./a.out –Deimos: bsub -R "span[hosts=1]" -n {1-8}./a.out
19
Slide 19 - Guido Juckeland Modifying a batch job Command: bmod Attention! Works usually only with pending jobs (PEND) Call with: bmod [Parameter] Parameter: – -n : Number of job slots to use (CPUs) – -q : Selects a specific queue for the job – -W : Max. runtime of the job (format: H:MM) – -e : Redirects all error output (stderr) to „file“ – -o : Redirects all standard output (stdout) to „file“ – -M : Max. amount of main memory needed by the job – -m : Specifies a certain host (group) for the job execution – -x: Uses the execution host exclusively by the job
20
Slide 20 - Guido Juckeland Cancelling a batch job Command: bkill Call with: bkill Special cases: –bkill 0 cancels all jobs of a user –bkill sends only a SIGKILL to the application -> if the process executed by the job does not respond to that signal, LSF cannot abort the job
21
Slide 21 - Guido Juckeland Modification of the order of pending jobs Default: Execution order = Order of arrival Commands: btop/bbot Call with: btop/bbot Example: juckel@:deimos101~> bjobs JOBID USER STAT QUEUE FROM_HOST EXEC_HOST JOB_NAME SUBMIT_TIME 647942 juckel PEND express host pwd Jun 15 10:33 647943 juckel PEND express host pwd Jun 15 10:33 647944 juckel PEND express host pwd Jun 15 10:33 juckel@deimos101:~> bbot 647942 Job has been moved to position 1 from bottom. juckel@deimos101:~> bjobs JOBID USER STAT QUEUE FROM_HOST EXEC_HOST JOB_NAME SUBMIT_TIME 647943 juckel PEND express host pwd Jun 15 10:33 647944 juckel PEND express host pwd Jun 15 10:33 647942 juckel PEND express host pwd Jun 15 10:33
22
Slide 22 - Guido Juckeland Monitoring a job status (1) Command: bjobs Call with: bjobs [Parameter] [Job-ID] Parameters: –With all: Shows a list of all jobs of the user with [Job-ID] –-p [Job-ID]: Shows the reason why the job is pending –-l [Job-ID]: Shows detailed information about the job with [Job-ID] –-q : Shows all the users jobs in the queue –-r : Show only jobs in status (RUN) Example: juckel@deimos101:~> bjobs JOBID USER STAT QUEUE FROM_HOST EXEC_HOST JOB_NAME SUBMIT_TIME 647943 juckel PEND express host pwd Jun 15 10:33 647944 juckel PEND express host pwd Jun 15 10:33 647942 juckel PEND express host pwd Jun 15 10:33
23
Slide 23 - Guido Juckeland Monitoring a job status (2) Possible states of a job: –PEND (pending): Job is waiting to be executed –RUN (running): Job is currently executed –UNKN (unknown): Satus of the job is unknown (usually happens when the job ran on a node that crashed) –SUSP (suspended): Job was stopped
24
Slide 24 - Guido Juckeland Information about a job after it has finished Command: bhist Call with: bhist -l Example: juckel@deimos101:~> bhist -l 647942 Job, User, Project, Command Thu Jun 15 10:33:26: Submitted from host, to Queue, CWD ; Thu Jun 15 10:34:06: Job moved to position 1 relative to by user or ad ministrator ; Thu Jun 15 10:39:19: Dispatched to ; Thu Jun 15 10:39:19: Starting (Pid 23143); Thu Jun 15 10:39:19: Running with execution home, Execution CWD, Execution Pid ; Thu Jun 15 10:39:19: Done successfully. The CPU time used is 0.0 seconds; Thu Jun 15 10:39:19: Post job process done successfully; Summary of time in seconds spent in various states by Thu Jun 15 10:39:19 PEND PSUSP RUN USUSP SSUSP UNKWN TOTAL 353 0 0 0 0 0 353
25
Slide 25 - Guido Juckeland Displaying all jobs in the system (1) Command: qstat -a Example: juckel@deimos101:~> qstat -a... small; type=BATCH; [ENABLED]; pri=20 10 run; 0 wait; REQUEST NAME REQUEST ID USER STATE 1: CRANp_S2_PBE 639563 rbarthel RUNNING 2: ts_rekom_eta 643986 drees RUNNING 3: tricomplex_3 646038 drees RUNNING 4: BCPPp 647727 rbarthel RUNNING 5: iC_C20H9p 647930 rbarthel RUNNING 6: iA_C21OH10p 647931 rbarthel RUNNING 7: iC_C20H9 647934 rbarthel RUNNING 8: iA_C20H9 647935 rbarthel RUNNING
26
Slide 26 - Guido Juckeland Displaying all jobs in the system(2) Command: lsf_info (developed at ZIH) Example: juckel@deimos101:~> lsf_info Running Jobs ------------- JOBID USER PROJECT QUEUE PROC START TIME WALL TIME USED UTILIZATION 647945 mlieber ozon large 64 15.Jun 10:43 0:01 of 0:15 44% 639563 rbarthel nano1 small 1 13.Jun 05:15 2T/ 5h of 2T/23h 99% 643986 drees akat small 2 13.Jun 11:43 1T/23h of 3T/ 0h 43% 646038 drees akat small 2 14.Jun 16:35 18:09 of 3T/ 0h 11% 647727 rbarthel nano1 small 1 14.Jun 21:07 13:37 of 2T/23h 97%... Pending Jobs ------------ JOBID USER PROJECT QUEUE SUBMITTED PROC 601109 muel zhr stresstest 28.Mai 11:27 1 601110 muel zhr stresstest 28.Mai 11:27 1...
27
Slide 27 - Guido Juckeland System status Command: nodestat (developed at ZIH) Call with: nodestat [Hostgroup] Example: deimos101:~ # nodestat -------------------------------------------------------------------------------- nodes available: 724/724 nodes damaged:0 ------------------------------------+------------------------------------------- jobs running: 1576 | cores closed (exclusive jobs):118 jobs wait: 909 | cores closed by ADMIN:20 jobs suspend: 0 | cores working:2300 jobs damaged: 0 | ------------------------------------+------------------------------------------- normal working cores: 2576 cores free for jobs:138 --------------------------------------------------------------------------------
28
Slide 28 - Guido Juckeland Displaying the status of the batch queues (1) Command: bqueues Call with: bqueues [Parameter] [Queue name] Parameters: –without: Shows a summary of all queues or the specified queue –-l [Queuename]: Shows detailed information about all queues or the specified queues Example: juckel@merkur:~> bqueues QUEUE_NAME PRIO STATUS MAX JL/U JL/P JL/H NJOBS PEND RUN SUSP interactive 30 Open:Active 60 60 1 - 0 0 0 0 large 25 Open:Active 124 124 1 - 1072 1072 0 0 gauss 20 Open:Active 64 64 1 32 24 0 24 0 small_long 20 Open:Active 120 120 1 - 4 0 4 0 small 20 Open:Active 120 120 1 120 84 0 84 0...
29
Slide 29 - Guido Juckeland Displaying the status of the batch queues(2) Possible states of the queues: –Active: Queue accepts and executes jobs –Inactive: Queue accepts jobs but execution is delayed –Closed: Queue does not accept jobs and jobs are not executed
30
Slide 30 - Guido Juckeland Information about the compute hosts Kommando: bhosts Call with: bhosts (Altix), bhosts batch_hosts (Phobos,Deimos) Beispiel: juckel@merkur:~> bhosts HOST_NAME STATUS JL/U MAX NJOBS RUN SSUSP USUSP RSV merkur ok - 124 42 42 0 0 0 venus ok - 124 106 106 0 0 0 Possible states of hosts: –OK: Host accepts jobs –Closed: Host does not accept jobs (host is either full or closed by the admin) –Unkown: Host is crashed
31
Slide 31 - Guido Juckeland Batch scripts
32
Slide 32 - Guido Juckeland Layout #!/bin/sh #BSUB -n 4 #BSUB -q small #BSUB -a openmpi mpirun.lsf./a.out
33
Slide 33 - Guido Juckeland Submission of a job with a batch script Command: bsub Call with: bsub [Parameters] Example: juckel@deimos101:~> bsub < test.sh Job is submitted to queue.
34
You need help later on? There are no stupid questions or requests!! Central drop off point: hpcsupport@zih.tu-dresden.de Central information point: http://www.tu-dresden.de/zih/hpc Read our mail-announcements: zih-hpcnews@groups.tu-dresden.de Discuss your problem with other ZIH HPC users: zih-hpcforum@groups.tu-dresden.de Slide 34 - Guido Juckeland
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.