GPU and Large Memory Jobs on Ibex Passant Hafez HPC Applications Specialist Supercomputing Core Lab
A) GPU Jobs
Mirco-architecture: 1- Volta → V100 2- Turing → RTX 2080 Ti A) GPU Jobs Mirco-architecture: 1- Volta → V100 2- Turing → RTX 2080 Ti 3- Pascal → P100 → P6000 → GTX 1080 Ti
GPU Resources: 33 nodes with 156 GPUs A) GPU Jobs GPU Resources: 33 nodes with 156 GPUs All nodes are using NVIDIA driver version 418.40.04 and CUDA version 10.1 CUDA Toolkit versions available are 10.1.105 9.2.148.1 9.0.176 8.0.44
A) GPU Jobs
A) GPU Jobs SLURM Allocation 2 ways, for example, to allocate 4*P100 GPUs, either use: #SBATCH --gres=gpu:p100:4 Or #SBATCH --gres=gpu:4 #SBATCH --constraint=[p100]
A) GPU Jobs The allocation of CPU memory can be done with `--mem=###G` constraint in SLURM job scripts. The amount of memory depends on the job characterization. A good starting place would be at least as much as the GPU memory they will use. For example: 2 x v100 GPUs would allocate at least `--mem=64G` for the CPUs.
Debug Partition: 2 GPU nodes: dgpu601-14 with 2*P6000 A) GPU Jobs Debug Partition: 2 GPU nodes: dgpu601-14 with 2*P6000 gpu104-12 with 1*V100 Max time limit: 2 hrs Use #SBATCH -p debug
Login Nodes: V100 and RTX 2080 Ti GPU nodes: vlogin.ibex.kaust.edu.sa A) GPU Jobs Login Nodes: V100 and RTX 2080 Ti GPU nodes: vlogin.ibex.kaust.edu.sa All other GPU nodes: glogin.ibex.kaust.edu.sa
B) Large Memory Jobs
Normal compute nodes have memory up to ~ 360GB per node. B) Large Memory Jobs Normal compute nodes have memory up to ~ 360GB per node. “large memory job” is a label that’s assigned to your job by SLURM when you ask for memory 370,000MB or more. You don’t need to worry about it, just ask for the memory that your job REALLY needs, and the scheduler will handle this.
Nodes with memory varying from 498 GB to 3 TB valid for job execution* B) Large Memory Jobs Nodes with memory varying from 498 GB to 3 TB valid for job execution* * Part of the total memory is used by operating system.
Notes: Always check the wiki for hardware updates https://www.hpc.kaust.edu.sa/ibex
Notes: Use Ibex Job Script Generator: https://www.hpc.kaust.edu.sa/ibex/job
Notes: --exclusive will be deprecated soon. Don’t ssh directly to compute nodes. To monitor jobs: Before and during running: scontrol show job <jobid> After job ends: seff <jobid>
Thank You! Demo Time asyncAPI cdpSimplePrint cudaOpenMP matrixMul simpleAssert simpleIPC simpleMultiGPU vectorAdd bandwidthTest deviceQuery topologyQuery UnifiedMemoryPerf