Download presentation
Presentation is loading. Please wait.
1
GPU and Large Memory Jobs on Ibex
Passant Hafez HPC Applications Specialist Supercomputing Core Lab
2
A) GPU Jobs
3
Mirco-architecture: 1- Volta → V100 2- Turing → RTX 2080 Ti
A) GPU Jobs Mirco-architecture: 1- Volta → V100 2- Turing → RTX 2080 Ti 3- Pascal → P100 → P6000 → GTX 1080 Ti
4
GPU Resources: 33 nodes with 156 GPUs
A) GPU Jobs GPU Resources: 33 nodes with 156 GPUs All nodes are using NVIDIA driver version and CUDA version 10.1 CUDA Toolkit versions available are 8.0.44
5
A) GPU Jobs
6
A) GPU Jobs SLURM Allocation 2 ways, for example, to allocate 4*P100 GPUs, either use: #SBATCH --gres=gpu:p100:4 Or #SBATCH --gres=gpu:4 #SBATCH --constraint=[p100]
7
A) GPU Jobs The allocation of CPU memory can be done with `--mem=###G` constraint in SLURM job scripts. The amount of memory depends on the job characterization. A good starting place would be at least as much as the GPU memory they will use. For example: 2 x v100 GPUs would allocate at least `--mem=64G` for the CPUs.
8
Debug Partition: 2 GPU nodes: dgpu601-14 with 2*P6000
A) GPU Jobs Debug Partition: 2 GPU nodes: dgpu with 2*P6000 gpu with 1*V100 Max time limit: 2 hrs Use #SBATCH -p debug
9
Login Nodes: V100 and RTX 2080 Ti GPU nodes: vlogin.ibex.kaust.edu.sa
A) GPU Jobs Login Nodes: V100 and RTX 2080 Ti GPU nodes: vlogin.ibex.kaust.edu.sa All other GPU nodes: glogin.ibex.kaust.edu.sa
10
B) Large Memory Jobs
11
Normal compute nodes have memory up to ~ 360GB per node.
B) Large Memory Jobs Normal compute nodes have memory up to ~ 360GB per node. “large memory job” is a label that’s assigned to your job by SLURM when you ask for memory 370,000MB or more. You don’t need to worry about it, just ask for the memory that your job REALLY needs, and the scheduler will handle this.
12
Nodes with memory varying from 498 GB to 3 TB valid for job execution*
B) Large Memory Jobs Nodes with memory varying from 498 GB to 3 TB valid for job execution* * Part of the total memory is used by operating system.
13
Notes: Always check the wiki for hardware updates
14
Notes: Use Ibex Job Script Generator:
15
Notes: --exclusive will be deprecated soon.
Don’t ssh directly to compute nodes. To monitor jobs: Before and during running: scontrol show job <jobid> After job ends: seff <jobid>
16
Thank You! Demo Time asyncAPI cdpSimplePrint cudaOpenMP matrixMul
simpleAssert simpleIPC simpleMultiGPU vectorAdd bandwidthTest deviceQuery topologyQuery UnifiedMemoryPerf
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.