Download presentation
Presentation is loading. Please wait.
Published byHunter Larman Modified over 9 years ago
1
BWUPEP2011, UIUC, May 29 - June 10 2011 1 Blue Waters Undergraduate Petascale Education Program May 29 – June 10 2011 Hybrid MPI/CUDA Scaling accelerator code
2
BWUPEP2011, UIUC, May 29 - June 10 2011 2 Why Hybrid CUDA? CUDA is fast! (for some problems) CUDA on a single card is like OpenMP (doesn’t scale) MPI can only scale so far Excessive power Communication overhead Large amount of work remains for each node What if you can harness the power of multiple accelerators on multiple MPI processes?
3
BWUPEP2011, UIUC, May 29 - June 10 2011 3 Hybrid Architectures Tesla S1050 connected to nodes 1 GPU, connected directly to a node Al-Salam @ Earlham (as11 & as12) Tesla S1070 A server node with 4 GPUs, typically connected via PCI-E to 2 nodes Sooner @ OU has some of these Lincoln @ NCSA (192 nodes) Accelerator Cluster (AC) @ NCSA (32 nodes) RAM GPU Node
4
BWUPEP2011, UIUC, May 29 - June 10 2011 4 MPI/CUDA Approach CUDA will be: Doing the computational heavy lifting Dictating your algorithm & parallel layout (data parallel) Therefore: Design CUDA portions first Use MPI to move work to each node
5
BWUPEP2011, UIUC, May 29 - June 10 2011 5 Implementation Do as much work as possible on the GPU before bringing data back to the CPU and communicating it Sometimes you won’t have a choice… Debugging tips: Develop/test/debug one- node version first Then test it with multiple nodes to verify commun- ication move data to each node while not done: copy data to GPU do work >> get new state out of GPU communicate with others aggregate results from all nodes move data to each node while not done: copy data to GPU do work >> get new state out of GPU communicate with others aggregate results from all nodes
6
BWUPEP2011, UIUC, May 29 - June 10 2011 6 Multi-GPU Programming A CPU thread can only have a single active context to communicate with a GPU cudaGetDeviceCount(int * count) cudaSetDevice(int device) Be careful using MPI rank alone, device count only counts the cards visible from each node Use MPI_Get_processor_name() to determine which processes are running where
7
BWUPEP2011, UIUC, May 29 - June 10 2011 7 Compiling CUDA needs nvcc, MPI needs mpicc Dirty trick: wrap mpicc with nvcc nvcc processes.cu files, sends the rest to its wrapped compiler Kernel, kernel invocation, cudaMalloc, are all best off in a.cu file somewhere MPI calls should be in.c files There are workarounds, but this is the simplest approach nvcc --compiler-bindir mpicc main.c kernel.cu
8
BWUPEP2011, UIUC, May 29 - June 10 2011 8 Executing Typically one MPI process per available GPU On Sooner (OU), each node has 2 GPUs available, so ppn should be 2. On AC, each node has 4 GPUs and correspond to the number of processors requested, so this requests a total of 8 GPUs on 2 nodes: #BSUB -l nodes=2:tesla:cuda3.2:ppn=4 #BSUB -R "select[cuda > 0]“ #BSUB -R "rusage[cuda=2]“ #BSUB –l nodes=1:ppn=2 #BSUB -R "select[cuda > 0]“ #BSUB -R "rusage[cuda=2]“ #BSUB –l nodes=1:ppn=2
9
BWUPEP2011, UIUC, May 29 - June 10 2011 9 Hybrid CUDA Lab We already have Area Under a Curve code for MPI and CUDA independently. You can write a hybrid code that has each GPU calculate a portion of the area, then use MPI to combine subtotals for the complete area. Otherwise feel free to take any code we’ve used so far and experiment!
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.