Kyle Spafford Jeremy S. Meredith Jeffrey S. Vetter
2Managed by UT-Battelle for the U.S. Department of Energy S3DDCA++ Early Work
3Managed by UT-Battelle for the U.S. Department of Energy “ An experimental high performance computing system of innovative design. ” “ Outside the mainstream of what is routinely available from computer vendors. ” -National Science Foundation, Track2D Call Fall 2008
4Managed by UT-Battelle for the U.S. Department of Energy Keeneland GT\ORNL
5Managed by UT-Battelle for the U.S. Department of Energy Inside a Node 4 Hot plug SFF (2.5”) HDDs 1 GPU module in the rear, lower 1U 2 GPU modules in upper 1U Dual 1GbE Dedicated management iLO3 LAN & 2 USB ports VGA UID LED & Button Health LED Serial (RJ45) Power Button QSFP (QDR IB) 2 Non-hot plug SFF (2.5”) HDD
6Managed by UT-Battelle for the U.S. Department of Energy Node Block Diagram DDR3 PCIe x16 CPU GPU (6GB) RAM QPI Infiniband QPI I/O Hub GPU (6GB) integrated PCIe x16 QPI
7Managed by UT-Battelle for the U.S. Department of Energy Why a dual I/O hub? GPU #0 GPU #1 PCIe Switch Tesla 1U IOH
8Managed by UT-Battelle for the U.S. Department of Energy Why a dual I/O hub? GPU #0 GPU #1 PCIe Switch Tesla 1U IOH Bottleneck!
9Managed by UT-Battelle for the U.S. Department of Energy Why a dual I/O hub? GPU #0 GPU #1 PCIe Switch Tesla 1U IOH 8.0 CPU #0 CPU #1 GPU #1 GPU # IOH GPU # IOH Bottleneck!
10Managed by UT-Battelle for the U.S. Department of Energy Introduction of NUMA 8.0 CPU #0 GPU #1 IOH 12.8 IOH CPU #0 GPU #0 IOH Short Path Long Path
11Managed by UT-Battelle for the U.S. Department of Energy Bandwidth Penalty CPU #0 H->D Copy
12Managed by UT-Battelle for the U.S. Department of Energy Bandwidth Penalty CPU #0 D->H Copy ~2 GB/s
13Managed by UT-Battelle for the U.S. Department of Energy Other Benchmark Results MPI Latency – 26% penalty for large messages, 12% small messages SHOC Benchmarks – Mismap penalty shown below – gives this effect context
14Managed by UT-Battelle for the U.S. Department of Energy Given a Multi-GPU app, how should processes be pinned?
15Managed by UT-Battelle for the U.S. Department of Energy Given a Multi-GPU app, how should processes be pinned?
16Managed by UT-Battelle for the U.S. Department of Energy CPU #1 CPU #0 GPU #1 IOH Infiniband GPU #0 GPU #2 Maximize GPU Bandwidth
17Managed by UT-Battelle for the U.S. Department of Energy CPU #1 CPU #0 GPU #1 IOH Infiniband GPU #0 GPU # Maximize GPU Bandwidth
18Managed by UT-Battelle for the U.S. Department of Energy CPU #1 CPU #0 GPU #1 IOH Infiniband GPU #0 GPU # Maximize MPI Bandwidth
19Managed by UT-Battelle for the U.S. Department of Energy CPU #1 CPU #0 GPU #1 IOH Infiniband GPU #0 GPU # Maximize MPI Bandwidth Pretty easy, right?
20Managed by UT-Battelle for the U.S. Department of Energy Pinning with numactl numactl --cpunodebind=0 --membind=0./program
21Managed by UT-Battelle for the U.S. Department of Energy if [[ $OMPI_COMM_WORLD_LOCAL_RANK == "2" ]] then numactl --cpunodebind=1 --membind=1./prog else if [[ $OMPI_COMM_WORLD_LOCAL_RANK == "1" ]] then numactl --cpunodebind=1 --membind=1./prog else # rank = 0 numactl --cpunodebind=0 --membind=0./prog fi Pinning with numactl Pinning with numactl
22Managed by UT-Battelle for the U.S. Department of Energy HPL Scaling Sustained MPI and GPU ops Uses other CPU cores via Intel MKL
23Managed by UT-Battelle for the U.S. Department of Energy What Happened with 0-1-1? CPU #1 CPU # MPI Tasks
24Managed by UT-Battelle for the U.S. Department of Energy What Happened with 0-1-1? CPU #0 CPU #1 MPI Tasks
25Managed by UT-Battelle for the U.S. Department of Energy What Happened with 0-1-1? CPU #0 CPU #1 MPI Tasks MKL Threads
26Managed by UT-Battelle for the U.S. Department of Energy What Happened with 0-1-1? CPU #0 CPU #1 MPI Tasks MKL Threads Threads inherit pinning!
27Managed by UT-Battelle for the U.S. Department of Energy What Happened with 0-1-1? CPU #0 CPU #1 MKL Threads
28Managed by UT-Battelle for the U.S. Department of Energy What Happened with 0-1-1? CPU #0 CPU #1 MKL Threads Two idle cores, 1 oversubscribed socket!
29Managed by UT-Battelle for the U.S. Department of Energy NUMA Impact on Apps
30Managed by UT-Battelle for the U.S. Department of Energy Well… time
31Managed by UT-Battelle for the U.S. Department of Energy Can we improve utilization by sharing a Fermi among multiple tasks?
32Managed by UT-Battelle for the U.S. Department of Energy Bandwidth of Most Bottlenecked Task
33Managed by UT-Battelle for the U.S. Department of Energy Is the second IO hub worth it?
34Managed by UT-Battelle for the U.S. Department of Energy Is the second IO hub worth it? Aggregate bandwidth to GPUs is 16.9 GB/s What about real app behavior? – Scenario A: “HPL” -- 1 MPI & 1 GPU task per GPU – Scenario B: A + 1 MPI for each other core
35Managed by UT-Battelle for the U.S. Department of Energy Contention Penalty
36Managed by UT-Battelle for the U.S. Department of Energy Puzzler – Pinning Redux Do ranks 1 and 2 always have a long path?
37Managed by UT-Battelle for the U.S. Department of Energy Puzzler – Pinning Redux Do ranks 1 and 2 always have a long path? CPU #0 GPU #1 IOH
38Managed by UT-Battelle for the U.S. Department of Energy Puzzler – Pinning Redux Do ranks 1 and 2 always have a long path? CPU #0 GPU #1 IOH CPU #1 Infiniband IOH
39Managed by UT-Battelle for the U.S. Department of Energy Split MPI and GPU – MPI Latency
40Managed by UT-Battelle for the U.S. Department of Energy Split MPI and GPU – PCIe bandwidth
41Managed by UT-Battelle for the U.S. Department of Energy Takeaways Dual IO hubs deliver – But add complexity Ignoring the complexity will sink some apps – Wrong pinning sunk HPL – Bandwidth bound kernels & “function offload” apps Threads and libnuma can help – but can be tedious to use
42Managed by UT-Battelle for the U.S. Department of Energy