Download presentation
Presentation is loading. Please wait.
Published byGervais Gaines Modified over 9 years ago
1
Kyle Spafford Jeremy S. Meredith Jeffrey S. Vetter http://ft.ornl.gov
2
2Managed by UT-Battelle for the U.S. Department of Energy S3DDCA++ Early Work
3
3Managed by UT-Battelle for the U.S. Department of Energy “ An experimental high performance computing system of innovative design. ” “ Outside the mainstream of what is routinely available from computer vendors. ” -National Science Foundation, Track2D Call Fall 2008
4
4Managed by UT-Battelle for the U.S. Department of Energy Keeneland ID @ GT\ORNL
5
5Managed by UT-Battelle for the U.S. Department of Energy Inside a Node 4 Hot plug SFF (2.5”) HDDs 1 GPU module in the rear, lower 1U 2 GPU modules in upper 1U Dual 1GbE Dedicated management iLO3 LAN & 2 USB ports VGA UID LED & Button Health LED Serial (RJ45) Power Button QSFP (QDR IB) 2 Non-hot plug SFF (2.5”) HDD
6
6Managed by UT-Battelle for the U.S. Department of Energy Node Block Diagram DDR3 PCIe x16 CPU GPU (6GB) RAM QPI Infiniband QPI I/O Hub GPU (6GB) integrated PCIe x16 QPI
7
7Managed by UT-Battelle for the U.S. Department of Energy Why a dual I/O hub? 8 8 8 GPU #0 GPU #1 PCIe Switch Tesla 1U IOH
8
8Managed by UT-Battelle for the U.S. Department of Energy Why a dual I/O hub? 8 8 8.0 GPU #0 GPU #1 PCIe Switch Tesla 1U IOH Bottleneck!
9
9Managed by UT-Battelle for the U.S. Department of Energy Why a dual I/O hub? 8 8 8.0 GPU #0 GPU #1 PCIe Switch Tesla 1U IOH 8.0 CPU #0 CPU #1 GPU #1 GPU #2 12.8 IOH GPU #0 8.0 12.8 IOH Bottleneck!
10
10Managed by UT-Battelle for the U.S. Department of Energy Introduction of NUMA 8.0 CPU #0 GPU #1 IOH 12.8 IOH CPU #0 GPU #0 IOH 12.8 8.0 Short Path Long Path
11
11Managed by UT-Battelle for the U.S. Department of Energy Bandwidth Penalty CPU #0 H->D Copy
12
12Managed by UT-Battelle for the U.S. Department of Energy Bandwidth Penalty CPU #0 D->H Copy ~2 GB/s
13
13Managed by UT-Battelle for the U.S. Department of Energy Other Benchmark Results MPI Latency – 26% penalty for large messages, 12% small messages SHOC Benchmarks – Mismap penalty shown below – gives this effect context
14
14Managed by UT-Battelle for the U.S. Department of Energy Given a Multi-GPU app, how should processes be pinned?
15
15Managed by UT-Battelle for the U.S. Department of Energy Given a Multi-GPU app, how should processes be pinned? 0 0 1 1 2 2
16
16Managed by UT-Battelle for the U.S. Department of Energy CPU #1 CPU #0 GPU #1 IOH Infiniband GPU #0 GPU #2 Maximize GPU Bandwidth
17
17Managed by UT-Battelle for the U.S. Department of Energy CPU #1 CPU #0 GPU #1 IOH Infiniband GPU #0 GPU #2 0 0 1 1 2 2 Maximize GPU Bandwidth
18
18Managed by UT-Battelle for the U.S. Department of Energy CPU #1 CPU #0 GPU #1 IOH Infiniband GPU #0 GPU #2 0 0 1 1 2 2 Maximize MPI Bandwidth
19
19Managed by UT-Battelle for the U.S. Department of Energy CPU #1 CPU #0 GPU #1 IOH Infiniband GPU #0 GPU #2 0 0 1 1 2 2 Maximize MPI Bandwidth Pretty easy, right?
20
20Managed by UT-Battelle for the U.S. Department of Energy Pinning with numactl numactl --cpunodebind=0 --membind=0./program
21
21Managed by UT-Battelle for the U.S. Department of Energy if [[ $OMPI_COMM_WORLD_LOCAL_RANK == "2" ]] then numactl --cpunodebind=1 --membind=1./prog else if [[ $OMPI_COMM_WORLD_LOCAL_RANK == "1" ]] then numactl --cpunodebind=1 --membind=1./prog else # rank = 0 numactl --cpunodebind=0 --membind=0./prog fi 0-1-1 Pinning with numactl 0-1-1 Pinning with numactl
22
22Managed by UT-Battelle for the U.S. Department of Energy HPL Scaling Sustained MPI and GPU ops Uses other CPU cores via Intel MKL
23
23Managed by UT-Battelle for the U.S. Department of Energy What Happened with 0-1-1? CPU #1 CPU #0 0 0 1 1 2 2 MPI Tasks
24
24Managed by UT-Battelle for the U.S. Department of Energy What Happened with 0-1-1? 0 0 1 1 2 2 CPU #0 CPU #1 MPI Tasks
25
25Managed by UT-Battelle for the U.S. Department of Energy What Happened with 0-1-1? 0 0 1 1 2 2 CPU #0 CPU #1 MPI Tasks MKL Threads
26
26Managed by UT-Battelle for the U.S. Department of Energy What Happened with 0-1-1? 0 0 1 1 2 2 CPU #0 CPU #1 MPI Tasks MKL Threads Threads inherit pinning!
27
27Managed by UT-Battelle for the U.S. Department of Energy What Happened with 0-1-1? 0 0 1 1 2 2 CPU #0 CPU #1 MKL Threads
28
28Managed by UT-Battelle for the U.S. Department of Energy What Happened with 0-1-1? 0 0 1 1 2 2 CPU #0 CPU #1 MKL Threads Two idle cores, 1 oversubscribed socket!
29
29Managed by UT-Battelle for the U.S. Department of Energy NUMA Impact on Apps
30
30Managed by UT-Battelle for the U.S. Department of Energy Well… time
31
31Managed by UT-Battelle for the U.S. Department of Energy Can we improve utilization by sharing a Fermi among multiple tasks?
32
32Managed by UT-Battelle for the U.S. Department of Energy Bandwidth of Most Bottlenecked Task
33
33Managed by UT-Battelle for the U.S. Department of Energy Is the second IO hub worth it?
34
34Managed by UT-Battelle for the U.S. Department of Energy Is the second IO hub worth it? Aggregate bandwidth to GPUs is 16.9 GB/s What about real app behavior? – Scenario A: “HPL” -- 1 MPI & 1 GPU task per GPU – Scenario B: A + 1 MPI for each other core
35
35Managed by UT-Battelle for the U.S. Department of Energy Contention Penalty
36
36Managed by UT-Battelle for the U.S. Department of Energy Puzzler – Pinning Redux Do ranks 1 and 2 always have a long path?
37
37Managed by UT-Battelle for the U.S. Department of Energy Puzzler – Pinning Redux Do ranks 1 and 2 always have a long path? CPU #0 GPU #1 IOH
38
38Managed by UT-Battelle for the U.S. Department of Energy Puzzler – Pinning Redux Do ranks 1 and 2 always have a long path? CPU #0 GPU #1 IOH CPU #1 Infiniband IOH
39
39Managed by UT-Battelle for the U.S. Department of Energy Split MPI and GPU – MPI Latency
40
40Managed by UT-Battelle for the U.S. Department of Energy Split MPI and GPU – PCIe bandwidth
41
41Managed by UT-Battelle for the U.S. Department of Energy Takeaways Dual IO hubs deliver – But add complexity Ignoring the complexity will sink some apps – Wrong pinning sunk HPL – Bandwidth bound kernels & “function offload” apps Threads and libnuma can help – but can be tedious to use
42
42Managed by UT-Battelle for the U.S. Department of Energy Thanks!kys@ornl.govhttp://kylespafford.com/
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.