Presentation is loading. Please wait.

Presentation is loading. Please wait.

Kyle Spafford Jeremy S. Meredith Jeffrey S. Vetter

Similar presentations


Presentation on theme: "Kyle Spafford Jeremy S. Meredith Jeffrey S. Vetter"— Presentation transcript:

1 Kyle Spafford Jeremy S. Meredith Jeffrey S. Vetter http://ft.ornl.gov

2 2Managed by UT-Battelle for the U.S. Department of Energy S3DDCA++ Early Work

3 3Managed by UT-Battelle for the U.S. Department of Energy “ An experimental high performance computing system of innovative design. ” “ Outside the mainstream of what is routinely available from computer vendors. ” -National Science Foundation, Track2D Call Fall 2008

4 4Managed by UT-Battelle for the U.S. Department of Energy Keeneland ID @ GT\ORNL

5 5Managed by UT-Battelle for the U.S. Department of Energy Inside a Node 4 Hot plug SFF (2.5”) HDDs 1 GPU module in the rear, lower 1U 2 GPU modules in upper 1U Dual 1GbE Dedicated management iLO3 LAN & 2 USB ports VGA UID LED & Button Health LED Serial (RJ45) Power Button QSFP (QDR IB) 2 Non-hot plug SFF (2.5”) HDD

6 6Managed by UT-Battelle for the U.S. Department of Energy Node Block Diagram DDR3 PCIe x16 CPU GPU (6GB) RAM QPI Infiniband QPI I/O Hub GPU (6GB) integrated PCIe x16 QPI

7 7Managed by UT-Battelle for the U.S. Department of Energy Why a dual I/O hub? 8 8 8 GPU #0 GPU #1 PCIe Switch Tesla 1U IOH

8 8Managed by UT-Battelle for the U.S. Department of Energy Why a dual I/O hub? 8 8 8.0 GPU #0 GPU #1 PCIe Switch Tesla 1U IOH Bottleneck!

9 9Managed by UT-Battelle for the U.S. Department of Energy Why a dual I/O hub? 8 8 8.0 GPU #0 GPU #1 PCIe Switch Tesla 1U IOH 8.0 CPU #0 CPU #1 GPU #1 GPU #2 12.8 IOH GPU #0 8.0 12.8 IOH Bottleneck!

10 10Managed by UT-Battelle for the U.S. Department of Energy Introduction of NUMA 8.0 CPU #0 GPU #1 IOH 12.8 IOH CPU #0 GPU #0 IOH 12.8 8.0 Short Path Long Path

11 11Managed by UT-Battelle for the U.S. Department of Energy Bandwidth Penalty CPU #0 H->D Copy

12 12Managed by UT-Battelle for the U.S. Department of Energy Bandwidth Penalty CPU #0 D->H Copy ~2 GB/s

13 13Managed by UT-Battelle for the U.S. Department of Energy Other Benchmark Results MPI Latency – 26% penalty for large messages, 12% small messages SHOC Benchmarks – Mismap penalty shown below – gives this effect context

14 14Managed by UT-Battelle for the U.S. Department of Energy Given a Multi-GPU app, how should processes be pinned?

15 15Managed by UT-Battelle for the U.S. Department of Energy Given a Multi-GPU app, how should processes be pinned? 0 0 1 1 2 2

16 16Managed by UT-Battelle for the U.S. Department of Energy CPU #1 CPU #0 GPU #1 IOH Infiniband GPU #0 GPU #2 Maximize GPU Bandwidth

17 17Managed by UT-Battelle for the U.S. Department of Energy CPU #1 CPU #0 GPU #1 IOH Infiniband GPU #0 GPU #2 0 0 1 1 2 2 Maximize GPU Bandwidth

18 18Managed by UT-Battelle for the U.S. Department of Energy CPU #1 CPU #0 GPU #1 IOH Infiniband GPU #0 GPU #2 0 0 1 1 2 2 Maximize MPI Bandwidth

19 19Managed by UT-Battelle for the U.S. Department of Energy CPU #1 CPU #0 GPU #1 IOH Infiniband GPU #0 GPU #2 0 0 1 1 2 2 Maximize MPI Bandwidth Pretty easy, right?

20 20Managed by UT-Battelle for the U.S. Department of Energy Pinning with numactl numactl --cpunodebind=0 --membind=0./program

21 21Managed by UT-Battelle for the U.S. Department of Energy if [[ $OMPI_COMM_WORLD_LOCAL_RANK == "2" ]] then numactl --cpunodebind=1 --membind=1./prog else if [[ $OMPI_COMM_WORLD_LOCAL_RANK == "1" ]] then numactl --cpunodebind=1 --membind=1./prog else # rank = 0 numactl --cpunodebind=0 --membind=0./prog fi 0-1-1 Pinning with numactl 0-1-1 Pinning with numactl

22 22Managed by UT-Battelle for the U.S. Department of Energy HPL Scaling Sustained MPI and GPU ops Uses other CPU cores via Intel MKL

23 23Managed by UT-Battelle for the U.S. Department of Energy What Happened with 0-1-1? CPU #1 CPU #0 0 0 1 1 2 2 MPI Tasks

24 24Managed by UT-Battelle for the U.S. Department of Energy What Happened with 0-1-1? 0 0 1 1 2 2 CPU #0 CPU #1 MPI Tasks

25 25Managed by UT-Battelle for the U.S. Department of Energy What Happened with 0-1-1? 0 0 1 1 2 2 CPU #0 CPU #1 MPI Tasks MKL Threads

26 26Managed by UT-Battelle for the U.S. Department of Energy What Happened with 0-1-1? 0 0 1 1 2 2 CPU #0 CPU #1 MPI Tasks MKL Threads Threads inherit pinning!

27 27Managed by UT-Battelle for the U.S. Department of Energy What Happened with 0-1-1? 0 0 1 1 2 2 CPU #0 CPU #1 MKL Threads

28 28Managed by UT-Battelle for the U.S. Department of Energy What Happened with 0-1-1? 0 0 1 1 2 2 CPU #0 CPU #1 MKL Threads Two idle cores, 1 oversubscribed socket!

29 29Managed by UT-Battelle for the U.S. Department of Energy NUMA Impact on Apps

30 30Managed by UT-Battelle for the U.S. Department of Energy Well… time

31 31Managed by UT-Battelle for the U.S. Department of Energy Can we improve utilization by sharing a Fermi among multiple tasks?

32 32Managed by UT-Battelle for the U.S. Department of Energy Bandwidth of Most Bottlenecked Task

33 33Managed by UT-Battelle for the U.S. Department of Energy Is the second IO hub worth it?

34 34Managed by UT-Battelle for the U.S. Department of Energy Is the second IO hub worth it? Aggregate bandwidth to GPUs is 16.9 GB/s What about real app behavior? – Scenario A: “HPL” -- 1 MPI & 1 GPU task per GPU – Scenario B: A + 1 MPI for each other core

35 35Managed by UT-Battelle for the U.S. Department of Energy Contention Penalty

36 36Managed by UT-Battelle for the U.S. Department of Energy Puzzler – Pinning Redux Do ranks 1 and 2 always have a long path?

37 37Managed by UT-Battelle for the U.S. Department of Energy Puzzler – Pinning Redux Do ranks 1 and 2 always have a long path? CPU #0 GPU #1 IOH

38 38Managed by UT-Battelle for the U.S. Department of Energy Puzzler – Pinning Redux Do ranks 1 and 2 always have a long path? CPU #0 GPU #1 IOH CPU #1 Infiniband IOH

39 39Managed by UT-Battelle for the U.S. Department of Energy Split MPI and GPU – MPI Latency

40 40Managed by UT-Battelle for the U.S. Department of Energy Split MPI and GPU – PCIe bandwidth

41 41Managed by UT-Battelle for the U.S. Department of Energy Takeaways Dual IO hubs deliver – But add complexity Ignoring the complexity will sink some apps – Wrong pinning sunk HPL – Bandwidth bound kernels & “function offload” apps Threads and libnuma can help – but can be tedious to use

42 42Managed by UT-Battelle for the U.S. Department of Energy Thanks!kys@ornl.govhttp://kylespafford.com/


Download ppt "Kyle Spafford Jeremy S. Meredith Jeffrey S. Vetter"

Similar presentations


Ads by Google