Www.openfabrics.org Resource Utilization in Large Scale InfiniBand Jobs Galen M. Shipman Los Alamos National Labs LAUR-07-2873.

www.openfabrics.org Resource Utilization in Large Scale InfiniBand Jobs Galen M. Shipman Los Alamos National Labs LAUR-07-2873

2 www.openfabrics.org The Problem  InfiniBand specifies that receive resources are consumed in order regardless of size  Small messages may therefore consume much larger receive buffers  At very large scale, many applications are dominated by small message transfers  Message sizes vary substantially from job to job and even rank to rank

3 www.openfabrics.org Receive Buffer Efficiency

4 www.openfabrics.org Implication for SRQ  Flood of small messages may exhaust SRQ resources  Probability of RNR NAK increases  Stalls the pipeline  Performance degrades  Wasted resource utilization  Application may not complete within allotted time slot (12 + Hours for some jobs)

5 www.openfabrics.org Why not just tune the buffer size?  There is no “one size fits all” solution!  Message size patterns differ based on:  Number of processes in the parallel job  Input deck  Identity / function in the parallel job  Need to balance optimization between:  Performance  Memory footprint  Tuning for each application run is not acceptable

6 www.openfabrics.org What Do Users Want?  Optimal performance is important  But predictability at “acceptable” performance is more important  HPC users want a default/“good enough” solution  Parameter tweaking is fine for papers  Not for our end users  Parameter explosion  OMPI OpenFabrics-related driver parameters: 48  OMPI other parameters: …many…

7 www.openfabrics.org What Do Others Do?  Portals  Contiguous memory region for unexpected messages (Receiver managed offset semantic)  Myrinet GM  Variable size receive buffers can be allocated  Sender specifies which size receive buffer to consume (SIZE & PRIORITY fields)  Quadrics Elan  TPORTS manages pools of buffers of various sizes  On receipt of an unexpected message a buffer is chosen from the relevant pool

8 www.openfabrics.org Bucket-SRQ  Inspired from standard bucket allocation methods  Multiple “buckets” of receive descriptors are created in multiple SRQs  Each associated a different size buffer  A small pool of per-peer resources is also allocated

9 www.openfabrics.org Bucket-SRQ

10 www.openfabrics.org Performance Implications  Good overall performance  Decreased/no RNR NAKS from draining SRQ Never trigger “SRQ limit reached” event  Latency penalty for SRQ  ~1 usec  Large number of QPs may not be efficient  Still investigating impact of high QP count on performance

11 www.openfabrics.org Results  Evaluation applications  SAGE (DOE/LANL application)  Sweep3D (DOE/LANL application)  NAS Parallel Benchmarks (benchmark)  Instrumented Open MPI  Measured receive buffer efficiency: Size of receive buffer / size of data received

12 www.openfabrics.org SAGE: Hydrodynamics  SAGE – SAIC’s Adaptive Grid Eulerian hydrocode  Hydrodynamics code with Adaptive Mesh Refinement (AMR)  Applied to: water shock, energy coupling, hydro instability problems, etc.  Routinely run on 1,000’s of processors.  Scaling characteristic: Weak  Data Decomposition (Default): 1-D (of a 3-D AMR spatial grid) "Predictive Performance and Scalability Modeling of a Large-Scale Application", D.J. Kerbyson, H.J. Alme, A. Hoisie, F. Petrini, H.J. Wasserman, M. Gittings, in Proc. SC, Denver, 2001 Courtesy: PAL Team - LANL

13 www.openfabrics.org SAGE  Adaptive Mesh Refinement (AMR) hydro-code  3 repeated phases  Gather data (including processor boundary data)  Compute  Scatter data (send back results)  3-D spatial grid, partitioned in 1-D  Parallel characteristics  Message sizes vary, typically 10 - 100’s Kbytes  Distance between neighbors increases with scale Courtesy: PAL Team - LANL

14 www.openfabrics.org SAGE: Receive Buffer Usage  256 Processes

15 www.openfabrics.org SAGE: Receive Buffer Usage  4096 Processes

16 www.openfabrics.org SAGE: Receive buffer efficiency

17 www.openfabrics.org SAGE: Performance

18 www.openfabrics.org Sweep3D  3-D spatial grid, partitioned in 2-D  Pipelined wavefront processing  Dependency in ‘sweep’ direction  Parallel Characteristics:  logical neighbors in X and Y  Small message sizes: 100’s bytes (typical)  Number of processors determines pipe-line length (P X + P Y ) 2-D example: Courtesy: PAL Team - LANL

19 www.openfabrics.org Sweep3D: Wavefront Algorithm  Characterized by a dependency in cell processing  Direction of wavefront can change  start from any corner-point previously processed wavefront edge Courtesy: PAL Team - LANL

20 www.openfabrics.org Sweep3D Receive Buffer Usage  256 Processes

21 www.openfabrics.org Sweep3D: Receive Buffer Efficiency

22 www.openfabrics.org Sweep3d: Performance

23 www.openfabrics.org NPB Receive Buffer Usage  Class D 256 Processes

24 www.openfabrics.org NPB Receive Buffer Efficiency  Class D 256 Processes IS Benchmark Not Available for Class D

25 www.openfabrics.org NPB Performance Results  NPB Class D 256 Processes

26 www.openfabrics.org Conclusions  Bucket SRQ provides  Good performance at scale  “One size fits most” solution Eliminates need to custom-tune each run  Minimizes receive buffer memory footprint No more than 25 MB was allocated for any run  Avoids RNR NAKs in communication patterns we examined

27 www.openfabrics.org Future Work  Take advantage of ConnectX SRC feature to reduce the number of active QPs  Further examine our protocol at 4K+ processor count on SNL’s ThunderBird cluster

Www.openfabrics.org Resource Utilization in Large Scale InfiniBand Jobs Galen M. Shipman Los Alamos National Labs LAUR-07-2873.

Similar presentations

Presentation on theme: "Www.openfabrics.org Resource Utilization in Large Scale InfiniBand Jobs Galen M. Shipman Los Alamos National Labs LAUR-07-2873."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Www.openfabrics.org Resource Utilization in Large Scale InfiniBand Jobs Galen M. Shipman Los Alamos National Labs LAUR-07-2873.

Similar presentations

Presentation on theme: "Www.openfabrics.org Resource Utilization in Large Scale InfiniBand Jobs Galen M. Shipman Los Alamos National Labs LAUR-07-2873."— Presentation transcript:

Similar presentations

About project

Feedback