Download presentation
Presentation is loading. Please wait.
Published byHester Francis Modified over 9 years ago
1
www.openfabrics.org Resource Utilization in Large Scale InfiniBand Jobs Galen M. Shipman Los Alamos National Labs LAUR-07-2873
2
2 www.openfabrics.org The Problem InfiniBand specifies that receive resources are consumed in order regardless of size Small messages may therefore consume much larger receive buffers At very large scale, many applications are dominated by small message transfers Message sizes vary substantially from job to job and even rank to rank
3
3 www.openfabrics.org Receive Buffer Efficiency
4
4 www.openfabrics.org Implication for SRQ Flood of small messages may exhaust SRQ resources Probability of RNR NAK increases Stalls the pipeline Performance degrades Wasted resource utilization Application may not complete within allotted time slot (12 + Hours for some jobs)
5
5 www.openfabrics.org Why not just tune the buffer size? There is no “one size fits all” solution! Message size patterns differ based on: Number of processes in the parallel job Input deck Identity / function in the parallel job Need to balance optimization between: Performance Memory footprint Tuning for each application run is not acceptable
6
6 www.openfabrics.org What Do Users Want? Optimal performance is important But predictability at “acceptable” performance is more important HPC users want a default/“good enough” solution Parameter tweaking is fine for papers Not for our end users Parameter explosion OMPI OpenFabrics-related driver parameters: 48 OMPI other parameters: …many…
7
7 www.openfabrics.org What Do Others Do? Portals Contiguous memory region for unexpected messages (Receiver managed offset semantic) Myrinet GM Variable size receive buffers can be allocated Sender specifies which size receive buffer to consume (SIZE & PRIORITY fields) Quadrics Elan TPORTS manages pools of buffers of various sizes On receipt of an unexpected message a buffer is chosen from the relevant pool
8
8 www.openfabrics.org Bucket-SRQ Inspired from standard bucket allocation methods Multiple “buckets” of receive descriptors are created in multiple SRQs Each associated a different size buffer A small pool of per-peer resources is also allocated
9
9 www.openfabrics.org Bucket-SRQ
10
10 www.openfabrics.org Performance Implications Good overall performance Decreased/no RNR NAKS from draining SRQ Never trigger “SRQ limit reached” event Latency penalty for SRQ ~1 usec Large number of QPs may not be efficient Still investigating impact of high QP count on performance
11
11 www.openfabrics.org Results Evaluation applications SAGE (DOE/LANL application) Sweep3D (DOE/LANL application) NAS Parallel Benchmarks (benchmark) Instrumented Open MPI Measured receive buffer efficiency: Size of receive buffer / size of data received
12
12 www.openfabrics.org SAGE: Hydrodynamics SAGE – SAIC’s Adaptive Grid Eulerian hydrocode Hydrodynamics code with Adaptive Mesh Refinement (AMR) Applied to: water shock, energy coupling, hydro instability problems, etc. Routinely run on 1,000’s of processors. Scaling characteristic: Weak Data Decomposition (Default): 1-D (of a 3-D AMR spatial grid) "Predictive Performance and Scalability Modeling of a Large-Scale Application", D.J. Kerbyson, H.J. Alme, A. Hoisie, F. Petrini, H.J. Wasserman, M. Gittings, in Proc. SC, Denver, 2001 Courtesy: PAL Team - LANL
13
13 www.openfabrics.org SAGE Adaptive Mesh Refinement (AMR) hydro-code 3 repeated phases Gather data (including processor boundary data) Compute Scatter data (send back results) 3-D spatial grid, partitioned in 1-D Parallel characteristics Message sizes vary, typically 10 - 100’s Kbytes Distance between neighbors increases with scale Courtesy: PAL Team - LANL
14
14 www.openfabrics.org SAGE: Receive Buffer Usage 256 Processes
15
15 www.openfabrics.org SAGE: Receive Buffer Usage 4096 Processes
16
16 www.openfabrics.org SAGE: Receive buffer efficiency
17
17 www.openfabrics.org SAGE: Performance
18
18 www.openfabrics.org Sweep3D 3-D spatial grid, partitioned in 2-D Pipelined wavefront processing Dependency in ‘sweep’ direction Parallel Characteristics: logical neighbors in X and Y Small message sizes: 100’s bytes (typical) Number of processors determines pipe-line length (P X + P Y ) 2-D example: Courtesy: PAL Team - LANL
19
19 www.openfabrics.org Sweep3D: Wavefront Algorithm Characterized by a dependency in cell processing Direction of wavefront can change start from any corner-point previously processed wavefront edge Courtesy: PAL Team - LANL
20
20 www.openfabrics.org Sweep3D Receive Buffer Usage 256 Processes
21
21 www.openfabrics.org Sweep3D: Receive Buffer Efficiency
22
22 www.openfabrics.org Sweep3d: Performance
23
23 www.openfabrics.org NPB Receive Buffer Usage Class D 256 Processes
24
24 www.openfabrics.org NPB Receive Buffer Efficiency Class D 256 Processes IS Benchmark Not Available for Class D
25
25 www.openfabrics.org NPB Performance Results NPB Class D 256 Processes
26
26 www.openfabrics.org Conclusions Bucket SRQ provides Good performance at scale “One size fits most” solution Eliminates need to custom-tune each run Minimizes receive buffer memory footprint No more than 25 MB was allocated for any run Avoids RNR NAKs in communication patterns we examined
27
27 www.openfabrics.org Future Work Take advantage of ConnectX SRC feature to reduce the number of active QPs Further examine our protocol at 4K+ processor count on SNL’s ThunderBird cluster
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.