Www.openfabrics.org Resource Utilization in Large Scale InfiniBand Jobs Galen M. Shipman Los Alamos National Labs LAUR-07-2873.

Slides:



Advertisements
Similar presentations
Introduction to Grid Application On-Boarding Nick Werstiuk
Advertisements

1 Optimizing compilers Managing Cache Bercovici Sivan.
CoMPI: Enhancing MPI based applications performance and scalability using run-time compression. Rosa Filgueira, David E.Singh, Alejandro Calderón and Jesús.
Presented by Dealing with the Scale Problem Innovative Computing Laboratory MPI Team.
File Consistency in a Parallel Environment Kenin Coloma
Kernel memory allocation
The Who, What, Why and How of High Performance Computing Applications in the Cloud Soheila Abrishami 1.
Efficient I/O on the Cray XT Jeff Larkin With Help Of: Gene Wagenbreth.
Toward Efficient Support for Multithreaded MPI Communication Pavan Balaji 1, Darius Buntinas 1, David Goodell 1, William Gropp 2, and Rajeev Thakur 1 1.
Spark: Cluster Computing with Working Sets
Dr. Gengbin Zheng and Ehsan Totoni Parallel Programming Laboratory University of Illinois at Urbana-Champaign April 18, 2011.
Increasing the Cache Efficiency by Eliminating Noise Philip A. Marshall.
CISC October Goals for today: Foster’s parallel algorithm design –Partitioning –Task dependency graph Granularity Concurrency Collective communication.
MPI and C-Language Seminars Seminar Plan  Week 1 – Introduction, Data Types, Control Flow, Pointers  Week 2 – Arrays, Structures, Enums, I/O,
A Parallel Computational Model for Heterogeneous Clusters Jose Luis Bosque, Luis Pastor, IEEE TRASACTION ON PARALLEL AND DISTRIBUTED SYSTEM, VOL. 17, NO.
CS 584. A Parallel Programming Model We need abstractions to make it simple. The programming model needs to fit our parallel machine model. Abstractions.
I/O Optimization for ENZO Cosmology Simulation Using MPI-IO Jianwei Li12/06/2001.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.
Adaptive MPI Chao Huang, Orion Lawlor, L. V. Kalé Parallel Programming Lab Department of Computer Science University of Illinois at Urbana-Champaign.
High Performance Communication using MPJ Express 1 Presented by Jawad Manzoor National University of Sciences and Technology, Pakistan 29 June 2015.
Application Performance Analysis on Blue Gene/L Jim Pool, P.I. Maciej Brodowicz, Sharon Brunett, Tom Gottschalk, Dan Meiron, Paul Springer, Thomas Sterling,
Unified Parallel C at LBNL/UCB Empirical (so far) Understanding of Communication Optimizations for GAS Languages Costin Iancu LBNL.
Topology Aware Mapping for Performance Optimization of Science Applications Abhinav S Bhatele Parallel Programming Lab, UIUC.
1 ProActive performance evaluation with NAS benchmarks and optimization of OO SPMD Brian AmedroVladimir Bodnartchouk.
A Framework for Collective Personalized Communication Laxmikant V. Kale, Sameer Kumar, Krishnan Varadarajan.
1 Titanium Review: Ti Parallel Benchmarks Kaushik Datta Titanium NAS Parallel Benchmarks Kathy Yelick U.C. Berkeley September.
Stratified Magnetohydrodynamics Accelerated Using GPUs:SMAUG.
LLNL-PRES-XXXXXX This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344.
Predicting performance of applications and infrastructures Tania Lorido 27th May 2011.
Martin Berzins (Steve Parker) What are the hard apps problems? How do the solutions get shared? What non-apps work is needed? Thanks to DOE for funding.
Work Stealing and Persistence-based Load Balancers for Iterative Overdecomposed Applications Jonathan Lifflander, UIUC Sriram Krishnamoorthy, PNNL* Laxmikant.
Chapter 7 Memory Management
1 Scaling Collective Multicast Fat-tree Networks Sameer Kumar Parallel Programming Laboratory University Of Illinois at Urbana Champaign ICPADS ’ 04.
Lecture 8: Design of Parallel Programs Part III Lecturer: Simon Winberg.
CCA Common Component Architecture Manoj Krishnan Pacific Northwest National Laboratory MCMD Programming and Implementation Issues.
1 Using Multiple Energy Gears in MPI Programs on a Power- Scalable Cluster Vincent W. Freeh, David K. Lowenthal, Feng Pan, and Nandani Kappiah Presented.
The WRF Model The Weather Research and Forecasting (WRF) Model is a mesoscale numerical weather prediction system designed for both atmospheric research.
© 2012 MELLANOX TECHNOLOGIES 1 The Exascale Interconnect Technology Rich Graham – Sr. Solutions Architect.
Chapter 3 Parallel Algorithm Design. Outline Task/channel model Task/channel model Algorithm design methodology Algorithm design methodology Case studies.
Introduction, background, jargon Jakub Yaghob. Literature T.G.Mattson, B.A.Sanders, B.L.Massingill: Patterns for Parallel Programming, Addison- Wesley,
Lecture 18: Dynamic Reconfiguration II November 12, 2004 ECE 697F Reconfigurable Computing Lecture 18 Dynamic Reconfiguration II.
Collective Buffering: Improving Parallel I/O Performance By Bill Nitzberg and Virginia Lo.
Swapping to Remote Memory over InfiniBand: An Approach using a High Performance Network Block Device Shuang LiangRanjit NoronhaDhabaleswar K. Panda IEEE.
August 22, 2005Page 1 of (#) Datacenter Fabric Workshop Open MPI Overview and Current Status Tim Woodall - LANL Galen Shipman - LANL/UNM.
Using Automated Performance Modeling to Find Scalability Bugs in Complex Codes A. Calotoiu 1, T. Hoefler 2, M. Poke 1, F. Wolf 1 1)German Research School.
The Vesta Parallel File System Peter F. Corbett Dror G. Feithlson.
Minimizing Communication Latency to Maximize Network Communication Throughput over InfiniBand Design and Implementation of MPICH-2 over InfiniBand with.
Issues Autonomic operation (fault tolerance) Minimize interference to applications Hardware support for new operating systems Resource management (global.
Investigating Adaptive Compilation using the MIPSpro Compiler Keith D. Cooper Todd Waterman Department of Computer Science Rice University Houston, TX.
Computer and Computational Sciences Division Los Alamos National Laboratory On the Feasibility of Incremental Checkpointing for Scientific Computing Jose.
App. TypeApp. Name Distributed or Parallel A parallel version of the Gaussian elimination application SAGE (SAIC's Adaptive Grid Eulerian hydrocode) Adaptive.
Lecture 4 TTH 03:30AM-04:45PM Dr. Jianjun Hu CSCE569 Parallel Computing University of South Carolina Department of.
Intradomain Traffic Engineering By Behzad Akbari These slides are based in part upon slides of J. Rexford (Princeton university)
Department of Computer Science MapReduce for the Cell B. E. Architecture Marc de Kruijf University of Wisconsin−Madison Advised by Professor Sankaralingam.
Optimizing Charm++ Messaging for the Grid Gregory A. Koenig Parallel Programming Laboratory Department of Computer.
Lecture 3 : Performance of Parallel Programs Courtesy : MIT Prof. Amarasinghe and Dr. Rabbah’s course note.
CS 484 Load Balancing. Goal: All processors working all the time Efficiency of 1 Distribute the load (work) to meet the goal Two types of load balancing.
An Evaluation of Partitioners for Parallel SAMR Applications Sumir Chandra & Manish Parashar ECE Dept., Rutgers University Submitted to: Euro-Par 2001.
Open MPI OpenFabrics Update April 2008 Jeff Squyres.
August 12, 2004 UCRL-PRES Aug Outline l Motivation l About the Applications l Statistics Gathered l Inferences l Future Work.
Performance Evaluation of Parallel Algorithms on a Computational Grid Environment Simona Blandino 1, Salvatore Cavalieri 2 1 Consorzio COMETA, 2 Faculty.
P AL Performance Modeling of Unstructured Mesh Particle Transport Computations Mark M. Mathis and Darren J. Kerbyson Performance and Architectures Laboratory.
Application of Design Patterns to Geometric Decompositions V. Balaji, Thomas L. Clune, Robert W. Numrich and Brice T. Womack.
Mark Mathis Texas A&M University Darren Kerbyson and Adolfy Hoisie
High Performance and Reliable Multicast over Myrinet/GM-2
Parallel Programming By J. H. Wang May 2, 2017.
Performance Evaluation of Adaptive MPI
Parallel Programming in C with MPI and OpenMP
Resource Utilization in Large Scale InfiniBand Jobs
Parallel Programming in C with MPI and OpenMP
Presentation transcript:

Resource Utilization in Large Scale InfiniBand Jobs Galen M. Shipman Los Alamos National Labs LAUR

2 The Problem  InfiniBand specifies that receive resources are consumed in order regardless of size  Small messages may therefore consume much larger receive buffers  At very large scale, many applications are dominated by small message transfers  Message sizes vary substantially from job to job and even rank to rank

3 Receive Buffer Efficiency

4 Implication for SRQ  Flood of small messages may exhaust SRQ resources  Probability of RNR NAK increases  Stalls the pipeline  Performance degrades  Wasted resource utilization  Application may not complete within allotted time slot (12 + Hours for some jobs)

5 Why not just tune the buffer size?  There is no “one size fits all” solution!  Message size patterns differ based on:  Number of processes in the parallel job  Input deck  Identity / function in the parallel job  Need to balance optimization between:  Performance  Memory footprint  Tuning for each application run is not acceptable

6 What Do Users Want?  Optimal performance is important  But predictability at “acceptable” performance is more important  HPC users want a default/“good enough” solution  Parameter tweaking is fine for papers  Not for our end users  Parameter explosion  OMPI OpenFabrics-related driver parameters: 48  OMPI other parameters: …many…

7 What Do Others Do?  Portals  Contiguous memory region for unexpected messages (Receiver managed offset semantic)  Myrinet GM  Variable size receive buffers can be allocated  Sender specifies which size receive buffer to consume (SIZE & PRIORITY fields)  Quadrics Elan  TPORTS manages pools of buffers of various sizes  On receipt of an unexpected message a buffer is chosen from the relevant pool

8 Bucket-SRQ  Inspired from standard bucket allocation methods  Multiple “buckets” of receive descriptors are created in multiple SRQs  Each associated a different size buffer  A small pool of per-peer resources is also allocated

9 Bucket-SRQ

10 Performance Implications  Good overall performance  Decreased/no RNR NAKS from draining SRQ Never trigger “SRQ limit reached” event  Latency penalty for SRQ  ~1 usec  Large number of QPs may not be efficient  Still investigating impact of high QP count on performance

11 Results  Evaluation applications  SAGE (DOE/LANL application)  Sweep3D (DOE/LANL application)  NAS Parallel Benchmarks (benchmark)  Instrumented Open MPI  Measured receive buffer efficiency: Size of receive buffer / size of data received

12 SAGE: Hydrodynamics  SAGE – SAIC’s Adaptive Grid Eulerian hydrocode  Hydrodynamics code with Adaptive Mesh Refinement (AMR)  Applied to: water shock, energy coupling, hydro instability problems, etc.  Routinely run on 1,000’s of processors.  Scaling characteristic: Weak  Data Decomposition (Default): 1-D (of a 3-D AMR spatial grid) "Predictive Performance and Scalability Modeling of a Large-Scale Application", D.J. Kerbyson, H.J. Alme, A. Hoisie, F. Petrini, H.J. Wasserman, M. Gittings, in Proc. SC, Denver, 2001 Courtesy: PAL Team - LANL

13 SAGE  Adaptive Mesh Refinement (AMR) hydro-code  3 repeated phases  Gather data (including processor boundary data)  Compute  Scatter data (send back results)  3-D spatial grid, partitioned in 1-D  Parallel characteristics  Message sizes vary, typically ’s Kbytes  Distance between neighbors increases with scale Courtesy: PAL Team - LANL

14 SAGE: Receive Buffer Usage  256 Processes

15 SAGE: Receive Buffer Usage  4096 Processes

16 SAGE: Receive buffer efficiency

17 SAGE: Performance

18 Sweep3D  3-D spatial grid, partitioned in 2-D  Pipelined wavefront processing  Dependency in ‘sweep’ direction  Parallel Characteristics:  logical neighbors in X and Y  Small message sizes: 100’s bytes (typical)  Number of processors determines pipe-line length (P X + P Y ) 2-D example: Courtesy: PAL Team - LANL

19 Sweep3D: Wavefront Algorithm  Characterized by a dependency in cell processing  Direction of wavefront can change  start from any corner-point previously processed wavefront edge Courtesy: PAL Team - LANL

20 Sweep3D Receive Buffer Usage  256 Processes

21 Sweep3D: Receive Buffer Efficiency

22 Sweep3d: Performance

23 NPB Receive Buffer Usage  Class D 256 Processes

24 NPB Receive Buffer Efficiency  Class D 256 Processes IS Benchmark Not Available for Class D

25 NPB Performance Results  NPB Class D 256 Processes

26 Conclusions  Bucket SRQ provides  Good performance at scale  “One size fits most” solution Eliminates need to custom-tune each run  Minimizes receive buffer memory footprint No more than 25 MB was allocated for any run  Avoids RNR NAKs in communication patterns we examined

27 Future Work  Take advantage of ConnectX SRC feature to reduce the number of active QPs  Further examine our protocol at 4K+ processor count on SNL’s ThunderBird cluster