Load Balancing and Topology Aware Mapping for Petascale Machines Abhinav S Bhatele Parallel Programming Laboratory, Illinois.

Slides:

Advertisements

Similar presentations

Dynamic Load Balancing in Scientific Simulation Angen Zheng.

Advertisements

Resource Management §A resource can be a logical, such as a shared file, or physical, such as a CPU (a node of the distributed system). One of the functions.

The Charm++ Programming Model and NAMD Abhinav S Bhatele Department of Computer Science University of Illinois at Urbana-Champaign

1 Advancing Supercomputer Performance Through Interconnection Topology Synthesis Yi Zhu, Michael Taylor, Scott B. Baden and Chung-Kuan Cheng Department.

1 NAMD - Scalable Molecular Dynamics Gengbin Zheng 9/1/01.

Mapping Communication Layouts to Network Hardware Characteristics on Massive-Scale Blue Gene Systems Pavan Balaji*, Rinku Gupta*, Abhinav Vishnu + and.

Abhinav Bhatele, Laxmikant V. Kale University of Illinois at Urbana-Champaign Sameer Kumar IBM T. J. Watson Research Center.

Abhinav Bhatele, Laxmikant V. Kale University of Illinois at Urbana-Champaign Sameer Kumar IBM T. J. Watson Research Center.

Dr. Gengbin Zheng and Ehsan Totoni Parallel Programming Laboratory University of Illinois at Urbana-Champaign April 18, 2011.

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.

A CASE STUDY OF COMMUNICATION OPTIMIZATIONS ON 3D MESH INTERCONNECTS University of Illinois at Urbana-Champaign Abhinav Bhatele, Eric Bohm, Laxmikant V.

Automating Topology Aware Task Mapping on Large Parallel Machines Abhinav S Bhatele Advisor: Laxmikant V. Kale University of Illinois at Urbana-Champaign.

Topology Aware Mapping for Performance Optimization of Science Applications Abhinav S Bhatele Parallel Programming Lab, UIUC.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Programming Massively Parallel Processors.

A Framework for Collective Personalized Communication Laxmikant V. Kale, Sameer Kumar, Krishnan Varadarajan.

Charm++ Load Balancing Framework Gengbin Zheng Parallel Programming Laboratory Department of Computer Science University of Illinois at.

Interconnect Networks

Application-specific Topology-aware Mapping for Three Dimensional Topologies Abhinav Bhatelé Laxmikant V. Kalé.

Designing and Evaluating Parallel Programs Anda Iamnitchi Federated Distributed Systems Fall 2006 Textbook (on line): Designing and Building Parallel Programs.

Scaling to New Heights Retrospective IEEE/ACM SC2002 Conference Baltimore, MD.

1 Scaling Collective Multicast Fat-tree Networks Sameer Kumar Parallel Programming Laboratory University Of Illinois at Urbana Champaign ICPADS ’ 04.

1 Scalable Molecular Dynamics for Large Biomolecular Systems Robert Brunner James C Phillips Laxmikant Kale.

Department of Computer Science at Florida State LFTI: A Performance Metric for Assessing Interconnect topology and routing design Background ‒ Innovations.

Computational issues in Carbon nanotube simulation Ashok Srinivasan Department of Computer Science Florida State University.

Chapter 3 Parallel Algorithm Design. Outline Task/channel model Task/channel model Algorithm design methodology Algorithm design methodology Case studies.

© 2010 IBM Corporation Enabling Concurrent Multithreaded MPI Communication on Multicore Petascale Systems Gabor Dozsa 1, Sameer Kumar 1, Pavan Balaji 2,

Molecular Dynamics Collection of [charged] atoms, with bonds – Newtonian mechanics – Relatively small #of atoms (100K – 10M) At each time-step – Calculate.

Scalable Reconfigurable Interconnects Ali Pinar Lawrence Berkeley National Laboratory joint work with Shoaib Kamil, Lenny Oliker, and John Shalf CSCAPES.

Scheduling Many-Body Short Range MD Simulations on a Cluster of Workstations and Custom VLSI Hardware Sumanth J.V, David R. Swanson and Hong Jiang University.

Dynamic Load Balancing in Charm++ Abhinav S Bhatele Parallel Programming Lab, UIUC.

NIH Resource for Biomolecular Modeling and Bioinformatics Beckman Institute, UIUC NAMD Development Goals L.V. (Sanjay) Kale Professor.

NIH Resource for Biomolecular Modeling and Bioinformatics Beckman Institute, UIUC NAMD Development Goals L.V. (Sanjay) Kale Professor.

PARALLEL APPLICATIONS EE 524/CS 561 Kishore Dhaveji 01/09/2000.

Overcoming Scaling Challenges in Bio-molecular Simulations Abhinav Bhatelé Sameer Kumar Chao Mei James C. Phillips Gengbin Zheng Laxmikant V. Kalé.

Workshop BigSim Large Parallel Machine Simulation Presented by Eric Bohm PPL Charm Workshop 2004.

University of Illinois at Urbana-Champaign Memory Architectures for Protein Folding: MD on million PIM processors Fort Lauderdale, May 03,

1 Scaling Applications to Massively Parallel Machines using Projections Performance Analysis Tool Presented by Chee Wai Lee Authors: L. V. Kale, Gengbin.

Lecture 4 TTH 03:30AM-04:45PM Dr. Jianjun Hu CSCE569 Parallel Computing University of South Carolina Department of.

CS 484 Designing Parallel Algorithms Designing a parallel algorithm is not easy. There is no recipe or magical ingredient Except creativity We can benefit.

CS 484 Load Balancing. Goal: All processors working all the time Efficiency of 1 Distribute the load (work) to meet the goal Two types of load balancing.

Partitioning using Mesh Adjacencies  Graph-based dynamic balancing Parallel construction and balancing of standard partition graph with small cuts takes.

IPDPS Workshop: Apr 2002PPL-Dept of Computer Science, UIUC A Parallel-Object Programming Model for PetaFLOPS Machines and BlueGene/Cyclops Gengbin Zheng,

Interconnect Networks Basics. Generic parallel/distributed system architecture On-chip interconnects (manycore processor) Off-chip interconnects (clusters.

Scalable and Topology-Aware Load Balancers in Charm++ Amit Sharma Parallel Programming Lab, UIUC.

A uGNI-Based Asynchronous Message- driven Runtime System for Cray Supercomputers with Gemini Interconnect Yanhua Sun, Gengbin Zheng, Laximant(Sanjay) Kale.

NGS Workshop: Feb 2002PPL-Dept of Computer Science, UIUC Programming Environment and Performance Modeling for million-processor machines Laxmikant (Sanjay)

1 Opportunities and Challenges of Modern Communication Architectures: Case Study with QsNet CAC Workshop Santa Fe, NM, 2004 Sameer Kumar* and Laxmikant.

Effects of contention on message latencies in large supercomputers Abhinav S Bhatele and Laxmikant V Kale Parallel Programming Laboratory, UIUC IS TOPOLOGY.

NIH Resource for Macromolecular Modeling and Bioinformatics Beckman Institute, UIUC Scaling NAMD to 100 Million Atoms on Petascale.

Hierarchical Load Balancing for Large Scale Supercomputers Gengbin Zheng Charm++ Workshop 2010 Parallel Programming Lab, UIUC 1Charm++ Workshop 2010.

Effects of contention on message latencies in large supercomputers Abhinav S Bhatele and Laxmikant V Kale Parallel Programming Laboratory, UIUC IS TOPOLOGY.

COMP7330/7336 Advanced Parallel and Distributed Computing Task Partitioning Dr. Xiao Qin Auburn University

COMP7330/7336 Advanced Parallel and Distributed Computing Task Partitioning Dynamic Mapping Dr. Xiao Qin Auburn University

Parallel Molecular Dynamics A case study : Programming for performance Laxmikant Kale

Massively Parallel Cosmological Simulations with ChaNGa Pritish Jetley, Filippo Gioachin, Celso Mendes, Laxmikant V. Kale and Thomas Quinn.

Flexibility and Interoperability in a Parallel MD code Robert Brunner, Laxmikant Kale, Jim Phillips University of Illinois at Urbana-Champaign.

Auburn University

Effects of Contention on Message Latencies in Large Supercomputers

Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing Mapping Techniques Dr. Xiao Qin Auburn University.

Parallel Algorithm Design

uGNI-based Charm++ Runtime for Cray Gemini Interconnect

Performance Evaluation of Adaptive MPI

Implementing Simplified Molecular Dynamics Simulation in Different Parallel Paradigms Chao Mei April 27th, 2006 CS498LVK.

Parallel Programming in C with MPI and OpenMP

Integrated Runtime of Charm++ and OpenMP

Case Studies with Projections

BigSim: Simulating PetaFLOPS Supercomputers

Gengbin Zheng, Esteban Meneses, Abhinav Bhatele and Laxmikant V. Kale

IXPUG, SC’16 Lightning Talk Kavitha Chandrasekar*, Laxmikant V. Kale

Parallel Programming in C with MPI and OpenMP

Presentation transcript:

Load Balancing and Topology Aware Mapping for Petascale Machines Abhinav S Bhatele Parallel Programming Laboratory, Illinois

Outline Scalable Load Balancing Current Supercomputers and their Topologies Topology Aware Mapping – Motivation – Examples – Tools and Techniques August 04th, 2009Scaling to Petascale Summer School2

Load Balancing

Load Imbalance can severely hurt performance The time to completion is bound by the slowest processor Hence, when dividing 121 units of work among 10 processors – 12 units to 9 processors and 13 units to 1 (simple round- robin scheme) – 13 units to 9 processors and 4 units to 1 (intelligent blocking) August 04th, 2009Scaling to Petascale Summer School4

Load Balancing Strategies Static: Achieved through initial data distribution Dynamic – Blocks of data which can be moved around during the execution – Measurement-based – Automatic (by the runtime) August 04th, 2009Scaling to Petascale Summer School5

Another classification Balance computational load Balance computational as well as communication load – Minimize communication volume on the network Minimize communication traffic – Bytes passing through each link on the network August 04th, 2009Scaling to Petascale Summer School6

Scalable Load Balancing Centralized Strategies perform better but take longer – Memory/communication bottleneck Distributed Strategies are fast but have incomplete knowledge Hybrid/ Hierarchical Strategies – Allow using different strategies at different levels August 04th, 2009Scaling to Petascale Summer School7

Hierarchical Strategies August 04th, 2009Scaling to Petascale Summer School8 0 … … 1024 … … … K processor hierarchical tree Apply different strategies at each level Level 0 Level 1 Level

References Trinilos (developed at Sandia): Charm++ Load Balancing Framework: August 04th, 2009Scaling to Petascale Summer School9

Topology Aware Mapping

Current Machines and their Topologies 3D Mesh – Cray XT3/4/5 3D Torus – Blue Gene/L, Blue Gene/P Fat-tree, CLOS network – Infiniband, Federation Kautz Graph – SiCortex Future Topologies? August 04th, 2009Scaling to Petascale Summer School11

RoadRunner August 04th, 2009Scaling to Petascale Summer School12

Blue Gene/P Midplane: smallest unit with torus in all dimensions: 8 x 8 x 8 All bigger partitions formed from this unit of 512 nodes Intrepid at ANL: nodes August 04th, 2009Scaling to Petascale Summer School13

Cray XT5 The full installation is a torus Typically the batch scheduler does not allocate cuboidal shapes Jaguar (XT4) at ORNL: 8,064 nodes, torus dimensions 21 x 16 x 24 Jaguarpf (XT5) at ORNL: 21 x 32 x 24 August 04th, 2009Scaling to Petascale Summer School14

Is topology important? For machines as large as these? Yes. For all applications? No. August 04th, 2009Scaling to Petascale Summer School15

Motivation Consider a 3D mesh/torus interconnect Message latencies can be modeled by (L f /B) x D + L/B L f = length of flit, B = bandwidth, D = hops, L = message size When (L f * D) << L, first term is negligible August 04th, 2009Scaling to Petascale Summer School16 But in presence of contention …

Validation through Benchmarks

MPI Benchmarks† Quantification of message latencies and dependence on hops – No sharing of links (no contention) – Sharing of links (with contention) August 04th, † Scaling to Petascale Summer School Abhinav Bhatele, Laxmikant V. Kale, An Evaluation of the Effect of Interconnect Topologies on Message Latencies in Large Supercomputers, In Workshop on Large-Scale Parallel Processing (IPDPS), 2009.

WOCON: No contention A master rank sends messages to all other ranks, one at a time (with replies) August 04th, Scaling to Petascale Summer School

WOCON: Results August 04th, ANL Blue Gene/PPSC XT3 (L f /B) x D + L/B Scaling to Petascale Summer School

WICON: With Contention Divide all ranks into pairs and everyone sends to their respective partner simultaneously August 04th, Near Neighbor: NNRandom: RND Scaling to Petascale Summer School

WICON: Results August 04th, ANL Blue Gene/PPSC XT3 Scaling to Petascale Summer School

Stressing-a-link Benchmark Controlled Congestion Quantify contention effects on a particular link August 04th, 2009Scaling to Petascale Summer School23

August 04th, 2009Scaling to Petascale Summer School24 ANL Blue Gene/PPSC XT3

Equidistant-pairs Benchmark Pair each rank with a partner which is ‘n’ hops away August 04th, Scaling to Petascale Summer School

Blue Gene/P August 04th, times Scaling to Petascale Summer School

Cray XT3 August 04th, 2009Scaling to Petascale Summer School times

Can we avoid congestion? Yes. If we can minimize the distance each message travels Solution: topology aware mapping – Initially – At runtime (as a part of load balancing) August 04th, 2009Scaling to Petascale Summer School28

Example Application: NAMD

Molecular Dynamics A system of [charged] atoms with bonds Use Newtonian Mechanics to find the positions and velocities of atoms Each time-step is typically in femto-seconds At each time step – calculate the forces on all atoms – calculate the velocities and move atoms around August 04th, 2009Scaling to Petascale Summer School30

NAMD: NAnoscale Molecular Dynamics Naïve force calculation is O(N 2 ) Reduced to O(N logN) by calculating – Bonded forces – Non-bonded: using a cutoff radius Short-range: calculated every time step Long-range: calculated every fourth time-step (PME) August 04th, 2009Scaling to Petascale Summer School31

NAMD’s Parallel Design Hybrid of spatial and force decomposition August 04th, 2009Scaling to Petascale Summer School32

Parallelization using Charm++ August 04th, 2009Scaling to Petascale Summer School33 Static Mapping Load Balancing Bhatele, A., Kumar, S., Mei, C., Phillips, J. C., Zheng, G. & Kale, L. V., Overcoming Scaling Challenges in Biomolecular Simulations across Multiple Platforms. In Proceedings of IEEE International Parallel and Distributed Processing Symposium, Miami, FL, USA, April 2008.

Communication in NAMD Each patch multicasts its information to many computes Each compute is a target of two multicasts only Use ‘Proxies’ to send data to different computes on the same processor August 04th, 2009Scaling to Petascale Summer School34

Topology Aware Techniques Static Placement of Patches August 04th, 2009Scaling to Petascale Summer School35

Topology Aware Techniques (contd.) Placement of computes August 04th, 2009Scaling to Petascale Summer School36

Load Balancing in Charm++ Principle of Persistence – Object communication patterns and computational loads tend to persist over time Measurement-based Load Balancing – Instrument computation time and communication volume at runtime – Use the database to make new load balancing decisions August 04th, 2009Scaling to Petascale Summer School37

NAMD’s Load Balancing Strategy NAMD uses a dynamic centralized greedy strategy There are two schemes in play: – A comprehensive strategy (called once) – A refinement scheme (called several times during a run) Algorithm: – Pick a compute and find a “suitable” processor to place it on August 04th, 2009Scaling to Petascale Summer School38

Choice of a suitable processor Among underloaded processors, try to: – Find a processor with the two patches or their proxies – Find a processor with one patch or a proxy – Pick any underloaded processor August 04th, 2009Scaling to Petascale Summer School39 Highest Priority Lowest Priority

Load Balancing Metrics Load Balance: Bring Max-to-Avg Ratio close to 1 Communication Volume: Minimize the number of proxies Communication Traffic: Minimize hop bytes Hop-bytes = Message size X Distance traveled by message August 04th, 2009Scaling to Petascale Summer School40 Agarwal, T., Sharma, A., Kale, L.V., Topology-aware task mapping for reducing communication contention on large parallel machines, In Proceedings of IEEE International Parallel and Distributed Processing Symposium, Rhodes Island, Greece, April 2006.

Results: Hop-bytes August 04th, 2009Scaling to Petascale Summer School41

Results: Performance August 04th, 2009Scaling to Petascale Summer School42 Abhinav Bhatele and Laxmikant V. Kale. Dynamic Topology Aware Load Balancing Algorithms for MD Applications. In Proceedings of International Conference on Supercomputing, 2009.

OpenAtom Ab-Initio Molecular Dynamics code Communication is static and structured Challenge: Multiple groups of objects with conflicting communication patterns August 04th, Scaling to Petascale Summer School

Parallelization using Charm++ August 04th, Eric Bohm, Glenn J. Martyna, Abhinav Bhatele, Sameer Kumar, Laxmikant V. Kale, John A. Gunnels, and Mark E. Tuckerman. Fine Grained Parallelization of the Car-Parrinello ab initio MD Method on Blue Gene/L. IBM J. of R. and D.: Applications of Massively Parallel Systems, 52(1/2): , Scaling to Petascale Summer School Abhinav Bhatele, Eric Bohm, Laxmikant V. Kale, A Case Study of Communication Optimizations on 3D Mesh Interconnects, To appear in Proceedings of Euro-Par (Topic 13 - High Performance Networks), 2009.

Topology Mapping of Chare Arrays August 04th, State-wise communication Plane-wise communication Joint work with Eric J. Bohm Scaling to Petascale Summer School

Results on Blue Gene/P (ANL) August 04th, Scaling to Petascale Summer School

Results on XT3 August 04th, Scaling to Petascale Summer School

Tools and Techniques

Topology Aware Mapping Broadly three requirements – Processor topology graph – Object communication graph – Efficient and scalable mapping algorithms August 04th, 2009Scaling to Petascale Summer School49

Application Characteristics Computation-bound applications Communication-heavy applications – Latency tolerant: NAMD – Latency sensitive: OpenAtom August 04th, 2009Scaling to Petascale Summer School50

Topology Manager API† The application needs information such as – Dimensions of the partition – Rank to physical co-ordinates and vice-versa TopoManager: a uniform API – On BG/L and BG/P: provides a wrapper for system calls – On XT3/4/5, there are no such system calls – Provides a clean and uniform interface to the application August 04th, 2009Scaling to Petascale Summer School51 †

TopoMgrAPI getDimNX(), getDimNY(), getDimNZ(), getDimNT() rankToCoordinates(), coordinatesToRank() getHopsBetweenRanks() August 04th, 2009Scaling to Petascale Summer School52

Object Communication Graph Obtaining this graph: – Manually – Profiling (e.g. IBM’s HPCT tools) – Charm++’s instrumentation framework Visualizing the graph Pattern matching August 04th, 2009Scaling to Petascale Summer School53

Communication Scenarios Static Regular Graph: 3D Stencil – Structured Mesh applications such as MILC – POP, MILC, WRF Static Irregular Graph: Unstructured Mesh applications such as UMT2K Dynamic Communication Graph: MD codes, Unstructured Mesh codes with AMR Use pattern matching to get an idea August 04th, Scaling to Petascale Summer School

Simple 2D/3D Graph Many science applications have a near-neighbor communication pattern – POP: Parallel Ocean Program – MILC: MIMD Lattice Computation – WRF: Weather Research and Forecasting August 04th, 2009Scaling to Petascale Summer School55

WRF Communication Graph August 04th, 2009Scaling to Petascale Summer School

Folding of 2D/3D to 3D August 04th, 2009Scaling to Petascale Summer School57 Hao Yu, I-Hsin Chung and Jose Moreira, Topology Mapping for Blue Gene/L Supercomputer, In Proceedings of the ACM/IEEE Supercomputing Conference, 2006 For more objects than processors: - Blocking Techniques SMP nodes ?

Irregular Graphs Heuristic Techniques – Pairwise Exchanges – Greedy Strategies August 04th, 2009Scaling to Petascale Summer School58

Automatic Mapping Framework Obtain the communication graph (previous run, at runtime) Pattern matching to identify communication patterns Apply heuristic techniques for different scenarios to arrive at a mapping solution automatically August 04th, 2009Scaling to Petascale Summer School59

Summary Scalable Load Balancing: Hierarchical Machines of Petascale Era: ‘Exploitable’ Topologies MPI Benchmarks show: – Contention for the same links by different messages slows all of them down Topology aware mapping to avoid contention – Carefully see if application would benefit from this Tools and Techniques – TopoMgrAPI – Obtaining the object graph: Instrumentation – Pattern Matching – Techniques: Folding, Heuristics August 04th, 2009Scaling to Petascale Summer School60

Summary August 04th, Topology is important again 2. Even on fast interconnects such as Cray 3. In presence of contention, bandwidth occupancy effects message latencies significantly 4. Increases with the number of hops each message travels 5. Topology Manager API: A uniform API for IBM and Cray machines 6. Different algorithms depending on the communication patterns 7. Eventually, an automatic mapping framework Scaling to Petascale Summer School

Acknowledgements: 1. Argonne National Laboratory: Pete Beckman, Tisha Stacey 2. Pittsburgh Supercomputing Center: Chad Vizino, Shawn Brown 3. Oak Ridge National Laboratory: Patrick Worley, Donald Frederick 4. Grants: DOE Grant B funded by CSAR, DOE grant DE-FG05-08OR23332 through ORNL LCF, and a NIH Grant PHS 5 P41 RR for Molecular Dynamics References: 1. Abhinav Bhatele, Laxmikant V. Kale, Quantifying Network Contention on Large Parallel Machines, Parallel Processing Letters (Special issue on Large-Scale Parallel Processing), Abhinav Bhatele, Laxmikant V. Kale, Benefits of Topology-aware Mapping for Mesh Topologies, Parallel Processing Letters (Special issue on Large-Scale Parallel Processing), Vol: 18 Issue:4, Pages: , bhatele, illinois.edu Webpage: August 04th, 2009Scaling to Petascale Summer School

Thanks!