Effects of Contention on Message Latencies in Large Supercomputers

Slides:

Advertisements

Similar presentations

Resource Management §A resource can be a logical, such as a shared file, or physical, such as a CPU (a node of the distributed system). One of the functions.

Advertisements

1 Advancing Supercomputer Performance Through Interconnection Topology Synthesis Yi Zhu, Michael Taylor, Scott B. Baden and Chung-Kuan Cheng Department.

Mapping Communication Layouts to Network Hardware Characteristics on Massive-Scale Blue Gene Systems Pavan Balaji*, Rinku Gupta*, Abhinav Vishnu + and.

Abhinav Bhatele, Laxmikant V. Kale University of Illinois at Urbana-Champaign Sameer Kumar IBM T. J. Watson Research Center.

Abhinav Bhatele, Laxmikant V. Kale University of Illinois at Urbana-Champaign Sameer Kumar IBM T. J. Watson Research Center.

Dr. Gengbin Zheng and Ehsan Totoni Parallel Programming Laboratory University of Illinois at Urbana-Champaign April 18, 2011.

A CASE STUDY OF COMMUNICATION OPTIMIZATIONS ON 3D MESH INTERCONNECTS University of Illinois at Urbana-Champaign Abhinav Bhatele, Eric Bohm, Laxmikant V.

Automating Topology Aware Task Mapping on Large Parallel Machines Abhinav S Bhatele Advisor: Laxmikant V. Kale University of Illinois at Urbana-Champaign.

Load Balancing and Topology Aware Mapping for Petascale Machines Abhinav S Bhatele Parallel Programming Laboratory, Illinois.

Topology Aware Mapping for Performance Optimization of Science Applications Abhinav S Bhatele Parallel Programming Lab, UIUC.

A Framework for Collective Personalized Communication Laxmikant V. Kale, Sameer Kumar, Krishnan Varadarajan.

Charm++ Load Balancing Framework Gengbin Zheng Parallel Programming Laboratory Department of Computer Science University of Illinois at.

© Fujitsu Laboratories of Europe 2009 HPC and Chaste: Towards Real-Time Simulation 24 March

Collective Communication on Architectures that Support Simultaneous Communication over Multiple Links Ernie Chan.

Application-specific Topology-aware Mapping for Three Dimensional Topologies Abhinav Bhatelé Laxmikant V. Kalé.

Chapter 3 Parallel Algorithm Design. Outline Task/channel model Task/channel model Algorithm design methodology Algorithm design methodology Case studies.

© 2010 IBM Corporation Enabling Concurrent Multithreaded MPI Communication on Multicore Petascale Systems Gabor Dozsa 1, Sameer Kumar 1, Pavan Balaji 2,

 Collectives on Two-tier Direct Networks EuroMPI – 2012 Nikhil Jain, JohnMark Lau, Laxmikant Kale 26 th September, 2012.

A Fault Tolerant Protocol for Massively Parallel Machines Sayantan Chakravorty Laxmikant Kale University of Illinois, Urbana-Champaign.

Dynamic Load Balancing in Charm++ Abhinav S Bhatele Parallel Programming Lab, UIUC.

Charm Workshop CkDirect: Charm++ RDMA Put Presented by Eric Bohm CkDirect Team: Eric Bohm, Sayantan Chakravorty, Pritish Jetley, Abhinav Bhatele.

Overcoming Scaling Challenges in Bio-molecular Simulations Abhinav Bhatelé Sameer Kumar Chao Mei James C. Phillips Gengbin Zheng Laxmikant V. Kalé.

Workshop BigSim Large Parallel Machine Simulation Presented by Eric Bohm PPL Charm Workshop 2004.

University of Illinois at Urbana-Champaign Memory Architectures for Protein Folding: MD on million PIM processors Fort Lauderdale, May 03,

Using Charm++ to Mask Latency in Grid Computing Applications Gregory A. Koenig Parallel Programming Laboratory Department.

IPDPS Workshop: Apr 2002PPL-Dept of Computer Science, UIUC A Parallel-Object Programming Model for PetaFLOPS Machines and BlueGene/Cyclops Gengbin Zheng,

Scalable and Topology-Aware Load Balancers in Charm++ Amit Sharma Parallel Programming Lab, UIUC.

Super computers Parallel Processing

A uGNI-Based Asynchronous Message- driven Runtime System for Cray Supercomputers with Gemini Interconnect Yanhua Sun, Gengbin Zheng, Laximant(Sanjay) Kale.

1 Opportunities and Challenges of Modern Communication Architectures: Case Study with QsNet CAC Workshop Santa Fe, NM, 2004 Sameer Kumar* and Laxmikant.

Effects of contention on message latencies in large supercomputers Abhinav S Bhatele and Laxmikant V Kale Parallel Programming Laboratory, UIUC IS TOPOLOGY.

1 Optimizing Quantum Chemistry using Charm++ Eric Bohm Parallel Programming Laboratory Department of Computer Science University.

On the Placement of Web Server Replicas Yu Cai. Paper On the Placement of Web Server Replicas Lili Qiu, Venkata N. Padmanabhan, Geoffrey M. Voelker Infocom.

Hierarchical Load Balancing for Large Scale Supercomputers Gengbin Zheng Charm++ Workshop 2010 Parallel Programming Lab, UIUC 1Charm++ Workshop 2010.

Effects of contention on message latencies in large supercomputers Abhinav S Bhatele and Laxmikant V Kale Parallel Programming Laboratory, UIUC IS TOPOLOGY.

Interconnection Networks Communications Among Processors.

A CASE STUDY IN USING MASSIVELY PARALLEL SIMULATION FOR EXTREME- SCALE TORUS NETWORK CO-DESIGN Misbah Mubarak, Rensselaer Polytechnic Institute Christopher.

COMP8330/7330/7336 Advanced Parallel and Distributed Computing Communication Costs in Parallel Machines Dr. Xiao Qin Auburn University

Massively Parallel Cosmological Simulations with ChaNGa Pritish Jetley, Filippo Gioachin, Celso Mendes, Laxmikant V. Kale and Thomas Quinn.

BAHIR DAR UNIVERSITY Institute of technology Faculty of Computing Department of information technology Msc program Distributed Database Article Review.

Auburn University

Parallel Architecture

Interconnect Networks

Multiprocessor Systems

Auburn University COMP8330/7330/7336 Advanced Parallel and Distributed Computing Interconnection Networks (Part 2) Dr.

Interconnection topologies

Parallel Density-based Hybrid Clustering

Pablo Abad, Pablo Prieto, Valentin Puente, Jose-Angel Gregorio

uGNI-based Charm++ Runtime for Cray Gemini Interconnect

Performance Evaluation of Adaptive MPI

Welcome Final Year Project Oral Presentation

Jun Doi Tokyo Research Laboratory IBM Research

BlueGene/L Supercomputer

Mayank Bhatt, Jayasi Mehar

Title Meta-Balancer: Automated Selection of Load Balancing Strategies

Agent-based Model Simulation with Twister

Milind A. Bhandarkar Adaptive MPI Milind A. Bhandarkar

Interconnection Network Design Lecture 14

Communication operations

Parallelization of CPAIMD using Charm++

Advanced Computer Architecture 5MD00 / 5Z032 Multi-Processing 2

CS 6290 Many-core & Interconnect

Birds Eye View of Interconnection Networks

BigSim: Simulating PetaFLOPS Supercomputers

Gengbin Zheng, Esteban Meneses, Abhinav Bhatele and Laxmikant V. Kale

IXPUG, SC’16 Lightning Talk Kavitha Chandrasekar*, Laxmikant V. Kale

Parallel Programming in C with MPI and OpenMP

An Orchestration Language for Parallel Objects

Support for Adaptivity in ARMCI Using Migratable Objects

Chapter 2 from ``Introduction to Parallel Computing'',

Presentation transcript:

Effects of Contention on Message Latencies in Large Supercomputers IS TOPOLOGY IMPORTANT AGAIN? Effects of Contention on Message Latencies in Large Supercomputers Abhinav S Bhatele and Laxmikant V Kale ACM Research Competition, SC ‘08

Outline Why should we consider topology aware mapping for optimizing performance? Demonstrate the effects of contention on message latencies through simple MPI benchmarks Obtaining topology information: TopoManager API Case Study: OpenAtom November 19th, 2008

The Mapping Problem Given a set of communicating parallel “entities”, map them onto physical processors Entities COMM_WORLD ranks in case of an MPI program Objects in case of a Charm++ program Aim Balance load Minimize communication traffic November 19th, 2008

Target Machines 3D torus/mesh interconnects Blue Gene/P at ANL: 40,960 nodes, torus - 32 x 32 x 40 XT4 (Jaguar) at ORNL: 8,064 nodes, torus - 21 x 16 x 24 Other interconnects Fat-tree Kautz graph: SiCortex November 19th, 2008

Motivation Consider a 3D mesh/torus interconnect Message latencies can be modeled by (Lf/B) x D + L/B Lf = length of flit, B = bandwidth, D = hops, L = message size When (Lf * D) << L, first term is negligible But in presence of contention … November 19th, 2008

MPI Benchmarks† Quantification of message latencies and dependence on hops No sharing of links (no contention) Sharing of links (with contention) † http://charm.cs.uiuc.edu/~bhatele/phd/contention.htm November 19th, 2008

WOCON: No contention A master rank sends messages to all other ranks, one at a time (with replies) November 19th, 2008

WOCON: Results (Lf/B) x D + L/B ANL Blue Gene/P ORNL XT4 PSC XT3 November 19th, 2008

WICON: With Contention Divide all ranks into pairs and everyone sends to their respective partner simultaneously Near Neighbor: NN Random: RND November 19th, 2008

WICON: Results ANL Blue Gene/P PSC XT3 BG/P = 1.737 times XT3 = 2.24 times ANL Blue Gene/P PSC XT3 November 19th, 2008

Message Latencies and Hops Pair each rank with a partner which is ‘n’ hops away November 19th, 2008

November 19th, 2008

Results 8 times 1 to 2 – 1.84 1 to 3 – 2.11 1 to 4 - 2.55 November 19th, 2008

Peak bandwidth of each link = 425 MB/s unidirectional November 19th, 2008

Difference from previous work Then Now Mainly for theoretical object graphs on hypercubes, shuffle exchange and other theoretical networks Object graphs from real applications on 3D torus/mesh topologies Most techniques were used offline – slow Fast, runtime solutions Demonstrated on graphs with 10-100 nodes Scalable techniques for large machines No cardinality variation Multiple objects per processor Not tested with real applications on actual machines – theoretical work Targeted at production codes – tested with real applications November 19th, 2008

Difference from recent work [5] G. Bhanot, A. Gara, P. Heidelberger, E. Lawless, J. C. Sexton, and R. Walkup. Optimizing task layout on the Blue Gene/L supercomputer. IBM Journal of Research and Development, 49(2/3), 2005. Use of simulated annealing – slow and solution is developed offline [6] Hao Yu, I-Hsin Chung, and Jose Moreira. Topology mapping for Blue Gene/L supercomputer. In SC '06: Proceedings of the 2006 ACM/IEEE conference on Supercomputing, page 116, New York, NY, USA, 2006. ACM Node mappings for simple scenarios (1D rings, 2D meshes, 3D) Only useful in case of simple near-neighbor communication November 19th, 2008

Topology Manager API† The application needs information such as Dimensions of the partition Rank to physical co-ordinates and vice-versa TopoManager: a uniform API On BG/L and BG/P: provides a wrapper for system calls On XT3 and XT4, there are no such system calls Help from PSC and ORNL staff to discovery topology at runtime Provides a clean and uniform interface to the application This can be used by MPI Topology functions † http://charm.cs.uiuc.edu/~bhatele/phd/topomgr.htm November 19th, 2008

OpenAtom Ab-Initio Molecular Dynamics code Communication is static and structured Challenge: Multiple groups of objects with conflicting communication patterns November 19th, 2008

Parallelization using Charm++ [10] Eric Bohm, Glenn J. Martyna, Abhinav Bhatele, Sameer Kumar, Laxmikant V. Kale, John A. Gunnels, and Mark E. Tuckerman. Fine Grained Parallelization of the Car-Parrinello ab initio MD Method on Blue Gene/L. IBM J. of R. and D.: Applications of Massively Parallel Systems, 52(1/2):159-174, 2008. November 19th, 2008

PairCalculator GSpace RealSpace States Planes Planes States Planes November 19th, 2008

Topology Mapping of Chare Arrays State-wise communication Plane-wise communication Joint work with Eric J. Bohm November 19th, 2008

Results on Blue Gene/P (ANL) November 19th, 2008

Results on XT3 (BigBen) November 19th, 2008

Summary Topology is important again Even on fast interconnects such as Cray In presence of contention, bandwidth occupancy effects message latencies significantly Increases with the number of hops each message travels Topology Manager API: A uniform API for IBM and Cray machines Case Studies: OpenAtom, NAMD, Stencil Eventually, an automatic mapping framework November 19th, 2008

Acknowledgements: 1. Argonne National Lab: Pete Beckman, Tisha Stacey 2. Pittsburgh Supercomputing Center: Chad Vizino, Shawn Brown 3. Oak Ridge National Laboratory: Patrick Worley, Donald Frederick 4. IBM: Robert Walkup, Sameer Kumar 5. Cray: Larry Kaplan 6. SiCortex: Matt Reilly References: 1. Abhinav Bhatele, Laxmikant V. Kale, Dynamic Topology Aware Load Balancing Algorithms for MD Applications, submitted to Philosophical Transactions of the Royal Society A, 2008 2. Abhinav Bhatele, Laxmikant V. Kale, Benefits of Topology-aware Mapping for Mesh Topologies, LSPP special issue of Parallel Processing Letters, 2008 3. Abhinav Bhatele, Laxmikant V. Kale, Application-specific Topology-aware Mapping for Three Dimensional Topologies, Proceedings of Workshop on Large-Scale Parallel Processing (held as part of IPDPS '08), 2008