Abdelhalim Amer , Huiwei Lu , Pavan Balaji , Satoshi Matsuoka + Argonne National Laboratory, IL, USA +Tokyo Institute of Technology, Tokyo, Japan Characterizing.

Slides:

Advertisements

Similar presentations

CSE 160 – Lecture 9 Speed-up, Amdahl’s Law, Gustafson’s Law, efficiency, basic performance metrics.

Advertisements

Technology Drivers Traditional HPC application drivers – OS noise, resource monitoring and management, memory footprint – Complexity of resources to be.

OpenMP Optimization National Supercomputing Service Swiss National Supercomputing Center.

MINJAE HWANG THAWAN KOOBURAT CS758 CLASS PROJECT FALL 2009 Extending Task-based Programming Model beyond Shared-memory Systems.

Introduction to Openmp & openACC

Starting Parallel Algorithm Design David Monismith Based on notes from Introduction to Parallel Programming 2 nd Edition by Grama, Gupta, Karypis, and.

GWDG Matrix Transpose Results with Hybrid OpenMP / MPI O. Haan Gesellschaft für wissenschaftliche Datenverarbeitung Göttingen, Germany ( GWDG ) SCICOMP.

Toward Efficient Support for Multithreaded MPI Communication Pavan Balaji 1, Darius Buntinas 1, David Goodell 1, William Gropp 2, and Rajeev Thakur 1 1.

Latency Tolerance: what to do when it just won’t go away CS 258, Spring 99 David E. Culler Computer Science Division U.C. Berkeley.

Scientific Programming OpenM ulti- P rocessing M essage P assing I nterface.

Revisiting a slide from the syllabus: CS 525 will cover Parallel and distributed computing architectures – Shared memory processors – Distributed memory.

MPI+Threads: Runtime Contention and Remedies Abdelhalim Amer*, Huiwei Lu+, Yanjie Wei #, Pavan Balaji+, Satoshi Matsuoka* * Tokyo Institute of Technology.

Reference: Message Passing Fundamentals.

1 Tuesday, November 07, 2006 “If anything can go wrong, it will.” -Murphy’s Law.

DISTRIBUTED AND HIGH-PERFORMANCE COMPUTING CHAPTER 7: SHARED MEMORY PARALLEL PROGRAMMING.

3.5 Interprocess Communication Many operating systems provide mechanisms for interprocess communication (IPC) –Processes must communicate with one another.

3.5 Interprocess Communication

High Performance Communication using MPJ Express 1 Presented by Jawad Manzoor National University of Sciences and Technology, Pakistan 29 June 2015.

Runtime Support for Irregular Computations in MPI-Based Applications - CCGrid 2015 Doctoral Symposium - Xin Zhao *, Pavan Balaji † (Co-advisor), William.

Techniques for Enabling Highly Efficient Message Passing on Many-Core Architectures Min Si PhD student at University of Tokyo, Tokyo, Japan Advisor : Yutaka.

Argonne National Laboratory is a U.S. Department of Energy laboratory managed by U Chicago Argonne, LLC. The experimental resources for this research were.

BRASS Analysis of QuasiStatic Scheduling Techniques in a Virtualized Reconfigurable Machine Yury Markovskiy, Eylon Caspi, Randy Huang, Joseph Yeh, Michael.

1 MPI-2 and Threads. 2 What are Threads? l Executing program (process) is defined by »Address space »Program Counter l Threads are multiple program counters.

Performance Comparison of Pure MPI vs Hybrid MPI-OpenMP Parallelization Models on SMP Clusters Nikolaos Drosinos and Nectarios Koziris National Technical.

Advanced Hybrid MPI/OpenMP Parallelization Paradigms for Nested Loop Algorithms onto Clusters of SMPs Nikolaos Drosinos and Nectarios Koziris National.

User-Level Process towards Exascale Systems Akio Shimada [1], Atsushi Hori [1], Yutaka Ishikawa [1], Pavan Balaji [2] [1] RIKEN AICS, [2] Argonne National.

Implementing MPI on Windows: Comparison with Common Approaches on Unix Jayesh Krishna, 1 Pavan Balaji, 1 Ewing Lusk, 1 Rajeev Thakur, 1 Fabian Tillier.

Introduction to Symmetric Multiprocessors Süha TUNA Bilişim Enstitüsü UHeM Yaz Çalıştayı

Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of Energy’s National Nuclear.

Non-Collective Communicator Creation in MPI James Dinan 1, Sriram Krishnamoorthy 2, Pavan Balaji 1, Jeff Hammond 1, Manojkumar Krishnan 2, Vinod Tipparaju.

Early experiences on COSMO hybrid parallelization at CASPUR/USAM Stefano Zampini Ph.D. CASPUR COSMO-POMPA Kick-off Meeting, 3-4 May 2011, Manno (CH)

CC02 – Parallel Programming Using OpenMP 1 of 25 PhUSE 2011 Aniruddha Deshmukh Cytel Inc.

Performance Evaluation of Hybrid MPI/OpenMP Implementation of a Lattice Boltzmann Application on Multicore Systems Department of Computer Science and Engineering,

A Prototypical Self-Optimizing Package for Parallel Implementation of Fast Signal Transforms Kang Chen and Jeremy Johnson Department of Mathematics and.

Lecture 4: Parallel Programming Models. Parallel Programming Models Parallel Programming Models: Data parallelism / Task parallelism Explicit parallelism.

MPI3 Hybrid Proposal Description

Synchronization and Communication in the T3E Multiprocessor.

Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.

Dynamic Time Variant Connection Management for PGAS Models on InfiniBand Abhinav Vishnu 1, Manoj Krishnan 1 and Pavan Balaji 2 1 Pacific Northwest National.

The WRF Model The Weather Research and Forecasting (WRF) Model is a mesoscale numerical weather prediction system designed for both atmospheric research.

© 2010 IBM Corporation Enabling Concurrent Multithreaded MPI Communication on Multicore Petascale Systems Gabor Dozsa 1, Sameer Kumar 1, Pavan Balaji 2,

Architectural Support for Fine-Grained Parallelism on Multi-core Architectures Sanjeev Kumar, Corporate Technology Group, Intel Corporation Christopher.

Introduction, background, jargon Jakub Yaghob. Literature T.G.Mattson, B.A.Sanders, B.L.Massingill: Patterns for Parallel Programming, Addison- Wesley,

Argonne National Laboratory is a U.S. Department of Energy laboratory managed by U Chicago Argonne, LLC. Xin Zhao *, Pavan Balaji † (Co-advisor) and William.

Message Passing in Massively Multithreaded Environments Pavan Balaji Computer Scientist and Project Lead Argonne National Laboratory.

High-Level, One-Sided Models on MPI: A Case Study with Global Arrays and NWChem James Dinan, Pavan Balaji, Jeff R. Hammond (ANL); Sriram Krishnamoorthy.

Non-Data-Communication Overheads in MPI: Analysis on Blue Gene/P P. Balaji, A. Chan, W. Gropp, R. Thakur, E. Lusk Argonne National Laboratory University.

PMI: A Scalable Process- Management Interface for Extreme-Scale Systems Pavan Balaji, Darius Buntinas, David Goodell, William Gropp, Jayesh Krishna, Ewing.

Scaling NWChem with Efficient and Portable Asynchronous Communication in MPI RMA Min Si [1][2], Antonio J. Peña [1], Jeff Hammond [3], Pavan Balaji [1],

Introduction to OpenMP Eric Aubanel Advanced Computational Research Laboratory Faculty of Computer Science, UNB Fredericton, New Brunswick.

Efficient Multithreaded Context ID Allocation in MPI James Dinan, David Goodell, William Gropp, Rajeev Thakur, and Pavan Balaji.

A new thread support level for hybrid programming with MPI endpoints EASC 2015 Dan Holmes, Mark Bull, Jim Dinan

Computer Network Lab. Korea University Computer Networks Labs Se-Hee Whang.

October 2008 Integrated Predictive Simulation System for Earthquake and Tsunami Disaster CREST/Japan Science and Technology Agency (JST)

MT-MPI: Multithreaded MPI for Many- Core Environments Min Si 1,2 Antonio J. Peña 2 Pavan Balaji 2 Masamichi Takagi 3 Yutaka Ishikawa 1 1 University of.

Lecture 3 : Performance of Parallel Programs Courtesy : MIT Prof. Amarasinghe and Dr. Rabbah’s course note.

OpenMP for Networks of SMPs Y. Charlie Hu, Honghui Lu, Alan L. Cox, Willy Zwaenepoel ECE1747 – Parallel Programming Vicky Tsang.

Endpoints Plenary James Dinan Hybrid Working Group December 10, 2013.

3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.

3/12/2013Computer Engg, IIT(BHU)1 OpenMP-1. OpenMP is a portable, multiprocessing API for shared memory computers OpenMP is not a “language” Instead,

Parallel Computing Presented by Justin Reschke

OpenMP Runtime Extensions Many core Massively parallel environment Intel® Xeon Phi co-processor Blue Gene/Q MPI Internal Parallelism Optimizing MPI Implementation.

Background Computer System Architectures Computer System Software.

PARALLEL AND DISTRIBUTED PROGRAMMING MODELS U. Jhashuva 1 Asst. Prof Dept. of CSE om.

Programming Parallel Hardware using MPJ Express By A. Shafi.

 Dan Ibanez, Micah Corah, Seegyoung Seol, Mark Shephard  2/27/2013  Scientific Computation Research Center  Rensselaer Polytechnic Institute 1 Advances.

SHARED MEMORY PROGRAMMING WITH OpenMP

Hybrid Programming with OpenMP and MPI

Introduction to parallelism and the Message Passing Interface

Shared-Memory Paradigm & OpenMP

Presentation transcript:

Abdelhalim Amer *, Huiwei Lu *, Pavan Balaji *, Satoshi Matsuoka + *Argonne National Laboratory, IL, USA +Tokyo Institute of Technology, Tokyo, Japan Characterizing MPI and Hybrid MPI+Threads Applications at Scale: Case Study with BFS 1 PPMM’15, in conjunction with CCGRID’15, May 4-7, 2015, Shenzhen, Guangdong, China

 Systems with massive core counts already in production – Tianhe-2: 3,120,000 cores – Mira: 3,145,728 HW threads  Core density is increasing  Other resources do not scale at the same rate – Memory per core is reducing – Network endpoints [1] Peter Kogge. Pim & memory: The need for a revolution in architecture. The Argonne Training Program on Extreme-Scale Computing (ATPESC), Evolution of the memory capacity per core in the Top500 list [1] 2 Evolution of High-End Systems

Problem Domain Target Architecture Core0Core1Core0Core1 Core2Core3Core2Core3 Node 0Node 1 Core0Core1Core0Core1 Core2Core3Core2Core3 Node 2Node3 3 Parallelism with Message Passing ß

MPI-only = Core Granularity Domain Decomposition Domain Decomposition with MPI vs. MPI+X MPI+X = Node Granularity Domain Decomposition ProcessCommunication Process Threads

MPI-only = Core Granularity Domain Decomposition Process Communication (single copy) Boundary Data (extra memory) MPI vs. MPI+X MPI+X = Node Granularity Domain Decomposition Process Threads Shared Data

The process model has inherent limitations Sharing is becoming a requirement Using threads needs careful thread-safety implementations 6 Process Model vs. Threading Model with MPI ProcessesThreads Data all privateGlobal data all shared Sharing requires extra work (e.g. MPI-3 shared memory) Sharing is given, consistency is not and implies protection Communication fine-grained (core-to-core)Communication coarse-grained (typically node-to-node) Space overhead is high (buffers, boundary data, MPI runtime, etc) Space overhead is reduced Contention only for system resourcesContention for system resources and shared data No thread-safety overheadsMagnitude of thread-safety overheads depend on the application and MPI runtime

 MPI_THREAD_SINGLE – No additional threads  MPI_THREAD_FUNNELED – Master thread communication only  MPI_THREAD_SERIALIZED – Threaded communication serialized  MPI_THREAD_MULTIPLE – No restrictions Restriction Low Thread- Safety Costs Flexibility High Thread- Safety Costs 7 MPI + Threads Interoperation by the Standard  An MPI process is allowed to spawn multiple threads  Threads share the same rank  A thread blocking for communication must not block other threads  Applications can specify the way threads interoperate with MPI

 Search in graph  Neighbors first  Solves many problems in graph theory  Graph500 benchmark BFS kernel  Kronecker graph as input  Communication  Two-sided nonblocking This small, synthetic graph was generated by a method called Kronecker multiplication. Larger versions of this generator, modeling real-world graphs, are used in the Graph500 benchmark. (Courtesy of Jeremiah Willcock, Indiana University) [Sandia National Laboratory] Breadth First Search and Graph500

Breadth First Search Baseline Implementation While(1) { Process_Current_Level(); Synchronize(); MPI_Allreduce(QLength); if(QueueLenth == 0) break; } Sync()

MPI OnlyHybrid MPI + OpenMP MPI_THREAD_MULTIPLE Shared read queue Private temp write queues Private buffers Lock-Free/Atomic-Free While(1) { Process_Current_Level(); Synchronize(); MPI_Allreduce(QLength); if(QueueLenth == 0) break; } While(1) { #pragma omp parallel { Process_Current_Level(); Synchronize(); } MPI_Allreduce(QLength); if(QueueLenth == 0) break; } 10 MPI only to Hybrid BFS

Problem size = 2 26 vertices (SCALE = 26) 11 Communication Characterization Communication Volume (GB)Message Count

ArchitectureBlue Gene/Q ProcessorPowerPC A2 Clock frequency1.6 GHz Cores per node16 HW threads/Core4 Number of nodes49152 Memory/node1GB InterconnectProprietary Topology5D Torus CompilerGCC MPI libraryMPICH Network driverBG/Q V1R2M1 12 Target Platform Memory/HW thread = 256 MB! We use in the following 1 rank/thread per core MPICH: global critical section

13 Baseline Weak Scaling Performance

14 Main Sources of Overhead MPI-only MPI+Threads

Make_Progress() { MPI_Test(recvreq,flag) if(flag) compute(); for(each process P){ MPI_Test(sendreq[P],flag) if(flag) buffer_free[P] = 1; } Eager polling for communication progress O(P) Synchronize() { for(each process P) MPI_Isend(buf,0,P, sendreq[P]); while(!all_procs_done) Check_Incom_Msgs(); } Global synchronization (2.75G messages for 512K cores) 15 Non-Scalable Sub-Routines O(P 2 ) Empty Messages

 Use a lazy polling (LP) policy  Use the MPI 3 nonblocking barrier (IB) Weak Scaling Results 16 Fixing the Scalability Issues

MPI_Test Latency 17 Thread Contention in the MPI Runtime  Default: global critical section to avoid extra overheads in uncontended cases  Fine-grained critical section can be used for highly contented scenarios

Profiling with 1K NodesWeak Scaling Performance 18 Performance with Fine-Grained Concurrency

 The coarse-grained MPI+X communication model is generally more scalable  In BFS, MPI+X reduced for example the – O(P) polling overhead – O(P 2 ) empty messages for global sync  The model does not fix root scalability issues  Thread-safety overheads can be significant source  It is not a fatality: – Various techniques can be used thread contention and safety overheads – We are actively working on improving multhreading support in MPICH (MPICH derivatives can benefit from it)  Characterizing MPI+shared-memory vs. MPI+threads models is being considered for a future study 19 Summary