MT-MPI: Multithreaded MPI for Many- Core Environments Min Si 1,2 Antonio J. Peña 2 Pavan Balaji 2 Masamichi Takagi 3 Yutaka Ishikawa 1 1 University of.

Slides:



Advertisements
Similar presentations
1 Tuning for MPI Protocols l Aggressive Eager l Rendezvous with sender push l Rendezvous with receiver pull l Rendezvous blocking (push or pull)
Advertisements

MPI Message Passing Interface
Middleware Support for RDMA-based Data Transfer in Cloud Computing Yufei Ren, Tan Li, Dantong Yu, Shudong Jin, Thomas Robertazzi Department of Electrical.
OpenMP Optimization National Supercomputing Service Swiss National Supercomputing Center.
MINJAE HWANG THAWAN KOOBURAT CS758 CLASS PROJECT FALL 2009 Extending Task-based Programming Model beyond Shared-memory Systems.
Unified Parallel C at LBNL/UCB Implementing a Global Address Space Language on the Cray X1 Christian Bell and Wei Chen.
Delivering High Performance to Parallel Applications Using Advanced Scheduling Nikolaos Drosinos, Georgios Goumas Maria Athanasaki and Nectarios Koziris.
Thoughts on Shared Caches Jeff Odom University of Maryland.
1 Buffers l When you send data, where does it go? One possibility is: Process 0Process 1 User data Local buffer the network User data Local buffer.
AMLAPI: Active Messages over Low-level Application Programming Interface Simon Yau, Tyson Condie,
Toward Efficient Support for Multithreaded MPI Communication Pavan Balaji 1, Darius Buntinas 1, David Goodell 1, William Gropp 2, and Rajeev Thakur 1 1.
Parallel System Performance CS 524 – High-Performance Computing.
Scientific Programming OpenM ulti- P rocessing M essage P assing I nterface.
Revisiting a slide from the syllabus: CS 525 will cover Parallel and distributed computing architectures – Shared memory processors – Distributed memory.
Scaling Distributed Machine Learning with the BASED ON THE PAPER AND PRESENTATION: SCALING DISTRIBUTED MACHINE LEARNING WITH THE PARAMETER SERVER – GOOGLE,
VIA and Its Extension To TCP/IP Network Yingping Lu Based on Paper “Queue Pair IP, …” by Philip Buonadonna.
DISTRIBUTED AND HIGH-PERFORMANCE COMPUTING CHAPTER 7: SHARED MEMORY PARALLEL PROGRAMMING.
3.5 Interprocess Communication Many operating systems provide mechanisms for interprocess communication (IPC) –Processes must communicate with one another.
3.5 Interprocess Communication
Parallel System Performance CS 524 – High-Performance Computing.
High Performance Communication using MPJ Express 1 Presented by Jawad Manzoor National University of Sciences and Technology, Pakistan 29 June 2015.
Techniques for Enabling Highly Efficient Message Passing on Many-Core Architectures Min Si PhD student at University of Tokyo, Tokyo, Japan Advisor : Yutaka.
Argonne National Laboratory is a U.S. Department of Energy laboratory managed by U Chicago Argonne, LLC. The experimental resources for this research were.
P. Balaji, S. Bhagvat, D. K. Panda, R. Thakur, and W. Gropp
Performance Comparison of Pure MPI vs Hybrid MPI-OpenMP Parallelization Models on SMP Clusters Nikolaos Drosinos and Nectarios Koziris National Technical.
Advanced Hybrid MPI/OpenMP Parallelization Paradigms for Nested Loop Algorithms onto Clusters of SMPs Nikolaos Drosinos and Nectarios Koziris National.
ORIGINAL AUTHOR JAMES REINDERS, INTEL PRESENTED BY ADITYA AMBARDEKAR Overview for Intel Xeon Processors and Intel Xeon Phi coprocessors.
User-Level Process towards Exascale Systems Akio Shimada [1], Atsushi Hori [1], Yutaka Ishikawa [1], Pavan Balaji [2] [1] RIKEN AICS, [2] Argonne National.
Implementing MPI on Windows: Comparison with Common Approaches on Unix Jayesh Krishna, 1 Pavan Balaji, 1 Ewing Lusk, 1 Rajeev Thakur, 1 Fabian Tillier.
Optimizing Threaded MPI Execution on SMP Clusters Hong Tang and Tao Yang Department of Computer Science University of California, Santa Barbara.
Mixed MPI/OpenMP programming on HPCx Mark Bull, EPCC with thanks to Jake Duthie and Lorna Smith.
A Prototypical Self-Optimizing Package for Parallel Implementation of Fast Signal Transforms Kang Chen and Jeremy Johnson Department of Mathematics and.
OpenMP in a Heterogeneous World Ayodunni Aribuki Advisor: Dr. Barbara Chapman HPCTools Group University of Houston.
Protocols for Wide-Area Data-intensive Applications: Design and Performance Issues Yufei Ren, Tan Li, Dantong Yu, Shudong Jin, Thomas Robertazzi, Brian.
Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.
High Performance User-Level Sockets over Gigabit Ethernet Pavan Balaji Ohio State University Piyush Shivam Ohio State University.
1 Choosing MPI Alternatives l MPI offers may ways to accomplish the same task l Which is best? »Just like everything else, it depends on the vendor, system.
National Center for Supercomputing Applications University of Illinois at Urbana-Champaign Cell processor implementation of a MILC lattice QCD application.
The WRF Model The Weather Research and Forecasting (WRF) Model is a mesoscale numerical weather prediction system designed for both atmospheric research.
Boston, May 22 nd, 2013 IPDPS 1 Acceleration of an Asynchronous Message Driven Programming Paradigm on IBM Blue Gene/Q Sameer Kumar* IBM T J Watson Research.
© 2012 MELLANOX TECHNOLOGIES 1 The Exascale Interconnect Technology Rich Graham – Sr. Solutions Architect.
© 2010 IBM Corporation Enabling Concurrent Multithreaded MPI Communication on Multicore Petascale Systems Gabor Dozsa 1, Sameer Kumar 1, Pavan Balaji 2,
Architectural Support for Fine-Grained Parallelism on Multi-core Architectures Sanjeev Kumar, Corporate Technology Group, Intel Corporation Christopher.
AN EXTENDED OPENMP TARGETING ON THE HYBRID ARCHITECTURE OF SMP-CLUSTER Author : Y. Zhao 、 C. Hu 、 S. Wang 、 S. Zhang Source : Proceedings of the 2nd IASTED.
MPICH2 – A High-Performance and Widely Portable Open- Source MPI Implementation Darius Buntinas Argonne National Laboratory.
UPC Applications Parry Husbands. Roadmap Benchmark small applications and kernels —SPMV (for iterative linear/eigen solvers) —Multigrid Develop sense.
Introduction, background, jargon Jakub Yaghob. Literature T.G.Mattson, B.A.Sanders, B.L.Massingill: Patterns for Parallel Programming, Addison- Wesley,
S AN D IEGO S UPERCOMPUTER C ENTER N ATIONAL P ARTNERSHIP FOR A DVANCED C OMPUTATIONAL I NFRASTRUCTURE On pearls and perils of hybrid OpenMP/MPI programming.
DMA-Assisted, Intranode Communication in GPU-Accelerated Systems Feng Ji*, Ashwin M. Aji†, James Dinan‡, Darius Buntinas‡, Pavan Balaji‡, Rajeev Thakur‡,
Parallel Computing A task is broken down into tasks, performed by separate workers or processes Processes interact by exchanging information What do we.
Swapping to Remote Memory over InfiniBand: An Approach using a High Performance Network Block Device Shuang LiangRanjit NoronhaDhabaleswar K. Panda IEEE.
Min Si[1][2], Antonio J. Peña[1], Jeff Hammond[3], Pavan Balaji[1],
Abdelhalim Amer *, Huiwei Lu *, Pavan Balaji *, Satoshi Matsuoka + *Argonne National Laboratory, IL, USA +Tokyo Institute of Technology, Tokyo, Japan Characterizing.
Hybrid MPI and OpenMP Parallel Programming
Performance Oriented MPI Jeffrey M. Squyres Andrew Lumsdaine NERSC/LBNL and U. Notre Dame.
Supercomputing ‘99 Parallelization of a Dynamic Unstructured Application using Three Leading Paradigms Leonid Oliker NERSC Lawrence Berkeley National Laboratory.
Message Passing in Massively Multithreaded Environments Pavan Balaji Computer Scientist and Project Lead Argonne National Laboratory.
Minimizing Communication Latency to Maximize Network Communication Throughput over InfiniBand Design and Implementation of MPICH-2 over InfiniBand with.
Non-Data-Communication Overheads in MPI: Analysis on Blue Gene/P P. Balaji, A. Chan, W. Gropp, R. Thakur, E. Lusk Argonne National Laboratory University.
Scaling NWChem with Efficient and Portable Asynchronous Communication in MPI RMA Min Si [1][2], Antonio J. Peña [1], Jeff Hammond [3], Pavan Balaji [1],
High Performance Computing Group Feasibility Study of MPI Implementation on the Heterogeneous Multi-Core Cell BE TM Architecture Feasibility Study of MPI.
Computer Network Lab. Korea University Computer Networks Labs Se-Hee Whang.
1 Lecture 4: Part 2: MPI Point-to-Point Communication.
An Evaluation of Data-Parallel Compiler Support for Line-Sweep Applications Daniel Chavarría-Miranda John Mellor-Crummey Dept. of Computer Science Rice.
3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.
3/12/2013Computer Engg, IIT(BHU)1 OpenMP-1. OpenMP is a portable, multiprocessing API for shared memory computers OpenMP is not a “language” Instead,
OpenMP Runtime Extensions Many core Massively parallel environment Intel® Xeon Phi co-processor Blue Gene/Q MPI Internal Parallelism Optimizing MPI Implementation.
Fabric Interfaces Architecture – v4
Allen D. Malony Computer & Information Science Department
Chapter 13: I/O Systems “The two main jobs of a computer are I/O and [CPU] processing. In many cases, the main job is I/O, and the [CPU] processing is.
Presentation transcript:

MT-MPI: Multithreaded MPI for Many- Core Environments Min Si 1,2 Antonio J. Peña 2 Pavan Balaji 2 Masamichi Takagi 3 Yutaka Ishikawa 1 1 University of Tokyo 2 Argonne National Laboratory 3 NEC Corporation 1

Presentation Overview Background and Motivation Proposal and Challenges Design and Implementation – OpenMP Runtime Extension – MPI Internal Parallelism Evaluation Conclusion 2

Many-core Architectures Massively parallel environment Intel® Xeon Phi co-processor – 60 cores inside a single chip, 240 hardware threads – SELF-HOSTING in next generation, NATIVE mode in current version Blue Gene/Q – 16 cores per node, 64 hardware threads Lots of “light-weight” cores is becoming a common model 3

MPI programming on Many-Core Architectures 4 Thread Single mode /* user computation */ MPI_Function ( ); /* user computation */ MPI_Function ( ); /* user computation */ MPI Process COMP. COMP. MPI COMM. #pragma omp parallel { /* user computation */ } MPI_Function ( ); #pragma omp parallel { /* user computation */ } #pragma omp parallel { /* user computation */ } MPI_Function ( ); #pragma omp parallel { /* user computation */ } Funneled / Serialized mode #pragma omp parallel { /* user computation */ MPI_Function ( ); /* user computation */ } #pragma omp parallel { /* user computation */ MPI_Function ( ); /* user computation */ } MPI Process COMP. COMP. MPI COMM. Multithreading mode

Funneled / Serialized mode – Multiple threads are created for user computation – Single thread issues MPI Problem in Funneled / Serialized mode 5 MPI Process COMP. COMP. MPI COMM. #pragma omp parallel { /* user computation */ } MPI_Function ( ); #pragma omp parallel { /* user computation */ } #pragma omp parallel { /* user computation */ } MPI_Function ( ); #pragma omp parallel { /* user computation */ } ISSUE 1.Many threads are IDLE ! 2.Single lightweight core delivers poor performance

Sharing Idle Threads with Application inside MPI Our Approach 6 MPI Process COMP. COMP. MPI COMM. #pragma omp parallel { /* user computation */ } MPI_Function ( ){ #pragma omp parallel { /* MPI internal task */ } #pragma omp parallel { /* user computation */ } #pragma omp parallel { /* user computation */ } MPI_Function ( ){ #pragma omp parallel { /* MPI internal task */ } #pragma omp parallel { /* user computation */ } #pragma omp parallel { /* MPI internal task */ } #pragma omp parallel { /* MPI internal task */ }

Some parallel algorithms are not efficient with insufficient threads, need tradeoff But the number of available threads is UNKNOWN ! SINGLE SECTION Challenges (1/2) #pragma omp parallel { /* user computation */ #pragma omp single { /* MPI_Calls */ } #pragma omp parallel { /* user computation */ #pragma omp single { /* MPI_Calls */ } Barrier 7

Challenges (2/2) Nested parallelism – Simply creates new Pthreads, and offloads thread scheduling to OS, – Causes threads OVERRUNNING issue 8 #pragma omp parallel { #pragma omp single { #pragma omp parallel { … } } #pragma omp parallel { #pragma omp single { #pragma omp parallel { … } } Creates N Pthreads ! ONLY use IDLE threads

Design and Implementation Implementation is based on Intel OpenMP runtime (version ) 9  OpenMP Runtime Extension  MPI Internal Parallelism

Guaranteed Idle Threads VS Temporarily Idle Threads Guaranteed Idle Threads – Guaranteed idle until Current thread exits Temporarily Idle Threads – Current thread does not know when it may become active again #pragma omp parallel { #pragma omp single nowait {… } #pragma omp critical {… } #pragma omp parallel { #pragma omp single nowait {… } #pragma omp critical {… } #pragma omp parallel { #pragma omp single {… } #pragma omp parallel { #pragma omp single {… } Barrier #pragma omp parallel { #pragma omp critical {… } #pragma omp parallel { #pragma omp critical {… } Example 1 Example 2 Example 3 10

Expose Guaranteed Idle Threads MPI uses Guaranteed Idle Threads to schedule its internal parallelism efficiently (i.e. change algorithm, specify number of threads) #pragma omp parallel #pragma omp single { MPI_Function { num_idle_threads = omp_get_num_guaranteed_idle_threads( ); if ( num_idle_threads < N ) { /* Sequential algorithm */ } else { #pragma omp parallel num_threads(num_idle_threads) { … } } #pragma omp parallel #pragma omp single { MPI_Function { num_idle_threads = omp_get_num_guaranteed_idle_threads( ); if ( num_idle_threads < N ) { /* Sequential algorithm */ } else { #pragma omp parallel num_threads(num_idle_threads) { … } } 11

1.Derived Datatype Related Functions 2.Shared Memory Communication 3.Network-specific Optimizations Implementation is based on MPICH v Design and Implementation  OpenMP Runtime Extension  MPI Internal Parallelism

1. Derived Datatype Packing Processing MPI_Pack / MPI_Unpack Communication using Derived Datatype – Transfer non-contiguous data – Pack / unpack data internally #pragma omp parallel for for ( i=0; i<count; i++ ){ dest[i] = src[i * stride]; } #pragma omp parallel for for ( i=0; i<count; i++ ){ dest[i] = src[i * stride]; } block_length count stride

2. Shared Memory Communication Parallel algorithm – Get as many available cells as we can – Parallelize large data movement 14 Shared Buffer Cell[0] Cell[1] Cell[2] Cell[3] User Buffer SenderReceiver Shared Buffer Cell[0] Cell[1] Cell[2] Cell[3] User Buffer Sender Receiver (a) Sequential Pipelining(b) Parallel pipelining Original sequential algorithm – Shared user space buffer between processes – Pipelining copy on both sender side and receiver side

Sequential Pipelining VS Parallelism Small Data transferring ( < 128K ) – Threads synchronization overhead > parallel improvement Large Data transferring – Data transferred using Sequential Fine-Grained Pipelining – Data transferred using Parallelism with only a few of threads (worse) – Data transferred using Parallelism with many threads (better) Sender Buffer Shared Buffer Receiver Buffer 15

3. InfiniBand Communication Structures – IB context – Protection Domain – Queue Pair 1 QP per connection – Completion Queue Shared by 1 or more QPs RDMA communication – Post RDMA operation to QP – Poll completion from CQ OpenMP contention issue P0 P1 P2 QP CQ PD HCA IB CTX ADI3 CH3 SHM nemesis TCP IB … … (critical) 16

Evaluation 1.Derived Datatype Related Functions 2.Shared Memory Communication 3.Network-specific Optimizations All our experiments are executed on the Stampede supercomputer at the Texas Advanced Computing Center ( 17

Derived Datatype Packing Parallel packing 3D matrix of double Z Y X Packing the X-Z plane with varying Z Graph Data: Fixed matrix volume 1 GB Fixed length of Y: 2 doubles Length of Z: graph legend Packing the Y-Z plane with varying Y Graph Data: Fixed matrix volume 1 GB Fixed length of X: 2 doubles Length of Y: graph legend Block Vector (Parallelize here) Hvector 18

3D internode halo exchange using 64 MPI processes Cube 512*512*512 double Large X 16K*128*64 double Large Y 64*16K*128 double Large Z 64*128*16K double Not strong scaling BUT we are using IDLE RESOURCES ! Z Y X 19

Hybrid MPI+OpenMP NAS Parallel MG benchmark V-cycle multi-grid algorithm to solve a 3D discrete Poisson equation. Halo exchanges with various dimension sizes from 2 to 514 doubles in class E with 64 MPI processes Graph Data: Class E using 64 MPI processes 20

Shared Memory Communication OSU MPI micro-benchmark P2P Bandwidth Similar results of Latency and Message rate Caused by poor sequential performance due to too small Eager/Rendezvous communication threshold on Xeon Phi. Not by MT-MPI ! Poor pipelining but worse parallelism 21

One-sided Operations and IB netmod Optimization Micro benchmark – One to All experiment using 65 processes root sends many MPI_PUT operations to all the other 64 processes ( 64 IB QPs ) P0 P1 P2 P64 QP … … Parallelized Parallelized IB Communication Nthreads Time(s) TotalSPSP/Total % Profile of the experiment issuing operations 22 Ideal Speedup = 1.61

One-sided Graph500 benchmark Every process issues many MPI_Accumulate operations to the other processes in every breadth first search iteration. Scale 2 22, 16 edge factor, 64 MPI processes 23

Conclusion Many-core Architectures Most popular Funneled / Serialized mode in Hybrid MPI + threads programming model – Many threads parallelize user computation – Only single thread issues MPI calls Threads are IDLE during MPI calls ! We utilize these IDLE threads to parallelize MPI internal tasks, and delivers better performance in various aspects – Derived datatype packing processing – Shared memory communication – IB network communication 24

Backup 25

OpenMP Runtime Extension 2 Waiting progress in barrier – SPIN LOOP until timeout ! – May cause OVERSUBSCRIBING Solution: Force waiting threads to enter in a passive wait mode inside MPI – set_fast_yield (sched_yield) – set_fast_sleep (pthread_cond_wait) while (time < KMP_BLOCKTIME){ if (done) break; /* spin loop */ } pthread_cond_wait (…); while (time < KMP_BLOCKTIME){ if (done) break; /* spin loop */ } pthread_cond_wait (…); #pragma omp parallel #pragma omp single { set_fast_yield (1); #pragma omp parallel { … } } #pragma omp parallel #pragma omp single { set_fast_yield (1); #pragma omp parallel { … } } 26

OpenMP Runtime Extension 2 set_fast_sleep VS set_fast_yield – Test bed: Intel Xeon Phi cards (SE10P, 61 cores) set_fast_sleep set_fast_yield 27

Prefetching issue when compiler vectorized non-contiguous data for (i=0; i<count; i++){ *dest++ = *src; src += stride; } #pragma omp parallel for for (i=0; i<count; i++){ dest[i] = src[i * stride]; } (a) Sequential implementation (not vectorized)(b) Parallel implementation (vectorized) 28

Intra-node Large Message Communication Sequential pipelining LMT – Shared user space buffer – Pipelining copy on both sender side and receiver side Shared Buffer Cell[0] Cell[1] Cell[2] Cell[3] User Buffer SenderReceiver Sender Get a EMTPY cell from shared buffer, and copies data into this cell, and marks the cell FULL; Then, fill next cell. Receiver Get a FULL cell from shared buffer, then copies the data out, and marks the cell EMTPY ; Then, clear next cell. 29

Parallel InfiniBand communication  Two level parallel policies  Parallelize the operations to different IB CTXs  Parallelize the operations to different CQs / QPs HCA QP CQ IB CTX QP CQ IB CTX QP CQ IB CTX HCA CQ IB CTX QP CQ QP (a) Parallelism on different IB CTXs (b) Parallelism on different CQs / QPs 30

Parallelize InfiniBand Small Data Transfer 3 parallelism experiments based on ib_write_bw : 1.IB contexts 1 process per node, 64 IB CTX per process, 1 QP + 1 CQ per IB CTX 2.QPs and CQs 1 process per node, 1 IB CTX per process, 64 QPs + 64 CQs per IB CTX 3.QPs only 1 process per node, 1 IB CTX per process, 64 QPs + 1 shared CQ per IB CTX Test bed: Intel Intel Xeon Phi SE10P Coprocessor, InfiniBand FDR Data size: 64 Bytes 31

Eager Message Transferring in IB netmod When send many small messages – Limited IB resources QP, CQ, remote RDMA buffer QP, CQ, remote RDMA buffer SendQ – Most of the messages are enqueued into SendQ sendQ – All sendQ messages are sent out in wait progress Major steps in wait progress – Clean up issued requests – Receiving Poll RDMA-buffer Copy received messages out – Sending Copy sending messages from user buffer Issue RDMA op P0 Enqueue messages SendQ into SendQ wait progress start Send some messages immediately SendQ Send messages in SendQ P1 SYNC wait progress start SYNC wait progress end … SendQ Send messages in SendQ Receive message Parallelizable Clean up request 32

Parallel Eager protocol in IB netmod Parallel policy different connections – Parallelize large set of messages sending to different connections Different QPs : Sending processing Different QPs : Sending processing – Copy sending messages from user buffer – Issue RDMA op Different RDMA buffers : Receiving processing Different RDMA buffers : Receiving processing – Poll RDMA-buffer – Copy received messages out 33

Example: One-sided Communication Feature – Large amount of small non-blocking RMA operations sending to many targets MPI_Win_fence – Wait ALL the completion at the second synchronization call ( MPI_Win_fence ) MPICH implementation – Issue all the operations in the second synchronization call PUT 0 Origin Process Target Process 2 Target Process 1 PUT 1 PUT 2 PUT 3 PUT 4 PUT 5 MPI_Win_fence MPI_Win_fence Th 0 Th 1 Th 0 Th 1 34

Parallel One-sided communication Challenges in parallelism – Global SendQ  Group messages by targets – Queue structures  Stored in [Ring Array + Head / Tail ptr] – Netmod internal messages (SYNC etc.)  Enqueue to another SendQ (SYNC-SendQ) P Send-Arrays SYNC-SendQ IN OUT 35

Parallel One-sided communication Optimization – Every OP is issued through long critical section Issue all Ops together Issue all Ops together – Create large number of Requests Only create one request Only create one request ADI3 CH3 SHM nemesis IB PUT SendQ Critical Fig. Issue RMA operations from CH3 to IB netmod 36

Profiling Measured the execution time of the netmod send- side communication processing at the root process (SP) – Copy from the user buffer to a preregistered chunk – Posting of the operations to the IB network. Nthreads TimeSpeedup TotalSPSP/TotalTotalSP % % % % Expected speedup but the percentage of time spent in SP becomes less and less Profile of the experiment issuing operations 37